1 KAIST 2 Google Research |
We are interested in achieving spatially accurate and temporally consistent depth estimates only from a stream of 2D RGB images. Despite the success of recent depth estimation methods, we find that this is still difficult since existing approaches often estimate depth only from 2D information and overlook how the scene exists in 3D space. To tackle the issue, we propose Multi-view Consistent depth estimation via Coordinated image-based neural rendering (MC2) which casts the depth estimation as a feature matching problem in 3D space, thereby constructing and aligning scene features directly in 3D space from 2D images. First, we introduce a rescaling technique that minimizes the ambiguity of the depth estimation obtained independently from each 2D image. Using 2D images and corresponding rescaled depths, we extract the context representation with our new transformer architecture consisting of three-way factorized attention. Moreover, to ensure alignment with 3D structures without explicit geometry modeling, we propose an ordinal volume rendering that respects the nature of 3D spaces. We perform extensive comparisons on casually captured scenes from various real-world datasets and significantly outperform previous work in depth estimation from a stream of 2D RGB images. Results highlight our method as a comprehensive framework that not only improves the accuracy of monocular estimates but also bridges the gap to multi-view consistent depth estimation that respects the 3D worlds existing in given images. |
MC2 synthesizes a depth map and corresponding RGB image at arbitrary camera angles using a stream of 2D RGB images as context features. Such an accurate depth prediction is enabled by three main components. First, MC2 rescales each depth estimate of the context views obtained from the monocular depth estimation network to make them exist in the 3D world in which the scene exists and ensure consistency between individually obtained estimates, thus enabling geometric priors during feature matching in the next phase. Then, these rescaled depth features with image features are fed into our three-way factorized transformers that decompose the features along view, ray, and pixel to efficiently find the correspondence between the context features. Finally, MC2 renders a depth map without explicit proxy geometry by taking into account the sequential nature of a ray. By doing so, MC2 successfully imagines spatially accurate and temporally consistent depth maps that respect the 3D world in which the given images reside. |
Illustration of depth estimates with corresponding synthesized RGB images on mochi-high-five sequence on the iPhone dataset. The color bar on the right is in meters (m). MC2 depth consistently over the video, while accurately estimating metric depth. ZoeDepth, on the other hand, overestimates depth, with its estimates fluctuating over time, as shown by the different colors. |
Ground-Truth
MC2 (Ours)
iPhone Lidar Sensor
ZoeDepth
MC2 (Ours)
Neural scene renderings have shown great potential in parameterizing complex 3D scenes as a
neural network
by 1) mapping 5D coordinates to RGB values and densities using NeRF or 2) image-based view
synthesis.
While the synthesized views are plausible and seemingly satisfactory, they often suffer from inaccurate correspondence modeling with ground-truth 3D scenes as shown in depth estimates. We perform the same metric depth estimation experiments as above with the existing video nerfs of Nerfies and HyperNeRF. The color bar on the right is in meters (m). |
Moreover, we find that existing image-based neural rendering approaches can easily prioritize the simpler task of blending colors using features obtained from contextual views over the more complex task of establishing correspondence between sampled contextual features along rays in 3D space. This occurs because image-based neural rendering operated in 2D-pixel space often lacks a comprehensive understanding of the 3D scene, leading to a preference for color blending over accurate correspondence matching of 3D points, where depth estimates aligned with camera poses suggest that inferred geometry could be interpreted as correspondence matching points. |
Illustration of depth estimates and corresponding synthesized RGB images from IBRNet. The color bar on the right side of the depth maps indicates the depth scale in meters. Additionally, the index value with the largest predicted density weight from IBRNet is shown (right), and the color bar indicates the sample index in the range of 0 to 127. As shown in above figure, IBRNet typically assigns the highest weight to the last index in the sample along the ray. This phenomenon occurs primarily when IBRNet struggles to identify correspondences between context features derived solely from image encoders, especially in textureless regions (e.g., walls, flat surfaces). We conjecture that this happens because IBRNet's primary focus is on plausible view synthesis rather than accurate depth estimation. Such a bias manifests itself in the following ways: objects that are predicted to be farther away and thus to have greater depth tend to show only small pixel shifts across different viewpoints, even with significant changes in camera perspective; on the other hand, objects that are perceived to be closer show pronounced pixel movement within the images, even with minimal changes in camera angle. Thus, when optimizing image-based neural rendering networks for color image synthesis, there is a strong bias toward synthesizing images in which the relative motion of objects matches their expected real-world behavior. This prioritizes visual realism at the expense of accurate depth estimation, especially in the absence of explicit geometric modeling such as image-based view synthesis. |
Illustration of depth estimates and corresponding synthesized RGB images from our method, MC2. The color bar on the right side of the depth maps indicates the depth scale in meters. Probability and temperature are also visualized. |