IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Abstract
We introduce an approach to enhance the novel view synthesis from images taken from a freely moving camera. The
introduced approach focuses on outdoor scenes where recovering accurate geometric scaffold and camera pose is
challenging, leading to inferior results using the state-of-the-art stable view synthesis (SVS) method. SVS and related
methods fail for outdoor scenes primarily due to (i) over-relying on the multiview stereo (MVS) for geometric scaffold
recovery and (ii) assuming COLMAP computed camera poses as the best possible estimates, despite it being well-studied
that MVS 3D reconstruction accuracy is limited to scene disparity and camera-pose accuracy is sensitive to key-point
correspondence selection. This work proposes a principled way to enhance novel view synthesis solutions drawing
inspiration from the basics of multiple view geometry. By leveraging the complementary behavior of MVS and monocular
depth, we arrive at a better scene depth per view for nearby and far points, respectively. Moreover, our approach
jointly refines camera poses with image-based rendering via multiple rotation averaging graph optimization. The
recovered scene depth and the camera-pose help better view-dependent on-surface feature aggregation of the entire scene.
Extensive evaluation of our approach on the popular benchmark dataset, such as Tanks and Temples, shows substantial
improvement in view synthesis results compared to the prior art. For instance, our method shows 1.5 dB of PSNR
improvement on the Tank and Temples. Similar statistics are observed when tested on other benchmark datasets such as
FVS, Mip-NeRF 360, and DTU.