iComMa: Inverting 3D Gaussian Splatting for Camera Pose Estimation via Comparing and Matching

Yuan Sun†,1, Xuan Wang†,2, Yunfan Zhang1 , Jie Zhang1 , Caigui Jiang1 , Yu Guo*,1 , Fei Wang1
1 Xi'an Jiaotong University     2 Ant Group
Indicates Equal Contribution

*Indicates Corresponding Author

Given a query image with an unknown camera pose, iComMa accurately estimates it by inverting 3D Gaussian Splatting from a known initial pose. The gradient information inherent in the differences between the query image and the rendered image (which are overlaid in the above video, with a higher degree of overlap indicating more accurate pose estimation) is utilized for iteratively optimizing the camera pose. Compared to iNeRF, the proposed method not only employs pixel-to-pixel comparing but also utilizes 2D keypoints matching, which are connected by blue lines in the above video. As a result, our method is capable of precisely estimating camera poses even under poor initial conditions, such as large angular deviations.

Abstract

We present a method named iComMa to address the 6D camera pose estimation problem in computer vision. Conventional pose estimation methods typically rely on the target's CAD model or necessitate specific network training tailored to particular object classes. Some existing methods have achieved promising results in mesh-free object and scene pose estimation by inverting the Neural Radiance Fields (NeRF). However, they still struggle with adverse initializations such as large rotations and translations. To address this issue, we propose an efficient method for accurate camera pose estimation by inverting 3D Gaussian Splatting (3DGS). Specifically, a gradient-based differentiable framework optimizes camera pose by minimizing the residual between the query image and the rendered image, requiring no training. An end-to-end matching module is designed to enhance the model's robustness against adverse initializations, while minimizing pixel-level comparing loss aids in precise pose estimation. Experimental results on synthetic and complex real-world data demonstrate the effectiveness of the proposed approach in challenging conditions and the accuracy of camera pose estimation.

Pipeline

Given an initial camera pose, iComMa iteratively optimizes to estimate the ground truth pose associated with the query image. For the t-th optimization step, we first render the image corresponding to the camera pose 𝑇𝑑 using 3D Gaussian Splatting. Subsequently, we compute the residuals between the rendered image and the query image, which include the matching loss π“›π‘€π‘Ž obtained from the end-to-end matching module and the per-pixel comparing loss π“›πΆπ‘œπ‘š. The entire framework is differentiable, enabling the optimization of camera poses by utilizing the gradients derived from minimizing the residuals.

Experimental Results on Synthetic Datasets

Experimental Results on Front-facing LLFF Datasets

Experimental Results on 360Β° Scene Datasets

Note: iNeRF is a variant of iNeRF, achieved by inverting Mip-NeRF360.

Ablation Study

BibTeX

@article{sun2023icomma,
        title={icomma: Inverting 3d gaussians splatting for camera pose estimation via comparing and matching},
        author={Sun, Yuan and Wang, Xuan and Zhang, Yunfan and Zhang, Jie and Jiang, Caigui and Guo, Yu and Wang, Fei},
        journal={arXiv preprint arXiv:2312.09031},
        year={2023}
      }