Sparse Input View Synthesis: 3D Representations and Reliable Priors
作者: Nagabhushan Somraj
分类: cs.CV
发布日期: 2024-11-20
备注: PhD Thesis of Nagabhushan S N, Dept of ECE, Indian Institute of Science (IISc); Advisor: Dr. Rajiv Soundararajan; Thesis Reviewers: Dr. Kaushik Mitra (IIT Madras), Dr. Aniket Bera (Purdue University); Submitted: May 2024; Accepted and Defended: Sep 2024; Abstract condensed, please check the PDF for full abstract
💡 一句话要点
针对稀疏视角的新视角合成,提出基于3D表示和可靠先验的解决方案
🎯 匹配领域: 支柱三:空间感知与语义 (Perception & Semantics)
关键词: 新视角合成 神经辐射场 稀疏输入 可见性先验 场景特定先验 动态场景 光流估计
📋 核心要点
- 现有新视角合成方法在输入视角稀疏时性能显著下降,难以满足实际应用需求,尤其是在静态和动态场景下。
- 论文提出利用3D表示和可靠先验来解决稀疏输入的新视角合成问题,包括可见性先验、场景特定先验和稀疏光流先验。
- 实验结果表明,该方法在多个数据集上优于现有方法,并在视频游戏帧率提升应用中取得了最先进的性能。
📝 摘要(中文)
新视角合成是指在给定少量视角的图像后,合成场景的新视点图像。这是计算机视觉和图形学中的一个基本问题,可实现元宇宙、事件的自由视角观看、视频游戏、视频稳定和视频压缩等广泛应用。诸如辐射场和多平面图像等最新3D表示显著提高了从新视点渲染的图像质量。然而,这些模型需要密集的输入视图采样才能获得高质量的渲染效果。当只有少量输入视图可用时,它们的性能会显著下降。本论文重点研究静态和动态场景的稀疏输入新视角合成问题。本文首先关注使用神经辐射场(NeRF)的静态场景的稀疏输入新视角合成,研究可靠且密集的先验设计,以更好地正则化NeRF。特别地,我们提出了输入视图对中像素可见性的先验。我们表明,这种与物体相对深度相关的可见性先验比现有绝对深度先验更密集和可靠。我们使用平面扫描体积计算可见性先验,而无需在大型数据集上训练神经网络。我们在多个数据集上评估了我们的方法,并表明我们的模型优于现有的稀疏输入新视角合成方法。其次,我们旨在通过学习特定于场景的先验来进一步改进正则化,该先验不会受到泛化问题的影响。我们通过仅在给定场景上学习先验来实现这一点,而无需在大型数据集上进行预训练。特别地,我们设计了增强型NeRF,以便为主NeRF在场景的某些区域获得更好的深度监督。此外,我们将此框架扩展到也适用于更新、更快的辐射场模型,例如TensoRF和ZipNeRF。通过在多个数据集上进行的大量实验,我们展示了我们的方法在稀疏输入新视角合成中的优越性。稀疏输入快速动态辐射场的设计受到缺乏合适的表示和可靠的运动先验的严重限制。我们通过设计基于分解体积的显式运动模型来解决第一个挑战,该模型紧凑且优化速度快。我们还引入了可靠的稀疏光流先验来约束运动场,因为我们发现流行的密集光流先验不可靠。我们展示了我们的运动表示和可靠先验在多个数据集上的优势。在本论文的最后一部分,我们研究了视图合成在视频游戏中帧速率提升的应用。具体来说,我们考虑时间视图合成问题,其目标是在给定过去帧和相机运动的情况下预测未来帧。这里的关键挑战在于通过估计物体的过去运动并外推它来预测物体的未来运动。我们探索了多平面图像表示和场景深度,以可靠地估计物体运动,尤其是在遮挡区域。我们设计了一个新的数据库,以有效地评估我们的动态场景时间视图合成方法,并表明我们实现了最先进的性能。
🔬 方法详解
问题定义:论文旨在解决在输入视角稀疏的情况下,如何高质量地合成新视角图像的问题。现有方法,如NeRF,在密集视角下表现良好,但在稀疏视角下性能显著下降,这是因为缺乏足够的几何约束和场景信息。
核心思路:论文的核心思路是利用3D表示(如NeRF)结合可靠的先验知识来正则化模型,从而在稀疏视角下也能获得高质量的合成结果。通过引入可见性先验、场景特定先验和稀疏光流先验,可以有效地约束场景的几何结构和运动信息。
技术框架:论文的技术框架主要包括以下几个部分:1) 对于静态场景,首先使用平面扫描体积计算输入视图对之间的像素可见性先验;然后,将该先验作为正则项加入到NeRF的训练过程中。2) 为了进一步提高正则化效果,论文提出学习场景特定的先验,通过增强型NeRF在特定区域提供更好的深度监督。3) 对于动态场景,论文设计了基于分解体积的显式运动模型,并引入可靠的稀疏光流先验来约束运动场。4) 在时间视图合成方面,论文探索了多平面图像表示和场景深度,以可靠地估计物体运动。
关键创新:论文的关键创新在于:1) 提出了基于平面扫描体积的可见性先验,该先验比绝对深度先验更可靠和密集。2) 设计了增强型NeRF,用于学习场景特定的先验,从而避免了泛化问题。3) 提出了基于分解体积的显式运动模型,并引入了可靠的稀疏光流先验,从而提高了动态场景的合成质量。
关键设计:1) 可见性先验的计算:使用平面扫描体积,通过计算不同深度假设下的图像匹配程度来估计像素的可见性。2) 增强型NeRF的设计:通过在特定区域添加额外的监督信息,来提高NeRF的深度估计精度。3) 稀疏光流先验的选择:避免使用常用的密集光流,而是选择更可靠的稀疏光流来约束运动场。
📊 实验亮点
论文在多个数据集上进行了实验,结果表明该方法在稀疏输入新视角合成任务中优于现有方法。例如,在静态场景合成方面,该方法能够生成更清晰、更真实的图像,尤其是在遮挡区域。在动态场景合成方面,该方法能够更准确地估计物体的运动,从而生成更流畅的视频。
🎯 应用场景
该研究成果可广泛应用于元宇宙、自由视角视频、视频游戏、视频稳定和视频压缩等领域。通过仅使用少量视角的图像,即可合成高质量的新视角图像,从而降低了数据采集和传输的成本,并提高了用户体验。此外,该方法还可用于视频游戏中的帧率提升,从而提高游戏的流畅度和视觉效果。
📄 摘要(原文)
Novel view synthesis refers to the problem of synthesizing novel viewpoints of a scene given the images from a few viewpoints. This is a fundamental problem in computer vision and graphics, and enables a vast variety of applications such as meta-verse, free-view watching of events, video gaming, video stabilization and video compression. Recent 3D representations such as radiance fields and multi-plane images significantly improve the quality of images rendered from novel viewpoints. However, these models require a dense sampling of input views for high quality renders. Their performance goes down significantly when only a few input views are available. In this thesis, we focus on the sparse input novel view synthesis problem for both static and dynamic scenes. In the first part of this work, we mainly focus on sparse input novel view synthesis of static scenes using neural radiance fields (NeRF). We study the design of reliable and dense priors to better regularize the NeRF in such situations. In particular, we propose a prior on the visibility of the pixels in a pair of input views. We show that this visibility prior, which is related to the relative depth of objects, is dense and more reliable than existing priors on absolute depth. We compute the visibility prior using plane sweep volumes without the need to train a neural network on large datasets. We evaluate our approach on multiple datasets and show that our model outperforms existing approaches for sparse input novel view synthesis. In the second part, we aim to further improve the regularization by learning a scene-specific prior that does not suffer from generalization issues. We achieve this by learning the prior on the given scene alone without pre-training on large datasets. In particular, we design augmented NeRFs to obtain better depth supervision in certain regions of the scene for the main NeRF. Further, we extend this framework to also apply to newer and faster radiance field models such as TensoRF and ZipNeRF. Through extensive experiments on multiple datasets, we show the superiority of our approach in sparse input novel view synthesis. The design of sparse input fast dynamic radiance fields is severely constrained by the lack of suitable representations and reliable priors for motion. We address the first challenge by designing an explicit motion model based on factorized volumes that is compact and optimizes quickly. We also introduce reliable sparse flow priors to constrain the motion field, since we find that the popularly employed dense optical flow priors are unreliable. We show the benefits of our motion representation and reliable priors on multiple datasets. In the final part of this thesis, we study the application of view synthesis for frame rate upsampling in video gaming. Specifically, we consider the problem of temporal view synthesis, where the goal is to predict the future frames given the past frames and the camera motion. The key challenge here is in predicting the future motion of the objects by estimating their past motion and extrapolating it. We explore the use of multi-plane image representations and scene depth to reliably estimate the object motion, particularly in the occluded regions. We design a new database to effectively evaluate our approach for temporal view synthesis of dynamic scenes and show that we achieve state-of-the-art performance.