Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration
作者: Zilong Huang, Jun He, Junyan Ye, Lihan Jiang, Weijia Li, Yiping Chen, Ting Han
分类: cs.CV
发布日期: 2025-04-01 (更新: 2025-04-21)
备注: CVPR 2025, 11 pages, 7 figures
🔗 代码/项目: GITHUB
💡 一句话要点
提出Scene4U以解决单幅全景图重建3D场景中的遮挡与纹理不一致问题
🎯 匹配领域: 支柱三:空间感知与语义 (Perception & Semantics) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 3D场景重建 全景图 深度学习 扩散模型 计算机视觉 语义一致性 层次化表示
📋 核心要点
- 现有的图像驱动场景重建方法在动态物体遮挡和全局纹理不一致性方面存在显著不足,导致视觉效果不佳。
- 本文提出的Scene4U框架通过结合开放词汇分割和扩散模型,分层重建全景图,解决了遮挡和纹理一致性问题。
- 实验结果表明,Scene4U在多个指标上显著优于现有最先进的方法,提升幅度达到24%以上,并且训练速度最快。
📝 摘要(中文)
沉浸式和真实感的3D场景重建在计算机视觉和计算机图形学领域具有重要的实际意义。现有方法在动态物体遮挡下容易出现视觉不连续和场景空洞。为此,本文提出了一种新颖的基于全景图的分层3D场景重建框架Scene4U。该框架结合开放词汇分割模型与大型语言模型,将真实全景图分解为多个层次,并利用扩散模型的分层修复模块恢复被遮挡区域,生成层次化的场景表示。最终,Scene4U在LPIPS和BRISQUE指标上分别提高了24.24%和24.40%,并实现了最快的训练速度。
🔬 方法详解
问题定义:本文旨在解决从单幅全景图重建3D场景时,动态物体遮挡导致的视觉不连续性和全局纹理不一致性的问题。现有方法在不同相机姿态下容易出现场景空洞,影响重建效果。
核心思路:Scene4U框架的核心思想是将全景图分解为多个层次,通过结合开放词汇分割模型和大型语言模型,利用深度信息和视觉线索修复被遮挡区域,从而生成层次化的3D场景表示。
技术框架:Scene4U的整体架构包括三个主要模块:首先,使用开放词汇分割模型对全景图进行分层;其次,利用扩散模型的分层修复模块恢复被遮挡的区域;最后,将多层全景图初始化为3D高斯点云表示,并进行分层优化以生成最终的3D场景。
关键创新:Scene4U的主要创新在于其分层重建策略和基于扩散模型的修复模块,这与现有方法的单一层次重建和简单修复策略有本质区别,显著提高了重建的语义和结构一致性。
关键设计:在设计中,采用了多层次的全景图表示和基于深度信息的修复策略,损失函数考虑了纹理一致性和结构完整性,确保了生成结果的高质量和快速训练。
🖼️ 关键图片
📊 实验亮点
在实验中,Scene4U在LPIPS和BRISQUE指标上分别提高了24.24%和24.40%,显示出其在视觉质量上的显著提升。此外,Scene4U实现了最快的训练速度,表明其高效性和实用性。
🎯 应用场景
该研究的潜在应用领域包括虚拟现实、增强现实、游戏开发以及文化遗产保护等。通过提供高质量的3D场景重建,Scene4U能够为用户提供更为沉浸的体验,推动相关行业的发展。未来,随着技术的进一步完善,Scene4U有望在更多实际场景中得到应用。
📄 摘要(原文)
The reconstruction of immersive and realistic 3D scenes holds significant practical importance in various fields of computer vision and computer graphics. Typically, immersive and realistic scenes should be free from obstructions by dynamic objects, maintain global texture consistency, and allow for unrestricted exploration. The current mainstream methods for image-driven scene construction involves iteratively refining the initial image using a moving virtual camera to generate the scene. However, previous methods struggle with visual discontinuities due to global texture inconsistencies under varying camera poses, and they frequently exhibit scene voids caused by foreground-background occlusions. To this end, we propose a novel layered 3D scene reconstruction framework from panoramic image, named Scene4U. Specifically, Scene4U integrates an open-vocabulary segmentation model with a large language model to decompose a real panorama into multiple layers. Then, we employs a layered repair module based on diffusion model to restore occluded regions using visual cues and depth information, generating a hierarchical representation of the scene. The multi-layer panorama is then initialized as a 3D Gaussian Splatting representation, followed by layered optimization, which ultimately produces an immersive 3D scene with semantic and structural consistency that supports free exploration. Scene4U outperforms state-of-the-art method, improving by 24.24% in LPIPS and 24.40% in BRISQUE, while also achieving the fastest training speed. Additionally, to demonstrate the robustness of Scene4U and allow users to experience immersive scenes from various landmarks, we build WorldVista3D dataset for 3D scene reconstruction, which contains panoramic images of globally renowned sites. The implementation code and dataset will be released at https://github.com/LongHZ140516/Scene4U .