Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

📄 arXiv: 2606.06002v2 📥 PDF

作者: Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma

分类: cs.CV

发布日期: 2026-06-04 (更新: 2026-06-05)


💡 一句话要点

提出全球-局部蒙特卡洛树搜索以解决文本到3D室内场景生成问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 文本到3D 室内场景生成 视觉-语言模型 树搜索 蒙特卡洛树搜索 多模态学习 数据集构建

📋 核心要点

  1. 现有基于视觉-语言模型的方法在文本到3D室内场景生成中存在决策不可修正的问题,导致错误传播。
  2. 本文提出将场景生成视为树搜索问题,采用全球和局部树的结构,结合PRM引导的MCTS方法进行优化。
  3. 实验结果表明,所提方法生成的3D场景比现有最先进的方法更为真实,且在新数据集上表现优异。

📝 摘要(中文)

大型视觉-语言模型在多种任务中取得了显著的推理性能。然而,针对文本到3D室内场景生成的研究仍然较少,现有方法的链式思维决策机制无法修正早期决策,导致错误传播。本文将该任务视为受空间和布局常识约束的规划问题,通过建立全球和局部树的树搜索模型来解决。我们提出的PRM引导的MCTS方法有效地平衡了探索与利用,生成更为真实的3D场景,并收集了一个包含65种场景类型和3250条指令的大规模多样化数据集3DTindo-bench,以更好地评估模型能力。

🔬 方法详解

问题定义:本文旨在解决文本到3D室内场景生成中的决策不可修正问题,现有方法的链式思维导致错误传播,影响生成效果。

核心思路:将场景生成视为树搜索问题,通过建立全球和局部树结构,模拟人类布置房间的过程,允许多次尝试和修正。

技术框架:整体架构包括全球树和局部树。全球树用于迭代放置对象,局部树则细化每个对象的放置参数,结合PRM引导的MCTS方法进行优化。

关键创新:提出的PRM引导的MCTS方法通过修剪不必要的树枝,平衡探索与利用,显著提高了生成效率和效果,区别于传统的链式决策方法。

关键设计:采用分层场景表示,抽象为房间级、区域级、楼层对象级和支撑对象级,确保场景一致性,同时利用预训练的扩散图像生成模型预测纹理。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,所提方法在生成3D场景的真实感上优于现有最先进的方法,具体性能提升幅度达到XX%(具体数据待补充),并在新收集的3DTindo-bench数据集上表现出色。

🎯 应用场景

该研究的潜在应用领域包括室内设计、虚拟现实和游戏开发等。通过生成高质量的3D场景,可以提升用户体验和交互效果,具有重要的实际价值和广泛的未来影响。

📄 摘要(原文)

Large Vision-Language Models have achieved significant reasoning performance in various tasks. However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation. In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense. To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches. In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree. To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method. This representation abstracts a scene into room level, region level, floor object level, and supported object level. The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts. In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters. To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene. As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art methods.