EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

作者: Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren, Yifan Wu, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, Christian Theobalt, Taku Komura

分类: cs.CV

发布日期: 2026-02-28

💡 一句话要点

EmbodMocap：提出一种基于双iPhone的便携式4D人-场景重建方法，用于具身智能体。

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱三：空间感知与语义 (Perception & Semantics) 支柱五：交互与反应 (Interaction & Reaction) 支柱七：动作重定向 (Motion Retargeting) 支柱八：物理动画 (Physics-based Animation) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 人体动作捕捉 场景重建 双目视觉 具身智能 RGB-D 人机交互 机器人控制

📋 核心要点

现有动作捕捉系统依赖昂贵的演播室设备和可穿戴设备，限制了在真实场景中大规模收集场景条件下的运动数据。
EmbodMocap利用双iPhone的RGB-D序列，通过联合校准实现人和场景的统一度量重建，无需额外设备。
实验表明，双视图设置有效缓解深度模糊，在人-场景重建、角色动画和机器人控制等任务中表现优异。

📝 摘要（中文）

本文提出EmbodMocap，一种便携且经济实惠的数据采集流程，使用两个移动的iPhone。核心思想是联合校准双RGB-D序列，以在统一的度量世界坐标系中重建人和场景。该方法允许在日常环境中进行度量尺度和场景一致的捕获，无需静态相机或标记，从而无缝地桥接人体运动和场景几何。与光学捕获真值相比，双视图设置在减轻深度模糊方面表现出卓越的能力，在对齐和重建性能上优于单iPhone或单目模型。基于收集的数据，本文验证了该方法在单目人-场景重建、基于物理的角色动画和机器人运动控制三个具身AI任务中的有效性。

🔬 方法详解

问题定义：现有的人体动作捕捉系统通常需要在受控的演播室环境中使用昂贵的设备和标记点，这限制了在真实场景中大规模收集人体运动数据的能力。特别是在具身智能体研究中，缺乏真实场景下的人与环境交互数据，阻碍了智能体感知、理解和行动能力的提升。

核心思路：EmbodMocap的核心思路是利用两个移动的iPhone，通过双目视觉原理来提高深度估计的准确性，从而实现对人和场景的精确重建。通过联合校准两个iPhone的RGB-D序列，将人和场景都重建到一个统一的度量世界坐标系中，从而建立人与环境之间的空间关系。

技术框架：EmbodMocap的数据采集流程主要包含以下几个阶段：1) 使用两个iPhone同时采集RGB-D视频序列；2) 对双目视频序列进行联合校准，估计两个相机的相对位姿；3) 利用校准后的相机参数和深度信息，重建人和场景的三维模型；4) 将重建的人体模型与SMPL模型进行对齐，得到人体姿态参数。

关键创新：EmbodMocap的关键创新在于提出了一种低成本、便携式的4D人-场景重建方案，能够在真实场景中进行数据采集。与传统的动作捕捉系统相比，EmbodMocap无需昂贵的设备和复杂的设置，大大降低了数据采集的门槛。此外，双目RGB-D的设置有效缓解了深度模糊问题，提高了重建的精度。

关键设计：在双目校准方面，论文可能采用了基于特征匹配或优化的方法来估计相机位姿。在人体模型对齐方面，可能使用了迭代最近点(ICP)算法或基于优化的方法来将重建的人体点云与SMPL模型进行匹配。具体的损失函数可能包括点到模型的距离、姿态正则化项等。具体的网络结构（如果使用）未知。

🖼️ 关键图片

📊 实验亮点

实验结果表明，EmbodMocap在人-场景重建任务中取得了显著的性能提升。与单iPhone或单目模型相比，双视图设置能够有效缓解深度模糊，实现更精确的对齐和重建。此外，基于EmbodMocap采集的数据，在单目人-场景重建、基于物理的角色动画和机器人运动控制等任务中均取得了良好的效果，验证了该方法的有效性。

🎯 应用场景

EmbodMocap采集的数据可用于训练具身智能体，使其具备更强的感知、理解和行动能力。例如，可以用于训练机器人模仿人类行为，或者用于虚拟现实和增强现实应用中，实现更逼真的人机交互。该技术还可应用于运动分析、游戏开发、康复训练等领域，具有广泛的应用前景。

📄 摘要（原文）

Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理