PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image

📄 arXiv: 2510.11649v1 📥 PDF

作者: Pradyumna Yalandur Muralidhar, Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, Gerard Pons-Moll

分类: cs.CV

发布日期: 2025-10-13

备注: Accepted to ACM SIGGraphAsia 2025. Project website: https://yuxuan-xue.com/physic

DOI: 10.1145/3757377.3763862


💡 一句话要点

提出PhySIC框架以解决单图像重建3D人类场景交互问题

🎯 匹配领域: 支柱三:空间感知与语义 (Perception & Semantics) 支柱四:生成式动作 (Generative Motion) 支柱五:交互与反应 (Interaction & Reaction) 支柱六:视频提取与匹配 (Video Extraction)

关键词: 3D重建 人类场景交互 单幅图像 深度估计 物理一致性 虚拟现实 机器人技术

📋 核心要点

  1. 现有方法在从单幅图像重建3D人类及场景时面临深度模糊、遮挡和物理接触不一致等挑战。
  2. PhySIC框架通过融合可见深度和几何体,执行遮挡感知修复,恢复人类网格和场景表面。
  3. 实验结果显示,PhySIC在重建精度上显著提升,平均每顶点场景误差从641mm降至227mm,接触F1值从0.09提高至0.51。

📝 摘要(中文)

从单幅图像重建度量准确的人类及其周围场景对于虚拟现实、机器人技术和全面的3D场景理解至关重要。然而,现有方法在深度模糊、遮挡和物理不一致接触方面存在困难。为了解决这些挑战,本文提出了PhySIC框架,能够从单幅RGB图像中恢复度量一致的SMPL-X人类网格、密集场景表面和顶点级接触图。PhySIC通过粗略的单目深度和身体估计开始,执行遮挡感知的修复,融合可见深度与未缩放几何体以建立稳健的度量框架,并合成缺失的支撑表面。通过联合强制深度对齐、接触先验、避免相互穿透和2D重投影一致性,进行置信加权优化。实验结果表明,PhySIC在多个指标上超越了单图像基线,显著提高了重建的准确性。

🔬 方法详解

问题定义:本文旨在解决从单幅图像中重建3D人类及其场景交互的难题,现有方法在深度估计和遮挡处理上存在不足,导致重建结果不够准确和物理上不合理。

核心思路:PhySIC框架的核心思想是通过融合可见深度和几何信息,结合物理一致性约束,恢复人类网格和场景表面,确保重建结果的物理合理性。

技术框架:PhySIC的整体架构包括多个模块:首先进行粗略的单目深度和身体估计,然后执行遮挡感知的修复,接着融合可见深度与未缩放几何体,最后通过置信加权优化进行全局调整。

关键创新:PhySIC的主要创新在于其物理一致性优化方法,通过联合考虑深度对齐、接触先验和避免相互穿透,显著提升了重建的准确性和合理性。

关键设计:在设计中,PhySIC采用了置信加权优化策略,结合多种损失函数以确保深度和接触的一致性,同时引入了显式的遮挡掩模来保护不可见区域,避免不合理配置。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,PhySIC在重建精度上显著优于单图像基线,平均每顶点场景误差从641mm降至227mm,PA-MPJPE减少至42mm,接触F1值从0.09提升至0.51,展示了其在处理复杂场景和交互中的有效性。

🎯 应用场景

PhySIC框架在虚拟现实、机器人导航和人机交互等领域具有广泛的应用潜力。通过提供准确的3D人类与场景交互重建,能够提升用户体验和系统的智能化水平,推动相关技术的发展与应用。

📄 摘要(原文)

Reconstructing metrically accurate humans and their surrounding scenes from a single image is crucial for virtual reality, robotics, and comprehensive 3D scene understanding. However, existing methods struggle with depth ambiguity, occlusions, and physically inconsistent contacts. To address these challenges, we introduce PhySIC, a framework for physically plausible Human-Scene Interaction and Contact reconstruction. PhySIC recovers metrically consistent SMPL-X human meshes, dense scene surfaces, and vertex-level contact maps within a shared coordinate frame from a single RGB image. Starting from coarse monocular depth and body estimates, PhySIC performs occlusion-aware inpainting, fuses visible depth with unscaled geometry for a robust metric scaffold, and synthesizes missing support surfaces like floors. A confidence-weighted optimization refines body pose, camera parameters, and global scale by jointly enforcing depth alignment, contact priors, interpenetration avoidance, and 2D reprojection consistency. Explicit occlusion masking safeguards invisible regions against implausible configurations. PhySIC is efficient, requiring only 9 seconds for joint human-scene optimization and under 27 seconds end-to-end. It naturally handles multiple humans, enabling reconstruction of diverse interactions. Empirically, PhySIC outperforms single-image baselines, reducing mean per-vertex scene error from 641 mm to 227 mm, halving PA-MPJPE to 42 mm, and improving contact F1 from 0.09 to 0.51. Qualitative results show realistic foot-floor interactions, natural seating, and plausible reconstructions of heavily occluded furniture. By converting a single image into a physically plausible 3D human-scene pair, PhySIC advances scalable 3D scene understanding. Our implementation is publicly available at https://yuxuan-xue.com/physic.