ChatHuman: Chatting about 3D Humans with Tools

📄 arXiv: 2405.04533v2 📥 PDF

作者: Jing Lin, Yao Feng, Weiyang Liu, Michael J. Black

分类: cs.CV, cs.LG

发布日期: 2024-05-07 (更新: 2025-05-29)

备注: Project page: https://chathuman.github.io


💡 一句话要点

提出ChatHuman以解决3D人类任务分析的复杂性问题

🎯 匹配领域: 支柱五:交互与反应 (Interaction & Reaction) 支柱九:具身大模型 (Embodied Foundation Models)

📋 核心要点

  1. 现有方法在3D人类任务分析中往往需要专家知识,导致使用门槛高,结果解读困难。
  2. ChatHuman通过语言驱动的方式,将多种专门工具整合到一个统一框架中,简化用户操作和结果分析。
  3. 实验结果显示,ChatHuman在工具选择准确性和整体性能上超过了现有模型,且支持与用户的互动。
  4. method_zh

📝 摘要(中文)

本文提出了一种名为ChatHuman的语言驱动系统,旨在整合多种专门方法以分析3D人类的姿态、形状、接触、人与物体的交互及情感等属性。现有方法通常需要专家知识来选择和解释结果,ChatHuman通过大型语言模型(LLM)框架,能够自主选择、应用和解释多种工具,克服了将LLM应用于3D人类任务的诸多挑战。实验表明,ChatHuman在工具选择准确性和整体性能上均优于现有模型,并支持与用户的互动聊天,标志着向统一、强大的3D人类任务分析系统迈出了重要一步。

🖼️ 关键图片

fig_0
fig_1
fig_2

📄 摘要(原文)

Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including 3D pose, shape, contact, human-object interaction, and emotion. While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that integrates the capabilities of specialized methods into a unified framework. ChatHuman functions as an assistant proficient in utilizing, analyzing, and interacting with tools specific to 3D human tasks, adeptly discussing and resolving related challenges. Built on a Large Language Model (LLM) framework, ChatHuman is trained to autonomously select, apply, and interpret a diverse set of tools in response to user inputs. Our approach overcomes significant hurdles in adapting LLMs to 3D human tasks, including the need for domain-specific knowledge and the ability to interpret complex 3D outputs. The innovations of ChatHuman include leveraging academic publications to instruct the LLM on tool usage, employing a retrieval-augmented generation model to create in-context learning examples for managing new tools, and effectively discriminating between and integrating tool results by transforming specialized 3D outputs into comprehensible formats. Experiments demonstrate that ChatHuman surpasses existing models in both tool selection accuracy and overall performance across various 3D human tasks, and it supports interactive chatting with users. ChatHuman represents a significant step toward consolidating diverse analytical methods into a unified, robust system for 3D human tasks.