KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

作者: Tengjiao Sun, Pengcheng Fang, Xiaoyu Zhan, Yanwen Guo, Dongjie Fu, Xiaohao Cai, Hansung Kim

分类: cs.CV, cs.GR

发布日期: 2026-06-04

💡 一句话要点

提出KV-Control以解决文本驱动运动生成中的控制精度问题

🎯 匹配领域: 支柱四：生成式动作 (Generative Motion) 支柱七：动作重定向 (Motion Retargeting) 支柱八：物理动画 (Physics-based Animation)

关键词: 文本驱动生成 3D人类运动 自注意力机制 运动控制 高精度轨迹

📋 核心要点

现有的文本驱动运动生成方法在控制精度上存在不足，无法有效平衡预训练运动模型与具体轨迹控制的需求。
论文提出KV-Control，通过在自注意力层中注入控制条件的键/值记忆，避免了重复生成器的复杂性，保持了运动生成的质量。
实验表明，KV-Control在跟踪根部和多关节约束方面达到了亚厘米级的精度，同时保留了文本驱动的运动质量，显示出显著的性能提升。

📝 摘要（中文）

文本条件的3D人类运动模型能够根据提示合成合理的动作，但实际动画和具身代理工作流往往需要角色遵循草图路径、达到末端执行器目标或满足多关节轨迹，同时保持语言描述的步态、风格和意图。这暴露了控制的权衡。KV-Control是一个紧凑的注意力侧控制接口，旨在为冻结的掩蔽文本到运动变换器提供支持。其核心思想是将几何约束作为自注意力中的记忆，而不是通过全局姿势标记注入或仅在输出端强制执行。该方法在保持预训练查询流的同时，在每个自注意力层注入控制条件的键/值记忆，从而实现高精度的轨迹控制。

🔬 方法详解

问题定义：本论文旨在解决文本驱动运动生成中的控制精度问题，现有方法要么重复生成器以获得逐层控制，要么在测试时进行优化，导致效率低下和精度不足。

核心思路：KV-Control的核心思路是将几何约束作为自注意力中的记忆进行处理，而不是通过全局姿势标记注入，从而实现更灵活和高效的控制。

技术框架：该方法包括一个部分标记的运动基底和控制器，PartVQ学习解剖对齐的部分代码本，T-Concat将每帧-部分标记暴露为可寻址的注意力位置，KV-Control在每个自注意力层注入控制条件的键/值记忆。

关键创新：KV-Control的主要创新在于其轻量级的记忆检索机制，使得轨迹控制变得高效且透明，避免了对预训练模型的干扰。

关键设计：该方法仅在共享轨迹编码器之上添加可训练的注入参数，保持了预训练的查询流、文本交叉注意力、前馈网络和所有主干权重，确保了运动生成的质量和控制的精度。

🖼️ 关键图片

📊 实验亮点

实验结果显示，KV-Control在跟踪根部和多关节约束方面达到了亚厘米级的精度，同时保持了文本驱动运动的质量。与现有基线相比，KV-Control在控制精度和生成质量上均有显著提升，展示了其在实际应用中的潜力。

🎯 应用场景

该研究的潜在应用领域包括动画制作、游戏开发和虚拟现实等，能够为角色动画提供更高的控制精度和灵活性，提升用户体验。未来，随着技术的进一步发展，KV-Control可能会在更广泛的交互式应用中发挥重要作用。

📄 摘要（原文）

Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting the pretrained text-conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per-layer control access or move much of the cost to test-time optimization. We introduce KV-Control, a compact attention-side control interface for frozen masked text-to-motion transformers. The key idea is to make geometric constraints available as memory inside self-attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co-design a part-tokenized motion substrate and controller: \textbf{PartVQ} learns anatomy-aligned part codebooks, T-Concat exposes each frame--part token as an attention-addressable site, and KV-Control injects control-conditioned key/value memories at every self-attention layer while preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi-joint constraints with sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality. KV-Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text-to-motion generation.

KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理