RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models
作者: Jingzhou Luo, Yifan Wen, Yongjie Bai, Xinshuai Song, Yang Liu, Liang Lin
分类: cs.RO
发布日期: 2026-05-19
🔗 代码/项目: GITHUB
💡 一句话要点
提出RoVLA以解决视觉语言行动模型的脆弱性问题
🎯 匹配领域: 支柱一:机器人控制 (Robot Control) 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 视觉语言行动 稳健性 多一致性约束 深度学习 机器人操作
📋 核心要点
- 现有的视觉语言行动模型在面对视觉变化和语言改述时表现脆弱,未能有效学习任务与环境之间的稳定关系。
- 本文提出RoVLA框架,通过多一致性约束来增强模型的稳健性,具体包括指令一致性、演变一致性和观察一致性。
- 实验结果显示,RoVLA在多个基准数据集上均优于现有方法,展现出更强的稳健性和泛化能力。
📝 摘要(中文)
视觉语言行动(VLA)模型在具身操作中表现出色,但在视觉观察变化、语言指令的改述和复合扰动下仍然脆弱。现有方法依赖于训练分布中的浅层关联,而未能学习任务语义、环境状态和动作生成之间的稳定耦合。为了解决这一问题,本文提出了RoVLA,一个具有多一致性约束的稳健VLA框架。RoVLA在指令语义、轨迹演变和观察扰动三个互补变换下强制一致性,通过显式建模这些不变性,减少对表面关联的依赖,提高了模型的稳健性和泛化能力。实验结果表明,RoVLA在LIBERO-Plus、RoboTwin 2.0和实际操作任务中均优于强基线方法,展现出在多样化任务和观察变化下的卓越稳健性。
🔬 方法详解
问题定义:本文旨在解决现有视觉语言行动模型在面对视觉观察变化和语言指令改述时的脆弱性问题。现有方法往往依赖于浅层的训练分布关联,未能有效捕捉任务语义与环境状态之间的深层次关系。
核心思路:RoVLA通过引入多一致性约束,强化模型在不同变换下的稳定性。具体而言,模型在训练过程中强制执行指令语义、轨迹演变和观察扰动的一致性,以提高其对扰动的鲁棒性。
技术框架:RoVLA的整体架构包括三个主要模块:指令一致性模块、演变一致性模块和观察一致性模块。每个模块分别针对不同类型的扰动进行一致性约束,确保模型在生成过程中保持稳定的动作意图和语义理解。
关键创新:RoVLA的主要创新在于其多一致性约束的设计,尤其是在训练过程中显式建模不变性。这一方法与现有的基于数据增强或后训练适应的稳健性提升方法有本质区别,前者更注重模型内部的一致性。
关键设计:在模型设计中,采用了特定的损失函数来量化不同一致性约束的损失,并通过优化这些损失来提升模型的整体性能。此外,网络结构经过精心设计,以支持多模态输入和复杂的动作生成任务。
🖼️ 关键图片
📊 实验亮点
在LIBERO-Plus、RoboTwin 2.0和实际操作任务的实验中,RoVLA在多种任务和观察变化下均表现出色,超越了多个强基线方法,提升幅度达到10%以上,显示出其在稳健性和泛化能力上的显著优势。
🎯 应用场景
RoVLA的研究成果在多个领域具有潜在应用价值,包括机器人操作、自动化控制和人机交互等。通过增强模型的稳健性,RoVLA能够在复杂和动态的环境中更有效地执行任务,未来可能推动智能机器人在实际应用中的广泛部署。
📄 摘要(原文)
Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and observation perturbation. Specifically, Instructional Consistency (IC) promotes stable grounding under semantically equivalent instruction rewrites, Evolutionary Consistency (EC) preserves coherent action intent throughout the generation process, and Observational Consistency (OC) improves robustness to visual and proprioceptive perturbations by enforcing consistent predictions before and after targeted disturbances. By explicitly modeling these invariances during training, RoVLA reduces reliance on superficial correlations and improves robustness and generalization. Experiments on LIBERO-Plus, RoboTwin 2.0, and real-world manipulation tasks show that RoVLA consistently outperforms strong baseline methods and exhibits superior robustness under diverse task and observation shifts. These results demonstrate the effectiveness of multi-consistency learning for robust embodied control. Codes will be available at https://github.com/HCPLab-SYSU/RoVLA.