BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
作者: Zhaohui Du, Zhe Wang, Hongmei Fei, Xiwen Cao, Ting Xiao, Qi Wang, Huanbo Jin, Jiaming Gu, Quan Lu, Zhe Liu
分类: cs.RO, cs.AI
发布日期: 2026-05-08
备注: 16 pages, 7 figures
💡 一句话要点
提出BioProVLA-Agent:一种基于协议驱动与视觉增强的低成本生物实验室具身智能多智能体系统
🎯 匹配领域: 支柱一:机器人控制 (Robot Control) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 具身智能 视觉-语言-动作模型 生物实验室自动化 多智能体系统 闭环推理 数据增强 机器人操作
📋 核心要点
- 现有生物实验室机器人系统成本高昂且依赖固定流程,难以应对非结构化协议及透明器皿带来的视觉感知挑战。
- 提出BioProVLA-Agent多智能体架构,通过协议解析、视觉验证与VLA策略执行,构建了闭环的具身操作流程。
- 引入AugSmolVLA在线增强策略,显著提升了模型在复杂光照与透明物体操作场景下的鲁棒性与执行成功率。
📝 摘要(中文)
生物实验室自动化旨在减少重复性劳动并提升实验可重复性,但在湿实验室环境中实现可靠的具身执行仍面临挑战。现有系统多依赖昂贵硬件、固定工作流或专用接口,且难以处理非结构化协议、透明/反光器皿及复杂的多步操作。本文提出了BioProVLA-Agent,这是一个由视觉-语言-动作(VLA)模型驱动的低成本、协议驱动型具身多智能体系统。该系统通过协议解析、视觉状态验证和闭环执行,实现了从任务指令到物理操作的自动化。此外,作者开发了AugSmolVLA在线增强策略,有效缓解了透明器皿、反光及光照变化带来的视觉干扰。实验表明,该系统在15项原子任务、6项复合工作流及3项双臂任务中表现出优异的稳定性,显著优于ACT、X-VLA及原始SmolVLA模型。
🔬 方法详解
问题定义:论文旨在解决生物实验室自动化中具身智能体面临的“非结构化协议解析难”、“透明/反光器皿感知失效”以及“长程多步任务执行缺乏闭环验证”等核心痛点。
核心思路:采用多智能体协作架构,将复杂的实验协议解构为可验证的子任务。通过引入视觉反馈闭环,确保每一步操作在执行前后的状态一致性,从而实现对长程任务的鲁棒控制。
技术框架:系统包含三个核心智能体:1. LLM协议智能体,负责将自然语言协议转化为结构化子任务;2. VLM-RAG验证智能体,利用检索增强生成技术评估任务准备与完成状态;3. VLA具身智能体,基于轻量级策略执行具体动作。
关键创新:核心创新在于AugSmolVLA在线增强策略,专门针对实验室常见的透明器皿、强反光及过曝环境进行数据增强,有效提升了视觉模型在极端视觉扰动下的特征提取能力。
关键设计:系统采用分层基准测试,涵盖原子任务到复合工作流。在模型训练中,通过集成机器人状态、视觉观测与成功/失败案例库,实现了基于闭环推理的策略优化,确保了系统在湿实验室环境中的高可靠性。
🖼️ 关键图片
📊 实验亮点
实验在15项原子任务、6项复合工作流及3项双臂任务中验证了系统性能。结果显示,AugSmolVLA在处理透明物体、复杂复合工作流及高曝光场景时,执行稳定性显著优于ACT、X-VLA及原始SmolVLA。特别是在视觉退化场景下,该系统展现了极强的鲁棒性,证明了其作为低成本、协议驱动型具身智能方案的有效性。
🎯 应用场景
该系统主要应用于生物实验室的自动化操作,如试管装载、液体移取、废弃物处理及盖子旋拧等任务。其低成本与高鲁棒性设计,使其在科研机构、制药企业及临床检测实验室具有广泛的推广价值,能够显著降低人工操作误差,提升实验流程的标准化与可重复性。
📄 摘要(原文)
Biological laboratory automation can reduce repetitive manual work and improve reproducibility, but reliable embodied execution in wet-lab environments remains challenging. Protocols are often unstructured, labware is frequently transparent or reflective, and multi-step procedures require state-aware execution beyond one-shot instruction following. Existing robotic systems often rely on costly hardware, fixed workflows, dedicated instruments, or robotics-oriented interfaces. Here, we introduce BioProVLA-Agent, an affordable, protocol-driven, vision-enhanced embodied multi-agent system enabled by Vision-Language-Action (VLA) models for biological manipulation. The system uses protocols as the task interface and integrates protocol parsing, visual state verification, and embodied execution in a closed-loop workflow. A Tailored LLM Protocol Agent converts protocols into verifiable subtasks; a VLM-RAG Verification Agent assesses readiness and completion using observations, robot states, retrieved knowledge, and success/failure examples; and a VLA Embodied Agent executes verified subtasks through a lightweight policy. To improve robustness under wet-lab visual perturbations, we develop AugSmolVLA, an online augmentation strategy targeting transparent labware, reflections, illumination shifts, and overexposure. We evaluate the system on a hierarchical benchmark covering 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, including tube loading, sorting, waste disposal, cap twisting, and liquid pouring. Across normal and high-exposure settings, AugSmolVLA improves execution stability over ACT, X-VLA, and the original SmolVLA, especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes. These results suggest a practical route toward accessible, protocol-centered, and verification-capable embodied AI for biological manipulation.