HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
作者: Guankai Li, Jiabin Chen, Yi Xu, Xichen Zhang, Yuan Lu
分类: cs.LG, cs.AI
发布日期: 2026-05-08
备注: Code & Data: https://github.com/Guankai-Li/HyperEyes
💡 一句话要点
提出HyperEyes:一种双粒度效率感知强化学习框架,实现并行多模态搜索代理
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 多模态搜索 强化学习 并行计算 推理效率 工具调用 视觉定位 策略蒸馏
📋 核心要点
- 现有代理在处理复杂查询时采用串行检索,导致多实体搜索时产生大量冗余交互,严重拖慢推理效率。
- 提出HyperEyes框架,通过将视觉定位与检索融合为原子动作,实现多实体并发搜索,并引入双粒度强化学习优化效率。
- 在六项基准测试中,HyperEyes-30B在提升9.9%准确率的同时,将工具调用轮数显著降低了5.3倍,验证了其高效性。
📝 摘要(中文)
现有的多模态搜索代理通常采用串行处理模式,即针对每个实体逐一调用工具,导致查询分解为独立子检索时产生冗余交互。本文提出HyperEyes,这是一种并行多模态搜索代理,将视觉定位与检索融合为单一原子动作,支持多实体并发搜索,并将推理效率作为核心训练目标。HyperEyes采用两阶段训练:首先通过并行友好型数据合成流水线进行冷启动监督;其次引入双粒度效率感知强化学习框架,在宏观层面通过TRACE奖励机制抑制冗余工具调用,在微观层面利用策略内蒸馏注入令牌级纠错信号。此外,本文构建了IMEB基准测试以评估搜索能力与效率。实验表明,HyperEyes-30B在六项基准测试中准确率提升9.9%,工具调用轮数平均减少5.3倍。
🔬 方法详解
问题定义:现有搜索代理在处理多实体查询时,往往将任务拆解为多个串行步骤,导致每轮交互仅处理单一实体,造成了严重的推理延迟和冗余的工具调用开销。
核心思路:论文主张“搜索应更宽而非更长”,通过将视觉定位与检索操作合并为原子动作,实现多实体的并行化处理,并显式地将推理成本纳入强化学习的优化目标中。
技术框架:训练分为两阶段:第一阶段通过并行友好型数据合成流水线(结合渐进式拒绝采样)进行冷启动;第二阶段采用双粒度强化学习,宏观上通过TRACE奖励机制优化轨迹,微观上通过策略内蒸馏(On-Policy Distillation)注入令牌级反馈。
关键创新:核心创新在于TRACE(工具使用参考自适应成本效率)奖励机制,它通过单调递减的参考基准,在不损害多跳搜索能力的前提下,有效抑制了不必要的工具调用,解决了传统稀疏奖励带来的信用分配难题。
关键设计:在微观层面,利用外部教师模型在失败的Rollout中提供密集的令牌级纠错信号,通过蒸馏技术引导模型学习更优的搜索路径,从而克服了仅依赖最终结果奖励导致的训练效率低下问题。
🖼️ 关键图片
📊 实验亮点
HyperEyes-30B在六项基准测试中表现卓越,准确率较最强开源基线提升9.9%,同时工具调用轮数平均减少5.3倍。此外,本文推出的IMEB基准测试填补了当前领域对“搜索效率”评估的空白,为后续研究提供了更全面的评价体系。
🎯 应用场景
该研究适用于需要处理复杂多模态查询的智能助手、自动化信息检索系统及机器人视觉导航任务。通过提升搜索的并行度和效率,该技术能显著降低云端大模型的推理成本,并提升终端设备在实时交互场景下的响应速度,具有极高的工业落地价值。
📄 摘要(原文)
Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.