AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness

作者: Yijian Li, Changze Li, Hantian Shi, Jiaying Luo, Jiyuan Cai, Ming Yang, Tong Qin

分类: cs.RO

发布日期: 2026-06-09

💡 一句话要点

提出AgenticNav以解决零-shot视觉语言导航中的工具调用问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉语言导航 零-shot学习 深度学习 机器人导航 多模态模型 智能系统 路径规划

📋 核心要点

现有的零-shot视觉语言导航方法依赖于路径预测器，限制了模型的动作选择和深度信息的有效利用。
AgenticNav通过将动作、深度和记忆作为可调用工具，重新定义了导航过程，允许模型直接选择目标像素并执行动作。
在R2R-CE基准上，AgenticNav实现了新的最先进性能，相较于传统方法在零-shot设置下表现更佳。

📝 摘要（中文）

零-shot视觉语言导航在连续环境中（VLN-CE）随着大型视觉-语言模型（VLMs）的发展变得可行。然而，现有方法通常依赖于学习的路径预测器来提出可导航的动作，这限制了模型的动作空间，并未有效利用深度输入。此外，记忆通常通过累积长文本或视觉历史来处理，导致上下文中包含大量无关信息，削弱了零-shot设置。本文重新思考零-shot VLN-CE，将其视为VLM与环境之间的代理接口，提出了AgenticNav，一个轻量级的导航工具，能够将动作、深度和记忆作为可调用工具暴露。通过这种方式，AgenticNav在R2R-CE基准上实现了新的最先进性能，并在实际应用中展现了其零-shot泛化能力。

🔬 方法详解

问题定义：本文旨在解决现有零-shot视觉语言导航方法中路径预测器的局限性，导致动作空间受限和深度信息利用不充分的问题。

核心思路：AgenticNav通过将导航过程视为一个代理接口，允许视觉-语言模型直接选择目标像素并执行动作，从而增强了模型的灵活性和深度信息的利用效率。

技术框架：整体架构包括三个主要模块：动作工具、深度工具和记忆工具。动作工具允许模型选择RGB观察中的目标像素，深度工具提供按需的像素深度信息，而记忆工具则通过紧凑的地图图像和回忆工具帮助模型回顾历史观察。

关键创新：最重要的技术创新在于将动作、深度和记忆作为可调用工具进行设计，区别于传统的路径预测方法，显著提升了导航的灵活性和效率。

关键设计：在设计中，动作工具允许直接选择目标像素，深度工具提供精确的度量距离，而记忆工具通过简化的历史轨迹图像和选择性回忆机制，避免了上下文的过载。具体的参数设置和损失函数设计未在摘要中详细说明，需参考原文。

🖼️ 关键图片

📊 实验亮点

在R2R-CE基准上，AgenticNav在零-shot方法中取得了新的最先进性能，具体表现为相较于传统方法的显著提升，验证了其在实际应用中的零-shot泛化能力。实验结果表明，动作工具设计优于传统路径预测器，深度工具和代理记忆进一步提升了导航性能。

🎯 应用场景

该研究的潜在应用场景包括智能机器人导航、自动驾驶系统和增强现实等领域。通过提升零-shot导航能力，AgenticNav能够在未知环境中更有效地进行路径规划和决策，具有重要的实际价值和未来影响。

📄 摘要（原文）

Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs). However, existing methods typically rely on learned waypoint predictors to propose navigable actions. This severely limits the model's action space and fails to leverage depth inputs effectively. Moreover, memory is commonly handled by accumulating long textual or visual histories with substantial irrelevant context, or by retrieving cross-episode experiences, which weakens the zero-shot setting. In this paper, we rethink zero-shot VLN-CE as an agentic interface between the VLM and the environment, and present AgenticNav, a lightweight navigation harness that exposes action, depth, and memory as callable tools. Instead of choosing from predicted waypoints, the action tool allows the VLM to directly select a target pixel in RGB observations, converting it into executable motion. Depth is exposed through an on-demand pixel-depth tool, enabling the VLM to request precise metric distances only where they matter. For memory, AgenticNav provides a compact map image summarizing the historical trajectory, paired with a recall tool that allows the VLM to selectively revisit past visual observations without overwhelming the prompt context. On the R2R-CE benchmark, AgenticNav establishes new state-of-the-art (SOTA) performance among zero-shot methods given the same VLM backbone. Real-world validation further highlights its zero-shot generalization compared to prior methods. Ablations show that our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute to navigation performance.

AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理