Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild

📄 arXiv: 2505.11350v5 📥 PDF

作者: Derek Ming Siang Tan, Shailesh, Boyang Liu, Alok Raj, Qi Xuan Ang, Weiheng Dai, Tanishq Duhan, Jimmy Chiun, Yuhong Cao, Florian Shkurti, Guillaume Sartoretti

分类: cs.RO

发布日期: 2025-05-16 (更新: 2025-11-07)

备注: Accepted for presentation at CORL 2025. Code, models, and data are available at https://search-tta.github.io/


💡 一句话要点

提出Search-TTA框架以解决户外视觉搜索中的不确定性问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多模态适应 视觉搜索 卫星图像 强化学习 不确定性加权 CLIP模型 无人机导航

📋 核心要点

  1. 现有方法在处理视觉搜索时,往往假设没有先验信息或未考虑先验信息的获取方式,导致搜索效率低下。
  2. 论文提出的Search-TTA框架通过动态调整CLIP的预测,结合不确定性加权梯度更新,提升了搜索的准确性和效率。
  3. 实验结果表明,Search-TTA在规划者性能上提升了最多30.0%,并在零样本情况下实现了对未见模态的泛化能力。

📝 摘要(中文)

为了进行户外视觉导航和搜索,机器人可以利用卫星图像生成视觉先验。这有助于制定高层次的搜索策略,即使这些图像的分辨率不足以进行目标识别。然而,许多现有的路径规划或搜索方法要么假设没有先验信息,要么使用先验信息而不考虑其获取方式。为了解决这些挑战,我们提出了Search-TTA,一个多模态测试时适应框架,具有灵活的插拔接口,兼容多种输入模态(如图像、文本、声音)和规划方法(如基于强化学习)。我们发现Search-TTA在规划者性能上提高了最多30.0%,特别是在初始CLIP预测不佳的情况下。最后,我们在真实无人机上通过硬件在环测试部署了Search-TTA。

🔬 方法详解

问题定义:本论文旨在解决户外视觉搜索中由于先验信息不足或不准确导致的搜索效率低下的问题。现有方法往往未能有效利用卫星图像等先验信息,导致规划效果不佳。

核心思路:论文提出的Search-TTA框架通过动态调整和优化CLIP模型的预测,结合不确定性加权的梯度更新,旨在提高视觉搜索的准确性和效率。

技术框架:Search-TTA框架包括两个主要模块:首先是卫星图像编码器的预训练,与CLIP的视觉编码器对齐,以输出目标存在的概率分布;其次是在搜索过程中动态调整CLIP的预测。

关键创新:最重要的技术创新在于引入了不确定性加权的梯度更新机制,灵感来自空间泊松点过程,这一方法显著提升了搜索的准确性和效率。

关键设计:在模型设计上,采用了与CLIP对齐的卫星图像编码器,并在训练过程中使用了特定的损失函数来优化目标存在概率的预测。

🖼️ 关键图片

img_0

📊 实验亮点

实验结果显示,Search-TTA在规划者性能上提升了最多30.0%,尤其在初始CLIP预测不佳的情况下表现突出。此外,该框架在与更大规模的视觉语言模型对比时,表现出相当的性能,并实现了对未见模态的零样本泛化能力。

🎯 应用场景

该研究的潜在应用领域包括无人机导航、搜索与救援、环境监测等场景,能够有效提升机器人在复杂环境中的自主搜索能力。未来,该框架有望在更多实际应用中推广,进一步推动视觉搜索技术的发展。

📄 摘要(原文)

To perform outdoor visual navigation and search, a robot may leverage satellite imagery to generate visual priors. This can help inform high-level search strategies, even when such images lack sufficient resolution for target recognition. However, many existing informative path planning or search-based approaches either assume no prior information, or use priors without accounting for how they were obtained. Recent work instead utilizes large Vision Language Models (VLMs) for generalizable priors, but their outputs can be inaccurate due to hallucination, leading to inefficient search. To address these challenges, we introduce Search-TTA, a multimodal test-time adaptation framework with a flexible plug-and-play interface compatible with various input modalities (e.g., image, text, sound) and planning methods (e.g., RL-based). First, we pretrain a satellite image encoder to align with CLIP's visual encoder to output probability distributions of target presence used for visual search. Second, our TTA framework dynamically refines CLIP's predictions during search using uncertainty-weighted gradient updates inspired by Spatial Poisson Point Processes. To train and evaluate Search-TTA, we curate AVS-Bench, a visual search dataset based on internet-scale ecological data containing 380k images and taxonomy data. We find that Search-TTA improves planner performance by up to 30.0%, particularly in cases with poor initial CLIP predictions due to domain mismatch and limited training data. It also performs comparably with significantly larger VLMs, and achieves zero-shot generalization via emergent alignment to unseen modalities. Finally, we deploy Search-TTA on a real UAV via hardware-in-the-loop testing, by simulating its operation within a large-scale simulation that provides onboard sensing.