Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

📄 arXiv: 2606.10819v1 📥 PDF

作者: Miaoxin Cai, Guanqun Wang, Wei Zhang, Guangyao Zhou, Yin Zhuang, Tong Zhang, Hao Wang, He Chen, Jun Li

分类: cs.CV, cs.AI

发布日期: 2026-06-09


💡 一句话要点

提出Earth-OneVision以解决遥感多模态模型的传感器类型和任务限制问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 遥感技术 多模态融合 自然语言处理 空间推理 深度学习

📋 核心要点

  1. 现有的遥感多模态大语言模型仅支持有限的传感器类型和任务,导致对地球的理解片面,跨模态知识未被充分利用。
  2. 论文提出Earth-OneVision,通过统一六种传感器模态和跨传感器融合,构建一个包含九个任务类别的自回归框架。
  3. Earth-OneVision在多个基准测试中表现优异,2B参数的模型在性能上超越了4B-72B的RS-MLLM,显示出显著的提升。

📝 摘要(中文)

RS-MLLMs使得对地球观测图像的自然语言理解和空间推理成为可能。然而,现有模型仅支持有限的传感器类型和任务,导致对地球的视角片面,跨模态地球科学知识未被充分利用。本研究提出了Earth-OneVision,一个统一六种传感器模态(光学、SAR、红外、多光谱、时间和视频)及跨传感器融合的2B RS-MLLM,涵盖九个任务类别。通过三种专门机制解决了三个瓶颈问题,最终在多个基准测试中表现出色,超越了现有的RS多模态指令数据集。

🔬 方法详解

问题定义:本论文旨在解决现有遥感多模态大语言模型在传感器类型和任务支持上的局限性,导致的地球观测理解片面化问题。

核心思路:通过提出Earth-OneVision,整合多种传感器模态和任务,构建一个统一的自回归框架,以实现更全面的地球观测理解。

技术框架:Earth-OneVision的整体架构包括六种传感器模态的融合,采用三种机制(FGVLA、SLIS、PCMA)来解决多模态对齐和领域间差距的问题。

关键创新:最重要的创新在于引入了全粒度视觉-语言对齐机制和空间-语言同构序列化,显著提升了多模态数据的融合能力和任务适应性。

关键设计:模型构建中使用了约3400万个问答对,涵盖所有六种传感器模态,采用自回归设计,确保了各类任务的有效训练和性能提升。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

Earth-OneVision在多个基准测试中表现优异,光学视觉定位测试集OPT-RSVG上达到87.52%的P@0.5,SAR VQA基准SARLANG-Bench上达到80.68%,均超越了现有7B模型超过7%。在多光谱分类和跨模态推理任务中也取得了显著的成绩,显示出强大的性能优势。

🎯 应用场景

该研究的潜在应用领域包括环境监测、城市规划、农业管理等,能够为决策提供更为全面和准确的地理信息支持。未来,Earth-OneVision有望推动遥感技术在多模态数据融合和智能分析中的应用,提升地球科学研究的效率和准确性。

📄 摘要(原文)

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.