Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

作者: David F. Ramirez, Tim Overman, Kristen Jaskie, Andreas Spanias

分类: eess.IV, cs.AI, cs.CV

发布日期: 2026-05-11

备注: Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI

💡 一句话要点

提出SMART-HC-VQA数据集与多模态大模型框架，实现遥感影像的时空活动推理

🎯 匹配领域: 支柱八：物理动画 (Physics-based Animation) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 遥感影像分析 多模态大模型 视觉问答 时空推理 自动目标识别 Sentinel-2

📋 核心要点

现有遥感分析多局限于静态目标检测，缺乏对地理空间活动随时间演变的深层语义推理能力。
提出SMART-HC-VQA数据集，通过将时空元数据转化为自然语言问答对，构建了多时相遥感影像的推理基准。
基于LLaVA-NeXT架构实现多图输入训练，使模型能够理解施工现场的动态演变过程及未来发展趋势。

📝 摘要（中文）

本文介绍了SMART-HC-VQA，这是一个基于Sentinel-2卫星影像的视觉问答（VQA）数据集，源自IARPA SMART重型建筑数据集，旨在进行人类活动的时空分析。该研究将建筑工地标注、施工类型、时间阶段、地理元数据及观测关系转化为自然语言问答对，将现有数据集重构为具备时序扩展性的自动目标识别与VQA挑战。目前，该数据集包含21,837个Sentinel-2影像切片、65,511个单图VQA示例，以及通过创新的“图像对组合增强”生成的约230万个双图时序对比示例。此外，作者基于LLaVA-NeXT Mistral-7B构建了一个多图多模态大模型（MLLM）训练框架，能够处理多时相输入并进行元数据驱动的推理。该工作为理解语言引导的遥感活动提供了可复现的基础，旨在实现从单纯的“变化检测”向“过程推理与预测”的跨越。

🔬 方法详解

问题定义：传统遥感任务多关注单帧图像的目标识别，难以处理固定地理位置在稀疏观测下的动态演变过程，缺乏对施工进度、活动状态等时空逻辑的语义理解能力。

核心思路：将遥感影像分析转化为视觉问答（VQA）任务，通过构建大规模时空问答对，引导多模态大模型（MLLM）学习地理空间实体的属性演变，从而实现从“检测”到“推理”的范式转换。

技术框架：流程包括：1. 遥感影像预处理与切片；2. 基于SMART-HC标注生成自然语言问答对；3. 采用“图像对组合增强”技术扩充时序对比样本；4. 基于LLaVA-NeXT Mistral-7B构建多图输入MLLM训练框架。

关键创新：引入“图像对组合增强”（Image-Pairwise Combinatorial Augmentation），通过组合不同时间点的影像，极大地丰富了时序对比样本，使模型能够捕捉施工过程的细微变化。

关键设计：模型架构适配了多时相输入，通过将多个带时间戳的影像切片作为上下文输入，结合元数据驱动的指令微调，使模型能够针对特定地理位置的施工阶段和类型进行逻辑推理。

📊 实验亮点

实验构建了包含21,837个影像切片、65,511个单图VQA及约230万个时序对比示例的庞大基准。通过将LLaVA-NeXT适配至多图时序输入，模型在处理复杂地理空间推理任务上表现出显著潜力，为遥感领域的大模型微调提供了可复现的基准框架，有效提升了对长周期活动演变的理解精度。

🎯 应用场景

该研究在城市规划、基础设施建设监测、环境变化分析及国防情报领域具有重要价值。通过自动化的时空推理，能够实时监控大型工程进度，识别异常施工活动，并辅助决策者预测项目完成周期，提升遥感数据在复杂动态场景下的应用效能。

📄 摘要（原文）

We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理