Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

📄 arXiv: 2605.07178v1 📥 PDF

作者: Kai Zheng, Hang-Cheng Dong, Jiatong Pan, Zhenkai Wu, Fupeng Wei, Wei Zhang

分类: cs.CV

发布日期: 2026-05-08


💡 一句话要点

提出S2M框架,通过将遥感变化检测掩码转化为结构化文本实现多模态监督

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 遥感变化检测 多模态学习 结构化文本 语义对齐 对比学习 多类变化检测

📋 核心要点

  1. 现有单模态方法难以区分视觉相似的无关变化,而多模态方法依赖的外部文本描述往往存在语义粗糙或噪声干扰问题。
  2. 提出S2M框架,利用遥感变化检测数据集自带的掩码标签,自动提取结构化语义四元组并转化为文本,实现零成本的多模态监督。
  3. 在Gaza-Change-v2数据集上,S2M在Sek和Fscd指标上均取得显著提升,证明了结构化掩码文本在辅助遥感变化检测中的有效性。

📝 摘要(中文)

遥感变化检测在城市监测、灾害评估及资源管理中至关重要。然而,单模态深度学习方法常将视觉相似的无关变化误判为语义变化。现有的多模态方法虽引入文本作为辅助监督,但其描述往往语义粗糙、非结构化或存在模型生成的噪声。本研究发现,细粒度的变化语义已隐含在标准的真值掩码中。为此,我们提出S2M框架,无需额外标注,自动将变化区域转录为“地点、内容、方式、数量”的语义四元组,并转化为固定模板的文本描述,提供精确、密集且无噪声的多模态监督。通过两阶段训练策略及双向对比损失,实现了视觉特征与结构化文本嵌入的深度对齐。在自建的Gaza-Change-v2多类变化检测数据集上,S2M表现优异,显著超越了依赖大语言模型的多模态方法,证明了掩码在语义表达中的巨大潜力。

🔬 方法详解

问题定义:遥感变化检测中,模型常受限于视觉特征的歧义性,难以区分真正的语义变化与背景噪声。现有方法引入文本监督时,面临文本生成质量不可控、语义结构缺失等挑战。

核心思路:论文提出“掩码即文本”的观点,认为变化检测数据集中的真值掩码本身就蕴含了丰富的语义信息(如位置、类别、变化类型及对象数量)。通过将这些掩码解析为结构化的四元组,可以构建高质量、无噪声的文本监督信号。

技术框架:S2M采用两阶段训练策略。第一阶段在遥感影像上进行预训练以获取稳健的领域特定特征;第二阶段引入多模态解码器,通过双向对比损失(Bi-directional Contrastive Loss)将视觉特征与结构化文本嵌入空间进行深度对齐。

关键创新:核心创新在于将掩码转化为结构化文本(where, what, how, how many),实现了从“视觉到语义”的直接映射,避免了依赖外部LLM生成文本带来的噪声和语义丢失问题。

关键设计:采用固定模板将四元组转化为自然语言描述,确保文本的一致性;利用双向对比损失函数强制视觉编码器学习与文本语义高度相关的特征表示,从而增强模型对复杂变化场景的判别能力。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

在自建的Gaza-Change-v2多类变化检测数据集上,S2M表现卓越,Sek指标达到17.80%,Fscd指标达到66.14%。实验结果表明,该方法不仅超越了传统单模态基线,甚至在性能上显著优于依赖大语言模型(LLM)生成文本的多模态方法,验证了结构化掩码监督的高效性与鲁棒性。

🎯 应用场景

该方法可广泛应用于城市扩张监测、灾后损毁评估、非法建筑检测及环境资源动态管理。通过提供更精确的语义理解,S2M能够显著提升遥感影像分析的自动化水平,为政府决策、应急响应及地理空间数据分析提供高精度的技术支撑,具有极高的实际应用价值。

📄 摘要(原文)

Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels at zero additional annotation cost. Specifically, each change region is automatically transcribed into a semantic quadruple (where, what, how, how many) and converted into several fixed-template text descriptions, providing precise, dense, and noise-free multimodal supervision. We adopts a two-stage training strategy to fine-tune on remote sensing imagery firstly for robust domain-specific representation, after which a multimodal decoder with a bi-directional contrastive loss is introduced to achieve deep alignment between visual features and structured textual embeddings. To validate our method, we construct Gaza-Change-v2, a new multi-class change detection (MCD) dataset about the Gaza Strip. On this MCD dataset, S2M achieves a Sek of 17.80\% and an F$_{\text{scd}}$ of 66.14\%, notably surpassing even multimodal methods that leverage large language models. Our work demonstrates that masks can indeed talk. They tell us exactly what, where, how, and how many changes have occurred.