DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance

📄 arXiv: 2606.08906v1 📥 PDF

作者: Qiangqiang Zhou, Jiawei Xu, Yong Chen, Dandan Zhu, Yugen Yi, Xiaoqi Zhao

分类: cs.CV

发布日期: 2026-06-08


💡 一句话要点

提出DifferSeg以解决多模态二值分割中的适应性与解码效率问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多模态分割 二值分割 差异感知 频率引导 特征融合 深度学习 计算机视觉

📋 核心要点

  1. 现有多模态二值分割方法在处理模态差异和互补性方面缺乏自适应机制,导致融合冗余和效果不佳。
  2. 本文提出DifferSeg框架,通过差异感知融合模块自适应对齐多模态特征,并设计频率引导解码器以平衡高低频表示。
  3. DifferSeg在29个公共数据集上表现优异,超越67种最先进的方法,显示出更强的泛化能力和分割精度。

📝 摘要(中文)

在许多二值分割任务中,现有多模态方法依赖固定特征拼接进行跨模态交互,且解码器设计往往以低频语义为主,忽视了处理模态差异和互补性的自适应机制以及高低频表示平衡的高效解码策略。为此,本文提出了一种简单而通用的多模态二值分割框架DifferSeg,通过差异感知融合模块(DPF)自适应对齐多模态特征,并通过频率引导解码器(FGD)建立跨频交互,确保高频细节与低频语义的一致性。实验结果表明,DifferSeg在29个公共数据集上超越67种最先进的方法,展现出卓越的泛化能力和分割精度。

🔬 方法详解

问题定义:本文旨在解决现有多模态二值分割方法在模态差异处理和解码效率上的不足,特别是固定特征拼接导致的融合冗余和低频语义主导的问题。

核心思路:DifferSeg通过引入差异感知融合模块(DPF)和频率引导解码器(FGD),自适应对齐多模态特征并增强其互补性,同时保持高频和低频表示的一致性。

技术框架:DifferSeg的整体架构包括两个主要模块:差异感知融合模块用于特征对齐和融合,频率引导解码器用于高低频表示的交互和上采样。

关键创新:最重要的创新在于DPF模块的引入,使得多模态特征能够自适应对齐,减少模态不匹配和冗余,同时FGD确保了细粒度边界恢复与噪声抑制的有效结合。

关键设计:在设计中,DPF模块使用可学习的差异算子进行特征融合,FGD则通过多路径上采样和跨频交互来保持高低频结构的一致性,确保分割精度的提升。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

在实验中,DifferSeg在29个公共数据集上超越了67种最先进的方法,展现出卓越的分割精度和泛化能力,具体性能数据表明其在多个下游任务中的表现显著优于现有技术,提升幅度可达数个百分点。

🎯 应用场景

DifferSeg框架具有广泛的应用潜力,适用于自然图像和医学图像等多种二值分割任务。其自适应特性和高效解码策略使其在实际应用中能够处理不同模态的图像数据,提升分割效果,具有重要的实际价值和未来影响。

📄 摘要(原文)

In many binary segmentation tasks, most multimodal methods rely on fixed feature concatenation for cross-modal interaction and straightforward decoder designs dominated by low-frequency semantics. %ToDO: % However, they ignore two key challenges: one is the lack of an adaptive mechanism to handle modality discrepancies and complementarity, and the other is the absence of an efficient decoding strategy to balance both high- and low-frequency representations. % In this work, we propose a simple yet general multimodal binary segmentation framework, termed DifferSeg, to address both problems simultaneously. With the help of the differential perception fusion (DPF) module, DifferSeg employs learnable differential operators to adaptively align multimodal features and enhance their complementarity through residual fusion, effectively mitigating modality mismatch and fusion redundancy. % In addition, we design a frequency-guided decoder (FGD) that builds cross-frequency interactions and multi-path upsampling to maintain consistency between detailed high-frequency structures and semantic low-frequency representations, ensuring fine-grained boundary recovery and noise suppression. % Benefiting from these designs, DifferSeg can be easily generalized to diverse binary segmentation tasks, including both natural and medical modalities. Without bells and whistles, it consistently surpasses 67 state-of-the-art methods across 29 public datasets involving 18 downstream tasks, demonstrating superior generalization and segmentation accuracy.Code and pretrained models will be available at the Link.