Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

作者: Yin Zhang, Yongqiang Zhang, Yaoyue Zheng, Bogdan Raducanu, Dan Liu

分类: cs.CV

发布日期: 2025-12-18

备注: Accepted by AAAI 2026

💡 一句话要点

Causal-Tune：挖掘视觉基础模型中的因果因子，用于领域泛化语义分割

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 领域泛化 语义分割 视觉基础模型 因果推断 离散余弦变换 特征解耦 频域分析

📋 核心要点

现有领域泛化语义分割方法忽略了预训练视觉基础模型中存在的伪影，这些伪影会干扰有价值的特征表示。
Causal-Tune通过分析视觉基础模型特征的频谱，分离因果和非因果因素，从而抑制伪影并提升泛化能力。
实验表明，Causal-Tune在跨域语义分割任务中表现出色，尤其在恶劣天气条件下，性能提升显著。

📝 摘要（中文）

本文提出了一种针对领域泛化语义分割（DGSS）的因果调优（Causal-Tune）方法，旨在解决视觉基础模型（VFM）中存在的伪影问题。作者观察到，这些伪影与VFM频谱中的低频和高频成分相关的非因果因素有关，阻碍了VFM的有效利用并降低了DGSS的性能。Causal-Tune通过显式地检查VFM特征中的因果和非因果因素，并分离它们，从而实现更鲁棒的领域泛化。该方法首先使用离散余弦变换（DCT）提取每层特征的频谱，然后应用高斯带通滤波器将频谱分离为因果和非因果成分。为了进一步细化因果成分，引入了一组在频域中操作的因果感知可学习token，同时丢弃非因果成分。最后，细化后的特征通过逆DCT转换回空间域，并传递到下一层。在各种跨域任务上的大量实验表明了Causal-Tune的有效性，尤其是在恶劣天气条件下，在雪地条件下，mIoU比基线提高了+4.8%。

🔬 方法详解

问题定义：领域泛化语义分割（DGSS）旨在使模型在未见过的目标域上也能保持良好的分割性能。现有的方法通常通过微调轻量级适配器或优化中间特征来实现，但忽略了预训练视觉基础模型（VFM）中存在的伪影。这些伪影与非因果因素相关，阻碍了VFM的有效利用，降低了DGSS的性能。

核心思路：本文的核心思路是识别并分离VFM特征中的因果和非因果因素。作者观察到，这些非因果因素通常存在于VFM频谱的低频和高频成分中。通过抑制这些非因果因素，可以提取更鲁棒的因果特征，从而提高模型的领域泛化能力。

技术框架：Causal-Tune的整体框架包括以下几个主要步骤：1) 使用离散余弦变换（DCT）提取VFM每层特征的频谱；2) 应用高斯带通滤波器将频谱分离为因果和非因果成分；3) 引入因果感知可学习token，在频域中细化因果成分；4) 丢弃非因果成分；5) 通过逆DCT将细化后的特征转换回空间域，并传递到下一层。

关键创新：Causal-Tune的关键创新在于显式地建模和分离VFM特征中的因果和非因果因素。与现有方法不同，Causal-Tune不是简单地微调或优化特征，而是深入分析特征的频谱，并有针对性地抑制非因果成分，从而提取更具泛化能力的特征表示。

关键设计：Causal-Tune的关键设计包括：1) 使用DCT进行频谱分析；2) 使用高斯带通滤波器分离因果和非因果成分，滤波器的参数需要根据具体任务进行调整；3) 引入因果感知可学习token，这些token在频域中操作，用于细化因果成分，token的数量和维度需要根据实验进行调整；4) 使用逆DCT将特征转换回空间域。

🖼️ 关键图片

📊 实验亮点

Causal-Tune在多个跨域语义分割任务上取得了显著的性能提升。例如，在恶劣天气条件下，Causal-Tune在雪地场景下的mIoU比基线提高了+4.8%。实验结果表明，Causal-Tune能够有效地提取因果特征，抑制非因果因素，从而提高模型的领域泛化能力。

🎯 应用场景

Causal-Tune方法可应用于各种需要领域泛化的语义分割任务，例如自动驾驶、遥感图像分析、医学图像诊断等。该方法能够提高模型在不同环境和条件下的鲁棒性，减少对大量标注数据的依赖，具有重要的实际应用价值和潜力。

📄 摘要（原文）

Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.

Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理