Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start
作者: Kun Chen, Peng Shi, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao, Lin Ma
分类: cs.LG, cs.AI, cs.CL, cs.CV
发布日期: 2025-10-29 (更新: 2026-01-30)
备注: Published as a conference paper at ICLR 2026!
🔗 代码/项目: PROJECT_PAGE
💡 一句话要点
提出SPECS框架以解决多模态学习中的冷启动问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 多模态学习 自蒸馏 偏好训练 冷启动 强化学习 模型泛化 视觉语言模型
📋 核心要点
- 现有的基于监督微调的冷启动方法存在推理与任务解决方案交织的问题,导致过拟合和泛化能力不足。
- 提出SPECS框架,通过自蒸馏生成偏好数据对,采用偏好训练以提高模型的泛化能力。
- 在多个多模态基准上,SPECS相较于强基线提升了MEGA-Bench 4.1%和MathVista 12.2%的性能。
📝 摘要(中文)
近年来,带有可验证奖励的强化学习(RL)推动了“MLLM-r1”方法的发展,将RL引入视觉语言模型。然而,现有的基于监督微调(SFT)的冷启动方法存在推理范式与任务解决方案交织的问题,可能导致指令风格的过拟合,削弱了模型的分布外泛化能力。为此,本文重新审视冷启动,从训练方法和数据构建两方面入手,提出了自蒸馏偏好基础的冷启动框架SPECS。该框架通过自蒸馏生成内省偏好数据对,进行偏好训练,最终交给RL进行深度推理。实验结果表明,SPECS在多个多模态基准上表现优异,显著提升了模型性能。
🔬 方法详解
问题定义:本文旨在解决现有基于监督微调的冷启动方法在推理与任务解决方案交织下导致的过拟合和泛化能力不足的问题。
核心思路:提出SPECS框架,通过自蒸馏生成偏好数据对,避免对大型教师模型或人工标注的依赖,采用偏好训练聚焦于表面形式标准,从而提升模型的泛化能力。
技术框架:SPECS框架包括三个主要模块:生成内省偏好数据对、进行偏好训练以及交给RL进行深度推理。首先,通过自蒸馏生成数据对;其次,利用这些数据进行偏好训练;最后,使用RL进行最终的推理。
关键创新:SPECS的核心创新在于将多模态学习解耦,采用自蒸馏生成偏好数据对,显著提高了模型的泛化能力,与传统的SFT方法本质上不同。
关键设计:在训练过程中,采用了新的损失函数以优化偏好训练效果,并设计了适应性参数设置以增强模型的稳定性和探索能力。
🖼️ 关键图片
📊 实验亮点
实验结果显示,SPECS框架在多个多模态基准上表现优异,MEGA-Bench提升4.1%,MathVista提升12.2%。此外,SPECS还有效减少了模型在分布内的“卡滞”现象,改善了探索能力,稳定了训练过程。
🎯 应用场景
该研究的潜在应用领域包括智能助手、自动内容生成和多模态交互系统等。通过提升模型的泛化能力和推理性能,SPECS框架能够在实际应用中提供更为准确和灵活的响应,具有重要的实际价值和未来影响。
📄 摘要(原文)
Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution "stuckness," improving exploration, stabilizing training, and raising the performance ceiling. Project Page: https://kwen-chen.github.io/SPECS-VL/