ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

📄 arXiv: 2606.05718v1 📥 PDF

作者: Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong, Yi Wang

分类: cs.CV, cs.AI, cs.LG

发布日期: 2026-06-04

备注: 25 pages, 11 figures. Preprint, under review


💡 一句话要点

提出ViCuR以解决多模态蒸馏中的教师特权问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多模态蒸馏 视觉推理 教师特权 线索恢复 深度学习

📋 核心要点

  1. 现有的多模态蒸馏方法依赖于教师的答案侧特权,导致训练与测试阶段信号不匹配,影响学生的推理能力。
  2. ViCuR框架通过引入视觉线索替代答案侧特权,使得学生能够在推理时恢复这些线索,从而实现更为扎实的推理。
  3. 在七个基准测试中,ViCuR在整体平均性能上分别提升了+1.19和+1.24,且在强教师的蒸馏中也取得了显著的性能提升。

📝 摘要(中文)

在本论文中,作者提出了一种名为ViCuR的视觉引导特权教师蒸馏框架,旨在解决多模态推理中的训练与测试不匹配问题。现有方法依赖于教师的答案侧特权,这导致学生在推理时无法获得相同的信号,从而鼓励了快捷模仿而非基于视觉的推理。ViCuR通过使用来自输入的视觉线索替代答案侧特权,使得学生在推理时能够恢复这些线索。实验结果表明,ViCuR在多个基准测试中显著提升了性能,证明了教师特权设计的重要性。

🔬 方法详解

问题定义:本论文旨在解决多模态推理中教师特权的设计问题。现有方法依赖于教师的答案侧特权,导致学生在推理时无法获得相同的信号,造成训练与测试阶段的不匹配。

核心思路:ViCuR的核心思路是用视觉线索替代答案侧特权,这些线索来自于推理时可用的视觉输入,使得学生能够在推理时恢复这些信息,从而促进基于视觉的推理能力。

技术框架:ViCuR框架包含一个轻量级的线索恢复模块,该模块使用专门的sink-token交叉注意力机制,在预填充阶段聚合与任务相关的视觉证据,形成内部表示,而不改变推理接口或需要辅助的线索生成损失。

关键创新:ViCuR的主要创新在于将视觉线索引入蒸馏过程,解决了传统方法中教师特权的局限性,使得学生在推理时能够利用与教师相同的视觉信息。

关键设计:在设计中,ViCuR采用了专门的sink-token交叉注意力机制来聚合视觉信息,确保在不增加额外损失的情况下,学生能够有效地利用这些视觉线索。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,ViCuR在七个基准测试中相较于基于答案的自蒸馏方法提升了+1.19和+1.24的性能。此外,在强教师的蒸馏中,ViCuR也超越了OPD基线,分别提升了+0.64和+1.08,且在8B规模下表现出一致的跨域增益。

🎯 应用场景

该研究的潜在应用领域包括智能问答系统、视觉推理任务和多模态学习等。通过改善学生模型的推理能力,ViCuR可以在实际应用中提升系统的准确性和可靠性,尤其是在需要结合视觉信息进行决策的场景中。

📄 摘要(原文)

On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.