ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

作者: Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong, Yi Wang

分类: cs.CV, cs.AI, cs.LG

发布日期: 2026-06-04

备注: 25 pages, 11 figures. Preprint, under review

💡 一句话要点

提出ViCuR以解决多模态蒸馏中的教师特权问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态蒸馏 视觉推理 教师特权 线索恢复 深度学习

📋 核心要点

现有的多模态蒸馏方法依赖于教师的答案侧特权，导致训练与测试阶段信号不匹配，影响学生的推理能力。
ViCuR框架通过引入视觉线索替代答案侧特权，使得学生能够在推理时恢复这些线索，从而实现更为扎实的推理。
在七个基准测试中，ViCuR在整体平均性能上分别提升了+1.19和+1.24，且在强教师的蒸馏中也取得了显著的性能提升。

📝 摘要（中文）

在本论文中，作者提出了一种名为ViCuR的视觉引导特权教师蒸馏框架，旨在解决多模态推理中的训练与测试不匹配问题。现有方法依赖于教师的答案侧特权，这导致学生在推理时无法获得相同的信号，从而鼓励了快捷模仿而非基于视觉的推理。ViCuR通过使用来自输入的视觉线索替代答案侧特权，使得学生在推理时能够恢复这些线索。实验结果表明，ViCuR在多个基准测试中显著提升了性能，证明了教师特权设计的重要性。

🔬 方法详解

问题定义：本论文旨在解决多模态推理中教师特权的设计问题。现有方法依赖于教师的答案侧特权，导致学生在推理时无法获得相同的信号，造成训练与测试阶段的不匹配。

核心思路：ViCuR的核心思路是用视觉线索替代答案侧特权，这些线索来自于推理时可用的视觉输入，使得学生能够在推理时恢复这些信息，从而促进基于视觉的推理能力。

技术框架：ViCuR框架包含一个轻量级的线索恢复模块，该模块使用专门的sink-token交叉注意力机制，在预填充阶段聚合与任务相关的视觉证据，形成内部表示，而不改变推理接口或需要辅助的线索生成损失。

关键创新：ViCuR的主要创新在于将视觉线索引入蒸馏过程，解决了传统方法中教师特权的局限性，使得学生在推理时能够利用与教师相同的视觉信息。

关键设计：在设计中，ViCuR采用了专门的sink-token交叉注意力机制来聚合视觉信息，确保在不增加额外损失的情况下，学生能够有效地利用这些视觉线索。

🖼️ 关键图片

📊 实验亮点

实验结果显示，ViCuR在七个基准测试中相较于基于答案的自蒸馏方法提升了+1.19和+1.24的性能。此外，在强教师的蒸馏中，ViCuR也超越了OPD基线，分别提升了+0.64和+1.08，且在8B规模下表现出一致的跨域增益。

🎯 应用场景

该研究的潜在应用领域包括智能问答系统、视觉推理任务和多模态学习等。通过改善学生模型的推理能力，ViCuR可以在实际应用中提升系统的准确性和可靠性，尤其是在需要结合视觉信息进行决策的场景中。

📄 摘要（原文）

On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理