Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts
作者: Haoyu Dong
分类: cs.AI
发布日期: 2026-06-09
💡 一句话要点
提出视觉反馈自蒸馏策略优化框架以解决代码生成视觉缺陷问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 视觉反馈 自蒸馏 策略优化 代码生成 视觉伪影 多模态学习 深度学习
📋 核心要点
- 现有的代码生成模型在生成视觉内容时,常常出现重叠元素、文本裁剪等视觉缺陷,影响最终效果。
- 本文提出Visual-SDPO框架,通过视觉反馈进行自蒸馏,利用渲染结果优化代码生成过程。
- 在多个基准测试中,Visual-SDPO在主要指标上提升超过10个绝对点,相较于基线表现显著改善。
📝 摘要(中文)
代码生成的大型语言模型(LLMs)在编写程序时,常常产生视觉伪影,如图表、网页和幻灯片。这些程序在非可微渲染器中执行,导致可执行代码产生视觉上显著缺陷。本文研究了针对代码生成视觉伪影的视觉反馈自蒸馏方法,提出了Visual-SDPO框架,将渲染的视觉反馈视为权重共享教师的特权上下文,并将其蒸馏到编码学生中。通过引入视觉基础代码信用加权,增强了对缺陷代码的监督信号。实验结果表明,Visual-SDPO在多个基准测试中显著提升了生成质量。
🔬 方法详解
问题定义:本文旨在解决代码生成模型在生成视觉内容时出现的视觉伪影问题,现有方法在渲染后无法及时反馈,导致生成的内容存在明显缺陷。
核心思路:提出Visual-SDPO框架,将渲染后的视觉反馈作为特权上下文,通过自蒸馏机制优化代码生成过程,提升生成内容的质量。
技术框架:Visual-SDPO框架包含教师-学生模型结构,教师模型利用视觉反馈指导学生模型的学习,采用视觉基础代码信用加权来增强对缺陷代码的监督。
关键创新:引入视觉基础代码信用加权,能够追踪缺陷到具体代码语句,增强了蒸馏信号的针对性,这是与现有方法的本质区别。
关键设计:设计了序列级GRPO(Group Relative Policy Optimization)项,奖励可执行且视觉质量高的生成结果,同时通过自蒸馏路径使得失败的执行也能被学习。
🖼️ 关键图片
📊 实验亮点
在ChartMimic、Design2Code和AeSlides等基准测试中,Visual-SDPO在主要指标上提升超过10个绝对点,相较于GRPO提升至少2.4点,且训练步骤更少,推理时无额外成本。
🎯 应用场景
该研究的潜在应用领域包括自动化图表生成、网页设计和幻灯片制作等,能够显著提升代码生成模型在视觉内容生成中的表现,具有广泛的实际价值和未来影响。
📄 摘要(原文)
Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.