Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

作者: Haoyu Dong

分类: cs.AI

发布日期: 2026-06-09

💡 一句话要点

提出视觉反馈自蒸馏策略优化框架以解决代码生成视觉缺陷问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉反馈 自蒸馏 策略优化 代码生成 视觉伪影 多模态学习 深度学习

📋 核心要点

现有的代码生成模型在生成视觉内容时，常常出现重叠元素、文本裁剪等视觉缺陷，影响最终效果。
本文提出Visual-SDPO框架，通过视觉反馈进行自蒸馏，利用渲染结果优化代码生成过程。
在多个基准测试中，Visual-SDPO在主要指标上提升超过10个绝对点，相较于基线表现显著改善。

📝 摘要（中文）

代码生成的大型语言模型（LLMs）在编写程序时，常常产生视觉伪影，如图表、网页和幻灯片。这些程序在非可微渲染器中执行，导致可执行代码产生视觉上显著缺陷。本文研究了针对代码生成视觉伪影的视觉反馈自蒸馏方法，提出了Visual-SDPO框架，将渲染的视觉反馈视为权重共享教师的特权上下文，并将其蒸馏到编码学生中。通过引入视觉基础代码信用加权，增强了对缺陷代码的监督信号。实验结果表明，Visual-SDPO在多个基准测试中显著提升了生成质量。

🔬 方法详解

问题定义：本文旨在解决代码生成模型在生成视觉内容时出现的视觉伪影问题，现有方法在渲染后无法及时反馈，导致生成的内容存在明显缺陷。

核心思路：提出Visual-SDPO框架，将渲染后的视觉反馈作为特权上下文，通过自蒸馏机制优化代码生成过程，提升生成内容的质量。

技术框架：Visual-SDPO框架包含教师-学生模型结构，教师模型利用视觉反馈指导学生模型的学习，采用视觉基础代码信用加权来增强对缺陷代码的监督。

关键创新：引入视觉基础代码信用加权，能够追踪缺陷到具体代码语句，增强了蒸馏信号的针对性，这是与现有方法的本质区别。

关键设计：设计了序列级GRPO（Group Relative Policy Optimization）项，奖励可执行且视觉质量高的生成结果，同时通过自蒸馏路径使得失败的执行也能被学习。

🖼️ 关键图片

📊 实验亮点

在ChartMimic、Design2Code和AeSlides等基准测试中，Visual-SDPO在主要指标上提升超过10个绝对点，相较于GRPO提升至少2.4点，且训练步骤更少，推理时无额外成本。

🎯 应用场景

该研究的潜在应用领域包括自动化图表生成、网页设计和幻灯片制作等，能够显著提升代码生成模型在视觉内容生成中的表现，具有广泛的实际价值和未来影响。

📄 摘要（原文）

Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理