Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation

作者: Qingyun Zou, Yingze Li, Tianen Liu, Bingsheng He, Weng-Fai Wong

分类: cs.AR, cs.LG

发布日期: 2026-05-13

💡 一句话要点

提出奖励加权On-Policy蒸馏方法，提升NL到SVA生成的属性等价性

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: NL-to-SVA生成 属性等价性验证 奖励加权蒸馏 On-Policy学习 SystemVerilog Assertion

📋 核心要点

现有NL到SVA生成模型在整体准确率较高，但在处理时序规范时，容易退化为少数模板，无法保证属性等价性。
提出奖励加权On-Policy蒸馏(RWOPD)方法，利用属性等价性检查器对模型生成结果进行奖励，引导模型学习属性等价的SVA。
实验结果表明，RWOPD方法显著提升了NL到SVA生成的性能，在多个数据集上超越了现有SOTA模型和大型通用语言模型。

📝 摘要（中文）

本文指出，基于LLM的SystemVerilog Assertion (SVA)生成看似接近饱和，但实际上在时序上存在差距，模型在有界延迟和活性规范上会退化为少数模板。核心问题是，监督微调优化的是token级别的模仿，而非SVA正确性的关键——属性等价性。为此，本文提出奖励加权On-Policy蒸馏(RWOPD)，该方法采样student模型的rollout，使用SymbiYosys+Z3属性等价性检查器(PEC)进行评分，并对通过验证的rollout应用来自冻结的14B teacher模型的验证器奖励加权前向KL梯度。这保持了每个响应token的密集监督，同时将选择和损失权重都基于属性等价行为。RWOPD将CodeV-SVA-14B蒸馏成Qwen2.5-Coder-7B-Instruct student，在NL2SVA-Human和NL2SVA-Machine上，pass@1、pass@5和pass@10指标均达到了新的state-of-the-art，超越了之前的专门模型和671B通用模型。

🔬 方法详解

问题定义：论文旨在解决自然语言到SystemVerilog Assertion (NL-to-SVA) 生成任务中，现有方法过度依赖token级别模仿，忽略SVA的核心属性——属性等价性的问题。现有方法，如监督微调，容易使模型记住训练数据中的模式，而无法生成真正满足规范要求的SVA，尤其是在处理复杂的时序逻辑时，模型容易退化为少数几个模板，导致生成的SVA无法通过验证。

核心思路：论文的核心思路是利用属性等价性检查器(Property-Equivalence Checker, PEC)作为奖励信号，指导模型生成满足属性等价性的SVA。通过On-Policy蒸馏，让student模型学习teacher模型的行为，但同时根据PEC的验证结果对student模型的生成进行奖励加权，从而鼓励模型生成属性等价的SVA。这样既能利用teacher模型的知识，又能保证生成的SVA的正确性。

技术框架：RWOPD的技术框架主要包含以下几个模块：1) Teacher模型：一个预训练好的大型语言模型，用于提供初始的生成策略。2) Student模型：需要训练的小型语言模型，用于生成SVA。3) Property-Equivalence Checker (PEC)：使用SymbiYosys+Z3实现的属性等价性检查器，用于验证生成的SVA是否满足给定的自然语言规范。4) Reward加权模块：根据PEC的验证结果，对student模型的生成进行奖励加权，用于调整损失函数。整个流程是：首先，student模型根据当前策略生成SVA；然后，PEC验证生成的SVA是否与自然语言规范等价；接着，根据验证结果计算奖励；最后，使用奖励加权的前向KL散度损失函数更新student模型。

关键创新：论文最重要的技术创新点在于提出了奖励加权On-Policy蒸馏(RWOPD)方法，将属性等价性检查器引入到NL-to-SVA生成模型的训练过程中。与传统的监督微调方法相比，RWOPD方法能够直接优化SVA的属性等价性，从而生成更可靠的SVA。此外，RWOPD方法还采用了On-Policy蒸馏，能够更好地利用teacher模型的知识，同时避免了Off-Policy蒸馏中的分布偏移问题。

关键设计：RWOPD的关键设计包括：1) 使用SymbiYosys+Z3作为属性等价性检查器，保证了验证的准确性。2) 使用前向KL散度作为损失函数，保证了student模型能够有效地学习teacher模型的知识。3) 根据PEC的验证结果，对损失函数进行奖励加权，鼓励模型生成属性等价的SVA。4) Teacher模型使用CodeV-SVA-14B，Student模型使用Qwen2.5-Coder-7B-Instruct。具体训练时，对通过PEC验证的rollout，给予更高的权重，反之则降低权重。

🖼️ 关键图片

📊 实验亮点

RWOPD方法在NL2SVA-Human和NL2SVA-Machine数据集上取得了state-of-the-art的结果，pass@1、pass@5和pass@10指标均超越了之前的专门模型和671B通用模型。具体来说，RWOPD方法将CodeV-SVA-14B蒸馏成Qwen2.5-Coder-7B-Instruct student，在性能上取得了显著提升。

🎯 应用场景

该研究成果可应用于集成电路设计验证领域，通过自动生成SystemVerilog Assertion (SVA)，可以显著提高验证效率，降低验证成本。此外，该方法还可以推广到其他代码生成任务中，例如生成数据库查询语句、API调用序列等，具有广泛的应用前景。

📄 摘要（原文）

LLM-based generation of SystemVerilog Assertions (SVA) is often reported as nearing saturation, with the strongest specialized model reaching ${\sim}76\%$ accuracy on NL2SVA-Human. We show that this aggregate hides a temporal gap: models that appear strong overall still collapse to a few implication templates on bounded-delay and liveness specifications. The core issue is that the dominant recipe, supervised fine-tuning on NL/SVA pairs, optimizes token-level mimicry rather than the \emph{property equivalence} that defines SVA correctness. We introduce \emph{Reward-Weighted On-Policy Distillation} (RWOPD), an on-policy distillation method that samples student rollouts, scores them with an open SymbiYosys+Z3 Property-Equivalence Checker (PEC), and applies a verifier-reward-weighted forward-KL gradient from a frozen 14B teacher on verifier-passable rollouts. This keeps the supervision dense at every response token while grounding both selection and loss weight in property-equivalent behavior. RWOPD distills CodeV-SVA-14B into a Qwen2.5-Coder-7B-Instruct student that sets a new state of the art on NL2SVA-Human and NL2SVA-Machine across pass@1, pass@5, and pass@10, surpassing both specialized prior SOTA models and 671B general-purpose baselines.

Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理