ComplexConstraints and Beyond: Expert Rubrics for RLVR
作者: Sushant Mehta, Liudas Panavas, Edwin Chen
分类: cs.AI
发布日期: 2026-06-08
备注: Accepted to the GEM workshop at ACL 2026: https://gem-workshop.com/
💡 一句话要点
提出专家评分标准以提升RLVR评估方法的有效性
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 专家评分标准 复杂指令跟随 自主任务 评估方法 大型语言模型 机器学习 训练信号 数据集构建
📋 核心要点
- 现有评估方法主要依赖于表面约束的程序化验证,无法有效评估复杂的上下文依赖行为。
- 本文提出了一种基于专家评分标准的评估方法,强调设计原则如最大可行原子性和意图感知标准设计。
- 通过在ComplexConstraints数据集上的实验,模型在指令跟随任务上提升了15.5%和12.2%的性能,证明了专家评分的有效性。
📝 摘要(中文)
随着大型语言模型(LLM)能力的快速发展,评估方法却未能跟上。传统基准依赖于程序化验证狭隘的表面约束,而现实中的指令跟随和自主任务则需要评估复杂的、依赖上下文的行为。本文提出了一种基于专家评分标准的评估新范式,并通过复杂指令跟随和企业自主任务两个领域的实证证据进行分析。我们提出了构建高质量评分标准的五个设计原则,并引入了ComplexConstraints数据集,验证了专家评分标准在评估和训练中的有效性,显著提升了模型在指令跟随任务中的表现。
🔬 方法详解
问题定义:本文旨在解决现有评估方法无法有效评估复杂指令跟随和自主任务的痛点,传统方法过于依赖表面约束,缺乏对上下文的理解。
核心思路:论文提出了一种基于专家评分标准的评估方法,强调通过设计高质量的评分标准来提升评估的准确性和训练效果。
技术框架:整体架构包括五个设计原则的制定、ComplexConstraints数据集的构建,以及在此基础上的模型训练和评估。主要模块包括评分标准设计、数据集构建和模型训练。
关键创新:最重要的技术创新在于提出了专家评分标准作为评估和训练信号,显著提升了模型的性能,与传统方法相比,能够更好地捕捉复杂行为。
关键设计:在评分标准设计中,采用了最大可行原子性和意图感知标准设计等原则,确保评分标准的有效性和适用性,同时在训练过程中使用了基于评分的反馈机制。
📊 实验亮点
实验结果表明,使用ComplexConstraints数据集进行训练后,4B参数模型在指令跟随任务上提升了15.5%,235B参数模型提升了12.2%。此外,基于评分的企业环境单次训练也在未训练的分布外基准上实现了4.5%的提升,显示出良好的迁移能力。
🎯 应用场景
该研究的潜在应用领域包括大型语言模型的评估与训练、智能助手的开发以及复杂任务的自动化处理。通过引入专家评分标准,可以提升模型在实际应用中的表现,推动智能系统的进一步发展。
📄 摘要(原文)
As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on programmatic verification of narrow, surface-level constraints, but real-world instruction following and agentic tasks demand assessment of nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce ComplexConstraints, a new expert-curated instruction-following dataset in which each prompt is paired with 10-40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 ComplexConstraints examples yields +15.5% improvement for a 4B-parameter model and +12.2% for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.