WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

📄 arXiv: 2604.18224v1 📥 PDF

作者: Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, Jiaheng Liu

分类: cs.SE, cs.AI

发布日期: 2026-04-20


💡 一句话要点

提出WebCompass以解决现有编码评估方法的局限性

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多模态评估 网页编码 交互式代理 生成模型 编码修复 闭源模型 框架选择

📋 核心要点

  1. 现有的编码评估方法主要集中于文本生成和静态正确性,未能全面评估编码的视觉和交互质量。
  2. WebCompass通过引入多模态输入和多任务类型,模拟真实的网页编码迭代过程,提供更全面的评估框架。
  3. 实验结果表明,闭源模型在性能上显著优于开源模型,且编辑和修复任务的难度特征各异,修复任务的交互性更好但执行挑战性较高。

📝 摘要(中文)

大型语言模型正在迅速发展为能够进行端到端网页编码的交互式编码代理,但现有基准仅评估这一能力的狭窄切片,通常依赖文本条件生成和静态正确性指标,未能充分测量视觉保真度、交互质量和代码库级推理。我们提出WebCompass,一个多模态基准,提供网页工程能力的统一生命周期评估。WebCompass涵盖文本、图像和视频三种输入模态,以及生成、编辑和修复三种任务类型,形成七个任务类别,反映专业工作流程。通过多阶段的人机协作流程,我们策划了涵盖15个生成领域、16种编辑操作类型和11种修复缺陷类型的实例,并按简单、中等和困难级别进行标注。

🔬 方法详解

问题定义:本论文旨在解决现有编码评估方法的局限性,特别是它们未能全面评估网页编码的视觉保真度、交互质量和代码库级推理。现有方法通常只关注文本生成和静态正确性,导致评估结果片面。

核心思路:WebCompass的核心思路是通过引入多模态输入(文本、图像、视频)和多任务类型(生成、编辑、修复),模拟真实的网页编码迭代过程,从而提供更全面的评估框架。这样的设计使得评估能够更贴近实际开发工作流程。

技术框架:WebCompass的整体架构包括多阶段的人机协作流程,涵盖任务策划、实例标注和评估三个主要模块。我们策划了涵盖多种生成领域和操作类型的实例,并通过人机协作进行标注和评估。

关键创新:最重要的技术创新点在于引入了“Agent-as-a-Judge”范式,该范式允许生成的网页在真实浏览器中执行,模拟人类的接受测试。这一方法与传统的静态评估方法本质上不同,能够更真实地反映生成代码的可用性和交互性。

关键设计:在关键设计方面,我们设置了不同的难度级别(简单、中等、困难),并采用了基于检查表的LLM作为评估工具,确保评估过程的系统性和全面性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,闭源模型在性能上显著优于开源模型,尤其在编辑和修复任务中表现出不同的难度特征。修复任务在保持交互性的同时,执行挑战性较高。此外,框架选择对结果有显著影响,Vue框架在多个任务中表现出较大的挑战性,而React和Vanilla/HTML则在特定任务中表现更强。

🎯 应用场景

WebCompass的研究成果在多个领域具有潜在应用价值,包括网页开发、自动化测试和教育培训等。通过提供全面的编码评估框架,开发者可以更有效地评估和优化他们的代码生成工具,提升开发效率和代码质量。未来,WebCompass可能会推动更智能的编码助手和自动化开发工具的出现。

📄 摘要(原文)

Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.