World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

作者: Emmanuelle Bourigault

分类: cs.CL

发布日期: 2026-05-28

备注: 8 pages, 3 figures, 5 tables

💡 一句话要点

提出审计物理状态转变承诺的框架以提升视觉语言模型评估

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 视觉语言模型 物理推理 模型评估 状态转变 审计框架 类型化追踪 错误标签

📋 核心要点

现有的视觉语言模型评估方法主要关注最终答案，忽视了模型在物理状态理解和推理过程中的细节。
本文提出了 extit{WMW}框架，通过生成类型化追踪记录，全面审计模型的物理承诺，提升评估的准确性。
实验结果显示，使用验证器引导的重新排序可以提高追踪有效性达7个百分点，同时保持答案准确性，追踪级别的偏好调优减少了41%的隐藏不一致性。

📝 摘要（中文）

视觉语言模型（VLMs）在回答关于物理场景的问题时越来越普遍，但大多数评估仅关注最终答案，忽视了模型是否正确感知对象、表示物理状态、预测合理转变或仅仅选择了错误原因的正确选项。本文提出了 extit{World Models in Words}（ extit{WMW}）评估框架，旨在审计VLMs的语言表达物理承诺。该框架要求模型生成一个类型化的追踪记录，包括初始状态、状态转变、结果状态和答案。通过混合验证器检查模式有效性、状态基础、转变一致性和答案追踪兼容性，生成如对象、关系、力、转变等错误标签。我们还发布了 extit{Tracebank}，提供了受控追踪资源和验证代码，评估模型在受控和外部物理推理示例上的表现。研究发现，35%的中等模型的正确答案背后存在物理无效的追踪记录。

🔬 方法详解

问题定义：本文旨在解决现有视觉语言模型评估方法的不足，特别是对模型物理状态理解和推理过程的忽视。现有方法往往只关注最终答案，无法揭示模型在推理过程中的潜在错误。

核心思路：论文提出的 extit{WMW}框架要求模型生成一个包含初始状态、状态转变、结果状态和答案的类型化追踪记录，从而全面审计模型的物理承诺。通过这种方式，可以更清晰地了解模型的推理过程及其有效性。

技术框架：整体架构包括模型输出的类型化追踪记录生成、混合验证器的应用以及错误标签的生成。验证器检查模式有效性、状态基础、转变一致性和答案追踪兼容性，确保模型的物理推理过程是合理的。

关键创新：最重要的创新在于引入了类型化追踪记录的概念，使得评估不仅限于最终答案，而是关注模型在推理过程中的每一个环节。这与传统方法的本质区别在于，后者往往忽视了推理过程的细节。

关键设计：在设计上，论文使用了混合验证器来检查不同类型的错误，包括对象、关系、力、转变等。此外，追踪记录的生成和验证过程采用了合成场景和对比偏好对，以确保评估的全面性和准确性。

🖼️ 关键图片

📊 实验亮点

实验结果表明，35%的中等模型的正确答案背后存在物理无效的追踪记录。通过验证器引导的重新排序，追踪有效性提高了7个百分点，而追踪级别的偏好调优则减少了41%的隐藏不一致性，显示出该框架在提升模型评估准确性方面的显著效果。

🎯 应用场景

该研究的潜在应用领域包括自动问答系统、机器人视觉理解和物理推理任务。通过提升视觉语言模型的评估标准，可以更好地理解和改进模型在复杂物理场景中的表现，进而推动智能系统在实际应用中的可靠性和有效性。

📄 摘要（原文）

Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the right physical state, predicted a plausible transition, or merely selected the right option for the wrong reasons. We introduce \wmw, an evaluation framework for auditing the \emph{language-expressed physical commitments} of VLMs. Instead of scoring only $I,q\mapsto a$, we ask models to produce a typed trace $I,q\mapsto(s_0,Δs,s_1,a)$: an initial state, a state transition, a resulting state, and an answer. A hybrid verifier then checks schema validity, state grounding, transition consistency, and answer-trace compatibility, yielding typed error labels such as object, relation, force, transition, temporal, unit/scale, and faithfulness errors. We release \tracebank, a controlled trace resource with \nSeed schema- and recomputation-validated synthetic scenarios across \nFamilies physics families, \nPairs minimally perturbed contrastive preference pairs, verifier code, audit guidelines, and model outputs. We evaluate \nModels VLMs on both controlled and external physical-reasoning examples. \wmw reveals failures that answer-only evaluation misses: 35\% of correct answers from mid-tier models are backed by physically invalid traces. Verifier-guided reranking recovers up to 7 percentage points of trace validity without sacrificing answer accuracy, and trace-level preference tuning reduces hidden inconsistency by 41\% relative. The contribution is not another final-answer physics benchmark, but a reusable protocol for measuring whether a VLM's stated physical world can be true at the same time as its answer.

World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理