Can Large Language Models Understand Symbolic Graphics Programs?

作者: Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, Bernhard Schölkopf

分类: cs.LG, cs.AI, cs.CL, cs.CV

发布日期: 2024-08-15 (更新: 2025-05-27)

备注: ICLR 2025 Spotlight (v4: 47 pages, 26 figures, project page: https://sgp-bench.github.io/)

💡 一句话要点

提出基于符号图形程序的基准测试，评估并提升LLM的空间语义推理能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 符号图形程序 空间语义推理 指令调优 视觉理解

📋 核心要点

现有方法难以有效评估LLM的空间语义推理能力，缺乏合适的、模型未见过的测试任务。
利用符号图形程序生成视觉数据，通过程序语义理解任务，测试LLM在无视觉编码器下的推理能力。
提出符号指令调优（SIT）方法，通过微调提升LLM对符号程序的理解和通用推理能力。

📝 摘要（中文）

在大语言模型（LLM）备受关注的背景下，迫切需要科学地评估其能力和不足。由于很难找到模型在训练期间未曾遇到的任务，这并非易事。我们利用符号图形程序，提出了一个非常适合测试LLM多重空间语义推理技能的领域。这些程序在计算机图形学中很常用，可以程序化地生成视觉数据。虽然LLM在通用程序合成和分析方面表现出令人印象深刻的技能，但符号图形程序提供了一个新的评估层面：它们允许我们测试LLM在没有视觉编码器的情况下，回答关于图像或3D几何体的语义问题的能力。为了在语义上理解符号程序，LLM需要具备“想象”和推理相应图形内容外观的能力，而这些内容仅通过局部曲率和笔画的符号描述来表示。我们使用此任务来评估LLM，通过程序化构建一个大型基准，以最小的人工成本实现对符号图形程序的语义视觉理解。特别强调了图像的变换，这些变换保留了图像级别的语义不变性，同时对底层程序进行了重大更改。我们在基准上评估了商业和开源LLM，以评估它们推理程序视觉输出的能力，发现通常在推理方面更强的LLM表现更好。最后，我们介绍了一种新颖的方法来提高这种能力——符号指令调优（SIT），其中LLM使用预先收集的关于符号图形程序的指令数据进行微调。有趣的是，我们发现SIT不仅提高了LLM对符号程序的理解，而且还提高了在各种其他基准上的通用推理能力。

🔬 方法详解

问题定义：论文旨在评估和提升大型语言模型（LLM）对符号图形程序的语义理解能力。现有方法难以找到LLM在训练过程中未曾遇到的任务，从而难以有效评估其空间语义推理能力。此外，缺乏有效的手段来提升LLM在此类任务上的表现。

核心思路：论文的核心思路是利用符号图形程序作为评估和提升LLM空间语义推理能力的工具。符号图形程序能够程序化地生成视觉数据，并且可以通过语义问题来测试LLM对图像或3D几何体的理解，而无需视觉编码器。通过构建一个包含大量符号图形程序的基准测试，并提出符号指令调优（SIT）方法，可以有效地评估和提升LLM在此类任务上的表现。

技术框架：论文的技术框架主要包括以下几个部分：1) 构建一个包含大量符号图形程序的基准测试，用于评估LLM的语义理解能力。2) 设计一系列语义问题，用于测试LLM对符号图形程序所生成图像或3D几何体的理解。3) 提出符号指令调优（SIT）方法，通过使用预先收集的关于符号图形程序的指令数据对LLM进行微调，从而提升其语义理解能力。4) 在基准测试上评估不同的LLM，并比较它们在SIT前后的表现。

关键创新：论文的关键创新点在于：1) 提出了使用符号图形程序作为评估和提升LLM空间语义推理能力的工具。2) 构建了一个包含大量符号图形程序的基准测试，为LLM的语义理解能力评估提供了一个新的平台。3) 提出了符号指令调优（SIT）方法，通过微调有效地提升了LLM对符号程序的理解和通用推理能力。

关键设计：在SIT中，使用了预先收集的关于符号图形程序的指令数据。这些指令数据包含了符号图形程序的描述以及与之相关的语义问题和答案。LLM通过学习这些指令数据，可以更好地理解符号图形程序的语义，从而提高其在基准测试上的表现。具体的损失函数和网络结构等技术细节在论文中没有详细描述，属于未知信息。

🖼️ 关键图片

📊 实验亮点

实验结果表明，在符号图形程序基准测试上，通常在推理方面更强的LLM表现更好。通过符号指令调优（SIT），LLM不仅提高了对符号程序的理解，而且在各种其他基准测试上也提高了通用推理能力。具体性能提升数据未在摘要中明确给出，属于未知信息。

🎯 应用场景

该研究成果可应用于提升LLM在计算机图形学、机器人视觉等领域的应用能力。例如，LLM可以用于理解和生成复杂的3D模型，或者用于控制机器人进行精确的视觉引导操作。此外，该研究提出的评估方法和SIT技术，也可以推广到其他需要空间语义推理能力的领域。

📄 摘要（原文）

Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM's ability to answer semantic questions about the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to "imagine" and reason how the corresponding graphics content would look with only the symbolic description of the local curvatures and strokes. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability -- Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM's understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.

Can Large Language Models Understand Symbolic Graphics Programs?

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理