Generalizable Geometric Image Caption Synthesis

作者: Yue Xin, Wenyuan Wang, Rui Pan, Ruida Wang, Howard Meng, Renjie Pi, Shizhe Diao, Tong Zhang

分类: cs.AI, cs.CV, cs.LG

发布日期: 2025-09-18

💡 一句话要点

提出RLVR方法以解决几何图像描述生成问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 几何图像 描述生成 强化学习 多模态模型 数据合成 推理能力 数学问题解决

📋 核心要点

现有多模态大型语言模型在处理复杂几何问题时表现不佳，缺乏高质量的图像-文本配对数据集是主要挑战。
本文提出了一种引入可验证奖励的强化学习（RLVR）方法，以改进几何图像的描述生成，增强模型的推理能力。
实验结果表明，生成的数据集在不同任务中提升了2.8%-4.8%的准确率，显著增强了模型的推理能力。

📝 摘要（中文）

多模态大型语言模型在解决复杂几何问题时仍面临挑战，主要原因在于缺乏高质量的图像-文本配对数据集。现有的基于模板的数据合成流程通常无法推广到超出预定义模板的问题。本文通过在数据生成流程中引入可验证奖励的强化学习（RLVR）过程，成功捕捉几何问题解决的关键特征，从而实现更好的任务泛化，并在多个任务中取得显著的准确性提升。

🔬 方法详解

问题定义：本文旨在解决多模态大型语言模型在几何图像描述生成中的不足，尤其是现有方法在复杂几何问题上的泛化能力差。

核心思路：通过引入可验证奖励的强化学习（RLVR）机制，优化几何图像的描述生成过程，使其能够更好地捕捉几何问题的特征。

技术框架：整体流程包括数据生成、RLVR优化和模型训练三个主要阶段。首先生成基于50种基本几何关系的图像，然后通过RLVR对生成的描述进行优化，最后将优化后的数据用于训练多模态模型。

关键创新：RLVR的引入是本文的核心创新，与传统的模板化数据合成方法相比，它能够更灵活地适应不同类型的问题，提升模型的泛化能力。

关键设计：在RLVR过程中，设计了基于数学问题解决任务的奖励信号，以指导描述生成的优化。此外，采用了特定的损失函数来平衡生成描述的准确性和多样性。

🖼️ 关键图片

📊 实验亮点

实验结果显示，使用RLVR优化后的数据集在统计、算术、代数和数值任务中提升了2.8%-4.8%的准确率，同时在艺术、设计、技术和工程任务中也取得了2.4%-3.9%的提升，验证了方法的有效性和广泛适用性。

🎯 应用场景

该研究的潜在应用领域包括教育、自动化设计和机器人等，能够为复杂几何问题的自动化解决提供支持，提升多模态模型在实际应用中的表现。未来，该方法有望推广到更广泛的领域，促进智能系统的推理能力和决策支持。

📄 摘要（原文）

Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, even in out-of-distribution scenarios, the generated dataset enhances the general reasoning capabilities of multimodal large language models, yielding accuracy improvements of $2.8\%\text{-}4.8\%$ in statistics, arithmetic, algebraic, and numerical tasks with non-geometric input images of MathVista and MathVerse, along with $2.4\%\text{-}3.9\%$ improvements in Art, Design, Tech, and Engineering tasks in MMMU.

Generalizable Geometric Image Caption Synthesis

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理