Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

作者: Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr

分类: cs.CV, cs.AI, cs.CL, cs.MA

发布日期: 2025-05-27 (更新: 2025-10-30)

备注: Project Page: https://github.com/Paper2Poster/Paper2Poster

🔗 代码/项目: GITHUB

💡 一句话要点

提出Paper2Poster以解决学术海报自动生成问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 学术海报生成 多模态学习 视觉语言模型 信息压缩 自动化工具

📋 核心要点

现有的学术海报生成方法在视觉连贯性和信息传达上存在不足，难以有效压缩长篇文档。
本文提出了PosterAgent，一个多代理管道，通过解析、规划和绘制模块实现海报的自动生成。
实验结果表明，基于Qwen-2.5系列的开源变体在多个评估指标上优于现有系统，且使用的token数量减少了87%。

📝 摘要（中文）

学术海报生成是科学传播中的一项重要但具有挑战性的任务，需将长篇文档压缩为视觉上连贯的单页。为此，本文首次引入了海报生成的基准和评估指标，结合会议论文与作者设计的海报，评估输出的视觉质量、文本连贯性、整体评估及海报内容传达能力。基于此基准，提出了PosterAgent，一个自上而下的多代理管道，包含解析、规划和绘制三个主要模块。综合评估显示，基于Qwen-2.5系列的开源变体在多个指标上优于现有系统，且成本极低。这些发现为下一代全自动海报生成模型指明了方向。

🔬 方法详解

问题定义：本文旨在解决学术海报生成中的信息压缩与视觉连贯性问题。现有方法常常在文本与视觉内容的对齐上表现不佳，导致生成的海报难以有效传达核心信息。

核心思路：提出PosterAgent，通过自上而下的多代理管道，利用解析、规划和绘制模块的协同工作，优化海报生成过程，确保信息的有效传达和视觉的吸引力。

技术框架：整体架构包括三个主要模块：解析器（Parser）将论文内容提取为结构化资产库；规划器（Planner）将文本与视觉对齐，形成二叉树布局；绘制-评论循环（Painter-Commenter）则通过执行渲染代码和利用视觉语言模型（VLM）反馈来优化每个面板。

关键创新：最重要的技术创新在于引入了PaperQuiz评估指标，量化海报传达核心内容的能力，同时通过VLM作为评判标准，提升了评估的客观性和准确性。

关键设计：在设计中，采用了多种损失函数来平衡视觉质量与文本连贯性，确保生成的海报在美学和信息传达上都达到较高标准。

📊 实验亮点

实验结果显示，基于Qwen-2.5系列的开源变体在视觉质量和信息传达能力上均优于现有的GPT-4o驱动系统，且在多个评估指标上表现出87%的token使用效率提升，成本仅为0.005美元。

🎯 应用场景

该研究的潜在应用领域包括学术会议、科研成果展示及教育领域，能够显著提高学术海报的生成效率与质量。未来，随着技术的进一步发展，可能会在更广泛的科学传播和信息可视化场景中发挥重要作用。

📄 摘要（原文）

Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册