CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

作者: Hui Zhang, Dexiang Hong, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang

分类: cs.CV

发布日期: 2024-12-05 (更新: 2025-08-06)

备注: Accepted by ICCV 2025

💡 一句话要点

提出CreatiLayout，基于Siamese多模态扩散Transformer实现可控的布局到图像生成。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 布局到图像生成 多模态扩散Transformer Siamese网络 图像生成 数据集 大型语言模型 可控生成

📋 核心要点

现有L2I方法主要依赖UNet，未能充分利用MM-DiT强大的图像生成能力，且布局信息的有效融合存在挑战。
提出SiamLayout，通过Siamese分支解耦图像-布局交互，并使用独立权重处理布局，平衡多模态信息。
构建大规模LayoutSAM数据集和LayoutSAM-Eval基准，并结合Layout Designer，形成完整的CreatiLayout解决方案。

📝 摘要（中文）

扩散模型因其生成视觉上吸引人且具有高艺术质量图像的能力而备受认可。因此，布局到图像（L2I）生成被提出，利用特定区域的位置和描述来实现更精确和可控的生成。然而，先前的方法主要集中在基于UNet的模型（例如SD1.5和SDXL）上，而对多模态扩散Transformer（MM-DiT）的探索有限，MM-DiT已展示出强大的图像生成能力。将MM-DiT用于布局到图像生成看似简单，但由于布局如何引入、集成以及在多个模态之间平衡的复杂性而具有挑战性。为此，我们探索了各种网络变体，以有效地将布局指导融入MM-DiT，并最终提出了SiamLayout。为了继承MM-DiT的优势，我们使用一组单独的网络权重来处理布局，将其视为与图像和文本模态同等重要。同时，为了缓解模态之间的竞争，我们将图像-布局交互解耦为与图像-文本分支并行的Siamese分支，并在后期阶段融合它们。此外，我们贡献了一个大规模布局数据集，名为LayoutSAM，其中包括270万个图像-文本对和1070万个实体。每个实体都标有边界框和详细描述。我们进一步构建了LayoutSAM-Eval基准，作为评估L2I生成质量的综合工具。最后，我们引入了Layout Designer，它挖掘了大型语言模型在布局规划中的潜力，将它们转变为布局生成和优化方面的专家。这些组件构成了CreatiLayout——一个集成了布局模型、数据集和规划器的系统解决方案，用于创造性的布局到图像生成。

🔬 方法详解

问题定义：论文旨在解决布局到图像生成（L2I）任务中，如何有效利用多模态扩散Transformer（MM-DiT）的问题。现有方法主要基于UNet，无法充分发挥MM-DiT的潜力。同时，如何将布局信息有效地融入MM-DiT，并在图像、文本和布局三种模态之间取得平衡，是一个重要的挑战。

核心思路：论文的核心思路是将布局信息视为与图像和文本同等重要的模态，并采用Siamese网络结构来解耦图像-布局交互。通过独立处理布局信息，并将其与图像-文本交互分支并行处理，可以缓解模态之间的竞争，从而更有效地利用布局信息指导图像生成。

技术框架：CreatiLayout整体框架包含三个主要组成部分：SiamLayout模型、LayoutSAM数据集和Layout Designer。SiamLayout是核心的图像生成模型，它接收文本描述和布局信息作为输入，生成对应的图像。LayoutSAM是一个大规模的布局数据集，用于训练和评估SiamLayout模型。Layout Designer利用大型语言模型进行布局规划，为SiamLayout提供高质量的布局输入。

关键创新：论文的关键创新在于SiamLayout模型的Siamese网络结构，它将图像-布局交互解耦为独立的分支，避免了与图像-文本交互的直接竞争。此外，使用独立的网络权重处理布局信息，确保了布局信息的重要性。LayoutSAM数据集的大规模和详细标注也为L2I研究提供了宝贵资源。

关键设计：SiamLayout模型包含两个并行的分支：图像-文本分支和图像-布局分支。图像-文本分支采用标准的MM-DiT结构，处理图像和文本信息。图像-布局分支也采用类似的MM-DiT结构，但使用独立的网络权重处理布局信息。两个分支的输出在后期阶段进行融合，生成最终的图像。损失函数包括扩散模型的标准损失函数，以及一些辅助损失函数，用于提高生成图像的质量和与布局的一致性。Layout Designer使用大型语言模型生成布局，并根据用户反馈进行优化。

🖼️ 关键图片

📊 实验亮点

论文构建了大规模LayoutSAM数据集，包含270万图像-文本对和1070万实体。提出的SiamLayout模型在LayoutSAM-Eval基准上取得了显著的性能提升，证明了其有效性。Layout Designer能够根据用户需求生成高质量的布局，进一步提升了L2I生成的可控性。

🎯 应用场景

该研究成果可应用于创意设计、广告生成、虚拟内容创作等领域。通过指定布局和文本描述，用户可以快速生成符合需求的图像，提高创作效率和质量。未来，该技术有望应用于更广泛的场景，例如智能家居设计、游戏场景生成等。

📄 摘要（原文）

Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (\eg SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. These components form CreatiLayout -- a systematic solution that integrates the layout model, dataset, and planner for creative layout-to-image generation.

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理