GuideCAD: A Lightweight Multimodal Framework for 3D CAD Model Generation via Prefix Embedding
作者: Minseong Kim, Jinyeong Park, Sungho Park, Jibum Kim
分类: cs.CV
发布日期: 2026-06-05
🔗 代码/项目: GITHUB
💡 一句话要点
提出GuideCAD以解决3D CAD模型生成的计算资源问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
📋 核心要点
- 现有的多模态3D CAD生成方法通常需要大量的计算资源,导致训练效率低下。
- GuideCAD通过映射网络将图像嵌入转换为前缀嵌入,结合预训练语言模型,整合视觉与文本信息。
- 实验结果显示,GuideCAD在参数使用上减少约四倍,同时训练效率提高了两倍,生成的模型质量依然高。
- method_zh
📝 摘要(中文)
多模态方法用于3D CAD生成通常需要大量计算资源,训练效率低下。为此,本文提出了GuideCAD,利用语义丰富的视觉-文本表示,仅需少量可训练参数即可生成3D CAD模型。GuideCAD通过映射网络将图像嵌入转换为前缀嵌入,使得预训练的大型语言模型能够整合视觉和文本信息。最终,基于变换器的解码器利用视觉-文本嵌入预测构建顺序,从而生成3D CAD模型。实验中,我们构建了新的数据集GuideCAD,包含文本-图像对,结果表明GuideCAD在使用约四分之一参数的情况下,生成的3D CAD模型质量相当,并且训练效率提升了两倍。
🖼️ 关键图片
📄 摘要(原文)
Multi-modal approaches used for 3D CAD generation require substantial computational resources, necessitating efficient training. To address this, we propose GuideCAD, which leverages semantically rich visual-textual representations having only a small number of trainable parameters to generate 3D CAD models. Specifically, GuideCAD uses a mapping network that converts image embeddings into prefix embeddings, enabling a pretrained large language model to integrate visual and textual information. As a result, a transformer-based decoder predicts the construction sequence using the visual-textual embeddings in order to generate the 3D CAD model. For experimental evaluation, we construct a new dataset, referred to as GuideCAD, which consists of text-image pairs. Each pair includes a text prompt that represents a 3D CAD construction sequence and its corresponding 3D CAD image. Our experimental results show that GuideCAD generates comparably high-quality 3D CAD models while using approximately four times fewer parameters and achieving twice the training efficiency compared to fine-tuning approaches. We have released the source code and dataset for our method at: https://github.com/mskimS2/GuideCAD