VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

作者: Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan

分类: cs.CV

发布日期: 2025-11-28

备注: 19 pages, 10 figures

💡 一句话要点

提出VQRAE，统一多模态理解、生成和重建的表示量化自编码器。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态学习 向量量化 自编码器 视觉表示 图像生成

📋 核心要点

现有方法难以在单一tokenizer中统一多模态理解、生成和重建的表示，通常采用双编码器结构。
VQRAE通过表示量化自编码器，在统一的tokenizer中生成连续语义特征（理解）和离散token（生成）。
实验表明，VQRAE在视觉理解、生成和重建任务上表现出竞争性，并在自回归范式中具有良好的扩展性。

📝 摘要（中文）

本文提出VQRAE，一种表示量化自编码器的变体，旨在统一多模态理解、生成和重建的表示。现有研究主要采用双编码器范式，例如分别使用单独的编码器进行理解和生成，或使用对比损失平衡语义表示和低层特征。VQRAE首次探索了统一表示，以在统一的tokenizer中产生用于图像理解的连续语义特征和用于视觉生成的离散token。该方法基于预训练的视觉基础模型，采用对称的ViT解码器，并使用两阶段训练策略：首先，冻结编码器，学习具有像素重建目标的高维语义VQ码本；然后，使用自蒸馏约束联合优化编码器。这种设计使得语义信息可以忽略不计，从而保持了多模态理解能力，同时离散token与生成兼容，并能进行细粒度的重建。此外，研究发现，与图像重建中常用的低维码本相比，依赖于高维码本的语义量化编码器具有有趣的特性，语义VQ码本在维度为1536时可以实现100%的利用率。VQRAE在视觉理解、生成和重建的多个基准测试中表现出竞争性的性能，并且由于其离散特性，在自回归范式中具有良好的扩展性。

🔬 方法详解

问题定义：现有方法在多模态任务中，通常使用双编码器结构分别处理理解和生成任务，难以实现表示的统一。此外，如何平衡语义信息和低层特征也是一个挑战。

核心思路：VQRAE的核心思路是利用向量量化（VQ）技术，将连续的视觉表示转换为离散的token，从而实现理解和生成任务的统一表示。通过高维语义VQ码本，既能保留足够的语义信息用于理解，又能生成离散token用于生成。

技术框架：VQRAE基于预训练的视觉基础模型，并添加一个对称的ViT解码器。整体框架包含以下几个主要模块：1) 视觉编码器：提取图像的视觉特征。2) VQ码本：将连续的视觉特征量化为离散的token。3) ViT解码器：将离散的token解码为像素级别的图像表示。训练过程分为两个阶段：第一阶段，冻结编码器，训练VQ码本和解码器，以实现像素级别的图像重建；第二阶段，联合优化编码器、VQ码本和解码器，并引入自蒸馏约束。

关键创新：VQRAE的关键创新在于使用高维语义VQ码本，在统一的tokenizer中同时生成连续语义特征和离散token。与传统的低维码本不同，高维码本能够更好地保留语义信息，从而保证了模型在理解任务上的性能。此外，自蒸馏约束也有助于提高模型的泛化能力。

关键设计：VQRAE的关键设计包括：1) 使用对称的ViT解码器，保证了编码器和解码器之间的信息对齐。2) 采用两阶段训练策略，先训练VQ码本和解码器，再联合优化整个模型。3) 引入自蒸馏约束，提高模型的泛化能力。4) 使用高维语义VQ码本，维度设置为1536，以实现100%的利用率。

📊 实验亮点

VQRAE在多个视觉理解、生成和重建基准测试中取得了竞争性的性能。特别是在高维语义VQ码本的维度达到1536时，码本利用率达到100%，表明该方法能够有效地利用高维语义空间。由于其离散特性，VQRAE在自回归范式中具有良好的扩展性，为未来的研究提供了新的方向。

🎯 应用场景

VQRAE具有广泛的应用前景，例如多模态对话系统、图像编辑、图像描述生成等。通过统一的表示，可以实现更高效的多模态信息处理和生成，提升用户体验。未来，VQRAE有望应用于更复杂的视觉任务，例如视频理解和生成。

📄 摘要（原文）

Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理