GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models

📄 arXiv: 2603.09079v1 📥 PDF

作者: Md Selim Sarowar, Omer Tariq, Sungho Kim

分类: cs.CV, cs.AI, cs.RO

发布日期: 2026-03-10

备注: The results presented in this paper are preliminary. Please note that the experiments are currently ongoing, and the final data is subject to change upon the completion of the study. All ideas, results, methods, and any content herein are the sole property of the authors


💡 一句话要点

提出GST-VLA以解决3D深度感知视觉-语言-动作模型的几何结构问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱三:空间感知与语义 (Perception & Semantics) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 3D深度感知 视觉-语言-动作 高斯空间标记器 链式思维 多模态学习

📋 核心要点

  1. 现有的VLA模型在编码视觉观察时缺乏内在的几何结构,导致对3D信息的理解不足。
  2. 本文提出的GST-VLA通过Gaussian Spatial Tokenizer和3D深度感知链式思维,增强了模型对几何信息的捕捉和理解能力。
  3. 实验结果显示,GST-VLA在LIBERO和SimplerEnv数据集上分别提升了2.0%和5.4%的性能,验证了方法的有效性。

📝 摘要(中文)

VLA模型将视觉观察编码为没有内在几何结构的2D补丁标记。本文提出GST-VLA,主要贡献包括:首先,Gaussian Spatial Tokenizer(GST)将冻结的密集深度和语义补丁特征转换为128个各向异性的3D高斯原语,参数化为度量残差均值、对数尺度协方差和学习的不透明度。协方差特征编码局部表面方向,不透明度提供每个原语的几何置信度。其次,3D深度感知链式思维(DA-CoT)推理监督四个结构化的中间空间思维,作为训练损失中的显式生成目标。GST-VLA在LIBERO和SimplerEnv上分别取得96.4%和80.2%的准确率,验证了各个组件的独立和协同增益。

🔬 方法详解

问题定义:本文旨在解决VLA模型在处理视觉观察时缺乏几何结构的问题,导致对3D深度信息的理解不足。现有方法无法有效利用深度和语义信息,影响了模型的性能。

核心思路:GST-VLA通过引入Gaussian Spatial Tokenizer,将深度和语义特征转换为3D高斯原语,增强了模型对几何信息的表达能力。同时,3D深度感知链式思维(DA-CoT)提供了结构化的推理过程,提升了模型的推理能力。

技术框架:GST-VLA的整体架构包括两个主要模块:Gaussian Spatial Tokenizer和3D深度感知链式思维。GST模块负责将输入特征转换为高斯原语,而DA-CoT模块则通过监督学习生成多个中间空间思维,指导模型的推理过程。

关键创新:GST-VLA的主要创新在于引入了高斯空间标记器和深度感知链式思维,前者通过高斯原语有效编码几何信息,后者则通过结构化思维提升了推理的准确性和效率。这与传统的VLA模型形成了显著对比。

关键设计:模型的关键设计包括128个各向异性的3D高斯原语的参数设置,损失函数结合了流匹配、链式思维和深度信息,确保模型在训练过程中能够有效学习到几何特征。

🖼️ 关键图片

fig_0

📊 实验亮点

GST-VLA在LIBERO数据集上取得了96.4%的准确率,相较于基线提升了2.0%;在SimplerEnv上达到了80.2%,提升幅度为5.4%。这些结果表明,GST-VLA在精度要求较高的任务中表现出色,验证了各个组件的有效性。

🎯 应用场景

该研究的潜在应用领域包括机器人操作、自动驾驶和增强现实等场景,能够帮助系统更好地理解和处理复杂的3D环境信息。未来,该方法有望推动多模态学习和深度感知技术的发展,提升智能系统的交互能力和决策水平。

📄 摘要(原文)

VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into $N_g{=}128$ anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean $μ\in \mathbb{R}^3$, log-scale covariance $\log σ\in \mathbb{R}^3$, and learned opacity $α\in (0,1)$. The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering 3D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse SE(3) waypoints, as explicit generation targets in the training loss. A cross-attention sublayer at every VLM transformer block provides direct access to the raw 256-primitive Gaussian field during DA-CoT generation. A 300M-parameter flow-matching action expert with mixture-of-experts feedforward sublayers decodes 7-DoF delta action chunks via conditional ODE integration, conditioned on both VLM hidden states and DA-CoT outputs through dual cross-attention. Trained with composite $\mathcal{L}\mathrm{flow} + \mathcal{L}\mathrm{CoT} + \mathcal{L}_\mathrm{depth}$ across three progressive stages, GST-VLA achieves 96.4% on LIBERO (+2.0%), and 80.2% on SimplerEnv (+5.4%). Ablations isolate the contribution of each GST component, each DA-CoT thought, and each training stage, confirming independent and synergistic gains concentrated on precision demanding tasks.