cs.CV(2025-03-25)

📊 共 23 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱一:机器人控制 (Robot Control) (4 🔗2) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
1 CoLLM: A Large Language Model for Composed Image Retrieval 提出CoLLM,利用大语言模型解决组合图像检索中的数据稀缺和多模态融合难题。 large language model multimodal
2 Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception 提出HyperDUM,利用超维计算高效量化自动驾驶感知中的多模态不确定性。 multimodal
3 FullDiT: Multi-Task Video Generative Foundation Model with Full Attention FullDiT:基于全注意力机制的多任务视频生成基础模型 foundation model
4 PAVE: Patching and Adapting Video Large Language Models PAVE:通过轻量级适配器增强视频大语言模型的多模态理解能力 large language model
5 Towards Online Multi-Modal Social Interaction Understanding 提出Online-MMSI-VLM框架,用于在线多模态社交互动理解,解决实时人机交互问题。 large language model multimodal
6 Audio-centric Video Understanding Benchmark without Text Shortcut 提出AVUT:一个以音频为中心的视频理解基准,解决文本捷径问题。 large language model multimodal
7 RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models 提出RGB-Th-Bench,用于评估视觉语言模型对RGB-Thermal图像对的理解能力。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
8 Enhanced Spatiotemporal Consistency for Image-to-LiDAR Data Pretraining SuperFlow++:增强时空一致性的图像-LiDAR数据预训练框架 representation learning contrastive learning scene understanding
9 A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition 提出A-MESS框架,通过锚点多模态嵌入和语义同步提升多模态意图识别性能。 contrastive learning large language model multimodal
10 DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for Alignment 提出DGTRSD数据集与DGTRS-CLIP模型,用于遥感图像-文本双粒度对齐。 curriculum learning foundation model
11 CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning CAFe:通过对比-自回归微调统一表征与生成任务 representation learning multimodal
12 Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations 提出BYOV,通过掩码自监督学习视角不变的细粒度视频表征 representation learning egocentric
13 Scaling Vision Pre-Training to 4K Resolution PS3:通过局部对比学习将CLIP风格的视觉预训练扩展到4K分辨率 representation learning contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
14 OpenLex3D: A Tiered Evaluation Benchmark for Open-Vocabulary 3D Scene Representations OpenLex3D:用于开放词汇3D场景表示的分层评估基准 scene understanding open-vocabulary open vocabulary
15 LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation 提出LPOSS+,通过标签传播优化视觉语言模型,实现开放词汇语义分割。 open-vocabulary open vocabulary
16 Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation 针对语义分割,研究如何有效提示视觉-语言模型,并提出PromptMatcher。 open-vocabulary open vocabulary foundation model
17 The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs 发布Coralscapes珊瑚礁图像数据集,用于珊瑚礁场景的语义理解 scene understanding
18 Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders 提出Vanishing Depth,通过位置深度编码增强通用图像编码器,实现度量深度理解。 metric depth

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
19 TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization TokenHSI:通过任务Token化统一合成物理人-场景交互 humanoid physically plausible human-scene interaction
20 OpenSDI: Spotting Diffusion-Generated Images in the Open World 提出OpenSDI数据集与SPM框架,用于开放世界中扩散模型生成图像的检测与定位。 manipulation masked autoencoder MAE
21 G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation G-DexGrasp:通过部件感知先验检索与辅助生成实现通用灵巧抓取合成 dexterous hand affordance
22 PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model PartRM:利用大规模跨状态重建模型建模部件级动态 manipulation world model

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
23 Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models 利用预训练2D扩散模型学习3D物体空间关系 motion synthesis spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页