cs.CV（2025-03-25）

📊 共 23 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱一：机器人控制 (Robot Control) (4 🔗2) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
1	CoLLM: A Large Language Model for Composed Image Retrieval	提出CoLLM，利用大语言模型解决组合图像检索中的数据稀缺和多模态融合难题。	large language model multimodal
2	Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception	提出HyperDUM，利用超维计算高效量化自动驾驶感知中的多模态不确定性。	multimodal
3	FullDiT: Multi-Task Video Generative Foundation Model with Full Attention	FullDiT：基于全注意力机制的多任务视频生成基础模型	foundation model
4	PAVE: Patching and Adapting Video Large Language Models	PAVE：通过轻量级适配器增强视频大语言模型的多模态理解能力	large language model	✅
5	Towards Online Multi-Modal Social Interaction Understanding	提出Online-MMSI-VLM框架，用于在线多模态社交互动理解，解决实时人机交互问题。	large language model multimodal	✅
6	Audio-centric Video Understanding Benchmark without Text Shortcut	提出AVUT：一个以音频为中心的视频理解基准，解决文本捷径问题。	large language model multimodal	✅
7	RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models	提出RGB-Th-Bench，用于评估视觉语言模型对RGB-Thermal图像对的理解能力。	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
8	Enhanced Spatiotemporal Consistency for Image-to-LiDAR Data Pretraining	SuperFlow++：增强时空一致性的图像-LiDAR数据预训练框架	representation learning contrastive learning scene understanding	✅
9	A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition	提出A-MESS框架，通过锚点多模态嵌入和语义同步提升多模态意图识别性能。	contrastive learning large language model multimodal
10	DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for Alignment	提出DGTRSD数据集与DGTRS-CLIP模型，用于遥感图像-文本双粒度对齐。	curriculum learning foundation model	✅
11	CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning	CAFe：通过对比-自回归微调统一表征与生成任务	representation learning multimodal
12	Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations	提出BYOV，通过掩码自监督学习视角不变的细粒度视频表征	representation learning egocentric	✅
13	Scaling Vision Pre-Training to 4K Resolution	PS3：通过局部对比学习将CLIP风格的视觉预训练扩展到4K分辨率	representation learning contrastive learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
14	OpenLex3D: A Tiered Evaluation Benchmark for Open-Vocabulary 3D Scene Representations	OpenLex3D：用于开放词汇3D场景表示的分层评估基准	scene understanding open-vocabulary open vocabulary	✅
15	LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation	提出LPOSS+，通过标签传播优化视觉语言模型，实现开放词汇语义分割。	open-vocabulary open vocabulary	✅
16	Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation	针对语义分割，研究如何有效提示视觉-语言模型，并提出PromptMatcher。	open-vocabulary open vocabulary foundation model
17	The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs	发布Coralscapes珊瑚礁图像数据集，用于珊瑚礁场景的语义理解	scene understanding
18	Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders	提出Vanishing Depth，通过位置深度编码增强通用图像编码器，实现度量深度理解。	metric depth

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
19	TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization	TokenHSI：通过任务Token化统一合成物理人-场景交互	humanoid physically plausible human-scene interaction	✅
20	OpenSDI: Spotting Diffusion-Generated Images in the Open World	提出OpenSDI数据集与SPM框架，用于开放世界中扩散模型生成图像的检测与定位。	manipulation masked autoencoder MAE	✅
21	G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation	G-DexGrasp：通过部件感知先验检索与辅助生成实现通用灵巧抓取合成	dexterous hand affordance
22	PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model	PartRM：利用大规模跨状态重建模型建模部件级动态	manipulation world model

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models	利用预训练2D扩散模型学习3D物体空间关系	motion synthesis spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页