cs.CV（2025-05-20）

📊 共 41 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (16 🔗4) 支柱九：具身大模型 (Embodied Foundation Models) (12 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (6 🔗3) 支柱一：机器人控制 (Robot Control) (3) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱四：生成式动作 (Generative Motion) (1) 支柱六：视频提取与匹配 (Video Extraction) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
1	UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning	提出UniVG-R1，通过强化学习增强推理能力，解决通用视觉定位任务。	reinforcement learning large language model multimodal	✅
2	UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation	UniGen：通过增强训练和测试策略实现统一多模态理解与生成	direct preference optimization large language model multimodal
3	Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning	Visionary-R1：通过强化学习缓解视觉推理中的捷径学习问题	reinforcement learning large language model multimodal
4	Programmatic Video Prediction Using Large Language Models	ProgGen：利用大语言模型进行可解释的程序化视频预测	world model large language model
5	Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency	提出TemRobBench基准与PanoDPO优化方法，提升大模型在时序一致性扰动下的鲁棒性。	direct preference optimization multimodal
6	VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank	提出VisualQuality-R1，通过强化学习排序实现推理驱动的图像质量评估。	reinforcement learning large language model
7	DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning	DeepEyes：通过强化学习激励视觉语言模型进行“图像思考”	reinforcement learning multimodal	✅
8	Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method	提出OmniVQA数据集与360-R1方法，提升全景视觉问答能力	reinforcement learning embodied AI large language model
9	StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning	提出StPR框架，通过时空信息解耦与保持，解决免样本视频类增量学习问题。	distillation spatiotemporal
10	Intra-class Patch Swap for Self-Distillation	提出一种基于类内块交换的自蒸馏方法，无需教师网络即可提升模型性能。	teacher-student distillation	✅
11	MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks	提出MultiMAE地球观测预训练方法，提升多模态遥感数据下游任务性能。	masked autoencoder	✅
12	RETRO: REthinking Tactile Representation Learning with Material PriOrs	提出材料感知先验以提升触觉表示学习的准确性	representation learning
13	Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search	提出符号图排序器SGR，利用LLM统一图学习与文本信息，提升会话搜索性能。	contrastive learning large language model
14	Scaling Vision Mamba Across Resolutions via Fractal Traversal	FractalMamba++：提出基于分形遍历的视觉Mamba，提升跨分辨率适应性	Mamba
15	Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning	提出物理驱动的局部-整体弹性变形建模以提升点云表示学习	representation learning
16	Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels	Ground-V：通过像素级指令微调，提升VLM在复杂场景下的定位能力	distillation instruction following

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
17	Speculative Decoding Reimagined for Multimodal Large Language Models	针对多模态大语言模型，提出多模态推测解码（MSD）加速推理。	large language model multimodal	✅
18	ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations	ViC-Bench：通过自由形式中间视觉状态评估多模态大语言模型的视觉交错思维能力	large language model chain-of-thought
19	EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language	EmoSign：构建美国手语情感理解多模态数据集，填补情感手语研究空白。	multimodal	✅
20	RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding	提出RAVENEA基准，通过检索增强提升视觉文化理解能力，解决多模态场景下的文化理解不足问题。	multimodal
21	Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models	VidCom2：即插即用视频大语言模型推理加速框架，提升效率并保持性能	large language model	✅
22	LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts	LoVR：一个用于多模态上下文中长视频检索的基准数据集。	multimodal	✅
23	Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach	提出Llama-SMoP：一种基于稀疏混合投影器的可扩展LLM语音识别方法	large language model multimodal
24	RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection	RADAR：通过补充知识注入增强放射学报告生成	large language model multimodal
25	VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation	提出VideoEval-Pro，用于更鲁棒和真实的长期视频理解评估	multimodal
26	Unlocking the Power of SAM 2 for Few-Shot Segmentation	利用SAM 2的Few-Shot分割方法，解决不同身份前景对象匹配问题	foundation model
27	Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting	Dolphin：通过异构锚点提示实现文档图像解析	multimodal	✅
28	AppleGrowthVision: A large-scale stereo dataset for phenological analysis, fruit detection, and 3D reconstruction in apple orchards	AppleGrowthVision：用于苹果园物候分析、果实检测和3D重建的大规模立体数据集	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
29	MGStream: Motion-aware 3D Gaussian for Streamable Dynamic Scene Reconstruction	MGStream：利用运动感知3D高斯实现可流式动态场景重建，解决闪烁伪影和存储低效问题。	3D gaussian splatting 3DGS gaussian splatting	✅
30	M3Depth: Wavelet-Enhanced Depth Estimation on Mars via Mutual Boosting of Dual-Modal Data	M3Depth：利用双模态数据互助增强的火星表面深度估计	depth estimation stereo depth
31	Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image	提出CP-GS框架，解决单图个性化3D场景生成中的视角偏差问题	3D gaussian splatting 3DGS gaussian splatting
32	Multi-Label Stereo Matching for Transparent Scene Depth Estimation	提出多标签立体匹配方法，用于透明场景深度估计，解决传统方法的单峰分布假设。	depth estimation scene reconstruction	✅
33	Diving into the Fusion of Monocular Priors for Generalized Stereo Matching	提出基于局部排序和自适应对齐的单目先验融合方法，提升立体匹配泛化性	monocular depth scene flow foundation model
34	4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision	提出4D-ROLLS，利用激光雷达监督学习4D雷达的Occupancy预测，提升恶劣环境感知能力。	height map	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
35	Emerging Properties in Unified Multimodal Pretraining	BAGEL：一个支持多模态理解与生成的开源统一预训练模型	manipulation multimodal
36	Vid2World: Crafting Video Diffusion Models to Interactive World Models	Vid2World：利用视频扩散模型构建交互式世界模型	manipulation world model
37	Visual Agentic Reinforcement Fine-Tuning	提出Visual-ARFT，提升LVLM在多模态Agent任务中的推理和泛化能力	manipulation multimodal

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
38	Beyond Words: Multimodal LLM Knows When to Speak	提出MM-When2Speak模型，利用多模态信息提升对话中响应时机预测的准确性。	dyadic interaction large language model multimodal

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
39	EGFormer: Towards Efficient and Generalizable Multimodal Semantic Segmentation	EGFormer：面向高效且泛化的多模态语义分割框架	MDM multimodal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
40	Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance	提出EAIL框架，利用视觉-语言引导，实现点云中基于头戴IMU的自中心动作感知定位	egocentric multimodal

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
41	Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI	Dynadiff：单阶段解码动态fMRI生成图像，提升时间分辨率和语义重建效果	diff-sim

⬅️ 返回 cs.CV 首页 · 🏠 返回主页