cs.CV（2025-06-11）

📊 共 40 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (9 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (9 🔗2) 支柱一：机器人控制 (Robot Control) (5) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search	提出AutoCaption框架以解决视频字幕生成评估问题	large language model multimodal	✅
2	EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models	提出EfficientVLA以解决VLA模型的加速与压缩问题	vision-language-action VLA
3	Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy	提出Kvasir-VQA-x1以解决医疗视觉问答数据集不足问题	large language model multimodal	✅
4	OctoNav: Towards Generalist Embodied Navigation	提出OctoNav以解决多模态导航任务的统一性问题	embodied AI VLA VLN
5	AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation	提出AnimateAnyMesh以解决高质量3D模型动画生成问题	foundation model
6	Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets	提出基于类别相似性的多模态分类方法以解决异构类别集问题	multimodal
7	Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation	提出CCELLA以解决医学影像数据稀缺问题	large language model foundation model	✅
8	HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding	提出HSENet以解决3D医学图像理解中的语言-视觉融合问题	large language model multimodal	✅
9	Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding	提出ReVisiT以解决视觉信息在LVLM解码中的不足	multimodal visual grounding
10	Digitization of Document and Information Extraction using OCR	提出结合OCR与大语言模型的框架以提升文档信息提取准确性	large language model
11	DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision	提出DreamCS以解决文本到3D生成中的几何偏差问题	large language model
12	Q-SAM2: Accurate Quantization for Segment Anything Model 2	提出Q-SAM2以解决SAM2模型在资源受限设备上的量化问题	foundation model
13	LLM-to-Phy3D: Physically Conform Online 3D Object Generation with LLMs	提出LLM-to-Phy3D以解决物理约束下的3D对象生成问题	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
14	DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction	提出DynaSplat以解决动态场景重建问题	gaussian splatting splatting scene reconstruction
15	HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene	提出HAIF-GS以解决动态场景重建中的一致性问题	3D gaussian splatting 3DGS gaussian splatting
16	Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation	提出Vireo框架以解决开放词汇领域泛化语义分割问题	open-vocabulary open vocabulary foundation model	✅
17	Accurate and efficient zero-shot 6D pose estimation with frozen foundation models	提出FreeZeV2以解决零-shot 6D姿态估计问题	6D pose estimation foundation model
18	Self-Supervised Multi-Part Articulated Objects Modeling via Deformable Gaussian Splatting and Progressive Primitive Segmentation	提出DeGSS框架以解决多部件关节物体建模问题	gaussian splatting splatting
19	Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS	提出全局高斯混合简化方法以解决3D高斯点云渲染的内存问题	3D gaussian splatting 3DGS gaussian splatting	✅
20	MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images	提出MetricHMSR以解决单目图像中的人类姿态与场景恢复问题	depth estimation metric depth
21	The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge	提出一种新颖的视图合成方法以解决稀疏无姿态图像的问题	3DGS NeRF	✅
22	Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes	提出一种方法以预测3D场景中手部交互的声音	scene reconstruction

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
23	UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting	提出UniPre3D以解决3D点云统一表示学习问题	representation learning gaussian splatting splatting	✅
24	Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation	提出视觉扰动框架以提升多模态推理能力	DPO large language model multimodal	✅
25	3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation	提出几何蒸馏方法以提升视觉语言模型的3D理解能力	distillation VGGT foundation model
26	SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields	提出SemanticSplat以解决3D场景理解中的语义与几何建模问题	distillation scene understanding open-vocabulary
27	Towards a general-purpose foundation model for fMRI analysis	提出NeuroSTORM以解决fMRI分析的可重复性与迁移性问题	Mamba foundation model
28	ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs	提出ViCrit以解决视觉语言模型中的视觉感知问题	reinforcement learning large language model
29	PlayerOne: Egocentric World Simulator	提出PlayerOne以解决真实世界模拟的挑战	world model egocentric
30	Synthetic Geology: Structural Geology Meets Deep Learning	提出StructuralGeo以解决地质重建中的数据稀缺问题	flow matching foundation model
31	MMME: A Spontaneous Multi-Modal Micro-Expression Dataset Enabling Visual-Physiological Fusion	提出MMME数据集以解决多模态微表情分析问题	MAE multimodal

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
32	Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing	提出通过绘图增强视觉语言模型的空间推理能力	manipulation reinforcement learning spatial relationship
33	CHIP: A multi-sensor dataset for 6D pose estimation of chairs in industrial settings	提出CHIP数据集以解决工业环境中椅子的6D姿态估计问题	manipulation 6D pose estimation
34	Benchmarking Gaslighting Negation Attacks Against Reasoning Models	提出GaslightingBench-R以评估推理模型对否定攻击的抵抗力	manipulation multimodal chain-of-thought
35	VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models	提出VITA以解决视觉语言模型的零-shot价值函数问题	manipulation reinforcement learning offline reinforcement learning
36	CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation	提出CheckManual基准以解决手动电器操作的挑战	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
37	LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning	提出基于掩膜的LoRA微调方法以实现灵活的视频编辑	spatiotemporal	✅
38	MPFNet: A Multi-Prior Fusion Network with a Progressive Training Strategy for Micro-Expression Recognition	提出MPFNet以解决微表情识别中的多源信息融合问题	spatiotemporal

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
39	InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions	提出InterActHuman框架以解决多概念人类动画问题	human-object interaction spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
40	A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs	提出最小视频对基准以解决视频语言模型的物理理解问题	egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页