cs.CV（2025-10-19）

📊 共 17 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (6) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (4) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input	提出Res-Bench，评估多模态大语言模型在动态分辨率输入下的鲁棒性	large language model multimodal
2	Enrich and Detect: Video Temporal Grounding with Multimodal LLMs	提出ED-VTG，利用多模态LLM进行细粒度视频时序定位	large language model multimodal
3	Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs	LENS：为冻结多模态LLM提供即插即用的分割能力	large language model multimodal
4	Training-free Online Video Step Grounding	提出BaGLM，利用大模型零样本能力在线视频步骤定位，超越离线训练方法。	large language model multimodal
5	Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding	通过fMRI神经编码揭示视觉-语言模型中类脑分层模式	multimodal
6	EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction	提出EventFormer，用于解决动作中心视频事件预测任务，并构建大规模数据集AVEP。	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
7	Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis	综述性分析医学影像领域中的Foundation Model，系统性地归纳架构、训练范式和临床应用。	distillation foundation model multimodal
8	A Comprehensive Survey on World Models for Embodied AI	对具身智能中世界模型的全面综述，涵盖功能、时序建模和空间表示三个维度。	world model embodied AI	✅
9	EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation	EMRRG：高效微调预训练Mamba X射线网络，用于放射报告生成	Mamba SSM large language model	✅
10	Video Reasoning without Training	提出V-Reason，无需训练即可提升大模型在视频推理任务中的性能。	reinforcement learning multimodal chain-of-thought
11	Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback	Uniworld-V2：利用扩散负感知微调和MLLM隐式反馈增强图像编辑能力	flow matching large language model multimodal
12	Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding	提出W2R2框架，解决视频LLM中3D grounding的2D语义偏见问题。	representation learning multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
13	SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes	SceneCOT：提出3D场景中基于常识链的推理框架，提升具身问答性能	scene understanding large language model multimodal
14	2DGS-R: Revisiting the Normal Consistency Regularization in 2D Gaussian Splatting	2DGS-R：通过分层训练和原位克隆提升2D高斯溅射的渲染质量和几何精度	3D gaussian splatting 3DGS gaussian splatting
15	GS2POSE: Marry Gaussian Splatting to 6D Object Pose Estimation	GS2POSE：结合高斯溅射的6D物体姿态估计方法	3DGS gaussian splatting splatting
16	How Universal Are SAM2 Features?	量化通用视觉模型与分割专用模型特征的泛化能力差异	depth estimation

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
17	HumanCM: One Step Human Motion Prediction	提出HumanCM，一种基于一致性模型的人体运动单步预测框架	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页