cs.CV（2025-12-26）

📊 共 15 篇论文 | 🔗 1 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (6) 支柱二：RL算法与架构 (RL & Architecture) (3) 支柱八：物理动画 (Physics-based Animation) (2) 支柱一：机器人控制 (Robot Control) (2) 支柱四：生成式动作 (Generative Motion) (1) 支柱三：空间感知与语义 (Perception & Semantics) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
1	iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception	iSHIFT：轻量级自适应感知的慢-快GUI代理，提升交互效率与精度	large language model multimodal visual grounding
2	See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning	提出双向感知塑形方法以解决多模态推理中的视觉证据不足问题	multimodal
3	Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models	提出BadVSFM，针对Prompt驱动的视频分割基础模型的后门攻击框架。	foundation model
4	Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models	提出Inherent-enhanced Multi-modal Calibration框架，提升医学多模态大语言模型在噪声环境下的鲁棒性。	large language model
5	SLIM-Brain: A Data- and Training-Efficient Foundation Model for fMRI Data Analysis	SLIM-Brain：一种数据与训练高效的fMRI分析基础模型	foundation model
6	Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models	提出DIOR：一种免训练的条件图像嵌入框架，利用大型视觉语言模型。	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
7	Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition	提出PAN：一种以人为中心的图表示学习框架，用于多模态动作识别。	representation learning spatiotemporal multimodal
8	VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning	提出VideoZoomer，通过强化学习动态聚焦长视频推理的关键帧。	reinforcement learning large language model multimodal
9	Yume-1.5: A Text-Controlled Interactive World Generation Model	Yume-1.5：一种文本控制的交互式世界生成模型，提升实时性和可控性。	linear attention distillation

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
10	End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration	提出XET-V2X，用于V2X场景下多模态融合的端到端3D时空感知。	spatiotemporal multimodal
11	LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration	LongFly：提出时空上下文整合框架，解决无人机长程视觉-语言导航问题	spatiotemporal VLN multimodal

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
12	VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement	VULCAN：工具增强的多智能体迭代式3D物体排列方法	manipulation scene understanding large language model
13	Attack-Aware Deepfake Detection under Counter-Forensic Manipulations	提出一种攻击感知的Deepfake检测器，增强在对抗取证下的鲁棒性与可信度。	manipulation

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
14	DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models	DeMoGen：提出一种基于能量的扩散模型，用于分解式人体运动生成。	text-to-motion motion generation human motion

🔬 支柱三：空间感知与语义 (Perception & Semantics) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
15	Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer	提出Reloc-VGGT，利用几何约束Transformer实现鲁棒高效的视觉重定位	VGGT spatial relationship	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页