cs.CV（2026-02-25）

📊 共 37 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (14 🔗5) 支柱二：RL算法与架构 (RL & Architecture) (7) 支柱三：空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱七：动作重定向 (Motion Retargeting) (4) 支柱一：机器人控制 (Robot Control) (3 🔗2) 支柱四：生成式动作 (Generative Motion) (2) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
1	UniVBench: Towards Unified Evaluation for Video Foundation Models	提出UniVBench以解决视频基础模型评估碎片化问题	foundation model multimodal instruction following
2	MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving	MindDriver：面向自动驾驶的渐进式多模态推理框架	multimodal chain-of-thought	✅
3	SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model	SkyReels-V4：统一多模态视频-音频生成、修复与编辑的基石模型	large language model foundation model multimodal
4	RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models	提出基于RGB-Event超图提示的预训练模型，用于解决GNSS拒止环境下的地铁里程标志识别问题	foundation model	✅
5	Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models	提出动态多模态激活引导方法，缓解大型视觉语言模型中的幻觉问题	multimodal
6	E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought	提出E-comIQ-ZH框架，用于细粒度评估中文电商海报质量，解决现有方法忽略文本伪影问题。	chain-of-thought	✅
7	CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis	提出CARE：一种分子引导的自适应区域建模病理切片图像分析基础模型	foundation model
8	SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction	SEF-MAP：用于稳健多模态高清地图预测的子空间分解专家融合方法	multimodal
9	WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs	WeaveTime：通过将先前帧的信息融入涌现记忆，提升视频LLM在流式场景下的时序理解能力	large language model multimodal	✅
10	Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation	GLoTran：面向高分辨率富文本图像翻译，提出全局-局部双重感知MLLM框架	large language model multimodal
11	StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles	StoryMovie数据集通过电影剧本和字幕对齐，提升视觉故事中语义关系的准确性。	visual grounding TAMP
12	TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection	提出TranX-Adapter，增强MLLM在AI生成图像检测中的鲁棒性	large language model multimodal
13	NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors	NoLan：通过动态抑制语言先验缓解大型视觉语言模型中的对象幻觉	multimodal	✅
14	RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations	RobustVisRAG：提出因果感知的视觉退化鲁棒检索增强生成框架	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
15	See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs	提出一种免训练的迭代框架，通过视觉证据监督LVLM的多模态推理，提升视觉一致性。	reinforcement learning multimodal chain-of-thought
16	Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning	提出工具专家感知的胸部X光Agent，通过多模态Agent学习解决工具冲突问题	reinforcement learning multimodal
17	How to Take a Memorable Picture? Empowering Users with Actionable Feedback	提出MemCoach，通过可执行反馈提升图像记忆性，赋能用户拍摄更难忘的照片	teacher-student large language model multimodal
18	Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization	提出难度感知分组归一化(Durian)，提升多模态LLM推理能力	reinforcement learning large language model multimodal
19	PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning	PanoEnv：利用强化学习探索全景环境中3D空间智能	reinforcement learning scene understanding
20	Solaris: Building a Multiplayer Video World Model in Minecraft	Solaris：构建Minecraft多人视频世界模型，实现一致的多视角模拟。	world model
21	CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning	提出CCCaption，通过双重奖励强化学习生成完整且正确的图像描述	reinforcement learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction	GPOcc：利用通用视觉几何先验进行稀疏高斯占据预测	monocular depth Depth Anything scene understanding	✅
23	SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR	SF3D-RGB：单目相机与稀疏LiDAR融合的场景流估计方法	scene flow
24	Lie Flow: Video Dynamic Fields Modeling and Predicting with Lie Algebra as Geometric Physics Principle	LieFlow：利用李代数几何物理原理建模和预测视频动态场	NeRF geometric consistency
25	Pseudo-View Enhancement via Confidence Fusion for Unposed Sparse-View Reconstruction	提出基于置信度融合的伪视图增强方法，用于无位姿稀疏视图三维重建	scene reconstruction geometric consistency
26	Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping	提出基于运动感知高斯分组的MoGaF框架，用于动态场景的时空预测。	gaussian splatting splatting	✅
27	Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps	提出光-几何交互图，用于单目深度信息的联合阴影生成与光照重定向	monocular depth

🔬 支柱七：动作重定向 (Motion Retargeting) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
28	Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos	提出LFG：一种基于无标注野外视频的大规模自动驾驶预训练方法	motion prediction foundation model
29	MultiAnimate: Pose-Guided Image Animation Made Extensible	提出MultiAnimate，解决多角色姿态引导图像动画中的身份混淆和遮挡问题。	spatial relationship character animation
30	SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance	SemVideo：通过分层语义引导，从脑活动重建观看视频内容	motion adaptation
31	Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow	Easy3E：基于校正体素流的前馈3D资产编辑框架	geometric consistency

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
32	Structure-to-Image: Zero-Shot Depth Estimation in Colonoscopy via High-Fidelity Sim-to-Real Adaptation	提出基于结构到图像的结肠镜零样本深度估计方法，解决结构失真问题。	sim-to-real depth estimation monocular depth	✅
33	WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos	WHOLE：从自我中心视频中重建世界坐标系下的人手-物体交互	manipulation egocentric motion estimation	✅
34	WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation	WeatherCity：提出可控多天气变换的城市场景重建框架	manipulation scene reconstruction

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
35	UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling	UniHand：统一的4D手部动作建模框架，支持估计与生成	motion synthesis MANO
36	From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors	提出PhysicEdit，通过物理状态转移先验实现物理感知图像编辑	physically plausible

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
37	RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking	提出RT-RMOT数据集与RTrack框架，解决全天候条件下的RGB-Thermal指称多目标跟踪问题。	interactive character large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页