cs.CV(2026-02-25)

📊 共 37 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (7) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱七:动作重定向 (Motion Retargeting) (4) 支柱一:机器人控制 (Robot Control) (3 🔗2) 支柱四:生成式动作 (Generative Motion) (2) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 UniVBench: Towards Unified Evaluation for Video Foundation Models 提出UniVBench以解决视频基础模型评估碎片化问题 foundation model multimodal instruction following
2 MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving MindDriver:面向自动驾驶的渐进式多模态推理框架 multimodal chain-of-thought
3 SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model SkyReels-V4:统一多模态视频-音频生成、修复与编辑的基石模型 large language model foundation model multimodal
4 RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models 提出基于RGB-Event超图提示的预训练模型,用于解决GNSS拒止环境下的地铁里程标志识别问题 foundation model
5 Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models 提出动态多模态激活引导方法,缓解大型视觉语言模型中的幻觉问题 multimodal
6 E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought 提出E-comIQ-ZH框架,用于细粒度评估中文电商海报质量,解决现有方法忽略文本伪影问题。 chain-of-thought
7 CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis 提出CARE:一种分子引导的自适应区域建模病理切片图像分析基础模型 foundation model
8 SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction SEF-MAP:用于稳健多模态高清地图预测的子空间分解专家融合方法 multimodal
9 WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs WeaveTime:通过将先前帧的信息融入涌现记忆,提升视频LLM在流式场景下的时序理解能力 large language model multimodal
10 Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation GLoTran:面向高分辨率富文本图像翻译,提出全局-局部双重感知MLLM框架 large language model multimodal
11 StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles StoryMovie数据集通过电影剧本和字幕对齐,提升视觉故事中语义关系的准确性。 visual grounding TAMP
12 TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection 提出TranX-Adapter,增强MLLM在AI生成图像检测中的鲁棒性 large language model multimodal
13 NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors NoLan:通过动态抑制语言先验缓解大型视觉语言模型中的对象幻觉 multimodal
14 RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations RobustVisRAG:提出因果感知的视觉退化鲁棒检索增强生成框架 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
15 See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs 提出一种免训练的迭代框架,通过视觉证据监督LVLM的多模态推理,提升视觉一致性。 reinforcement learning multimodal chain-of-thought
16 Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning 提出工具专家感知的胸部X光Agent,通过多模态Agent学习解决工具冲突问题 reinforcement learning multimodal
17 How to Take a Memorable Picture? Empowering Users with Actionable Feedback 提出MemCoach,通过可执行反馈提升图像记忆性,赋能用户拍摄更难忘的照片 teacher-student large language model multimodal
18 Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization 提出难度感知分组归一化(Durian),提升多模态LLM推理能力 reinforcement learning large language model multimodal
19 PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning PanoEnv:利用强化学习探索全景环境中3D空间智能 reinforcement learning scene understanding
20 Solaris: Building a Multiplayer Video World Model in Minecraft Solaris:构建Minecraft多人视频世界模型,实现一致的多视角模拟。 world model
21 CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning 提出CCCaption,通过双重奖励强化学习生成完整且正确的图像描述 reinforcement learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
22 Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction GPOcc:利用通用视觉几何先验进行稀疏高斯占据预测 monocular depth Depth Anything scene understanding
23 SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR SF3D-RGB:单目相机与稀疏LiDAR融合的场景流估计方法 scene flow
24 Lie Flow: Video Dynamic Fields Modeling and Predicting with Lie Algebra as Geometric Physics Principle LieFlow:利用李代数几何物理原理建模和预测视频动态场 NeRF geometric consistency
25 Pseudo-View Enhancement via Confidence Fusion for Unposed Sparse-View Reconstruction 提出基于置信度融合的伪视图增强方法,用于无位姿稀疏视图三维重建 scene reconstruction geometric consistency
26 Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping 提出基于运动感知高斯分组的MoGaF框架,用于动态场景的时空预测。 gaussian splatting splatting
27 Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps 提出光-几何交互图,用于单目深度信息的联合阴影生成与光照重定向 monocular depth

🔬 支柱七:动作重定向 (Motion Retargeting) (4 篇)

#题目一句话要点标签🔗
28 Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos 提出LFG:一种基于无标注野外视频的大规模自动驾驶预训练方法 motion prediction foundation model
29 MultiAnimate: Pose-Guided Image Animation Made Extensible 提出MultiAnimate,解决多角色姿态引导图像动画中的身份混淆和遮挡问题。 spatial relationship character animation
30 SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance SemVideo:通过分层语义引导,从脑活动重建观看视频内容 motion adaptation
31 Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow Easy3E:基于校正体素流的前馈3D资产编辑框架 geometric consistency

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
32 Structure-to-Image: Zero-Shot Depth Estimation in Colonoscopy via High-Fidelity Sim-to-Real Adaptation 提出基于结构到图像的结肠镜零样本深度估计方法,解决结构失真问题。 sim-to-real depth estimation monocular depth
33 WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos WHOLE:从自我中心视频中重建世界坐标系下的人手-物体交互 manipulation egocentric motion estimation
34 WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation WeatherCity:提出可控多天气变换的城市场景重建框架 manipulation scene reconstruction

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
35 UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling UniHand:统一的4D手部动作建模框架,支持估计与生成 motion synthesis MANO
36 From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors 提出PhysicEdit,通过物理状态转移先验实现物理感知图像编辑 physically plausible

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
37 RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking 提出RT-RMOT数据集与RTrack框架,解决全天候条件下的RGB-Thermal指称多目标跟踪问题。 interactive character large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页