cs.CV(2026-02-24)

📊 共 39 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗2) 支柱一:机器人控制 (Robot Control) (4 🔗1) 支柱七:动作重定向 (Motion Retargeting) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 Are Multimodal Large Language Models Good Annotators for Image Tagging? 提出TagLLM框架,提升多模态大语言模型在图像标签任务中的标注质量。 large language model multimodal
2 Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion 提出MVLAD-AD,通过掩码扩散模型实现高效、可解释的端到端自动驾驶。 vision-language-action large language model
3 CrystaL: Spontaneous Emergence of Visual Latents in MLLMs CrystaL:MLLM中视觉隐变量的自发涌现,提升细粒度视觉理解 large language model multimodal chain-of-thought
4 OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation OrthoDiffusion:用于肌肉骨骼MRI解释的通用多任务扩散模型 foundation model
5 An interactive enhanced driving dataset for autonomous driving 提出交互增强驾驶数据集IEDD,解决自动驾驶VLA模型数据稀疏和多模态对齐不足问题。 vision-language-action VLA multimodal
6 UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics 提出UDVideoQA数据集,用于城市交通视频中多目标时空推理的视频问答任务。 multimodal visual grounding
7 OmniOCR: Generalist OCR for Ethnic Minority Languages OmniOCR:面向少数民族语言的通用OCR框架,提升低资源场景识别精度。 foundation model multimodal
8 Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction Skullptor:基于多视角法线预测的快速高保真3D头部重建 foundation model
9 VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models 提出VII框架,通过视觉指令注入破解图生视频模型的安全限制。 instruction following
10 Cycle-Consistent Tuning for Layered Image Decomposition 提出循环一致性微调方法,用于基于扩散模型的图像分层解耦 foundation model
11 On the Explainability of Vision-Language Models in Art History 研究CLIP在艺术史领域的视觉推理可解释性,评估XAI方法有效性。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
12 SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens SceMoS:利用几何约束Token规划的场景感知3D人体运动合成 height map occupancy grid affordance
13 RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction RU4D-SLAM:通过重加权不确定性实现动态场景的4D高斯溅射SLAM重建 3D gaussian splatting gaussian splatting splatting
14 Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting 提出DropAnSH-GS,通过锚点Dropout和球谐函数稀疏化提升稀疏视角下的高斯溅射性能。 3D gaussian splatting 3DGS gaussian splatting
15 VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos VAGNet:通过视频中的人-物交互进行3D可供性区域定位 affordance human-object interaction HOI
16 Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models 融合几何与交互线索,零样本提升视觉基础模型的可供性推理能力 affordance foundation model
17 BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting 提出BrepGaussian,利用高斯溅射从多视角图像重建CAD模型 gaussian splatting splatting
18 Monocular Endoscopic Tissue 3D Reconstruction with Multi-Level Geometry Regularization 提出多层几何约束的单目内窥镜组织3D重建方法,实现实时渲染和光滑表面 3D gaussian splatting gaussian splatting splatting
19 WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos WildGHand:学习抗扰动高斯手部Avatar,从单目野外视频中重建 3D gaussian splatting gaussian splatting splatting
20 Real-time Motion Segmentation with Event-based Normal Flow 提出基于事件Normal Flow的实时运动分割框架,显著提升动态场景理解效率。 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
21 LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding 提出LongVideo-R1,通过主动推理导航实现低成本长视频理解。 reinforcement learning large language model multimodal
22 GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection 提出GatedCLIP,通过门控多模态融合提升Hateful Memes检测性能。 contrastive learning multimodal
23 RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation RAYNOVA:提出无3D几何先验的自回归驾驶世界建模方法,实现统一时空表示。 world model physically plausible foundation model
24 Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models 提出MMHNet,解决视频到音频生成模型在长时序上的泛化难题 Mamba multimodal
25 Communication-Inspired Tokenization for Structured Image Representations 提出COMiT,通过模仿人类交流方式学习结构化图像表示,提升组合泛化和关系推理能力。 flow matching multimodal
26 A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata 提出轻量级视觉-语言融合框架,利用UI和元数据预测App评分。 MAE multimodal
27 Path-Decoupled Hyperbolic Flow Matching for Few-Shot Adaptation 提出路径解耦的双曲流匹配(HFM),用于解决小样本跨模态迁移中的路径纠缠问题。 flow matching
28 PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models PropFly:利用预训练视频扩散模型的即时监督学习视频编辑传播 flow matching classifier-free guidance

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
29 Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection 提出对象-场景-相机解耦重组方法,提升单目3D目标检测的数据效率。 manipulation
30 From Perception to Action: An Interactive Benchmark for Vision Reasoning 提出CHAIN基准测试,用于评估视觉推理模型在交互式物理环境中的行动能力。 manipulation
31 See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis ArtiAgent:通过智能体数据合成,使VLMs和扩散模型理解视觉伪影 manipulation
32 RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces RecoverMark:用于人脸篡改定位与恢复的鲁棒水印方法 manipulation

🔬 支柱七:动作重定向 (Motion Retargeting) (3 篇)

#题目一句话要点标签🔗
33 VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving VGGDrive:通过跨视角几何 grounding 增强视觉-语言模型在自动驾驶中的应用 motion prediction foundation model
34 SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models 提出SpatiaLQA基准,评估视觉语言模型在复杂空间逻辑推理中的能力 spatial relationship foundation model
35 SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking SIMSPINE:一个用于3D脊柱运动标注和基准测试的生物力学感知模拟框架 motion estimation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
36 Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing 提出InterFormer,通过交互感知建模和共现一致性提升自中心视角下手-物解析性能 egocentric
37 Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change 提出联合SfM重建方法,解决长期外观变化下的三维重建问题 feature matching

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
38 PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning PFGNet:一种全卷积频率引导外围门控网络,用于高效时空预测学习 spatiotemporal
39 Human Video Generation from a Single Image with 3D Pose and View Control 提出HVG模型,通过单张图像生成具有3D姿态和视角控制的高质量人体视频。 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页