cs.CV(2026-06-08)

📊 共 39 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (5) 支柱一:机器人控制 (Robot Control) (3 🔗2) 支柱四:生成式动作 (Generative Motion) (2) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis 提出NutriMLLM以解决饮食微量营养素分析问题 large language model multimodal
2 Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur? 提出Ego-MC-Bench与Ego-CoMist以解决视频LLM实时纠错问题 large language model multimodal
3 GD-MIL: Grade-Disentangled Multiple Instance Learning for Multimodal Biochemical Recurrence Prediction in Prostate Cancer 提出GD-MIL以解决前列腺癌生化复发预测问题 foundation model multimodal
4 Scaling by Diversified Experience for Vision-Language-Action Models 提出SyVLA以解决视觉-语言-动作模型的控制与推理问题 vision-language-action VLA
5 Securing Self-supervised Data Curation for Foundation Models Robustness 提出毒性数据检测器以确保自监督数据的完整性 foundation model
6 Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning 提出Rea2Seg框架以解决复杂图像分割问题 large language model foundation model multimodal
7 CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms 提出CAMF-Det以解决无人机平台下的遮挡问题 multimodal
8 CRANE: Knowledge Editing for Reasoning MLLMs 提出CRANE框架以解决推理多模态大语言模型的知识编辑问题 large language model multimodal chain-of-thought
9 When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models 提出TransGeoCLIP以解决全球图像地理定位问题 multimodal
10 DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance 提出DifferSeg以解决多模态二值分割中的适应性与解码效率问题 multimodal
11 HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging 提出HDRAgent以解决动态场景中的HDR成像伪影问题 large language model multimodal
12 A multi-agent system for spine MRI report generation from multi-sequence imaging 提出SpineAgent以解决脊柱MRI报告生成的复杂性问题 foundation model multimodal
13 HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents 提出HDSL以解决文本驱动室内场景生成与编辑问题 multimodal
14 Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA 提出CREDiT框架以解决视频问答中的因果推理问题 multimodal
15 See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding 提出CoVER框架以解决长视频理解中的证据获取和反馈问题 large language model
16 Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions? 提出Distract-Bench以解决视觉语言模型对语义干扰的鲁棒性问题 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
17 Training-Free Generalized Few-Shot Segmentation through Open-Vocabulary Semantic Arbitration 提出Open-V以解决训练无关的少样本语义分割问题 representation learning open-vocabulary open vocabulary
18 Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis 研究视频基础模型是否理解直观物理知识 JEPA foundation model
19 ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models 提出ATM以诊断和改进潜在世界模型的行动一致性 world model world models
20 Latent Spatial Memory for Video World Models 提出潜在空间记忆以解决视频世界模型中的3D一致性问题 world model world models
21 Echo-Memory: A Controlled Study of Memory in Action World Models 提出Echo-Memory以解决动作世界模型中的记忆问题 world model world models
22 Prisma-World: Camera-Controllable Multi-Agent Video World Model 提出Prisma-World以解决多代理视频生成中的视图一致性问题 world model world models
23 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning 提出CapRL++以解决图像和视频字幕生成中的奖励验证问题 reinforcement learning multimodal
24 Temporal-Aware Reasoning Optimization for Video Temporal Grounding 提出TaRO框架以解决视频时间定位中的推理质量问题 reinforcement learning large language model TAMP
25 Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating 提出VLHTrack以解决高光谱目标跟踪中的光谱冗余问题 Mamba state space model large language model
26 Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition 提出多模态集成框架以解决微手势识别问题 representation learning multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
27 Leveraging NeRF-Rendered Images for 3D Gaussian Splatting 提出利用NeRF渲染图像以提升3D高斯点云渲染质量 3D gaussian splatting 3DGS gaussian splatting
28 REFINE: Super-efficient 3D Gaussian Splatting Pruning via Rendering-Free Primitive Importance 提出REFINE以解决3D高斯点云剪枝效率低下问题 3D gaussian splatting 3DGS gaussian splatting
29 Minimal Solvers for Full-DoF Motion Estimation from Asynchronous Differential SfM 提出全自由度运动估计框架以解决异步数据挑战 optical flow motion estimation spatiotemporal
30 ExDet: Open-Domain Open-Vocabulary Detection with Cross-modal Extrapolation and Rectification 提出ExDet以解决开放域开放词汇检测问题 open-vocabulary open vocabulary
31 Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance Reconstruction 提出归一化各向异性球面Gabor函数以解决视角依赖外观建模问题 scene reconstruction

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
32 EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation 提出EPS3D框架以解决开放词汇3D全景分割问题 manipulation distillation scene understanding
33 EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video 提出EgoTactile以解决日常物体抓握压力估计问题 manipulation egocentric
34 An Enhanced Geometric-Spectral Feature Learning Framework for Airborne Multispectral Point Cloud Classification 提出增强几何-光谱特征学习框架以解决多光谱点云分类问题 MPC

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
35 CP4D: Compositional Physics-aware 4D Scene Generation 提出CP4D以解决动态场景生成中的物理一致性问题 motion synthesis physically plausible spatiotemporal
36 Real-time body pose non-verbal communication with a consistency-based reliability measure 提出基于一致性可靠性度量的实时身体姿态非语言交流方法 MotionLCM

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
37 A Geometric Framework for Absolute Pose and Velocity Estimation with Event Cameras 提出几何框架以解决绝对姿态与速度估计问题 motion estimation
38 See More, Match Better: Multi-Source Feature Fusion for Two-View Correspondence Learning 提出TriMatch以解决两视图对应学习中的伪一致性问题 geometric consistency

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
39 Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models 通过视觉语言模型解码行人过马路意图 egocentric egocentric vision first-person view

⬅️ 返回 cs.CV 首页 · 🏠 返回主页