cs.CV(2026-05-26)

📊 共 48 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (16 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (16 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱四:生成式动作 (Generative Motion) (3 🔗1) 支柱一:机器人控制 (Robot Control) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (16 篇)

#题目一句话要点标签🔗
1 SpatialBench: Is Your Spatial Foundation Model an All-Round Player? SpatialBench:用于评估空间基础模型泛化能力的跨领域、多任务基准测试。 representation learning egocentric foundation model
2 FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation FoundObj:利用自监督基础模型奖励进行无标签3D物体分割 reinforcement learning foundation model
3 Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini Gemini Embedding 2:原生多模态嵌入模型,统一表示视频、音频、图像和文本 contrastive learning multimodal
4 O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding 提出O-MARC框架,通过压缩蒸馏提升多模态大模型在视频理解中的效率与性能。 distillation large language model multimodal
5 Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression 提出Re-M3Dr,解决多模态医学影像融合中视场缺损评估的性能退化问题 contrastive learning multimodal
6 DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models DinoComplete:利用蒸馏语义先验和状态空间模型实现三维形状补全 Mamba state space model foundation model
7 OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation 提出基于融合即教师蒸馏的OmniRetriever,实现任意模态音视频文本检索 distillation multimodal
8 JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search JetViT:通过后训练注意力搜索实现高效高分辨率视觉Transformer linear attention Depth Anything foundation model
9 LongCat-Video-Avatar 1.5 Technical Report LongCat-Video-Avatar 1.5:面向商业级应用的开源音频驱动视频生成框架 RLHF distillation multi-person interaction
10 PlayClass: Automated Play Behaviour Classification in Poultry PlayClass:一种用于家禽玩耍行为自动分类的流水线方法 JEPA foundation model
11 Touch-R1: Reinforcing Touch Reasoning in MLLMs Touch-R1:通过触觉强化学习提升多模态大模型中的触觉推理能力 reinforcement learning multimodal
12 Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning 提出解耦子空间对比学习的半监督眼球注视估计方法,提升领域泛化性。 contrastive learning
13 REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization 提出REVERSE框架,通过强化证据验证与搜索实现Agentic图像地理定位 reinforcement learning visual grounding
14 InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward InterSketch:提出一种基于自校正视觉草图和逐步奖励的交错推理模型,提升视觉语言模型在复杂视觉推理任务上的性能。 reinforcement learning chain-of-thought
15 JLT: Clean-Latent Prediction in Latent Diffusion Transformers JLT:在潜在扩散Transformer中通过Clean-Latent预测提升图像生成质量 flow matching classifier-free guidance
16 Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules 提出TriPS,通过优化扩散后验采样的引导与随机性调度,显著提升逆问题成像效果。 reinforcement learning classifier-free guidance

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
17 Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models 提出多模态检索头检测方法,提升长文本视觉语言模型在图文检索任务中的性能。 large language model multimodal
18 DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding 提出DynFrame以解决复杂视频理解中的动态取样问题 large language model multimodal
19 How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning 提出View Dropout与全景视觉思考,提升统一多模态模型跨视角空间推理能力 multimodal
20 Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models DistractionBench揭示视频大语言模型在时序理解中存在“事件袋”行为 large language model
21 Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning 提出OFA框架,通过一次训练即可为多模态指令调优选择任意数据集,提升训练效率。 multimodal
22 IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams IPIBench:提出交互式主动智能评测基准,评估MLLM在连续视频流中的性能 large language model multimodal
23 DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding 提出DV-SFT,通过直接视觉监督提升多模态大语言模型的细粒度视觉理解能力 large language model multimodal
24 OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants OmniInteract:面向实时全模态助手的真实世界流式交互评测基准 large language model multimodal
25 LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding LocateAnything:提出并行框解码,加速并提升视觉-语言定位质量。 visual grounding
26 Leveraging Visual Signals for Robust Token-Level Uncertainty in Vision-Language Generation 提出VIG-TUQ框架,利用视觉信号提升视觉-语言生成中token级别不确定性估计的鲁棒性 visual grounding
27 ChartAct: A Benchmark for Dynamic Chart Understanding 提出ChartAct:一个动态图表理解的交互式基准测试。 multimodal
28 On the Robustness of Machine Unlearning for Vision-Language Models 针对视觉-语言模型,提出多模态知识遗忘的鲁棒性分析与攻击方法。 multimodal
29 I2PRef: Image-Driven Point Completion with Iterative Refinement 提出I2PRef,通过图像驱动的点云补全与迭代优化实现高质量3D重建 multimodal
30 METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition 提出METATR:一个多语言、可演进的自动文本识别评测基准 large language model
31 FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling FTibSuite:为藏语视觉-语言建模提供全面的资源套件 multimodal
32 The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP 提出LRA-EE,通过层级表征感知的提前退出机制,缓解CLIP模型量化导致的性能崩溃问题。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
33 TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting TrackRef3D:提出多视角一致的Track-then-Label方法,用于3D高斯溅射中的开放世界指代分割。 3D gaussian splatting 3DGS gaussian splatting
34 DelowlightSplat: Feed-Forward Gaussian Splatting for Lowlight 3D Scene Reconstruction DelowlightSplat:用于弱光3D场景重建的前馈高斯溅射方法 3D reconstruction gaussian splatting splatting
35 Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting 提出Underwater360,利用全景高斯溅射重建水下场景,解决水下图像退化和视角畸变问题。 3D gaussian splatting 3DGS gaussian splatting
36 COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection 提出NoIn-Det,解决持续开放词汇目标检测中新概念注入问题,无需额外参数。 open-vocabulary open vocabulary
37 $R^3$: 3D Reconstruction via Relative Regression 提出基于相对回归的R^3方法,解决长序列和流式三维重建中全局坐标系依赖问题。 3D reconstruction foundation model
38 Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction 提出Gaussian-Voxel Duet,用于快速、精确的单目表面重建。 3D gaussian splatting 3D reconstruction gaussian splatting
39 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation 提出3D高斯图以解决视觉语言导航中的环境理解问题 scene understanding egocentric VLN
40 Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth 提出SLIM,通过稀疏激光雷达提示单目几何基础模型,提升长距离驾驶场景深度估计性能。 Depth Anything MoGe foundation model
41 Uncertainty-Aware Gaussian Map for Vision-Language Navigation 提出不确定性感知高斯地图,提升视觉-语言导航任务的可靠性 affordance VLN
42 G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing G3T:利用重力对齐坐标系简化点云地图处理,提升三维重建精度 3D reconstruction VGGT

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
43 Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy 提出基于高效人体球体代理的自相交感知3D人体运动生成方法 motion diffusion model MDM motion diffusion
44 Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos HTD-Refine:通过对齐高阶时间动态,提升单目视频人体运动恢复的真实性。 physically plausible HMR human motion
45 Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis 提出Generative Animations,通过提示驱动的多模型管线合成动画 motion synthesis large language model visual grounding

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
46 OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes OSMa-Bench++:利用提示生成的合成场景,实现操作语义地图的开放式基准测试 manipulation semantic mapping semantic map

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
47 Joint 2D-3D Segmentation and Association in Street-level Imaging 提出联合2D-3D分割与关联框架,用于大规模街景图像理解与空间数字孪生构建。 geometric consistency motion reconstruction

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
48 CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence 提出CmIVTP框架,利用跨模态交互预测海上船舶轨迹,提升航运智能化水平。 interaction transformer multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页