cs.CV(2026-05-29)

📊 共 54 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (19 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (15 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱七:动作重定向 (Motion Retargeting) (2 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (19 篇)

#题目一句话要点标签🔗
1 HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning 提出HQ-JEPA,用于跨模态遥感表征学习的混合量子联合嵌入预测架构 JEPA Joint-Embedding Predictive Architecture joint-embedding predictive architecture
2 iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning 提出iVGR,通过强化学习将视觉定位能力内化于多模态大语言模型的文本推理中 reinforcement learning large language model multimodal
3 VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching VolFill:利用体素流匹配的单视角非完整3D场景重建 flow matching scene reconstruction foundation model
4 Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization 提出IC-VCO,通过上下文视觉对比优化缓解视觉语言模型中的多模态幻觉问题 DPO direct preference optimization distillation
5 DriveMA: Driving Vision-Language-Action Models with verifiable Meta-Actions DriveMA:通过可验证的元动作驱动自动驾驶视觉-语言-动作模型 reinforcement learning vision-language-action VLA
6 Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models Light Interaction:无需训练加速交互式视频世界模型的推理 world model world models latent dynamics
7 Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR 提出EASE:通过证据锚定的空间注意力监督提升多模态RLVR性能 reinforcement learning multimodal visual grounding
8 Task-Focused Memorization for Multimodal Agents 提出TaskMem:基于强化学习的多模态Agent任务聚焦记忆策略学习框架 reinforcement learning policy learning multimodal
9 Astra: a generalizable report generation foundation model for 3D computed tomography Astra:一种通用的3D CT报告生成基础模型,提升诊断效率和准确性 reinforcement learning foundation model
10 Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation Robust Dreamer:提出偏差感知潜在高斯记忆,用于动作控制的AR视频生成 dreamer gaussian splatting splatting
11 HiERO-StepG @ Ego4D Step Grounding Challenge: hierarchical activity understanding enables zero-shot step grounding HiERO-StepG:利用层级活动理解实现Ego4D零样本步骤定位 representation learning Ego4D
12 Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning 提出DetAS:一个基于Agent的、具有经验感知推理的目标检测框架,提升复杂场景下的检测性能。 representation learning large language model multimodal
13 NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving 提出神经令牌重构(NTR)方法,增强端到端自动驾驶场景令牌的视觉表征能力。 representation learning distillation foundation model
14 CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference 提出CoFiDA-M,利用概念感知特征调制实现图像跨域自适应,解决皮肤癌筛查部署难题。 distillation privileged information foundation model
15 Equivariant Latent Alignment via Flow Matching under Group Symmetries 提出Residual Latent Flow,解决群对称性下等变隐空间对齐问题,提升新视角合成质量。 flow matching representation learning
16 PEEK: Picking Essential frames via Efficient Knowledge distillation PEEK:通过高效知识蒸馏选取视频关键帧,提升视频描述效率。 distillation
17 GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning 提出GUI-C$^2$,通过难度感知强化学习实现GUI元素精准定位 reinforcement learning
18 DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory 提出DecMem,通过解耦记忆实现分钟级一致性世界生成。 world model world models
19 Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams 提出基于测试时训练的域增量学习方法,解决视频流中的灾难性遗忘问题。 masked autoencoder MAE

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
20 ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models 提出ERGeoBench,用于评估多模态大语言模型在具身环境下的地理定位能力。 large language model multimodal
21 Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior? 提出多层次视觉扰动框架以分析VLA模型的视觉行为依赖性 vision-language-action VLA multimodal
22 SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models SOCO:用于评估视觉基础模型语义对象对应能力的基准测试 foundation model multimodal
23 MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding 提出MechVQA数据集,并构建MechVL模型,提升MLLM在机械图纸理解上的能力 large language model multimodal
24 Representation Forcing for Bottleneck-Free Unified Multimodal Models 提出表征强制(RF)技术,实现无瓶颈的统一多模态模型。 multimodal
25 Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval 提出动态适配器路由DAR,解决持续多模态检索中的灾难性遗忘问题。 multimodal
26 Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search 提出V-SPLADE:一种免推理的多模态学习稀疏检索方法,用于大规模视觉文档搜索。 multimodal
27 Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining VISTA:提出多层次事件语义挖掘框架,有效提升长视频事件预测精度 large language model multimodal
28 GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration 提出GGT-100K:利用生成模型合成高质量图像对,提升真实世界图像复原泛化性。 foundation model multimodal
29 Recognizing Co-Speech Gestures in-the-Wild 提出大规模手势识别数据集GRW,用于解决野生环境下语义相关的手势识别问题。 multimodal
30 Personalize Your Large Vision-language Models With In-context Prompt Tuning 提出ICPT,通过上下文提示调优实现大视觉语言模型的个性化定制 multimodal
31 SAM for Robust Mitochondria Instance Segmentation in Fluorescence Microscopy 提出基于合成数据微调SAM的方法,用于提升荧光显微镜下线粒体实例分割的鲁棒性 foundation model
32 Vanilla ViT for Automotive Point Cloud Semantic Segmentation 提出VaViT,使用Vanilla ViT进行汽车点云语义分割,性能媲美SOTA方法。 multimodal
33 Text-guided Feature Disentanglement for Cross-modal Gait Recognition 提出TCFDNet,利用文本引导特征解耦实现LiDAR-Camera跨模态步态识别 large language model
34 Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness Immuno-VLM:通过生成语义抗体免疫大型视觉语言模型,提升开放世界可信度 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
35 DSD-GS: Dynamic-Static Decomposition of Gaussian Splatting for Efficient and High-Fidelity Dynamic Scene Reconstruction DSD-GS:动态静态解耦高斯溅射,实现高效高保真动态场景重建 3DGS gaussian splatting splatting
36 Benchmarking Single-Step Inpainting Methods for Multi-Object 3D Gaussian Splatting Scenes 针对多物体3D高斯溅射场景,评估单步修复方法的性能并提出基线。 3D gaussian splatting 3DGS gaussian splatting
37 Triangle Splatting SLAM 提出基于可微三角形splatting的稠密RGB-D SLAM系统,实现实时网格重建与编辑。 3D gaussian splatting gaussian splatting splatting
38 Feature-Optimized Vision for Adaptive 3D Scene Reconstruction 提出一种自适应特征优化视觉前端,提升3D场景重建质量与效率 3D reconstruction scene reconstruction
39 QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer QVGGT:后训练量化视觉几何Transformer,实现边缘设备上的高效3D重建 3D reconstruction VGGT geometric consistency
40 SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence SVI-Bench:用于战略视频智能的动态微观世界基准测试 scene understanding multimodal
41 Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction 提出C4G:基于紧凑高斯和视频扩散的单目视频4D重建框架 scene reconstruction scene understanding
42 SurGe: Improved Surface Geometry in Point Maps SurGe通过梯度匹配损失和邻域注意力机制,提升点云地图表面几何精度 3D reconstruction
43 SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation SMART:基于SMPLest-X网格自适应与RAFT跟踪的足球运动员姿态估计 optical flow
44 RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video RayDer:提出可扩展的自监督新视角合成方法,适用于真实世界视频。 scene reconstruction

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
45 WristCompass: Kinematic Coupling as a Learnable Visual Concept for Ego-Camera Orientation WristCompass:利用运动耦合作为可学习的视觉概念,用于自 Ego 相机姿态估计 manipulation imitation learning scene reconstruction
46 TALON: Token-Aligned Lightweight Adapters for 6-DoF Spacecraft Pose Estimation 提出TALON:用于6自由度航天器姿态估计的Token对齐轻量级适配器 sim-to-real optical flow spatiotemporal
47 Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning Polyphony:提出基于扩散模型的双手动Action分割方法,显著提升性能。 bi-manual

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
48 MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance MultiAct:通过定制注意力引导,从复合文本生成动作 text-to-motion motion synthesis motion generation
49 Guidance for Low-Level Perceptual Editing in Unconditional Diffusion Models 提出一种无训练的扩散模型编辑框架,用于图像的低级感知属性调整。 classifier-free guidance

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
50 VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning VisionPulse:提出动态视觉稀疏化方法,提升多模态推理效率。 PULSE multimodal
51 Linear Scaling Video VLMs for Long Video Understanding 提出StateKV,实现长视频VLM线性扩展,提升长视频理解效率 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
52 Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning 提出OmniME框架,通过正负学习平衡文本驱动人体动作编辑中的变化与不变性。 human motion
53 CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping 提出CameraNoise,通过几何流引导的噪声扭曲实现视频扩散中精确的相机控制 geometric consistency

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
54 EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision 提出EGOSTREAM,用于评估以自我为中心的视觉流式情景记忆的诊断基准。 egocentric egocentric vision

⬅️ 返回 cs.CV 首页 · 🏠 返回主页