cs.CV(2026-05-12)

📊 共 68 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (23 🔗6) 支柱九:具身大模型 (Embodied Foundation Models) (21 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (15 🔗2) 支柱一:机器人控制 (Robot Control) (4 🔗2) 支柱四:生成式动作 (Generative Motion) (2) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (23 篇)

#题目一句话要点标签🔗
1 PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting PointGS:利用3D高斯溅射实现语义一致的无监督3D点云分割 contrastive learning 3D gaussian splatting gaussian splatting
2 SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture SenseNova-U1:基于NEO-unify架构的统一多模态理解与生成模型 world model world models vision-language-action
3 PairDropGS: Paired Dropout-Induced Consistency Regularization for Sparse-View Gaussian Splatting 提出PairDropGS以解决稀疏视图高斯点云重建不稳定问题 representation learning 3D gaussian splatting 3DGS
4 Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training VISTA:提出视觉感知自提升训练框架,提升多模态大语言模型的推理能力 preference learning large language model multimodal
5 Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation 提出雷达调制选择机制以解决雷达-相机深度估计问题 Mamba state space model MAE
6 Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction Lite3R:一种模型无关的高效前馈3D重建框架,降低计算开销并保持精度。 linear attention teacher-student distillation
7 Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images 提出PCSR-Bench基准,诊断MLLM在全景图像中视角条件下的空间推理能力 reward design reward shaping egocentric
8 CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating 提出基于视觉-语言模型的CaC框架,用于提升视频异常检测的准确性和可解释性。 reinforcement learning spatiotemporal chain-of-thought
9 HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation HorizonDrive:用于长时程驾驶模拟的自校正自回归世界模型 world model world models distillation
10 TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles 提出TCP-SSM,通过token条件极点改进视觉状态空间模型的效率与可解释性。 Mamba SSM state space model
11 VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference 提出VIP:视觉引导的Prompt进化方法,高效实现密集视觉-语言推理。 VIP distillation open-vocabulary
12 SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning 提出SyncDPO以解决视频音频联合生成中的时间同步问题 preference learning DPO direct preference optimization
13 3D-Belief: Embodied Belief Inference via Generative 3D World Modeling 提出3D-Belief,通过生成式3D世界建模实现具身信念推理。 world model world models
14 Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos 针对结肠镜视频,提出噪声感知的时序自监督对比学习方法 contrastive learning foundation model
15 Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation 提出基于推理前缀掩码的视觉锚定蒸馏方法,提升VLM在多模态推理中的视觉信息利用率。 distillation multimodal
16 When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy 提出感知熵约束,解决Flow模型RLHF微调中多样性崩溃问题 flow matching RLHF
17 Interactive State Space Model with Cross-Modal Local Scanning for Depth Super-Resolution 提出基于交互状态空间模型的跨模态局部扫描深度超分辨率方法 Mamba state space model
18 The DAWN of World-Action Interactive Models 提出WAIM以解决世界预测与动作生成的相互依赖问题 world model world models world action model
19 FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity 提出FIS-DiT,通过无训练帧交错稀疏性突破视频扩散模型推理速度瓶颈。 predictive model distillation spatiotemporal
20 Large-Small Model Collaboration for Farmland Semantic Change Detection 提出大小模型协同框架,用于解决农田语义变化检测中的伪变化问题。 Mamba multimodal
21 Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations 提出CoDAAR,通过语义对齐的离散表示实现跨模态领域泛化 representation learning multimodal
22 Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data 提出SAGL方法,从异构多视图数据中学习保持子空间的稀疏注意力图,用于无监督迁移学习。 linear attention representation learning
23 DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers 提出DORA:一种基于强化学习的ViT动态Token融合在线推理方法 reinforcement learning distillation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (21 篇)

#题目一句话要点标签🔗
24 Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models 提出GAP:一种用于多模态大语言模型中视觉推理的细粒度对齐范式 large language model multimodal
25 Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts 提出基于多模态大语言模型的全能图像复原框架,解决复杂退化建模问题。 large language model multimodal
26 UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs UniVLR:统一文本与视觉潜在推理,提升多模态LLM的视觉思维效率 large language model multimodal chain-of-thought
27 Dynamic Execution Commitment of Vision-Language-Action Models 提出A3自适应动作接受机制,解决VLA模型动态执行承诺问题 vision-language-action VLA
28 AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward AlphaGRPO:通过可分解验证奖励解锁UMM中的自反思多模态生成 multimodal
29 G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models 提出G$^2$TR,通过生成引导的视觉token缩减,提升分离编码器统一多模态模型的推理效率。 multimodal
30 Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models 提出ClipSum框架,利用冻结CLIP视觉-语言特征进行教学视频多模态摘要生成。 multimodal
31 OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models 提出OTT-Vid,通过最优传输进行时序token压缩,提升Video-LLM效率。 large language model
32 CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection 提出CAST框架,通过融合多尺度拓扑结构进行多模态数据集高效子集选择。 multimodal
33 Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment Instruct-ICL:利用指令引导的上下文学习提升灾后损失评估多模态大语言模型性能 large language model multimodal chain-of-thought
34 PresentAgent-2: Towards Generalist Multimodal Presentation Agents 提出PresentAgent-2,实现通用多模态演示代理,支持多种演示模式。 multimodal
35 Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs ContextGuard:面向Omni-LLM的上下文保持型Token剪枝框架,提升效率并保持性能 large language model multimodal
36 Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration 提出Logit-Attention Divergence方法,解决多图检索中由注意力偏差引起的位置偏见问题 large language model multimodal
37 When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs 利用视觉注意力结构揭示多模态大语言模型中的幻觉现象,并提出LaSCD解码策略 large language model multimodal
38 LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs 提出LDDR:基于线性DPP的动态分辨率视频帧采样方法,提升视频MLLM性能 large language model multimodal
39 Elastic Attention Cores for Scalable Vision Transformers 提出VECA:通过弹性注意力核心实现可扩展的视觉Transformer foundation model
40 H2G: Hierarchy-Aware Hyperbolic Grouping for 3D Scenes 提出H2G:一种层级感知的双曲空间分组方法,用于三维场景理解 foundation model
41 Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters 提出Chronicles-OCR,用于评估VLLM在汉字演化轨迹上的跨时序视觉感知能力 large language model
42 SB-BEVFusion: Enhancing the Robustness against Sensor Malfunction and Corruptions SB-BEVFusion:增强多模态融合在传感器故障和数据损坏下的鲁棒性 multimodal
43 M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection M$^4$-SAM:面向RGB-D视频显著性目标检测的记忆增强多模态混合专家模型 foundation model
44 ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes ShapeCodeBench:用于合成形状场景感知到程序重建的可再生基准 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (15 篇)

#题目一句话要点标签🔗
45 3D Gaussian Splatting for Efficient Retrospective Dynamic Scene Novel View Synthesis with a Standardized Benchmark 针对同步多视角动态场景,提出高效的3D高斯溅射新视角合成方法。 3D gaussian splatting 3DGS gaussian splatting
46 Focusable Monocular Depth Estimation 提出FocusDepth,解决单目深度估计中目标区域深度精度不足的问题。 depth estimation monocular depth Depth Anything
47 Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction AmbiSuR:基于高斯溅射的鲁棒光度歧义表面重建框架 3D reconstruction gaussian splatting splatting
48 PD-4DGS:Progressive Decomposition of 4D Gaussian Splatting for Bandwidth-Adaptive Dynamic Scene Streaming 提出PD-4DGS,实现4D高斯溅射的渐进式分解,用于带宽自适应的动态场景流式传输。 3DGS gaussian splatting splatting
49 VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors VidSplat:利用几何引导的视频扩散先验实现高斯溅射重建,提升稀疏视图下的三维重建效果。 gaussian splatting splatting scene reconstruction
50 4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation 提出4DVGGT-D,通过动态深度估计改进4D视觉几何Transformer,用于单目视频动态场景重建。 depth estimation scene reconstruction foundation model
51 GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction GeoQuery:几何引导的扩散模型用于稀疏视角三维重建,提升重建质量。 3D gaussian splatting 3DGS 3D reconstruction
52 PoseCompass: Intelligent Synthetic Pose Selection for Visual Localization 提出PoseCompass以解决视觉定位中的合成姿态选择问题 3D gaussian splatting 3DGS gaussian splatting
53 BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding BARISTA:一个用于组合视觉理解的多任务自中心视角基准数据集 scene understanding egocentric
54 Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision 提出基于VQA事件地图的城市风险感知导航系统,辅助低视力人群安全出行。 scene understanding large language model multimodal
55 PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations PointForward:提出基于点对齐表示的feedforward自动驾驶场景重建方法 3D gaussian splatting 3DGS gaussian splatting
56 The Midas Touch for Metric Depth 提出MTD方法,利用极稀疏3D数据将相对深度转换为度量深度,提升跨场景泛化能力。 depth estimation metric depth
57 Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances AFFORDMEM:利用跨场景与场景内记忆实现3D功能可供性定位 affordance
58 LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing 提出LiBrA-Net以解决超高清4K视频去雾问题 optical flow spatiotemporal
59 TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion 提出TriBand-BEV,通过高度感知BEV和高分辨率特征融合实现实时LiDAR行人3D检测。 3D reconstruction

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
60 OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation OmniHumanoid:提出一种无需配对数据自适应的跨具身人形视频生成框架 humanoid human-to-robot cross-embodiment
61 EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras EgoEV-HandPose:利用立体事件相机进行第一人称3D手部姿态估计和手势识别 bi-manual monocular depth egocentric
62 EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera EgoForce:利用前臂引导的相机空间3D手部姿态单目估计 manipulation egocentric hand reconstruction
63 WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting WildRelight:提出真实世界单图重光照基准与物理引导的自适应方法 sim-to-real

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
64 ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation 提出ScaleMoGen框架以解决人类动作生成中的细粒度预测问题 motion generation motion tokenizer MoMask
65 Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers 提出一种融合预训练模块化控制器的动态全身人-物交互运动生成框架 motion diffusion model motion diffusion physically plausible

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
66 GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization GaitProtector:通过无训练扩散潜在空间优化实现基于模仿的步态去识别 latent optimization spatiotemporal
67 EchoTracker2: Enhancing Myocardial Point Tracking by Modeling Local Motion EchoTracker2:通过建模局部运动增强心肌点追踪 motion estimation spatiotemporal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
68 PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition 提出PoseBridge以解决零样本骨架动作识别中的语义损失问题 human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页