cs.CV(2026-05-28)

📊 共 81 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (26 🔗7) 支柱三:空间感知与语义 (Perception & Semantics) (23 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (16 🔗4) 支柱一:机器人控制 (Robot Control) (8) 支柱四:生成式动作 (Generative Motion) (3 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (26 篇)

#题目一句话要点标签🔗
1 VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies 提出VisualThink-VLA,通过视觉中间推理实现高效低延迟的视觉-语言-动作策略。 vision-language-action VLA chain-of-thought
2 Archon: A Unified Multimodal Model for Holistic Digital Human Generation Archon:用于整体数字人生成的统一多模态模型 multimodal
3 DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark DocRetriever:一个即插即用的多模态文档检索框架,并构建了全面的基准测试。 multimodal
4 VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents VideoFDB:提出首个全双工视听对话基准,评估对话Agent的非语言交互能力。 multimodal visual grounding
5 AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly Detection 提出AnomalyAgent,一种无需训练的Agentic模型,用于零/少样本异常检测。 large language model multimodal
6 Genetically Aligned Patient Representations Improve Hematological Diagnosis 提出基因对齐的患者表征方法,提升血液学诊断性能。 foundation model multimodal
7 Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering 提出BRACS,通过自适应闭式引导缓解视觉语言模型中的幻觉问题。 multimodal visual grounding
8 SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation SuperVoxelGPT:自回归形状生成的自适应有序3D Token化方法 large language model multimodal
9 CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning CogniVerse:融合认知反射与几何推理的多模态检索增强生成框架 large language model multimodal
10 Grounded 3D-Aware Spatial Vision-Language Modeling 提出GR3D:一种具有显式和隐式2D以及单目3D grounding能力的空间视觉语言模型 chain-of-thought
11 LoMo: Local Modality Substitution for Deeper Vision-Language Fusion 提出LoMo局部模态替换方法,提升视觉-语言模型跨模态融合的鲁棒性。 multimodal
12 Unveiling the Visual Counting Bottleneck in Vision-Language Models 揭示视觉语言模型中视觉计数瓶颈:符号映射失败导致外推泛化能力不足 foundation model
13 EarlyTom: Early Token Compression Completes Fast Video Understanding EarlyTom:早期Token压缩加速视频理解,显著降低时间延迟。 large language model
14 Masked Diffusion Vision-Language Models for Temporal Action Localization 提出MDVLM-TAL,利用掩码扩散模型解决时序动作定位中时间边界难以修正的问题。 language conditioned
15 Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models Pocket-Dentist:通过高效多模态大语言模型实现设备端牙科图像理解 large language model multimodal
16 ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation ReactBench:提出一个因果驱动的多模态幻觉评测基准,系统性评估视觉语言模型。 large language model multimodal chain-of-thought
17 WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction 提出WorldMemArena以评估多模态智能体记忆在行动-世界交互中的表现 large language model multimodal
18 DMC-CF: Dynamic Multimodal CounterFactual QA benchmark for Causal Reasoning 提出DMC-CF:用于因果推理的动态多模态反事实问答基准 large language model multimodal
19 PInVerify: An Offline Embodied Benchmark for Active Instance Verification 提出PInVerify离线具身基准,用于主动实例验证任务 embodied AI large language model multimodal
20 Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset 提出CFMME:一个全面的中文金融多模态评估数据集,用于评测大视觉语言模型 multimodal
21 SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation SuperVoxelGPT:自回归形状生成中自适应有序的3D Token化方法 large language model multimodal
22 ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models 提出ReGuLaR框架,通过关系图推理增强大型视觉语言模型的潜在推理能力。 chain-of-thought
23 On-Device Generative AI for GDPR-Compliant Visual Monitoring: Natural Language Alerts from Local Object Detection 提出一种GDPR合规的端侧生成式AI视觉监控系统,实现本地目标检测与自然语言警报生成。 large language model
24 GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection 提出GiPL,通过生成增强迭代伪标签解决跨域小样本目标检测问题 foundation model
25 FlowSeg: Dynamic Semantic Guidance for LLM-Conditioned Segmentation FlowSeg:提出动态语义引导机制,提升LLM条件下的图像分割性能 large language model
26 FedSmoothLoRA: Toward Smoother and Faster Convergence in Federated Low-Rank Adaptation FedSmoothLoRA:面向联邦低秩适应的平滑快速收敛方法 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (23 篇)

#题目一句话要点标签🔗
27 Supercharging Thermal Gaussian Splatting with Depth Estimation 提出基于热红外图像和深度估计的TDg方法,加速并提升3D高斯溅射性能。 depth estimation 3D gaussian splatting 3D reconstruction
28 PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions PhyGenHOI:提出物理感知的动态人-物交互4D生成框架 3DGS motion diffusion model MDM
29 DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding DGSG-Mind:用于长期场景理解和定位的动态3D高斯场景图 scene reconstruction scene understanding semantic mapping
30 Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field 提出基于各向异性可见度场的3D高斯溅射主动建图方法,实现不确定性驱动。 3D gaussian splatting 3DGS gaussian splatting
31 From General Vision to Reliable Traversability Estimation: Adapting Vision Foundation Models for Unstructured Outdoor Environments ViTA:面向非结构化环境,自适应视觉基础模型的可靠地形可通行性估计 traversability foundation model
32 FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views FRUC:基于无标定协同驾驶视角的动态场景前馈重建 3D gaussian splatting gaussian splatting splatting
33 OmniCD: A Foundational Framework for Remote Sensing Image Change Detection Guided by Multimodal Semantics OmniCD:多模态语义引导的遥感图像变化检测基础框架 semantic map multimodal
34 City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images City-Mesh3R:从多视角图像重建可用于仿真的城市级三维网格模型 3D reconstruction gaussian splatting splatting
35 REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image REST3D:提出物理约束的单图三维场景重建框架,提升场景物理稳定性。 scene understanding penetration human-object interaction
36 Large Depth Completion Model from Sparse Observations 提出LDCM:基于Transformer的大规模稀疏深度补全模型 depth estimation metric depth foundation model
37 Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence 提出基于3D先验的语义对应学习框架,提升模型对3D结构的感知能力。 sam 3D SAM 3D foundation model
38 MonoPhysics: Estimating Geometry, Appearance, and Physical Parameters from Monocular Videos MonoPhysics:单目视频中几何、外观和物理参数的联合估计 3D gaussian splatting gaussian splatting splatting
39 Déjà View: Looping Transformers for Multi-View 3D Reconstruction Déjà View:循环Transformer用于多视角3D重建,提升效率与性能 3D reconstruction
40 Towards Consistent Video Geometry Estimation ViGeo:用于视频序列时空一致几何估计的通用前馈模型 depth estimation foundation model
41 DVSM: Decoder-only View Synthesis Model Done Right DVSM:仅解码器视角合成模型,性能超越传统编码器-解码器结构 3DGS foundation model
42 GMOS: Grounding Moving Object Segmentation in 3D Space and Time 提出GMOS框架以解决移动物体分割中的3D信息缺失问题 optical flow
43 BitC-3DGS: High-Capacity 3D Gaussian Splatting Watermarking via Bit Compression BitC-3DGS:通过比特压缩实现高容量3D高斯溅射水印 3D gaussian splatting 3DGS gaussian splatting
44 Comparative evaluation of photogrammetric reconstruction methods and 3D Gaussian Splatting for road surface roughness analysis 比较四种三维重建方法以评估路面粗糙度 3D gaussian splatting 3DGS 3D reconstruction
45 DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding 提出DGSG-Mind以解决动态3D场景理解中的实例关联脆弱问题 scene reconstruction scene understanding semantic mapping
46 Learning Representations from 3D Gaussian Splats 评估几何深度学习在3D高斯溅射场景理解中的应用 3D gaussian splatting 3DGS gaussian splatting
47 Déjà View: Looping Transformers for Multi-View 3D Reconstruction Déjà View:循环Transformer用于多视角3D重建,提升效率与性能 3D reconstruction
48 Towards Consistent Video Geometry Estimation ViGeo:提出用于视频序列时空一致几何估计的通用前馈模型 depth estimation foundation model
49 VLM3: Vision Language Models Are Native 3D Learners VLM3:利用视觉语言模型实现原生3D场景理解 depth estimation

🔬 支柱二:RL算法与架构 (RL & Architecture) (16 篇)

#题目一句话要点标签🔗
50 SAM3D-Phys: Towards Multi-Object Interactive Simulation in Real World SAM3D-Phys:从真实世界重建场景中恢复可交互仿真的完整物体几何 distillation scene reconstruction sam 3D
51 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models minWM:用于实时交互视频世界模型的全栈开源框架 world model world models distillation
52 AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning 提出AgentCVR,通过脚本模拟强化学习解决跨视频推理中证据获取难题。 reinforcement learning large language model multimodal
53 EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation 提出EVL-ECG,通过异构知识蒸馏实现高效的心电图(ECG)判读。 distillation feature matching foundation model
54 FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection FakeVLM-R1:通过思维链和物理规律内化提升合成图像检测能力 imitation learning multimodal chain-of-thought
55 SLAD : Shared LoRA Adapters for Task Specific Distillation 提出SLAD:面向任务特定蒸馏的共享LoRA适配器,提升小模型性能。 distillation foundation model
56 LiveSVG: Zero-Shot SVG Animation via Video Generation LiveSVG:基于视频生成的零样本SVG动画方法 distillation motion representation
57 xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR 提出xModel-KD,利用跨模态知识蒸馏提升LiDAR点云3D场景感知性能。 distillation scene understanding
58 Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning Stable-Layers:利用VLM评分的强化学习微调图像层分解模型,无需配对监督。 reinforcement learning
59 Reinforcement Learning with Robust Rubric Rewards 提出 RLR³,通过鲁棒的准则奖励强化学习,提升视觉-语言任务性能。 reinforcement learning
60 SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation 提出SGMD:面向少步视频扩散蒸馏的分数梯度匹配蒸馏方法 distillation
61 GeoMag: Geometric-Aware Video Motion Magnification via State Space Model 提出GeoMag:基于状态空间模型的几何感知视频运动放大方法 state space model
62 NeuROK: Generative 4D Neural Object Kinematics NeuROK:生成式4D神经对象运动学,实现逼真的物体形变模拟 world model world models
63 Clustering Guided Domain-Specific Pretrained Foundation Model Very High-Resolution Arctic Remote Sensing 提出基于聚类引导的域特定预训练模型以提升北极遥感分析 masked autoencoder MAE foundation model
64 UniNote: A Unified Embedding Model for Multimodal Representation and Ranking 提出UniNote,用于解决工业级Item-to-Item检索中多模态表征与排序的挑战。 reinforcement learning representation learning multimodal
65 Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization 提出 Guidance Contrastive Policy Optimization (GCPO) 以实现离散策略优化中的 Token 级别信用分配 reinforcement learning policy learning chain-of-thought

🔬 支柱一:机器人控制 (Robot Control) (8 篇)

#题目一句话要点标签🔗
66 SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation 提出SAFE-Pruner,通过语义注意力引导的未来感知token剪枝加速VLA模型推理。 manipulation vision-language-action VLA
67 YoCausal: How Far is Video Generation from World Model? A Causality Perspective YoCausal:从因果关系视角评估视频生成模型与世界模型的差距 sim-to-real world model world models
68 Geometry-Guided Modeling of Foundation Features Enables Generalizable Object Shape Deformation Learning 提出几何引导的形变学习框架,实现通用物体形状重建 manipulation
69 Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation Dex2HOI:提出双流扩散模型,用于生成灵巧的双手双物体交互动作 manipulation bi-manual motion synthesis
70 Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning 通过逆动力学学习缓解视觉-语言-动作模型中的状态混淆问题 manipulation vision-language-action VLA
71 SalsaAgent: A multimodal embodied language model for interactive dance generation SalsaAgent:提出一种多模态具身语言模型,用于生成交互式舞蹈动作。 humanoid large language model multimodal
72 Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes 重构工业视觉Sim-to-Real:基于先验可得性的CAD引导与非CAD引导方法综述 sim-to-real teacher-student
73 ParCo-SDF: Learning Prior-Free Partial-to-Complete Signed Distance Fields of Deformable Objects ParCo-SDF:学习可变形物体无先验的部分到完整SDF重建 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
74 Colored Noise Diffusion Sampling 提出彩色噪声采样(CNS),通过频率解耦能量转移提升扩散模型图像合成质量。 classifier-free guidance
75 S2MDF: A Plug-And-Play Layer for Intersection-Free Multi-Object Signed Distance Fields 提出S2MDF,一个即插即用的模块,用于解决多物体SDF表示中的相交问题。 penetration
76 AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling AnyMo:基于掩码建模的通用模态条件运动生成框架 motion synthesis motion generation motion tokenizer

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
77 GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver GenEraser:提出一种基于平衡文本-掩码引导和解耦定位-保持器的通用视频对象移除框架。 spatiotemporal multimodal
78 Veda: Scalable Video Diffusion via Distilled Sparse Attention Veda:通过蒸馏稀疏注意力实现可扩展的视频扩散模型 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
79 Mesh-Aware Epipolar Matching for Multi-View Multi-Person 3D Pose Estimation in Basketball 提出Mesh-Aware Epipolar Matching解决篮球比赛中多人3D姿态估计问题 human mesh recovery
80 Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge 提出融合语义与视觉证据的框架,解决长时程视频问答难题 egocentric large language model multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
81 Turbulence-Robust Dynamic Object Segmentation with Multi-Signal Priors and SAM2 Refinement 提出一种基于多信号先验和SAM2优化的湍流鲁棒动态目标分割方法 motion estimation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页