cs.CV(2026-06-01)

📊 共 56 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (19 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (11 🔗4) 支柱一:机器人控制 (Robot Control) (6 🔗4) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (19 篇)

#题目一句话要点标签🔗
1 Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning 提出Attentive-CoT,通过注意力引导微调提升多模态大语言模型的CoT推理能力 large language model multimodal chain-of-thought
2 Jailbreaking Multimodal Large Language Models using Multi-Clip Video 提出Multi-Clip Video SafetyBench,评估视频输入多样性对多模态大语言模型越狱攻击的影响。 large language model multimodal
3 Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling 提出基于感知扰动和奖励建模的多模态LLM评判偏见缓解方法 large language model multimodal
4 ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning ProtoAda:原型引导的自适应Adapter扩展与几何整合,用于多模态持续指令调优 large language model multimodal
5 Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference RESTORE:通过校正视觉扭曲提升多模态LLM推理效率 large language model multimodal
6 Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains 研究表明多模态Agent工具使用收益可能被高估,工具调用不代表能力提升 multimodal
7 Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis 对比分析多模态方法在视觉文档类型分类中的应用,揭示不同模态信息的贡献。 multimodal
8 InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models InfoMerge:面向高效视频大语言模型的信息感知型Token压缩方法 large language model
9 Multimodal Action Diffusion for Robust End-to-End Autonomous Driving 提出Action Diffusion Transformer,用于稳健的端到端自动驾驶多模态动作预测。 multimodal
10 The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue 提出图像重建游戏基准,通过迭代多模态对话提升图像生成质量。 multimodal
11 FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds 提出FlatVPR以解决视觉位置识别中的特征重建问题 foundation model
12 PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images PathAR:一种结构优先的自回归模型,用于合成多模态病理图像 multimodal
13 AdaCodec: A Predictive Visual Code for Video MLLMs AdaCodec:面向视频MLLM的预测式视觉编码,显著降低计算成本并提升性能。 large language model multimodal
14 Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events Moment-Video:诊断视频多模态大模型在瞬时视觉事件上的时间保真度 large language model multimodal
15 Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models 提出SEIG框架,利用视觉-语言模型从单张图像重建可编辑Blender场景。 foundation model
16 Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis 提出U4D框架,利用不确定性指导4D激光雷达场景合成,提升场景保真度和时序一致性。 embodied AI
17 A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision 提出TGAD基准测试,揭示现有文本引导异常检测对语言条件的依赖不足 multimodal
18 Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs 提出密度感知转换(DAT)方法,提升零样本VLM在虚假相关性下的鲁棒性 multimodal
19 Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation Goal2Pixel:将目标与像素对齐,用于视觉-语言导航 VLN

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
20 VEDAL: Variational Error-Driven Asynchronous Learning for 3D Gaussian Splatting Pruning VEDAL:基于变分误差驱动异步学习的3D高斯溅射剪枝 3D gaussian splatting 3DGS gaussian splatting
21 $\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer 提出VG²GT,利用体素高斯溅射和视觉几何Transformer实现高质量三维重建与新视角合成。 3D reconstruction gaussian splatting splatting
22 Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image 提出基于可微多平面图像的快速轻量级新视角合成方法,解决NeRF等方法的速度和模型大小瓶颈。 3D gaussian splatting 3DGS gaussian splatting
23 TIDES: Time-Derivative Event Simulation via Deformable Reconstruction TIDES:基于可变形重建的时间导数事件模拟器,解决事件模拟中的时间戳批量问题。 gaussian splatting splatting TAMP
24 Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation 提出混合密度表示MDA,解决深度估计中边界飞点问题。 depth estimation
25 Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation 提出COVRAG,通过最大化覆盖率检索增强长视频生成的一致性 3D reconstruction geometric consistency
26 Honey, I Shrunk the Arc de Triomphe! 提出MetricScenes数据集以解决单目几何估计中的尺度崩溃问题 MoGe foundation model
27 PerBite: A Curated Diagnostic Workflow for Bite-Aware Food Volume Estimation PerBite提出了一种基于咬合感知的食物体积估计诊断工作流程,在MetaFood挑战赛中获得领先。 MoGe 3D reconstruction
28 Places in the Wild: A Large, High-Resolution RAW Photograph Dataset for Ecologically Valid Vision Research Places in the Wild:一个用于生态有效视觉研究的大型高分辨率RAW图像数据集 scene understanding
29 WebSpline: Structure-Informed Splines for Real-Time 3D Gaussians from Monocular Videos WebSpline:单目视频实时3D高斯建模的结构化样条方法 scene reconstruction
30 PhyScene3D: Physically Consistent Interactive 3D Tabletop Scene Generation PhyScene3D:提出物理一致的交互式3D桌面场景生成框架 affordance
31 Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis StreetNVS:提出一种有效融合多传感器信息的街景新视角合成方法 metric depth

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
32 MotionDreamer: Universal Skeletal Motion Generation for 3D Rigged Shapes MotionDreamer:提出一种通用的骨骼运动生成框架,用于3D绑定形状的动画生成。 dreamer motion generation
33 Geometry-Aware Implicit Memory for Video World Models 提出GIM-World,利用几何感知隐式记忆提升视频世界模型的长时序一致性 world model world models 3D reconstruction
34 PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder PaCX-MAE:生理信息增强的胸部X光图像掩码自编码器,提升诊断性能 masked autoencoder MAE visual pre-training
35 From Zero to Hero: Training-Free Custom Concept Spawning in World Models 提出SPAWN,一种免训练的世界模型概念植入方法,用于交互式视频生成。 world model world models
36 MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching MT-EditFlow:基于流匹配的强化学习框架,用于多轮图像编辑,提升交互式编辑质量。 reinforcement learning flow matching
37 Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning 提出基于表征和几何引导的离散Token化器,用于自动驾驶世界模型和规划 world model world models
38 Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models 提出动态认知地图与空间断言代码以增强空间推理能力 reinforcement learning SAC spatial relationship
39 Paving the Way for Point Cloud Video Representation Learning Using A PDE Model 提出MotionPDE,利用偏微分方程和对比学习增强点云视频表示学习 representation learning contrastive learning
40 VISReg: Variance-Invariance-Sketching Regularization for JEPA training VISReg:方差-不变性-素描正则化方法,提升JEPA训练的稳定性和泛化性 JEPA
41 From Extrinsic to Intrinsic: Geodesic-Guided Representation Learning for 3D Geometric Data PRISM:通过恢复内在表面测地距离学习3D几何数据的等距嵌入 representation learning
42 Pool-Select-Refine: Allocation-Aware Generative Dataset Distillation with Soft-Label-Guided Latent Refinement 提出Pool-Select-Refine框架,通过解耦生成、选择和优化,提升扩散模型数据集蒸馏效果。 distillation

🔬 支柱一:机器人控制 (Robot Control) (6 篇)

#题目一句话要点标签🔗
43 Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances Ultra Diffusion Poser:结合稀疏惯性传感器与测距信息的扩散人体运动追踪 motion tracking human motion
44 RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation 提出RoboTrustBench,用于评估机器人操作中视频世界模型的可靠性。 manipulation world model world models
45 Restoring Initial Noise Sensitivity in Text-to-Image Distillation via Geometric Alignment 提出几何感知蒸馏(GAD),恢复文本到图像蒸馏中对初始噪声的敏感性。 manipulation distillation
46 PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps 提出PlatonicNav以解决视觉导航中的语义对应问题 Unitree VLN
47 Explainable Forensics of Manipulated Segments in Untrimmed Long Videos 提出TASLE基准和MSLoc方法,用于长视频中AI篡改片段的可解释性取证。 manipulation
48 PRIMA: Boosting Animal Mesh Recovery with Biological Priors and Test-Time Adaptation PRIMA:利用生物先验和测试时自适应提升动物网格重建效果 quadruped

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
49 Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization 提出基于网格Token化的3D感知视频扩散模型,实现无渲染的人体运动控制。 human motion
50 Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation Auteur:提出语言驱动的电影级镜头控制方法,用于生成以人为中心的视频。 human motion large language model multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
51 Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection 提出UE-MCM模型,解决长尾分布下以自我为中心的错误动作检测问题 egocentric
52 HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image HumanNOVA:基于单张图像的逼真、通用、快速3D人体Avatar建模 SMPL

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
53 3rd Place at CVPR 2026 CASTLE Challenge: Agentic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval 提出基于层级知识图谱检索的Agentic多视角长视频理解框架,CVPR 2026 CASTLE挑战赛第三名 spatiotemporal multimodal
54 VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization 提出基于VLM教师的自适应测试时优化方法,提升视频推理能力 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
55 Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation 提出ST-DRC框架,解决身份保持的文本到视频生成中语义控制与身份保真间的平衡问题。 classifier-free guidance
56 TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos TROPHIES:多视角视频中人物、场景和相机的时序重建 physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页