cs.CV（2026-06-01）

📊 共 56 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (19 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (12 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (11 🔗4) 支柱一：机器人控制 (Robot Control) (6 🔗4) 支柱七：动作重定向 (Motion Retargeting) (2) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱四：生成式动作 (Generative Motion) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (19 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning	提出Attentive-CoT，通过注意力引导微调提升多模态大语言模型的CoT推理能力	large language model multimodal chain-of-thought
2	Jailbreaking Multimodal Large Language Models using Multi-Clip Video	提出Multi-Clip Video SafetyBench，评估视频输入多样性对多模态大语言模型越狱攻击的影响。	large language model multimodal
3	Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling	提出基于感知扰动和奖励建模的多模态LLM评判偏见缓解方法	large language model multimodal
4	ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning	ProtoAda：原型引导的自适应Adapter扩展与几何整合，用于多模态持续指令调优	large language model multimodal
5	Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference	RESTORE：通过校正视觉扭曲提升多模态LLM推理效率	large language model multimodal
6	Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains	研究表明多模态Agent工具使用收益可能被高估，工具调用不代表能力提升	multimodal
7	Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis	对比分析多模态方法在视觉文档类型分类中的应用，揭示不同模态信息的贡献。	multimodal
8	InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models	InfoMerge：面向高效视频大语言模型的信息感知型Token压缩方法	large language model
9	Multimodal Action Diffusion for Robust End-to-End Autonomous Driving	提出Action Diffusion Transformer，用于稳健的端到端自动驾驶多模态动作预测。	multimodal
10	The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue	提出图像重建游戏基准，通过迭代多模态对话提升图像生成质量。	multimodal
11	FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds	提出FlatVPR以解决视觉位置识别中的特征重建问题	foundation model
12	PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images	PathAR：一种结构优先的自回归模型，用于合成多模态病理图像	multimodal
13	AdaCodec: A Predictive Visual Code for Video MLLMs	AdaCodec：面向视频MLLM的预测式视觉编码，显著降低计算成本并提升性能。	large language model multimodal
14	Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events	Moment-Video：诊断视频多模态大模型在瞬时视觉事件上的时间保真度	large language model multimodal
15	Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models	提出SEIG框架，利用视觉-语言模型从单张图像重建可编辑Blender场景。	foundation model
16	Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis	提出U4D框架，利用不确定性指导4D激光雷达场景合成，提升场景保真度和时序一致性。	embodied AI
17	A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision	提出TGAD基准测试，揭示现有文本引导异常检测对语言条件的依赖不足	multimodal
18	Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs	提出密度感知转换(DAT)方法，提升零样本VLM在虚假相关性下的鲁棒性	multimodal
19	Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation	Goal2Pixel：将目标与像素对齐，用于视觉-语言导航	VLN	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
20	VEDAL: Variational Error-Driven Asynchronous Learning for 3D Gaussian Splatting Pruning	VEDAL：基于变分误差驱动异步学习的3D高斯溅射剪枝	3D gaussian splatting 3DGS gaussian splatting
21	$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer	提出VG²GT，利用体素高斯溅射和视觉几何Transformer实现高质量三维重建与新视角合成。	3D reconstruction gaussian splatting splatting
22	Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image	提出基于可微多平面图像的快速轻量级新视角合成方法，解决NeRF等方法的速度和模型大小瓶颈。	3D gaussian splatting 3DGS gaussian splatting
23	TIDES: Time-Derivative Event Simulation via Deformable Reconstruction	TIDES：基于可变形重建的时间导数事件模拟器，解决事件模拟中的时间戳批量问题。	gaussian splatting splatting TAMP
24	Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation	提出混合密度表示MDA，解决深度估计中边界飞点问题。	depth estimation	✅
25	Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation	提出COVRAG，通过最大化覆盖率检索增强长视频生成的一致性	3D reconstruction geometric consistency
26	Honey, I Shrunk the Arc de Triomphe!	提出MetricScenes数据集以解决单目几何估计中的尺度崩溃问题	MoGe foundation model
27	PerBite: A Curated Diagnostic Workflow for Bite-Aware Food Volume Estimation	PerBite提出了一种基于咬合感知的食物体积估计诊断工作流程，在MetaFood挑战赛中获得领先。	MoGe 3D reconstruction	✅
28	Places in the Wild: A Large, High-Resolution RAW Photograph Dataset for Ecologically Valid Vision Research	Places in the Wild：一个用于生态有效视觉研究的大型高分辨率RAW图像数据集	scene understanding
29	WebSpline: Structure-Informed Splines for Real-Time 3D Gaussians from Monocular Videos	WebSpline：单目视频实时3D高斯建模的结构化样条方法	scene reconstruction
30	PhyScene3D: Physically Consistent Interactive 3D Tabletop Scene Generation	PhyScene3D：提出物理一致的交互式3D桌面场景生成框架	affordance
31	Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis	StreetNVS：提出一种有效融合多传感器信息的街景新视角合成方法	metric depth

🔬 支柱二：RL算法与架构 (RL & Architecture) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
32	MotionDreamer: Universal Skeletal Motion Generation for 3D Rigged Shapes	MotionDreamer：提出一种通用的骨骼运动生成框架，用于3D绑定形状的动画生成。	dreamer motion generation
33	Geometry-Aware Implicit Memory for Video World Models	提出GIM-World，利用几何感知隐式记忆提升视频世界模型的长时序一致性	world model world models 3D reconstruction
34	PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder	PaCX-MAE：生理信息增强的胸部X光图像掩码自编码器，提升诊断性能	masked autoencoder MAE visual pre-training
35	From Zero to Hero: Training-Free Custom Concept Spawning in World Models	提出SPAWN，一种免训练的世界模型概念植入方法，用于交互式视频生成。	world model world models
36	MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching	MT-EditFlow：基于流匹配的强化学习框架，用于多轮图像编辑，提升交互式编辑质量。	reinforcement learning flow matching
37	Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning	提出基于表征和几何引导的离散Token化器，用于自动驾驶世界模型和规划	world model world models
38	Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models	提出动态认知地图与空间断言代码以增强空间推理能力	reinforcement learning SAC spatial relationship	✅
39	Paving the Way for Point Cloud Video Representation Learning Using A PDE Model	提出MotionPDE，利用偏微分方程和对比学习增强点云视频表示学习	representation learning contrastive learning	✅
40	VISReg: Variance-Invariance-Sketching Regularization for JEPA training	VISReg：方差-不变性-素描正则化方法，提升JEPA训练的稳定性和泛化性	JEPA	✅
41	From Extrinsic to Intrinsic: Geodesic-Guided Representation Learning for 3D Geometric Data	PRISM：通过恢复内在表面测地距离学习3D几何数据的等距嵌入	representation learning	✅
42	Pool-Select-Refine: Allocation-Aware Generative Dataset Distillation with Soft-Label-Guided Latent Refinement	提出Pool-Select-Refine框架，通过解耦生成、选择和优化，提升扩散模型数据集蒸馏效果。	distillation

🔬 支柱一：机器人控制 (Robot Control) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
43	Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances	Ultra Diffusion Poser：结合稀疏惯性传感器与测距信息的扩散人体运动追踪	motion tracking human motion
44	RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation	提出RoboTrustBench，用于评估机器人操作中视频世界模型的可靠性。	manipulation world model world models
45	Restoring Initial Noise Sensitivity in Text-to-Image Distillation via Geometric Alignment	提出几何感知蒸馏（GAD），恢复文本到图像蒸馏中对初始噪声的敏感性。	manipulation distillation	✅
46	PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps	提出PlatonicNav以解决视觉导航中的语义对应问题	Unitree VLN	✅
47	Explainable Forensics of Manipulated Segments in Untrimmed Long Videos	提出TASLE基准和MSLoc方法，用于长视频中AI篡改片段的可解释性取证。	manipulation	✅
48	PRIMA: Boosting Animal Mesh Recovery with Biological Priors and Test-Time Adaptation	PRIMA：利用生物先验和测试时自适应提升动物网格重建效果	quadruped	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
49	Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization	提出基于网格Token化的3D感知视频扩散模型，实现无渲染的人体运动控制。	human motion
50	Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation	Auteur：提出语言驱动的电影级镜头控制方法，用于生成以人为中心的视频。	human motion large language model multimodal

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
51	Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection	提出UE-MCM模型，解决长尾分布下以自我为中心的错误动作检测问题	egocentric
52	HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image	HumanNOVA：基于单张图像的逼真、通用、快速3D人体Avatar建模	SMPL

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
53	3rd Place at CVPR 2026 CASTLE Challenge: Agentic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval	提出基于层级知识图谱检索的Agentic多视角长视频理解框架，CVPR 2026 CASTLE挑战赛第三名	spatiotemporal multimodal	✅
54	VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization	提出基于VLM教师的自适应测试时优化方法，提升视频推理能力	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
55	Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation	提出ST-DRC框架，解决身份保持的文本到视频生成中语义控制与身份保真间的平衡问题。	classifier-free guidance
56	TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos	TROPHIES：多视角视频中人物、场景和相机的时序重建	physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页