cs.CV（2026-05-26）

📊 共 48 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (16 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (16 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱四：生成式动作 (Generative Motion) (3 🔗1) 支柱一：机器人控制 (Robot Control) (1 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱五：交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
1	SpatialBench: Is Your Spatial Foundation Model an All-Round Player?	SpatialBench：用于评估空间基础模型泛化能力的跨领域、多任务基准测试。	representation learning egocentric foundation model
2	FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation	FoundObj：利用自监督基础模型奖励进行无标签3D物体分割	reinforcement learning foundation model
3	Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini	Gemini Embedding 2：原生多模态嵌入模型，统一表示视频、音频、图像和文本	contrastive learning multimodal
4	O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding	提出O-MARC框架，通过压缩蒸馏提升多模态大模型在视频理解中的效率与性能。	distillation large language model multimodal
5	Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression	提出Re-M3Dr，解决多模态医学影像融合中视场缺损评估的性能退化问题	contrastive learning multimodal
6	DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models	DinoComplete：利用蒸馏语义先验和状态空间模型实现三维形状补全	Mamba state space model foundation model
7	OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation	提出基于融合即教师蒸馏的OmniRetriever，实现任意模态音视频文本检索	distillation multimodal
8	JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search	JetViT：通过后训练注意力搜索实现高效高分辨率视觉Transformer	linear attention Depth Anything foundation model
9	LongCat-Video-Avatar 1.5 Technical Report	LongCat-Video-Avatar 1.5：面向商业级应用的开源音频驱动视频生成框架	RLHF distillation multi-person interaction
10	PlayClass: Automated Play Behaviour Classification in Poultry	PlayClass：一种用于家禽玩耍行为自动分类的流水线方法	JEPA foundation model
11	Touch-R1: Reinforcing Touch Reasoning in MLLMs	Touch-R1：通过触觉强化学习提升多模态大模型中的触觉推理能力	reinforcement learning multimodal
12	Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning	提出解耦子空间对比学习的半监督眼球注视估计方法，提升领域泛化性。	contrastive learning	✅
13	REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization	提出REVERSE框架，通过强化证据验证与搜索实现Agentic图像地理定位	reinforcement learning visual grounding	✅
14	InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward	InterSketch：提出一种基于自校正视觉草图和逐步奖励的交错推理模型，提升视觉语言模型在复杂视觉推理任务上的性能。	reinforcement learning chain-of-thought
15	JLT: Clean-Latent Prediction in Latent Diffusion Transformers	JLT：在潜在扩散Transformer中通过Clean-Latent预测提升图像生成质量	flow matching classifier-free guidance
16	Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules	提出TriPS，通过优化扩散后验采样的引导与随机性调度，显著提升逆问题成像效果。	reinforcement learning classifier-free guidance

🔬 支柱九：具身大模型 (Embodied Foundation Models) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
17	Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models	提出多模态检索头检测方法，提升长文本视觉语言模型在图文检索任务中的性能。	large language model multimodal
18	DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding	提出DynFrame以解决复杂视频理解中的动态取样问题	large language model multimodal	✅
19	How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning	提出View Dropout与全景视觉思考，提升统一多模态模型跨视角空间推理能力	multimodal
20	Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models	DistractionBench揭示视频大语言模型在时序理解中存在“事件袋”行为	large language model
21	Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning	提出OFA框架，通过一次训练即可为多模态指令调优选择任意数据集，提升训练效率。	multimodal
22	IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams	IPIBench：提出交互式主动智能评测基准，评估MLLM在连续视频流中的性能	large language model multimodal
23	DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding	提出DV-SFT，通过直接视觉监督提升多模态大语言模型的细粒度视觉理解能力	large language model multimodal
24	OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants	OmniInteract：面向实时全模态助手的真实世界流式交互评测基准	large language model multimodal	✅
25	LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding	LocateAnything：提出并行框解码，加速并提升视觉-语言定位质量。	visual grounding
26	Leveraging Visual Signals for Robust Token-Level Uncertainty in Vision-Language Generation	提出VIG-TUQ框架，利用视觉信号提升视觉-语言生成中token级别不确定性估计的鲁棒性	visual grounding
27	ChartAct: A Benchmark for Dynamic Chart Understanding	提出ChartAct：一个动态图表理解的交互式基准测试。	multimodal	✅
28	On the Robustness of Machine Unlearning for Vision-Language Models	针对视觉-语言模型，提出多模态知识遗忘的鲁棒性分析与攻击方法。	multimodal	✅
29	I2PRef: Image-Driven Point Completion with Iterative Refinement	提出I2PRef，通过图像驱动的点云补全与迭代优化实现高质量3D重建	multimodal	✅
30	METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition	提出METATR：一个多语言、可演进的自动文本识别评测基准	large language model
31	FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling	FTibSuite：为藏语视觉-语言建模提供全面的资源套件	multimodal
32	The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP	提出LRA-EE，通过层级表征感知的提前退出机制，缓解CLIP模型量化导致的性能崩溃问题。	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
33	TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting	TrackRef3D：提出多视角一致的Track-then-Label方法，用于3D高斯溅射中的开放世界指代分割。	3D gaussian splatting 3DGS gaussian splatting
34	DelowlightSplat: Feed-Forward Gaussian Splatting for Lowlight 3D Scene Reconstruction	DelowlightSplat：用于弱光3D场景重建的前馈高斯溅射方法	3D reconstruction gaussian splatting splatting
35	Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting	提出Underwater360，利用全景高斯溅射重建水下场景，解决水下图像退化和视角畸变问题。	3D gaussian splatting 3DGS gaussian splatting	✅
36	COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection	提出NoIn-Det，解决持续开放词汇目标检测中新概念注入问题，无需额外参数。	open-vocabulary open vocabulary
37	$R^3$: 3D Reconstruction via Relative Regression	提出基于相对回归的R^3方法，解决长序列和流式三维重建中全局坐标系依赖问题。	3D reconstruction foundation model	✅
38	Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction	提出Gaussian-Voxel Duet，用于快速、精确的单目表面重建。	3D gaussian splatting 3D reconstruction gaussian splatting	✅
39	3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation	提出3D高斯图以解决视觉语言导航中的环境理解问题	scene understanding egocentric VLN
40	Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth	提出SLIM，通过稀疏激光雷达提示单目几何基础模型，提升长距离驾驶场景深度估计性能。	Depth Anything MoGe foundation model
41	Uncertainty-Aware Gaussian Map for Vision-Language Navigation	提出不确定性感知高斯地图，提升视觉-语言导航任务的可靠性	affordance VLN
42	G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing	G3T：利用重力对齐坐标系简化点云地图处理，提升三维重建精度	3D reconstruction VGGT

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
43	Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy	提出基于高效人体球体代理的自相交感知3D人体运动生成方法	motion diffusion model MDM motion diffusion	✅
44	Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos	HTD-Refine：通过对齐高阶时间动态，提升单目视频人体运动恢复的真实性。	physically plausible HMR human motion
45	Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis	提出Generative Animations，通过提示驱动的多模型管线合成动画	motion synthesis large language model visual grounding

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
46	OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes	OSMa-Bench++：利用提示生成的合成场景，实现操作语义地图的开放式基准测试	manipulation semantic mapping semantic map	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
47	Joint 2D-3D Segmentation and Association in Street-level Imaging	提出联合2D-3D分割与关联框架，用于大规模街景图像理解与空间数字孪生构建。	geometric consistency motion reconstruction

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
48	CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence	提出CmIVTP框架，利用跨模态交互预测海上船舶轨迹，提升航运智能化水平。	interaction transformer multimodal	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页