cs.CV（2026-05-15）

📊 共 40 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (16 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (11 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗2) 支柱一：机器人控制 (Robot Control) (3) 支柱四：生成式动作 (Generative Motion) (3 🔗2) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models	MAgSeg：利用多模态大语言模型分割高分辨率卫星图像中的农业景观	large language model multimodal
2	Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models	提出FAB-G框架，通过属性引导的选择性推理提升多模态大模型在艺术作品情感理解上的性能。	large language model multimodal	✅
3	Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models	提出ML-FOP-SOAP，通过多级方差校正解决多模态模型中的模态竞争问题。	foundation model multimodal
4	SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval	提出SOLAR框架，解决对称多模态检索问题，无需人工标注数据。	multimodal
5	Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models	提出G2U框架，利用生成式视觉思考反哺多模态理解，提升模型认知能力	multimodal
6	STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System	STABLE：基于语义-物理双系统的可用于仿真的桌面布局生成	embodied AI large language model
7	GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions	GRASP：学习在多人非语言互动中进行社会推理	large language model multimodal
8	AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models	提出自适应测地线校正(AGC)，提升视觉-语言模型在对抗攻击下的鲁棒性	multimodal zero-shot transfer
9	IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation	IVGT：用于神经场景表示的隐式视觉几何Transformer	foundation model
10	Res$^2$CLIP: Few-Shot Generalist Anomaly Detection with Residual-to-Residual Alignment	提出Res$^2$CLIP，通过残差对齐解决少样本通用异常检测中的跨粒度和跨类别泛化问题。	multimodal	✅
11	A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation	提出CrossMPI：一种针对大型视觉语言模型的图像注入跨模态提示攻击。	multimodal
12	Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination	揭示视觉语言模型“视觉重检”的假象：模型只是在说，而非真正在看	visual grounding
13	Cross-Modal Registration Between 3D and 2D Fingerprints via Pose-Aware Unwrapping and Point-Cloud Fusion	提出一种姿态感知的3D指纹展开与点云融合方法，实现3D与2D指纹的跨模态配准。	multimodal
14	HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion	HyperDiT：通过超连接Transformer实现高保真像素空间扩散模型	foundation model
15	EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy	提出EntropyScan，通过视觉注意力熵检测LVLMs中的模型级后门攻击。	large language model
16	LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs	LRCP：基于低秩可压缩性的高效LVLM视觉Token剪枝	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
17	Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting	提出基于先验引导的分割方法，实现可编辑的3D高斯溅射	3D gaussian splatting gaussian splatting splatting
18	EndoGSim: Physics-Aware 4D Dynamic Endoscopic Scene Simulations via MLLM-Guided Gaussian Splatting	EndoGSim：基于MLLM引导的高斯溅射实现物理感知的4D动态内窥镜场景仿真	depth estimation gaussian splatting splatting
19	Unlocking Dense Metric Depth Estimation in VLMs	提出DepthVLM，将视觉语言模型转化为原生密集深度预测器，提升3D空间推理能力。	depth estimation metric depth foundation model
20	Learn2Splat: Extending the Horizon of Learned 3DGS Optimization	Learn2Splat：通过元学习扩展3D高斯溅射优化视野	3D gaussian splatting 3DGS gaussian splatting	✅
21	3D Segmentation Using Viewpoint-Dependent Spatial Relationships	提出视角依赖的3D指代分割数据集，并设计视角感知的模型以提升空间关系理解。	scene understanding spatial relationship multimodal
22	Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation	提出解耦的视觉-语言对齐框架，用于细粒度开放词汇分割。	open-vocabulary open vocabulary
23	Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning	提出自提示扩散Transformer，通过上下文学习实现开放词汇场景文本编辑	open-vocabulary open vocabulary	✅
24	RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations	提出RaPD：通过语义增强隐式表示实现分辨率无关的像素扩散模型	implicit representation
25	GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction	GHOST：提出几何分层在线流式Token淘汰方法，高效实现3D重建	3D reconstruction	✅
26	Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer	提出Fisher引导量化(FGQ)方法，解决视觉几何Transformer中多任务量化敏感度差异问题。	depth estimation 3D reconstruction VGGT
27	On RGB-TIR Stereo Calibration under Extreme Resolution Asymmetry	提出一种RGB-TIR立体标定框架，解决极端分辨率不对称下的标定难题。	depth estimation multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
28	Latent Video Prediction Learns Better World Models	基于隐空间视频预测，提升视频世界模型的鲁棒性	world model world models JEPA
29	ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark	提出ChronoEarth-492K大规模时空高光谱数据集与基准，促进长时间序列高光谱自监督学习。	representation learning HSI spatiotemporal
30	DiLA: Disentangled Latent Action World Models	提出DiLA以解决潜在动作模型的抽象与生成质量权衡问题	world model world models optical flow
31	3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds	提出3DTMDet，结合Transformer和SSM，解决点云目标检测中远距离点稀疏和上下文理解的难题。	Mamba SSM state space model	✅
32	Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study	针对极低数据量细粒度分类，研究预训练目标对表征质量的影响	MAE contrastive learning distillation
33	From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding	提出Group-Revision优化范式，解决目标级Grounding中困难样本的稀疏奖励问题。	reinforcement learning reward shaping	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
34	Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation	提出VLA-AD以解决VLA策略蒸馏效率问题	manipulation distillation vision-language-action
35	UAM: A Dual-Stream Perspective on Forgetting in VLA Training	提出UAM双流架构，解决VLA训练中的多模态能力遗忘问题	manipulation VLA multimodal
36	WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes	WorldAct：将静态3D世界转化为可交互的、以对象为中心的场景	manipulation world model world models

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
37	AnyAct: Towards Human Reenactment of Character Motion From Video	AnyAct：提出一种从角色视频到人体表演的重定向方法	motion generation motion retargeting human motion
38	Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling	提出条件多视角祖先采样(cMAS)方法，用于无监督单视角3D人体姿态估计。	motion diffusion model MDM motion diffusion	✅
39	VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation	提出VAGS：一种速度自适应引导缩放方法，用于提升图像编辑和生成质量。	classifier-free guidance	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
40	VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation	VideoSeeker：通过原生Agent工具调用，激励实例级视频理解	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页