cs.CV(2026-05-15)

📊 共 40 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗2) 支柱一:机器人控制 (Robot Control) (3) 支柱四:生成式动作 (Generative Motion) (3 🔗2) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models MAgSeg:利用多模态大语言模型分割高分辨率卫星图像中的农业景观 large language model multimodal
2 Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models 提出FAB-G框架,通过属性引导的选择性推理提升多模态大模型在艺术作品情感理解上的性能。 large language model multimodal
3 Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models 提出ML-FOP-SOAP,通过多级方差校正解决多模态模型中的模态竞争问题。 foundation model multimodal
4 SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval 提出SOLAR框架,解决对称多模态检索问题,无需人工标注数据。 multimodal
5 Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models 提出G2U框架,利用生成式视觉思考反哺多模态理解,提升模型认知能力 multimodal
6 STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System STABLE:基于语义-物理双系统的可用于仿真的桌面布局生成 embodied AI large language model
7 GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions GRASP:学习在多人非语言互动中进行社会推理 large language model multimodal
8 AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models 提出自适应测地线校正(AGC),提升视觉-语言模型在对抗攻击下的鲁棒性 multimodal zero-shot transfer
9 IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation IVGT:用于神经场景表示的隐式视觉几何Transformer foundation model
10 Res$^2$CLIP: Few-Shot Generalist Anomaly Detection with Residual-to-Residual Alignment 提出Res$^2$CLIP,通过残差对齐解决少样本通用异常检测中的跨粒度和跨类别泛化问题。 multimodal
11 A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation 提出CrossMPI:一种针对大型视觉语言模型的图像注入跨模态提示攻击。 multimodal
12 Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination 揭示视觉语言模型“视觉重检”的假象:模型只是在说,而非真正在看 visual grounding
13 Cross-Modal Registration Between 3D and 2D Fingerprints via Pose-Aware Unwrapping and Point-Cloud Fusion 提出一种姿态感知的3D指纹展开与点云融合方法,实现3D与2D指纹的跨模态配准。 multimodal
14 HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion HyperDiT:通过超连接Transformer实现高保真像素空间扩散模型 foundation model
15 EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy 提出EntropyScan,通过视觉注意力熵检测LVLMs中的模型级后门攻击。 large language model
16 LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs LRCP:基于低秩可压缩性的高效LVLM视觉Token剪枝 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
17 Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting 提出基于先验引导的分割方法,实现可编辑的3D高斯溅射 3D gaussian splatting gaussian splatting splatting
18 EndoGSim: Physics-Aware 4D Dynamic Endoscopic Scene Simulations via MLLM-Guided Gaussian Splatting EndoGSim:基于MLLM引导的高斯溅射实现物理感知的4D动态内窥镜场景仿真 depth estimation gaussian splatting splatting
19 Unlocking Dense Metric Depth Estimation in VLMs 提出DepthVLM,将视觉语言模型转化为原生密集深度预测器,提升3D空间推理能力。 depth estimation metric depth foundation model
20 Learn2Splat: Extending the Horizon of Learned 3DGS Optimization Learn2Splat:通过元学习扩展3D高斯溅射优化视野 3D gaussian splatting 3DGS gaussian splatting
21 3D Segmentation Using Viewpoint-Dependent Spatial Relationships 提出视角依赖的3D指代分割数据集,并设计视角感知的模型以提升空间关系理解。 scene understanding spatial relationship multimodal
22 Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation 提出解耦的视觉-语言对齐框架,用于细粒度开放词汇分割。 open-vocabulary open vocabulary
23 Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning 提出自提示扩散Transformer,通过上下文学习实现开放词汇场景文本编辑 open-vocabulary open vocabulary
24 RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations 提出RaPD:通过语义增强隐式表示实现分辨率无关的像素扩散模型 implicit representation
25 GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction GHOST:提出几何分层在线流式Token淘汰方法,高效实现3D重建 3D reconstruction
26 Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer 提出Fisher引导量化(FGQ)方法,解决视觉几何Transformer中多任务量化敏感度差异问题。 depth estimation 3D reconstruction VGGT
27 On RGB-TIR Stereo Calibration under Extreme Resolution Asymmetry 提出一种RGB-TIR立体标定框架,解决极端分辨率不对称下的标定难题。 depth estimation multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
28 Latent Video Prediction Learns Better World Models 基于隐空间视频预测,提升视频世界模型的鲁棒性 world model world models JEPA
29 ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark 提出ChronoEarth-492K大规模时空高光谱数据集与基准,促进长时间序列高光谱自监督学习。 representation learning HSI spatiotemporal
30 DiLA: Disentangled Latent Action World Models 提出DiLA以解决潜在动作模型的抽象与生成质量权衡问题 world model world models optical flow
31 3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds 提出3DTMDet,结合Transformer和SSM,解决点云目标检测中远距离点稀疏和上下文理解的难题。 Mamba SSM state space model
32 Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study 针对极低数据量细粒度分类,研究预训练目标对表征质量的影响 MAE contrastive learning distillation
33 From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding 提出Group-Revision优化范式,解决目标级Grounding中困难样本的稀疏奖励问题。 reinforcement learning reward shaping

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
34 Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation 提出VLA-AD以解决VLA策略蒸馏效率问题 manipulation distillation vision-language-action
35 UAM: A Dual-Stream Perspective on Forgetting in VLA Training 提出UAM双流架构,解决VLA训练中的多模态能力遗忘问题 manipulation VLA multimodal
36 WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes WorldAct:将静态3D世界转化为可交互的、以对象为中心的场景 manipulation world model world models

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
37 AnyAct: Towards Human Reenactment of Character Motion From Video AnyAct:提出一种从角色视频到人体表演的重定向方法 motion generation motion retargeting human motion
38 Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling 提出条件多视角祖先采样(cMAS)方法,用于无监督单视角3D人体姿态估计。 motion diffusion model MDM motion diffusion
39 VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation 提出VAGS:一种速度自适应引导缩放方法,用于提升图像编辑和生成质量。 classifier-free guidance

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
40 VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation VideoSeeker:通过原生Agent工具调用,激励实例级视频理解 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页