cs.CV(2026-06-09)

📊 共 31 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (11 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks 提出Earth-OneVision以解决遥感多模态模型的传感器类型和任务限制问题 large language model multimodal visual grounding
2 Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories 提出数据记者代理以解决数据故事生成问题 multimodal
3 Multimodal Brain Tumour Classification Using Feature Fusion 提出多模态脑肿瘤分类方法以提升诊断准确性 multimodal
4 5% > 100%: Flatness Preference is All You Need for Multimodal Parameter-Efficient Fine-Tuning 提出平坦性偏好优化以提升多模态参数高效微调效果 multimodal
5 Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems 提出视觉辅助基础模型以解决多任务车辆路径问题 foundation model
6 P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning 提出P3D-Bench以评估参数化3D生成与结构推理 large language model multimodal
7 Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction 提出DeBias-Attack以解决VLP模型中的对抗转移性问题 large language model multimodal
8 CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence 提出CoCoSI以解决多模态大语言模型空间理解问题 large language model multimodal
9 AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference 提出AMNet以解决低光视频增强中的模态缺失问题 multimodal
10 Segment and Select: Vision-Language Segmentation in 3D Scenarios 提出SEGA3D以解决3D视觉语言分割中的边界模糊问题 large language model
11 GRAR: Glass-induced Reflection Artifact Removal in LiDAR Point Clouds 提出GRAR框架以解决激光扫描点云中的玻璃反射伪影问题 foundation model
12 Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning 提出AVEX-Prune以解决音视频描述中的动态令牌修剪问题 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
13 Next Forcing: Causal World Modeling with Multi-Chunk Prediction 提出Next Forcing以解决视频生成训练慢和推理效率低的问题 world model world models world action model
14 Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency 提出ImageTime基准以解决视觉世界建模中的时序一致性问题 world model world models spatiotemporal
15 LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching 提出LAFP以解决多模态动作分布崩溃问题 policy learning imitation learning behavior cloning
16 ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations 提出ARM模型以统一图像理解、生成与编辑任务 reinforcement learning multimodal
17 Mean Flow Distillation: Robust and Stable Distillation for Flow Matching Models 提出均流蒸馏以解决流匹配模型的计算开销问题 flow matching distillation
18 SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning 提出SCAIL-2以解决受限角色动画中的信息损失问题 DPO character animation
19 Kwai Keye-VL-2.0 Technical Report 提出Kwai Keye-VL-2.0以解决长视频理解和智能体协作问题 distillation foundation model multimodal
20 FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model 提出FADA以解决低收入国家产前超声检查人员短缺问题 MAE distillation foundation model
21 Listen, Look, and Learn: Learning Without Forgetting through SAM-Audio 提出SAM-Audio以解决音频视觉增量学习中的遗忘问题 distillation multimodal
22 Efficient RWKV-based Representation Learning for 3D Point Clouds 提出P-RWKV以解决3D点云表示学习中的局部几何结构捕捉问题 representation learning
23 Benchmarking stereo reconstruction for 3D printable Martian terrain models 提出立体重建方法以解决火星地形建模挑战 MAE stereo depth

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
24 GaussTrace: Provenance Analysis of 3D Gaussian Splatting Models with Evidence-based LLM Reasoning 提出GaussTrace以解决3D Gaussian Splatting模型的溯源分析问题 3D gaussian splatting 3DGS gaussian splatting
25 Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving 提出Envision4D以解决自主驾驶中的未来场景预测问题 gaussian splatting splatting
26 Leveraging Metric Depth for Relative Depth Prediction 提出利用度量深度解决足球场景中的相对深度预测问题 depth estimation monocular depth metric depth
27 3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis 提出3D-CoS以解决3D重建控制难题 3D reconstruction NeRF

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
28 ManiSplat: Manipulation Trajectory Synthesis from Monocular Video via Decoupled 3D Gaussian Splatting 提出ManiSplat以解决动态3D场景重建问题 manipulation policy learning 3D gaussian splatting
29 WorldOlympiad: Can Your World Model Survive a Triathlon? 提出WorldOlympiad以解决视频生成模型评估不足问题 manipulation world model world models
30 LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination 提出LIBERO-Occ以解决场景诱导遮挡下的视觉-语言-动作模型问题 manipulation vision-language-action VLA

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
31 A Multimodal RGB and Events Dataset for Hand Detection in First-Person View 提出多模态RGB与事件数据集以解决手部检测问题 egocentric first-person view multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页