cs.CV(2026-03-13)
📊 共 46 篇论文 | 🔗 16 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (14 🔗5)
支柱二:RL算法与架构 (RL & Architecture) (12 🔗4)
支柱三:空间感知与语义 (Perception & Semantics) (8 🔗2)
支柱一:机器人控制 (Robot Control) (4 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (4 🔗2)
支柱四:生成式动作 (Generative Motion) (2 🔗1)
支柱七:动作重定向 (Motion Retargeting) (2 🔗1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)
🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation | TerraFlow:用于地球观测的多模态、多时相表征学习方法 | representation learning foundation model multimodal | ||
| 16 | VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model | VGGT-World:提出一种基于几何特征自回归预测的几何世界模型,提升深度预测效率。 | flow matching world model VGGT | ||
| 17 | Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach | Team RAS提出多模态融合方法,用于野外环境下valence和arousal的连续情感识别。 | Mamba multimodal | ||
| 18 | Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach | Team LEYA提出多模态融合方法,用于解决非约束视频中的犹豫/矛盾情绪识别问题。 | Mamba multimodal | ||
| 19 | GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification | 提出GLEAM多模态青光眼数据集和HAMM模型用于青光眼分期分类 | representation learning multimodal | ||
| 20 | Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation | Cheers:解耦图像细节与语义表示,实现统一的多模态理解与生成 | flow matching multimodal | ||
| 21 | Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing | 提出ME-RSRG基准数据集与EAR框架,解决遥感图像多实体推理与视觉定位问题 | reinforcement learning foundation model visual grounding | ✅ | |
| 22 | CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration | CMHANet:用于点云配准的跨模态混合注意力网络,提升复杂场景下的鲁棒性。 | contrastive learning scene understanding geometric consistency | ✅ | |
| 23 | Visual-ERM: Reward Modeling for Visual Equivalence | 提出Visual-ERM,用于视觉等价的奖励建模,提升Vision-to-Code任务性能。 | reinforcement learning multimodal | ||
| 24 | Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models | 提出STEVO-Bench以评估视频世界模型的状态演变能力 | world model | ✅ | |
| 25 | Thinking in Streaming Video | ThinkStream:提出基于观察-思考-表达范式的流式视频理解框架,解决实时性问题。 | reinforcement learning multimodal | ✅ | |
| 26 | SGMatch: Semantic-Guided Non-Rigid Shape Matching with Flow Regularization | SGMatch:语义引导的非刚性形状匹配与流正则化 | flow matching foundation model |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)
🔬 支柱一:机器人控制 (Robot Control) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 35 | PVI: Plug-in Visual Injection for Vision-Language-Action Models | 提出PVI,一种即插即用的视觉注入模块,提升VLA模型在语言条件下的操作能力。 | manipulation bi-manual flow matching | ||
| 36 | RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization | RoboStereo:双塔4D具身世界模型,用于统一策略优化,提升机器人操作性能。 | manipulation policy learning world model | ||
| 37 | SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation | SAW:通过可控且可扩展的视频生成技术构建手术动作世界模型 | sim-to-real world model affordance | ||
| 38 | Rethinking VLMs for Image Forgery Detection and Localization | 提出IFDL-VLM,利用视觉语言模型提升图像篡改检测与定位性能 | manipulation | ✅ |
🔬 支柱六:视频提取与匹配 (Video Extraction) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 39 | Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence | VAEX-BENCH:提出用于评估MLLM时空抽象推理能力的合成视频基准 | egocentric spatiotemporal large language model | ||
| 40 | Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering | 提出EgoPointVQA数据集以解决手势基础的自我中心视频问答问题 | egocentric large language model multimodal | ✅ | |
| 41 | Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods | 提出基于深度学习的屋顶风场重建方法,利用稀疏传感器数据提升无人机安全。 | sparse sensors | ||
| 42 | CM-Bench: A Comprehensive Cross-Modal Feature Matching Benchmark Bridging Visible and Infrared Images | 构建红外-可见光跨模态特征匹配基准CM-Bench,促进跨模态视觉应用 | feature matching | ✅ |
🔬 支柱四:生成式动作 (Generative Motion) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 43 | InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing | InterEdit:提出文本引导的多人3D动作编辑框架,并构建相应数据集。 | text-to-motion human motion | ✅ | |
| 44 | TRACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking | TRACE:提出一种结构感知的字符编码框架,用于文档水印嵌入,提升鲁棒性和泛化性。 | MDM |
🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 45 | SDF-Net: Structure-Aware Disentangled Feature Learning for Opticall-SAR Ship Re-identification | SDF-Net:提出结构感知解耦特征学习网络,解决光电-SAR船舶重识别难题 | geometric consistency | ✅ | |
| 46 | Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA | 针对LLaVA的空间推理能力弱点,提出了一种受控诊断研究方法。 | spatial relationship |