cs.CV(2025-08-31)
📊 共 21 篇论文 | 🔗 3 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (6 🔗1)
支柱二:RL算法与架构 (RL & Architecture) (6 🔗1)
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗1)
支柱五:交互与反应 (Interaction & Reaction) (2)
支柱一:机器人控制 (Robot Control) (1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)
🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | CSFMamba: Cross State Fusion Mamba Operator for Multimodal Remote Sensing Image Classification | 提出CSFMamba以解决多模态遥感图像分类中的计算复杂性问题 | Mamba SSM state space model | ||
| 8 | OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving | 提出OmniReason框架以解决自动驾驶中的时空推理问题 | distillation scene understanding spatiotemporal | ||
| 9 | MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation | 提出MV-SSM框架以解决多视角3D人体姿态估计问题 | Mamba SSM state space model | ✅ | |
| 10 | Multi-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification | 提出多层CLS Token融合以解决内窥镜图像分类问题 | contrastive learning multimodal | ||
| 11 | LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model | 提出LLaVA-Critic-R1以优化多模态生成与评估 | reinforcement learning multimodal | ||
| 12 | CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition | 提出CascadeFormer以解决骨架基础的人类动作识别问题 | representation learning spatiotemporal |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 13 | Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering | 提出MI-RAG框架以解决知识密集型视觉问答中的知识获取问题 | large language model multimodal | ||
| 14 | Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model | 提出Fusion to Enhance以解决多模态语言模型的视觉感知瓶颈问题 | large language model multimodal | ||
| 15 | Ultrasound-based detection and malignancy prediction of breast lesions eligible for biopsy: A multi-center clinical-scenario study using nomograms, large language models, and radiologist evaluation | 提出综合超声nomogram以提高乳腺病变活检推荐准确性 | large language model | ||
| 16 | Image-to-Brain Signal Generation for Visual Prosthesis with CLIP Guided Multimodal Diffusion Models | 提出图像到脑信号生成框架以解决视觉假体的编码问题 | multimodal | ||
| 17 | EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions | 提出EVENT-Retriever以解决基于事件的多模态图像检索问题 | multimodal | ✅ | |
| 18 | Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision | 提出新基准以评估视觉语言模型的投影层泛化能力 | large language model multimodal |
🔬 支柱五:交互与反应 (Interaction & Reaction) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 19 | No More Sibling Rivalry: Debiasing Human-Object Interaction Detection | 提出新方法以解决人机交互检测中的偏见问题 | human-object interaction HOI | ||
| 20 | Secure and Scalable Face Retrieval via Cancelable Product Quantization | 提出可取消的产品量化以解决人脸检索隐私问题 | OMOMO |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 21 | InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos | 提出InterPose以解决复杂场景中人机交互生成问题 | manipulation motion generation human-object interaction |