cs.CV(2025-10-19)
📊 共 17 篇论文 | 🔗 2 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6)
支柱二:RL算法与架构 (RL & Architecture) (6 🔗2)
支柱三:空间感知与语义 (Perception & Semantics) (4)
支柱八:物理动画 (Physics-based Animation) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input | 提出Res-Bench,评估多模态大语言模型在动态分辨率输入下的鲁棒性 | large language model multimodal | ||
| 2 | Enrich and Detect: Video Temporal Grounding with Multimodal LLMs | 提出ED-VTG,利用多模态LLM进行细粒度视频时序定位 | large language model multimodal | ||
| 3 | Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs | LENS:为冻结多模态LLM提供即插即用的分割能力 | large language model multimodal | ||
| 4 | Training-free Online Video Step Grounding | 提出BaGLM,利用大模型零样本能力在线视频步骤定位,超越离线训练方法。 | large language model multimodal | ||
| 5 | Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding | 通过fMRI神经编码揭示视觉-语言模型中类脑分层模式 | multimodal | ||
| 6 | EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction | 提出EventFormer,用于解决动作中心视频事件预测任务,并构建大规模数据集AVEP。 | multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis | 综述性分析医学影像领域中的Foundation Model,系统性地归纳架构、训练范式和临床应用。 | distillation foundation model multimodal | ||
| 8 | A Comprehensive Survey on World Models for Embodied AI | 对具身智能中世界模型的全面综述,涵盖功能、时序建模和空间表示三个维度。 | world model embodied AI | ✅ | |
| 9 | EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation | EMRRG:高效微调预训练Mamba X射线网络,用于放射报告生成 | Mamba SSM large language model | ✅ | |
| 10 | Video Reasoning without Training | 提出V-Reason,无需训练即可提升大模型在视频推理任务中的性能。 | reinforcement learning multimodal chain-of-thought | ||
| 11 | Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback | Uniworld-V2:利用扩散负感知微调和MLLM隐式反馈增强图像编辑能力 | flow matching large language model multimodal | ||
| 12 | Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding | 提出W2R2框架,解决视频LLM中3D grounding的2D语义偏见问题。 | representation learning multimodal |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 13 | SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes | SceneCOT:提出3D场景中基于常识链的推理框架,提升具身问答性能 | scene understanding large language model multimodal | ||
| 14 | 2DGS-R: Revisiting the Normal Consistency Regularization in 2D Gaussian Splatting | 2DGS-R:通过分层训练和原位克隆提升2D高斯溅射的渲染质量和几何精度 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 15 | GS2POSE: Marry Gaussian Splatting to 6D Object Pose Estimation | GS2POSE:结合高斯溅射的6D物体姿态估计方法 | 3DGS gaussian splatting splatting | ||
| 16 | How Universal Are SAM2 Features? | 量化通用视觉模型与分割专用模型特征的泛化能力差异 | depth estimation |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 17 | HumanCM: One Step Human Motion Prediction | 提出HumanCM,一种基于一致性模型的人体运动单步预测框架 | spatiotemporal |