cs.CV(2025-10-07)
📊 共 33 篇论文 | 🔗 9 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (14 🔗4)
支柱二:RL算法与架构 (RL & Architecture) (8 🔗3)
支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1)
支柱一:机器人控制 (Robot Control) (2 🔗1)
支柱五:交互与反应 (Interaction & Reaction) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)
🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection | HOI-R1:探索多模态大语言模型在人-物交互检测中的潜力 | reinforcement learning human-object interaction HOI | ✅ | |
| 16 | Improving Chain-of-Thought Efficiency for Autoregressive Image Generation | 提出ShortCoTI框架,提升自回归图像生成中思维链的效率,减少冗余计算。 | reinforcement learning large language model foundation model | ||
| 17 | Towards Robust and Realible Multimodal Misinformation Recognition with Incomplete Modality | 提出MMLNet,解决多模态信息传播中模态缺失导致的虚假信息识别鲁棒性问题。 | contrastive learning multimodal | ✅ | |
| 18 | GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments | GAZE:面向零样本世界模型的治理感知预标注流水线 | world model scene understanding multimodal | ||
| 19 | Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics | Midway Network:通过潜在动态学习进行识别和运动的表征学习 | latent dynamics optical flow motion latent | ||
| 20 | When Thinking Drifts: Evidential Grounding for Robust Video Reasoning | 提出Visual Evidence Reward (VER)框架,解决视频推理中思维漂移问题。 | reinforcement learning multimodal chain-of-thought | ||
| 21 | VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization | 提出VideoMiner,通过树状结构和强化学习优化,解决长视频关键帧提取与理解难题。 | reinforcement learning spatiotemporal large language model | ✅ | |
| 22 | Deforming Videos to Masks: Flow Matching for Referring Video Segmentation | 提出FlowRVS以解决视频对象分割中的语言引导问题 | flow matching |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 23 | Human3R: Everyone Everywhere All at Once | Human3R:提出统一的单目视频4D人体场景重建框架,实现多人、场景和相机轨迹的实时重建。 | depth estimation scene reconstruction contact-aware | ✅ | |
| 24 | EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark | EgoNight:提出首个夜间第一人称视觉理解基准,解决低光照场景下的VQA难题。 | depth estimation egocentric egocentric vision | ||
| 25 | Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow | Flow4Agent:利用光流运动先验进行长视频理解,提升MLLM性能。 | optical flow large language model multimodal | ||
| 26 | When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach | 提出一种多模态自动视频编辑方法,用于古典音乐会多机位录制视频的剪辑。 | scene understanding multimodal | ||
| 27 | ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars | ArchitectHead:提出首个支持连续细节层次控制的3D高斯头部头像框架 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 28 | Teleportraits: Training-Free People Insertion into Any Scene | Teleportraits:提出一种免训练的人物插入方法,实现任意场景下的人物合成 | affordance classifier-free guidance affordance-aware | ||
| 29 | Human Action Recognition from Point Clouds over Time | 提出一种基于点云序列和稀疏卷积网络的3D人体动作识别方法 | depth estimation monocular depth | ||
| 30 | Dropping the D: RGB-D SLAM Without the Depth Sensor | DropD-SLAM:无需深度传感器的单目RGB SLAM,达到RGB-D级别精度 | metric depth |
🔬 支柱一:机器人控制 (Robot Control) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 31 | Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images | 提出基于扩散模型的双手3D运动与姿态预测方法,提升日常图像中的预测精度。 | bi-manual multimodal | ||
| 32 | HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video | HoloScene:从单视频重建可交互、可仿真的3D场景 | manipulation scene understanding | ✅ |
🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 33 | Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation | Text2Interact:提出高保真、多样化的文本驱动双人互动生成框架 | two-person interaction spatiotemporal |