cs.CV(2025-05-27)
📊 共 67 篇论文 | 🔗 24 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (24 🔗8)
支柱二:RL算法与架构 (RL & Architecture) (18 🔗6)
支柱三:空间感知与语义 (Perception & Semantics) (10 🔗5)
支柱一:机器人控制 (Robot Control) (5 🔗2)
支柱七:动作重定向 (Motion Retargeting) (3)
支柱六:视频提取与匹配 (Video Extraction) (3 🔗2)
支柱四:生成式动作 (Generative Motion) (2)
支柱五:交互与反应 (Interaction & Reaction) (1)
支柱八:物理动画 (Physics-based Animation) (1 🔗1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (24 篇)
🔬 支柱二:RL算法与架构 (RL & Architecture) (18 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 25 | Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO | 提出ACTIVE-O3框架,通过强化学习赋能多模态大语言模型主动感知能力 | reinforcement learning large language model multimodal | ||
| 26 | Mamba-Driven Topology Fusion for Monocular 3D Human Pose Estimation | 提出Mamba驱动的拓扑融合框架,提升单目3D人体姿态估计精度与效率 | Mamba SSM state space model | ||
| 27 | Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation | 提出自适应文本梦想者以解决视觉与语言导航问题 | dreamer VLN large language model | ✅ | |
| 28 | Object Concepts Emerge from Motion | 提出一种基于运动信息的无监督物体概念学习框架,提升视觉表征能力。 | contrastive learning depth estimation monocular depth | ||
| 29 | TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs | 提出TACO算法,通过强化学习优化LVLM中的长链推理与数据学习,解决推理不一致等问题。 | reinforcement learning large language model multimodal | ||
| 30 | MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on | MagicTryOn:利用扩散Transformer实现服装细节保持的视频虚拟试穿 | distillation human motion spatiotemporal | ||
| 31 | ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding | ZigzagPointMamba:通过空间-语义Mamba网络提升点云理解能力 | Mamba SSM state space model | ||
| 32 | MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding | MUSEG:通过时间戳感知的多片段定位增强视频时序理解 | reinforcement learning large language model multimodal | ✅ | |
| 33 | Policy Optimized Text-to-Image Pipeline Design | 提出基于强化学习的文本到图像生成流程优化方法,提升图像质量和多样性。 | reinforcement learning classifier-free guidance large language model | ||
| 34 | OccLE: Label-Efficient 3D Semantic Occupancy Prediction | OccLE:一种标签高效的3D语义占据预测方法 | Mamba scene understanding foundation model | ✅ | |
| 35 | OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers | OmniSync:基于扩散Transformer的通用唇形同步框架,适用于多样化视觉场景 | flow matching classifier-free guidance spatiotemporal | ||
| 36 | DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization | DreamBoothDPO:利用直接偏好优化提升个性化图像生成效果 | DPO direct preference optimization | ✅ | |
| 37 | PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter | 提出PMA以解决点云理解中的信息利用不足问题 | Mamba | ✅ | |
| 38 | Rendering-Aware Reinforcement Learning for Vector Graphics Generation | 提出RLRF:利用渲染反馈的强化学习方法提升向量图形生成质量 | reinforcement learning | ||
| 39 | Hierarchical Instruction-aware Embodied Visual Tracking | 提出HIEVT,利用分层指令感知解决具身视觉跟踪中指令理解与动作生成鸿沟 | reinforcement learning VLA | ||
| 40 | Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets | 提出时序显著性引导的视频数据集蒸馏框架,实现高效视频数据压缩。 | distillation | ||
| 41 | Supervised Contrastive Learning for Ordinal Engagement Measurement | 提出基于监督对比学习的序数学生参与度测量方法,解决不平衡分类问题。 | contrastive learning | ||
| 42 | LPOI: Listwise Preference Optimization for Vision Language Models | 提出LPOI,通过列表式偏好优化减少视觉语言模型中的幻觉问题。 | RLHF DPO | ✅ |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 43 | Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting | Intern-GS:利用视觉模型引导的稀疏视图3D高斯溅射 | 3D gaussian splatting gaussian splatting splatting | ||
| 44 | Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts | 提出Uni3D-MoE,通过MoE实现可扩展的多模态3D场景理解。 | scene understanding large language model multimodal | ||
| 45 | Generalizable and Relightable Gaussian Splatting for Human Novel View Synthesis | 提出GRGS,实现通用且可重光照的人体新视角合成 | gaussian splatting splatting geometric consistency | ||
| 46 | 3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics Based Appearance-Medium Decoupling | 提出基于物理的3D高斯水下场景重建方法,解耦外观与介质效应 | 3D gaussian splatting 3DGS gaussian splatting | ✅ | |
| 47 | Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility | Dream3DVG:提出一种支持任意视角、渐进细节优化和视角相关可见性的文本到矢量图生成方法 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 48 | Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning | 提出MoDOT框架,通过多任务学习互增强遮挡边界和单目深度估计 | depth estimation monocular depth geometric consistency | ✅ | |
| 49 | Compositional Scene Understanding through Inverse Generative Modeling | 提出基于逆生成建模的组合场景理解方法,实现对复杂场景的鲁棒解析。 | scene understanding | ✅ | |
| 50 | Plenodium: UnderWater 3D Scene Reconstruction with Plenoptic Medium Representation | Plenodium:水下三维场景重建的光场介质表示方法 | scene reconstruction | ✅ | |
| 51 | Robust Video-Based Pothole Detection and Area Estimation for Intelligent Vehicles with Depth Map and Kalman Smoothing | 提出ACSH-YOLOv8与CDKF,用于智能车辆在视频中稳健检测坑洼并估计面积 | depth estimation monocular depth Depth Anything | ||
| 52 | OmniIndoor3D: Comprehensive Indoor 3D Reconstruction | OmniIndoor3D:基于高斯表示的综合室内三维重建框架 | 3DGS scene understanding | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 53 | HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion | HTMNet:用于透明和反射物体深度补全的Transformer-Mamba混合网络 | manipulation Mamba state space model | ||
| 54 | Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks | DORI:提出细粒度多轴感知基准,解耦多模态大模型中的方向理解能力 | manipulation scene reconstruction scene understanding | ✅ | |
| 55 | FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention | FastFace:通过引导和注意力机制调整蒸馏扩散模型中的身份保持 | manipulation distillation classifier-free guidance | ||
| 56 | RefAV: Towards Planning-Centric Scenario Mining | RefAV:提出以规划为中心的场景挖掘方法,解决自动驾驶日志分析难题。 | motion planning | ✅ | |
| 57 | Geometry-Editable and Appearance-Preserving Object Compositon | 提出DGAD模型,通过解耦几何编辑和外观保持,实现可控且逼真的物体合成。 | manipulation |
🔬 支柱七:动作重定向 (Motion Retargeting) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 58 | HuMoCon: Concept Discovery for Human Motion Understanding | HuMoCon:提出用于人体运动理解的概念发现框架,提升多模态特征对齐和高频信息表达。 | human motion | ||
| 59 | Diffusion Model-based Activity Completion for AI Motion Capture from Videos | 提出基于扩散模型的动作补全方法,用于AI视频动作捕捉中生成自然连续的动作 | human motion | ||
| 60 | ProBA: Probabilistic Bundle Adjustment with the Bhattacharyya Coefficient | 提出ProBA:一种基于Bhattacharyya系数的概率Bundle Adjustment方法,解决相机内参未知和初始估计不准的问题。 | geometric consistency |
🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 61 | ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models | 提出ViewSpatial-Bench基准,评估视觉语言模型在多视角空间定位中的能力 | egocentric spatial relationship embodied AI | ||
| 62 | HCQA-1.5 @ Ego4D EgoSchema Challenge 2025 | 提出基于多源聚合与置信度过滤的HCQA扩展框架,提升第一人称视角视频问答的准确性。 | egocentric Ego4D | ✅ | |
| 63 | SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation | SANSA:利用SAM2的潜在语义信息进行少样本分割 | feature matching | ✅ |
🔬 支柱四:生成式动作 (Generative Motion) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 64 | Exploring Timeline Control for Facial Motion Generation | 提出时间线控制的 facial motion 生成方法,实现精细化面部动作控制 | motion generation | ||
| 65 | Normalized Attention Guidance: Universal Negative Guidance for Diffusion Models | 提出归一化注意力引导(NAG),解决扩散模型中负引导失效问题。 | classifier-free guidance |
🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 66 | OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions | 提出OmniResponse,解决在线多模态对话中听者反馈生成问题 | dyadic interaction large language model multimodal |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 67 | AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Agriculture Mapping | AgriFM:面向农业制图的多源时序遥感基础模型 | spatiotemporal foundation model | ✅ |