cs.CV(2025-12-23)

📊 共 23 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8) 支柱二:RL算法与架构 (RL & Architecture) (5 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation CRAFT:用于多模态文图生成的持续推理和Agent反馈调整框架 large language model multimodal
2 FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models FlashVLM:文本引导的视觉Token选择,提升大模型多模态效率 multimodal
3 MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis MAPI-GNN:用于多模态医学诊断的多激活平面交互图神经网络 multimodal
4 SpatialTree: How Spatial Abilities Branch Out in MLLMs SpatialTree:构建多模态LLM空间能力分层评估体系与提升方法 multimodal
5 Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models 提出DSR Suite和几何选择模块GSM,提升VLM在动态空间推理能力 foundation model
6 Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark 提出NL-DIR基准数据集,用于解决自然语言描述的文档图像检索问题 large language model
7 Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva 提出多模态检索增强的图像描述方法,提升事件背景和上下文理解能力 multimodal
8 PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification PaveSync:统一全面的路面病害分析与分类数据集 zero-shot transfer

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
9 Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition 提出多模态对齐、翻译、融合与迁移方法,提升复杂输入理解与识别能力 distillation egocentric multimodal
10 AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model 提出AMoE,一种高效的Agglomerative Mixture-of-Experts视觉基础模型,通过多教师蒸馏实现。 representation learning distillation foundation model
11 Active Intelligence in Video Avatars via Closed-loop World Modeling 提出ORCA框架,通过闭环世界建模实现视频化身的主动智能 world model
12 milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion milliMamba:基于双毫米波雷达和多帧Mamba融合的抗镜面反射人体姿态估计 Mamba
13 DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation DDAVS:解耦音频语义与延迟双向对齐,用于音视频分割 contrastive learning multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
14 Enhancing annotations for 5D apple pose estimation through 3D Gaussian Splatting (3DGS) 利用3D高斯溅射增强5D苹果姿态估计的标注效率 3D gaussian splatting 3DGS gaussian splatting
15 SirenPose: Dynamic Scene Reconstruction via Geometric Supervision SirenPose:通过几何监督实现动态场景的精确重建与时序一致性 scene reconstruction physically plausible spatiotemporal
16 AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment AlignPose:通过多视角特征度量对齐实现通用6D位姿估计 6D pose estimation
17 SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images SmartSplat:提出特征感知的GS图像压缩框架,实现超高分辨率图像的高效压缩与高质量重建。 3D gaussian splatting gaussian splatting splatting

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
18 LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation 提出LADLE-MM,一种基于有限标注和集成学习的多模态信息检测器,适用于资源受限场景。 manipulation multimodal
19 Dreamcrafter: Immersive Editing of 3D Radiance Fields Through Flexible, Generative Inputs and Outputs Dreamcrafter:通过灵活的生成式输入输出实现沉浸式3D辐射场编辑 manipulation 3D gaussian splatting gaussian splatting
20 LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving LEAD:最小化端到端驾驶中学习者-专家不对称性,提升CARLA模拟器驾驶性能 sim-to-real imitation learning

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
21 TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation 提出TAVID,通过跨模态映射实现文本驱动的交互式音视频对话生成。 dyadic interaction multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
22 A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments 研究双视角视频输入对自然驾驶环境下分心检测的影响,强调融合设计的重要性。 spatiotemporal multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
23 DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning 提出DETACH框架,通过解耦时空对齐解决外中心视频与环境传感器融合问题 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页