cs.CV(2025-12-29)

📊 共 30 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (11 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (11) 支柱三:空间感知与语义 (Perception & Semantics) (5) 支柱一:机器人控制 (Robot Control) (2) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
1 HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation HY-Motion 1.0:扩展Flow Matching模型至十亿参数规模,实现文本驱动的3D人体动作生成。 reinforcement learning flow matching text-to-motion
2 PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis PathFound:一种主动证据搜寻的病理诊断多模态Agent模型 reinforcement learning representation learning foundation model
3 LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation 提出改进的On-Policy蒸馏方法,实现多模态交互式实时视频扩散 distillation multimodal
4 ProGuard: Towards Proactive Multimodal Safeguard 提出ProGuard,一种主动式多模态安全防护方法,用于识别和描述生成模型中的OOD安全风险。 reinforcement learning multimodal
5 ThinkGen: Generalized Thinking for Visual Generation ThinkGen:提出基于广义思维的视觉生成框架,提升多场景适应性。 reinforcement learning large language model multimodal
6 GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation 提出基于3D高斯表示的驾驶世界模型GaussianDWM,实现统一的场景理解和多模态生成。 world model scene understanding
7 CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation 提出CME-CAD异构协作多专家强化学习框架,用于高精度可编辑CAD代码生成。 reinforcement learning chain-of-thought
8 GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection GVSynergy-Det:协同高斯-体素表示用于多视角3D目标检测 representation learning gaussian splatting splatting
9 Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment 提出ArtQuant框架,通过层级描述学习解决艺术图像美学评估中的认知鸿沟。 contrastive learning multimodal
10 Visual Language Hypothesis 提出视觉语言假设,从结构和拓扑角度分析视觉表征学习 representation learning multimodal
11 SoulX-LiveTalk Technical Report 提出SoulX-LiveTalk框架,实现高保真实时音频驱动的数字人生成。 distillation spatiotemporal

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
12 RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature RxnBench:一个多模态基准,用于评估大语言模型对科学文献中化学反应的理解能力 large language model multimodal
13 MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios? 提出MM-UAVBench,评估多模态大语言模型在低空无人机场景下的感知、认知和规划能力。 large language model multimodal
14 Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging 提出基于混合元启发式优化的可扩展残差特征聚合框架,用于多模态CT影像中早期胰腺肿瘤的稳健检测。 multimodal
15 Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition 针对iMiGUE数据集,提出多轨多模态学习框架用于微手势和情感识别 multimodal
16 Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism 提出DRIS和MS-VLAM,用于提升遥感图像多模态融合的效率和语义理解精度。 multimodal
17 RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models RS-Prune:面向遥感扩散模型的高比例免训练数据剪枝 foundation model
18 OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding OmniAgent:一种音频引导的主动感知Agent,用于全模态音视频理解 large language model multimodal
19 MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images MedGemma在医学图像疾病分类中优于GPT-4,领域微调至关重要 large language model multimodal
20 REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation REVEALER:提出强化学习引导的视觉推理框架,用于元素级文本-图像对齐评估 large language model multimodal
21 Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin 提出混合AI框架,用于利姆波波河流域自动读取水位标尺 multimodal
22 Towards Integrating Uncertainty for Domain-Agnostic Segmentation 提出UncertSAM基准并探索不确定性量化,提升分割模型在未知领域的泛化性 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
23 Contour Information Aware 2D Gaussian Splatting for Image Representation 提出轮廓信息感知的2D高斯溅射,提升图像表示中边缘重建质量 gaussian splatting splatting
24 SpatialMosaic: A Multiview VLM Dataset for Partial Visibility 提出SpatialMosaic数据集,增强多视角VLM在部分可见场景下的空间推理能力 scene understanding large language model multimodal
25 AVOID: The Adverse Visual Conditions Dataset with Obstacles for Driving Scene Understanding AVOID:用于驾驶场景理解的含障碍物恶劣视觉条件数据集 scene understanding
26 Multi-label Classification with Panoptic Context Aggregation Networks 提出PanCAN,通过全景上下文聚合网络提升多标签分类性能 scene understanding
27 RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction RealX3D:一个用于多视角视觉恢复与重建的物理退化3D基准 metric depth

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
28 NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization NeXT-IMDL:构建下一代图像篡改检测与定位的基准测试 manipulation
29 Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation 利用视频扩散模型,DKT实现了透明物体深度和法向量的零样本SOTA估计 manipulation monocular depth

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
30 Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information 提出Holi-DETR,利用上下文信息进行整体时尚单品检测,提升检测精度。 spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页