cs.CV(2025-12-29)

📊 共 33 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (11 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (11) 支柱三:空间感知与语义 (Perception & Semantics) (7) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱一:机器人控制 (Robot Control) (2)

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
1 HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation HY-Motion 1.0:扩展Flow Matching模型至十亿参数规模,实现文本驱动的3D人体动作生成。 reinforcement learning flow matching text-to-motion
2 PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis PathFound:一种主动证据搜索的病理诊断多模态Agent模型 reinforcement learning representation learning foundation model
3 LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation 提出改进的On-Policy蒸馏方法,实现多模态交互式实时视频扩散。 distillation multimodal
4 GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation 提出基于3D高斯表示的驾驶世界模型,实现统一的场景理解和多模态生成。 world model scene understanding
5 ProGuard: Towards Proactive Multimodal Safeguard 提出ProGuard,一种主动式多模态安全防护方法,用于识别和描述分布外安全风险。 reinforcement learning multimodal
6 ThinkGen: Generalized Thinking for Visual Generation ThinkGen:提出基于思维链的通用视觉生成框架,提升多场景适应性。 reinforcement learning large language model multimodal
7 CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation 提出CME-CAD异构协作多专家强化学习框架,用于高精度可编辑CAD代码生成。 reinforcement learning chain-of-thought
8 SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation SoulX-FlashTalk:基于自校正双向蒸馏的实时无限音频驱动头像生成 distillation spatiotemporal
9 GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection GVSynergy-Det:协同高斯-体素表示用于多视角3D目标检测 representation learning gaussian splatting splatting
10 Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment 提出ArtQuant框架,通过层级描述学习解决艺术图像美学评估中的认知鸿沟。 contrastive learning multimodal
11 Visual Language Hypothesis 提出视觉语言假设,从结构和拓扑角度理解视觉表征学习 representation learning multimodal

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
12 RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature RxnBench:一个多模态基准,用于评估大语言模型对科学文献中化学反应的理解能力 large language model multimodal
13 MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios? 提出MM-UAVBench,评估多模态大语言模型在低空无人机场景下的感知、认知和规划能力。 large language model multimodal
14 Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale 探索遥感基础模型扩展性:在Peta级数据上权衡数据域 foundation model multimodal
15 Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging 提出基于混合元启发式优化的可扩展残差特征聚合框架,用于多模态CT影像中早期胰腺肿瘤的稳健检测。 multimodal
16 Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism 提出DRIS和MS-VLAM的VLM框架,提升遥感图像多模态理解的效率与精度 multimodal
17 RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models RS-Prune:面向遥感扩散模型,实现高比例免训练数据剪枝 foundation model
18 REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation REVEALER:提出基于强化学习引导的视觉推理框架,用于元素级文本-图像对齐评估 large language model multimodal
19 Active Perception Agent for Omnimodal Audio-Video Understanding 提出OmniAgent,首个全主动感知Agent,用于细粒度音视频理解。 large language model multimodal
20 MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images MedGemma在医学图像疾病分类中优于GPT-4,领域微调至关重要 large language model multimodal
21 Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin 提出混合AI框架,结合目标检测与生成模型,实现利姆波波河流域水位自动读取 multimodal
22 Towards Integrating Uncertainty for Domain-Agnostic Segmentation 提出UncertSAM基准,探索不确定性量化提升领域泛化分割模型 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
23 Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments 利用合成先验知识,提升内窥镜手术环境中单目深度估计精度 depth estimation monocular depth Depth Anything
24 Contour Information Aware 2D Gaussian Splatting for Image Representation 提出轮廓感知的2D高斯溅射方法,提升图像表示中边缘重建质量 gaussian splatting splatting
25 OptFormer: Optical Flow-Guided Attention and Phase Space Reconstruction for SST Forecasting OptFormer:光流引导注意力与相空间重构用于海表温度预测 optical flow spatiotemporal
26 SpatialMosaic: A Multiview VLM Dataset for Partial Visibility 提出SpatialMosaic以解决部分可见性下的空间推理问题 scene understanding large language model multimodal
27 AVOID: The Adverse Visual Conditions Dataset with Obstacles for Driving Scene Understanding AVOID:用于驾驶场景理解的恶劣视觉条件障碍物数据集 scene understanding
28 Multi-label Classification with Panoptic Context Aggregation Networks 提出PanCAN,通过全景上下文聚合网络提升多标签分类性能 scene understanding
29 RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction RealX3D:一个用于多视角视觉恢复与重建的物理退化3D基准 metric depth

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
30 Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition 针对iMiGUE数据集,提出多轨多模态学习框架用于微手势和情感识别 motion prediction multimodal
31 Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information 提出Holi-DETR,利用上下文信息进行整体时尚单品检测,提升检测精度。 spatial relationship

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
32 NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization NeXT-IMDL:构建下一代图像篡改检测与定位的基准测试 manipulation
33 Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation 利用视频扩散模型,DKT实现了透明物体深度和法向量的零样本SOTA估计 manipulation monocular depth

⬅️ 返回 cs.CV 首页 · 🏠 返回主页