cs.CV（2025-12-29）

📊 共 33 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (11 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (11) 支柱三：空间感知与语义 (Perception & Semantics) (7) 支柱七：动作重定向 (Motion Retargeting) (2) 支柱一：机器人控制 (Robot Control) (2)

🔬 支柱二：RL算法与架构 (RL & Architecture) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation	HY-Motion 1.0：扩展Flow Matching模型至十亿参数规模，实现文本驱动的3D人体动作生成。	reinforcement learning flow matching text-to-motion
2	PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis	PathFound：一种主动证据搜索的病理诊断多模态Agent模型	reinforcement learning representation learning foundation model
3	LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation	提出改进的On-Policy蒸馏方法，实现多模态交互式实时视频扩散。	distillation multimodal
4	GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation	提出基于3D高斯表示的驾驶世界模型，实现统一的场景理解和多模态生成。	world model scene understanding	✅
5	ProGuard: Towards Proactive Multimodal Safeguard	提出ProGuard，一种主动式多模态安全防护方法，用于识别和描述分布外安全风险。	reinforcement learning multimodal
6	ThinkGen: Generalized Thinking for Visual Generation	ThinkGen：提出基于思维链的通用视觉生成框架，提升多场景适应性。	reinforcement learning large language model multimodal	✅
7	CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation	提出CME-CAD异构协作多专家强化学习框架，用于高精度可编辑CAD代码生成。	reinforcement learning chain-of-thought
8	SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation	SoulX-FlashTalk：基于自校正双向蒸馏的实时无限音频驱动头像生成	distillation spatiotemporal
9	GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection	GVSynergy-Det：协同高斯-体素表示用于多视角3D目标检测	representation learning gaussian splatting splatting
10	Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment	提出ArtQuant框架，通过层级描述学习解决艺术图像美学评估中的认知鸿沟。	contrastive learning multimodal
11	Visual Language Hypothesis	提出视觉语言假设，从结构和拓扑角度理解视觉表征学习	representation learning multimodal

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
12	RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature	RxnBench：一个多模态基准，用于评估大语言模型对科学文献中化学反应的理解能力	large language model multimodal
13	MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?	提出MM-UAVBench，评估多模态大语言模型在低空无人机场景下的感知、认知和规划能力。	large language model multimodal
14	Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale	探索遥感基础模型扩展性：在Peta级数据上权衡数据域	foundation model multimodal
15	Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging	提出基于混合元启发式优化的可扩展残差特征聚合框架，用于多模态CT影像中早期胰腺肿瘤的稳健检测。	multimodal
16	Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism	提出DRIS和MS-VLAM的VLM框架，提升遥感图像多模态理解的效率与精度	multimodal
17	RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models	RS-Prune：面向遥感扩散模型，实现高比例免训练数据剪枝	foundation model
18	REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation	REVEALER：提出基于强化学习引导的视觉推理框架，用于元素级文本-图像对齐评估	large language model multimodal
19	Active Perception Agent for Omnimodal Audio-Video Understanding	提出OmniAgent，首个全主动感知Agent，用于细粒度音视频理解。	large language model multimodal
20	MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images	MedGemma在医学图像疾病分类中优于GPT-4，领域微调至关重要	large language model multimodal
21	Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin	提出混合AI框架，结合目标检测与生成模型，实现利姆波波河流域水位自动读取	multimodal
22	Towards Integrating Uncertainty for Domain-Agnostic Segmentation	提出UncertSAM基准，探索不确定性量化提升领域泛化分割模型	foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments	利用合成先验知识，提升内窥镜手术环境中单目深度估计精度	depth estimation monocular depth Depth Anything
24	Contour Information Aware 2D Gaussian Splatting for Image Representation	提出轮廓感知的2D高斯溅射方法，提升图像表示中边缘重建质量	gaussian splatting splatting
25	OptFormer: Optical Flow-Guided Attention and Phase Space Reconstruction for SST Forecasting	OptFormer：光流引导注意力与相空间重构用于海表温度预测	optical flow spatiotemporal
26	SpatialMosaic: A Multiview VLM Dataset for Partial Visibility	提出SpatialMosaic以解决部分可见性下的空间推理问题	scene understanding large language model multimodal
27	AVOID: The Adverse Visual Conditions Dataset with Obstacles for Driving Scene Understanding	AVOID：用于驾驶场景理解的恶劣视觉条件障碍物数据集	scene understanding
28	Multi-label Classification with Panoptic Context Aggregation Networks	提出PanCAN，通过全景上下文聚合网络提升多标签分类性能	scene understanding
29	RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction	RealX3D：一个用于多视角视觉恢复与重建的物理退化3D基准	metric depth

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition	针对iMiGUE数据集，提出多轨多模态学习框架用于微手势和情感识别	motion prediction multimodal
31	Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information	提出Holi-DETR，利用上下文信息进行整体时尚单品检测，提升检测精度。	spatial relationship

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
32	NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization	NeXT-IMDL：构建下一代图像篡改检测与定位的基准测试	manipulation
33	Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation	利用视频扩散模型，DKT实现了透明物体深度和法向量的零样本SOTA估计	manipulation monocular depth

⬅️ 返回 cs.CV 首页 · 🏠 返回主页