cs.CV（2025-12-29）

📊 共 30 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (11 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (11) 支柱三：空间感知与语义 (Perception & Semantics) (5) 支柱一：机器人控制 (Robot Control) (2) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation	HY-Motion 1.0：扩展Flow Matching模型至十亿参数规模，实现文本驱动的3D人体动作生成。	reinforcement learning flow matching text-to-motion
2	PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis	PathFound：一种主动证据搜寻的病理诊断多模态Agent模型	reinforcement learning representation learning foundation model
3	LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation	提出改进的On-Policy蒸馏方法，实现多模态交互式实时视频扩散	distillation multimodal
4	ProGuard: Towards Proactive Multimodal Safeguard	提出ProGuard，一种主动式多模态安全防护方法，用于识别和描述生成模型中的OOD安全风险。	reinforcement learning multimodal
5	ThinkGen: Generalized Thinking for Visual Generation	ThinkGen：提出基于广义思维的视觉生成框架，提升多场景适应性。	reinforcement learning large language model multimodal	✅
6	GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation	提出基于3D高斯表示的驾驶世界模型GaussianDWM，实现统一的场景理解和多模态生成。	world model scene understanding	✅
7	CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation	提出CME-CAD异构协作多专家强化学习框架，用于高精度可编辑CAD代码生成。	reinforcement learning chain-of-thought
8	GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection	GVSynergy-Det：协同高斯-体素表示用于多视角3D目标检测	representation learning gaussian splatting splatting
9	Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment	提出ArtQuant框架，通过层级描述学习解决艺术图像美学评估中的认知鸿沟。	contrastive learning multimodal
10	Visual Language Hypothesis	提出视觉语言假设，从结构和拓扑角度分析视觉表征学习	representation learning multimodal
11	SoulX-LiveTalk Technical Report	提出SoulX-LiveTalk框架，实现高保真实时音频驱动的数字人生成。	distillation spatiotemporal

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
12	RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature	RxnBench：一个多模态基准，用于评估大语言模型对科学文献中化学反应的理解能力	large language model multimodal
13	MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?	提出MM-UAVBench，评估多模态大语言模型在低空无人机场景下的感知、认知和规划能力。	large language model multimodal
14	Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging	提出基于混合元启发式优化的可扩展残差特征聚合框架，用于多模态CT影像中早期胰腺肿瘤的稳健检测。	multimodal
15	Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition	针对iMiGUE数据集，提出多轨多模态学习框架用于微手势和情感识别	multimodal
16	Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism	提出DRIS和MS-VLAM，用于提升遥感图像多模态融合的效率和语义理解精度。	multimodal
17	RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models	RS-Prune：面向遥感扩散模型的高比例免训练数据剪枝	foundation model
18	OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding	OmniAgent：一种音频引导的主动感知Agent，用于全模态音视频理解	large language model multimodal
19	MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images	MedGemma在医学图像疾病分类中优于GPT-4，领域微调至关重要	large language model multimodal
20	REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation	REVEALER：提出强化学习引导的视觉推理框架，用于元素级文本-图像对齐评估	large language model multimodal
21	Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin	提出混合AI框架，用于利姆波波河流域自动读取水位标尺	multimodal
22	Towards Integrating Uncertainty for Domain-Agnostic Segmentation	提出UncertSAM基准并探索不确定性量化，提升分割模型在未知领域的泛化性	foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Contour Information Aware 2D Gaussian Splatting for Image Representation	提出轮廓信息感知的2D高斯溅射，提升图像表示中边缘重建质量	gaussian splatting splatting
24	SpatialMosaic: A Multiview VLM Dataset for Partial Visibility	提出SpatialMosaic数据集，增强多视角VLM在部分可见场景下的空间推理能力	scene understanding large language model multimodal
25	AVOID: The Adverse Visual Conditions Dataset with Obstacles for Driving Scene Understanding	AVOID：用于驾驶场景理解的含障碍物恶劣视觉条件数据集	scene understanding
26	Multi-label Classification with Panoptic Context Aggregation Networks	提出PanCAN，通过全景上下文聚合网络提升多标签分类性能	scene understanding
27	RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction	RealX3D：一个用于多视角视觉恢复与重建的物理退化3D基准	metric depth

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
28	NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization	NeXT-IMDL：构建下一代图像篡改检测与定位的基准测试	manipulation
29	Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation	利用视频扩散模型，DKT实现了透明物体深度和法向量的零样本SOTA估计	manipulation monocular depth

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information	提出Holi-DETR，利用上下文信息进行整体时尚单品检测，提升检测精度。	spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页