cs.CV(2025-11-14)

📊 共 49 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (24 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱八:物理动画 (Physics-based Animation) (3) 支柱四:生成式动作 (Generative Motion) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (24 篇)

#题目一句话要点标签🔗
1 VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models VP-Bench:多模态大语言模型中视觉提示理解能力的综合评测基准 large language model multimodal
2 MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model 提出MicroVQA++:一个高质量显微镜推理数据集,利用弱监督图进行多模态大语言模型训练。 large language model multimodal
3 Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models 提出QTSplus以解决长视频理解中的视觉信息选择问题 large language model multimodal
4 Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models Q-Doc:评估多模态大语言模型在文档图像质量评估中的能力 large language model chain-of-thought
5 AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models AUVIC:面向多模态大语言模型的视觉概念对抗性遗忘框架 large language model multimodal
6 MAFM^3: Modular Adaptation of Foundation Models for Multi-Modal Medical AI MAFM^3:用于多模态医学AI的基础模型模块化适配框架 foundation model multimodal
7 CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging CrossMed:一个用于评估医学影像中组合泛化能力的多模态跨任务基准 large language model multimodal
8 Multimodal Posterior Sampling-based Uncertainty in PD-L1 Segmentation from H&E Images 提出基于多模态后验采样的nnUNet-B,用于H&E图像PD-L1分割及不确定性估计。 multimodal
9 ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation 提出ImAgent:一种统一的多模态Agent框架,用于测试时可扩展的图像生成。 multimodal
10 Synergy vs. Noise: Performance-Guided Multimodal Fusion For Biochemical Recurrence-Free Survival in Prostate Cancer 提出性能引导的多模态融合方法,提升前列腺癌生化复发预测精度 multimodal
11 The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models 提出多模态标志性评估框架,用于分析扩散模型中的文化记忆持久性 multimodal
12 DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding 提出DocSLM,一种面向资源受限边缘设备的长文档理解小规模视觉语言模型 multimodal
13 Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End? 揭示多模态嵌入模型中的位置偏差:文本偏向起始,图像偏向两端 multimodal
14 Toward Generalized Detection of Synthetic Media: Limitations, Challenges, and the Path to Multimodal Solutions 综述AI合成媒体检测局限性与挑战,提出多模态深度学习解决方案的研究方向。 multimodal
15 EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation EmoVid:首个多模态情感视频数据集,用于情感中心视频理解与生成。 multimodal
16 Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models 提出正负提示监督以提升OOD检测性能 large language model
17 PhaseWin Search Framework Enable Efficient Object-Level Interpretation PhaseWin:一种高效的对象级解释框架,实现近线性复杂度的忠实区域归因 foundation model multimodal visual grounding
18 AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization AccKV:面向高效音视频LLM推理的自适应聚焦与交叉校准KV缓存优化 large language model multimodal
19 S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation 提出S2D-Align,通过浅层到深层的辅助学习,实现解剖学相关的放射报告生成。 large language model multimodal
20 Draft and Refine with Visual Experts 提出Draft and Refine框架,提升LVLM视觉信息利用率,减少幻觉 multimodal visual grounding
21 Φeat: Physically-Grounded Feature Representation 提出Φeat:一种物理可解释的视觉特征表示方法,提升材质识别能力。 foundation model
22 Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression GEODE:解耦3D推理与数值回归,提升视觉语言模型空间智能 chain-of-thought
23 Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model 提出基于笔画建模的大型矢量字形模型LVGM,实现矢量化字符生成 large language model
24 PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs PAS:一种免训练的视频LLM时间编码稳定器,解决时间不一致性问题 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
25 RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting RealisticDreamer:用于少样本高斯溅射的引导分数蒸馏 dreamer distillation 3D gaussian splatting
26 OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning OpenUS:首个全开源超声图像分析基础模型,采用自适应掩码对比学习。 Mamba contrastive learning foundation model
27 Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question Answering 提出HinD框架,通过后见之明蒸馏推理和知识激励偏好优化解决知识型视觉问答问题。 distillation large language model multimodal
28 MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition 提出MCN-CL模型,利用跨模态注意力与对比学习提升多模态情感识别性能。 contrastive learning multimodal
29 Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery 提出地理空间思维链VQA框架,提升卫星图像理解与气候应用能力 DPO direct preference optimization chain-of-thought
30 Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs Arcee:利用Mamba SSMs的差分循环状态链,提升生成视觉建模性能。 flow matching Mamba SSM
31 PROMISE: Prompt-Attentive Hierarchical Contrastive Learning for Robust Cross-Modal Representation with Missing Modalities PROMISE:针对模态缺失,提出提示引导的分层对比学习,实现鲁棒的跨模态表示。 representation learning contrastive learning multimodal
32 VIDEOP2R: Video Understanding from Perception to Reasoning VideoP2R:通过感知与推理建模,提升视频理解能力 reinforcement learning large language model chain-of-thought
33 Language-Guided Graph Representation Learning for Video Summarization 提出语言引导的图表示学习网络LGRLN,用于解决视频摘要中全局依赖和多模态定制问题。 representation learning multimodal
34 Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions 提出数据驱动的视觉语言模型微调方法,提升标准化考试问题解答能力 reinforcement learning DPO multimodal
35 A Comparison of Lightweight Deep Learning Models for Particulate-Matter Nowcasting in the Indian Subcontinent & Surrounding Regions 提出轻量级深度学习模型,用于印度次大陆及周边地区PM1、PM2.5和PM10的短时临近预报。 MAE foundation model
36 Heterogeneous Complementary Distillation 提出异构互补蒸馏(HCD)框架,有效解决ViT到ResNet等异构架构间的知识迁移问题。 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
37 PINGS-X: Physics-Informed Normalized Gaussian Splatting with Axes Alignment for Efficient Super-Resolution of 4D Flow MRI 提出PINGS-X以解决4D流动MRI超分辨率问题 3D gaussian splatting 3DGS gaussian splatting
38 Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos 提出时序对齐模块,解决非同步视频下的动态高斯场景重建问题 gaussian splatting splatting scene reconstruction
39 3D Gaussian and Diffusion-Based Gaze Redirection 提出DiT-Gaze,结合扩散模型与3D高斯,提升注视方向重定向的真实度和准确性。 3D gaussian splatting 3DGS gaussian splatting
40 DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition 提出DEFT-LLM,通过解耦专家特征调整微表情识别,提升性能与可解释性。 optical flow large language model multimodal
41 6D Strawberry Pose Estimation: Real-time and Edge AI Solutions Using Purely Synthetic Training Data 提出基于纯合成数据的草莓6D姿态估计方案,适用于实时和边缘AI。 6D pose estimation

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
42 Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation Viper-F1:利用跨模态状态空间调制实现快速精细的多模态理解 manipulation large language model multimodal
43 AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning AirCopBench:用于多无人机协同具身感知与推理的基准测试 sim-to-real scene understanding egocentric
44 Phys-Liquid: A Physics-Informed Dataset for Estimating 3D Geometry and Volume of Transparent Deformable Liquids Phys-Liquid:用于透明可变形液体三维几何与体积估计的物理信息数据集 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
45 Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering 提出SAMOSA,通过多模态场景感知声学渲染增强XR听觉真实感 PULSE multimodal
46 SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction SOTFormer:一种极简Transformer,用于统一目标跟踪和轨迹预测 AMP
47 Computationally-efficient deep learning models for nowcasting of precipitation: A solution for the Weather4cast 2025 challenge 提出基于ConvGRU和迁移学习的降水临近预报模型,在Weather4cast 2025挑战赛中获得第二名。 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
48 SOSControl: Enhancing Human Motion Generation through Saliency-Aware Symbolic Orientation and Timing Control 提出SOSControl框架,通过显著性导向的符号化控制增强人体动作生成。 text-to-motion motion generation human motion
49 Free3D: 3D Human Motion Emerges from Single-View 2D Supervision Free3D:提出一种仅用2D监督信号生成3D人体运动的框架 motion generation human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页