cs.CV（2025-11-14）

📊 共 49 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (24 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (12 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱八：物理动画 (Physics-based Animation) (3) 支柱四：生成式动作 (Generative Motion) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (24 篇)

#	题目	一句话要点	标签	🔗
1	VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models	VP-Bench：多模态大语言模型中视觉提示理解能力的综合评测基准	large language model multimodal
2	MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model	提出MicroVQA++：一个高质量显微镜推理数据集，利用弱监督图进行多模态大语言模型训练。	large language model multimodal
3	Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models	提出QTSplus以解决长视频理解中的视觉信息选择问题	large language model multimodal
4	Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models	Q-Doc：评估多模态大语言模型在文档图像质量评估中的能力	large language model chain-of-thought	✅
5	AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models	AUVIC：面向多模态大语言模型的视觉概念对抗性遗忘框架	large language model multimodal
6	MAFM^3: Modular Adaptation of Foundation Models for Multi-Modal Medical AI	MAFM^3：用于多模态医学AI的基础模型模块化适配框架	foundation model multimodal	✅
7	CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging	CrossMed：一个用于评估医学影像中组合泛化能力的多模态跨任务基准	large language model multimodal
8	Multimodal Posterior Sampling-based Uncertainty in PD-L1 Segmentation from H&E Images	提出基于多模态后验采样的nnUNet-B，用于H&E图像PD-L1分割及不确定性估计。	multimodal
9	ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation	提出ImAgent：一种统一的多模态Agent框架，用于测试时可扩展的图像生成。	multimodal
10	Synergy vs. Noise: Performance-Guided Multimodal Fusion For Biochemical Recurrence-Free Survival in Prostate Cancer	提出性能引导的多模态融合方法，提升前列腺癌生化复发预测精度	multimodal
11	The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models	提出多模态标志性评估框架，用于分析扩散模型中的文化记忆持久性	multimodal
12	DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding	提出DocSLM，一种面向资源受限边缘设备的长文档理解小规模视觉语言模型	multimodal	✅
13	Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End?	揭示多模态嵌入模型中的位置偏差：文本偏向起始，图像偏向两端	multimodal
14	Toward Generalized Detection of Synthetic Media: Limitations, Challenges, and the Path to Multimodal Solutions	综述AI合成媒体检测局限性与挑战，提出多模态深度学习解决方案的研究方向。	multimodal
15	EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation	EmoVid：首个多模态情感视频数据集，用于情感中心视频理解与生成。	multimodal
16	Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models	提出正负提示监督以提升OOD检测性能	large language model
17	PhaseWin Search Framework Enable Efficient Object-Level Interpretation	PhaseWin：一种高效的对象级解释框架，实现近线性复杂度的忠实区域归因	foundation model multimodal visual grounding
18	AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization	AccKV：面向高效音视频LLM推理的自适应聚焦与交叉校准KV缓存优化	large language model multimodal
19	S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation	提出S2D-Align，通过浅层到深层的辅助学习，实现解剖学相关的放射报告生成。	large language model multimodal
20	Draft and Refine with Visual Experts	提出Draft and Refine框架，提升LVLM视觉信息利用率，减少幻觉	multimodal visual grounding	✅
21	Φeat: Physically-Grounded Feature Representation	提出Φeat：一种物理可解释的视觉特征表示方法，提升材质识别能力。	foundation model
22	Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression	GEODE：解耦3D推理与数值回归，提升视觉语言模型空间智能	chain-of-thought
23	Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model	提出基于笔画建模的大型矢量字形模型LVGM，实现矢量化字符生成	large language model
24	PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs	PAS：一种免训练的视频LLM时间编码稳定器，解决时间不一致性问题	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

#	题目	一句话要点	标签	🔗
25	RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting	RealisticDreamer：用于少样本高斯溅射的引导分数蒸馏	dreamer distillation 3D gaussian splatting
26	OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning	OpenUS：首个全开源超声图像分析基础模型，采用自适应掩码对比学习。	Mamba contrastive learning foundation model	✅
27	Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question Answering	提出HinD框架，通过后见之明蒸馏推理和知识激励偏好优化解决知识型视觉问答问题。	distillation large language model multimodal
28	MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition	提出MCN-CL模型，利用跨模态注意力与对比学习提升多模态情感识别性能。	contrastive learning multimodal
29	Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery	提出地理空间思维链VQA框架，提升卫星图像理解与气候应用能力	DPO direct preference optimization chain-of-thought
30	Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs	Arcee：利用Mamba SSMs的差分循环状态链，提升生成视觉建模性能。	flow matching Mamba SSM
31	PROMISE: Prompt-Attentive Hierarchical Contrastive Learning for Robust Cross-Modal Representation with Missing Modalities	PROMISE：针对模态缺失，提出提示引导的分层对比学习，实现鲁棒的跨模态表示。	representation learning contrastive learning multimodal
32	VIDEOP2R: Video Understanding from Perception to Reasoning	VideoP2R：通过感知与推理建模，提升视频理解能力	reinforcement learning large language model chain-of-thought
33	Language-Guided Graph Representation Learning for Video Summarization	提出语言引导的图表示学习网络LGRLN，用于解决视频摘要中全局依赖和多模态定制问题。	representation learning multimodal	✅
34	Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions	提出数据驱动的视觉语言模型微调方法，提升标准化考试问题解答能力	reinforcement learning DPO multimodal
35	A Comparison of Lightweight Deep Learning Models for Particulate-Matter Nowcasting in the Indian Subcontinent & Surrounding Regions	提出轻量级深度学习模型，用于印度次大陆及周边地区PM1、PM2.5和PM10的短时临近预报。	MAE foundation model
36	Heterogeneous Complementary Distillation	提出异构互补蒸馏（HCD）框架，有效解决ViT到ResNet等异构架构间的知识迁移问题。	distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗
37	PINGS-X: Physics-Informed Normalized Gaussian Splatting with Axes Alignment for Efficient Super-Resolution of 4D Flow MRI	提出PINGS-X以解决4D流动MRI超分辨率问题	3D gaussian splatting 3DGS gaussian splatting	✅
38	Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos	提出时序对齐模块，解决非同步视频下的动态高斯场景重建问题	gaussian splatting splatting scene reconstruction
39	3D Gaussian and Diffusion-Based Gaze Redirection	提出DiT-Gaze，结合扩散模型与3D高斯，提升注视方向重定向的真实度和准确性。	3D gaussian splatting 3DGS gaussian splatting
40	DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition	提出DEFT-LLM，通过解耦专家特征调整微表情识别，提升性能与可解释性。	optical flow large language model multimodal
41	6D Strawberry Pose Estimation: Real-time and Edge AI Solutions Using Purely Synthetic Training Data	提出基于纯合成数据的草莓6D姿态估计方案，适用于实时和边缘AI。	6D pose estimation

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗
42	Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation	Viper-F1：利用跨模态状态空间调制实现快速精细的多模态理解	manipulation large language model multimodal
43	AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning	AirCopBench：用于多无人机协同具身感知与推理的基准测试	sim-to-real scene understanding egocentric
44	Phys-Liquid: A Physics-Informed Dataset for Estimating 3D Geometry and Volume of Transparent Deformable Liquids	Phys-Liquid：用于透明可变形液体三维几何与体积估计的物理信息数据集	manipulation	✅

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

#	题目	一句话要点	标签
45	Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering	提出SAMOSA，通过多模态场景感知声学渲染增强XR听觉真实感	PULSE multimodal
46	SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction	SOTFormer：一种极简Transformer，用于统一目标跟踪和轨迹预测	AMP
47	Computationally-efficient deep learning models for nowcasting of precipitation: A solution for the Weather4cast 2025 challenge	提出基于ConvGRU和迁移学习的降水临近预报模型，在Weather4cast 2025挑战赛中获得第二名。	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
48	SOSControl: Enhancing Human Motion Generation through Saliency-Aware Symbolic Orientation and Timing Control	提出SOSControl框架，通过显著性导向的符号化控制增强人体动作生成。	text-to-motion motion generation human motion
49	Free3D: 3D Human Motion Emerges from Single-View 2D Supervision	Free3D：提出一种仅用2D监督信号生成3D人体运动的框架	motion generation human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-11-14）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (24 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理