cs.CV(2025-10-16)

📊 共 47 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (15 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗3) 支柱一:机器人控制 (Robot Control) (3) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (2 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 Vision-Centric Activation and Coordination for Multimodal Large Language Models 提出VaCo,通过视觉中心激活与协调提升多模态大语言模型的视觉理解能力 large language model foundation model multimodal
2 IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection 提出IAD-GPT,利用多模态大语言模型提升工业异常检测的视觉知识。 large language model multimodal visual grounding
3 You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction 提出nlg2choice方法,提升多模态大语言模型在细粒度视觉识别中的分类与检索能力。 large language model multimodal
4 Benchmarking Multimodal Large Language Models for Face Recognition 系统性评测多模态大语言模型在人脸识别任务上的性能表现。 large language model multimodal
5 Train a Unified Multimodal Data Quality Classifier with Synthetic Data 提出UniFilter:一种基于合成数据的统一多模态数据质量分类器 large language model multimodal
6 Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection 提出基于多模态LLM描述的半监督视频异常检测框架,提升复杂异常检测能力和可解释性。 large language model multimodal
7 ChangingGrounding: 3D Visual Grounding in Changing Scenes 提出ChangingGrounding基准与Mem-ChangingGrounder方法,解决动态场景下的3D视觉定位问题 visual grounding
8 VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning VTimeCoT:通过绘制视频进度条进行视频时序定位与推理 large language model multimodal chain-of-thought
9 Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology 提出DentVFM:用于口腔颌面放射学的通用视觉基础模型 foundation model
10 Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition 提出Big Five和HEXACO联合建模方法,用于多模态表观人格特质识别 multimodal
11 MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning MOBIUS:通过多模态瓶颈融合与校准解码器剪枝实现Big-to-Mobile通用实例分割 foundation model
12 MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos 提出MaskCaptioner,通过联合学习分割和描述视频中的物体轨迹,实现端到端的密集视频物体描述。 VLN
13 DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models DEXTER:利用扩散模型和文本推理实现视觉模型的可解释性,无需数据。 large language model
14 In-Context Learning with Unpaired Clips for Instruction-based Video Editing 提出基于非配对视频片段的上下文学习方法,用于指令驱动的视频编辑。 instruction following
15 Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference 提出高效视频采样EVS,通过剪枝时序冗余token加速VLM推理。 large language model
16 Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding 提出基于水印的三层对比解码方法,提升视觉-语言模型的事实性和视觉 grounding。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (15 篇)

#题目一句话要点标签🔗
17 OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression OmniMotion:提出连续掩码自回归Transformer,用于多模态全身人体运动生成。 linear attention text-to-motion motion generation
18 WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging 提出WeCKD:一种弱监督链式蒸馏网络,用于高效多模态医学影像分析。 teacher-student distillation multimodal
19 Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering 提出Wiki-PRF框架,解决知识库VQA中多模态查询质量和检索结果相关性问题 reinforcement learning multimodal
20 Capturing Context-Aware Route Choice Semantics for Trajectory Representation Learning 提出CORE框架,融合上下文感知的路径选择语义,提升轨迹表示学习效果 representation learning spatiotemporal large language model
21 Directional Reasoning Injection for Fine-Tuning MLLMs 提出DRIFT,通过梯度空间注入方向性推理知识,高效微调多模态大语言模型 reinforcement learning large language model multimodal
22 Composition-Grounded Instruction Synthesis for Visual Reasoning 提出COGS框架以提升多模态大语言模型的推理能力 reinforcement learning large language model multimodal
23 Spatial Preference Rewarding for MLLMs Spatial Understanding 提出空间偏好奖励SPR,提升MLLM在细粒度空间理解上的能力 direct preference optimization large language model multimodal
24 RealDPO: Real or Not Real, that is the Preference RealDPO:利用真实数据偏好学习,提升视频生成模型运动真实性 preference learning DPO direct preference optimization
25 Terra: Explorable Native 3D World Model with Point Latents Terra:基于点潜变量的可探索原生3D世界模型 flow matching world model
26 DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights 提出DRBD-Mamba模型,用于鲁棒高效的脑肿瘤分割,并提供分析性见解 Mamba state space model
27 Generalized Dynamics Generation towards Scannable Physical World Model GDGen:基于势能的通用动力学生成框架,用于可扫描物理世界建模 world model
28 Vision Mamba for Permeability Prediction of Porous Media 提出基于Vision Mamba的多孔介质渗透率预测模型,提升计算效率和内存利用率。 Mamba
29 Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning 提出Identity-GRPO,通过强化学习优化多人视频生成中的身份保持问题。 reinforcement learning
30 Multi-modal video data-pipelines for machine learning with minimal human supervision 提出一种基于弱监督多模态视频数据管道的机器学习方法 MAE depth estimation
31 Decorrelation Speeds Up Vision Transformers 提出DBP-MAE加速ViT预训练,降低计算成本和碳排放,提升下游任务性能。 masked autoencoder MAE

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
32 CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection 提出CoT-PL框架,通过视觉链式推理和伪标签提升开放词汇目标检测在复杂场景下的性能。 open-vocabulary open vocabulary chain-of-thought
33 GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering GauSSmart:融合2D基础模型与几何滤波增强3D高斯溅射重建 3D gaussian splatting gaussian splatting splatting
34 BalanceGS: Algorithm-System Co-design for Efficient 3D Gaussian Splatting Training on GPU BalanceGS:面向GPU的3D高斯溅射高效训练的算法-系统协同设计 3D gaussian splatting 3DGS gaussian splatting
35 SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images SaLon3R:结构感知的长期通用3D重建,解决冗余和几何不一致问题 depth estimation 3D gaussian splatting 3DGS
36 Leveraging Learned Image Prior for 3D Gaussian Compression 利用图像先验知识提升3D高斯压缩率与渲染质量 3D gaussian splatting 3DGS gaussian splatting
37 C4D: 4D Made from 3D through Dual Correspondences C4D:通过双重对应关系从3D重建4D动态场景 depth estimation optical flow
38 Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures 提出基于多视角表演捕捉的视频扩散模型定制框架,实现相机可控和角色一致性。 gaussian splatting splatting
39 STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding STANCE:通过稀疏到稠密锚定编码实现运动连贯的视频生成 monocular depth

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
40 MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning MathCanvas:用于多模态数学推理的内在视觉思维链 manipulation large language model multimodal
41 QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models QDepth-VLA:利用量化深度预测辅助视觉-语言-动作模型,提升空间感知能力 manipulation VQ-VAE vision-language-action
42 UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos UrbanVerse:通过城市漫游视频扩展城市模拟规模,用于具身智能体训练。 quadruped sim-to-real embodied AI

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
43 EuroMineNet: A Multitemporal Sentinel-2 Benchmark for Spatiotemporal Mining Footprint Analysis in the European Union (2015-2024) EuroMineNet:欧盟多时相Sentinel-2矿区时空足迹分析基准数据集 spatiotemporal
44 Event Interval Modulation: A Novel Scheme for Event-based Optical Camera Communication 提出事件间隔调制(EIM)方案,提升事件相机光通信的传输速率。 PULSE

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
45 TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions 提出TOUCH框架,实现文本引导的可控自由手部-物体交互生成。 physically plausible HOI
46 Deep Compositional Phase Diffusion for Long Motion Sequence Generation 提出组合相位扩散方法,解决长运动序列生成中片段衔接不流畅问题。 motion generation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
47 A solution to generalized learning from small training sets found in everyday infant experiences 分析婴儿视觉经验的“块状”相似性,提升小样本学习泛化能力 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页