cs.CV(2025-04-14)

📊 共 47 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (11 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models TAMP:多模态大语言模型中基于Token自适应的层级剪枝 large language model multimodal TAMP
2 COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts 提出COUNTS数据集与O(OD)2、OODG基准,评估目标检测器和多模态大模型在分布偏移下的泛化能力。 large language model multimodal visual grounding
3 Mavors: Multi-granularity Video Representation for Multimodal Large Language Model Mavors:多粒度视频表示用于多模态大语言模型,提升长视频理解能力 large language model multimodal
4 Multimodal Long Video Modeling Based on Temporal Dynamic Context 提出基于时序动态上下文的TDC模型,解决长视频多模态理解中的信息丢失问题。 large language model multimodal chain-of-thought
5 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models InternVL3:探索开源多模态模型的高级训练和测试方法 large language model multimodal
6 Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data 提出IMAX数据集,提升医学通用Foundation模型的多任务学习能力 large language model foundation model
7 Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis 提出基于Transformer的多模态深度学习框架,用于医学伤口图像的分类与定位分析。 multimodal
8 Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure 利用视觉-语言模型进行多模态演示文稿摘要,研究模态和结构的影响 multimodal
9 Relation-Rich Visual Document Generator for Visual Information Extraction 提出RIDGE,通过内容驱动的布局生成,解决关系丰富的视觉文档信息抽取问题。 large language model multimodal
10 MIEB: Massive Image Embedding Benchmark MIEB:大规模图像嵌入基准,用于全面评估图像和图像-文本嵌入模型。 large language model multimodal
11 The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer 提出SAIL:单Transformer统一多模态大语言模型,提升视觉-语言学习的可扩展性 large language model multimodal
12 CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography CameraBench:通过摄影评估多模态大语言模型中的视觉推理能力 large language model multimodal
13 Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding 提出Socratic Chart框架,通过多智能体协作提升MLLM在SVG图表理解中的鲁棒性。 large language model multimodal
14 SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model 提出SlowFastVAD,融合快速检测器与RAG增强的视觉语言模型,用于高效可解释的视频异常检测。 multimodal
15 XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark 提出XY-Cut++,通过层级掩码机制实现文档布局排序的显著提升 large language model
16 DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction 提出DTFSal以解决音视频显著性预测中的多模态融合问题 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
17 GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting GaussVideoDreamer:利用视频扩散和不一致感知高斯溅射进行3D场景生成 dreamer 3D gaussian splatting gaussian splatting
18 Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis 提出多层多模态面部基础模型(MF^2)用于全面面部状态分析,提升AU和情感识别性能。 representation learning foundation model multimodal
19 CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates CleanMAP:一种基于多模态LLM蒸馏的置信度驱动众包HD地图更新方法 distillation large language model multimodal
20 AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark 提出AgMMU农业多模态理解基准,用于评估和提升视觉-语言模型在农业领域的性能。 distillation multimodal
21 Perturbed State Space Feature Encoders for Optical Flow with Event Cameras 提出Perturbed State Space Feature Encoders (P-SSE),用于事件相机光流估计,提升时空推理能力。 SSM optical flow spatiotemporal
22 Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding Pixel-SAIL:用于像素级理解的单Transformer模型,简化多模态大模型。 distillation large language model multimodal
23 InstructEngine: Instruction-driven Text-to-Image Alignment InstructEngine:指令驱动的文本到图像对齐框架,提升生成质量。 reinforcement learning RLHF multimodal
24 Global and Local Mamba Network for Multi-Modality Medical Image Super-Resolution 提出GLMamba网络,利用全局和局部Mamba模块提升多模态医学图像超分辨率效果 Mamba state space model
25 Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics 提出基于掩码自编码器的微电子缺陷检测自监督预训练方法 masked autoencoder MAE
26 HDC: Hierarchical Distillation for Multi-level Noisy Consistency in Semi-Supervised Fetal Ultrasound Segmentation 提出HDC框架,通过分层蒸馏解决半监督胎儿超声图像分割中的噪声一致性问题 teacher-student distillation
27 GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents GUI-R1:面向GUI代理的通用R1风格视觉-语言-动作模型 reinforcement learning large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
28 LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis 提出LL-Gaussian,通过高斯溅射实现低光照场景重建与新视角合成 3D gaussian splatting 3DGS gaussian splatting
29 Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization 分析多模态大语言模型在遥感目标定位中的应用,并优化提示工程与GSD。 scene understanding large language model foundation model
30 EBAD-Gaussian: Event-driven Bundle Adjusted Deblur Gaussian Splatting 提出EBAD-Gaussian,利用事件相机和高斯溅射实现运动模糊场景下的高质量3D重建。 3D gaussian splatting gaussian splatting splatting
31 FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation FLOSS:开放词汇语义分割中利用单模板分类器提升性能的免训练方法 open-vocabulary open vocabulary
32 ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting ESCT3D:基于高斯溅射的高效且可控的文本驱动3D内容生成 gaussian splatting splatting
33 MCBlock: Boosting Neural Radiance Field Training Speed by MCTS-based Dynamic-Resolution Ray Sampling 提出基于MCTS的动态分辨率光线采样MCBlock,加速NeRF训练。 gaussian splatting splatting NeRF
34 FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding FUSION:一种用于深度跨模态理解的完全视觉-语言表征集成方法 semantic mapping semantic map large language model
35 SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding 提出SoccerNet-v3D数据集,用于足球赛事广播中基于多视角同步的3D场景理解。 scene understanding
36 DNF-Avatar: Distilling Neural Fields for Real-time Animatable Avatar Relighting 提出DNF-Avatar,通过知识蒸馏实现实时可重光照的动画Avatar gaussian splatting splatting
37 AGO: Adaptive Grounding for Open World 3D Occupancy Prediction 提出AGO,通过自适应Grounding实现开放世界3D Occupancy预测。 open-vocabulary open vocabulary

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
38 Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis 提出双流扩散模型,解决钢琴双手协同运动合成中手部独立性与协调性建模难题。 bi-manual motion synthesis
39 Digital Staining with Knowledge Distillation: A Unified Framework for Unpaired and Paired-But-Misaligned Data 提出基于知识蒸馏的数字染色统一框架,解决非配对和错配数据下的细胞染色问题 WBC distillation
40 Hearing Anywhere in Any Environment 提出xRIR框架,利用几何与声学特征,实现任意环境下的房间脉冲响应预测 sim-to-real PULSE

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
41 HUMOTO: A 4D Dataset of Mocap Human Object Interactions HUMOTO:用于动作生成和人机交互研究的高质量4D人体-物体交互数据集 motion generation penetration human-object interaction
42 REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers 提出REPA-E,通过表征对齐损失实现VAE与潜在扩散Transformer的端到端训练。 classifier-free guidance

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
43 Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge 利用大型多模态模型知识,提升仇恨表情包检测性能 HuMoR multimodal
44 Efficient 2D to Full 3D Human Pose Uplifting including Joint Rotations 提出高效的2D到3D人体姿态提升模型,直接估计包含关节旋转的完整3D姿态。 human mesh recovery HMR

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
45 ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments 提出ST-Booster,增强连续环境下的视觉语言导航中时空感知能力。 spatiotemporal VLN
46 Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling 提出KnightPupil,增强事件相机眼动追踪的鲁棒性和自适应时序建模能力 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
47 H-MoRe: Learning Human-centric Motion Representation for Action Analysis 提出H-MoRe,学习以人为中心的运动表征,用于动作分析。 human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页