cs.CV（2025-04-14）

📊 共 47 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (16 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (11 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱四：生成式动作 (Generative Motion) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (2) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
1	TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models	TAMP：多模态大语言模型中基于Token自适应的层级剪枝	large language model multimodal TAMP
2	COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts	提出COUNTS数据集与O(OD)2、OODG基准，评估目标检测器和多模态大模型在分布偏移下的泛化能力。	large language model multimodal visual grounding
3	Mavors: Multi-granularity Video Representation for Multimodal Large Language Model	Mavors：多粒度视频表示用于多模态大语言模型，提升长视频理解能力	large language model multimodal
4	Multimodal Long Video Modeling Based on Temporal Dynamic Context	提出基于时序动态上下文的TDC模型，解决长视频多模态理解中的信息丢失问题。	large language model multimodal chain-of-thought	✅
5	InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models	InternVL3：探索开源多模态模型的高级训练和测试方法	large language model multimodal
6	Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data	提出IMAX数据集，提升医学通用Foundation模型的多任务学习能力	large language model foundation model
7	Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis	提出基于Transformer的多模态深度学习框架，用于医学伤口图像的分类与定位分析。	multimodal
8	Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure	利用视觉-语言模型进行多模态演示文稿摘要，研究模态和结构的影响	multimodal
9	Relation-Rich Visual Document Generator for Visual Information Extraction	提出RIDGE，通过内容驱动的布局生成，解决关系丰富的视觉文档信息抽取问题。	large language model multimodal	✅
10	MIEB: Massive Image Embedding Benchmark	MIEB：大规模图像嵌入基准，用于全面评估图像和图像-文本嵌入模型。	large language model multimodal	✅
11	The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer	提出SAIL：单Transformer统一多模态大语言模型，提升视觉-语言学习的可扩展性	large language model multimodal	✅
12	CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography	CameraBench：通过摄影评估多模态大语言模型中的视觉推理能力	large language model multimodal
13	Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding	提出Socratic Chart框架，通过多智能体协作提升MLLM在SVG图表理解中的鲁棒性。	large language model multimodal
14	SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model	提出SlowFastVAD，融合快速检测器与RAG增强的视觉语言模型，用于高效可解释的视频异常检测。	multimodal
15	XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark	提出XY-Cut++，通过层级掩码机制实现文档布局排序的显著提升	large language model
16	DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction	提出DTFSal以解决音视频显著性预测中的多模态融合问题	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
17	GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting	GaussVideoDreamer：利用视频扩散和不一致感知高斯溅射进行3D场景生成	dreamer 3D gaussian splatting gaussian splatting
18	Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis	提出多层多模态面部基础模型(MF^2)用于全面面部状态分析，提升AU和情感识别性能。	representation learning foundation model multimodal
19	CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates	CleanMAP：一种基于多模态LLM蒸馏的置信度驱动众包HD地图更新方法	distillation large language model multimodal	✅
20	AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark	提出AgMMU农业多模态理解基准，用于评估和提升视觉-语言模型在农业领域的性能。	distillation multimodal
21	Perturbed State Space Feature Encoders for Optical Flow with Event Cameras	提出Perturbed State Space Feature Encoders (P-SSE)，用于事件相机光流估计，提升时空推理能力。	SSM optical flow spatiotemporal
22	Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding	Pixel-SAIL：用于像素级理解的单Transformer模型，简化多模态大模型。	distillation large language model multimodal	✅
23	InstructEngine: Instruction-driven Text-to-Image Alignment	InstructEngine：指令驱动的文本到图像对齐框架，提升生成质量。	reinforcement learning RLHF multimodal
24	Global and Local Mamba Network for Multi-Modality Medical Image Super-Resolution	提出GLMamba网络，利用全局和局部Mamba模块提升多模态医学图像超分辨率效果	Mamba state space model
25	Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics	提出基于掩码自编码器的微电子缺陷检测自监督预训练方法	masked autoencoder MAE
26	HDC: Hierarchical Distillation for Multi-level Noisy Consistency in Semi-Supervised Fetal Ultrasound Segmentation	提出HDC框架，通过分层蒸馏解决半监督胎儿超声图像分割中的噪声一致性问题	teacher-student distillation
27	GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents	GUI-R1：面向GUI代理的通用R1风格视觉-语言-动作模型	reinforcement learning large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
28	LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis	提出LL-Gaussian，通过高斯溅射实现低光照场景重建与新视角合成	3D gaussian splatting 3DGS gaussian splatting
29	Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization	分析多模态大语言模型在遥感目标定位中的应用，并优化提示工程与GSD。	scene understanding large language model foundation model
30	EBAD-Gaussian: Event-driven Bundle Adjusted Deblur Gaussian Splatting	提出EBAD-Gaussian，利用事件相机和高斯溅射实现运动模糊场景下的高质量3D重建。	3D gaussian splatting gaussian splatting splatting
31	FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation	FLOSS：开放词汇语义分割中利用单模板分类器提升性能的免训练方法	open-vocabulary open vocabulary	✅
32	ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting	ESCT3D：基于高斯溅射的高效且可控的文本驱动3D内容生成	gaussian splatting splatting
33	MCBlock: Boosting Neural Radiance Field Training Speed by MCTS-based Dynamic-Resolution Ray Sampling	提出基于MCTS的动态分辨率光线采样MCBlock，加速NeRF训练。	gaussian splatting splatting NeRF
34	FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding	FUSION：一种用于深度跨模态理解的完全视觉-语言表征集成方法	semantic mapping semantic map large language model	✅
35	SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding	提出SoccerNet-v3D数据集，用于足球赛事广播中基于多视角同步的3D场景理解。	scene understanding
36	DNF-Avatar: Distilling Neural Fields for Real-time Animatable Avatar Relighting	提出DNF-Avatar，通过知识蒸馏实现实时可重光照的动画Avatar	gaussian splatting splatting
37	AGO: Adaptive Grounding for Open World 3D Occupancy Prediction	提出AGO，通过自适应Grounding实现开放世界3D Occupancy预测。	open-vocabulary open vocabulary	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
38	Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis	提出双流扩散模型，解决钢琴双手协同运动合成中手部独立性与协调性建模难题。	bi-manual motion synthesis	✅
39	Digital Staining with Knowledge Distillation: A Unified Framework for Unpaired and Paired-But-Misaligned Data	提出基于知识蒸馏的数字染色统一框架，解决非配对和错配数据下的细胞染色问题	WBC distillation
40	Hearing Anywhere in Any Environment	提出xRIR框架，利用几何与声学特征，实现任意环境下的房间脉冲响应预测	sim-to-real PULSE

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
41	HUMOTO: A 4D Dataset of Mocap Human Object Interactions	HUMOTO：用于动作生成和人机交互研究的高质量4D人体-物体交互数据集	motion generation penetration human-object interaction	✅
42	REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers	提出REPA-E，通过表征对齐损失实现VAE与潜在扩散Transformer的端到端训练。	classifier-free guidance

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
43	Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge	利用大型多模态模型知识，提升仇恨表情包检测性能	HuMoR multimodal	✅
44	Efficient 2D to Full 3D Human Pose Uplifting including Joint Rotations	提出高效的2D到3D人体姿态提升模型，直接估计包含关节旋转的完整3D姿态。	human mesh recovery HMR

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
45	ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments	提出ST-Booster，增强连续环境下的视觉语言导航中时空感知能力。	spatiotemporal VLN
46	Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling	提出KnightPupil，增强事件相机眼动追踪的鲁棒性和自适应时序建模能力	spatiotemporal

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
47	H-MoRe: Learning Human-centric Motion Representation for Action Analysis	提出H-MoRe，学习以人为中心的运动表征，用于动作分析。	human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页