cs.CV（2025-04-10）

📊 共 38 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (12 🔗5) 支柱九：具身大模型 (Embodied Foundation Models) (12 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱八：物理动画 (Physics-based Animation) (4 🔗1) 支柱一：机器人控制 (Robot Control) (1 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MM-IFEngine: Towards Multimodal Instruction Following	提出MM-IFEngine，用于生成高质量多模态指令跟随数据，并构建评测基准。	DPO direct preference optimization large language model	✅
2	ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting	ContrastiveGaussian：利用对比学习和高斯溅射实现高保真3D生成	contrastive learning distillation gaussian splatting
3	Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction	提出基于关键短语提取的检索增强型多模态LLM放射报告生成方法	contrastive learning large language model multimodal
4	GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation	GLUS：统一全局-局部推理的MLLM用于视频分割，实现RefVOS新SOTA	contrastive learning large language model	✅
5	Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs	电商图像嵌入基准测试：评估预训练模型、微调策略与实际权衡	contrastive learning foundation model
6	Perception-R1: Pioneering Perception Policy with Reinforcement Learning	Perception-R1：利用强化学习提升多模态大语言模型感知策略，显著提高视觉感知任务性能。	reinforcement learning policy learning reward design
7	Kimi-VL Technical Report	Kimi-VL：高效开源MoE视觉语言模型，擅长长文本理解和高分辨率视觉输入	reinforcement learning multimodal chain-of-thought	✅
8	Heart Failure Prediction using Modal Decomposition and Masked Autoencoders for Scarce Echocardiography Databases	提出基于模态分解和掩码自编码器的心力衰竭预测方法，适用于稀疏超声心动图数据库。	masked autoencoder MAE	✅
9	BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation	BoxDreamer：通过预测物体边界框角点实现通用物体姿态估计	dreamer
10	SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement	ThinkLite-VL：利用MCTS指导样本选择，实现数据高效的视觉推理自提升	distillation multimodal
11	VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model	VLM-R1：基于规则奖励的稳定且泛化性强的视觉语言大模型	reinforcement learning large language model	✅
12	DGFamba: Learning Flow Factorized State Space for Visual Domain Generalization	提出DG-Famba，通过流分解状态空间学习领域泛化视觉表征	Mamba state space model

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
13	VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning	VCR-Bench：用于视频思维链推理的综合评估框架	large language model chain-of-thought
14	VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding	VideoExpert：增强LLM用于时序敏感的视频理解，解决时间戳生成偏差问题。	large language model multimodal instruction following
15	Scaling Laws for Native Multimodal Models	原生多模态模型扩展法则研究：早期融合架构更具优势	multimodal
16	MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation	MARS：多模态对齐与排序系统，提升少样本分割性能	multimodal
17	AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations	提出AerialVG数据集和模型，解决航空影像视觉定位中空间关系推理难题	visual grounding
18	A Multicore and Edge TPU-Accelerated Multimodal TinyML System for Livestock Behavior Recognition	提出一种基于多核和Edge TPU加速的多模态TinyML牲畜行为识别系统	multimodal
19	POEM: Precise Object-level Editing via MLLM control	提出POEM，利用MLLM实现精确的对象级别图像编辑	large language model multimodal
20	FMNV: A Dataset of Media-Published News Videos for Fake News Detection	构建FMNV数据集以解决媒体发布新闻视频的假新闻检测问题	large language model multimodal	✅
21	Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects	提出Gen3DEval以解决3D对象生成评估不足问题	large language model	✅
22	Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding	提出ProVideLLM，用于实时程序视频理解的内存高效流式VideoLLM框架。	multimodal
23	ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness	ColorBench：构建全面基准测试，评估视觉语言模型对色彩的感知、推理和鲁棒性	multimodal
24	FakeIDet: Exploring Patches for Privacy-Preserving Fake ID Detection	提出FakeIDet以解决假身份证检测中的隐私保护问题	foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
25	View-Dependent Uncertainty Estimation of 3D Gaussian Splatting	提出视角依赖的不确定性估计方法，提升3D高斯溅射在下游任务中的性能	3D gaussian splatting 3DGS gaussian splatting
26	ZS-VCOS: Zero-Shot Video Camouflaged Object Segmentation By Optical Flow and Open Vocabulary Object Detection	提出ZS-VCOS，利用光流和开放词汇目标检测实现零样本视频伪装目标分割	open-vocabulary open vocabulary optical flow	✅
27	RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability	RadZero：基于相似度的交叉注意力实现胸部X光片中可解释的视觉-语言对齐与零样本多任务能力	open-vocabulary open vocabulary large language model	✅
28	Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction	Geo4D：利用视频生成模型进行动态场景的几何4D重建	depth estimation scene reconstruction
29	InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians	InteractAvatar：提出基于可变形高斯体的逼真手部-面部交互头像建模方法	3D gaussian splatting gaussian splatting splatting
30	Extending Visual Dynamics for Video-to-Music Generation	提出DyViM框架，通过增强视觉动态建模提升视频到音乐生成效果。	optical flow
31	DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction	DGOcc：基于深度感知的全局查询网络，用于单目3D occupancy预测	scene understanding

🔬 支柱八：物理动画 (Physics-based Animation) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
32	How Can Objects Help Video-Language Understanding?	ObjectMLLM：通过显式对象信息提升视频语言理解能力	spatiotemporal large language model multimodal	✅
33	STeP: A Framework for Solving Scientific Video Inverse Problems with Spatiotemporal Diffusion Priors	STeP：利用时空扩散先验解决科学视频逆问题的框架	spatiotemporal
34	SRVP: Strong Recollection Video Prediction Model Using Attention-Based Spatiotemporal Correlation Fusion	提出基于注意力机制的时空相关性融合的强回忆视频预测模型，提升预测质量。	spatiotemporal
35	SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding	提出自监督片段微调SF²T，提升Video-LLM的细粒度视频理解能力	spatiotemporal large language model

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
36	Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction	提出MMTwin，一种用于多模态3D手部轨迹预测的新型扩散模型。	manipulation Mamba motion diffusion	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
37	Marmot: Object-Level Self-Correction via Multi-Agent Reasoning	Marmot：提出一种基于多智能体推理的对象级自校正框架，提升多对象场景图像生成的准确性。	spatial relationship large language model multimodal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
38	SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos	SAMJAM：面向第一视角厨房视频的零样本视频场景图生成方法	egocentric foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页