cs.CV(2025-04-10)

📊 共 38 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (12 🔗5) 支柱九:具身大模型 (Embodied Foundation Models) (12 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱八:物理动画 (Physics-based Animation) (4 🔗1) 支柱一:机器人控制 (Robot Control) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
1 MM-IFEngine: Towards Multimodal Instruction Following 提出MM-IFEngine,用于生成高质量多模态指令跟随数据,并构建评测基准。 DPO direct preference optimization large language model
2 ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting ContrastiveGaussian:利用对比学习和高斯溅射实现高保真3D生成 contrastive learning distillation gaussian splatting
3 Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction 提出基于关键短语提取的检索增强型多模态LLM放射报告生成方法 contrastive learning large language model multimodal
4 GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation GLUS:统一全局-局部推理的MLLM用于视频分割,实现RefVOS新SOTA contrastive learning large language model
5 Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs 电商图像嵌入基准测试:评估预训练模型、微调策略与实际权衡 contrastive learning foundation model
6 Perception-R1: Pioneering Perception Policy with Reinforcement Learning Perception-R1:利用强化学习提升多模态大语言模型感知策略,显著提高视觉感知任务性能。 reinforcement learning policy learning reward design
7 Kimi-VL Technical Report Kimi-VL:高效开源MoE视觉语言模型,擅长长文本理解和高分辨率视觉输入 reinforcement learning multimodal chain-of-thought
8 Heart Failure Prediction using Modal Decomposition and Masked Autoencoders for Scarce Echocardiography Databases 提出基于模态分解和掩码自编码器的心力衰竭预测方法,适用于稀疏超声心动图数据库。 masked autoencoder MAE
9 BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation BoxDreamer:通过预测物体边界框角点实现通用物体姿态估计 dreamer
10 SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement ThinkLite-VL:利用MCTS指导样本选择,实现数据高效的视觉推理自提升 distillation multimodal
11 VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model VLM-R1:基于规则奖励的稳定且泛化性强的视觉语言大模型 reinforcement learning large language model
12 DGFamba: Learning Flow Factorized State Space for Visual Domain Generalization 提出DG-Famba,通过流分解状态空间学习领域泛化视觉表征 Mamba state space model

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
13 VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning VCR-Bench:用于视频思维链推理的综合评估框架 large language model chain-of-thought
14 VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding VideoExpert:增强LLM用于时序敏感的视频理解,解决时间戳生成偏差问题。 large language model multimodal instruction following
15 Scaling Laws for Native Multimodal Models 原生多模态模型扩展法则研究:早期融合架构更具优势 multimodal
16 MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation MARS:多模态对齐与排序系统,提升少样本分割性能 multimodal
17 AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations 提出AerialVG数据集和模型,解决航空影像视觉定位中空间关系推理难题 visual grounding
18 A Multicore and Edge TPU-Accelerated Multimodal TinyML System for Livestock Behavior Recognition 提出一种基于多核和Edge TPU加速的多模态TinyML牲畜行为识别系统 multimodal
19 POEM: Precise Object-level Editing via MLLM control 提出POEM,利用MLLM实现精确的对象级别图像编辑 large language model multimodal
20 FMNV: A Dataset of Media-Published News Videos for Fake News Detection 构建FMNV数据集以解决媒体发布新闻视频的假新闻检测问题 large language model multimodal
21 Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects 提出Gen3DEval以解决3D对象生成评估不足问题 large language model
22 Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding 提出ProVideLLM,用于实时程序视频理解的内存高效流式VideoLLM框架。 multimodal
23 ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness ColorBench:构建全面基准测试,评估视觉语言模型对色彩的感知、推理和鲁棒性 multimodal
24 FakeIDet: Exploring Patches for Privacy-Preserving Fake ID Detection 提出FakeIDet以解决假身份证检测中的隐私保护问题 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
25 View-Dependent Uncertainty Estimation of 3D Gaussian Splatting 提出视角依赖的不确定性估计方法,提升3D高斯溅射在下游任务中的性能 3D gaussian splatting 3DGS gaussian splatting
26 ZS-VCOS: Zero-Shot Video Camouflaged Object Segmentation By Optical Flow and Open Vocabulary Object Detection 提出ZS-VCOS,利用光流和开放词汇目标检测实现零样本视频伪装目标分割 open-vocabulary open vocabulary optical flow
27 RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability RadZero:基于相似度的交叉注意力实现胸部X光片中可解释的视觉-语言对齐与零样本多任务能力 open-vocabulary open vocabulary large language model
28 Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction Geo4D:利用视频生成模型进行动态场景的几何4D重建 depth estimation scene reconstruction
29 InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians InteractAvatar:提出基于可变形高斯体的逼真手部-面部交互头像建模方法 3D gaussian splatting gaussian splatting splatting
30 Extending Visual Dynamics for Video-to-Music Generation 提出DyViM框架,通过增强视觉动态建模提升视频到音乐生成效果。 optical flow
31 DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction DGOcc:基于深度感知的全局查询网络,用于单目3D occupancy预测 scene understanding

🔬 支柱八:物理动画 (Physics-based Animation) (4 篇)

#题目一句话要点标签🔗
32 How Can Objects Help Video-Language Understanding? ObjectMLLM:通过显式对象信息提升视频语言理解能力 spatiotemporal large language model multimodal
33 STeP: A Framework for Solving Scientific Video Inverse Problems with Spatiotemporal Diffusion Priors STeP:利用时空扩散先验解决科学视频逆问题的框架 spatiotemporal
34 SRVP: Strong Recollection Video Prediction Model Using Attention-Based Spatiotemporal Correlation Fusion 提出基于注意力机制的时空相关性融合的强回忆视频预测模型,提升预测质量。 spatiotemporal
35 SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding 提出自监督片段微调SF²T,提升Video-LLM的细粒度视频理解能力 spatiotemporal large language model

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
36 Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction 提出MMTwin,一种用于多模态3D手部轨迹预测的新型扩散模型。 manipulation Mamba motion diffusion

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
37 Marmot: Object-Level Self-Correction via Multi-Agent Reasoning Marmot:提出一种基于多智能体推理的对象级自校正框架,提升多对象场景图像生成的准确性。 spatial relationship large language model multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
38 SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos SAMJAM:面向第一视角厨房视频的零样本视频场景图生成方法 egocentric foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页