cs.CV(2025-04-30)

📊 共 34 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (5) 支柱一:机器人控制 (Robot Control) (4 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱四:生成式动作 (Generative Motion) (2 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 Multimodal Language Models See Better When They Look Shallower 提出视觉层选择策略以提升多模态大语言模型性能 large language model multimodal
2 UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation UniBiomed:用于可解释生物医学图像分析的通用基础模型 large language model foundation model
3 CMD: Constraining Multimodal Distribution for Domain Adaptation in Stereo Matching 提出CMD方法,约束立体匹配域适应中的多峰分布问题 multimodal
4 GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers GarmentDiffusion:基于多模态扩散Transformer的3D服装缝纫纸样生成 multimodal
5 Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs 提出HEAL-MedVQA基准与LobA框架,提升医学多模态LLM的定位能力与抗幻觉性。 multimodal
6 A Survey of Interactive Generative Video 综述交互式生成视频技术,提出包含五大模块的通用框架,并分析未来发展方向。 embodied AI multimodal
7 COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning 提出COMPACT,通过组合原子视觉能力进行高效多模态大模型微调。 large language model multimodal
8 Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields 利用GPT-4o生成能力,探索AIGC在极低码率图像压缩中的应用,实现优异性能。 foundation model multimodal
9 Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision Diff-Prompt:利用扩散模型和掩码监督生成细粒度Prompt,提升多模态模型在复杂任务上的性能。 foundation model multimodal
10 Simple Visual Artifact Detection in Sora-Generated Videos 提出一种基于多标签分类的框架,用于检测Sora生成视频中的视觉伪影。 large language model multimodal
11 Zoomer: Adaptive Image Focus Optimization for Black-box MLLM Zoomer:针对黑盒MLLM的自适应图像焦点优化框架,提升小物体识别能力。 large language model multimodal
12 DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation 提出DOPE网络,增强视觉语言导航中智能体的对象感知能力 VLN
13 SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 提出SeriesBench,用于评估多模态大语言模型在叙事驱动型剧集理解方面的能力。 large language model
14 Responsive DNN Adaptation for Video Analytics against Environment Shift via Hierarchical Mobile-Cloud Collaborations MOCHA:针对环境变化的视频分析,提出响应式DNN分层移动-云协同自适应框架 foundation model
15 Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space Nexus-Gen:通过共享嵌入空间中的预填充自回归实现统一的图像理解、生成和编辑 multimodal
16 CoCoDiff: Diversifying Skeleton Action Features via Coarse-Fine Text-Co-Guided Latent Diffusion CoCoDiff:通过粗细粒度文本协同引导的潜在扩散模型,提升骨骼动作识别特征多样性。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
17 eNCApsulate: NCA for Precision Diagnosis on Capsule Endoscopes 提出eNCApsulate,利用神经元胞自动机实现胶囊内窥镜的精确诊断。 distillation visual odometry depth estimation
18 Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning 提出SPORT:通过逐步偏好调整迭代探索多模态Agent的工具使用策略 reinforcement learning multimodal
19 Mamba Based Feature Extraction And Adaptive Multilevel Feature Fusion For 3D Tumor Segmentation From Multi-modal Medical Image 提出基于Mamba的多模态医学图像肿瘤分割方法,提升3D肿瘤分割精度。 Mamba representation learning
20 Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization 提出DEEVISum,一种基于多阶段知识蒸馏和早退机制的轻量级视频摘要VLM模型 distillation
21 CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation 提出CAE-DFKD,提升数据自由知识蒸馏中表征的可迁移性 distillation

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
22 Visual Text Processing: A Comprehensive Review and Unified Evaluation 针对视觉文本处理,提出VTPBench基准与VTPScore评估指标,促进文本特征有效利用。 manipulation large language model foundation model
23 Consistency-aware Fake Videos Detection on Short Video Platforms 提出一致性感知的伪造视频检测方法,利用跨模态矛盾提升短视频平台假新闻识别精度。 manipulation large language model multimodal
24 Diffusion-based Adversarial Identity Manipulation for Facial Privacy Protection 提出基于扩散模型的对抗身份操纵方法DiffAIM,用于人脸隐私保护。 manipulation
25 Combating Falsification of Speech Videos with Live Optical Signatures (Extended Version) 提出VeriLight以解决演讲视频伪造问题 manipulation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
26 Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection Mcity数据引擎:通过开放词汇数据选择迭代改进智能交通系统模型 open-vocabulary open vocabulary
27 HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation HoloTime:利用视频扩散模型生成全景4D场景,提升VR/AR体验 depth estimation gaussian splatting splatting
28 AnimalMotionCLIP: Embedding motion in CLIP for Animal Behavior Analysis AnimalMotionCLIP:通过在CLIP中嵌入运动信息进行动物行为分析 optical flow

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
29 Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models 提出UnHateMeme框架,利用视觉-语言模型检测并缓解多模态表情包中的仇恨内容。 HuMoR multimodal
30 Learning to Borrow Features for Improved Detection of Small Objects in Single-Shot Detectors 提出一种特征借用框架,提升单阶段检测器中小目标检测性能 feature matching

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
31 Differentiable Room Acoustic Rendering with Multi-View Vision Priors AV-DAR:利用多视角视觉先验的可微房间声学渲染 PULSE multimodal
32 Direct Motion Models for Assessing Generated Videos 提出基于轨迹自编码器的视频生成质量评估方法,提升运动一致性与真实性评估。 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
33 ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling ReVision:通过显式3D运动建模优化视频扩散模型,提升复杂运动生成质量。 physically plausible
34 MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance MagicPortrait:利用3D几何引导实现时间一致的人脸重演 motion latent

⬅️ 返回 cs.CV 首页 · 🏠 返回主页