cs.CV(2025-06-28)

📊 共 17 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (7 🔗2) 支柱一:机器人控制 (Robot Control) (3) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱三:空间感知与语义 (Perception & Semantics) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
1 MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models 提出MusiXQA数据集,用于提升多模态大语言模型在乐谱理解方面的能力 large language model multimodal
2 MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding MANTA:通过跨模态语义对齐和信息论优化实现长程多模态理解 large language model multimodal
3 MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering 提出MOTOR,一种基于多模态最优传输的医学视觉问答方法,提升临床相关性。 multimodal
4 Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding 提出Temporal Search框架,通过迭代缩放时间区间提升MLLM长视频理解能力 large language model multimodal
5 Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval 提出Mask-aware TIR,融合文本到图像检索与指代表达分割,提升检索精度与可解释性。 large language model multimodal
6 ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment ActAlign:通过语言引导的序列对齐实现零样本细粒度视频分类 large language model
7 Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration 提出属性感知零样本测试时校准方法,解决VLM测试时微调的置信度校准问题 large language model

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
8 RoboPearls: Editable Video Simulation for Robot Manipulation RoboPearls:用于机器人操作的可编辑视频仿真框架,提升数据效率。 manipulation sim-to-real distillation
9 Towards Explainable Bilingual Multimodal Misinformation Detection and Localization 提出BiMi框架,解决双语多模态信息误导检测与定位难题。 manipulation multimodal
10 PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection PhonemeFake:通过语言驱动的音段操纵和自适应双层检测,提升Deepfake的真实感 manipulation

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
11 Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding 提出MVOV3D以解决开放词汇3D场景理解中的噪声问题 contrastive learning scene understanding open-vocabulary
12 LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning LightBSR:通过判别性隐式退化表示学习实现轻量级盲超分辨率 representation learning contrastive learning distillation
13 CLIP-like Model as a Foundational Density Ratio Estimator 将CLIP类模型重新解释为通用密度比估计器,并应用于重要性权重学习和KL散度估计。 contrastive learning multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
14 STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing STR-Match:通过时空相关性匹配实现免训练视频编辑 latent optimization spatiotemporal
15 MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances MagShield:提升磁干扰下稀疏惯性动作捕捉的鲁棒性 human motion

🔬 支柱三:空间感知与语义 (Perception & Semantics) (1 篇)

#题目一句话要点标签🔗
16 RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors RGE-GS:提出奖励引导的扩散先验扩展驾驶场景重建方法 3D gaussian splatting 3DGS gaussian splatting

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
17 Single-Frame Point-Pixel Registration via Supervised Cross-Modal Feature Matching 提出基于监督跨模态特征匹配的单帧点云-像素配准方法 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页