cs.CV(2025-07-20)

📊 共 17 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression 提出RvTC框架,结合数据特定提示,提升多模态大模型在图像回归任务中的性能 large language model multimodal
2 Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG Med-GRIM:利用提示嵌入多模态图RAG增强零样本医学VQA large language model multimodal
3 Light Future: Multimodal Action Frame Prediction via InstructPix2Pix 提出基于InstructPix2Pix的轻量级多模态动作帧预测方法,用于机器人任务。 multimodal
4 TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP TriCLIP-3D:基于CLIP的统一参数高效三模态3D视觉定位框架 visual grounding
5 BleedOrigin: Dynamic Bleeding Source Localization in Endoscopic Submucosal Dissection via Dual-Stage Detection and Tracking BleedOrigin-Net:用于内镜黏膜下剥离术中动态出血源定位的双阶段检测跟踪框架 large language model multimodal
6 LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering LeAdQA:利用LLM驱动的上下文感知时序定位解决视频问答难题 multimodal visual grounding
7 Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding 提出Video-TT:一个用于评估视频LLM高级推理和理解能力的综合基准 large language model
8 Grounding Degradations in Natural Language for All-In-One Video Restoration 提出一种基于自然语言语义引导的端到端视频修复框架,无需预知退化类型。 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
9 Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction 提出Stereo-GS,用于可泛化的基于多视图立体视觉的3D高斯溅射重建。 3D gaussian splatting 3DGS gaussian splatting
10 Region-aware Depth Scale Adaptation with Sparse Measurements 提出区域感知深度尺度自适应方法,利用稀疏测量提升单目深度估计精度。 depth estimation monocular depth metric depth
11 An Evaluation of DUSt3R/MASt3R/VGGT 3D Reconstruction on Photogrammetric Aerial Blocks 评估DUSt3R/MASt3R/VGGT在摄影测量航测影像块三维重建中的性能 VGGT
12 Training Self-Supervised Depth Completion Using Sparse Measurements and a Single Image 提出一种仅使用稀疏深度测量和单张图像进行自监督深度补全训练的方法。 depth estimation foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
13 U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs U-MARVEL:通过MLLM嵌入学习揭示通用多模态检索的关键因素 contrastive learning distillation multimodal
14 Open-set Cross Modal Generalization via Multimodal Unified Representation 提出MICU模型,解决开放集跨模态泛化问题,提升多模态统一表征的泛化能力。 contrastive learning multimodal
15 Semantic-Aware Representation Learning via Conditional Transport for Multi-Label Image Classification 提出SCT模型,通过条件传输的语义感知表示学习解决多标签图像分类问题 representation learning

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
16 Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction VideoPlan:利用辅助任务和多Token预测增强视觉规划能力 egocentric Ego4D large language model

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
17 Event-based Graph Representation with Spatial and Motion Vectors for Asynchronous Object Detection 提出基于事件的空间和运动向量图表示,用于异步目标检测。 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页