cs.CV(2025-04-06)
📊 共 16 篇论文 | 🔗 2 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗2)
支柱三:空间感知与语义 (Perception & Semantics) (4)
支柱二:RL算法与架构 (RL & Architecture) (3)
支柱六:视频提取与匹配 (Video Extraction) (1)
支柱七:动作重定向 (Motion Retargeting) (1)
支柱一:机器人控制 (Robot Control) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric | 提出多模态长视频检索框架与评估指标,提升复杂场景下的检索精度。 | multimodal | ||
| 2 | Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models | 提出一种基于文本到图像和音频生成模型的多模态电影视频合成方法 | multimodal | ||
| 3 | Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection | 提出基于增强-搜索策略的CD-FSOD方法,提升基础模型在跨域少样本目标检测中的性能。 | foundation model | ✅ | |
| 4 | UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding | UniToken:通过统一视觉编码实现多模态理解与生成的和谐统一 | multimodal | ✅ | |
| 5 | VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT | VideoAgent2:通过不确定性感知CoT增强LLM Agent长视频理解能力 | large language model chain-of-thought | ||
| 6 | Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering | 提出内容感知复合提示工程,解决人脸反欺骗跨域泛化难题 | large language model |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Targetless LiDAR-Camera Calibration with Neural Gaussian Splatting | 提出基于神经高斯溅射的无目标LiDAR-相机联合标定方法 | gaussian splatting splatting | ||
| 8 | FluentLip: A Phonemes-Based Two-stage Approach for Audio-Driven Lip Synthesis with Optical Flow Consistency | FluentLip提出基于音素的两阶段唇语合成方法,提升流畅度和可懂性。 | optical flow multimodal | ||
| 9 | Thermoxels: a voxel-based method to generate simulation-ready 3D thermal models | 提出Thermoxels,一种基于体素的3D热模型生成方法,用于建筑节能改造。 | gaussian splatting splatting NeRF | ||
| 10 | VSLAM-LAB: A Comprehensive Framework for Visual SLAM Methods and Datasets | VSLAM-LAB:统一的VSLAM框架,简化开发、评估与部署流程。 | visual SLAM |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering | 提出M$^2$IV,通过表征工程实现高效细粒度的多模态上下文学习。 | representation learning distillation multimodal | ||
| 12 | AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection | 提出AVadCLIP,利用音视频协同增强视频异常检测的鲁棒性 | representation learning distillation multimodal | ||
| 13 | NCL-CIR: Noise-aware Contrastive Learning for Composed Image Retrieval | 提出NCL-CIR,通过噪声感知对比学习解决组合图像检索中的噪声问题 | contrastive learning |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | Advancing Egocentric Video Question Answering with Multimodal Large Language Models | 利用多模态大语言模型提升第一视角视频问答性能 | egocentric Ego4D large language model |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? | 评估点云对大语言模型空间推理能力的提升:揭示3D LLM的局限性 | spatial relationship large language model foundation model |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | PRISM: Probabilistic Representation for Integrated Shape Modeling and Generation | PRISM:提出概率表示方法,用于集成形状建模与生成 | manipulation SSM |