cs.CV（2025-04-06）

📊 共 16 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (6 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (4) 支柱二：RL算法与架构 (RL & Architecture) (3) 支柱六：视频提取与匹配 (Video Extraction) (1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric	提出多模态长视频检索框架与评估指标，提升复杂场景下的检索精度。	multimodal
2	Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models	提出一种基于文本到图像和音频生成模型的多模态电影视频合成方法	multimodal
3	Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection	提出基于增强-搜索策略的CD-FSOD方法，提升基础模型在跨域少样本目标检测中的性能。	foundation model	✅
4	UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding	UniToken：通过统一视觉编码实现多模态理解与生成的和谐统一	multimodal	✅
5	VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT	VideoAgent2：通过不确定性感知CoT增强LLM Agent长视频理解能力	large language model chain-of-thought
6	Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering	提出内容感知复合提示工程，解决人脸反欺骗跨域泛化难题	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
7	Targetless LiDAR-Camera Calibration with Neural Gaussian Splatting	提出基于神经高斯溅射的无目标LiDAR-相机联合标定方法	gaussian splatting splatting
8	FluentLip: A Phonemes-Based Two-stage Approach for Audio-Driven Lip Synthesis with Optical Flow Consistency	FluentLip提出基于音素的两阶段唇语合成方法，提升流畅度和可懂性。	optical flow multimodal
9	Thermoxels: a voxel-based method to generate simulation-ready 3D thermal models	提出Thermoxels，一种基于体素的3D热模型生成方法，用于建筑节能改造。	gaussian splatting splatting NeRF
10	VSLAM-LAB: A Comprehensive Framework for Visual SLAM Methods and Datasets	VSLAM-LAB：统一的VSLAM框架，简化开发、评估与部署流程。	visual SLAM

🔬 支柱二：RL算法与架构 (RL & Architecture) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
11	M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering	提出M$^2$IV，通过表征工程实现高效细粒度的多模态上下文学习。	representation learning distillation multimodal
12	AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection	提出AVadCLIP，利用音视频协同增强视频异常检测的鲁棒性	representation learning distillation multimodal
13	NCL-CIR: Noise-aware Contrastive Learning for Composed Image Retrieval	提出NCL-CIR，通过噪声感知对比学习解决组合图像检索中的噪声问题	contrastive learning

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Advancing Egocentric Video Question Answering with Multimodal Large Language Models	利用多模态大语言模型提升第一视角视频问答性能	egocentric Ego4D large language model

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
15	The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?	评估点云对大语言模型空间推理能力的提升：揭示3D LLM的局限性	spatial relationship large language model foundation model

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
16	PRISM: Probabilistic Representation for Integrated Shape Modeling and Generation	PRISM：提出概率表示方法，用于集成形状建模与生成	manipulation SSM

⬅️ 返回 cs.CV 首页 · 🏠 返回主页