cs.CV(2024-06-19)

📊 共 29 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (4 🔗2) 支柱一:机器人控制 (Robot Control) (3 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events 利用多模态大语言模型自动检测交通安全关键事件 large language model multimodal
2 Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models 提出基于多模态视频大语言模型的心理理论(ToM)推理框架 large language model multimodal
3 MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency MC-MKE:提出一个细粒度的多模态知识编辑基准,强调模态一致性,用于评估和纠正MLLM中的错误。 large language model multimodal
4 Biomedical Visual Instruction Tuning with Clinician Preference Alignment BioMed-VITAL:通过临床医生偏好对齐进行生物医学视觉指令调优 foundation model multimodal instruction following
5 VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models 提出VisualRWKV,将线性RNN应用于视觉语言模型,实现高效多模态学习。 large language model multimodal
6 GUI Action Narrator: Where and When Did That Action Take Place? 提出GUI Narrator框架与Act2Cap数据集,用于提升多模态LLM在GUI动作视频理解上的性能。 multimodal
7 IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning IntCoOp:一种可解释的视觉-语言提示调优方法,提升图像-文本对齐。 zero-shot transfer
8 SpatialBot: Precise Spatial Understanding with Vision Language Models SpatialBot:利用视觉语言模型实现精确的空间理解 embodied AI
9 Semantic Enhanced Few-shot Object Detection 提出语义增强的少样本目标检测框架,提升新类别检测性能 multimodal
10 SituationalLLM: Proactive language models with scene awareness for dynamic, contextual task guidance SituationalLLM:提出一种具备场景感知能力的主动式语言模型,用于动态上下文任务指导。 large language model
11 Neural Residual Diffusion Models for Deep Scalable Vision Generation 提出神经残差扩散模型(Neural-RDM),解决深度视觉生成模型的可扩展性问题。 large language model
12 Block-level Text Spotting with LLMs 提出BTS-LLM,利用大语言模型进行图像块级文本定位与识别。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
13 PanDA: Towards Panoramic Depth Anything with Unlabeled Panoramas and Mobius Spatial Augmentation PanDA:利用无标注全景图和Mobius空间增强实现全景深度估计 depth estimation Depth Anything foundation model
14 Low Latency Visual Inertial Odometry with On-Sensor Accelerated Optical Flow for Resource-Constrained UAVs 针对资源受限无人机,提出基于片上加速光流的低延迟视觉惯性里程计 VIO optical flow
15 Freq-Mip-AA : Frequency Mip Representation for Anti-Aliasing Neural Radiance Fields 提出FreqMipAA,通过频率域Mip表示和抗锯齿技术加速NeRF训练并提升渲染质量。 NeRF neural radiance field
16 NeRF-Feat: 6D Object Pose Estimation using Feature Rendering NeRF-Feat:利用特征渲染实现弱监督的6D物体姿态估计 NeRF
17 StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images StableSemantics:一个基于自然图像语义表示的合成语言-视觉数据集 scene understanding open-vocabulary open vocabulary
18 SMORE: Simultaneous Map and Object REconstruction 提出SMORE方法以解决动态场景重建问题 scene flow
19 4K4DGen: Panoramic 4D Generation at 4K Resolution 提出4K4DGen,首次实现4K分辨率全景4D动态场景生成 splatting

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
20 Towards a multimodal framework for remote sensing image change retrieval and captioning 提出一种遥感图像变化检索与描述的多模态框架,提升时序遥感数据的理解能力。 contrastive learning foundation model multimodal
21 WaterMono: Teacher-Guided Anomaly Masking and Enhancement Boosting for Robust Underwater Self-Supervised Monocular Depth Estimation 提出WaterMono以解决水下单目深度估计中的动态干扰问题 distillation depth estimation monocular depth
22 DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object Detection 提出双扰动优化DPO,用于3D目标检测中的测试时自适应。 DPO
23 Towards Trustworthy Unsupervised Domain Adaptation: A Representation Learning Perspective for Enhancing Robustness, Discrimination, and Generalization 提出MIRoUDA,从表征学习角度提升鲁棒无监督领域自适应性能 representation learning

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
24 Splatter a Video: Video Gaussian Representation for Versatile Processing 提出视频高斯表示,用于解决视频处理中复杂运动建模和可操作性问题。 manipulation optical flow foundation model
25 CNN Based Flank Predictor for Quadruped Animal Species 提出基于CNN的侧翼预测器,用于提升四足动物个体识别准确率 quadruped
26 Exploring Multi-view Pixel Contrast for General and Robust Image Forgery Localization 提出多视角像素对比学习方法,用于通用且鲁棒的图像篡改定位 MPC

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
27 AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding AlanaVLM:用于第一视角视频理解的多模态具身AI基础模型 egocentric embodied AI foundation model
28 HumorDB: Can AI understand graphical humor? 提出 HumorDB 数据集,用于评估和提升AI对视觉幽默的理解能力 HuMoR

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
29 Convolutional Kolmogorov-Arnold Networks 提出卷积Kolmogorov-Arnold网络,提升CNN参数效率和表达能力 spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页