cs.CV（2024-06-27）

📊 共 17 篇论文 | 🔗 1 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (4) 支柱三：空间感知与语义 (Perception & Semantics) (3) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
1	HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale	HuatuoGPT-Vision：通过注入大规模医学视觉知识提升多模态LLM的医学能力	large language model multimodal
2	DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming	DocKylin：一种高效视觉精简的大型多模态文档理解模型	large language model multimodal
3	ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos	ReXTime：一个用于视频中跨时间推理的基准测试套件	large language model multimodal
4	RAVEN: Multitask Retrieval Augmented Vision-Language Learning	RAVEN：多任务检索增强的视觉-语言学习框架，提升VLM性能。	large language model multimodal
5	OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding	OMG-LLaVA：融合图像、对象和像素级推理与理解的多模态模型	multimodal
6	CELLO: Causal Evaluation of Large Vision-Language Models	提出CELLO以解决大规模视觉-语言模型因果推理问题	chain-of-thought	✅
7	Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift	针对大视觉语言模型，提出域泛化多模态后门攻击方法MABA，提升攻击成功率。	multimodal
8	ViT LoS V2X: Vision Transformers for Environment-aware LoS Blockage Prediction for 6G Vehicular Networks	提出基于视觉Transformer的V2X环境感知LoS阻塞预测方法，用于6G车载网络。	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
9	Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation	提出模态感知特征蒸馏方法，提升视觉问答持续学习性能	distillation multimodal
10	Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model	提出RWKV-SAM：一种高效且高质量的Segment Anything模型	Mamba linear attention
11	Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads	Fibottention：利用多样化注意力头的Inception式视觉表征学习，提升Transformer在有限数据下的性能。	representation learning
12	Snakes and Ladders: Two Steps Up for VideoMamba	VideoMambaPro：通过改进Mamba架构，提升视频理解性能。	Mamba

🔬 支柱三：空间感知与语义 (Perception & Semantics) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Dense Monocular Motion Segmentation Using Optical Flow and Pseudo Depth Map: A Zero-Shot Approach	提出一种基于光流和伪深度图的零样本单目运动分割方法	depth estimation monocular depth optical flow
14	A Universal Railway Obstacle Detection System based on Semi-supervised Segmentation And Optical Flow	提出基于半监督分割和光流的通用铁路障碍物检测系统，解决类别泛化难题。	optical flow
15	360 in the Wild: Dataset for Depth Prediction and View Synthesis	提出大规模真实场景360°视频数据集，用于深度预测和视角合成研究	depth estimation

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
16	CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement	CORE4D：用于协同物体重排列的4D人-物-人交互数据集	human-object interaction human motion

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
17	PNeRV: A Polynomial Neural Representation for Videos	提出PNeRV，一种用于视频的参数高效多项式神经表示，保持时空连续性。	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页