cs.CV（2024-06-19）

📊 共 29 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (12 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗2) 支柱一：机器人控制 (Robot Control) (3 🔗2) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events	利用多模态大语言模型自动检测交通安全关键事件	large language model multimodal
2	Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models	提出基于多模态视频大语言模型的心理理论（ToM）推理框架	large language model multimodal
3	MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency	MC-MKE：提出一个细粒度的多模态知识编辑基准，强调模态一致性，用于评估和纠正MLLM中的错误。	large language model multimodal
4	Biomedical Visual Instruction Tuning with Clinician Preference Alignment	BioMed-VITAL：通过临床医生偏好对齐进行生物医学视觉指令调优	foundation model multimodal instruction following
5	VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models	提出VisualRWKV，将线性RNN应用于视觉语言模型，实现高效多模态学习。	large language model multimodal	✅
6	GUI Action Narrator: Where and When Did That Action Take Place?	提出GUI Narrator框架与Act2Cap数据集，用于提升多模态LLM在GUI动作视频理解上的性能。	multimodal
7	IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning	IntCoOp：一种可解释的视觉-语言提示调优方法，提升图像-文本对齐。	zero-shot transfer
8	SpatialBot: Precise Spatial Understanding with Vision Language Models	SpatialBot：利用视觉语言模型实现精确的空间理解	embodied AI	✅
9	Semantic Enhanced Few-shot Object Detection	提出语义增强的少样本目标检测框架，提升新类别检测性能	multimodal
10	SituationalLLM: Proactive language models with scene awareness for dynamic, contextual task guidance	SituationalLLM：提出一种具备场景感知能力的主动式语言模型，用于动态上下文任务指导。	large language model
11	Neural Residual Diffusion Models for Deep Scalable Vision Generation	提出神经残差扩散模型(Neural-RDM)，解决深度视觉生成模型的可扩展性问题。	large language model	✅
12	Block-level Text Spotting with LLMs	提出BTS-LLM，利用大语言模型进行图像块级文本定位与识别。	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
13	PanDA: Towards Panoramic Depth Anything with Unlabeled Panoramas and Mobius Spatial Augmentation	PanDA：利用无标注全景图和Mobius空间增强实现全景深度估计	depth estimation Depth Anything foundation model
14	Low Latency Visual Inertial Odometry with On-Sensor Accelerated Optical Flow for Resource-Constrained UAVs	针对资源受限无人机，提出基于片上加速光流的低延迟视觉惯性里程计	VIO optical flow
15	Freq-Mip-AA : Frequency Mip Representation for Anti-Aliasing Neural Radiance Fields	提出FreqMipAA，通过频率域Mip表示和抗锯齿技术加速NeRF训练并提升渲染质量。	NeRF neural radiance field	✅
16	NeRF-Feat: 6D Object Pose Estimation using Feature Rendering	NeRF-Feat：利用特征渲染实现弱监督的6D物体姿态估计	NeRF
17	StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images	StableSemantics：一个基于自然图像语义表示的合成语言-视觉数据集	scene understanding open-vocabulary open vocabulary	✅
18	SMORE: Simultaneous Map and Object REconstruction	提出SMORE方法以解决动态场景重建问题	scene flow	✅
19	4K4DGen: Panoramic 4D Generation at 4K Resolution	提出4K4DGen，首次实现4K分辨率全景4D动态场景生成	splatting

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
20	Towards a multimodal framework for remote sensing image change retrieval and captioning	提出一种遥感图像变化检索与描述的多模态框架，提升时序遥感数据的理解能力。	contrastive learning foundation model multimodal	✅
21	WaterMono: Teacher-Guided Anomaly Masking and Enhancement Boosting for Robust Underwater Self-Supervised Monocular Depth Estimation	提出WaterMono以解决水下单目深度估计中的动态干扰问题	distillation depth estimation monocular depth	✅
22	DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object Detection	提出双扰动优化DPO，用于3D目标检测中的测试时自适应。	DPO
23	Towards Trustworthy Unsupervised Domain Adaptation: A Representation Learning Perspective for Enhancing Robustness, Discrimination, and Generalization	提出MIRoUDA，从表征学习角度提升鲁棒无监督领域自适应性能	representation learning

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Splatter a Video: Video Gaussian Representation for Versatile Processing	提出视频高斯表示，用于解决视频处理中复杂运动建模和可操作性问题。	manipulation optical flow foundation model	✅
25	CNN Based Flank Predictor for Quadruped Animal Species	提出基于CNN的侧翼预测器，用于提升四足动物个体识别准确率	quadruped
26	Exploring Multi-view Pixel Contrast for General and Robust Image Forgery Localization	提出多视角像素对比学习方法，用于通用且鲁棒的图像篡改定位	MPC	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
27	AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding	AlanaVLM：用于第一视角视频理解的多模态具身AI基础模型	egocentric embodied AI foundation model
28	HumorDB: Can AI understand graphical humor?	提出 HumorDB 数据集，用于评估和提升AI对视觉幽默的理解能力	HuMoR	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Convolutional Kolmogorov-Arnold Networks	提出卷积Kolmogorov-Arnold网络，提升CNN参数效率和表达能力	spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页