cs.CV（2024-05-21）

📊 共 33 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (11 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (11) 支柱三：空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1) 支柱四：生成式动作 (Generative Motion) (1) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Physics-based Scene Layout Generation from Human Motion	提出基于物理的场景布局生成方法，实现逼真的人机交互动画	reinforcement learning affordance physically plausible
2	Cross-spectral Gated-RGB Stereo Depth Estimation	提出跨光谱门控RGB立体深度估计方法，提升远距离深度精度。	MAE depth estimation stereo depth
3	AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection	提出AMFD框架，通过自适应多模态融合蒸馏提升多光谱行人检测效率。	distillation multimodal	✅
4	3DSS-Mamba: 3D-Spectral-Spatial Mamba for Hyperspectral Image Classification	提出3DSS-Mamba，用于高光谱图像分类，提升长程依赖建模效率。	Mamba state space model HSI
5	A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data	综述基于多模态数据的深度学习放射学报告生成方法，聚焦数据融合与模型可解释性。	contrastive learning multimodal
6	A Multimodal Learning-based Approach for Autonomous Landing of UAV	提出一种基于多模态学习的无人机自主着陆方法，提升精度和环境适应性。	reinforcement learning multimodal
7	Active Object Detection with Knowledge Aggregation and Distillation from Large Models	提出基于知识聚合与蒸馏的主动对象检测方法，提升交互场景下的检测精度。	distillation affordance Ego4D
8	RemoCap: Disentangled Representation Learning for Motion Capture	RemoCap：提出解耦表征学习方法，解决复杂遮挡下的三维人体运动捕捉难题	representation learning penetration	✅
9	CLRKDNet: Speeding up Lane Detection with Knowledge Distillation	CLRKDNet：利用知识蒸馏加速车道线检测，提升自动驾驶实时性	teacher-student distillation
10	BIMM: Brain Inspired Masked Modeling for Video Representation Learning	提出脑启发的掩码建模BIMM框架，用于视频表征学习	representation learning
11	C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning	提出C3L，通过对比学习生成内容相关视觉-语言指令微调数据，提升LVLM性能。	contrastive learning

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
12	Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models	提出SIU单图遗忘方法，解决多模态大语言模型中视觉概念的有效遗忘问题	large language model multimodal
13	Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting	提出多模态自适应推理与Anytime Early Exiting方法，提升文档图像分类的性能与效率。	foundation model multimodal
14	Context-Enhanced Video Moment Retrieval with Large Language Models	提出LMR模型，利用大语言模型增强视频上下文，提升视频片段检索性能。	large language model language conditioned
15	CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers	CamViG：基于多模态Transformer的相机感知图像到视频生成	multimodal
16	BiomedParse: a biomedical foundation model for image parsing of everything everywhere all at once	BiomedParse：用于生物医学图像解析的通用基础模型，一次性完成所有任务。	foundation model
17	Comprehensive Multimodal Deep Learning Survival Prediction Enabled by a Transformer Architecture: A Multicenter Study in Glioblastoma	提出基于Transformer的多模态深度学习模型，提升胶质母细胞瘤生存预测精度。	multimodal
18	An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation	利用大语言模型提升文本到图像生成中的文本理解能力	large language model
19	Multimodal video analysis for crowd anomaly detection using open access tourism cameras	提出一种基于开放旅游摄像头和多模态视频分析的异常人群检测方法	multimodal
20	Is Dataset Quality Still a Concern in Diagnosis Using Large Foundation Model?	研究表明：大型预训练模型在眼底诊断中对数据集质量具有更强的鲁棒性	foundation model
21	Mutual Information Analysis in Multimodal Learning Systems	提出InfoMeter，通过互信息分析提升多模态3D目标检测系统性能。	multimodal
22	Towards Retrieval-Augmented Architectures for Image Captioning	提出一种检索增强的图像描述架构，利用外部知识库提升生成质量。	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
23	MOSS: Motion-based 3D Clothed Human Synthesis from Monocular Video	提出MOSS框架，利用运动信息从单目视频中合成逼真的3D服装人体模型	gaussian splatting splatting NeRF	✅
24	WorldAfford: Affordance Grounding based on Natural Language Instructions	提出WorldAfford框架，解决基于自然语言指令的Affordance区域定位问题	affordance chain-of-thought
25	Leveraging Neural Radiance Fields for Pose Estimation of an Unknown Space Object during Proximity Operations	提出一种基于NeRF的航天器位姿估计方法，用于未知空间目标的近距离操作。	NeRF neural radiance field
26	Gaussian Control with Hierarchical Semantic Graphs in 3D Human Recovery	提出HUGS框架，利用层级语义图控制3D高斯模型，提升人体3D重建质量	3D gaussian splatting 3DGS gaussian splatting	✅
27	Rethink Predicting the Optical Flow with the Kinetics Perspective	提出基于运动学视角的光流预测方法，提升遮挡和快速运动场景下的性能。	optical flow
28	Anticipating Object State Changes in Long Procedural Videos	提出Ego4D-OSCA数据集，解决长程序视频中物体状态变化的预测问题	scene understanding Ego4D

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Identity-free Artificial Emotional Intelligence via Micro-Gesture Understanding	提出基于微手势理解的无身份情感人工智能方法，提升情感理解能力。	spatiotemporal large language model
30	Text-Video Retrieval with Global-Local Semantic Consistent Learning	提出全局-局部语义一致性学习方法GLSCL，高效解决文本-视频检索问题。	PULSE	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
31	OmniGlue: Generalizable Feature Matching with Foundation Model Guidance	OmniGlue：利用基础模型引导的通用特征匹配方法	feature matching foundation model	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
32	DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control	DisenStudio：提出解耦空间控制的多主体文本到视频生成框架	motion generation

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
33	EmoEdit: Evoking Emotions through Image Manipulation	EmoEdit：通过图像内容操控激发情感，提升情感图像编辑效果	manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页