cs.CV（2024-07-11）

📊 共 34 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (10 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (10 🔗3) 支柱九：具身大模型 (Embodied Foundation Models) (7 🔗5) 支柱一：机器人控制 (Robot Control) (2) 支柱四：生成式动作 (Generative Motion) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (2) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
1	WildGaussians: 3D Gaussian Splatting in the Wild	WildGaussians：在复杂场景下实现高质量、实时3D高斯溅射	3D gaussian splatting 3DGS gaussian splatting
2	ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation	提出ScaleDepth，将单目深度估计分解为尺度预测和相对深度估计，提升跨场景泛化性。	depth estimation monocular depth metric depth	✅
3	Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation	提出CLIPtrase，通过重校准自相关性增强CLIP在开放词汇语义分割中的局部特征感知能力。	open-vocabulary open vocabulary	✅
4	Survey on Fundamental Deep Learning 3D Reconstruction Techniques	综述深度学习3D重建技术，聚焦NeRF、LDM和3D高斯溅射。	3D gaussian splatting gaussian splatting splatting
5	Generalizable Implicit Motion Modeling for Video Frame Interpolation	提出通用隐式运动建模GIMM，提升视频帧插值效果	optical flow motion latent spatiotemporal
6	Feasibility of Neural Radiance Fields for Crime Scene Video Reconstruction	探索神经辐射场在犯罪现场视频重建中的可行性	NeRF neural radiance field
7	Explicit-NeRF-QA: A Quality Assessment Database for Explicit NeRF Model Compression	构建Explicit-NeRF-QA数据集，用于评估显式NeRF模型压缩质量	NeRF neural radiance field	✅
8	Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data	MIA：利用大规模公共数据赋能鸟瞰图地图构建	semantic map first-person view
9	Event-based vision on FPGAs -- a survey	综述：基于FPGA的事件相机视觉技术，加速低功耗实时嵌入式系统应用	optical flow
10	Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets	提出COSTG与SCP ControlNet，用于扩充曲线物体分割数据集并保持语义一致性	semantic map	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
11	Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification	提出数据自适应回溯(DAT)框架，提升视觉-语言基础模型在图像分类任务上的性能	contrastive learning foundation model
12	MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine	MAVIS：利用自动数据引擎进行数学视觉指令调优，提升多模态大模型数学能力	DPO direct preference optimization contrastive learning	✅
13	Emergent Visual-Semantic Hierarchies in Image-Text Representations	研究发现CLIP等VLM模型具备涌现的视觉-语义层级理解能力，并提出Radial Embedding框架进行优化。	representation learning large language model foundation model
14	VideoMamba: Spatio-Temporal Selective State Space Model	VideoMamba：用于视频识别的时空选择性状态空间模型	Mamba SSM state space model
15	SR-Mamba: Effective Surgical Phase Recognition with State Space Model	SR-Mamba：利用状态空间模型实现高效的手术阶段识别	Mamba state space model	✅
16	GraphMamba: An Efficient Graph Structure Learning Vision Mamba for Hyperspectral Image Classification	提出GraphMamba，用于高效学习高光谱图像分类中的图结构和时序特征。	Mamba HSI
17	SliceMamba with Neural Architecture Search for Medical Image Segmentation	提出SliceMamba，结合神经架构搜索，提升医学图像分割性能	Mamba representation learning
18	DegustaBot: Zero-Shot Visual Preference Estimation for Personalized Multi-Object Rearrangement	DegustaBot：面向个性化多物体重排列的零样本视觉偏好估计	preference learning foundation model
19	Exemplar-free Continual Representation Learning via Learnable Drift Compensation	提出可学习漂移补偿以解决无样本持续表征学习问题	representation learning	✅
20	FYI: Flip Your Images for Dataset Distillation	提出FYI：通过图像翻转增强数据集蒸馏，提升小样本语义表达能力	distillation

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
21	SEED-Story: Multimodal Long Story Generation with Large Language Model	SEED-Story：利用多模态大语言模型生成长篇多模态故事	large language model multimodal
22	DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception	提出DenseFusion-1M，融合视觉专家知识，提升多模态大语言模型对图像的全面感知能力。	large language model multimodal	✅
23	CAR-MFL: Cross-Modal Augmentation by Retrieval for Multimodal Federated Learning with Missing Modalities	提出CAR-MFL，通过跨模态检索增强解决多模态联邦学习中的模态缺失问题	multimodal	✅
24	15M Multimodal Facial Image-Text Dataset	发布FaceCaption-15M：大规模人脸图像-文本多模态数据集，促进人脸相关任务研究。	multimodal	✅
25	DSCENet: Dynamic Screening and Clinical-Enhanced Multimodal Fusion for MPNs Subtype Classification	DSCENet：动态筛选与临床增强的多模态融合用于MPNs亚型分类	multimodal	✅
26	Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding	提出HMLLM模型，利用脑电和眼动多模态数据评估视频理解中的异质性反应。	large language model	✅
27	Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models	Live2Diff：提出基于单向注意力机制的视频扩散模型，用于实时流视频翻译。	large language model

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
28	MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility	MetaUrban：用于城市微出行的具身智能模拟平台，提升AI模型泛化性和安全性。	humanoid reinforcement learning imitation learning
29	MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos	MeshAvatar：提出一种从多视角视频学习高质量三角形人像Avatar的新方法	manipulation NeRF neural radiance field

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Infinite Motion: Extended Motion Generation via Long Text Instructions	提出Infinite Motion，通过长文本指令扩展运动生成，实现无限时长高质量运动序列合成。	motion synthesis motion generation	✅
31	A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights	全面综述：人类视频生成面临挑战、方法及未来方向	motion generation

🔬 支柱五：交互与反应 (Interaction & Reaction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
32	NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning	提出NODE-Adapter，利用神经常微分方程提升视觉-语言推理能力	human-object interaction
33	Nonverbal Interaction Detection	提出基于超图的非语言交互检测模型NVI-DEHR，解决社交场景下非语言行为理解难题。	HOI

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
34	WalkTheDog: Cross-Morphology Motion Alignment via Phase Manifolds	提出基于相位流形的跨形态运动对齐方法，实现不同骨骼结构角色间的动作迁移	motion matching motion retrieval	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页