cs.CV（2024-10-17）

📊 共 28 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (3) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (2) 支柱四：生成式动作 (Generative Motion) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models	提出$γ$-MoD，通过深度混合自适应提升多模态大语言模型的效率。	large language model multimodal
2	RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models	RAP：检索增强的个性化多模态大语言模型框架	large language model multimodal	✅
3	Improving Multi-modal Large Language Model through Boosting Vision Capabilities	Arcana：通过增强视觉能力提升多模态大语言模型性能	large language model multimodal
4	Fundus to Fluorescein Angiography Video Generation as a Retinal Generative Foundation Model	提出Fundus2Video，用于从眼底彩照生成动态FFA视频，并作为视网膜生成式基础模型。	foundation model multimodal
5	Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	Janus：解耦视觉编码，实现统一的多模态理解与生成	multimodal
6	Movie Gen: A Cast of Media Foundation Models	Movie Gen：一套高质量媒体基础模型，实现1080p高清视频生成与编辑	foundation model
7	Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation	提出TenRMOT，利用时序增强Transformer解决Referring多目标跟踪与分割问题	multimodal
8	Harnessing Webpage UIs for Text-Rich Visual Understanding	利用网页UI提升富文本视觉理解能力，解决多模态大模型在结构化环境中的交互问题。	large language model multimodal
9	Exploring the Design Space of Visual Context Representation in Video MLLMs	提出视觉上下文表示设计方案以提升视频多模态大语言模型性能	large language model multimodal	✅
10	Performance of Gaussian Mixture Model Classifiers on Embedded Feature Spaces	提出一种参数量更少的GMM分类器，并评估其在CLIP和ImageBind嵌入特征空间上的性能。	multimodal
11	Trust but Verify: Programmatic VLM Evaluation in the Wild	提出程序化VLM评估方法以解决视觉语言模型的响应验证问题	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
12	GlossyGS: Inverse Rendering of Glossy Objects with 3D Gaussian Splatting	GlossyGS：利用3D高斯溅射和材质先验进行光泽物体逆渲染	3D gaussian splatting gaussian splatting splatting
13	DepthSplat: Connecting Gaussian Splatting and Depth	DepthSplat：连接高斯溅射与深度估计，实现高质量三维重建与深度预测。	depth estimation monocular depth 3D gaussian splatting
14	MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes	MEGA：面向动态场景的内存高效4D高斯溅射方法	gaussian splatting splatting	✅
15	VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	VLM-Grounder：一种基于视觉语言模型的零样本3D视觉定位方法	scene understanding visual grounding	✅
16	DN-4DGS: Denoised Deformable Network with Temporal-Spatial Aggregation for Dynamic Scene Rendering	提出DN-4DGS，通过去噪和时空聚合实现动态场景实时高质量渲染	3D gaussian splatting 3DGS gaussian splatting
17	ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding	ARKit LabelMaker：构建大规模室内3D场景理解数据集，提升语义分割性能	scene understanding	✅
18	Self-Supervised Scene Flow Estimation with Point-Voxel Fusion and Surface Representation	提出点-体素融合与表面表示的自监督场景流估计方法，提升三维运动场预测精度。	scene flow

🔬 支柱二：RL算法与架构 (RL & Architecture) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
19	DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation	DriveDreamer4D：利用世界模型作为数据机器，提升4D驾驶场景表示	world model dreamer 3DGS
20	Stochastic Flow Matching for Resolving Small-Scale Physics	提出随机流匹配（SFM）框架，用于解决物理科学中小尺度细节的超分辨率重建问题。	flow matching
21	Enhancing Dataset Distillation via Label Inconsistency Elimination and Learning Pattern Refinement	M-DATM：通过消除标签不一致性和优化学习模式提升数据集蒸馏效果	distillation

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Object Pose Estimation Using Implicit Representation For Transparent Objects	提出基于NeRF隐式表达的透明物体位姿估计方法，超越现有技术水平。	manipulation NeRF neural radiance field
23	PUMA: Empowering Unified MLLM with Multi-granular Visual Generation	PUMA：提出一种统一的多模态大语言模型，赋能多粒度视觉生成任务。	manipulation large language model foundation model	✅

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Spatiotemporal Object Detection for Improved Aerial Vehicle Detection in Traffic Monitoring	提出时空车辆检测模型，提升无人机交通监控中车辆检测精度	spatiotemporal
25	Training Compute-Optimal Vision Transformers for Brain Encoding	针对大脑编码，研究计算量最优的视觉Transformer训练策略	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
26	MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations	MotionBank：构建大规模视频动作基准，用于解耦的规则驱动型动作描述生成。	motion generation foundation model	✅
27	L3DG: Latent 3D Gaussian Diffusion	L3DG：提出基于潜空间3D高斯扩散的生成式3D建模方法，可扩展到房间级场景。	VQ-VAE

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction	GraspDiffusion：合成逼真全身人-物交互场景	human-object interaction	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页