cs.CV（2024-12-31）

📊 共 16 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (6 🔗1) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (1) 支柱二：RL算法与架构 (RL & Architecture) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
1	OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models	提出OV-HHIR框架，利用大语言模型实现开放词汇的人际互动识别，适用于公共安全监控。	open-vocabulary open vocabulary large language model
2	SG-Splatting: Accelerating 3D Gaussian Splatting with Spherical Gaussians	SG-Splatting：用球谐高斯加速3D高斯溅射，提升渲染速度与质量	3D gaussian splatting gaussian splatting splatting
3	OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies	提出OVGaussian以解决3D高斯语义分割的开放词汇问题	3DGS scene understanding semantic map	✅
4	PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM	PanoSLAM：首个基于高斯SLAM的全景三维场景重建系统	3D gaussian splatting gaussian splatting splatting	✅
5	Gaussian Building Mesh (GBM): Extract a Building's 3D Mesh with Google Earth and Gaussian Splatting	提出基于Google Earth与高斯溅射的建筑物三维网格重建方法(GBM)	gaussian splatting splatting
6	STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes	STORM：用于大规模室外场景的时空重建模型，实现高效动态场景重建。	scene reconstruction scene understanding scene flow

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
7	OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning	OCRBench v2：改进的多模态模型视觉文本定位与推理评估基准	multimodal	✅
8	MLLM-as-a-Judge for Image Safety without Human Labeling	提出一种无需人工标注的MLLM图像安全判别方法，解决AIGC内容安全问题	large language model multimodal chain-of-thought
9	VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling	VideoChat-Flash：通过分层压缩实现长上下文视频建模，显著降低计算成本。	large language model multimodal
10	VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM	提出VideoRefer Suite，增强Video LLM在时空对象理解方面的能力	large language model
11	CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval	提出CaReBench基准测试，用于细粒度视频描述和检索，并评估视频语言模型的时空偏见。	multimodal
12	CRRG-CLIP: Automatic Generation of Chest Radiology Reports and Classification of Chest Radiographs	提出CRRG-CLIP模型，实现胸部X光片报告自动生成与疾病分类	multimodal

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding	Embodied VideoAgent：利用具身视频和传感器进行动态场景理解	manipulation scene understanding egocentric
14	SoundBrush: Sound as a Brush for Visual Scene Editing	SoundBrush：提出一种利用声音作为笔刷编辑视觉场景的模型	manipulation	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
15	Online Video Understanding: OVBench and VideoChat-Online	提出VideoChat-Online，用于在线视频理解，并在OVBench上超越SOTA模型。	spatiotemporal large language model multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
16	A Novel Convolution and Attention Mechanism-based Model for 6D Object Pose Estimation	PoseLecTr：结合Legendre卷积与注意力机制的6D物体姿态估计方法	distillation spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页