cs.CV（2024-12-24）

📊 共 30 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (16 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱一：机器人控制 (Robot Control) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
1	ICM-Assistant: Instruction-tuning Multimodal Large Language Models for Rule-based Explainable Image Content Moderation	提出ICM-Assistant，用于基于规则的可解释图像内容审核，显著提升性能。	large language model multimodal	✅
2	TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization	TextMatch：通过多模态优化增强图像-文本一致性	large language model multimodal chain-of-thought
3	VisionLLM-based Multimodal Fusion Network for Glottic Carcinoma Early Detection	提出基于VisionLLM的多模态融合网络MMGC-Net，用于喉癌早期检测。	large language model multimodal
4	An Ensemble Approach to Short-form Video Quality Assessment Using Multimodal LLM	提出基于多模态LLM的短视频质量评估集成方法，提升泛化性能。	large language model multimodal
5	Multimodal joint prediction of traffic spatial-temporal data with graph sparse attention mechanism and bidirectional temporal convolutional network	提出GSABT模型，利用图稀疏注意力机制和双向时间卷积网络进行多模态交通时空数据联合预测。	multimodal
6	AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic Segmentation via Adaptive Label Correction	AdaCo：通过自适应标签校正克服视觉基础模型在3D语义分割中的噪声	foundation model
7	BIG-MoE: Bypass Isolated Gating MoE for Generalized Multimodal Face Anti-Spoofing	提出BIG-MoE以解决多模态人脸防伪中的隔离门控问题	multimodal	✅
8	Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation	综述长视频生成最新趋势，探讨生成模型、策略、数据集与评估指标。	large language model multimodal
9	RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction	提出RDPM：通过循环token预测解决扩散概率模型，实现离散扩散。	large language model multimodal
10	Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach	揭示语言模型中的视觉感知：一种基于注意力头的分析方法	large language model multimodal
11	Computer Vision-Driven Gesture Recognition: Toward Natural and Intuitive Human-Computer	提出基于三维手部骨骼模型的自然手势识别方法，提升人机交互的流畅性。	multimodal
12	Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search	提出CoMCTS，赋能MLLM类o1推理与反思能力，解决复杂问题。	multimodal	✅
13	ERPA: Efficient RPA Model Integrating OCR and LLMs for Intelligent Document Processing	ERPA：融合OCR与LLM的高效RPA模型，用于智能文档处理	large language model
14	Expand VSR Benchmark for VLLM to Expertize in Spatial Rules	扩展VSR基准以提升VLLM在空间规则上的能力	large language model	✅
15	Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task	提出DISCOVER编解码器，实现语义解耦与组合，兼顾人眼感知和机器视觉任务	multimodal
16	MMFactory: A Universal Solution Search Engine for Vision-Language Tasks	MMFactory：面向视觉-语言任务的通用解决方案搜索引擎	multimodal	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
17	RSGaussian:3D Gaussian Splatting with LiDAR for Aerial Remote Sensing Novel View Synthesis	RSGaussian：融合LiDAR约束的3D高斯溅射用于航空遥感新视角合成	depth estimation 3D gaussian splatting gaussian splatting
18	3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding	提出3DGraphLLM，融合语义图与大语言模型用于3D场景理解	scene understanding large language model	✅
19	FlameGS: Reconstruct flame light field via Gaussian Splatting	提出FlameGS以解决传统火焰诊断算法的计算效率问题	gaussian splatting splatting
20	Sampling Bag of Views for Open-Vocabulary Object Detection	提出基于概念采样的视角包方法，提升开放词汇目标检测性能与效率。	open-vocabulary open vocabulary
21	Parallel Neural Computing for Scene Understanding from LiDAR Perception in Autonomous Racing	提出并行感知网络PPN，加速激光雷达场景理解，提升自动驾驶赛车性能	scene understanding	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
22	COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection	提出COMO框架，利用Cross-Mamba交互和偏移引导融合解决多模态目标检测中的对齐问题。	Mamba multimodal
23	UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision	UniPLV：通过区域视觉语言监督实现标签高效的开放世界3D场景理解	distillation scene understanding multimodal
24	DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers	DrivingGPT：利用多模态自回归Transformer统一驾驶世界建模与规划	world model multimodal
25	HTR-JAND: Handwritten Text Recognition with Joint Attention Network and Knowledge Distillation	HTR-JAND：结合联合注意力网络与知识蒸馏的手写文本识别框架	curriculum learning distillation	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
26	VORTEX: A Spatial Computing Framework for Optimized Drone Telemetry Extraction from First-Person View Flight Data	VORTEX：用于优化无人机第一视角飞行数据遥测提取的空间计算框架	first-person view
27	Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos	提出Switch-a-View，从无标注视频中学习视角选择，用于生成教学视频。	egocentric

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation	ZeroHSI：基于视频生成实现零样本4D人-场景交互	human-scene interaction HSI embodied AI

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models	FameBias：一种无需模型训练的文本到图像模型嵌入操纵偏差攻击	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight	利用LLM/VLM增强视频异常检测的解释性、时序推理和泛化能力	spatiotemporal large language model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页