cs.CV(2024-12-24)

📊 共 30 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (4 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱一:机器人控制 (Robot Control) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 ICM-Assistant: Instruction-tuning Multimodal Large Language Models for Rule-based Explainable Image Content Moderation 提出ICM-Assistant,用于基于规则的可解释图像内容审核,显著提升性能。 large language model multimodal
2 TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization TextMatch:通过多模态优化增强图像-文本一致性 large language model multimodal chain-of-thought
3 VisionLLM-based Multimodal Fusion Network for Glottic Carcinoma Early Detection 提出基于VisionLLM的多模态融合网络MMGC-Net,用于喉癌早期检测。 large language model multimodal
4 An Ensemble Approach to Short-form Video Quality Assessment Using Multimodal LLM 提出基于多模态LLM的短视频质量评估集成方法,提升泛化性能。 large language model multimodal
5 Multimodal joint prediction of traffic spatial-temporal data with graph sparse attention mechanism and bidirectional temporal convolutional network 提出GSABT模型,利用图稀疏注意力机制和双向时间卷积网络进行多模态交通时空数据联合预测。 multimodal
6 AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic Segmentation via Adaptive Label Correction AdaCo:通过自适应标签校正克服视觉基础模型在3D语义分割中的噪声 foundation model
7 BIG-MoE: Bypass Isolated Gating MoE for Generalized Multimodal Face Anti-Spoofing 提出BIG-MoE以解决多模态人脸防伪中的隔离门控问题 multimodal
8 Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation 综述长视频生成最新趋势,探讨生成模型、策略、数据集与评估指标。 large language model multimodal
9 RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction 提出RDPM:通过循环token预测解决扩散概率模型,实现离散扩散。 large language model multimodal
10 Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach 揭示语言模型中的视觉感知:一种基于注意力头的分析方法 large language model multimodal
11 Computer Vision-Driven Gesture Recognition: Toward Natural and Intuitive Human-Computer 提出基于三维手部骨骼模型的自然手势识别方法,提升人机交互的流畅性。 multimodal
12 Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search 提出CoMCTS,赋能MLLM类o1推理与反思能力,解决复杂问题。 multimodal
13 ERPA: Efficient RPA Model Integrating OCR and LLMs for Intelligent Document Processing ERPA:融合OCR与LLM的高效RPA模型,用于智能文档处理 large language model
14 Expand VSR Benchmark for VLLM to Expertize in Spatial Rules 扩展VSR基准以提升VLLM在空间规则上的能力 large language model
15 Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task 提出DISCOVER编解码器,实现语义解耦与组合,兼顾人眼感知和机器视觉任务 multimodal
16 MMFactory: A Universal Solution Search Engine for Vision-Language Tasks MMFactory:面向视觉-语言任务的通用解决方案搜索引擎 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
17 RSGaussian:3D Gaussian Splatting with LiDAR for Aerial Remote Sensing Novel View Synthesis RSGaussian:融合LiDAR约束的3D高斯溅射用于航空遥感新视角合成 depth estimation 3D gaussian splatting gaussian splatting
18 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding 提出3DGraphLLM,融合语义图与大语言模型用于3D场景理解 scene understanding large language model
19 FlameGS: Reconstruct flame light field via Gaussian Splatting 提出FlameGS以解决传统火焰诊断算法的计算效率问题 gaussian splatting splatting
20 Sampling Bag of Views for Open-Vocabulary Object Detection 提出基于概念采样的视角包方法,提升开放词汇目标检测性能与效率。 open-vocabulary open vocabulary
21 Parallel Neural Computing for Scene Understanding from LiDAR Perception in Autonomous Racing 提出并行感知网络PPN,加速激光雷达场景理解,提升自动驾驶赛车性能 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
22 COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection 提出COMO框架,利用Cross-Mamba交互和偏移引导融合解决多模态目标检测中的对齐问题。 Mamba multimodal
23 UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision UniPLV:通过区域视觉语言监督实现标签高效的开放世界3D场景理解 distillation scene understanding multimodal
24 DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers DrivingGPT:利用多模态自回归Transformer统一驾驶世界建模与规划 world model multimodal
25 HTR-JAND: Handwritten Text Recognition with Joint Attention Network and Knowledge Distillation HTR-JAND:结合联合注意力网络与知识蒸馏的手写文本识别框架 curriculum learning distillation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
26 VORTEX: A Spatial Computing Framework for Optimized Drone Telemetry Extraction from First-Person View Flight Data VORTEX:用于优化无人机第一视角飞行数据遥测提取的空间计算框架 first-person view
27 Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos 提出Switch-a-View,从无标注视频中学习视角选择,用于生成教学视频。 egocentric

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
28 ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation ZeroHSI:基于视频生成实现零样本4D人-场景交互 human-scene interaction HSI embodied AI

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
29 FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models FameBias:一种无需模型训练的文本到图像模型嵌入操纵偏差攻击 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
30 Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight 利用LLM/VLM增强视频异常检测的解释性、时序推理和泛化能力 spatiotemporal large language model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页