| 1 |
ICM-Assistant: Instruction-tuning Multimodal Large Language Models for Rule-based Explainable Image Content Moderation |
提出ICM-Assistant,用于基于规则的可解释图像内容审核,显著提升性能。 |
large language model multimodal |
✅ |
|
| 2 |
TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization |
TextMatch:通过多模态优化增强图像-文本一致性 |
large language model multimodal chain-of-thought |
|
|
| 3 |
VisionLLM-based Multimodal Fusion Network for Glottic Carcinoma Early Detection |
提出基于VisionLLM的多模态融合网络MMGC-Net,用于喉癌早期检测。 |
large language model multimodal |
|
|
| 4 |
An Ensemble Approach to Short-form Video Quality Assessment Using Multimodal LLM |
提出基于多模态LLM的短视频质量评估集成方法,提升泛化性能。 |
large language model multimodal |
|
|
| 5 |
Multimodal joint prediction of traffic spatial-temporal data with graph sparse attention mechanism and bidirectional temporal convolutional network |
提出GSABT模型,利用图稀疏注意力机制和双向时间卷积网络进行多模态交通时空数据联合预测。 |
multimodal |
|
|
| 6 |
AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic Segmentation via Adaptive Label Correction |
AdaCo:通过自适应标签校正克服视觉基础模型在3D语义分割中的噪声 |
foundation model |
|
|
| 7 |
BIG-MoE: Bypass Isolated Gating MoE for Generalized Multimodal Face Anti-Spoofing |
提出BIG-MoE以解决多模态人脸防伪中的隔离门控问题 |
multimodal |
✅ |
|
| 8 |
Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation |
综述长视频生成最新趋势,探讨生成模型、策略、数据集与评估指标。 |
large language model multimodal |
|
|
| 9 |
RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction |
提出RDPM:通过循环token预测解决扩散概率模型,实现离散扩散。 |
large language model multimodal |
|
|
| 10 |
Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach |
揭示语言模型中的视觉感知:一种基于注意力头的分析方法 |
large language model multimodal |
|
|
| 11 |
Computer Vision-Driven Gesture Recognition: Toward Natural and Intuitive Human-Computer |
提出基于三维手部骨骼模型的自然手势识别方法,提升人机交互的流畅性。 |
multimodal |
|
|
| 12 |
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search |
提出CoMCTS,赋能MLLM类o1推理与反思能力,解决复杂问题。 |
multimodal |
✅ |
|
| 13 |
ERPA: Efficient RPA Model Integrating OCR and LLMs for Intelligent Document Processing |
ERPA:融合OCR与LLM的高效RPA模型,用于智能文档处理 |
large language model |
|
|
| 14 |
Expand VSR Benchmark for VLLM to Expertize in Spatial Rules |
扩展VSR基准以提升VLLM在空间规则上的能力 |
large language model |
✅ |
|
| 15 |
Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task |
提出DISCOVER编解码器,实现语义解耦与组合,兼顾人眼感知和机器视觉任务 |
multimodal |
|
|
| 16 |
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks |
MMFactory:面向视觉-语言任务的通用解决方案搜索引擎 |
multimodal |
✅ |
|