| 1 |
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models |
提出MemVR,通过视觉重溯缓解多模态大语言模型中的幻觉问题 |
large language model multimodal |
✅ |
|
| 2 |
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models |
Grounded-VideoLLM:提升视频大语言模型中细粒度时序定位能力 |
large language model TAMP |
|
|
| 3 |
A Multimodal Framework for Deepfake Detection |
提出一种多模态深度伪造检测框架,融合视觉和听觉信息以提高检测准确率。 |
multimodal |
|
|
| 4 |
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning |
提出Visual-O1框架,通过多模态多轮CoT推理解决视觉任务中歧义指令理解问题 |
chain-of-thought |
|
|
| 5 |
Frame-Voyager: Learning to Query Frames for Video Large Language Models |
提出Frame-Voyager,学习查询视频帧组合,提升Video-LLM在视频理解任务中的性能。 |
large language model |
|
|
| 6 |
ARB-LLM: Alternating Refined Binarizations for Large Language Models |
提出ARB-LLM,通过交替优化二值化参数实现大语言模型的高效1比特量化 |
large language model |
✅ |
|
| 7 |
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition |
Audio-Agent:利用LLM实现高质量音频生成、编辑与合成 |
large language model multimodal |
|
|
| 8 |
Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models |
提出跨模态参数知识冲突检测与缓解方法,提升大视觉语言模型性能 |
multimodal |
|
|
| 9 |
An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation |
提出SAE-Rad,利用稀疏自编码器提升放射报告生成的可解释性与效率。 |
multimodal |
|
|
| 10 |
Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation |
提出BGTAI模型,利用Gloss标注弥合文本、音频、图像等多模态理解的鸿沟。 |
multimodal |
|
|
| 11 |
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark |
提出AuroraCap,一种高效视频详细描述模型,并构建新的VDC评测基准。 |
multimodal |
|
|