| 1 |
Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models |
提出对齐可回答性框架,提升视频大语言模型拒绝回答不相关问题的能力 |
large language model multimodal |
|
|
| 2 |
ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding |
提出ReLoop闭环训练框架,缓解多模态大语言模型中的幻觉问题 |
large language model multimodal |
|
|
| 3 |
VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs |
提出VectorLLM以解决建筑轮廓提取问题 |
large language model multimodal |
|
|
| 4 |
MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding |
提出MODA:通过模块化双工注意力机制增强多模态感知、认知和情感理解能力。 |
large language model multimodal |
|
|
| 5 |
Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing |
提出X-Planner,利用MLLM规划复杂指令图像编辑,提升编辑质量和身份保持。 |
large language model multimodal chain-of-thought |
|
|
| 6 |
Differential Attention for Multimodal Crisis Event Analysis |
提出差分注意力机制,增强多模态危机事件分析中的特征对齐与分类性能 |
multimodal |
✅ |
|
| 7 |
MurreNet: Modeling Holistic Multimodal Interactions Between Histopathology and Genomic Profiles for Survival Prediction |
MurreNet:建模组织病理学与基因组图谱间整体多模态交互,用于生存预测 |
multimodal |
|
|
| 8 |
Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model |
GeoSapiens:结合几何约束与人本基础模型的少样本牙科地标检测 |
foundation model |
✅ |
|
| 9 |
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding |
提出HV-MMBench,用于全面评估MLLM在以人为中心的视频理解能力 |
large language model multimodal |
|
|
| 10 |
From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection |
提出ArtBulb框架,用于AI艺术版权评估,并构建首个AI艺术版权数据集AICD。 |
large language model multimodal |
|
|
| 11 |
An analysis of vision-language models for fabric retrieval |
针对织物检索,提出基于多模态大语言模型自动标注的视觉语言模型零样本检索方案。 |
large language model multimodal |
|
|
| 12 |
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts |
提出SMoEStereo,利用选择性混合专家模型提升立体匹配在复杂场景下的鲁棒性。 |
foundation model |
✅ |
|
| 13 |
SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability |
SPARC:概念对齐的稀疏自编码器,实现跨模型和跨模态的可解释性 |
multimodal |
✅ |
|
| 14 |
Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model |
Llama Nemoretriever Colembed:一种高性能的文本-图像跨模态检索模型 |
multimodal |
|
|
| 15 |
INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling |
提出INTER:通过交互引导采样缓解大型视觉语言模型中的幻觉问题 |
multimodal |
|
|
| 16 |
Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite |
GRESEL团队探索多种OCR方法转录西班牙古籍文本,为PastReader任务提供对比。 |
multimodal |
|
|