| 1 |
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models |
提出ViTCoT:视频-文本交错思维链,提升大语言模型视频理解能力 |
embodied AI large language model chain-of-thought |
|
|
| 2 |
FaceLLM: A Multimodal Large Language Model for Face Understanding |
FaceLLM:面向人脸理解的多模态大语言模型,提升人脸相关任务性能。 |
large language model multimodal |
|
|
| 3 |
Test-Time Canonicalization by Foundation Models for Robust Perception |
提出FOCAL,利用预训练模型在测试时进行规范化,提升感知系统的鲁棒性。 |
foundation model |
✅ |
|
| 4 |
Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection |
SynOOD:利用生成模型合成近边界OOD样本,提升OOD检测性能 |
large language model foundation model multimodal |
✅ |
|
| 5 |
Boosting Multimodal Learning via Disentangled Gradient Learning |
提出解耦梯度学习框架DGL,解决多模态学习中模态编码器与融合模块的优化冲突问题。 |
multimodal |
✅ |
|
| 6 |
CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books |
提出CoSMo多模态Transformer,用于漫画书中页面流分割任务 |
multimodal |
|
|
| 7 |
(Almost) Free Modality Stitching of Foundation Models |
提出Hyma框架,利用超网络实现多模态模型高效拼接与最优单模态模型选择。 |
foundation model |
|
|
| 8 |
IGD: Instructional Graphic Design with Multimodal Layer Generation |
提出IGD:通过多模态层生成实现可编辑的指令式图形设计 |
multimodal |
|
|
| 9 |
Text-Visual Semantic Constrained AI-Generated Image Quality Assessment |
提出SC-AGIQA框架,通过文本-视觉语义约束提升AI生成图像质量评估的准确性。 |
large language model multimodal |
✅ |
|
| 10 |
DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs |
提出DisCo,提升视频MLLM中视觉封装的语义区分性和时间一致性 |
large language model multimodal |
✅ |
|
| 11 |
A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images |
提出ECP框架,无需训练提升MLLM在高分辨率图像上的细粒度定位和推理能力 |
large language model multimodal |
✅ |
|
| 12 |
Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis |
零样本分析:GPT-4o mini与Gemini 2.0 Flash在细粒度时尚产品属性预测上的能力评估 |
large language model multimodal |
✅ |
|
| 13 |
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends |
综述:基于MLLM的富视觉文档理解方法、挑战与新兴趋势 |
large language model multimodal |
|
|
| 14 |
DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation |
DEARLi:解耦识别与定位增强半监督全景分割 |
foundation model |
✅ |
|
| 15 |
Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect |
重新审视Bouba-Kiki效应:评估视觉-语言模型中的跨模态关联能力 |
multimodal |
|
|
| 16 |
Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction |
提出基于连续值Token和掩码预测的生成式音频语言模型,提升音频生成质量。 |
large language model |
|
|