| 1 |
AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model |
AndesVL:面向移动端的高效多模态大语言模型,实现性能与效率的平衡 |
large language model multimodal |
✅ |
|
| 2 |
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models |
InternSVG:利用多模态大语言模型实现统一的SVG任务处理 |
large language model multimodal |
|
|
| 3 |
FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models |
提出FlexAC以解决多模态大语言模型的关联推理灵活性问题 |
large language model multimodal |
✅ |
|
| 4 |
A Survey on Agentic Multimodal Large Language Models |
综述Agentic多模态大语言模型,探索其在动态环境中的智能涌现与应用 |
large language model multimodal |
✅ |
|
| 5 |
BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models |
BLEnD-Vis:构建多模态文化理解基准,评估视觉语言模型在文化知识上的鲁棒性。 |
multimodal visual grounding |
✅ |
|
| 6 |
CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images |
提出CodePlot-CoT,通过代码驱动图像的思维链解决数学视觉推理难题 |
large language model multimodal chain-of-thought |
✅ |
|
| 7 |
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning |
ExpVid:用于评估多模态大语言模型在科学实验视频理解与推理能力的新基准 |
large language model multimodal visual grounding |
|
|
| 8 |
MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis |
提出MS-Mix,通过情感感知的Mixup增强方法提升多模态情感分析性能。 |
multimodal |
✅ |
|
| 9 |
Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping |
基准测试基础模型用于高光谱图像分类,应用于谷类作物类型mapping |
foundation model |
|
|
| 10 |
How many samples to label for an application given a foundation model? Chest X-ray classification study |
研究胸部X光片分类任务中,如何利用预训练模型减少标注样本需求 |
foundation model |
|
|
| 11 |
A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images |
提出一种基于大语言模型的扫描电镜图像比例尺自动检测与提取框架 |
large language model |
|
|
| 12 |
CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation |
CoPRS:提出基于思维链的位置先验学习方法,用于提升推理分割任务的性能与可解释性 |
chain-of-thought |
✅ |
|
| 13 |
Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning |
提出SynTrans框架,利用大型多模态模型协同知识迁移提升少样本学习性能 |
multimodal |
|
|
| 14 |
Mixup Helps Understanding Multimodal Video Better |
提出多模态Mixup方法,解决多模态视频理解中模态过拟合问题 |
multimodal |
|
|
| 15 |
IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment |
提出IVEBench以解决指令引导视频编辑评估不足问题 |
large language model multimodal |
|
|
| 16 |
ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments? |
提出ODI-Bench基准测试MLLM在全景图像理解中的能力,并提出Omni-CoT方法。 |
large language model chain-of-thought |
|
|
| 17 |
video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM |
提出video-SALMONN S,通过测试时训练增强长时音频-视频流式LLM的记忆能力 |
large language model multimodal |
|
|
| 18 |
GIR-Bench: Versatile Benchmark for Generating Images with Reasoning |
GIR-Bench:用于评估图像生成模型推理能力的综合基准 |
large language model multimodal |
✅ |
|
| 19 |
COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models |
提出COCO-Tree,利用神经符号概念树增强视觉语言模型中的组合推理能力 |
large language model chain-of-thought |
|
|
| 20 |
Bringing The Consistency Gap: Explicit Structured Memory for Interleaved Image-Text Generation |
提出IUT-Plug,通过显式结构化记忆解决图文交错生成中的多模态上下文漂移问题。 |
multimodal symbolic grounding |
|
|
| 21 |
EvoCAD: Evolutionary CAD Code Generation with Vision Language Models |
EvoCAD:利用视觉语言模型与进化算法生成CAD代码 |
large language model |
|
|
| 22 |
Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts |
提出CLIP-SAM协同与级联提示的两阶段框架,提升零样本异常检测性能。 |
foundation model |
|
|
| 23 |
FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model |
提出FG-CLIP 2,用于提升英汉双语细粒度视觉-语言对齐能力 |
multimodal |
|
|