| 1 |
AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model |
AndesVL:面向移动端的高效多模态大语言模型,实现性能与效率的平衡 |
large language model multimodal |
✅ |
|
| 2 |
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models |
InternSVG:利用多模态大语言模型实现统一的SVG任务处理 |
large language model multimodal |
|
|
| 3 |
FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models |
FlexAC:面向多模态大语言模型中联想推理的灵活控制 |
large language model multimodal |
✅ |
|
| 4 |
A Survey on Agentic Multimodal Large Language Models |
综述Agentic多模态大语言模型,探索自主智能体在动态环境中的应用与发展。 |
large language model multimodal |
✅ |
|
| 5 |
BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models |
BLEnD-Vis:构建多模态文化理解基准,评估视觉语言模型中的文化知识鲁棒性。 |
multimodal visual grounding |
|
|
| 6 |
CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images |
提出CodePlot-CoT,通过代码驱动图像的思维链解决数学视觉推理难题 |
large language model multimodal chain-of-thought |
✅ |
|
| 7 |
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning |
ExpVid:用于实验视频理解与推理的基准数据集,挑战多模态大语言模型在科学实验中的应用。 |
large language model multimodal visual grounding |
|
|
| 8 |
MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis |
提出MS-Mix以解决多模态情感分析中的数据稀缺问题 |
multimodal |
✅ |
|
| 9 |
Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping |
评估基础模型在 hyperspectral 图像分类中的性能,应用于谷类作物类型识别。 |
foundation model |
|
|
| 10 |
How many samples to label for an application given a foundation model? Chest X-ray classification study |
研究胸部X光片分类任务中,如何利用预训练模型减少标注样本需求 |
foundation model |
|
|
| 11 |
A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images |
提出基于大语言模型的扫描电镜图像比例尺自动检测与提取框架 |
large language model |
|
|
| 12 |
CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation |
CoPRS:提出基于思维链的位置先验学习方法,用于提升推理分割任务的性能与可解释性 |
chain-of-thought |
✅ |
|
| 13 |
Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning |
提出SynTrans框架,利用大型多模态模型协同知识迁移提升少样本学习性能 |
multimodal |
|
|
| 14 |
Mixup Helps Understanding Multimodal Video Better |
提出多模态Mixup方法,提升多模态视频理解模型的泛化性和鲁棒性 |
multimodal |
|
|
| 15 |
IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment |
IVEBench:用于指令引导视频编辑评估的现代基准套件 |
large language model multimodal |
|
|
| 16 |
ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments? |
提出ODI-Bench,评估MLLM在全景图像理解中的能力并提出Omni-CoT方法。 |
large language model chain-of-thought |
|
|
| 17 |
GIR-Bench: Versatile Benchmark for Generating Images with Reasoning |
提出GIR-Bench以解决多模态模型评估不足问题 |
large language model multimodal |
✅ |
|
| 18 |
COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models |
提出COCO-Tree,利用神经符号概念树增强视觉语言模型中的组合推理能力 |
large language model chain-of-thought |
|
|
| 19 |
EvoCAD: Evolutionary CAD Code Generation with Vision Language Models |
EvoCAD:利用视觉语言模型与进化算法生成CAD代码 |
large language model |
|
|
| 20 |
Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts |
提出CLIP-SAM协同与级联提示的两阶段框架,提升零样本异常检测性能。 |
foundation model |
|
|
| 21 |
IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation |
提出IUT-Plug插件,通过显式结构化推理增强多模态图文生成中上下文一致性。 |
multimodal |
|
|
| 22 |
FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model |
提出FG-CLIP 2,用于提升英汉双语环境下的细粒度视觉-语言对齐能力 |
multimodal |
|
|