Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models
作者: Seyed Amir Ahmad Safavi-Naini, Shuhaib Ali, Omer Shahab, Zahra Shahhoseini, Thomas Savage, Sara Rafiee, Jamil S Samaan, Reem Al Shabeeb, Farah Ladak, Jamie O Yang, Juan Echavarria, Sumbal Babar, Aasma Shaukat, Samuel Margolis, Nicholas P Tatonetti, Girish Nadkarni, Bara El Kurdi, Ali Soroush
分类: cs.CL, cs.AI
发布日期: 2024-08-25 (更新: 2024-09-04)
备注: Manuscript Pages: 34, Figures: 7, Tables: 2, Supplementary File Pages: 35, Data Transparency Statement: Code is available at: https://github.com/Sdamirsa/LLM-VLM-in-Gastroenterology . Study data from American College of Gastroenterology (ACG) are restricted and available upon request with ACG permission. Correction: updated abstract considering Llama3.1 results
DOI: 10.1038/s41746-025-02174-0
💡 一句话要点
评估多种语言模型在胃肠病学中的医学推理能力
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大型语言模型 视觉语言模型 医学推理 多模态学习 量化模型
📋 核心要点
- 现有的多模态模型在医学推理中面临整合视觉数据的挑战,尤其是在处理图像时表现不佳。
- 本研究通过系统评估不同语言模型在胃肠病学考试中的表现,探索模型配置和提示工程的最佳实践。
- 实验结果表明,GPT-4o和Claude3.5-Sonnet在准确性上优于其他开源模型,而VLM在图像问题上的表现未能提升。
📝 摘要(中文)
本研究评估了大型语言模型(LLMs)和视觉语言模型(VLMs)在胃肠病学领域的医学推理表现。通过使用300道胃肠病学考试风格的多项选择题,研究系统性地分析了模型配置、参数和提示工程策略的影响。结果显示,GPT-4o和Claude3.5-Sonnet在准确性上表现最佳,而VLM在处理包含图像的问题时未能提升表现,反而在使用LLM生成的图像描述时表现更差。相反,当图像配以人工描述时,准确性提高了10%。
🔬 方法详解
问题定义:本研究旨在解决大型语言模型和视觉语言模型在胃肠病学领域医学推理能力的不足,尤其是如何有效整合视觉信息以提升模型表现。现有方法在处理图像时常常未能达到预期效果。
核心思路:通过系统评估不同模型的配置、参数和提示工程策略,研究旨在找出最佳实践,以提高模型在医学推理中的表现。
技术框架:研究使用300道多项选择题,其中138道包含图像,评估了多种语言模型(如GPT、Claude、Llama等)在不同接口和计算环境下的表现。采用半自动化流程评估模型的准确性。
关键创新:本研究的创新在于系统性地比较了多种语言模型在医学推理中的表现,尤其是对视觉数据的处理能力,揭示了VLM在图像问题上的局限性。
关键设计:研究中使用了不同版本的模型(如GPT-3.5、GPT-4等),并考虑了模型的量化精度,采用了多种提示工程策略以优化模型表现。
📊 实验亮点
实验结果显示,GPT-4o和Claude3.5-Sonnet的准确率分别为73.7%和74.0%,显著高于开源模型Llama3.1-405b(64%)和Mixtral-8x7b(54.3%)。在量化模型中,Phi3-14b的表现为48.7%,与全精度模型相当。值得注意的是,VLM在图像问题上的表现未见提升,反而在使用LLM生成的描述时表现更差。
🎯 应用场景
该研究的潜在应用领域包括医学教育、临床决策支持和医疗图像分析。通过提高模型在医学推理中的准确性,能够为医生提供更有效的辅助工具,提升患者的诊疗效果。未来,研究成果可推动多模态模型在其他医学领域的应用。
📄 摘要(原文)
Background and Aims: This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology. Methods: We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images to systematically assess the impact of model configurations and parameters and prompt engineering strategies utilizing GPT-3.5. Next, we assessed the performance of proprietary and open-source LLMs (versions), including GPT (3.5, 4, 4o, 4omini), Claude (3, 3.5), Gemini (1.0), Mistral, Llama (2, 3, 3.1), Mixtral, and Phi (3), across different interfaces (web and API), computing environments (cloud and local), and model precisions (with and without quantization). Finally, we assessed accuracy using a semiautomated pipeline. Results: Among the proprietary models, GPT-4o (73.7%) and Claude3.5-Sonnet (74.0%) achieved the highest accuracy, outperforming the top open-source models: Llama3.1-405b (64%), Llama3.1-70b (58.3%), and Mixtral-8x7b (54.3%). Among the quantized open-source models, the 6-bit quantized Phi3-14b (48.7%) performed best. The scores of the quantized models were comparable to those of the full-precision models Llama2-7b, Llama2--13b, and Gemma2-9b. Notably, VLM performance on image-containing questions did not improve when the images were provided and worsened when LLM-generated captions were provided. In contrast, a 10% increase in accuracy was observed when images were accompanied by human-crafted image descriptions. Conclusion: In conclusion, while LLMs exhibit robust zero-shot performance in medical reasoning, the integration of visual data remains a challenge for VLMs. Effective deployment involves carefully determining optimal model configurations, encouraging users to consider either the high performance of proprietary models or the flexible adaptability of open-source models.