Fine-tuning a vision-language model for fracture-surface morphology recognition

作者: Quanliang Liu, Jungtaek Kim, Kangwook Lee, Hyunseok Oh

分类: cond-mat.mtrl-sci, cs.CV

发布日期: 2026-05-08

💡 一句话要点

提出基于Qwen3-VL的微调框架，显著提升断口形貌识别的专业精度

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉语言模型 断口形貌分析 监督微调 材料科学 多模态学习 科学图像理解 自主显微分析

📋 核心要点

通用视觉-语言模型在处理材料科学中复杂的断口形貌图像时，因缺乏领域特定知识，导致识别精度不足，难以满足科研需求。
通过构建高质量文献挖掘数据集，结合大模型自动标注与人工增强，对Qwen3-VL-32B-Instruct进行针对性微调，注入专业领域知识。
实验结果显示，微调后的模型在断口形貌识别任务中精度达到0.92，显著超越了当前主流的通用多模态大模型，验证了领域微调的有效性。

📝 摘要（中文）

视觉-语言模型（VLMs）在科学图像理解方面展现出巨大潜力，但通用模型往往缺乏材料表征所需的领域特定视觉知识。本研究针对断口形貌分析，对开源VLM（Qwen3-VL-32B-Instruct）进行了微调。研究构建了一个包含13,168张开源文献断口图像的数据集，利用GPT-5.2-Reasoning结合图像与论文片段生成形貌标注，并通过人工收集与旋转增强进一步丰富数据。实验表明，该专家模型在100张人工标注图像的基准测试中表现优异，精度达到0.92，远超基准模型（0.35）及GPT-5.5-Reasoning（0.58）和Gemini 3.1 Pro-Reasoning（0.78）。消融实验证实了稀有特征的人工收集与旋转增强对提升识别效果的有效性。本研究展示了通过针对性数据采集与微调，使VLM适应特定科学特征识别的路径，为自主显微分析工作流提供了有力支持。

🔬 方法详解

问题定义：论文旨在解决通用多模态大模型在材料科学断口形貌识别任务中精度低、缺乏专业视觉知识的问题，以实现更可靠的材料表征。

核心思路：通过构建大规模、高质量的领域特定数据集，利用先进的推理模型进行自动标注，并结合人工校准与数据增强，对开源大模型进行监督微调（SFT），使其具备断口形貌的专业识别能力。

技术框架：整体流程包括：1. 数据挖掘与清洗，从开源文献中提取13,168张断口图像；2. 利用GPT-5.2-Reasoning结合多模态上下文进行自动标注；3. 引入人工收集的稀有特征样本与旋转增强策略；4. 对Qwen3-VL-32B-Instruct进行参数高效微调。

关键创新：创新点在于将多模态推理模型（GPT-5.2）的知识蒸馏至领域模型中，并证明了针对科学图像的“人工收集+旋转增强”策略能有效解决长尾分布下的稀有特征识别难题。

关键设计：采用了基于文献上下文的标注生成机制，确保了标注的科学严谨性；通过消融实验验证了数据增强对提升模型鲁棒性的贡献，并探讨了与通用模型集成以实现自主显微分析的协同工作流。

📊 实验亮点

模型在100张人工标注的测试集上精度高达0.92，对比基准模型Qwen3-VL-32B-Instruct（0.35）、GPT-5.5-Reasoning（0.58）及Gemini 3.1 Pro-Reasoning（0.78）具有显著优势。消融实验证实，针对稀有特征的人工补充采集与旋转增强策略是提升模型识别性能的核心驱动力。

🎯 应用场景

该研究主要应用于材料科学领域的自动化断口分析（Fractography）。通过将微调后的模型集成至显微镜工作流中，可实现对材料失效机理的快速、准确识别，显著降低人工分析成本，并为材料研发、失效分析及自主实验室建设提供关键的视觉决策支持。

📄 摘要（原文）

Vision-language models (VLMs) have shown strong potential for scientific image understanding, but general-purpose models often lack the domain-specific visual knowledge required for reliable materials characterization. In this work, we fine-tuned an open-source VLM (Qwen3-VL-32B-Instruct) for fracture-surface image analysis using a curated dataset of 13,168 open-source, literature-mined fracture-surface images. Morphology annotations were generated by GPT-5.2-Reasoning (high) from both the images and relevant excerpts of their source papers, and the dataset was further enriched with targeted manual collection and rotation-based augmentation. The resulting specialist model outperforms flagship proprietary multimodal models on a benchmark of 100 manually annotated images. It achieves a precision of 0.92, compared to 0.35 for the base Qwen3-VL-32B-Instruct, 0.58 for GPT-5.5-Reasoning (high), and 0.78 for Gemini 3.1 Pro-Reasoning (high). Dataset ablations show that manual collection of rare-feature images and augmentation via image rotation are both beneficial to improve recognition of less common fracture morphology features. We further discuss integrated use of the fine-tuned model with proprietary models to combine fracture-specific visual accuracy with broader multimodal reasoning for autonomous fractography. Although focused on fracture-surface images, this work demonstrates how VLMs can be adapted through targeted collection and fine-tuning on novel feature images to recognize those features and support downstream decision-making in autonomous microscopy workflows.

Fine-tuning a vision-language model for fracture-surface morphology recognition

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理