ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

作者: Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, Gang Xiong

分类: cs.CV

发布日期: 2025-02-27 (更新: 2025-03-12)

备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

💡 一句话要点

提出ProAPO以解决视觉分类中的提示优化问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉语言模型 图像分类 提示优化 进化算法 多模态学习 适配器方法 深度学习

📋 核心要点

现有方法在生成类别特定提示时，容易受到LLMs幻觉的影响，导致提示不准确或缺乏区分性。
论文提出了一种基于进化的算法，通过简单有效的编辑和进化操作，逐步优化语言提示，减少迭代成本。
在一对一图像分类设置下，所提方法在13个数据集上表现优异，超越了现有的文本提示方法，提升了适配器方法的效果。

📝 摘要（中文）

视觉语言模型（VLMs）在图像分类中取得了显著进展，依赖于大规模的图像-文本配对数据。然而，提示质量对其性能影响巨大。尽管最近的方法表明，大型语言模型（LLMs）生成的视觉描述能够增强VLMs的泛化能力，但由于LLMs的幻觉现象，特定类别的提示可能不准确或缺乏区分性。本文旨在以最小的监督和无人工干预的方式，寻找适用于细粒度类别的视觉区分提示。我们提出了一种基于进化的算法，逐步优化从任务特定模板到类别特定描述的语言提示。我们的实验表明，该方法在13个数据集上超越了现有的基于文本提示的方法，并有效改善了适配器方法的性能。

🔬 方法详解

问题定义：本文解决的是在视觉分类中生成高质量类别特定提示的问题。现有方法往往依赖于LLMs生成的描述，容易出现不准确和缺乏区分性的问题。

核心思路：我们提出了一种逐步优化的进化算法，通过多次查询LLMs生成多样化的候选提示，减少人工干预和迭代成本。

技术框架：整体流程包括生成候选提示、应用采样策略寻找初始搜索点、以及使用新颖的适应度评分来减轻过拟合。主要模块包括提示生成模块和优化模块。

关键创新：最重要的创新在于引入了编辑和进化操作，能够在一次查询中生成多样化的候选提示，并通过熵约束的适应度评分来控制过拟合。

关键设计：在参数设置上，我们设计了两种采样策略以优化初始搜索点，并在适应度评分中引入熵约束，确保生成的提示具有较好的泛化能力。整体架构注重减少提示生成的成本和迭代次数。

🖼️ 关键图片

📊 实验亮点

实验结果显示，所提ProAPO方法在13个数据集上超越了现有的文本提示方法，且在一对一图像分类设置下，提升幅度显著。此外，优化后的提示在适配器方法中也表现出良好的迁移能力，进一步验证了其有效性。

🎯 应用场景

该研究的潜在应用领域包括图像分类、视觉识别和多模态学习等。通过优化提示生成过程，能够显著提高视觉语言模型的性能，推动相关领域的技术进步，具有重要的实际价值和未来影响。

📄 摘要（原文）

Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.

ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理