Enhancing Continual Learning for Software Vulnerability Prediction: Addressing Catastrophic Forgetting via Hybrid-Confidence-Aware Selective Replay for Temporal LLM Fine-Tuning

作者: Xuhui Dou, Hayretdin Bahsi, Alejandro Guerra-Manzanares

分类: cs.CR, cs.AI, cs.LG

发布日期: 2026-02-27

备注: Accepted for publication in the Proceedings of the 2026 International Conference on Information Systems Security and Privacy (ICISSP)

期刊: Proceedings of the 12th International Conference on Information Systems Security and Privacy - Volume 1, ISBN 978-989-758-800-6, ISSN 2184-4356, pages 474-485, 2026

💡 一句话要点

提出Hybrid-CASR方法，解决LLM在软件漏洞预测中灾难性遗忘问题。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 持续学习 软件漏洞预测 大型语言模型 选择性回放 灾难性遗忘

📋 核心要点

现有软件漏洞预测方法忽略时间因素，导致LLM在实际应用中性能下降，无法适应代码库的演变。
提出Hybrid-CASR方法，通过置信度感知选择性回放，平衡漏洞和非漏洞样本，缓解灾难性遗忘。
实验表明，Hybrid-CASR在Macro-F1和后向保留方面优于基线，并显著降低了训练时间。

📝 摘要（中文）

本文研究了大型语言模型（LLM）在源代码漏洞检测中的持续学习问题，现有方法通常使用随机划分的训练集和测试集，忽略了时间因素，高估了实际性能。本文针对2018-2024年间与CVE关联的数据集，以双月为时间窗口，对decoder-style语言模型（microsoft/phi-2 with LoRA）进行持续微调，并评估了八种持续学习策略。提出了一种混合类感知选择性回放（Hybrid-CASR）方法，这是一种置信度感知的回放方法，用于二元漏洞分类，该方法优先考虑不确定样本，同时保持回放缓冲区中VULNERABLE和FIXED函数的平衡比例。在双月前向评估中，Hybrid-CASR实现了0.667的Macro-F1，与仅使用窗口训练的基线（0.651）相比，提高了0.016，具有统计学意义（p = 0.026），并且具有更强的后向保留能力（IBR@1为0.741）。与基线相比，Hybrid-CASR还将每个窗口的训练时间减少了约17％，而累积训练仅以15.9倍的计算成本实现了较小的F1提升（0.661）。结果表明，具有类平衡的选择性回放为基于LLM的持续时间漂移下的时间漏洞检测提供了实用的准确性-效率折衷。

🔬 方法详解

问题定义：论文旨在解决软件漏洞预测中，由于代码库随时间演变导致的数据分布漂移，使得基于LLM的漏洞检测模型发生灾难性遗忘的问题。现有方法，如随机划分数据集或简单累积训练，无法有效应对这种时间分布变化，要么高估模型性能，要么计算成本过高。

核心思路：论文的核心思路是利用选择性回放机制，维护一个包含具有代表性样本的回放缓冲区，并在每次训练迭代中，将当前窗口的数据与回放缓冲区中的数据混合，从而使模型能够记住过去学习的知识，并适应新的数据分布。关键在于如何选择回放缓冲区中的样本，以平衡模型的准确性和效率。

技术框架：整体框架包括以下几个阶段：1）数据预处理：将漏洞数据集按照双月时间窗口进行划分。2）模型初始化：使用预训练的decoder-style语言模型（microsoft/phi-2 with LoRA）。3）持续微调：在每个时间窗口内，使用当前窗口的数据和回放缓冲区中的数据对模型进行微调。4）评估：在后续时间窗口的数据上评估模型的性能。Hybrid-CASR方法主要体现在持续微调阶段，用于选择回放缓冲区中的样本。

关键创新：Hybrid-CASR的关键创新在于其置信度感知的选择性回放策略。它不仅考虑了样本的类别平衡（VULNERABLE和FIXED），还考虑了模型对样本预测的不确定性。通过优先选择模型预测置信度较低的样本进行回放，可以帮助模型更好地学习区分困难样本，从而提高模型的泛化能力。

关键设计：Hybrid-CASR的关键设计包括：1）置信度计算：使用模型预测概率的熵来衡量样本的不确定性。2）类别平衡：维护回放缓冲区中VULNERABLE和FIXED样本的比例接近1:1。3）回放缓冲区更新：每次训练迭代后，根据置信度和类别，选择性地替换回放缓冲区中的样本。4）损失函数：使用交叉熵损失函数进行训练。

🖼️ 关键图片

📊 实验亮点

实验结果表明，Hybrid-CASR方法在双月前向评估中实现了0.667的Macro-F1，显著优于window-only基线（0.651，p=0.026）。同时，Hybrid-CASR具有更强的后向保留能力（IBR@1为0.741），并且将每个窗口的训练时间减少了约17％。相比之下，累积训练虽然略微提高了F1（0.661），但计算成本增加了15.9倍。

🎯 应用场景

该研究成果可应用于软件安全领域，帮助开发者及时发现和修复潜在的漏洞。通过持续学习，漏洞检测模型能够适应不断演化的代码库，提高检测的准确性和效率，降低软件安全风险。该方法还可推广到其他需要处理时间序列数据和概念漂移的场景，如金融风险预测和网络安全态势感知。

📄 摘要（原文）

Recent work applies Large Language Models (LLMs) to source-code vulnerability detection, but most evaluations still rely on random train-test splits that ignore time and overestimate real-world performance. In practice, detectors are deployed on evolving code bases and must recognise future vulnerabilities under temporal distribution shift. This paper investigates continual fine-tuning of a decoder-style language model (microsoft/phi-2 with LoRA) on a CVE-linked dataset spanning 2018-2024, organised into bi-monthly windows. We evaluate eight continual learning strategies, including window-only and cumulative training, replay-based baselines and regularisation-based variants. We propose Hybrid Class-Aware Selective Replay (Hybrid-CASR), a confidence-aware replay method for binary vulnerability classification that prioritises uncertain samples while maintaining a balanced ratio of VULNERABLE and FIXED functions in the replay buffer. On bi-monthly forward evaluation Hybrid-CASR achieves a Macro-F1 of 0.667, improving on the window-only baseline (0.651) by 0.016 with statistically significant gains ($p = 0.026$) and stronger backward retention (IBR@1 of 0.741). Hybrid-CASR also reduces training time per window by about 17 percent compared to the baseline, whereas cumulative training delivers only a minor F1 increase (0.661) at a 15.9-fold computational cost. Overall, the results show that selective replay with class balancing offers a practical accuracy-efficiency trade-off for LLM-based temporal vulnerability detection under continuous temporal drift.

Enhancing Continual Learning for Software Vulnerability Prediction: Addressing Catastrophic Forgetting via Hybrid-Confidence-Aware Selective Replay for Temporal LLM Fine-Tuning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理