Large Language Models' Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products

📄 arXiv: 2507.04364v1 📥 PDF

作者: Soojong Kim, Kwanho Kim, Hye Min Kim

分类: cs.CL, cs.SI

发布日期: 2025-07-06

备注: Forthcoming in Social Science & Medicine


💡 一句话要点

评估大型语言模型在公共健康话语中识别情感的准确性

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 情感识别 公共健康 社交媒体分析 HPV疫苗 加热烟草产品 机器学习 数据分析

📋 核心要点

  1. 现有大型语言模型在识别公共健康话语中的情感时,准确性和可行性尚未得到充分探讨,尤其是在不同平台和健康问题上。
  2. 本文通过对三种主要大型语言模型的比较,探讨其在识别风险促进与健康支持情感方面的表现,提供了基于社交媒体数据的实证分析。
  3. 研究结果表明,模型在不同平台上对情感的识别准确性存在显著差异,尤其是在Facebook和Twitter上对不同类型情感的检测效果不同。

📝 摘要(中文)

随着机器学习方法在健康相关公共话语分析中的应用日益增多,关于其准确检测不同健康情感的能力仍存在疑问。本文研究了三种主要的大型语言模型(GPT、Gemini和LLAMA)在识别风险促进与健康支持情感方面的准确性,重点关注人乳头瘤病毒(HPV)疫苗接种和加热烟草产品(HTPs)这两个公共健康话题。通过从Facebook和Twitter收集的数据,研究发现所有三种模型在分类风险促进和健康支持情感方面表现出较高的准确性,但在平台、健康问题和模型类型上存在显著差异。这些结果强调了在公共健康分析中谨慎选择和验证语言模型的重要性,尤其是在训练数据可能导致模型偏差的情况下。

🔬 方法详解

问题定义:本文旨在解决大型语言模型在公共健康话语中识别风险促进与健康支持情感的准确性问题。现有方法在不同平台和健康话题上的表现差异尚未得到充分研究。

核心思路:通过对三种大型语言模型的比较,分析其在不同社交媒体平台上对健康相关情感的识别能力,探索模型的准确性和局限性。

技术框架:研究采用了从Facebook和Twitter收集的多组消息数据,结合人工标注作为情感分类的金标准,构建了一个多层次的分析框架。

关键创新:本文的创新在于系统性地比较了不同大型语言模型在公共健康话语中对情感的识别能力,揭示了模型在不同平台和健康问题上的表现差异。

关键设计:研究中对数据集进行了精心策划,确保包含支持和反对健康行为的多样化信息,同时采用了多种评估指标来衡量模型的准确性。

📊 实验亮点

实验结果显示,所有三种大型语言模型在识别风险促进和健康支持情感方面表现出较高的准确性,但在Facebook上对风险促进情感的识别准确率更高,而在Twitter上对健康支持信息的检测更为准确。这些发现强调了模型在不同平台上的表现差异,提示在公共健康分析中需谨慎选择模型。

🎯 应用场景

该研究的潜在应用领域包括公共健康政策制定、社交媒体舆情监测和健康传播策略优化。通过准确识别公众对健康话题的情感,决策者可以更有效地制定干预措施,提升公众健康意识和行为。未来,随着社交媒体数据的不断增长,相关技术的应用将更加广泛。

📄 摘要(原文)

Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on health issues are largely unexplored. Thus, this research examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in detecting risk-promoting versus health-supporting sentiments across two critical public health topics: Human Papillomavirus (HPV) vaccination and heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we curated multiple sets of messages supporting or opposing recommended health behaviors, supplemented with human annotations as the gold standard for sentiment classification. The findings indicate that all three LLMs generally demonstrate substantial accuracy in classifying risk-promoting and health-supporting sentiments, although notable discrepancies emerge by platform, health issue, and model type. Specifically, models often show higher accuracy for risk-promoting sentiment on Facebook, whereas health-supporting messages on Twitter are more accurately detected. An additional analysis also shows the challenges LLMs face in reliably detecting neutral messages. These results highlight the importance of carefully selecting and validating language models for public health analyses, particularly given potential biases in training data that may lead LLMs to overestimate or underestimate the prevalence of certain perspectives.