Automating the Analysis of Public Saliency and Attitudes towards Biodiversity from Digital Media

📄 arXiv: 2405.01610v1 📥 PDF

作者: Noah Giebink, Amrita Gupta, Diogo Verìssimo, Charlotte H. Chang, Tony Chang, Angela Brennan, Brett Dickson, Alex Bowmer, Jonathan Baillie

分类: cs.CL, cs.IR

发布日期: 2024-05-02

备注: v0.1, 21 pages with 10 figures


💡 一句话要点

提出一种自动化分析公众对生物多样性态度的方法

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 生物多样性 自然语言处理 公众态度 无监督学习 情感分析 大型语言模型 数据过滤

📋 核心要点

  1. 现有方法在全球范围内评估公众对生物多样性的态度时面临手动策划搜索词的繁琐和偏差问题。
  2. 论文提出利用现代自然语言处理工具,通过民间分类法和无监督学习来改进搜索词生成和相关性过滤。
  3. 研究表明,62%的与蝙蝠相关的文章被判定为与生物多样性无关,且疫情初期对马蹄蝠的情感显著转变。

📝 摘要(中文)

测量公众对野生动物的态度对于理解我们与自然的关系以及监测全球生物多样性框架目标的进展至关重要。然而,在全球范围内进行此类评估面临挑战。手动策划搜索词以查询新闻和社交媒体既繁琐又昂贵,且可能导致结果偏差。本文利用现代自然语言处理工具,提出了一种民间分类法以改进搜索词生成,并采用余弦相似度对文档频率-逆文档频率向量进行过滤。我们还引入了一种可扩展的相关性过滤管道,利用无监督学习揭示常见主题,随后使用开源的零样本大型语言模型为新闻标题分配主题。最后,我们对结果数据进行了情感、主题和数量分析。通过对COVID-19疫情前后关于不同哺乳动物类群的数据进行案例研究,强调了相关性过滤的重要性。

🔬 方法详解

问题定义:本文旨在解决在全球范围内评估公众对生物多样性态度时,手动策划搜索词的繁琐性和偏差问题。现有方法常常导致数据中包含大量无关内容,影响分析结果的准确性。

核心思路:论文的核心思路是利用现代自然语言处理工具,特别是民间分类法和无监督学习,来自动化搜索词生成和相关性过滤,从而提高数据分析的效率和准确性。

技术框架:整体架构包括搜索词生成、数据过滤、主题分配和情感分析四个主要模块。首先生成搜索词,然后通过余弦相似度过滤无关文档,接着使用无监督学习揭示主题,最后进行情感和主题分析。

关键创新:最重要的技术创新点在于引入了民间分类法和开源零样本大型语言模型,这与现有方法的手动策划和监督学习方式形成了鲜明对比,显著提高了处理效率和准确性。

关键设计:在参数设置上,使用了文档频率-逆文档频率向量进行相似度计算,确保了过滤的有效性;在网络结构上,采用了开源的零样本大型语言模型,以便快速适应不同主题的分配需求。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,在数据收集期间,62%的与蝙蝠相关的文章被判定为无关,强调了相关性过滤的重要性。此外,在疫情初期,针对马蹄蝠的文章数量显著增加,情感分析显示出明显的情感转变,这为理解公众态度提供了重要依据。

🎯 应用场景

该研究的潜在应用领域包括环境保护、公共政策制定和社会科学研究。通过自动化分析公众对生物多样性的态度,相关机构可以更有效地制定保护策略和宣传活动,从而提升公众意识和参与度。未来,该方法还可扩展至其他领域的舆情分析。

📄 摘要(原文)

Measuring public attitudes toward wildlife provides crucial insights into our relationship with nature and helps monitor progress toward Global Biodiversity Framework targets. Yet, conducting such assessments at a global scale is challenging. Manually curating search terms for querying news and social media is tedious, costly, and can lead to biased results. Raw news and social media data returned from queries are often cluttered with irrelevant content and syndicated articles. We aim to overcome these challenges by leveraging modern Natural Language Processing (NLP) tools. We introduce a folk taxonomy approach for improved search term generation and employ cosine similarity on Term Frequency-Inverse Document Frequency vectors to filter syndicated articles. We also introduce an extensible relevance filtering pipeline which uses unsupervised learning to reveal common topics, followed by an open-source zero-shot Large Language Model (LLM) to assign topics to news article titles, which are then used to assign relevance. Finally, we conduct sentiment, topic, and volume analyses on resulting data. We illustrate our methodology with a case study of news and X (formerly Twitter) data before and during the COVID-19 pandemic for various mammal taxa, including bats, pangolins, elephants, and gorillas. During the data collection period, up to 62% of articles including keywords pertaining to bats were deemed irrelevant to biodiversity, underscoring the importance of relevance filtering. At the pandemic's onset, we observed increased volume and a significant sentiment shift toward horseshoe bats, which were implicated in the pandemic, but not for other focal taxa. The proposed methods open the door to conservation practitioners applying modern and emerging NLP tools, including LLMs "out of the box," to analyze public perceptions of biodiversity during current events or campaigns.