| 1 |
The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems? |
提出Percept-V数据集,评估多模态大语言模型在基础视觉感知任务上的能力 |
large language model multimodal |
|
|
| 2 |
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers |
综述科学大语言模型:从数据基础到智能体前沿 |
large language model multimodal |
|
|
| 3 |
Exploring Machine Learning and Language Models for Multimodal Depression Detection |
探索机器学习与语言模型在多模态抑郁症检测中的应用 |
large language model multimodal |
|
|
| 4 |
How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations |
研究表明大型语言模型在价格谈判中受锚定效应影响 |
large language model chain-of-thought |
|
|
| 5 |
Leveraging Large Language Models for Generating Research Topic Ontologies: A Multi-Disciplinary Study |
利用大型语言模型生成研究主题本体,解决跨学科知识组织难题。 |
large language model chain-of-thought |
|
|
| 6 |
Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations |
揭示大语言模型评估中标签诱导的偏见,强调盲评的重要性 |
large language model |
|
|
| 7 |
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution |
LETHE:利用知识稀释净化后门大语言模型 |
large language model |
|
|
| 8 |
GDLLM: A Global Distance-aware Modeling Approach Based on Large Language Models for Event Temporal Relation Extraction |
提出GDLLM,利用全局距离感知建模提升大语言模型在事件时序关系抽取中的性能 |
large language model |
|
|
| 9 |
Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models |
针对LLM隐写与水印中Token化不一致问题,提出阶梯验证与回滚方法 |
large language model |
|
|
| 10 |
ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety |
ConspirED:构建阴谋论认知特征数据集,评估大型语言模型安全性 |
large language model |
|
|
| 11 |
CAPE: Context-Aware Personality Evaluation Framework for Large Language Models |
CAPE:提出上下文感知的LLM人格评估框架,解决现有方法忽略对话历史的问题。 |
large language model |
✅ |
|
| 12 |
Benchmarking GPT-5 for biomedical natural language processing |
评估GPT-5在生物医学自然语言处理任务中的性能,揭示其优势与局限。 |
multimodal chain-of-thought |
|
|
| 13 |
A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models |
揭示图语言模型评估困境:现有基准不足以评估多模态推理能力 |
large language model multimodal |
|
|
| 14 |
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs |
GUARD:通过自适应角色扮演和越狱诊断提升LLM的合规性测试 |
large language model |
|
|
| 15 |
On the Theoretical Limitations of Embedding-Based Retrieval |
揭示基于嵌入检索的理论局限性:即使简单查询也可能失效 |
instruction following |
|
|
| 16 |
Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection |
提出Rank-One Safety Injection (ROSI),通过秩一权重修改增强LLM安全性对齐。 |
large language model |
|
|
| 17 |
Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection |
提出解码记忆流水线DMP,加速自洽性幻觉检测并降低计算成本 |
large language model |
|
|
| 18 |
BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design |
提出BED-LLM以提升大语言模型的信息收集能力 |
large language model |
|
|
| 19 |
ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents |
ProactiveEval:用于评估主动对话Agent的统一评估框架 |
large language model |
|
|
| 20 |
CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection |
提出CoCoNUTS基准和CoCoDet检测器,用于识别同行评审中AI生成的内容,关注内容而非风格。 |
large language model |
✅ |
|
| 21 |
Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction |
通过条件熵降低评估LLM推理效用,优化推理过程 |
large language model |
|
|
| 22 |
An Agile Method for Implementing Retrieval Augmented Generation Tools in Industrial SMEs |
EASI-RAG:一种敏捷方法,用于在工业中小企业中部署检索增强生成工具 |
large language model |
|
|
| 23 |
How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench |
提出IRMA框架,通过输入重构显著提升LLM在动态环境中工具使用的准确性 |
large language model |
|
|
| 24 |
Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions |
对比分析真实与LLM生成的CBT对话情感弧,揭示LLM在情感表达上的局限性 |
large language model |
✅ |
|
| 25 |
SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM |
SciTopic:利用大型语言模型增强科学文献主题发现,提升科研信息检索效率。 |
large language model |
|
|
| 26 |
From Post To Personality: Harnessing LLMs for MBTI Prediction in Social Media |
提出PostToPersonality框架,利用LLM进行社交媒体MBTI性格预测,缓解幻觉并解决数据不平衡问题 |
large language model |
|
|
| 27 |
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers |
MCP-Bench:通过MCP服务器评估LLM智能体在复杂真实世界任务中的工具使用能力 |
large language model |
✅ |
|
| 28 |
CAMB: A comprehensive industrial LLM benchmark on civil aviation maintenance |
提出CAMB:一个全面的民用航空维护工业LLM基准测试 |
large language model |
✅ |
|
| 29 |
Joint Enhancement of Relational Reasoning for Long-Context LLMs |
提出JERR框架,通过图推理增强长文本LLM的关系推理能力 |
large language model |
|
|