| 1 |
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? |
DeltaBench:评估大语言模型在长链式推理中错误检测能力 |
large language model chain-of-thought |
|
|
| 2 |
DataMan: Data Manager for Pre-training Large Language Models |
DataMan:用于预训练大型语言模型的数据管理器,提升数据质量与领域混合。 |
large language model instruction following |
|
|
| 3 |
Medical Hallucinations in Foundation Models and Their Impact on Healthcare |
揭示医学领域大模型幻觉问题:通用模型优于专用模型,CoT推理显著缓解 |
foundation model chain-of-thought |
|
|
| 4 |
Do Large Language Models Know How Much They Know? |
评估大型语言模型知识范围:提出基准测试模型认知能力 |
large language model |
|
|
| 5 |
Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models |
提出基于语言学指标的刻板印象评估方法,利用大语言模型检测文本中的刻板印象。 |
large language model |
|
|
| 6 |
Binary Neural Networks for Large Language Model: A Survey |
综述:面向大语言模型的二值神经网络技术 |
large language model |
|
|
| 7 |
JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models |
JailBench:首个全面的中文安全评估基准,用于评估大型语言模型的深层漏洞 |
large language model |
✅ |
|
| 8 |
When Large Language Models Meet Speech: A Survey on Integration Approaches |
综述:探索大语言模型与语音融合的三种主要方法 |
large language model |
|
|
| 9 |
Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique |
利用共识评估技术,大语言模型在诗歌评估中超越非专家 |
large language model |
|
|
| 10 |
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering |
MEBench:用于跨文档多实体问答的大语言模型基准测试 |
large language model |
|
|
| 11 |
Towards Label-Only Membership Inference Attack against Pre-trained Large Language Models |
提出PETAL:一种针对预训练大语言模型的仅标签成员推理攻击方法 |
large language model |
|
|
| 12 |
Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance |
提出Plutus-ben基准和Plutus-8B模型,填补了低资源希腊金融领域大语言模型研究的空白。 |
large language model |
|
|
| 13 |
Weaker LLMs' Opinions Also Matter: Mixture of Opinions Enhances LLM's Mathematical Reasoning |
提出MoO方法,利用弱LLM的意见混合增强强LLM的数学推理能力 |
large language model chain-of-thought |
|
|
| 14 |
A Causal Lens for Evaluating Faithfulness Metrics |
提出因果诊断框架,评估自然语言解释忠实度指标的有效性 |
large language model chain-of-thought |
|
|
| 15 |
Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs |
提出LLMEvalDB,利用LLM加速文献分析,揭示前沿LLM的性能洞见 |
multimodal chain-of-thought |
|
|
| 16 |
Stay Focused: Problem Drift in Multi-Agent Debate |
提出DRIFTJudge和DRIFTPolicy,解决多智能体辩论中的问题漂移现象 |
large language model instruction following |
|
|
| 17 |
Random Forest-of-Thoughts: Uncertainty-aware Reasoning for Computational Social Science |
提出Random Forest-of-Thoughts (RFoT)方法,用于提升LLM在社会调查分析中的不确定性推理能力。 |
large language model chain-of-thought |
|
|
| 18 |
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs |
综述:探索代码增强推理与推理驱动代码智能在大型语言模型中的协同作用 |
large language model |
|
|
| 19 |
Shh, don't say that! Domain Certification in LLMs |
提出VALID方法,为LLM在特定领域应用中提供输出域认证,保障模型安全性。 |
large language model |
|
|
| 20 |
TestNUC: Enhancing Test-Time Computing Approaches and Scaling through Neighboring Unlabeled Data Consistency |
TestNUC:利用邻域未标注数据一致性提升测试时计算方法并实现线性扩展 |
large language model |
✅ |
|
| 21 |
Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs |
Amulet:测试时重对齐,实现LLM的个性化偏好适应 |
large language model |
|
|
| 22 |
Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review |
构建AI同行评审检测基准,揭示现有AI文本检测算法在评审场景下的局限性 |
large language model |
✅ |
|
| 23 |
Cognitive networks highlight differences and similarities in the STEM mindsets of human and LLM-simulated trainees, experts and academics |
利用认知网络揭示人类与LLM在STEM思维模式上的异同 |
large language model |
|
|
| 24 |
Norm Growth and Stability Challenges in Localized Sequential Knowledge Editing |
研究揭示LLM局部知识编辑中范数增长与稳定性挑战 |
large language model |
|
|
| 25 |
BEYONDWORDS is All You Need: Agentic Generative AI based Social Media Themes Extractor |
提出基于Agentic生成式AI的社交媒体主题提取方法,提升主题分析的深度和准确性。 |
chain-of-thought |
|
|
| 26 |
Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning |
提出低置信度黄金(LCG)框架,高效过滤指令微调数据集,提升大语言模型性能。 |
large language model |
|
|
| 27 |
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework |
提出ARJudge框架,通过多维度评估对齐LLM评估能力,提升鲁棒性。 |
large language model |
|
|
| 28 |
Revisiting Word Embeddings in the LLM Era |
对比研究LLM与经典词嵌入,揭示LLM时代词嵌入的优势与局限 |
large language model |
|
|
| 29 |
Where Are We? Evaluating LLM Performance on African Languages |
评估LLM在非洲语言上的性能,揭示数据偏差对模型效果的影响 |
large language model |
|
|
| 30 |
Learning Code-Edit Embedding to Model Student Debugging Behavior |
提出基于代码编辑嵌入的模型,用于建模学生调试行为并提供个性化代码建议。 |
large language model |
|
|
| 31 |
Negation-Induced Forgetting in LLMs |
研究发现部分大型语言模型存在否定诱导遗忘现象 |
large language model |
|
|
| 32 |
Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation |
提出Bi'an,一个双语基准和模型,用于检索增强生成中的幻觉检测。 |
large language model |
✅ |
|
| 33 |
BIG-Bench Extra Hard |
提出BIG-Bench Extra Hard (BBEH)基准,用于评估LLM更高级的通用推理能力。 |
large language model |
✅ |
|
| 34 |
Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval |
提出PseudoEval基准测试,用于分离评估LLM的代码能力与问题解决能力 |
large language model |
|
|
| 35 |
Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization |
提出PKUE,通过增强精确知识利用能力缓解大语言模型的事实性幻觉问题 |
large language model |
|
|
| 36 |
LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm |
LongEval:提出基于规划范式的长文本生成综合评估基准 |
large language model |
✅ |
|
| 37 |
Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs |
提出CLADA框架以解决大语言模型的效率瓶颈问题 |
large language model |
✅ |
|
| 38 |
IndicEval-XL: Bridging Linguistic Diversity in Code Generation Across Indic Languages |
IndicEval-XL:构建跨印度语代码生成的多语言评测基准 |
large language model |
✅ |
|
| 39 |
MathClean: A Benchmark for Synthetic Mathematical Data Cleaning |
提出MathClean基准,用于评估数学数据清洗模型的有效性。 |
large language model |
✅ |
|
| 40 |
GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation |
GenTool:通过零到一和弱到强模拟增强语言模型中的工具泛化能力 |
large language model |
|
|
| 41 |
TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation |
TokenSwift:超长序列生成无损加速框架,提升LLM生成效率 |
large language model |
✅ |
|
| 42 |
Active Few-Shot Learning for Text Classification |
提出基于主动学习的少样本文本分类方法,提升LLM在有限标注数据下的性能 |
large language model |
|
|
| 43 |
Towards Optimal Multi-draft Speculative Decoding |
提出基于最优传输理论的多Draft推测解码效率分析与优化方法 |
large language model |
|
|
| 44 |
A Survey of Automatic Prompt Optimization with Instruction-focused Heuristic-based Search Algorithm |
综述:基于指令的启发式搜索算法的自动Prompt优化方法 |
large language model |
|
|