| 1 |
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios |
LIFBench:评估大语言模型在长文本场景下的指令跟随性能与稳定性 |
large language model instruction following |
|
|
| 2 |
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models |
提出中文SimpleQA:用于评估大型语言模型事实性的中文基准 |
large language model foundation model |
|
|
| 3 |
The Super Weight in Large Language Models |
发现大语言模型中的超权重,单参数剪枝即可摧毁模型性能 |
large language model |
|
|
| 4 |
Evaluating Large Language Models on Financial Report Summarization: An Empirical Study |
评估大型语言模型在金融报告摘要生成中的能力,并提供基准测试。 |
large language model |
|
|
| 5 |
AssistRAG: Boosting the Potential of Large Language Models with an Intelligent Information Assistant |
AssistRAG:利用智能信息助手提升大型语言模型能力 |
large language model |
|
|
| 6 |
OpenThaiGPT 1.5: A Thai-Centric Open Source Large Language Model |
OpenThaiGPT 1.5:一个以泰语为中心的开源大型语言模型 |
large language model |
|
|
| 7 |
Cancer-Answer: Empowering Cancer Care with Advanced Large Language Models |
Cancer-Answer:利用大型语言模型赋能癌症诊疗,提升患者预后 |
large language model |
|
|
| 8 |
Persuasion with Large Language Models: a Survey |
综述:基于大型语言模型的说服技术及其伦理风险 |
large language model |
|
|
| 9 |
Large-scale moral machine experiment on large language models |
大规模道德机器实验评估大型语言模型在自动驾驶中的伦理决策能力 |
large language model |
|
|
| 10 |
SetLexSem Challenge: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models |
SetLexSem挑战:利用集合运算评估语言模型词汇和语义鲁棒性 |
large language model instruction following |
✅ |
|
| 11 |
Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations |
揭示地理偏见:大型语言模型在故事生成和旅行推荐中对不同富裕程度国家存在差异性表现 |
large language model |
|
|
| 12 |
LongSafety: Enhance Safety for Long-Context LLMs |
LongSafety:增强长文本大语言模型安全性的综合数据集与训练方法 |
large language model |
|
|
| 13 |
Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews |
提出一种基于上下文的混合方法,用于挖掘与伦理相关的应用评论。 |
large language model |
|
|
| 14 |
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts |
UTMath:提出基于推理到代码的单元测试数学评估基准,提升大语言模型数学能力 |
large language model |
✅ |
|
| 15 |
PDC & DM-SFT: A Road for LLM SQL Bug-Fix Enhancing |
提出PDC & DM-SFT方法,提升LLM在SQL代码缺陷修复任务上的性能 |
large language model |
|
|
| 16 |
Explore the Reasoning Capability of LLMs in the Chess Testbed |
提出MATE数据集并微调LLaMA-3-8B,提升LLM在国际象棋中的推理能力 |
large language model |
|
|
| 17 |
Using Generative AI and Multi-Agents to Provide Automatic Feedback |
提出AutoFeedback多智能体系统,提升生成式AI在教育反馈中的准确性 |
large language model |
|
|
| 18 |
On Many-Shot In-Context Learning for Long-Context Evaluation |
通过多示例上下文学习评估长文本语言模型,揭示不同任务对上下文理解的需求差异。 |
large language model |
|
|
| 19 |
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs |
提出稀疏条件自编码器SCAR,用于大语言模型中的概念检测与引导。 |
large language model |
|
|
| 20 |
Building a Taiwanese Mandarin Spoken Language Model: A First Attempt |
首次尝试构建用于实时语音交互的台湾普通话口语大语言模型 |
large language model |
|
|
| 21 |
Sniff AI: Is My 'Spicy' Your 'Spicy'? Exploring LLM's Perceptual Alignment with Human Smell Experiences |
Sniff AI:探索大语言模型与人类嗅觉体验的感知对齐程度 |
large language model |
|
|
| 22 |
Reverse Prompt Engineering |
提出一种无需训练的反向提示工程框架,仅用少量文本输出即可重建提示。 |
large language model |
|
|