| 1 |
KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models |
KITE:用于评估大型语言模型韩语指令遵循能力的基准 |
large language model instruction following |
|
|
| 2 |
Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification |
系统性文献综述筛选自动化:评估提示策略与大语言模型交互作用 |
large language model chain-of-thought |
|
|
| 3 |
Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection |
利用大型语言模型增强上下文感知的隐式文本和多模态仇恨言论检测 |
large language model multimodal |
|
|
| 4 |
Outraged AI: Large language models prioritise emotion over cost in fairness enforcement |
大型语言模型在公平执行中情感优先于成本,揭示类人道德决策机制 |
large language model foundation model |
|
|
| 5 |
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding |
综述多模态检索增强生成在文档理解中的应用,弥补现有方法在结构细节和上下文建模上的不足。 |
large language model multimodal |
|
|
| 6 |
EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture |
提出EgMM-Corpus:一个用于埃及文化理解的多模态视觉-语言数据集。 |
multimodal |
|
|
| 7 |
SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling |
利用语音大语言模型进行大规模上下文零样本槽填充 |
large language model foundation model instruction following |
|
|
| 8 |
Contextual Augmentation for Entity Linking using Large Language Models |
提出基于大语言模型上下文增强的实体链接方法,提升领域外数据集性能。 |
large language model |
|
|
| 9 |
Leveraging Test Driven Development with Large Language Models for Reliable and Verifiable Spreadsheet Code Generation: A Research Framework |
提出基于测试驱动开发(TDD)的LLM代码生成框架,提升电子表格代码的可靠性与可验证性 |
large language model |
|
|
| 10 |
Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering |
提出一种基于提示工程的可控抽象摘要生成方法,提升大语言模型摘要质量与可控性。 |
large language model |
|
|
| 11 |
Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry |
提出三步评估框架,揭示大语言模型在古诗生成与评估中的偏差 |
large language model |
|
|
| 12 |
When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs |
提出GuessBench基准,揭示MLLM在主动推理中存在的局限性 |
large language model multimodal |
|
|
| 13 |
Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs |
CorrectBench:评估大语言模型自纠错能力的综合基准 |
large language model chain-of-thought |
|
|
| 14 |
In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions |
提出基于Reddit数据的计算框架,分析公众对生成式AI的信任与不信任。 |
large language model |
|
|
| 15 |
PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction |
PolySkill:通过多态抽象学习可泛化技能,提升Agent在开放Web环境中的持续学习能力。 |
large language model |
|
|
| 16 |
Paper2Web: Let's Make Your Paper Alive! |
Paper2Web:提出学术网页自动生成框架PWAgent,提升论文传播效果 |
large language model |
|
|
| 17 |
Emergence of Linear Truth Encodings in Language Models |
提出透明Transformer玩具模型,揭示语言模型中线性真值编码涌现机制 |
large language model |
|
|
| 18 |
LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation |
提出基于博弈论的LLM互评估框架,实现更符合人类认知的模型评估 |
large language model |
|
|
| 19 |
GraphMind: Interactive Novelty Assessment System for Accelerating Scientific Discovery |
GraphMind:交互式新颖性评估系统加速科学发现 |
large language model |
✅ |
|
| 20 |
Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation |
ParallaxRAG:通过多视角知识图谱检索增强生成解决多跳推理问题 |
large language model |
|
|
| 21 |
Rethinking Cross-lingual Gaps from a Statistical Viewpoint |
从统计视角重新审视跨语言差距,并提出方差控制方法 |
large language model |
|
|
| 22 |
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs |
TokenTiming:一种通用推测解码模型对的动态对齐方法 |
large language model |
|
|
| 23 |
LLM Latent Reasoning as Chain of Superposition |
提出Latent-SFT框架,通过隐式推理链实现高效且高性能的数学问题求解。 |
chain-of-thought |
|
|
| 24 |
From Characters to Tokens: Dynamic Grouping with Hierarchical BPE |
提出基于分层BPE的动态分组方法,提升语言模型效率和灵活性。 |
large language model |
|
|
| 25 |
Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References? |
提出TEMP-ReCon以解决LLMs时间引用一致性问题 |
large language model |
|
|
| 26 |
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios |
DeceptionBench:一个用于评估现实场景中AI欺骗行为的综合基准 |
large language model |
✅ |
|
| 27 |
CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs |
提出CORE框架以减少移动代理中的UI暴露问题 |
large language model |
✅ |
|
| 28 |
VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency |
VocalBench-DF:评估语音LLM对口语不流畅鲁棒性的基准测试 |
large language model |
|
|
| 29 |
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling |
提出SAFE框架,通过选择性集成提升长文本生成中LLM集成的效率与稳定性。 |
large language model |
|
|