| 1 |
Automated Benchmark Auditing for AI Agents and Large Language Models |
提出Auto Benchmark Audit (ABA)框架,自动审计AI基准测试集并提升评估质量。 |
large language model |
|
|
| 2 |
Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning |
通过思维链微调传递专家隐性知识,实现创造性质量对齐 |
chain-of-thought |
|
|
| 3 |
Toward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models |
提出可控学习者模拟基准,利用大语言模型模拟具备特定技能缺陷的学生,用于教师培训。 |
large language model |
|
|
| 4 |
The Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models |
KIDBench:评估大语言模型在儿童安全方面的基准测试与安全模型。 |
large language model |
|
|
| 5 |
A general tensor-structured compression scheme for efficient large language models |
提出MixT:一种通用的张量结构压缩方案,用于高效压缩大型语言模型。 |
large language model |
|
|
| 6 |
MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models |
提出MATO:一种基于测试时优化的多目标个性化对齐大语言模型框架 |
large language model |
|
|
| 7 |
When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation |
研究表明LLM Agent对语义噪声比表面噪声更敏感,并揭示了潜在的推理分歧机制。 |
large language model chain-of-thought |
|
|
| 8 |
Double Triangle Annotation: A Scalable Human-in-the-Loop Framework for High-Precision Historical Document Annotation |
提出双三角标注框架,利用多模态大模型共识,实现历史文档高精度标注。 |
large language model multimodal |
|
|
| 9 |
WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification |
提出基于人-LLM协作的文本多语种说话人属性分类标注框架WhoSaidIt |
large language model |
|
|
| 10 |
QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability |
QUIET:多空级联故事完形填空基准,用于评估LLM的创造性生成能力 |
large language model |
|
|
| 11 |
Thaka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization |
针对阿拉伯语音标恢复,提出基于正则化微调的CATT-Whisper模型。 |
multimodal |
|
|
| 12 |
Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express |
揭示大语言模型因果推理的“舌尖效应”:内部理解与外部表达不一致 |
large language model |
|
|
| 13 |
TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning |
TIAR:轨迹信息优势重加权用于LLM拒绝学习,提升模型可靠性 |
large language model |
|
|
| 14 |
Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation |
提出信念增强生成(BAG),提升LLM在对话式问答中澄清、回答或拒绝的能力。 |
large language model |
|
|
| 15 |
StreamProfileBench: A Benchmark for Fine-Grained User Profile Inference in Real-World Streaming Scenarios |
StreamProfileBench:提出大规模流式用户画像基准,解决实时场景下用户兴趣演变建模难题 |
large language model |
|
|
| 16 |
PowLU: An Activation Function for Stable Pre-Training of LLMs |
提出PowLU激活函数,解决LLM预训练中的数值稳定性问题 |
large language model |
|
|
| 17 |
Neural Router: Semantic Content Matching for Agentic AI |
提出神经路由器,利用LLM进行语义内容匹配,赋能Agentic AI。 |
large language model |
|
|
| 18 |
PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation |
PennySynth:基于RAG的量子代码自动生成数据合成框架 |
large language model |
|
|
| 19 |
IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference |
IndexMem:利用潜在记忆学习KV缓存淘汰策略,提升长文本LLM推理性能 |
large language model |
|
|
| 20 |
HyLaT: Efficient Multi-Agent Communication via Hybrid Latent-Text Protocol |
HyLaT:提出一种混合隐-文本协议,用于提升多智能体通信效率。 |
large language model |
|
|
| 21 |
SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models |
SomaliBench Eval评估揭示开放权重语言模型在索马里语拒绝回答方面存在显著差距 |
large language model |
|
|
| 22 |
LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers |
评估LLM作为审稿人的能力:偏差、差异性与提示注入抵抗力基准研究 |
large language model |
|
|
| 23 |
EfficientGraph-RAG: Structured Retrieval-State Management for Cross-Task Retrieval-Augmented Generation |
EfficientGraph-RAG:通过结构化检索状态管理提升跨任务RAG效率 |
large language model |
|
|
| 24 |
Tool-Call Dependency Structure is Linearly Decodable in LLM Agent Residual Streams |
通过探针解码LLM Agent运行时工具调用依赖关系,揭示其线性可解码结构 |
chain-of-thought |
|
|