| 1 |
BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali |
BenHalluEval:孟加拉语大语言模型幻觉评估多任务框架 |
large language model chain-of-thought |
|
|
| 2 |
Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study |
研究技能文档粒度对大语言模型Agent任务成功率的影响,发现技能可用性是关键因素 |
large language model |
|
|
| 3 |
The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning |
提出双重干预框架,评估大语言模型在导航规划中空间推理的语言归纳偏置。 |
large language model |
✅ |
|
| 4 |
Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models |
提出基于GPT-4o目标端释义增强的Signformer手语翻译方法,提升低资源场景性能。 |
large language model |
|
|
| 5 |
Do Large Language Models Encode Institutional Experience? Evidence from Cross-Linguistic Moral Reasoning Under Ambiguity |
研究表明大型语言模型在跨语言道德推理中体现了制度经验的痕迹 |
large language model |
|
|
| 6 |
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty |
探索大语言模型不确定性与人类对齐、校准及激活模式的关联 |
large language model |
|
|
| 7 |
Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models |
提出语义三元组恢复协议,提升大语言模型在层级表格理解任务上的性能。 |
large language model |
✅ |
|
| 8 |
EvoDefense: Co-Evolving Black-Box Defense with Large Language Models |
EvoDefense:一种基于大语言模型的协同进化黑盒防御方法 |
large language model |
|
|
| 9 |
TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation |
提出TeachObs:一个用于多模态教学观察和模型评估的人工验证基准 |
multimodal |
|
|
| 10 |
DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs |
提出DOA:一种免训练的解码器自注意力策略,用于SpeechLLM的长文本同步翻译 |
large language model multimodal |
|
|
| 11 |
What Am I Missing? Question-Answering as Hidden State Probing |
提出基于问题生成的隐状态探测方法,用于提升LLM的推理能力。 |
large language model chain-of-thought |
|
|
| 12 |
MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft |
MineExplorer:评估MLLM智能体在Minecraft开放世界中的探索能力 |
large language model multimodal |
✅ |
|
| 13 |
MADS: Model-Aware Diverse Core Set Selection for Instruction Tuning |
提出模型感知的多样核心集选择方法以解决指令微调数据选择问题 |
large language model instruction following |
|
|
| 14 |
FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection |
提出FBHM基准测试与LSV引导方法,提升VLM在仇恨模因检测中的泛化能力。 |
multimodal |
|
|
| 15 |
Scaling Multi-Hop Training Data via Graph-Constrained Path Selection |
提出图约束路径选择以扩展多跳训练数据 |
large language model |
✅ |
|
| 16 |
Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models |
提出一种零样本跨语言置信度估计方法,利用多语言LLM的共享置信度特征。 |
large language model |
|
|
| 17 |
Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines |
通过监督式特征选择,稀疏自编码器在LLM引导任务上可媲美LoRA |
large language model |
|
|
| 18 |
XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks |
XLGoBench:提出算法任务集以检测大语言模型跨语言能力差距 |
large language model |
|
|
| 19 |
If LLMs Have Human-Like Attributes, Then So Does Age of Empires II |
质疑LLM拟人化属性:在《帝国时代II》中亦可观察到类似现象 |
large language model |
|
|
| 20 |
D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training |
提出D$^3$框架,通过动态方向图约束优化LLM训练数据调度,提升学习效率。 |
large language model |
✅ |
|
| 21 |
Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits |
研究表明:提示语中的毒性词汇会降低大语言模型的可靠性,并揭示了内部计算的变化。 |
large language model |
|
|
| 22 |
Fine-Tuning Improves Information Conveyance in Language Models |
提出Canopy Entropy以解决语言模型信息传递效率问题 |
large language model |
✅ |
|
| 23 |
Language Models Can Resolve Reference Compositionally, But It's Not Their Native Strength: The Case of the Personal Relation Task |
揭示大语言模型在指称组合性理解上的局限性:以人际关系任务为例 |
large language model |
|
|
| 24 |
Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation |
提出PowerCodeBench与知识边界干预方法,提升LLM在电力系统代码生成中的可靠性。 |
large language model |
|
|
| 25 |
LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories |
揭示LLM作为安全评估者的不一致性,尤其在金融等受监管领域 |
large language model |
|
|
| 26 |
The Latin Substrate: How Language Models Represent and Mediate Script Choice |
揭示LLM中拉丁语底层偏好:探究语言模型如何表征和调解文字选择 |
large language model |
|
|
| 27 |
Divergence Decoding: Inference-Time Unlearning via Auxiliary Models |
提出Divergence Decoding,通过辅助模型实现LLM的推理时非学习,解决隐私和版权风险。 |
large language model |
|
|
| 28 |
Wind Turbine Maintenance Log Labelling Framework: LLM-Driven Data Correction and Enrichment via Semantic Extraction of Reliability Intelligence |
提出基于LLM的风力涡轮机维护日志标注框架,实现数据校正与可靠性信息提取。 |
large language model |
|
|
| 29 |
Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages |
提出MCN多语言语料库,利用小型语言模型解决低资源语言维基百科的Citation Needed检测问题 |
large language model |
✅ |
|
| 30 |
GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs |
GRKV:通过全局回归实现长文本LLM中免训练的KV缓存压缩 |
large language model |
|
|
| 31 |
Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory |
提出RHELM基准,评估LLM在真实异构演化长期记忆场景下的性能 |
large language model |
|
|
| 32 |
How Much Do LLMs Know About Chinese Zero Pronouns? |
系统性评估大型语言模型对中文零代词的理解能力 |
large language model |
|
|
| 33 |
MoG: Mixture of Experts for Graph-based Retrieval-Augmented Generation |
提出MoG:基于图的检索增强生成混合专家模型,提升复杂推理性能。 |
large language model |
✅ |
|
| 34 |
EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation |
EvoGens:一种基于种群的启发式搜索框架,用于科学思想生成。 |
large language model |
|
|
| 35 |
dMoE: dLLMs with Learnable Block Experts |
dMoE:提出可学习块专家机制,解决扩散语言模型中专家选择与块并行解码的失配问题。 |
large language model |
✅ |
|
| 36 |
Incremental BPE Tokenization |
提出增量BPE分词算法以提升流式处理效率 |
large language model |
✅ |
|
| 37 |
Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation |
提出时空并行解码与置信度外推方法,加速扩散语言模型的推理。 |
large language model |
|
|
| 38 |
Triaging Threats to Specialized Guardrails |
提出RouteGuard:一种基于路由-专家框架的专业化安全防护方案 |
large language model |
|
|