| 1 |
Reliable Chain-of-Thought via Prefix Consistency |
提出前缀一致性(Prefix Consistency)方法,通过重采样验证提升思维链推理的可靠性 |
large language model chain-of-thought |
✅ |
|
| 2 |
Post-training makes large language models less human-like |
提出Psych-201数据集并揭示后训练过程导致大语言模型行为对齐度下降的现象 |
large language model |
|
|
| 3 |
PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat |
提出基于合成数据增强的Llama 3.1微调策略,以提升游戏聊天场景下的多类毒性检测性能 |
large language model |
|
|
| 4 |
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions |
提出基于决策表示转换的分析框架,揭示大语言模型层剪枝导致性能崩溃的内在机制 |
large language model |
|
|
| 5 |
Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation |
揭示大语言模型在低资源机器翻译中的失效机制:提出Token激活率(TAR)指标 |
large language model |
|
|
| 6 |
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts |
挑战思维链范式:推理大模型具备从稀疏且乱序的思维链中提取答案的能力 |
chain-of-thought |
|
|
| 7 |
NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models |
提出NSMQ Riddles科学与数学谜题基准,用于评估大语言模型在科学推理方面的能力。 |
large language model |
|
|
| 8 |
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation |
提出WeatherSyn模型与数据集,通过指令微调实现气象预报报告的自动化生成 |
large language model multimodal |
|
|
| 9 |
LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification |
提出LaTER推理框架:通过潜空间探索与显式验证实现高效测试时推理 |
large language model chain-of-thought |
✅ |
|
| 10 |
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning |
提出RIS框架:通过空间-语义接地实现多模态大模型的潜在视觉推理 |
large language model multimodal |
|
|
| 11 |
Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs |
引入认知评价理论:通过多维自我评估提升大语言模型性能预测的可靠性 |
large language model |
|
|
| 12 |
PaT: Planning-after-Trial for Efficient Test-Time Code Generation |
提出PaT(试后规划)框架,通过自适应规划策略显著提升大模型代码生成的推理效率。 |
large language model |
|
|
| 13 |
Hallucination Detection via Activations of Open-Weight Proxy Analyzers |
提出基于开源代理分析器激活值的幻觉检测框架,实现对闭源与开源LLM的通用幻觉识别。 |
large language model |
|
|
| 14 |
The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents |
揭示大模型“记忆诅咒”:长上下文窗口如何削弱多智能体协作意图 |
chain-of-thought |
|
|
| 15 |
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs |
评估EngGPT2MoE-16B-A3B模型:一种面向意大利语境的高性能混合专家(MoE)大语言模型 |
large language model |
|
|
| 16 |
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference |
提出LaProx框架:通过输出感知矩阵近似重构KV Cache驱逐策略,实现长文本推理的高效压缩 |
large language model |
|
|
| 17 |
The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval |
揭示大语言模型中的“文本恐怖谷”现象:非单调性能退化与模式转换机制 |
large language model |
|
|
| 18 |
Region4Web: Rethinking Observation Space Granularity for Web Agents |
提出Region4Web框架:通过功能区域粒度重构网页观测空间,提升Web智能体任务成功率 |
large language model |
|
|
| 19 |
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling |
提出AutoTTS框架,通过智能体自动发现推理时扩展(TTS)策略以优化计算分配 |
large language model |
✅ |
|
| 20 |
GLiGuard: Schema-Conditioned Classification for LLM Safeguard |
提出GLiGuard:一种基于模式条件分类的轻量级大模型安全防护框架 |
large language model |
✅ |
|
| 21 |
How Value Induction Reshapes LLM Behaviour |
揭示价值诱导对大语言模型行为的影响:安全性、拟人化与谄媚倾向的权衡 |
large language model |
|
|
| 22 |
Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement |
提出LANCE以解决大型语言模型的刚性拒绝问题 |
large language model |
|
|
| 23 |
Multi-Dimensional Evaluation of LLMs for Grammatical Error Correction |
多维度评估大语言模型在语法纠错中的表现:揭示模型性能与评价指标的局限性 |
large language model |
|
|
| 24 |
Is She Even Relevant? When BERT Ignores Explicit Gender Cues |
通过检查点级分析揭示荷兰语BERT模型中性别偏见的形成机制与上下文处理局限 |
large language model |
|
|
| 25 |
TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature |
提出TCMIIES系统:一种基于浏览器且由LLM驱动的学术文献结构化信息提取平台 |
large language model |
|
|
| 26 |
From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs |
提出LogiHard框架:通过组合硬化技术揭示前沿大模型在逻辑推理中的组合性缺陷 |
zero-shot transfer |
|
|
| 27 |
SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions |
提出SAGE分层评估框架,利用本体论驱动的大语言模型实现文学质量的量化评估 |
large language model |
|
|