| 1 |
Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning |
提出GEVO框架,通过字形驱动微调增强多模态大语言模型对古汉字演变分析的能力 |
large language model multimodal |
✅ |
|
| 2 |
NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment |
提出NovBench基准,用于评估大型语言模型在学术论文新颖性评估中的能力 |
large language model instruction following |
|
|
| 3 |
General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks |
General365:构建通用推理基准,评估大语言模型在多样化任务中的推理能力 |
large language model |
|
|
| 4 |
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models |
METER:评估大语言模型在多层次上下文因果推理中的能力 |
large language model |
✅ |
|
| 5 |
AOP-Smart: A RAG-Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis |
AOP-Smart:一种RAG增强的大语言模型框架,用于不良结局通路分析 |
large language model |
|
|
| 6 |
Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate |
Dialectic-Med:通过对抗性多智能体辩论缓解医疗诊断中的幻觉问题 |
large language model multimodal chain-of-thought |
|
|
| 7 |
How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts |
提出ClinicNumRobBench,评估大语言模型在临床数值推理中的鲁棒性 |
large language model |
✅ |
|
| 8 |
RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents |
RPA-Check:多阶段自动化框架,评估基于LLM的角色扮演Agent在约束环境下的性能 |
large language model chain-of-thought |
|
|
| 9 |
A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities |
通过人格引导提升LLM能力:系统分析与动态路由策略 |
large language model instruction following |
|
|
| 10 |
A Triadic Suffix Tokenization Scheme for Numerical Reasoning |
提出三元后缀分词(TST)方案,解决LLM数值推理中数字分词不一致问题。 |
large language model |
|
|
| 11 |
C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts |
提出C-ReD:一个基于真实提示的综合性中文AI生成文本检测基准。 |
large language model |
✅ |
|
| 12 |
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues |
METRO:从专家对话记录中归纳非协作对话策略 |
large language model |
✅ |
|
| 13 |
Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations |
揭示LLM工具调用中的结构对齐偏差,提出SABEval数据集与重平衡策略 |
large language model |
|
|
| 14 |
Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method |
提出ConflictQA基准与XoT框架,解决LLM在异构冲突知识下的推理难题 |
large language model |
|
|
| 15 |
CocoaBench: Evaluating Unified Digital Agents in the Wild |
提出 CocoaBench,用于评估统一数字智能体在复杂任务中的表现 |
visual grounding |
|
|
| 16 |
Efficient Training for Cross-lingual Speech Language Models |
提出跨语言语音语言模型CSLM,通过高效训练实现跨模态和跨语言对齐。 |
large language model |
✅ |
|
| 17 |
Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs? |
通过心理概念神经元干预,研究LLM中人格特质表征与行为输出的关联性。 |
large language model |
|
|
| 18 |
CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation |
CLSGen:用于联合概率分类和文本解释的双头微调框架 |
large language model |
|
|
| 19 |
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks |
提出AggAgent,通过智能体聚合实现长程Agent任务的并行扩展 |
chain-of-thought |
|
|
| 20 |
Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation |
揭示鲁棒性中的隐藏失效:监督不确定性量化需要更好的评估方法 |
large language model |
|
|
| 21 |
HTAA: Enhancing LLM Planning via Hybrid Toolset Agentization & Adaptation |
HTAA:通过混合工具集代理化与自适应增强LLM规划能力 |
large language model |
|
|