| 1 |
MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models |
MaterialFigBENCH:用于评估多模态LLM材料科学问题解决能力的图表基准数据集 |
large language model multimodal |
|
|
| 2 |
SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning |
提出SciMDR框架以解决科学多模态文档推理数据集构建问题 |
foundation model multimodal |
|
|
| 3 |
Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese |
评估开源大语言模型在日语病理报告写作辅助中的性能 |
large language model |
|
|
| 4 |
UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization |
UtilityMax Prompting:提出基于形式化语言的多目标大语言模型优化框架 |
large language model |
|
|
| 5 |
To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times |
利用微调大语言模型预测句子级心理语言学指标:可记忆性和阅读时间 |
large language model |
|
|
| 6 |
DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining |
DatedGPT:通过时间感知预训练防止大语言模型中的前瞻偏差 |
large language model |
|
|
| 7 |
Large Language Models for Biomedical Article Classification |
探索大型语言模型在生物医学文章分类中的应用,并提供实用配置建议。 |
large language model |
|
|
| 8 |
BLooP: Zero-Shot Abstractive Summarization using Large Language Models with Bigram Lookahead Promotion |
BLooP:利用大语言模型和Bigram Lookahead Promotion实现零样本摘要生成 |
large language model |
✅ |
|
| 9 |
CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks? |
提出CoMMET多模态基准,评估LLM在心理理论任务中的表现 |
large language model multimodal |
|
|
| 10 |
BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs |
BTZSC:零样本文本分类的综合基准,涵盖跨编码器、嵌入模型、重排序器和LLM |
large language model |
|
|
| 11 |
One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries |
提出一种自适应工具编排框架,用于自主多模态查询处理。 |
multimodal |
|
|
| 12 |
Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration |
提出Idea-Catalyst框架,利用LLM激发跨学科灵感,辅助科研创新。 |
large language model |
|
|
| 13 |
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections |
提出MADQA基准测试,评估多模态Agent在文档集合上的策略推理能力。 |
multimodal |
|
|
| 14 |
SemBench: A Universal Semantic Framework for LLM Evaluation |
SemBench:一种通用的LLM语义评估框架,自动生成跨语言评测基准。 |
large language model |
|
|
| 15 |
QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions |
提出QAQ框架,通过双向语义一致性选择高质量合成代码指令,提升代码生成模型性能。 |
instruction following |
|
|
| 16 |
LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation |
提出LifeSim,用于评估个性化助手在长期用户生活场景中的表现 |
large language model |
|
|
| 17 |
Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions |
提出跨上下文审查(CCR)方法,通过分离生成和审查会话提升LLM输出质量 |
large language model |
|
|
| 18 |
CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading |
提出CHiL(L)Grader,用于校准置信度的人工参与式短答案评分框架 |
large language model |
|
|
| 19 |
PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents |
PersonaTrace:利用LLM智能体合成逼真数字足迹,解决数据稀缺问题 |
large language model |
|
|
| 20 |
Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents |
提出Legal-DC基准和LegRAG框架,提升中文法律文档RAG性能 |
large language model |
✅ |
|
| 21 |
Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries |
提出DapQ:通过位置感知伪查询实现解码对齐的KV缓存压缩 |
large language model |
|
|
| 22 |
Tiny Aya: Bridging Scale and Multilingual Depth |
Tiny Aya:以33.5亿参数实现高效且平衡的多语种AI模型 |
foundation model |
|
|
| 23 |
Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs |
提出Tool-DC框架,提升LLM在长上下文工具调用中的性能 |
large language model |
|
|
| 24 |
LLM-Assisted Causal Structure Disambiguation and Factor Extraction for Legal Judgment Prediction |
提出LLM辅助的因果结构消歧和要素提取方法,用于提升法律判决预测的准确性和鲁棒性。 |
large language model |
|
|
| 25 |
Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects |
粒子滤波模型揭示句子处理中的歧义放大与“深挖”效应 |
large language model |
|
|
| 26 |
One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries |
提出一种自适应工具编排框架以优化多模态查询处理 |
multimodal |
|
|
| 27 |
LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation |
LLM BiasScope:用于大规模语言模型实时偏差分析与对比评估的平台 |
large language model |
|
|
| 28 |
CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection |
提出基于异构LLM集成和审慎复杂性门控的两阶段方法,用于政治回避检测。 |
large language model |
|
|
| 29 |
Not Just the Destination, But the Journey: Reasoning Traces Causally Shape Generalization Behaviors |
研究表明推理过程而非最终答案,因果性地塑造大语言模型的泛化行为 |
chain-of-thought |
|
|