| 1 |
IWISDM: Assessing instruction following in multimodal models at scale |
提出iWISDM:大规模评估多模态模型指令遵循能力的基准 |
large language model multimodal instruction following |
✅ |
|
| 2 |
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents |
PIN:面向配对与交错多模态文档的知识密集型数据集,促进LMMs发展 |
multimodal |
|
|
| 3 |
FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving |
提出FVEL以解决形式验证中的灵活性与效率问题 |
large language model |
✅ |
|
| 4 |
CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks |
CityBench:构建城市任务评估基准,系统评估大语言模型在城市研究中的能力 |
large language model |
|
|
| 5 |
A Large Language Model Outperforms Other Computational Approaches to the High-Throughput Phenotyping of Physician Notes |
利用大型语言模型GPT-4实现医生笔记的高通量表型分析,性能超越传统方法 |
large language model |
|
|
| 6 |
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal |
SORRY-Bench:系统性评估大型语言模型安全拒绝能力 |
large language model |
|
|
| 7 |
APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking |
提出APEER,通过自动Prompt工程提升大语言模型重排序效果 |
large language model |
|
|
| 8 |
LiveMind: Low-latency Large Language Models with Simultaneous Inference |
LiveMind:一种支持同步推理的低延迟大语言模型框架 |
large language model |
|
|
| 9 |
SPL: A Socratic Playground for Learning Powered by Large Language Model |
SPL:基于大型语言模型的苏格拉底式学习平台,提升批判性思维。 |
large language model |
|
|
| 10 |
DASB - Discrete Audio and Speech Benchmark |
发布离散音频和语音基准(DASB),用于全面评估各类音频token化方法。 |
large language model multimodal |
|
|
| 11 |
RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation |
提出RE-AdaptIR,利用逆向工程自适应提升LLM在信息检索中的性能 |
large language model |
|
|
| 12 |
How critically can an AI think? A framework for evaluating the quality of thinking of generative artificial intelligence |
提出MAGE框架,评估生成式AI在模拟批判性思维能力方面的局限性,辅助教育者设计更鲁棒的评估方案。 |
large language model |
|
|
| 13 |
Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms |
提出算法理解层次结构,量化评估人类与GPT对算法的理解程度 |
large language model |
|
|
| 14 |
Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models |
提出Qiskit HumanEval量子代码生成评测基准,评估LLM在量子计算领域的代码生成能力 |
large language model |
|
|
| 15 |
Artificial Leviathan: Exploring Social Evolution of LLM Agents Through the Lens of Hobbesian Social Contract Theory |
基于LLM智能体模拟霍布斯社会契约演化,探索复杂社会关系动态形成 |
large language model |
|
|
| 16 |
The neural correlates of logical-mathematical symbol systems processing resemble that of spatial cognition more than natural language processing |
揭示逻辑数学符号处理的神经机制:空间认知或为基础 |
large language model |
|
|
| 17 |
EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms |
EvoAgent:通过进化算法实现自动多智能体生成,提升任务解决能力 |
large language model |
✅ |
|
| 18 |
TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput |
TurboSpec:闭环推测控制系统优化LLM服务吞吐量 |
large language model |
|
|
| 19 |
AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework |
提出AspirinSum框架,通过基于方面的方法实现效用保持的去标识化摘要。 |
large language model |
|
|