| 1 |
ProofSketch: Efficient Verified Reasoning for Large Language Models |
ProofSketch:一种高效的、可验证的大语言模型推理框架 |
large language model chain-of-thought |
|
|
| 2 |
MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations |
提出MuSaG:一个带完整模态标注的德语多模态讽刺数据集,用于提升讽刺检测模型性能。 |
large language model multimodal |
|
|
| 3 |
Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation |
提出MiRAGE框架,用于评估多模态检索增强生成系统的性能 |
multimodal |
|
|
| 4 |
Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment |
提出基于潜在语义对齐的跨尺度知识迁移方法,提升大语言模型性能 |
large language model |
|
|
| 5 |
Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems |
综述性研究:面向应用,探讨RAG、推理和Agentic系统缓解大语言模型幻觉问题 |
large language model |
|
|
| 6 |
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model |
提出POWSM:一个语音开放Whisper风格的语音基础模型,统一解决多种语音音素相关任务 |
foundation model |
|
|
| 7 |
Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish |
提出基于语法书指导的评测框架,评估大语言模型对卢森堡语语法的理解能力 |
large language model |
|
|
| 8 |
Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations |
CATT-Whisper:利用文本和语音表征的多模态阿拉伯语变音符恢复 |
multimodal |
|
|
| 9 |
Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models? |
探究大语言模型忠实性的驱动因素,提升医疗等敏感领域的可信度 |
large language model |
|
|
| 10 |
TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents |
提出TEXT2DB任务与OPAL框架,利用LLM Agent实现信息抽取与数据库集成。 |
large language model |
✅ |
|
| 11 |
Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation |
提出基于前缀的自适应方法,实现大语言模型零样本跨语言迁移。 |
large language model zero-shot transfer |
|
|
| 12 |
Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants |
提出开放式阿拉伯文化问答基准以解决方言变体问题 |
large language model chain-of-thought |
|
|
| 13 |
Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers |
训练极简注意力Transformer以解决间接对象识别任务,揭示核心推理电路 |
large language model |
|
|
| 14 |
Idea2Plan: Exploring AI-Powered Research Planning |
Idea2Plan:探索AI驱动的科研规划能力,为自主科研智能体奠定基础 |
large language model |
|
|
| 15 |
Tongyi DeepResearch Technical Report |
提出 Tongyi DeepResearch,一个面向长程深度信息检索任务的 Agentic 大语言模型。 |
large language model |
|
|
| 16 |
zFLoRA: Zero-Latency Fused Low-Rank Adapters |
提出零延迟融合低秩适配器zFLoRA,解决LLM部署中适配器推理延迟问题 |
large language model |
|
|
| 17 |
Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts |
提出两种方法合成生成失语症患者的语音转录文本,缓解数据稀缺问题。 |
large language model |
|
|
| 18 |
A word association network methodology for evaluating implicit biases in LLMs compared to humans |
提出一种基于词语联想网络的LLM内隐偏见评估方法,可与人类偏见直接对比。 |
large language model |
|
|
| 19 |
Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices |
针对欧洲语言LLM评测,提出新分类体系与最佳实践方案 |
large language model |
|
|
| 20 |
Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content |
提出基于Agent的框架,评估LLM在生成伊斯兰内容时的准确性和一致性 |
large language model |
|
|
| 21 |
LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability |
提出LongWeave基准,通过CoV-Eval评估LLM在真实场景下的长文本生成能力。 |
large language model |
|
|
| 22 |
Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean |
提出Ko-MuSR基准,用于评估LLM在理解韩语长文本叙事中的多步软推理能力 |
large language model |
|
|
| 23 |
RiddleBench: A New Generative Reasoning Benchmark for LLMs |
RiddleBench:用于评估LLM生成式推理能力的新型基准测试 |
large language model |
|
|
| 24 |
WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking |
WebLeaper:通过信息丰富的搜索,提升WebAgent的效率和效能 |
large language model |
|
|
| 25 |
AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis |
AgentFrontier:利用ZPD引导的数据合成扩展LLM Agent的能力边界 |
large language model |
|
|
| 26 |
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence |
提出STAR-Bench,用于评估模型在音频4D时空推理方面的能力。 |
large language model |
|
|
| 27 |
"Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue |
提出多模态模型,提升对话系统中他人发起修复请求的检测能力 |
multimodal |
|
|
| 28 |
Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way |
提出dLLM-Var,实现原生可变长度生成的扩散语言模型,显著提升推理速度。 |
large language model |
|
|
| 29 |
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization |
提出ReForm,通过自反形式化与前瞻有界序列优化提升自然语言数学的形式化转换。 |
large language model |
|
|
| 30 |
Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts |
构建大规模开放韩语历史语料库,促进韩语历史变迁的量化研究 |
large language model |
|
|
| 31 |
Parallel Loop Transformer for Efficient Test-Time Computation Scaling |
提出并行循环Transformer(PLT),加速LLM测试时计算并降低内存占用 |
large language model |
|
|
| 32 |
CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration? |
CritiCal:利用自然语言评判提升大语言模型的不确定性与置信度校准 |
large language model |
|
|
| 33 |
LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data |
LuxIT:一种基于单语种子数据的卢森堡语指令微调数据集 |
large language model |
|
|
| 34 |
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation |
提出基于Agent驱动的LLM代码智能体评测基准PRDBench,解决标注成本高和评测指标单一问题。 |
large language model |
|
|
| 35 |
Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations |
评估大型语言模型在生成适合儿童年龄段对话方面的能力 |
large language model |
|
|
| 36 |
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures |
提出Global PIQA,用于评估大型语言模型在100+种语言和文化中的物理常识推理能力 |
large language model |
|
|
| 37 |
Pie: A Programmable Serving System for Emerging LLM Applications |
Pie:一种可编程的LLM服务系统,为新兴应用提供灵活高效的支持 |
large language model |
|
|
| 38 |
Success and Cost Elicit Convention Formation for Efficient Communication |
提出基于成功和代价驱动的对话惯例形成方法,提升多模态通信效率。 |
multimodal |
|
|