| 1 |
Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models |
利用自盲和反事实自模拟缓解大语言模型中的偏见和谄媚 |
large language model |
|
|
| 2 |
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning |
提出Render-of-Thought,将文本推理链渲染为图像,用于视觉潜在推理。 |
large language model chain-of-thought |
✅ |
|
| 3 |
Social Caption: Evaluating Social Understanding in Multimodal Models |
提出Social Caption框架,评估多模态模型中的社会理解能力 |
large language model multimodal |
|
|
| 4 |
RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR) |
提出REVEAL-CXR:一个AI辅助的胸部X光片基准数据集,用于评估心胸疾病大语言模型。 |
large language model multimodal |
|
|
| 5 |
RECAP: Resistance Capture in Text-based Mental Health Counseling with Large Language Models |
提出RECAP框架,用于识别文本心理咨询中的阻抗行为并提供解释。 |
large language model |
|
|
| 6 |
Metadata Conditioned Large Language Models for Localization |
提出元数据条件化大语言模型,提升模型在特定地理区域的性能且不牺牲跨区域泛化能力。 |
large language model |
|
|
| 7 |
Comparative Study of Large Language Models on Chinese Film Script Continuation: An Empirical Analysis Based on GPT-5.2 and Qwen-Max |
构建中文电影剧本续写基准,对比GPT-5.2与Qwen-Max在创意写作中的性能差异。 |
large language model |
|
|
| 8 |
Multi-Agent Constraint Factorization Reveals Latent Invariant Solution Structure |
提出多智能体约束分解框架,揭示大模型系统潜在不变解结构 |
large language model |
|
|
| 9 |
Knowledge Restoration-driven Prompt Optimization: Unlocking LLM Potential for Open-Domain Relational Triplet Extraction |
提出知识重建驱动的提示优化框架,提升LLM在开放域关系三元组抽取中的性能。 |
large language model |
|
|
| 10 |
Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora |
提出翻译感知污染检测方法,解决多语言大模型评估中数据污染的盲区问题 |
large language model |
|
|
| 11 |
CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning |
提出CorpusQA,一个千万token级别的语料库分析与推理基准。 |
large language model |
|
|
| 12 |
Say Anything but This: When Tokenizer Betrays Reasoning in LLMs |
揭示Tokenizer缺陷:LLM推理中Token化不一致性导致幻影编辑 |
large language model |
|
|
| 13 |
The Effect of Scripts and Formats on LLM Numeracy |
揭示LLM在不同数字脚本和格式下的计算能力退化问题,并提出改进策略 |
large language model |
|
|
| 14 |
Supporting Humans in Evaluating AI Summaries of Legal Depositions |
提出基于Nugget的方法,辅助法律专家评估和改进法律文书摘要。 |
large language model |
|
|
| 15 |
CodeDelegator: Mitigating Context Pollution via Role Separation in Code-as-Action Agents |
CodeDelegator:通过角色分离缓解代码即动作Agent中的上下文污染 |
large language model |
|
|
| 16 |
PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation |
PodBench:一个面向指令感知的播客脚本生成综合评测基准 |
instruction following |
|
|
| 17 |
AdaTIR: Adaptive Tool-Integrated Reasoning via Difficulty-Aware Policy Optimization |
提出AdaTIR以解决工具调用冗余问题 |
large language model |
|
|