| 1 |
A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning |
提出A$^2$FM以解决推理与工具调用效率低下问题 |
large language model foundation model chain-of-thought |
|
|
| 2 |
VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents |
提出VCB Bench:一个用于评估语音驱动的大语言模型对话Agent的中文基准 |
large language model multimodal instruction following |
|
|
| 3 |
Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations |
提出基于多模态扰动的医学视觉-语言模型推理忠实性评估框架,用于胸部X光VQA。 |
multimodal chain-of-thought |
|
|
| 4 |
Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models |
提出HalluDet框架,通过内部状态和结构化推理一致性检测大语言模型幻觉 |
large language model chain-of-thought |
✅ |
|
| 5 |
Investigating Large Language Models' Linguistic Abilities for Text Preprocessing |
利用大型语言模型进行文本预处理,提升下游文本分类任务性能 |
large language model |
✅ |
|
| 6 |
StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models |
提出StoryBox,利用多智能体协同模拟实现混合自底向上长篇故事生成。 |
large language model |
|
|
| 7 |
MeTA-LoRA: Data-Efficient Multi-Task Fine-Tuning for Large Language Models |
MeTA-LoRA:一种数据高效的大语言模型多任务微调方法 |
large language model |
|
|
| 8 |
Survey Response Generation: Generating Closed-Ended Survey Responses In-Silico with Large Language Models |
系统性研究不同方法对LLM生成封闭式调查问卷的影响,并提出实用建议。 |
large language model |
|
|
| 9 |
LLMAtKGE: Large Language Models as Explainable Attackers against Knowledge Graph Embeddings |
提出LLMAtKGE,利用大语言模型作为可解释的知识图谱嵌入对抗攻击器 |
large language model |
|
|
| 10 |
Are Large Language Models Effective Knowledge Graph Constructors? |
提出一种基于层级提取框架的知识图谱构建方法,提升LLM在知识密集型任务中的表现。 |
large language model |
|
|
| 11 |
Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality |
评估心理测量测试在大型语言模型中的有效性,揭示其在性别歧视、种族歧视和道德评估上的局限性。 |
large language model |
|
|
| 12 |
Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues |
提出集成大语言模型框架,用于分析学生-AI辅导对话中的情感动态 |
large language model |
|
|
| 13 |
DND: Boosting Large Language Models with Dynamic Nested Depth |
DND:通过动态嵌套深度提升大型语言模型性能 |
large language model |
|
|
| 14 |
Judge Before Answer: Can MLLM Discern the False Premise in Question? |
提出JBA数据集与识别增强框架,提升多模态大语言模型对虚假前提的识别能力 |
large language model multimodal |
|
|
| 15 |
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning |
提出UALM统一音频语言模型,实现音频理解、生成和跨模态推理 |
multimodal |
|
|
| 16 |
LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance |
研究表明LLM的知识脆弱性源于对表面相似性的依赖,而非稳健的知识表示。 |
large language model |
|
|
| 17 |
Conjecturing: An Overlooked Step in Formal Mathematical Reasoning |
提出ConjectureBench评估LLM在形式化数学推理中被忽视的猜想步骤,并设计Lean-FIRe方法提升性能。 |
large language model |
|
|
| 18 |
Direct Multi-Token Decoding |
提出直接多Token解码(DMTD),加速Decoder-only LLM推理且无需额外参数。 |
large language model |
|
|
| 19 |
TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition |
TopoAlign框架通过拓扑分解对齐代码与数学,提升数学LLM的自动形式化能力。 |
large language model |
|
|
| 20 |
FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs |
FaStfact:一种更快、更强的LLM长文本事实性评估框架 |
large language model |
✅ |
|
| 21 |
PHANTOM RECALL: When Familiar Puzzles Fool Smart Models |
PHANTOM RECALL基准测试揭示LLM在逻辑推理中对记忆模板的过度依赖 |
large language model |
|
|
| 22 |
Early Detection and Reduction of Memorisation for Domain Adaptation and Instruction Tuning |
提出基于n-gram的早停和正则化方法,减少领域自适应和指令调优中的模型记忆 |
large language model |
|
|
| 23 |
Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study |
提出一种基于LLM的标注指南重构方法,提升文本标注效率与质量 |
large language model |
|
|
| 24 |
LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation |
提出LLM特定效用性,优化检索增强生成中模型定制化证据选择 |
large language model |
|
|
| 25 |
Do LLMs "Feel"? Emotion Circuits Discovery and Control |
揭示LLM中的情感回路,实现精准可控的情感表达 |
large language model |
|
|
| 26 |
Generate Logical Equivalence Questions |
提出基于形式语言的逻辑等价问题自动生成方法,提升效率并保证难度 |
large language model |
|
|
| 27 |
TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks |
TextBandit:提出基于纯文本反馈的多臂老虎机基准,评估LLM的概率推理能力 |
large language model |
|
|
| 28 |
Deep Research Brings Deeper Harm |
揭示基于LLM的Deep Research Agent在生物安全等领域存在的潜在危害 |
large language model |
✅ |
|
| 29 |
Task-Aware Reduction for Scalable LLM-Database Systems |
提出任务感知缩减方法,提升LLM数据库系统处理海量数据的效率与可持续性 |
large language model |
|
|
| 30 |
ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems |
提出Acadreason基准,评估LLM和Agent在学术研究问题上的推理能力。 |
large language model |
|
|
| 31 |
Invisible Languages of the LLM Universe |
揭示LLM中语言不平等现象,强调数字鸿沟与殖民时代语言等级制度的延续性 |
large language model |
|
|
| 32 |
Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content |
提出数据驱动方法,分析大型语言模型生成内容中的人格和人口统计学特征。 |
large language model |
|
|
| 33 |
Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification |
提出结合LLM合成与偏差校正的调查模拟方法,提升有效样本量并降低偏差。 |
large language model |
|
|
| 34 |
Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies |
提出基于人类策略对齐的评估框架,用于评估LLM在狼人杀等社交推理游戏中的表现 |
multimodal |
|
|
| 35 |
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression |
XQuant:通过跨层压缩实现超低比特KV缓存量化,提升长文本处理效率。 |
large language model |
|
|
| 36 |
CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis |
发布CNSocialDepress中文社交媒体抑郁风险检测数据集,支持结构化分析。 |
large language model |
|
|
| 37 |
TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code |
TypePilot:利用Scala类型系统增强LLM生成代码的安全性 |
large language model |
|
|
| 38 |
ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios |
ABLEIST:揭示LLM生成招聘场景中残疾歧视的交叉性偏见 |
large language model |
|
|
| 39 |
Secret-Protected Evolution for Differentially Private Synthetic Text Generation |
提出Secret-Protected Evolution框架,用于差分隐私合成文本生成,提升效用与隐私权衡。 |
large language model |
|
|
| 40 |
The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems |
研究多智能体系统中刻板印象偏差的涌现、传播与放大机制 |
large language model |
|
|
| 41 |
ADVICE: Answer-Dependent Verbalized Confidence Estimation |
提出ADVICE框架,解决大语言模型中答案无关的置信度估计问题 |
large language model |
|
|
| 42 |
Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks |
评估推理时缩放策略在Text2SQL任务中的有效性,优化Agentic工作流 |
large language model |
|
|