| 1 |
QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents |
QUACK:多模态社交推理Agent中知识沟通的质询、理解与审计 |
large language model multimodal |
✅ |
|
| 2 |
Beyond Questions: Evaluating What Large Language Models (Actually) Know |
提出开放知识评估框架BeQu,用于全面评估大语言模型所掌握的知识。 |
large language model |
|
|
| 3 |
The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models |
重新审视大语言模型序列知识编辑中的正则化方法,简化并提升编辑稳定性。 |
large language model |
✅ |
|
| 4 |
AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian |
提出AlbanianLLMSafety,首个阿尔巴尼亚语LLM安全评估数据集,促进低资源语言LLM安全。 |
large language model |
|
|
| 5 |
KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models |
提出KZ-SafetyPrompts:一个用于评估大型语言模型安全性的哈萨克语提示数据集。 |
large language model |
|
|
| 6 |
Rethinking the Multilingual Reasoning Gap with Layer Swap |
提出Layer Swap方法,提升多语言大模型在非英语环境下的推理能力。 |
large language model chain-of-thought |
|
|
| 7 |
Tracing Computation Density in LLMs |
提出s-Trace方法,揭示LLM计算密度分布规律与模块化组织结构。 |
large language model |
|
|
| 8 |
Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination |
评估不确定性估计器在LLM幻觉检测中的相关性 |
large language model |
|
|
| 9 |
Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis |
通过因果分析编辑级别,揭示提示优化有效与失效的原因 |
large language model |
|
|
| 10 |
Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS) |
提出词覆盖率评分(WCS),评估LLM采样策略对词汇丰富度的影响 |
large language model |
|
|
| 11 |
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors |
JuICE:一个评估LLM在识别文化错误方面能力的基准 |
large language model |
|
|
| 12 |
ContextGuard: Structured Self-Auditing for Context Learning in Language Models |
ContextGuard:一种结构化自审计方法,用于提升语言模型在上下文学习中的表现 |
large language model |
|
|
| 13 |
Annotator Positionality as Signal: Psychometric Weighting for Anti-Autistic Ableism Detection |
提出基于标注者立场的心理测量加权框架,用于检测反自闭症歧视言论。 |
large language model |
|
|
| 14 |
Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery |
研究表明:视觉-语言模型在词汇判断中易受图像背景干扰,降低与人类判断的一致性 |
multimodal |
|
|
| 15 |
ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents |
提出ENPMR-Bench基准,评估情感支持对话中主动记忆检索能力 |
chain-of-thought |
|
|
| 16 |
Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora |
分析时间同步性对情感语料标注质量的影响,并构建Setswana语料库。 |
TAMP |
|
|
| 17 |
BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning |
BAIT:通过自条件推理和边界引导实现大语言模型的越狱攻击 |
large language model |
|
|
| 18 |
On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning |
揭示并缓解LLM反事实知识训练中隐藏的知识冲突与幻觉蔓延问题 |
large language model |
|
|
| 19 |
Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling |
提出协同并行思考(CPT)框架,提升大语言模型测试时推理效率。 |
large language model |
|
|
| 20 |
PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions |
PersLitEval:构建波斯文学细粒度评测基准,评估大型语言模型性能 |
large language model |
|
|
| 21 |
Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals |
针对提示注入攻击,提出部署感知的评估框架与可解释结构信号检测方法。 |
large language model |
|
|
| 22 |
Accountable Human-AI Deliberation with LLMs: Scaling Collective Intelligence through Symbiotic Scaffolding |
提出一种基于LLM的共生式人机协同框架,以提升集体智慧并保障责任归属。 |
large language model |
|
|
| 23 |
Quality Without Usefulness: LLM-Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids |
LLM生成的可解释AI叙事未能提升决策效用,反成信任启发式 |
large language model |
|
|
| 24 |
LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation |
LATTE:预测对等锚定的偏好轨迹,实现个性化LLM生成 |
large language model |
|
|
| 25 |
Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation |
Verilog-Evolve:一种反馈驱动和技能演进的Verilog生成框架 |
large language model |
|
|
| 26 |
Model Unlearning Objectives Vary for Distinct Language Functions |
针对不同语言功能,提出差异化的LLM模型遗忘目标,提升遗忘效果。 |
large language model |
|
|
| 27 |
Probing Minimalist Phase Structure in LLMs: What Universal Dependencies Cannot Represent |
通过探究LLM中的极简主义句法结构,揭示通用依存句法无法表示的信息 |
large language model |
|
|
| 28 |
Slide Deck Q&A Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation |
提出Slide Deck Q&A Quality Assurance系统,用于从幻灯片生成高质量教学问题 |
large language model |
✅ |
|
| 29 |
Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM |
提出知识增强的LLM框架,实现即时自适应反馈,提升学生学习效果 |
large language model |
|
|