| 1 |
PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark |
提出PersianMedQA,用于评估大型语言模型在波斯语-英语双语医学问答中的表现 |
large language model instruction following chain-of-thought |
✅ |
|
| 2 |
Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models |
揭示并缓解LLM在SDOH抽取中存在的虚假相关性和捷径学习问题 |
large language model chain-of-thought |
|
|
| 3 |
Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings |
针对中文毒性内容检测,提出多模态扰动分类体系并构建基准评测LLM |
large language model multimodal |
|
|
| 4 |
Benchmarking Large Language Models for Cryptanalysis and Side-Channel Vulnerabilities |
评估大型语言模型在密码分析和侧信道漏洞中的表现 |
large language model chain-of-thought |
|
|
| 5 |
When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways |
提出EVOKE基准,评估多模态大模型在演进知识注入中的能力与挑战。 |
multimodal instruction following |
|
|
| 6 |
MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs |
提出MMAFFBen多语言多模态情感分析基准,用于评估LLM和VLM |
large language model multimodal |
✅ |
|
| 7 |
Don't Reinvent the Wheel: Efficient Instruction-Following Text Embedding based on Guided Space Transformation |
提出GSTransform,通过引导空间变换实现高效的指令跟随文本嵌入。 |
instruction following |
✅ |
|
| 8 |
Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation |
揭示多模态RAG中的位置偏差,提出位置敏感性指标并分析其对性能的影响 |
multimodal |
|
|
| 9 |
Multilinguality Does not Make Sense: Investigating Factors Behind Zero-Shot Transfer in Sense-Aware Tasks |
多语言并非提升词义理解任务零样本迁移的关键,数据和评估更重要 |
zero-shot transfer |
|
|
| 10 |
Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty? |
研究表明大型语言模型的认知标记在分布外场景下无法准确反映其不确定性 |
large language model |
✅ |
|
| 11 |
HESEIA: A community-based dataset for evaluating social biases in large language models, co-designed in real school settings in Latin America |
提出HESEIA数据集,用于评估大型语言模型在拉丁美洲学校环境中的社会偏见。 |
large language model |
|
|
| 12 |
Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration |
提出Soft Reasoning框架,通过可控嵌入探索提升大语言模型复杂推理能力 |
large language model |
✅ |
|
| 13 |
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis |
TRIDENT:通过三维多样化红队数据合成增强大型语言模型的安全性 |
large language model |
|
|
| 14 |
Disentangling Language and Culture for Evaluating Multilingual Large Language Models |
提出双重评估框架,解耦语言和文化因素,更全面评估多语言大模型的性能。 |
large language model |
|
|
| 15 |
Harnessing Large Language Models for Scientific Novelty Detection |
利用大型语言模型进行科学新颖性检测,并构建相关基准数据集。 |
large language model |
|
|
| 16 |
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation |
CaMMT:构建文化感知多模态机器翻译的基准数据集 |
multimodal |
|
|
| 17 |
Donate or Create? Comparing Data Collection Strategies for Emotion-labeled Multimodal Social Media Posts |
对比捐赠数据与生成数据,评估情感识别多模态社交媒体内容的数据收集策略。 |
multimodal |
|
|
| 18 |
Multilingual Gloss-free Sign Language Translation: Towards Building a Sign Language Foundation Model |
提出一种多语种无词汇手语翻译模型,支持多种手语互译。 |
foundation model |
|
|
| 19 |
Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research |
AGORA:基于图编排引擎的统一语言Agent算法框架,促进可复现研究 |
large language model multimodal chain-of-thought |
|
|
| 20 |
Advantageous Parameter Expansion Training Makes Better Large Language Models |
APEX:通过优势参数扩展训练提升大语言模型性能 |
large language model |
|
|
| 21 |
Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models |
重新审视LLM误差累积:关注关键Token以突破长序列性能瓶颈 |
large language model |
|
|
| 22 |
Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games |
在最后通牒博弈中,利用心智理论和亲社会信念引导LLM实现人类对齐行为 |
large language model chain-of-thought |
✅ |
|
| 23 |
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation |
FinMME:金融多模态推理评估基准数据集,填补金融领域多模态评测空白。 |
large language model multimodal |
✅ |
|
| 24 |
LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text |
LegalEval-Q:提出法律领域LLM质量评估基准,关注清晰度、连贯性和术语准确性 |
large language model |
✅ |
|
| 25 |
Lossless Token Sequence Compression via Meta-Tokens |
提出基于Meta-Tokens的无损压缩方法,降低LLM输入序列长度并加速编码。 |
large language model |
|
|
| 26 |
Model Unlearning via Sparse Autoencoder Subspace Guided Projections |
提出SSPU,利用稀疏自编码器子空间投影实现大模型的可解释、鲁棒性知识遗忘。 |
large language model |
|
|
| 27 |
HD-NDEs: Neural Differential Equations for Hallucination Detection in LLMs |
提出HD-NDEs,利用神经微分方程检测LLM中的幻觉问题 |
large language model |
|
|
| 28 |
An evaluation of LLMs for generating movie reviews: GPT-4o, Gemini-2.0 and DeepSeek-V3 |
评估大型语言模型在电影评论生成中的表现:GPT-4o、Gemini-2.0 和 DeepSeek-V3 |
large language model |
|
|
| 29 |
Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings |
提出基于多语言Matryoshka嵌入的分层新闻文章聚类方法,提升可扩展性和可解释性。 |
large language model |
|
|
| 30 |
Multiple LLM Agents Debate for Equitable Cultural Alignment |
提出多智能体辩论框架,提升LLM在不同文化背景下的适应性和公平性 |
large language model |
|
|
| 31 |
Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX |
提出POLLUX:一个用于评估俄语LLM生成能力的综合性开源基准。 |
large language model |
|
|
| 32 |
Bench4KE: Benchmarking Automated Competency Question Generation |
Bench4KE:用于自动胜任力问题生成的基准测试系统 |
large language model |
|
|
| 33 |
Cross-Attention Speculative Decoding |
提出基于交叉注意力的推测解码模型Beagle,简化架构并提升训练效率。 |
large language model |
|
|
| 34 |
Localizing Persona Representations in LLMs |
研究大型语言模型中人格表征的定位与编码方式 |
large language model |
|
|
| 35 |
Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations |
提出基于LLM与定理证明器的NLI解释框架,提升忠实性和鲁棒性 |
large language model |
|
|
| 36 |
COSMIC: Generalized Refusal Direction Identification in LLM Activations |
COSMIC:基于LLM激活空间的通用拒绝方向识别方法 |
large language model |
|
|
| 37 |
LKD-KGC: Domain-Specific KG Construction via LLM-driven Knowledge Dependency Parsing |
提出LKD-KGC框架,通过LLM驱动的知识依赖解析构建领域知识图谱。 |
large language model |
|
|
| 38 |
CASPER: A Large Scale Spontaneous Speech Dataset |
CASPER:一个大规模自发语音数据集,旨在解决高质量自发语音数据稀缺问题。 |
large language model |
|
|
| 39 |
MultiHoax: A Dataset of Multi-hop False-Premise Questions |
提出MultiHoax数据集,用于评估LLM在多跳推理中对错误前提的检测能力 |
large language model |
|
|
| 40 |
The Impact of Disability Disclosure on Fairness and Bias in LLM-Driven Candidate Selection |
研究揭示LLM驱动的候选人筛选中,残疾披露信息对公平性和偏见的影响 |
large language model |
|
|
| 41 |
Guiding Generative Storytelling with Knowledge Graphs |
提出知识图谱辅助的生成式故事叙述框架,提升长文本连贯性和用户可控性。 |
large language model |
|
|
| 42 |
From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning |
探究语言模型微调中数据集多样性:从宏观到微观的分析框架 |
large language model |
|
|
| 43 |
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization |
提出SCRIPT编码,增强BPE在多语言预分词中的鲁棒性,避免非西方文字的惩罚。 |
large language model |
|
|
| 44 |
A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings |
提出A*-Thought以解决低资源环境下推理效率问题 |
chain-of-thought |
✅ |
|
| 45 |
Don't Erase, Inform! Detecting and Contextualizing Harmful Language in Cultural Heritage Collections |
提出一种AI工具,用于检测文化遗产数据中的有害语言并提供语境信息。 |
large language model |
|
|
| 46 |
ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation |
提出ClueAnchor,通过线索锚定的知识推理探索与优化增强检索增强生成。 |
large language model |
✅ |
|
| 47 |
LLM Inference Enhanced by External Knowledge: A Survey |
综述:利用外部知识增强大语言模型推理能力 |
large language model |
|
|
| 48 |
HiCaM: A Hierarchical-Causal Modification Framework for Long-Form Text Modification |
提出HiCaM框架,通过层级因果关系建模改进长文本修改任务 |
large language model |
|
|
| 49 |
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation |
提出并评估多种数据泄露检测方法,用于提升LLM评测基准的可靠性 |
large language model |
|
|
| 50 |
Semi-structured LLM Reasoners Can Be Rigorously Audited |
提出半结构化推理模型以解决大语言模型可审计性问题 |
large language model |
|
|
| 51 |
CLaSp: In-Context Layer Skip for Self-Speculative Decoding |
CLaSp:提出一种上下文层跳跃的自推测解码方法,加速LLM推理。 |
large language model |
|
|
| 52 |
CrossICL: Cross-Task In-Context Learning via Unsupervised Demonstration Transfer |
提出CrossICL,通过无监督示例迁移实现跨任务上下文学习。 |
large language model |
|
|
| 53 |
R-KV: Redundancy-aware KV Cache Compression for Reasoning Models |
R-KV:面向推理模型,提出冗余感知的KV缓存压缩方法 |
chain-of-thought |
|
|