| 1 |
When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts |
提出MixCuBe基准,评估多模态大模型在文化混合场景下的文化偏见 |
large language model multimodal |
✅ |
|
| 2 |
MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering |
提出MTBench多模态时间序列基准,用于评估LLM在时序推理和问答中的能力 |
large language model multimodal |
|
|
| 3 |
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models |
贝叶斯教学提升大语言模型中的概率推理能力 |
large language model |
|
|
| 4 |
SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia |
提出SaudiCulture基准,评估大型语言模型在沙特阿拉伯文化背景下的能力。 |
large language model |
|
|
| 5 |
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging |
SafeMERGE:通过选择性分层模型融合,在微调大语言模型中保持安全性对齐 |
large language model |
|
|
| 6 |
Automating Adjudication of Cardiovascular Events Using Large Language Models |
提出基于大语言模型的框架,自动化心血管事件的临床试验裁决。 |
large language model |
|
|
| 7 |
Text2Model: Generating dynamic chemical reactor models using large language models (LLMs) |
Text2Model:利用大型语言模型生成动态化学反应器模型 |
large language model |
|
|
| 8 |
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications |
针对大语言模型在实际应用中个性化对齐缺失问题,提出全面综述与统一框架。 |
large language model |
|
|
| 9 |
Judge Anything: MLLM as a Judge Across Any Modality |
提出TaskAnything和JudgeAnything基准,评估MLLM在跨模态理解和生成任务中的表现 |
foundation model multimodal |
✅ |
|
| 10 |
From Text to Talent: A Pipeline for Extracting Insights from Candidate Profiles |
提出基于LLM和图相似度的招聘流程,为职位空缺推荐理想候选人 |
large language model multimodal |
|
|
| 11 |
Language Models May Verbatim Complete Text They Were Not Explicitly Trained On |
大型语言模型可能生成未显式训练的文本,挑战现有成员定义 |
large language model |
|
|
| 12 |
Language-specific Neurons Do Not Facilitate Cross-Lingual Transfer |
研究表明语言特定神经元无法有效促进多语言模型的跨语言迁移 |
large language model |
|
|
| 13 |
Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility |
利用人类产出-理解不对称性测试LLM的认知合理性 |
large language model |
✅ |
|
| 14 |
Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique |
提出PANEL:利用自然语言自评判增强LLM推理能力 |
large language model |
✅ |
|
| 15 |
CASE -- Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement |
提出CASE模型,利用条件感知句子嵌入提升条件语义文本相似度计算。 |
large language model |
|
|
| 16 |
CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization |
提出CoKe:通过关键词链推理实现可定制的细粒度故事评估 |
chain-of-thought |
|
|
| 17 |
MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers |
提出MMCR基准,评估视觉语言模型在科学论文中跨源推理能力 |
chain-of-thought |
|
|
| 18 |
Interpretable LLM Guardrails via Sparse Representation Steering |
提出稀疏表示引导(SRS)框架,实现对LLM行为的细粒度、可解释控制。 |
large language model |
|
|