| 1 |
Quantifying construct validity in large language model evaluations |
提出结构化能力模型,用于量化大语言模型评估中的构建效度 |
large language model |
|
|
| 2 |
Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings |
利用大语言模型编码增强AI模型训练中建筑语义的保持能力 |
large language model |
|
|
| 3 |
How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning |
提出基于信息论的层级分析框架,解析多模态Transformer的推理机制。 |
multimodal |
|
|
| 4 |
CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving |
CARE Drive:评估自动驾驶中视觉语言模型对人类理由的响应性 |
foundation model |
|
|
| 5 |
Decision Quality Evaluation Framework at Pinterest |
Pinterest提出决策质量评估框架,用于提升内容安全策略执行效果。 |
large language model |
|
|
| 6 |
This human study did not involve human subjects: Validating LLM simulations as behavioral evidence |
验证LLM模拟作为行为证据:探索LLM在社会科学实验中的有效性 |
large language model |
|
|
| 7 |
SecCodeBench-V2 Technical Report |
SecCodeBench-V2:一个用于评估LLM代码生成安全性的工业级基准 |
large language model |
✅ |
|
| 8 |
Automated Multi-Source Debugging and Natural Language Error Explanation for Dashboard Applications |
提出一种自动多源调试与自然语言错误解释框架,用于改进仪表盘应用的调试效率。 |
large language model |
|
|
| 9 |
EAA: Automating materials characterization with vision language model agents |
EAA:利用视觉语言模型智能体自动化材料表征实验流程 |
multimodal |
|
|