| 1 |
MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models |
MacroBench:一个基于大语言模型的Web自动化脚本测试平台 |
large language model |
✅ |
|
| 2 |
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation |
提出基于贝叶斯框架的大语言模型评估方法,提升评估稳定性和可靠性。 |
large language model |
✅ |
|
| 3 |
AlphaApollo: Orchestrating Foundation Models and Professional Tools into a Self-Evolving System for Deep Agentic Reasoning |
AlphaApollo:通过自进化系统编排基础模型与专业工具,实现深度Agent推理 |
foundation model |
✅ |
|
| 4 |
What Shapes a Creative Machine Mind? Comprehensively Benchmarking Creativity in Foundation Models |
提出C^2-Eval,全面评估基础模型在收敛和发散创造力上的表现 |
foundation model |
|
|
| 5 |
LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions |
综述:基于LLM的数据科学Agent能力、挑战与未来方向 |
large language model multimodal |
|
|