| 1 |
WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models |
提出WebCompass以解决现有编码评估方法的局限性 |
large language model multimodal |
|
|
| 2 |
Using large language models for embodied planning introduces systematic safety risks |
大型语言模型具身规划存在系统性安全风险 |
large language model |
|
|
| 3 |
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval |
MathNet:一个用于数学推理和检索的全局多模态基准数据集 |
multimodal |
|
|
| 4 |
Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures |
综述多智能体系统:从经典范式到大模型赋能的未来 |
foundation model |
|
|
| 5 |
Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling |
提出DASH:利用Delta注意力选择性停止加速长文本预填充,保持硬件效率。 |
large language model multimodal |
✅ |
|
| 6 |
Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring for Dense Passage Retrieval |
提出BAGEL,利用LLM指导高斯过程主动学习,提升稠密检索效果 |
large language model multimodal |
|
|
| 7 |
Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots |
DailyDroid:针对LLM驱动的智能手机自动化,对比文本与截图输入,揭示其失效模式 |
large language model multimodal |
|
|
| 8 |
Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion |
系统评估云端与本地LLM在系统动力学AI助手中的表现 |
large language model |
|
|
| 9 |
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization |
提出AQPIM以解决大规模语言模型的内存激活量化问题 |
large language model |
|
|
| 10 |
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition |
提出Adversarial Arena,通过交互式对抗众包生成高质量LLM训练数据。 |
large language model |
|
|
| 11 |
Document-as-Image Representations Fall Short for Scientific Retrieval |
揭示文档图像表征在科学文档检索中的局限性,并提出基于LaTeX源的新基准。 |
multimodal |
|
|
| 12 |
Six Llamas: Comparative Religious Ethics Through LoRA-Adapted Language Models |
Six Llamas:通过LoRA适配的语言模型进行比较宗教学伦理研究 |
large language model |
|
|
| 13 |
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies |
提出上下文感知检索评估(CARE),提升RAG系统多跳推理评估的准确性 |
large language model |
✅ |
|
| 14 |
From Fallback to Frontline: When Can LLMs be Superior Annotators of Human Perspectives? |
挑战传统认知:大语言模型在人类观点标注任务中超越人类标注者 |
large language model |
|
|
| 15 |
RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs |
RAVEN:检索增强的漏洞探索网络,用于用户代码和二进制程序中的内存损坏分析 |
large language model |
|
|
| 16 |
WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent |
WebUncertainty:双重不确定性驱动的自主Web代理规划与推理 |
large language model |
|
|
| 17 |
Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective |
揭示代码大语言模型中因Tokenization导致的密钥泄露风险 |
large language model |
|
|
| 18 |
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks |
提出基于LRP对比归因方法,分析LLM在真实benchmark上的失效原因。 |
large language model |
|
|
| 19 |
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization |
EvoOR-Agent:提出一种协同进化框架,用于自动优化运筹学问题。 |
large language model |
|
|