| 1 |
ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following |
提出ABC-Eval基准,评估大语言模型在符号音乐理解和指令跟随方面的能力 |
large language model instruction following |
|
|
| 2 |
Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned |
提出混合数据合成框架和感知聚焦监督,提升视觉语言模型多模态推理能力。 |
large language model multimodal visual grounding |
|
|
| 3 |
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models |
提出AudioRole数据集,提升大语言模型在角色扮演中的音频个性化能力 |
large language model multimodal |
|
|
| 4 |
Transferring Vision-Language-Action Models to Industry Applications: Architectures, Performance, and Challenges |
评估视觉-语言-动作模型在工业应用中的性能与挑战,并分析其部署可行性 |
vision-language-action VLA |
|
|
| 5 |
Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark |
提出EAPrivacy基准,评估具身智能体在物理世界中的隐私意识 |
large language model |
✅ |
|
| 6 |
Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration |
提出Fact Grounded Attention,通过知识注入注意力机制消除大语言模型的事实幻觉。 |
large language model |
|
|
| 7 |
Artificial Phantasia: Evidence for Propositional Reasoning-Based Mental Imagery in Large Language Models |
提出基于命题推理的心智意象任务,评估大语言模型复杂认知能力 |
large language model |
|
|
| 8 |
CATMark: A Context-Aware Thresholding Framework for Robust Cross-Task Watermarking in Large Language Models |
提出CATMark,一种上下文感知阈值框架,用于大语言模型中鲁棒的跨任务水印嵌入。 |
large language model |
|
|
| 9 |
Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification |
DafnyCOMP:用于评测大语言模型在组合式形式化验证中性能的基准 |
large language model |
|
|
| 10 |
Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia |
提出Mini-Mafia基准测试LLM的社会智能,评估欺骗、检测和信息披露能力 |
large language model |
|
|
| 11 |
GUI-PRA: Process Reward Agent for GUI Tasks |
提出GUI-PRA,通过动态记忆和UI感知提升GUI任务中进程奖励模型的性能 |
large language model multimodal |
|
|
| 12 |
Agentic AI Reasoning for Mobile Edge General Intelligence: Fundamentals, Approaches, and Directions |
提出面向移动边缘通用智能的Agentic AI推理框架,优化资源效率与推理质量。 |
large language model chain-of-thought |
|
|
| 13 |
VeriGRAG: Enhancing LLM-Based Verilog Code Generation with Structure-Aware Soft Prompts |
VeriGRAG:利用结构感知软提示增强LLM的Verilog代码生成 |
large language model multimodal |
|
|
| 14 |
Your Dense Retriever is Secretly an Expeditious Reasoner |
提出AdaQR,自适应混合查询重写框架,提升推理检索效率。 |
large language model |
|
|
| 15 |
PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation |
PARROT:用于评估LLM跨系统SQL转换能力的基准测试 |
large language model |
✅ |
|
| 16 |
Understanding and Enhancing the Planning Capability of Language Models via Multi-Token Prediction |
通过多Token预测增强语言模型在复杂规划中的推理能力 |
large language model |
|
|
| 17 |
MathBode: Measuring the Stability of LLM Reasoning using Frequency Response |
MathBode:利用频率响应测量LLM数学推理的稳定性 |
large language model |
|
|
| 18 |
ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search |
提出ReliabilityRAG,利用文档可靠性信息增强RAG在Web搜索中的鲁棒性,防御检索语料库攻击。 |
large language model |
|
|
| 19 |
Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores |
提出基于模型一致性的LLM Elo评分代理,无需人工评估且高效 |
large language model |
|
|
| 20 |
GeoBS: Information-Theoretic Quantification of Geographic Bias in AI Models |
提出GeoBS框架,通过信息论量化AI模型中的地理偏差,并考虑空间因素。 |
foundation model |
|
|
| 21 |
NeuroBridge: Using Generative AI to Bridge Cross-neurotype Communication Differences through Neurotypical Perspective-taking |
NeuroBridge:利用生成式AI和神经典型视角弥合跨神经类型沟通差异 |
large language model |
|
|
| 22 |
Scaling LLM Test-Time Compute with Mobile NPU on Smartphones |
提出面向移动NPU的LLM测试时并行扩展方法,提升小模型性能。 |
large language model |
|
|
| 23 |
p-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding |
提出p-less采样方法,一种无需超参数的鲁棒LLM解码策略,提升生成质量。 |
large language model |
✅ |
|
| 24 |
AutoEP: LLMs-Driven Automation of Hyperparameter Evolution for Metaheuristic Algorithms |
AutoEP:利用LLM驱动的超参数进化自动优化元启发式算法 |
large language model |
|
|
| 25 |
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software |
BuildBench:基准测试LLM Agent在编译真实世界开源软件上的能力 |
large language model |
|
|
| 26 |
Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents |
Kimi-Dev:基于无Agent训练的技能先验提升软件工程Agent性能 |
large language model |
|
|
| 27 |
LLM Watermark Evasion via Bias Inversion |
提出Bias-Inversion Rewriting Attack,实现LLM水印的有效规避 |
large language model |
|
|