| 1 |
A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models |
提出音频推理任务基准(ART),用于评估多模态大语言模型的音频推理能力 |
large language model multimodal |
|
|
| 2 |
TS-Debate: Multimodal Collaborative Debate for Zero-Shot Time Series Reasoning |
提出TS-Debate,用于零样本时间序列推理的多模态协同辩论框架 |
large language model multimodal |
|
|
| 3 |
SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation |
提出SAM Audio Judge,用于无需参考信号的多模态音频分离质量评估 |
multimodal |
✅ |
|
| 4 |
CoReTab: Improving Multimodal Table Understanding with Code-driven Reasoning |
提出CoReTab框架,通过代码驱动推理提升多模态表格理解能力。 |
multimodal |
|
|
| 5 |
GAVEL: Towards rule-based safety through activation monitoring |
GAVEL:提出基于激活监控的规则化安全框架,提升LLM安全性。 |
large language model |
|
|
| 6 |
Benchmarks Saturate When The Model Gets Smarter Than The Judge |
Omni-MATH-2:通过高质量数据集和可靠评估,提升数学问题基准测试的准确性。 |
large language model |
|
|
| 7 |
RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization |
RPO:通过部分推理优化进行强化微调,显著降低计算开销。 |
large language model |
✅ |
|
| 8 |
Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection |
提出SpikeScore,用于解决大语言模型跨领域幻觉检测问题 |
large language model |
|
|
| 9 |
AgenticSCR: An Autonomous Agentic Secure Code Review for Immature Vulnerabilities Detection |
AgenticSCR:一种用于检测早期漏洞的自主智能体安全代码审查方法 |
large language model |
|
|
| 10 |
Veri-Sure: A Contract-Aware Multi-Agent Framework with Temporal Tracing and Formal Verification for Correct RTL Code Generation |
提出Veri-Sure框架,通过形式化验证提升LLM在RTL代码生成中的正确性。 |
large language model |
|
|
| 11 |
RvB: Automating AI System Hardening via Iterative Red-Blue Games |
RvB:通过迭代红蓝对抗博弈自动化AI系统强化 |
large language model |
|
|
| 12 |
Algorithmic Prompt-Augmentation for Efficient LLM-Based Heuristic Design for A* Search |
提出算法提示增强的A-CEoH框架,高效利用LLM为A*搜索设计启发式函数。 |
large language model |
|
|
| 13 |
ComAgent: Multi-LLM based Agentic AI Empowered Intelligent Wireless Networks |
ComAgent:基于多LLM智能体的智能无线网络方案,提升跨层优化效率 |
large language model |
|
|
| 14 |
AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context |
提出AACR-Bench,用于在代码仓库级别评估自动化代码审查中的大型语言模型 |
large language model |
✅ |
|
| 15 |
Revisiting Parameter Server in LLM Post-Training |
针对LLM后训练中负载不均衡问题,提出On-Demand Communication加速训练。 |
large language model |
✅ |
|
| 16 |
Balancing Sustainability And Performance: The Role Of Small-Scale Llms In Agentic Artificial Intelligence Systems |
利用小规模LLM平衡Agentic AI系统的可持续性与性能 |
large language model |
|
|
| 17 |
GLOVE: Global Verifier for LLM Memory-Environment Realignment |
GLOVE:用于LLM记忆-环境重对齐的全局验证器 |
large language model |
|
|
| 18 |
MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge Evolution |
提出MAGNET,利用记忆驱动知识演化实现自适应GUI代理 |
foundation model |
|
|
| 19 |
Multi-Agent Procedural Graph Extraction with Structural and Logical Refinement |
提出多代理程序图提取框架以解决结构有效性与逻辑一致性问题 |
large language model |
|
|
| 20 |
Detecting and Correcting Hallucinations in LLM-Generated Code via Deterministic AST Analysis |
提出基于确定性AST分析的代码幻觉检测与纠正框架,提升LLM代码生成可靠性 |
large language model |
|
|
| 21 |
HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation |
提出HalluJudge,用于检测代码评审自动化中上下文不一致导致的幻觉问题 |
large language model |
|
|