| 1 |
Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions |
对比思维链、单步代码执行与迭代代码执行,评估大语言模型在数学问题变体上的鲁棒性。 |
large language model chain-of-thought |
|
|
| 2 |
Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling |
提出KMAS自适应负采样方法,提升知识图谱基础模型在零样本知识图谱补全任务上的性能。 |
foundation model |
|
|
| 3 |
Generating Robust Portfolios of Optimization Models using Large Language Models |
利用大语言模型生成优化模型组合,提升决策鲁棒性 |
large language model |
|
|
| 4 |
What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation |
提出局部共现激活模型以解析链式思维的有效性 |
chain-of-thought |
|
|
| 5 |
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? |
提出LiveK12Bench,评估大型多模态模型在真实高中考试场景下的推理能力 |
multimodal |
|
|
| 6 |
Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal |
思维链干扰拒绝行为的简单引导:揭示大型推理模型的新型攻击面 |
chain-of-thought |
|
|
| 7 |
MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning |
MedGuideX:将可执行指南的决策逻辑融入大型语言模型,用于临床推理。 |
large language model |
|
|
| 8 |
Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering |
Gumbel Machine:通过Gumbel噪声引导生成反事实学生写作文本 |
large language model instruction following |
|
|
| 9 |
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation |
MUSE-Autoskill:通过技能生命周期管理实现自进化Agent |
large language model |
|
|
| 10 |
Cordyceps: Covert Control Attacks on LLMs via Data Poisoning |
Cordyceps:通过数据投毒对LLM进行隐蔽控制攻击 |
large language model |
|
|
| 11 |
GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing |
GENESIS:利用AI Agent实现6G RAN的自主合成、研究与测试 |
large language model |
|
|
| 12 |
Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation |
构建Qiskit QuantumKatas基准,用于评估LLM在量子计算任务中的能力。 |
chain-of-thought |
|
|
| 13 |
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments |
NoisyAgent:通过噪声环境训练提升LLM智能体在真实场景下的鲁棒性 |
large language model |
|
|
| 14 |
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions |
VitaBench 2.0:评估长期用户交互中个性化和主动型Agent |
large language model |
|
|
| 15 |
Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry |
Chat-ISV:基于可溯源知识图谱推理的钢铁行业VOCs治理LLM辅助决策系统 |
large language model |
|
|
| 16 |
ConVer: Using Contracts and Loop Invariant Synthesis for Scalable Formal Software Verification |
ConVer:利用合约与循环不变式综合实现可扩展的形式化软件验证 |
large language model |
|
|
| 17 |
ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning |
提出ReasonOps:一种可信、可验证的大语言模型推理统一操作范式 |
large language model |
|
|
| 18 |
Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton |
探索引导LLM应用Singleton设计模式的策略,提升代码质量与一致性 |
large language model |
|
|
| 19 |
Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study |
构建持久化AI Agent科研环境,探索其在学术研究中的应用与性能 |
large language model |
|
|
| 20 |
MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation |
MatFormBench:针对目标驱动材料配方设计的综合性基准测试框架 |
large language model |
|
|
| 21 |
Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation |
提出CUDAnalyst,用于分析LLM智能体在CUDA核生成中反馈到规划决策的影响。 |
large language model |
|
|
| 22 |
L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation |
L2Rec:通过双视角理解LLM,实现个性化推荐 |
large language model |
|
|
| 23 |
Plans for Evaluating Structured Generative Search Summaries |
提出评估结构化生成式搜索摘要的框架,用于提升网络搜索结果的呈现效果。 |
large language model |
|
|