| 1 |
Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage |
揭示LLM按Token计费模式的欺诈风险:供应商可恶意虚报Token数量 |
large language model |
|
|
| 2 |
MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization |
提出MuPHIRM框架,通过语义对齐的奖励优化提升VLM在隐式多模态危害推理上的能力 |
multimodal |
|
|
| 3 |
COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings |
COMET:通过概念空间剖析音频-文本多模态对比嵌入中的模态差异 |
multimodal |
|
|
| 4 |
Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence |
提出HetMedAgent异构多智能体框架,融合通用LLM与专科模型,提升医疗决策性能。 |
large language model foundation model |
|
|
| 5 |
SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations |
SchGen:基于语义代码表示的PCB原理图生成模型 |
large language model |
|
|
| 6 |
Demystifying Data Organization for Enhanced LLM Training |
探索数据组织策略,提升大语言模型训练效率与稳定性 |
large language model |
✅ |
|
| 7 |
Teaching Values to Machines: Simulating Human-Like Behavior in LLMs |
通过价值观引导,使LLM模拟更具人类一致性的行为 |
large language model |
|
|
| 8 |
PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing |
提出PRAIB基准,评估LLM辅助评审行为,揭示其与人类评审的差异 |
large language model |
|
|
| 9 |
Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems |
针对Agentic AI系统,评估Token优化格式TOON和TRON在降低Token开销方面的性能。 |
large language model |
|
|
| 10 |
VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing |
VLA-Trace:通过表征和行为追踪诊断视觉-语言-动作模型 |
vision-language-action VLA multimodal |
|
|
| 11 |
Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability |
提出基于偏好最大可满足性的方法以解决LLM优化问题 |
large language model chain-of-thought |
|
|
| 12 |
CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models |
提出CodeGolf Bench,用于评估大语言模型在60种编程语言中生成简洁代码的能力。 |
large language model |
|
|
| 13 |
OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields |
提出OmniMatBench以解决材料科学多模态推理不足问题 |
multimodal |
|
|
| 14 |
Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate |
通过LLM多智能体辩论模拟论证推理理论,提升问答任务的真值探寻性能。 |
large language model |
|
|
| 15 |
Harnessing non-adversarial robustness in large language models |
提出一种基于去偏置微调的LLM非对抗鲁棒性提升方法 |
large language model |
|
|
| 16 |
Benchmarking Positional Encoding Strategies for Transformer-Based EEG Foundation Models |
针对脑电Transformer基础模型,对比评估多种位置编码策略在脑机接口任务中的性能。 |
foundation model |
|
|
| 17 |
Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models |
提出基于LLM的多智能体框架,提升儿童协同故事创作质量 |
large language model |
|
|
| 18 |
HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering |
HiKEY提出层级多模态检索框架,解决开放域文档问答中的路由失败和证据碎片化问题。 |
multimodal |
|
|
| 19 |
FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification |
FinVerBench:构建金融报表验证基准,评估大语言模型在财务一致性判断中的有效性和校准性 |
large language model |
|
|
| 20 |
DenseSteer: Steering Small Language Models towards Dense Math Reasoning |
DenseSteer:引导小语言模型实现高密度数学推理 |
large language model chain-of-thought |
|
|
| 21 |
Inferring Code Correctness from Specification |
TRAILS:基于输入输出对齐规范推断代码正确性,提升LLM代码验证精度。 |
large language model chain-of-thought |
|
|
| 22 |
When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs |
研究表明角色提示主要重塑LLM响应特征而非提升能力,需多指标评估。 |
large language model |
|
|
| 23 |
When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop |
研究多模型自消费循环中人工干预的负面影响及偏好对齐问题 |
foundation model |
|
|
| 24 |
Automatically Attacking Software Reverse Engineering AI Agents |
提出基于遗传算法的提示生成方法,攻击软件逆向工程AI Agent。 |
large language model |
|
|
| 25 |
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents |
揭示自进化LLM Agent中Harness更新与收益的解耦关系,优化Agent训练策略。 |
instruction following |
✅ |
|
| 26 |
Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection |
提出VisAnomReasoner,一种高效的视觉-语言推理模型,用于时间序列异常检测。 |
multimodal |
|
|
| 27 |
ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure |
ProjectionBench:提出一种渐进式信息披露的LLM科学假设生成评估框架 |
large language model |
|
|
| 28 |
Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization |
提出时序与结构信用分配方法,优化LLM多智能体提示,提升复杂推理任务性能。 |
large language model |
|
|
| 29 |
Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale |
设计并评估三方LLM-教师协作系统,用于大规模K-12写作教学 |
large language model |
|
|
| 30 |
When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems |
探索混合多智能体系统:云端与设备端智能协同推理的设计空间 |
large language model |
|
|
| 31 |
PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers |
提出PokerSkill框架以解决无训练扑克游戏问题 |
large language model |
✅ |
|
| 32 |
Projectional Decoding: Towards Semantic-Aware LLM Generation |
提出投影解码,通过集成领域语义提升LLM生成软件工件的语义有效性。 |
large language model |
|
|
| 33 |
RAISE: RAG Design as an Architecture Search Problem |
提出RAISE框架,将RAG设计转化为架构搜索问题,实现RAG超参数优化。 |
multimodal |
|
|
| 34 |
From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs |
HTP:利用LLM分层生成城市轨迹,解决隐私限制下轨迹数据不足问题 |
large language model |
✅ |
|
| 35 |
Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent |
Compass:通过专家指导的LLM Agent导航全球海洋铅数据集成 |
large language model |
|
|
| 36 |
Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction |
提出MemPoison,通过对话交互隐蔽劫持LLM Agent记忆,实现特洛伊木马攻击。 |
large language model |
|
|
| 37 |
HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding |
提出HoliTok:一种稳健的连续整体语音Token化模型,用于语音生成和理解 |
foundation model |
✅ |
|
| 38 |
Make LLM Learn to Synthesize from Streaming Experiences through Feedback |
提出StreamSynth和SynLearner,使LLM在流式合成任务中持续学习并迁移经验。 |
large language model |
|
|
| 39 |
Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation |
提出Moment-KV,一种基于动量的解码时KV缓存压缩方法,用于提升长文本生成质量。 |
large language model |
|
|
| 40 |
Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions |
提出因果干预的联邦域泛化方法,解决呼吸音分类中听诊器伪相关问题。 |
multimodal |
|
|
| 41 |
LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs |
LFQ:Logit感知的最终块量化,提升低比特量化LLM的生成质量 |
large language model |
|
|
| 42 |
NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs |
提出噪声感知LoRA(NaRA),用于高效微调扩散语言模型。 |
large language model |
✅ |
|
| 43 |
BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices |
BitTP:面向边缘设备的轻量化轨迹预测模型,利用BitLLM实现高效推理 |
large language model |
✅ |
|
| 44 |
Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation |
提出Think Fast, Talk Smart框架,用于从结构化健康数据中生成高质量健康文本。 |
large language model |
|
|
| 45 |
LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning |
利用LLM进化领域无关启发式算法,超越符号AI规划人工设计水平 |
large language model |
|
|
| 46 |
ParaTool: Shifting Tool Representations from Context to Parameters |
ParaTool:将工具表示从上下文转移到参数,提升大模型工具调用能力 |
large language model |
|
|
| 47 |
Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation |
提出Battery-Sim-Agent以解决电池参数估计问题 |
large language model |
|
|
| 48 |
Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification |
Opt-Verifier:利用双侧验证释放LLM在优化建模中的潜力 |
large language model |
|
|
| 49 |
MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs |
MINDGAMES:用于评估多智能体LLM中社会和战略推理的实时竞技场 |
large language model |
|
|
| 50 |
Xetrieval: Mechanistically Explaining Dense Retrieval |
Xetrieval:提出一种可解释的稠密检索框架,揭示embedding层面的推理机制。 |
chain-of-thought |
✅ |
|
| 51 |
SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing |
SciIntBench:提出对抗性基准测试,评估LLM在科研诚信规范下的合规性 |
large language model |
|
|
| 52 |
CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials |
CrystalXRD-Bench:用于评估视觉-语言模型在晶体材料XRD峰索引任务上的性能 |
multimodal |
|
|
| 53 |
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet |
利用稀疏自编码器从Claude 3 Sonnet中提取可解释的单义特征 |
multimodal |
|
|
| 54 |
Provably Secure Agent Guardrail |
提出基于逻辑推理约束的Agent Guardrail,解决AI失控安全问题 |
large language model |
|
|
| 55 |
ReasonOps: Operator Segmentation for LLM Reasoning Traces |
ReasonOps:提出一种无监督的LLM推理轨迹算子分割方法,用于分析和理解LLM的推理过程。 |
chain-of-thought |
|
|