| 1 |
Think Then Embed: Generative Context Improves Multimodal Embedding |
提出Think-Then-Embed框架,利用生成式上下文提升通用多模态嵌入性能。 |
large language model multimodal chain-of-thought |
|
|
| 2 |
ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering |
提出ChartAgent,通过视觉推理解决复杂图表问答中未标注图表的理解难题 |
multimodal chain-of-thought |
|
|
| 3 |
Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA) |
大语言模型在国际天文与天体物理奥赛中达到金牌水平 |
large language model multimodal |
|
|
| 4 |
Efficient Prediction of Pass@k Scaling in Large Language Models |
提出基于Beta-Binomial分布的Pass@k预测方法,提升大语言模型能力与风险评估效率。 |
large language model |
|
|
| 5 |
Exploring Student Choice and the Use of Multimodal Generative AI in Programming Learning |
探索多模态生成式AI在编程学习中的应用与学生选择偏好 |
multimodal |
|
|
| 6 |
BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions |
BIRD-INTERACT:通过动态交互视角重新定义大语言模型Text-to-SQL的评测标准 |
large language model |
|
|
| 7 |
AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials |
AtomWorld:用于评估大语言模型在晶体材料空间推理能力的基准 |
large language model |
|
|
| 8 |
Improving Multimodal Brain Encoding Model with Dynamic Subject-awareness Routing |
提出AFIRE与MIND框架,解决自然场景下多模态脑编码模型的主体差异问题。 |
multimodal |
|
|
| 9 |
LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation |
LEGOMem:面向工作流自动化的多智能体LLM系统的模块化程序记忆 |
large language model |
|
|
| 10 |
Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization |
提出复杂度分布外泛化框架,用于评估和提升AI的推理能力。 |
large language model |
|
|
| 11 |
BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs |
提出BrokenMath基准,评估LLM在定理证明中对错误结论的盲从性 |
large language model |
|
|
| 12 |
VAL-Bench: Belief Consistency as a measure for Value Alignment in Language Models |
VAL-Bench:提出基于信念一致性的语言模型价值观对齐评测基准。 |
large language model |
|
|
| 13 |
UnitTenX: Generating Tests for Legacy Packages with AI Agents Powered by Formal Verification |
UnitTenX:利用形式化验证驱动的AI Agent为遗留软件包生成单元测试 |
large language model |
|
|
| 14 |
AInstein: Assessing the Feasibility of AI-Generated Approaches to Research Problems |
AInstein框架评估LLM在无外部辅助下解决AI研究问题的可行性 |
large language model |
|
|
| 15 |
AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling |
AutoDAN-Reasoning:通过测试时缩放增强基于策略探索的LLM越狱攻击 |
large language model |
|
|
| 16 |
DeepV: A Model-Agnostic Retrieval-Augmented Framework for Verilog Code Generation with a High-Quality Knowledge Base |
DeepV:一种模型无关的RAG框架,通过高质量知识库提升Verilog代码生成效果。 |
large language model |
✅ |
|
| 17 |
Staircase Streaming for Low-Latency Multi-Agent Inference |
提出Staircase Streaming,解决多Agent推理中高延迟问题,显著降低TTFT。 |
large language model |
|
|
| 18 |
AutoEmpirical: LLM-Based Automated Research for Empirical Software Fault Analysis |
AutoEmpirical:利用大语言模型自动进行软件缺陷的实证研究 |
large language model |
|
|
| 19 |
LLM-Hanabi: Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game |
LLM-Hanabi:利用Hanabi评估LLM在不完美信息协作中的心智理论和理性推断能力 |
large language model |
|
|
| 20 |
Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution |
提出ECHO算法,通过层级上下文和客观共识分析提升多智能体系统错误归因的准确性。 |
large language model |
|
|
| 21 |
FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration |
FreshBrew:用于评估AI Agent在Java代码迁移任务上的基准测试 |
large language model |
|
|
| 22 |
Natural Language Edge Labelling: Decoupling Intent from Execution in Structured LM Reasoning |
提出自然语言边缘标签(NLEL),解耦结构化LM推理中的意图与执行,提升可控性和可审计性。 |
chain-of-thought |
|
|
| 23 |
Curved Boolean Logic: A Contextual Generalization of Propositional Logic with Algorithmic Consequences |
提出弯曲布尔逻辑,通过局部真值赋值泛化命题逻辑,并提供算法优化。 |
large language model |
|
|
| 24 |
P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs |
提出P2P:一种用于LLM可靠后门防御的投毒解毒方法 |
large language model |
|
|