| 1 |
RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning |
RBF++:量化和优化CoT推理中可测量与不可测量能力的推理边界 |
large language model multimodal chain-of-thought |
✅ |
|
| 2 |
KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025 |
KIT提出利用LLM增强的离线语音翻译和指令跟随系统,提升性能。 |
large language model instruction following |
|
|
| 3 |
FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models |
FlightGPT:基于视觉-语言模型的通用且可解释的无人机视觉-语言导航 |
VLN multimodal chain-of-thought |
|
|
| 4 |
Are Large Language Models Good at Detecting Propaganda? |
评估大型语言模型在新闻宣传检测中的能力,结果表明其性能未超越RoBERTa-CRF基线。 |
large language model |
|
|
| 5 |
Krikri: Advancing Open Large Language Models for Greek |
Krikri:面向希腊语的开源大型语言模型,显著提升希腊语理解与生成能力 |
large language model |
|
|
| 6 |
Simulation Agent: A Framework for Integrating Simulation and Large Language Models for Enhanced Decision-Making |
提出Simulation Agent框架,融合仿真与大语言模型以增强决策能力 |
large language model |
|
|
| 7 |
From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery |
综述性论文:大型语言模型赋能科学发现,从自动化工具到自主科研智能体 |
large language model |
✅ |
|
| 8 |
SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science |
SeedBench:面向种子科学领域大语言模型的多任务评测基准 |
large language model |
|
|
| 9 |
ToolSpectrum : Towards Personalized Tool Utilization for Large Language Models |
ToolSpectrum:面向大语言模型的个性化工具利用基准 |
large language model |
✅ |
|
| 10 |
Role-Playing Evaluation for Large Language Models |
提出RPEval基准,用于评估大型语言模型在角色扮演中的能力 |
large language model |
✅ |
|
| 11 |
The Effect of Language Diversity When Fine-Tuning Large Language Models for Translation |
通过控制实验揭示语言多样性对LLM翻译微调的影响,并发现适度多样性提升翻译质量 |
large language model |
|
|
| 12 |
Suicide Risk Assessment Using Multimodal Speech Features: A Study on the SW1 Challenge Dataset |
利用多模态语音特征进行自杀风险评估,基于SW1挑战数据集。 |
multimodal |
|
|
| 13 |
An Empirical Study of Many-to-Many Summarization with Large Language Models |
系统性研究大型语言模型在多语种文档摘要任务中的能力,揭示指令调优的优势与事实性挑战。 |
large language model |
|
|
| 14 |
I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models |
图像增强视觉-语言模型中的虚假信息传播:一项关于图像影响力的研究 |
large language model multimodal |
✅ |
|
| 15 |
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information |
SAKURA:评估大型音频语言模型基于语音和音频信息的多跳推理能力 |
large language model multimodal |
|
|
| 16 |
Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning |
提出TREA数据集并评估LALM时序推理能力,同时提出不确定性度量方法。 |
large language model multimodal |
|
|
| 17 |
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix |
提出MMAR:一个用于评估音频-语言模型深度推理能力的挑战性基准 |
large language model chain-of-thought |
|
|
| 18 |
SQLForge: Synthesizing Reliable and Diverse Data to Enhance Text-to-SQL Reasoning in LLMs |
SQLForge:合成可靠且多样的数据以增强LLM在Text-to-SQL推理中的能力 |
large language model |
|
|
| 19 |
Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading |
评估GPT在盲评下基于证明的大学课程中的表现 |
large language model |
|
|
| 20 |
Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents |
提出引导搜索策略以解决非可序列化环境中的软件工程问题 |
large language model |
|
|
| 21 |
What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts |
针对LLM提示词欠规范问题,提出需求感知的优化方法,提升模型稳定性和性能。 |
instruction following |
|
|
| 22 |
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning |
揭示语义召回对长上下文代码推理的影响,提出SemTrace基准测试LLM的语义理解能力。 |
large language model |
|
|
| 23 |
GUARD: Generation-time LLM Unlearning via Adaptive Restriction and Detection |
提出GUARD:一种基于自适应限制和检测的生成时LLM知识遗忘框架 |
large language model |
|
|
| 24 |
Rank, Chunk and Expand: Lineage-Oriented Reasoning for Taxonomy Expansion |
LORex:提出一种面向谱系的推理框架,用于高效扩展分类体系。 |
PaLM-E |
|
|
| 25 |
What's in a prompt? Language models encode literary style in prompt embeddings |
语言模型Prompt嵌入蕴含文学风格信息,可用于作者归属分析 |
large language model |
|
|
| 26 |
RAR: Setting Knowledge Tripwires for Retrieval Augmented Rejection |
RAR:通过检索增强拒绝机制为大型语言模型设置知识陷阱,实现内容审核。 |
large language model |
|
|
| 27 |
HeteroSpec: Leveraging Contextual Heterogeneity for Efficient Speculative Decoding |
HeteroSpec:利用上下文异质性实现高效推测解码,显著提升LLM推理速度。 |
large language model |
|
|
| 28 |
Are LLMs Better Formalizers than Solvers on Complex Problems? |
针对复杂约束满足问题,LLM作为形式化器性能不如直接求解器 |
large language model |
|
|
| 29 |
Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks |
揭示LLM的位置脆弱性:偏移效应如何影响记忆风险认知 |
large language model |
|
|
| 30 |
What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text |
质疑文本欺骗检测的可靠性:跨语言研究揭示语言线索的局限性 |
large language model |
|
|
| 31 |
Language-Specific Latent Process Hinders Cross-Lingual Performance |
揭示语言特定隐变量阻碍跨语言性能,提出引导方法提升小模型跨语言推理能力 |
large language model |
|
|