| 1 |
Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams |
提出曲线推理框架以解决大语言模型的几何可解释性问题 |
large language model |
|
|
| 2 |
A Survey on Latent Reasoning |
综述潜在推理:探索大型语言模型在隐空间进行多步推理的新范式。 |
large language model multimodal chain-of-thought |
✅ |
|
| 3 |
UQLM: A Python Package for Uncertainty Quantification in Large Language Models |
UQLM:一个基于不确定性量化的大语言模型幻觉检测Python工具包 |
large language model |
|
|
| 4 |
Coding Triangle: How Does Large Language Model Understand Code? |
提出Code Triangle框架,系统评估大语言模型在代码理解与生成中的能力。 |
large language model |
|
|
| 5 |
Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors |
提出DAEDCMD,解决持续多模态虚假信息检测中的知识遗忘与环境演变问题 |
multimodal |
|
|
| 6 |
Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis |
针对图像描述任务,提出多模态上下文学习的外部与内部分析方法,揭示有效配置策略。 |
large language model multimodal |
|
|
| 7 |
HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation |
提出HIRAG:一种层级思维指令调优的检索增强生成方法,提升模型开放式问答能力。 |
large language model chain-of-thought |
|
|
| 8 |
Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders |
利用稀疏自编码器提升LLM可解释性与下游任务性能 |
large language model |
|
|
| 9 |
Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling |
REFORM:通过奖励引导的对抗性失败模式发现,提升奖励模型的鲁棒性 |
large language model |
|
|
| 10 |
Humans overrely on overconfident language models, across languages |
研究表明,多语言环境下人类过度依赖语言模型,且易受其过度自信表达的影响 |
large language model |
|
|
| 11 |
Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers |
提出基于FLOPs的LLM重排序器效率评估指标RPP和QPP,解决现有评估方法硬件依赖问题。 |
large language model |
✅ |
|
| 12 |
Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs |
提出熵-记忆定律,评估LLM中数据记忆难度并实现数据集推断 |
large language model |
|
|
| 13 |
DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations |
提出一种基于全合成示例的上下文学习方法,用于文档级信息抽取。 |
large language model |
|
|
| 14 |
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages |
RabakBench:构建面向低资源语言的、可扩展的多语种安全基准 |
large language model |
|
|
| 15 |
OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation |
提出OpenFActScore,用于开源评估文本生成的事实性 |
large language model |
✅ |
|
| 16 |
Few-shot text-based emotion detection |
利用大语言模型和少样本学习进行文本情感检测,并在Emakhuwa语料上取得最佳效果 |
large language model |
|
|
| 17 |
AI-Reporter: A Path to a New Genre of Scientific Communication |
AI-Reporter:将学术报告快速转化为可发表的科学论文 |
large language model |
|
|
| 18 |
Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators |
提出虚拟受访者框架以解决心理测量问卷项目验证问题 |
large language model |
|
|
| 19 |
Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports |
构建LVLM知觉能力评测基准,分析模型在残缺信息补全理解上的能力差异 |
multimodal |
|
|
| 20 |
Flippi: End To End GenAI Assistant for E-Commerce |
Flippi:面向电商的端到端生成式AI助手,提升用户购物体验 |
large language model |
|
|
| 21 |
DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities |
DocTalk:提出基于图的可扩展对话合成方法,增强LLM的对话能力 |
large language model |
✅ |
|
| 22 |
DRAGON: Dynamic RAG Benchmark On News |
DRAGON:提出首个俄语动态RAG基准,用于评估新闻领域检索增强生成系统。 |
large language model |
|
|
| 23 |
Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs |
Smoothie-Qwen:通过后处理平滑技术减少多语言LLM中的语言偏见 |
large language model |
|
|