| 1 |
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs |
提出KnowRecall和VisRecall基准,评估多模态LLM的跨语言一致性 |
large language model multimodal |
|
|
| 2 |
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models |
提出基于特征提取和引导的CoT推理增强方法,无需外部数据集。 |
large language model chain-of-thought |
|
|
| 3 |
NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation |
提出NeSyGeo神经符号框架,用于生成多样且泛化的多模态几何推理数据。 |
large language model multimodal |
|
|
| 4 |
Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation |
SDForger:利用大语言模型生成高质量时间序列合成数据 |
large language model multimodal |
✅ |
|
| 5 |
PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions |
提出PhysicsArena:首个多模态物理推理基准,评估变量、过程和解题能力 |
large language model multimodal |
|
|
| 6 |
RRTL: Red Teaming Reasoning Large Language Models in Tool Learning |
提出RRTL,用于评估推理大语言模型在工具学习中的安全性 |
large language model chain-of-thought |
|
|
| 7 |
TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration |
TACO:通过任务映射引导序列配置,增强多模态上下文学习 |
multimodal |
|
|
| 8 |
HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases |
HDLxGraph:通过HDL图数据库桥接大语言模型与HDL代码仓库,提升硬件设计任务性能。 |
large language model |
✅ |
|
| 9 |
Listen to the Context: Towards Faithful Large Language Models for Retrieval Augmented Generation on Climate Questions |
针对气候问题,提出ClimateGPT Faithful+,提升检索增强生成中大语言模型的忠实度。 |
large language model |
|
|
| 10 |
SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models |
提出SLMEval,基于熵最大化校准LLM评估器,提升与人类判断的一致性 |
large language model |
|
|
| 11 |
Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization |
利用大型语言模型提取概率知识,用于贝叶斯网络参数化 |
large language model |
|
|
| 12 |
FedSEA-LLaMA: A Secure, Efficient and Adaptive Federated Splitting Framework for Large Language Models |
FedSEA-LLaMA:面向LLaMA2的安全、高效、自适应联邦切分框架 |
large language model |
|
|
| 13 |
LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models |
提出LFTF算法,通过定位并微调LLM特定模块以缓解性别偏见。 |
large language model |
|
|
| 14 |
Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework |
提出GETER框架,增强大语言模型在时序推理中的可解释性,并构建了相应的评测基准。 |
large language model |
✅ |
|
| 15 |
Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning |
提出基于置信度的自适应推理框架CAR,提升LLM/MLLM推理效率与准确性。 |
large language model multimodal chain-of-thought |
|
|
| 16 |
OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models |
提出OpenEthics以全面评估开源生成大语言模型的伦理问题 |
large language model |
✅ |
|
| 17 |
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering |
CoPriva:针对大语言模型在问答中安全策略保持能力的大规模评测基准 |
large language model |
|
|
| 18 |
Evolutionary Computation and Large Language Models: A Survey of Methods, Synergies, and Applications |
探索进化计算与大语言模型的协同:方法、协同与应用综述 |
large language model |
|
|
| 19 |
After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in Retrieval-Augmented Generation |
BRIDGE框架提升RAG中大语言模型在知识冲突场景下的可信度 |
large language model |
|
|
| 20 |
Comparative Evaluation of Prompting and Fine-Tuning for Applying Large Language Models to Grid-Structured Geospatial Data |
对比Prompting与微调,利用大语言模型处理网格化地理空间数据 |
large language model |
|
|
| 21 |
Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling |
PsyLLM:首个集成诊断与治疗推理的大语言模型,用于心理健康咨询。 |
large language model |
✅ |
|
| 22 |
LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing |
LyapLock:序列大语言模型编辑中保证有界知识保留的框架 |
large language model |
✅ |
|
| 23 |
Can Large Language Models be Effective Online Opinion Miners? |
提出OOMB基准数据集,评估大语言模型在在线意见挖掘中的有效性 |
large language model |
|
|
| 24 |
ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy |
ThinkLess:一种免训练的推理加速方法,减少LLM推理冗余 |
large language model instruction following chain-of-thought |
|
|
| 25 |
Cultural Value Alignment in Large Language Models: A Prompt-based Analysis of Schwartz Values in Gemini, ChatGPT, and DeepSeek |
通过提示分析,揭示大型语言模型在施瓦茨价值观上的文化价值对齐差异 |
large language model |
|
|
| 26 |
Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models |
提出GainLoRA,通过门控集成LoRA解决LLM持续学习中的灾难性遗忘问题 |
large language model |
|
|
| 27 |
Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector |
提出基于推理的偏差检测器RBD,提升大语言模型作为评判者的可靠性。 |
large language model |
|
|
| 28 |
SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models |
SciCUEval:构建综合数据集,评估大语言模型在科学领域的上下文理解能力 |
large language model |
|
|
| 29 |
Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation |
评估大型语言模型在临床笔记生成中的可靠性,推荐本地部署小型开源模型。 |
large language model |
|
|
| 30 |
Can Large Language Models Understand Internet Buzzwords Through User-Generated Content |
提出RESS方法并构建CHEER数据集,提升大语言模型对互联网流行语的理解能力 |
large language model |
✅ |
|
| 31 |
Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory |
利用项目反应理论重新评估大语言模型评测基准的有效性 |
large language model |
|
|
| 32 |
Effective and Efficient Schema-aware Information Extraction Using On-Device Large Language Models |
提出DLISC:一种基于双LoRA与增量Schema缓存的设备端高效信息抽取方法 |
large language model |
|
|
| 33 |
Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning |
提出Joint Flashback Adaptation方法,解决指令调优中大模型灾难性遗忘问题 |
large language model instruction following |
|
|
| 34 |
Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation |
Aug2Search:利用LLM生成合成数据增强Facebook Marketplace搜索效果 |
large language model multimodal |
|
|
| 35 |
MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling |
MIKU-PAL:一种自动、标准化的多模态语音副语言和情感标注方法 |
large language model multimodal |
|
|
| 36 |
TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games |
提出TurnaboutLLM,一个基于侦探游戏的LLM演绎推理能力评测基准 |
large language model chain-of-thought |
|
|
| 37 |
Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling |
提出MMER以解决多模态大语言模型的扩展与保留问题 |
large language model multimodal |
|
|
| 38 |
FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management |
FlowKV:通过隔离的键值缓存管理增强LLM中的多轮对话连贯性 |
large language model instruction following |
|
|
| 39 |
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents |
Web-Shepherd:提出用于增强Web代理的流程奖励模型,解决Web导航任务缺乏专用奖励模型的问题。 |
large language model multimodal |
|
|
| 40 |
RoT: Enhancing Table Reasoning with Iterative Row-Wise Traversals |
提出RoT:通过迭代行遍历增强表格推理能力,无需训练。 |
large language model chain-of-thought |
|
|
| 41 |
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective |
提出扩散语言模型用于文本嵌入,显著提升长文档和推理检索性能 |
large language model instruction following |
|
|
| 42 |
Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems |
提出Spoken-MQA基准,评估语音模型在多方面数学问题上的推理能力 |
large language model multimodal |
|
|
| 43 |
Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention |
提出HyCo₂混合上下文压缩方法,平衡局部和全局信息保留,提升长文本推理性能。 |
large language model |
|
|
| 44 |
Scaling Physical Reasoning with the PHYSICS Dataset |
提出PHYSICS数据集,用于提升和评估LLM在物理推理任务上的能力。 |
large language model |
✅ |
|
| 45 |
DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning |
提出DTE框架,通过多智能体辩论和自进化训练提升语言模型推理能力。 |
large language model |
✅ |
|
| 46 |
Advancing LLM Safe Alignment with Safety Representation Ranking |
提出安全表征排序(SRR),利用LLM内部状态提升对抗性prompt下的安全对齐。 |
large language model |
|
|
| 47 |
Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning |
提出MAPLE指标,用于全面评估LLM在数学推理中的逻辑对齐程度 |
large language model |
|
|
| 48 |
A quantitative analysis of semantic information in deep representations of text and images |
提出一种量化方法,分析文本和图像深度表征中的语义信息。 |
large language model |
|
|
| 49 |
DeFTX: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer |
DeFT-X:通过去噪稀疏微调实现零样本跨语言迁移 |
large language model |
|
|
| 50 |
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision |
MAS-ZERO:无需监督的自进化多智能体系统设计框架 |
large language model |
|
|
| 51 |
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space |
提出Soft Thinking,在连续概念空间中提升LLM的推理能力。 |
chain-of-thought |
✅ |
|
| 52 |
Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities |
通过语言相似性揭示多语言LLM中的记忆现象 |
large language model |
|
|
| 53 |
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning |
对比人类与LLM概念结构,揭示LLM压缩语义的特性及局限性 |
large language model |
|
|
| 54 |
UniErase: Towards Balanced and Precise Unlearning in Language Models |
UniErase:提出一种平衡且精确的语言模型卸载框架,提升卸载效果和能力保持。 |
large language model |
|
|
| 55 |
Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs |
提出protoknowledge概念,分析LLM在下游任务中知识图谱的记忆与泛化行为 |
large language model |
|
|
| 56 |
RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection |
提出RePPL以解决语言模型幻觉检测的可解释性问题 |
large language model |
|
|
| 57 |
Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors |
提出对比释义攻击CoPA,无需训练即可有效欺骗LLM文本检测器 |
large language model |
|
|
| 58 |
Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild |
提出原型人机协作行为(PATHs),分析LLM辅助写作中用户与AI的交互模式。 |
large language model |
|
|
| 59 |
Explaining Puzzle Solutions in Natural Language: An Exploratory Study on 6x6 Sudoku |
评估LLM在6x6数独解题与自然语言解释中的能力,揭示其在策略推理上的不足。 |
large language model |
|
|
| 60 |
Generalizable Process Reward Models via Formally Verified Training Data |
提出FoVer,通过形式化验证自动生成训练数据,提升通用过程奖励模型性能。 |
large language model |
✅ |
|
| 61 |
MAPS: A Multilingual Benchmark for Global Agent Performance and Security |
MAPS:一个用于评估多语言环境下Agent性能与安全性的基准测试套件 |
large language model |
✅ |
|
| 62 |
Learning to Reason via Mixture-of-Thought for Logical Reasoning |
提出混合思维(MoT)框架,用于提升LLM在逻辑推理中的性能。 |
chain-of-thought |
|
|
| 63 |
MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation |
MTR-Bench:构建多轮推理综合评测基准,揭示LLM交互推理能力不足 |
large language model |
|
|
| 64 |
ToxicTone: A Mandarin Audio Dataset Annotated for Toxicity and Toxic Utterance Tonality |
ToxicTone:构建大规模中文语音毒性数据集,并提出多模态检测框架。 |
multimodal |
|
|
| 65 |
Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions |
自解释性:大语言模型能描述驱动决策的复杂内部过程 |
large language model |
|
|
| 66 |
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models |
VocalBench:用于评估语音交互模型会话能力的综合基准 |
large language model |
|
|
| 67 |
Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen! |
揭示微调开源LLM的数据泄露风险:攻击者可通过后门提取微调数据 |
large language model |
✅ |
|
| 68 |
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation |
InfoDeepSeek:提出Agentic RAG评测基准,评估真实动态网络环境下的智能信息检索能力 |
large language model |
|
|
| 69 |
CoLA: Collaborative Low-Rank Adaptation |
CoLA:一种协同低秩适应方法,提升低样本场景下多任务微调性能。 |
large language model |
✅ |
|
| 70 |
An Empirical Study of the Anchoring Effect in LLMs: Existence, Mechanism, and Potential Mitigations |
研究表明大型语言模型存在锚定效应,并提出潜在缓解策略 |
large language model |
|
|
| 71 |
X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System |
X-WebAgentBench:多语言交互式Web基准,评估全局Agent系统 |
large language model |
|
|
| 72 |
NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging |
提出NL-DEBUGGING框架,利用自然语言中间表示提升LLM代码调试能力 |
large language model |
|
|
| 73 |
Emotional Supporters often Use Multiple Strategies in a Single Turn |
重新定义情感支持对话任务,关注单轮多策略现象,并验证大型语言模型在该任务上的优越性。 |
large language model |
|
|
| 74 |
Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites |
ToxiRewriteCN:首个中文情感极性一致的有害言论改写数据集,提升LLM在微妙语境下的解毒能力。 |
large language model |
|
|
| 75 |
Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization |
揭示长文本生成中幻觉的位置偏见:集中于末尾,并探索缓解方法 |
large language model |
|
|
| 76 |
Multilingual Prompting for Improving LLM Generation Diversity |
提出多语言提示方法,提升大型语言模型生成内容的多样性 |
large language model |
|
|
| 77 |
R-TOFU: Unlearning in Large Reasoning Models |
提出R-TOFU基准,用于评估大型推理模型中知识遗忘的有效性 |
chain-of-thought |
|
|
| 78 |
BanglaByT5: Byte-Level Modelling for Bangla |
提出BanglaByT5,一种面向孟加拉语的字节级编码器-解码器模型,提升资源受限场景下的NLP性能。 |
large language model |
|
|
| 79 |
DUSK: Do Not Unlearn Shared Knowledge |
DUSK基准测试:评估LLM在数据重叠场景下的选择性遗忘能力 |
large language model |
|
|