The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning

📄 arXiv: 2606.06397v2 📥 PDF

作者: Shuo Wang, Xiangyu Wang, Quanxin Wang, Bailin Wu, Bokui Wang, Shunyang Huang, Boyan Deng, Haonan Liu, Ruiyi Fang, Zhenxiang Xu, Boyu Wang, Zhao Kang

分类: cs.LG

发布日期: 2026-06-04 (更新: 2026-06-05)

备注: Comments: Suggestions and comments are welcomed

🔗 代码/项目: PROJECT_PAGE


💡 一句话要点

提出曲率分层评估框架以解决关系学习中的偏差问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 关系学习 曲率评估 图卷积网络 几何特性 模型评估 机器学习 数据集划分

📋 核心要点

  1. 现有的评估方法假设数据集之间存在统一的几何结构,导致对模型泛化能力的误导性结论。
  2. 论文提出了一种曲率分层评估框架,通过将数据集按几何特性划分,揭示模型性能的关键权衡。
  3. 实验结果表明,在不同曲率区域内模型排名稳定,但跨区域排名变化显著,强调了几何依赖性。

📝 摘要(中文)

当前关系学习的评估实践过于依赖于扁平的排行榜,这种方法假设数据集之间存在统一的结构,导致系统性偏差。本文指出内在几何结构是影响模型有效性的关键因素,并提出了一种曲率分层评估框架,将数据集划分为正曲率、负曲率和近零曲率区域。通过对18种代表性模型在14个数据集上的评估,发现模型在不同曲率区域的排名差异显著,表明性能依赖于几何特性。基于此,提出了一种几何感知的评估协议,以实现更可靠的比较,并提供了相关代码和数据集以支持未来研究的可重复性。

🔬 方法详解

问题定义:论文旨在解决现有关系学习评估方法的系统性偏差问题,特别是对几何特性忽视所导致的误导性结论。

核心思路:通过引入曲率分层评估框架,论文将数据集划分为不同的曲率区域,从而揭示模型性能的几何依赖性。

技术框架:整体框架包括数据集的曲率分析、模型评估和结果比较三个主要模块。首先对数据集进行几何特性分析,然后在各个曲率区域内评估模型,最后进行跨区域比较。

关键创新:最重要的创新在于提出了曲率分层评估方法,打破了传统评估方法的局限,使得模型性能的几何特性得以显现。

关键设计:在评估过程中,采用了多种模型(如GCNs和GFMs),并设计了针对不同曲率区域的评估指标,以确保评估结果的可靠性和可解释性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,在不同曲率区域内,模型排名的稳定性显著高于跨区域的排名变化。例如,GFMs在某些曲率区域的性能提升有限,相比之下,几何对齐的GNNs表现更优。这一发现强调了几何特性在模型评估中的重要性。

🎯 应用场景

该研究的潜在应用领域包括图神经网络、关系学习和机器学习模型的评估。通过提供更可靠的评估框架,研究可以帮助开发更有效的模型,并推动相关领域的研究进展,提升模型在实际应用中的表现。

📄 摘要(原文)

Current evaluation practices in relational learning rely heavily on flat leaderboards that average performance across heterogeneous datasets, implicitly assuming a uniform underlying structure. We show that this assumption introduces systematic bias: it obscures geometry-dependent performance variations and can lead to misleading conclusions about model generalization. In this work, we identify intrinsic geometry as a key latent factor governing model effectiveness. We demonstrate that conventional aggregated metrics mask critical performance trade-offs that only become visible when datasets are stratified by their geometric properties. To address this issue, we introduce a curvature-stratified evaluation framework that partitions datasets into positive, negative, and near-zero curvature regimes. Our benchmark evaluates 18 representative models including Graph Convolutional Networks (GCNs), Graph Foundation Models (GFMs), and tabular learning methods across 14 datasets. We find that model rankings are highly stable within each curvature regime but shift significantly across regimes, indicating that performance is fundamentally geometry-dependent rather than universally transferable. Notably, we identify regimes where GFMs offer diminishing returns compared to geometry-aligned GNNs. Based on these findings, we propose a geometry-aware evaluation protocol that yields more reliable and interpretable comparisons than standard aggregated benchmarks. We release all code, curvature-stratified dataset splits, and evaluation tools to support reproducible and rigorous assessment of future relational learning methods. Code and datasets are provided in our project homepage: https://sirbabbage.github.io/CurvBench_HOME/.