VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs

作者: Tao Zhang, Shiqing Wei, Shihao Chen, Wenling Yu, Muying Luo, Shunping Ji

分类: cs.CV

发布日期: 2025-07-07

💡 一句话要点

提出VectorLLM以解决建筑轮廓提取问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 建筑轮廓提取 多模态大型语言模型 遥感影像 逐点回归 城市规划 数据集 零-shot学习

📋 核心要点

现有建筑轮廓提取方法依赖复杂的多阶段流程，导致可扩展性差和实际应用受限。
VectorLLM通过逐点回归建筑轮廓，模拟人类标注过程，采用多模态大型语言模型架构。
在WHU、WHU-Mix和CrowdAI数据集上，VectorLLM分别提升了5.6 AP、7.1 AP和13.6 AP，显示出强大的零-shot性能。

📝 摘要（中文）

自动从遥感影像中提取建筑轮廓对于城市规划、人口估算和灾害评估至关重要。现有方法依赖复杂的多阶段流程，限制了其可扩展性和实际应用。本文提出了VectorLLM，这是首个针对建筑轮廓提取的多模态大型语言模型。与现有方法不同，VectorLLM通过逐点回归建筑轮廓，模拟人类标注者的标注过程。通过对WHU、WHU-Mix和CrowdAI数据集的全面训练策略探索，VectorLLM在三个数据集上显著超越了之前的最先进方法，分别提升了5.6 AP、7.1 AP和13.6 AP。该模型在未见物体上的零-shot性能也表现出色，显示了其在多样化遥感物体轮廓提取任务中的潜力。

🔬 方法详解

问题定义：本文旨在解决从遥感影像中自动提取建筑轮廓的问题。现有方法通常依赖复杂的多阶段流程，包括像素分割、矢量化和多边形精细化，这限制了其在实际应用中的可扩展性和效率。

核心思路：VectorLLM的核心思路是通过逐点回归的方式直接提取建筑轮廓，模仿人类标注者的标注过程。这种方法不仅提高了提取的准确性，还简化了整个流程。

技术框架：VectorLLM的整体架构包括一个视觉基础骨干、一个多层感知器连接器和一个大型语言模型（LLM），并通过可学习的位置嵌入增强空间理解能力。

关键创新：最重要的技术创新在于引入了多模态大型语言模型进行建筑轮廓提取，突破了传统方法的限制，实现了高效的逐点回归。

关键设计：在训练过程中，采用了预训练、监督微调和偏好优化等策略，确保模型在不同数据集上的表现优异。具体的损失函数和网络结构设计也经过精心调整，以提升模型的整体性能。

🖼️ 关键图片

📊 实验亮点

在WHU、WHU-Mix和CrowdAI数据集上，VectorLLM分别提升了5.6 AP、7.1 AP和13.6 AP，显著超越了之前的最先进方法。此外，模型在未见物体上的零-shot性能表现出色，展示了其广泛的应用潜力。

🎯 应用场景

该研究的潜在应用场景包括城市规划、灾害评估和人口统计等领域。通过高效、准确的建筑轮廓提取，VectorLLM能够为相关决策提供重要的数据支持，促进智能城市的发展。未来，该技术有望扩展到其他遥感物体的提取任务，进一步提升遥感数据的利用价值。

📄 摘要（原文）

Automatically extracting vectorized building contours from remote sensing imagery is crucial for urban planning, population estimation, and disaster assessment. Current state-of-the-art methods rely on complex multi-stage pipelines involving pixel segmentation, vectorization, and polygon refinement, which limits their scalability and real-world applicability. Inspired by the remarkable reasoning capabilities of Large Language Models (LLMs), we introduce VectorLLM, the first Multi-modal Large Language Model (MLLM) designed for regular building contour extraction from remote sensing images. Unlike existing approaches, VectorLLM performs corner-point by corner-point regression of building contours directly, mimicking human annotators' labeling process. Our architecture consists of a vision foundation backbone, an MLP connector, and an LLM, enhanced with learnable position embeddings to improve spatial understanding capability. Through comprehensive exploration of training strategies including pretraining, supervised fine-tuning, and preference optimization across WHU, WHU-Mix, and CrowdAI datasets, VectorLLM significantly outperformed the previous SOTA methods by 5.6 AP, 7.1 AP, 13.6 AP, respectively in the three datasets. Remarkably, VectorLLM exhibits strong zero-shot performance on unseen objects including aircraft, water bodies, and oil tanks, highlighting its potential for unified modeling of diverse remote sensing object contour extraction tasks. Overall, this work establishes a new paradigm for vector extraction in remote sensing, leveraging the topological reasoning capabilities of LLMs to achieve both high accuracy and exceptional generalization. All the codes and weights will be published for promoting community development.

VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理