Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

作者: Qiwei Wang, Zhongyao Tuo, Xianghui Ze, Yujiao Shi

分类: cs.CV

发布日期: 2026-05-08

💡 一句话要点

提出Cross3R模型，通过引入无人机视角实现跨卫星、无人机与地面图像的6-DoF 3D重建与定位

🎯 匹配领域: 支柱三：空间感知与语义 (Perception & Semantics)

关键词: 跨视角定位 3D重建 6-DoF位姿估计 多模态融合 无人机视觉 计算机视觉

📋 核心要点

现有跨视角定位方法受限于天底卫星图像，难以获取翻滚、俯仰及高度信息，导致在复杂地形中无法实现完整的6-DoF位姿估计。
论文提出Cross3R模型，利用无人机图像作为桥梁，通过多视角融合揭示3D结构，实现无需预设相对位姿的跨视角联合重建与定位。
在自建的CrossGeo大规模数据集及KITTI基准测试中，Cross3R在点云重建与位姿估计指标上均显著优于现有前馈3D重建方法。

📝 摘要（中文）

跨视角定位传统上旨在确定地面图像在卫星图中的位置。现有方法通常局限于3-DoF估计（x, y坐标及偏航角），因为天底卫星图像无法提供翻滚角、俯仰角或高度信息，导致其过度依赖平面运动和零倾斜假设，在复杂地形（如斜坡、坡道）或非理想相机安装下失效。为解决此问题，本文引入无人机图像作为中间视角，揭示天底视角下不可见的3D结构，并提供卫星图像缺失的姿态线索，且无需已知相对位姿。基于此，本文提出了Cross3R，这是一个灵活的前馈模型，可同时处理卫星、无人机及地面图像，在单次前向传播中恢复跨视角3D点云、各相机的6-DoF位姿及定位信息。此外，本文构建了包含85个场景、27.8万张图像的CrossGeo数据集。实验表明，Cross3R在点云重建、6-DoF姿态估计及跨视角定位任务上均优于现有基线。

🔬 方法详解

问题定义：现有跨视角定位方法多基于平面假设，仅能估计3-DoF（x, y, yaw），无法处理真实世界中复杂的坡度、地形起伏及相机倾斜，导致无法实现完整的6-DoF相机位姿恢复与精确的3D场景重建。

核心思路：引入无人机（UAV）图像作为中间视角，利用其独特的斜视角度弥补卫星图像在垂直维度上的信息缺失，通过多模态特征融合，在无需已知相机相对位姿的前提下，实现跨视角的几何对齐与3D空间重建。

技术框架：Cross3R采用前馈架构，输入卫星图、无人机图及地面图的组合。模型通过特征提取器提取多视角特征，利用注意力机制进行跨视角特征交互，最终输出场景的3D点云表示、各输入相机的6-DoF位姿参数以及地面相机的地理定位坐标。

关键创新：首次将无人机视角引入跨视角定位任务，打破了传统“卫星-地面”双视角的局限；提出了一种无需显式相对位姿约束的端到端前馈重建范式，显著提升了模型在复杂地形下的鲁棒性。

关键设计：模型设计了灵活的输入接口，支持多种视角组合；通过联合优化点云重建损失与位姿回归损失，实现几何一致性的隐式学习；在训练中利用CrossGeo数据集进行多任务监督，确保模型在不同地理环境下的泛化能力。

🖼️ 关键图片

📊 实验亮点

Cross3R在CrossGeo数据集上表现卓越，在点云重建精度、6-DoF位姿估计及跨视角定位任务中均大幅超越现有前馈基线。在未经过KITTI数据集训练的情况下，该模型在KITTI测试集上的表现仍优于专门针对该数据集训练的现有跨视角方法，证明了其强大的跨场景泛化能力与几何推理能力。

🎯 应用场景

该技术在自动驾驶、无人机导航、灾害应急响应及城市规划领域具有重要价值。特别是在卫星地图更新滞后或地面GPS信号受限的复杂地形中，通过无人机与地面图像的实时融合，可实现高精度的定位与环境感知，为机器人自主作业提供关键的几何支撑。

📄 摘要（原文）

Cross-view localization classically asks: where does this ground image lie on the satellite tile? Existing methods are typically limited to 3-DoF estimates -- an $(x,y)$ position and a yaw angle -- because nadir satellite imagery provides no direct cues for roll, pitch, or altitude, forcing a reliance on planar-motion and zero-tilt assumptions. These assumptions break on real terrain with slopes, ramps, and tilted camera mounts. To overcome this, we introduce a single UAV image as an intermediate viewpoint: it reveals the 3D structure invisible from nadir, supplies the cues for roll, pitch, and altitude that the satellite alone cannot provide, and needs only spatial overlap with the ground camera -- no known relative pose is required. Building on this insight, we propose Cross3R, a flexible feed-forward model that ingests a satellite tile together with a UAV image, a ground image, or both, and, in a single forward pass, recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile $(x,y)$ position and yaw of each perspective camera. For training and evaluation, we also construct CrossGeo, a 278K-image tri-view dataset spanning 85 scenes across every continent except Antarctica. On CrossGeo, Cross3R consistently outperforms feed-forward 3D baselines in point-cloud reconstruction, 6-DoF camera-pose estimation, and cross-view localization. On KITTI, it outperforms dedicated cross-view methods trained on KITTI on most metrics, despite having no KITTI training itself.

Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理