Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

Haozhe Si¹, Yuxuan Wan¹, Minh Do¹, Deepak Vasisht¹, Han Zhao¹, Hendrik F. Hamann²,

¹University of Illinois Urbana-Champaign, ²IBM Research

Our proposed framework Hyer-MAE.

Abstract

Geospatial raster (imagery) data holds immense potential for enbaling extensive diverse high-impace downstream applications by providing spatial, temporal, and spectral information across multiple channels (e.g., spectral bands, polarizations) and sensing modalities. Recent work adapting existing self-supervised learning (SSL) approaches for geospatial data, they fall shorts of tailored model architectures and training objects. To address these limitations and better take advantage of geospatial data, we introduce LESS ViT, a Vision Transformer architecture variant specifically deisgned for hyperspectral geospatial data, and Hyper-MAE, a Masked Auto Encoder based pre-training framework that employs a LESS ViT encoder-decoder architecture and incorporates decoupled spatial and spectral masking to create a more challenging self-supervised pre-training objective. To evaluate empirical performance, we construct a benchmark, GFM-Bench, which serves as a comprehensive benchmark over geospatial data. Experimental results demonstrate that our proposed method surpasses current state-of-the-art multi-modal geospatial foundation models, achieving superior performance with less computation and fewer parameters. The flexibility and extensibility of our framework make it a promising solution for future geospatial data analysis tasks that involve a wide range of modalities and channels.

Low-rank Efficient Spatial-Spectral ViT

Our Low-rank Efficient Spatial-Spectral (LESS) ViT architecture consists of three key components: (1) Hyperspectral Patch Embedding Block, (2) LESS Attention Block, and (3) Perception Field Mask.

Hyperspectral Patch Embedding

Hyperspectral images can contain tens to thousands of channels, distinguishing them from natural images that typically have three (RGB) channels. The rich spectral information encoded in these channels exhibits strong physical correlations that must be effectively leveraged. To exploit these spectral dependencies in subsequent attention blocks, we adopt a Tied Patch Embedding Layer that maintains spectral fidelity by explicitly embedding each channel information, and incorporate a continuous positional-channel embedding to capture both spatial and spectral relationships.

Hyperspectral Patch Embedding Block

LESS ViT

LESS Attention Block

Given the computation inefficiency of applying standard attention mechanism on spatial-spectral tokens, we address this limitation by proposing LESS attention block specifically deisgned for spatial-spectral tokens. To efficiently model spatial-spectral interactions, our LESS attention block approximates the full spatial-spectral attention matrix using a Kronecker product of separate spatial and spectral attention matrices. This approximation makes the attention block more scalable to larger numbers of channels while still capturing spatial-spectral interactions.

Perception Field Mask

"Everything is related to everything else, but near things are more related than distant things."

— Tobler, First Law of Geography

To explicitly model spatial autocorrelation in geospatial data, we introduce the Perception Field Mask. The Perception Field Mask constrains the spatial attention computation by allowing each token to attend only to patches within a specified distance threshold. We define this threshold in meters rather than pixels, ensuring consistent spatial relationships across different image resolutions. This distance-based masking mechanism offers two key advantages: (1) it enforces locality in the attention computation, aligning with Tobler’s law, and (2) it enables the model to process images of varying sizes without downsampling, as the attention field remains spatially consistent regardless of resolution.

GFM-Bench

To encourage the usage of consistent evaluation protocol, we introduce GFM-Bench, a benchmark implemented using the HuggingFace framework for ease of use and providing standardized evaluation protocols. The current version of GFM-Bench consists three classification tasks (EuroSAT, BigEarthNet,and So2Sat) and four segmentation tasks (SegMunich, DFC2020, MARIDA, NLCD-L).

For more detailed information about GFM-Bench, please refer to our GFM-Bench page.

Quantitative Results

Hyperspectral Optical Experiments

Our LESS ViT is pretrained on the SSL4EO-S12 dataset, which is a large-scale multi-modal geospatial dataset containing spatially and temporally aligned MSI with 13 channels and SAR imagery with 2 channels. We compare LESS ViT-Base with SoTA geospatial representation learning methods on our GFM-Bench.

Quantitative results on seven datasets in the GFM-Bench under Fine-tuning (FT) and Linear Probing (LP).

Cross-Satellite Generalization

To also demonstrate the flexibility of our LESS ViT architecture in handling satellites with varying channels counts without architecture modifications, we evaluate our model on the NLCD-L dataset from GFM-Bench and compare our model against two baseline architectures. We also analyze computational efficiency by comparing encoder parameter counts, floating point operations (FLOPs) during fine-tuning, and both fine-tuning and inference latency on NLCD-L (20 channels) and BigEarthNet (12 channels). Performance is reported as mIoU for NLCD-L and mAP for BigEarthNet. To enable direct comparison, we normalize both FLOPs and wall-clock times relative to LESS ViT's baseline measurements.

Cross Satellite Generalization to Landsat and Model Efficiency.

Qualitative Evaluation

Visualization

We also perform principal component analysis (PCA) on the extracted patch features of each modal input. The top three principal components are visualized using three distinct colors. We interpolate the resulting image back to the original input size for better interpretation.

Visualization of top PCA components on BigEarthNet.

There are three key observations. First, the multi-modal patch features are mainly influenced by optical features, which is expected given the higher number of optical channels compared to radar channels. Second, for both the optical and multi-modal patch features, the principal components of similar land cover types have consistent color patterns. For instance, in optical feature visualizations, water bodies (Row 1) are highlighted in green, woodlands (Row 2) are represented by purple, and farmlands (Row 3) are primarily marked in yellow. Third, despite the presence of salt-and-pepper noise, the radar patch features demonstrate a correspondence between surfaces with similar textures. Smooth surfaces (water) tend to be marked in pink, while rough surfaces (woodlands and farmlands) are generally highlighted in green.

BibTeX

@misc{si2025scalablefoundationmodelmultimodal,
      title={Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data}, 
      author={Haozhe Si and Yuxuan Wan and Minh Do and Deepak Vasisht and Han Zhao and Hendrik F. Hamann},
      year={2025},
      eprint={2503.12843},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.12843}, 
    }