TerraTorch Model Zoo Overview
This guide provides a comprehensive overview of the Geospatial Foundation Models (GeoFMs) available in the TerraTorch toolkit. Each model represents different approaches to pre-training on Earth observation data, with varying architectures, data requirements, and downstream task performance.
Model Comparison Metrics
For consistency, we evaluate each model using these standardized metrics:
- Architecture Type: Base neural network architecture (ResNet, ViT, Swin)
- Parameter Count: Total trainable parameters
- Pre-training Method: Self-supervised learning approach used
- Input Resolution: Spatial resolution of training data
- Spectral Bands: Number and type of input channels
- Temporal Handling: How the model processes time-series data
- Pre-training Dataset Size: Scale of training data
- Patch Size: For ViT models, the size of image patches
- Embedding Dimension: Size of learned representations
Contrastive Learning Models
MOCOv2
Paper: Momentum Contrast for Unsupervised Visual Representation Learning
Repository: Available through TerraTorch backbone registry
Description: MOCOv2 applies momentum-based contrastive learning to Sentinel-2 imagery, learning representations by maximizing agreement between different augmented views of the same scene across multiple seasons.
Standard Metrics: - Architecture Type: ResNet50 - Parameter Count: 25M - Pre-training Method: Momentum Contrastive Learning - Input Resolution: 10m (Sentinel-2) - Spectral Bands: 13 (Sentinel-2 MSI) - Temporal Handling: Multi-seasonal contrasts - Pre-training Dataset Size: 1M samples - Patch Size: N/A (CNN-based) - Embedding Dimension: 2048
DINO
Paper: Emerging Properties in Self-Supervised Vision Transformers
Repository: Integrated via TerraTorch
Description: DINO (self-DIstillation with NO labels) learns visual representations through self-distillation, adapted for Sentinel-2 imagery with multi-seasonal temporal patterns.
Standard Metrics: - Architecture Type: ResNet50 - Parameter Count: 25M - Pre-training Method: Self-Distillation - Input Resolution: 10m (Sentinel-2) - Spectral Bands: 13 (Sentinel-2 MSI) - Temporal Handling: Multi-seasonal processing - Pre-training Dataset Size: 1M samples - Patch Size: N/A (CNN-based) - Embedding Dimension: 2048
DeCUR
Paper: Decoupling Common and Unique Representations for Multimodal Self-Supervised Learning
Repository: Available in TerraTorch
Description: DeCUR jointly learns from Sentinel-1 (radar) and Sentinel-2 (optical) data by decoupling common and unique representations between modalities, enabling robust multi-modal Earth observation.
Standard Metrics: - Architecture Type: ResNet50 - Parameter Count: 25M - Pre-training Method: Multi-modal Contrastive Learning - Input Resolution: 10m - Spectral Bands: 13 (S2) + 2 (S1 VV/VH polarizations) - Temporal Handling: Single timestamp - Pre-training Dataset Size: 1M samples - Patch Size: N/A (CNN-based) - Embedding Dimension: 2048
Masked Autoencoding Models
ScaleMAE
Paper: Scale-Aware Masked Autoencoder for Multi-scale Geospatial Representation Learning
Repository: GitHub
Description: ScaleMAE introduces scale-aware positional encodings to handle the variable ground sampling distances in remote sensing, training on RGB imagery across multiple resolutions.
Standard Metrics: - Architecture Type: ViT-Large - Parameter Count: 300M - Pre-training Method: Masked Autoencoding with scale awareness - Input Resolution: 0.1m to 30m (variable) - Spectral Bands: 3 (RGB) - Temporal Handling: Single timestamp - Pre-training Dataset Size: 360k samples - Patch Size: 16x16 - Embedding Dimension: 1024
DOFA (Dynamic One-For-All)
Paper: Neural Plasticity-Inspired Foundation Model for Observing the Earth Crossing Modalities
Repository: Available through TerraTorch
Description: DOFA employs dynamic wavelength encoding to handle arbitrary combinations of spectral bands, making it adaptable to various Earth observation sensors without retraining.
Standard Metrics: - Architecture Type: ViT-Large - Parameter Count: 300M - Pre-training Method: Masked Autoencoding with dynamic encoding - Input Resolution: 1-30m (variable) - Spectral Bands: Dynamic (any combination) - Temporal Handling: Single timestamp - Pre-training Dataset Size: 8M samples - Patch Size: 16x16 - Embedding Dimension: 1024
Clay v1
Paper: Clay Foundation Model Technical Report
Repository: HuggingFace
Description: Clay combines masked autoencoding with DINO for self-supervised learning, incorporating location and temporal encodings alongside dynamic wavelength handling for comprehensive Earth observation.
Standard Metrics: - Architecture Type: ViT-Base - Parameter Count: 100M - Pre-training Method: MAE + DINO hybrid - Input Resolution: 1-500m (highly variable) - Spectral Bands: Dynamic (Sentinel-2, Landsat, NAIP) - Temporal Handling: Temporal position encodings - Pre-training Dataset Size: 70M samples - Patch Size: 8x8 - Embedding Dimension: 768
Prithvi-EO-1.0
Paper: Foundation Models for Generalist Geospatial Artificial Intelligence
Repository: HuggingFace
Description: Developed by IBM and NASA, Prithvi-EO-1.0 is trained on Harmonized Landsat-Sentinel (HLS) data with multi-temporal inputs for comprehensive Earth system understanding.
Standard Metrics: - Architecture Type: ViT-Base - Parameter Count: 100M - Pre-training Method: Masked Autoencoding - Input Resolution: 30m (HLS) - Spectral Bands: 6 (HLS bands) - Temporal Handling: Multi-temporal stacking (3 timestamps) - Pre-training Dataset Size: 250k samples - Patch Size: 16x16 - Embedding Dimension: 768
Prithvi-EO-2.0
Paper: Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model
Repository: HuggingFace
Description: The second generation of Prithvi models, offering both 300M and 600M parameter variants with enhanced temporal and location encodings for improved global Earth observation capabilities.
Standard Metrics: - Architecture Type: ViT-Large (300M) / ViT-Huge (600M) - Parameter Count: 300M / 600M - Pre-training Method: Masked Autoencoding with temporal encoding - Input Resolution: 30m (HLS) - Spectral Bands: 6 (HLS bands) - Temporal Handling: Enhanced multi-temporal (3+ timestamps) - Pre-training Dataset Size: 4.2M samples - Patch Size: 16x16 - Embedding Dimension: 1024 (300M) / 1280 (600M)
Multi-Task Supervised Models
Satlas
Paper: SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding
Repository: GitHub
Description: Satlas uses supervised multi-task learning across various label types and resolutions, creating a generalist model for diverse remote sensing applications.
Standard Metrics: - Architecture Type: Swin Transformer - Parameter Count: 100M - Pre-training Method: Supervised Multi-task Learning - Input Resolution: ~10m (various sources) - Spectral Bands: Variable (RGB + multispectral) - Temporal Handling: Single timestamp - Pre-training Dataset Size: Not specified (labeled data) - Patch Size: 4x4 (Swin patches) - Embedding Dimension: 768
Model Selection Guide
Best for Multi-Modal Applications
- DeCUR: Optimized for combined SAR-optical analysis
- Clay v1: Flexible wavelength handling for diverse sensors
- DOFA: Dynamic adaptation to any spectral configuration
Best for Temporal Analysis
- Prithvi-EO-2.0: Enhanced temporal encodings
- Prithvi-EO-1.0: Native multi-temporal support
- MOCOv2/DINO: Multi-seasonal contrastive learning
Best for High-Resolution Tasks
- ScaleMAE: Scale-aware design for variable resolutions
- Satlas: Multi-resolution supervised training
Best for Limited Compute Resources
- MOCOv2/DINO/DeCUR: 25M parameters (ResNet50)
- Prithvi-EO-1.0: 100M parameters with proven efficiency
- Clay v1: 100M parameters with 8x8 patches for detail
Best for Production Deployment
- Prithvi-EO-2.0: Extensive validation and NASA/IBM support
- Clay v1: Active development and community support
- Satlas: Supervised training for predictable performance
Implementation Example
import terratorch
from terratorch.models import PrithviModelFactory
# Load a pre-trained model
model = PrithviModelFactory.build_model(
backbone="prithvi_eo_v2_300m",
decoder="upernet",
num_classes=10,
in_channels=6,
bands=["B02", "B03", "B04", "B08", "B11", "B12"],
num_frames=3
)
# Fine-tune on your dataset
trainer = terratorch.Trainer(
model=model,
task="semantic_segmentation",
learning_rate=1e-4,
batch_size=16
)
:::{#quarto-navigation-envelope .hidden}
[TerraTorch Model Zoo: Comprehensive Guide to Geospatial Foundation Models]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-sidebar-title"}
[GEOG 288KC]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar-title"}
[🏠 home]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:🏠 home"}
[/index.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/index.html"}
[📋 syllabus]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:📋 syllabus"}
[/Syllabus.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/Syllabus.html"}
[💻 weekly sessions]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:💻 weekly sessions"}
[Week 1 - 🚀 Core Tools and Data Access]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Week 1 - 🚀 Core Tools and Data Access"}
[/chapters/c01-geospatial-data-foundations.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/chapters/c01-geospatial-data-foundations.html"}
[Week 2 - ⚡ Rapid Remote Sensing Preprocessing]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Week 2 - ⚡ Rapid Remote Sensing Preprocessing"}
[/chapters/c02-spatial-temporal-attention-mechanisms.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/chapters/c02-spatial-temporal-attention-mechanisms.html"}
[Week 3a - 🌍 TerraTorch Foundations]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Week 3a - 🌍 TerraTorch Foundations"}
[/chapters/c03a-terratorch-foundations.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/chapters/c03a-terratorch-foundations.html"}
[Week 3b - 🤖 Machine Learning on Remote Sensing]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Week 3b - 🤖 Machine Learning on Remote Sensing"}
[/chapters/c03-complete-gfm-architecture.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/chapters/c03-complete-gfm-architecture.html"}
[Week 4 - 🏗️ Foundation Models in Practice]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Week 4 - 🏗️ Foundation Models in Practice"}
[/chapters/c04-pretraining-implementation.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/chapters/c04-pretraining-implementation.html"}
[Week 5 - 🔧 Fine-Tuning & Transfer Learning]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Week 5 - 🔧 Fine-Tuning & Transfer Learning"}
[/chapters/c05-training-loop-optimization.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/chapters/c05-training-loop-optimization.html"}
[Week 6 - ⏰ Spatiotemporal Modeling & Projects]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Week 6 - ⏰ Spatiotemporal Modeling & Projects"}
[/chapters/c06-model-evaluation-analysis.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/chapters/c06-model-evaluation-analysis.html"}
[👀 cheatsheets]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:👀 cheatsheets"}
[📋 All Cheatsheets]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:📋 All Cheatsheets"}
[/cheatsheets.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/cheatsheets.html"}
[---]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:---"}
[⚡ Quick Starts]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:⚡ Quick Starts"}
[Week 01: Import Guide]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Week 01: Import Guide"}
[/extras/cheatsheets/week01_imports.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/cheatsheets/week01_imports.html"}
[🧩 explainers]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:🧩 explainers"}
[1️⃣ Week 1]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:1️⃣ Week 1"}
[🤖 AI/ML/DL/FM Hierarchy]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:🤖 AI/ML/DL/FM Hierarchy"}
[/extras/ai-ml-dl-fm-hierarchy.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/ai-ml-dl-fm-hierarchy.html"}
[🎯 GFM Predictions (Standalone)]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:🎯 GFM Predictions (Standalone)"}
[/extras/geospatial-foundation-model-predictions-standalone.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/geospatial-foundation-model-predictions-standalone.html"}
[✅ Geospatial Task/Prediction Types]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:✅ Geospatial Task/Prediction Types"}
[/extras/geospatial-prediction-hierarchy.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/geospatial-prediction-hierarchy.html"}
[🧠 Neural Networks: Neurons to Transformers]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:🧠 Neural Networks: Neurons to Transformers"}
[/extras/neural_networks_explainer.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/neural_networks_explainer.html"}
[2️⃣ Week 2]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:2️⃣ Week 2"}
[🏗️ Foundation Model Architectures]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:🏗️ Foundation Model Architectures"}
[/chapters/c00a-foundation_model_architectures.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/chapters/c00a-foundation_model_architectures.html"}
[🎓 Introduction to Deep Learning Architecture]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:🎓 Introduction to Deep Learning Architecture"}
[/chapters/c00b-introduction-to-deeplearning-architecture.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/chapters/c00b-introduction-to-deeplearning-architecture.html"}
[📖 extras]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:📖 extras"}
[🎯 Practical Examples]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:🎯 Practical Examples"}
[Normalization Comparison]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Normalization Comparison"}
[/extras/examples/normalization_comparison.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/examples/normalization_comparison.html"}
[ResNet Implementation]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:ResNet Implementation"}
[/extras/examples/resnet.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/examples/resnet.html"}
[Segmentation Fine-Tuning]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Segmentation Fine-Tuning"}
[/extras/examples/segmentation_finetuning.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/examples/segmentation_finetuning.html"}
[Text Encoder]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Text Encoder"}
[/extras/examples/text_encoder.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/examples/text_encoder.html"}
[Tiling and Patches]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Tiling and Patches"}
[/extras/examples/tiling-and-patches.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/examples/tiling-and-patches.html"}
[TerraTorch Workflows]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:TerraTorch Workflows"}
[/extras/examples/terratorch_workflows.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/examples/terratorch_workflows.html"}
[📚 Reference Materials]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:📚 Reference Materials"}
[/extras/resources/course_resources.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/resources/course_resources.html"}
[📁 Project Templates]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:📁 Project Templates"}
[Project Proposal Template]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Project Proposal Template"}
[/extras/projects/project-proposal-template.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/projects/project-proposal-template.html"}
[Project Results Template]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:Project Results Template"}
[/extras/projects/mvp-template.html]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:/extras/projects/mvp-template.html"}
[https://github.com/gfms-from-scratch/gfms-from-scratch.github.io]{.hidden .quarto-markdown-envelope-contents render-id="quarto-int-navbar:https://github.com/gfms-from-scratch/gfms-from-scratch.github.io"}
:::{.hidden .quarto-markdown-envelope-contents render-id="footer-left"}
{.img-fluid width=250px}
:::
:::{.hidden .quarto-markdown-envelope-contents render-id="footer-right"}
This website is built with [](https://github.com/kcaylor/GEOG-288KC-geospatial-foundation-models) and [Quarto](https://quarto.org/)
:::
:::
:::{#quarto-meta-markdown .hidden}
[TerraTorch Model Zoo: Comprehensive Guide to Geospatial Foundation Models]{.hidden .quarto-markdown-envelope-contents render-id="quarto-metatitle"}
[TerraTorch Model Zoo: Comprehensive Guide to Geospatial Foundation Models]{.hidden .quarto-markdown-envelope-contents render-id="quarto-twittercardtitle"}
[TerraTorch Model Zoo: Comprehensive Guide to Geospatial Foundation Models]{.hidden .quarto-markdown-envelope-contents render-id="quarto-ogcardtitle"}
[]{.hidden .quarto-markdown-envelope-contents render-id="quarto-twittercarddesc"}
[]{.hidden .quarto-markdown-envelope-contents render-id="quarto-ogcardddesc"}
:::
<!-- -->
::: {.quarto-embedded-source-code}
```````````````````{.markdown shortcodes="false"}
---
title: "TerraTorch Model Zoo: Comprehensive Guide to Geospatial Foundation Models"
author: "GeoAI Course Materials"
date: "2025"
format:
html:
toc: true
toc-depth: 3
---
# TerraTorch Model Zoo Overview
This guide provides a comprehensive overview of the Geospatial Foundation Models (GeoFMs) available in the TerraTorch toolkit. Each model represents different approaches to pre-training on Earth observation data, with varying architectures, data requirements, and downstream task performance.
## Model Comparison Metrics
For consistency, we evaluate each model using these standardized metrics:
- **Architecture Type**: Base neural network architecture (ResNet, ViT, Swin)
- **Parameter Count**: Total trainable parameters
- **Pre-training Method**: Self-supervised learning approach used
- **Input Resolution**: Spatial resolution of training data
- **Spectral Bands**: Number and type of input channels
- **Temporal Handling**: How the model processes time-series data
- **Pre-training Dataset Size**: Scale of training data
- **Patch Size**: For ViT models, the size of image patches
- **Embedding Dimension**: Size of learned representations
---
## Contrastive Learning Models
### MOCOv2
**Paper**: [Momentum Contrast for Unsupervised Visual Representation Learning](https://arxiv.org/abs/1911.05722)
**Repository**: Available through TerraTorch backbone registry
**Description**: MOCOv2 applies momentum-based contrastive learning to Sentinel-2 imagery, learning representations by maximizing agreement between different augmented views of the same scene across multiple seasons.
**Standard Metrics**:
- Architecture Type: ResNet50
- Parameter Count: 25M
- Pre-training Method: Momentum Contrastive Learning
- Input Resolution: 10m (Sentinel-2)
- Spectral Bands: 13 (Sentinel-2 MSI)
- Temporal Handling: Multi-seasonal contrasts
- Pre-training Dataset Size: 1M samples
- Patch Size: N/A (CNN-based)
- Embedding Dimension: 2048
### DINO
**Paper**: [Emerging Properties in Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.14294)
**Repository**: Integrated via TerraTorch
**Description**: DINO (self-DIstillation with NO labels) learns visual representations through self-distillation, adapted for Sentinel-2 imagery with multi-seasonal temporal patterns.
**Standard Metrics**:
- Architecture Type: ResNet50
- Parameter Count: 25M
- Pre-training Method: Self-Distillation
- Input Resolution: 10m (Sentinel-2)
- Spectral Bands: 13 (Sentinel-2 MSI)
- Temporal Handling: Multi-seasonal processing
- Pre-training Dataset Size: 1M samples
- Patch Size: N/A (CNN-based)
- Embedding Dimension: 2048
### DeCUR
**Paper**: [Decoupling Common and Unique Representations for Multimodal Self-Supervised Learning](https://arxiv.org/abs/2309.05300)
**Repository**: Available in TerraTorch
**Description**: DeCUR jointly learns from Sentinel-1 (radar) and Sentinel-2 (optical) data by decoupling common and unique representations between modalities, enabling robust multi-modal Earth observation.
**Standard Metrics**:
- Architecture Type: ResNet50
- Parameter Count: 25M
- Pre-training Method: Multi-modal Contrastive Learning
- Input Resolution: 10m
- Spectral Bands: 13 (S2) + 2 (S1 VV/VH polarizations)
- Temporal Handling: Single timestamp
- Pre-training Dataset Size: 1M samples
- Patch Size: N/A (CNN-based)
- Embedding Dimension: 2048
---
## Masked Autoencoding Models
### ScaleMAE
**Paper**: [Scale-Aware Masked Autoencoder for Multi-scale Geospatial Representation Learning](https://arxiv.org/abs/2212.14532)
**Repository**: [GitHub](https://github.com/bair-climate-initiative/scale-mae)
**Description**: ScaleMAE introduces scale-aware positional encodings to handle the variable ground sampling distances in remote sensing, training on RGB imagery across multiple resolutions.
**Standard Metrics**:
- Architecture Type: ViT-Large
- Parameter Count: 300M
- Pre-training Method: Masked Autoencoding with scale awareness
- Input Resolution: 0.1m to 30m (variable)
- Spectral Bands: 3 (RGB)
- Temporal Handling: Single timestamp
- Pre-training Dataset Size: 360k samples
- Patch Size: 16x16
- Embedding Dimension: 1024
### DOFA (Dynamic One-For-All)
**Paper**: [Neural Plasticity-Inspired Foundation Model for Observing the Earth Crossing Modalities](https://arxiv.org/abs/2403.15356)
**Repository**: Available through TerraTorch
**Description**: DOFA employs dynamic wavelength encoding to handle arbitrary combinations of spectral bands, making it adaptable to various Earth observation sensors without retraining.
**Standard Metrics**:
- Architecture Type: ViT-Large
- Parameter Count: 300M
- Pre-training Method: Masked Autoencoding with dynamic encoding
- Input Resolution: 1-30m (variable)
- Spectral Bands: Dynamic (any combination)
- Temporal Handling: Single timestamp
- Pre-training Dataset Size: 8M samples
- Patch Size: 16x16
- Embedding Dimension: 1024
### Clay v1
**Paper**: [Clay Foundation Model Technical Report](https://arxiv.org/abs/2406.13030)
**Repository**: [HuggingFace](https://huggingface.co/made-with-clay/Clay)
**Description**: Clay combines masked autoencoding with DINO for self-supervised learning, incorporating location and temporal encodings alongside dynamic wavelength handling for comprehensive Earth observation.
**Standard Metrics**:
- Architecture Type: ViT-Base
- Parameter Count: 100M
- Pre-training Method: MAE + DINO hybrid
- Input Resolution: 1-500m (highly variable)
- Spectral Bands: Dynamic (Sentinel-2, Landsat, NAIP)
- Temporal Handling: Temporal position encodings
- Pre-training Dataset Size: 70M samples
- Patch Size: 8x8
- Embedding Dimension: 768
### Prithvi-EO-1.0
**Paper**: [Foundation Models for Generalist Geospatial Artificial Intelligence](https://arxiv.org/abs/2310.18660)
**Repository**: [HuggingFace](https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M)
**Description**: Developed by IBM and NASA, Prithvi-EO-1.0 is trained on Harmonized Landsat-Sentinel (HLS) data with multi-temporal inputs for comprehensive Earth system understanding.
**Standard Metrics**:
- Architecture Type: ViT-Base
- Parameter Count: 100M
- Pre-training Method: Masked Autoencoding
- Input Resolution: 30m (HLS)
- Spectral Bands: 6 (HLS bands)
- Temporal Handling: Multi-temporal stacking (3 timestamps)
- Pre-training Dataset Size: 250k samples
- Patch Size: 16x16
- Embedding Dimension: 768
### Prithvi-EO-2.0
**Paper**: [Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model](https://arxiv.org/abs/2412.02732)
**Repository**: [HuggingFace](https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-2.0-300M)
**Description**: The second generation of Prithvi models, offering both 300M and 600M parameter variants with enhanced temporal and location encodings for improved global Earth observation capabilities.
**Standard Metrics**:
- Architecture Type: ViT-Large (300M) / ViT-Huge (600M)
- Parameter Count: 300M / 600M
- Pre-training Method: Masked Autoencoding with temporal encoding
- Input Resolution: 30m (HLS)
- Spectral Bands: 6 (HLS bands)
- Temporal Handling: Enhanced multi-temporal (3+ timestamps)
- Pre-training Dataset Size: 4.2M samples
- Patch Size: 16x16
- Embedding Dimension: 1024 (300M) / 1280 (600M)
---
## Multi-Task Supervised Models
### Satlas
**Paper**: [SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding](https://arxiv.org/abs/2211.15660)
**Repository**: [GitHub](https://github.com/allenai/satlas)
**Description**: Satlas uses supervised multi-task learning across various label types and resolutions, creating a generalist model for diverse remote sensing applications.
**Standard Metrics**:
- Architecture Type: Swin Transformer
- Parameter Count: 100M
- Pre-training Method: Supervised Multi-task Learning
- Input Resolution: ~10m (various sources)
- Spectral Bands: Variable (RGB + multispectral)
- Temporal Handling: Single timestamp
- Pre-training Dataset Size: Not specified (labeled data)
- Patch Size: 4x4 (Swin patches)
- Embedding Dimension: 768
---
## Model Selection Guide
### Best for Multi-Modal Applications
- **DeCUR**: Optimized for combined SAR-optical analysis
- **Clay v1**: Flexible wavelength handling for diverse sensors
- **DOFA**: Dynamic adaptation to any spectral configuration
### Best for Temporal Analysis
- **Prithvi-EO-2.0**: Enhanced temporal encodings
- **Prithvi-EO-1.0**: Native multi-temporal support
- **MOCOv2/DINO**: Multi-seasonal contrastive learning
### Best for High-Resolution Tasks
- **ScaleMAE**: Scale-aware design for variable resolutions
- **Satlas**: Multi-resolution supervised training
### Best for Limited Compute Resources
- **MOCOv2/DINO/DeCUR**: 25M parameters (ResNet50)
- **Prithvi-EO-1.0**: 100M parameters with proven efficiency
- **Clay v1**: 100M parameters with 8x8 patches for detail
### Best for Production Deployment
- **Prithvi-EO-2.0**: Extensive validation and NASA/IBM support
- **Clay v1**: Active development and community support
- **Satlas**: Supervised training for predictable performance
---
## Implementation Example
```python
import terratorch
from terratorch.models import PrithviModelFactory
# Load a pre-trained model
model = PrithviModelFactory.build_model(
backbone="prithvi_eo_v2_300m",
decoder="upernet",
num_classes=10,
in_channels=6,
bands=["B02", "B03", "B04", "B08", "B11", "B12"],
num_frames=3
)
# Fine-tune on your dataset
trainer = terratorch.Trainer(
model=model,
task="semantic_segmentation",
learning_rate=1e-4,
batch_size=16
):::