Overview
This guide explains how to organize geospatial datasets for fine-tuning foundation models, with a focus on TerraTorchβs DataModule system. Youβll learn:
- Standard dataset organization patterns for different tasks
- How to structure files for TerraTorch DataModules
- Integration with HuggingFace datasets
- Train/validation/test split strategies
- Band selection and normalization
Task-Specific Organization Patterns
Classification Tasks
Directory Structure:
dataset_root/
βββ class_1/
β βββ image_001.tif
β βββ image_002.tif
β βββ ...
βββ class_2/
β βββ image_001.tif
β βββ ...
βββ class_n/
βββ ...
Key Points: - One folder per class, named with the class label - All images for that class inside the folder - Works with GenericNonGeoClassificationDataModule
Example: EuroSAT Dataset
EuroSAT_MS/
βββ AnnualCrop/
β βββ AnnualCrop_1.tif
β βββ ...
βββ Forest/
β βββ ...
βββ Residential/
βββ ...
Segmentation Tasks
Directory Structure (Separate Folders):
dataset_root/
βββ images/
β βββ scene_001_image.tif
β βββ scene_002_image.tif
β βββ ...
βββ labels/
βββ scene_001_mask.tif
βββ scene_002_mask.tif
βββ ...
Directory Structure (Matching Names):
dataset_root/
βββ train_images/
β βββ scene_001.tif
β βββ ...
βββ train_labels/
β βββ scene_001.tif
β βββ ...
βββ val_images/
β βββ ...
βββ val_labels/
βββ ...
Key Points: - Image and label files must have matching names (or matching patterns) - Common patterns: *_image.tif paired with *_mask.tif - Or identical names in separate images/ and labels/ folders - Labels are typically single-band rasters with integer class values
Object Detection Tasks
Directory Structure:
dataset_root/
βββ images/
β βββ scene_001.tif
β βββ ...
βββ annotations/
βββ scene_001.json # or .xml
βββ ...
Annotation Format (COCO-style JSON):
{
"images": [{"id": 1, "file_name": "scene_001.tif"}],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [x, y, width, height],
"area": 1000
}
],
"categories": [{"id": 1, "name": "building"}]
}Train/Val/Test Split Strategies
Strategy 1: Separate Directories
Structure:
dataset_root/
βββ train/
β βββ class_1/
β βββ class_2/
βββ val/
β βββ class_1/
β βββ class_2/
βββ test/
βββ class_1/
βββ class_2/
TerraTorch Usage:
from terratorch.datamodules import GenericNonGeoClassificationDataModule
datamodule = GenericNonGeoClassificationDataModule(
batch_size=16,
num_workers=4,
train_data_root="dataset_root/train",
val_data_root="dataset_root/val",
test_data_root="dataset_root/test",
means=[...],
stds=[...],
num_classes=10
)Strategy 2: Split Files (Recommended for Benchmarks)
Structure:
dataset_root/
βββ class_1/
βββ class_2/
βββ splits/
βββ train.txt
βββ val.txt
βββ test.txt
Split File Format (train.txt):
class_1/image_001.tif
class_1/image_005.tif
class_2/image_003.tif
...
TerraTorch Usage:
datamodule = GenericNonGeoClassificationDataModule(
batch_size=16,
num_workers=4,
# All splits use same root
train_data_root="dataset_root",
val_data_root="dataset_root",
test_data_root="dataset_root",
# But different split files
train_split="dataset_root/splits/train.txt",
val_split="dataset_root/splits/val.txt",
test_split="dataset_root/splits/test.txt",
means=[...],
stds=[...],
num_classes=10,
# Important options for split files
ignore_split_file_extensions=True, # Match "image.tif" with "image.jpg" in split
allow_substring_split_file=True # Split file can contain substrings
)Key Parameters: - ignore_split_file_extensions: Set to True if split files list different extensions than actual files - allow_substring_split_file: Set to True for partial filename matching (mmsegmentation style)
Band Selection and Ordering
Specifying Input Bands
# Your dataset has all Sentinel-2 bands
dataset_bands = [
"B1", "B2", "B3", "B4", "B5", "B6",
"B7", "B8", "B8A", "B9", "B10", "B11", "B12"
]
# But your model only needs these 6
output_bands = ["B2", "B3", "B4", "B8A", "B11", "B12"]
datamodule = GenericNonGeoClassificationDataModule(
# ... other params ...
dataset_bands=dataset_bands, # What your files contain
output_bands=output_bands, # What the model receives
)Using HLSBands Enums:
from terratorch.datasets import HLSBands
output_bands = [
HLSBands.BLUE,
HLSBands.GREEN,
HLSBands.RED,
HLSBands.NIR_NARROW,
HLSBands.SWIR_1,
HLSBands.SWIR_2
]Normalization Statistics
Computing Dataset Statistics
import rasterio
import numpy as np
from pathlib import Path
def compute_band_statistics(data_root, band_indices):
"""Compute mean and std for each band across dataset"""
means = []
stds = []
for band_idx in band_indices:
band_values = []
for img_path in Path(data_root).rglob("*.tif"):
with rasterio.open(img_path) as src:
band_data = src.read(band_idx + 1)
band_values.append(band_data.flatten())
all_values = np.concatenate(band_values)
means.append(float(np.mean(all_values)))
stds.append(float(np.std(all_values)))
return means, stds
# Example usage
means, stds = compute_band_statistics("dataset_root", range(6))Using Pre-computed Statistics
Common datasets have published statistics:
# Sentinel-2 L1C statistics (from TerraMesh)
S2L1C_STATS = {
"mean": [2357.090, 2137.398, 2018.799, 2082.998, 2295.663,
2854.548, 3122.860, 3040.571, 3306.491, 1473.849,
506.072, 2472.840, 1838.943],
"std": [1673.639, 1722.641, 1602.205, 1873.138, 1866.055,
1779.839, 1776.496, 1724.114, 1771.041, 1079.786,
512.404, 1340.879, 1172.435]
}
# Sentinel-2 L2A statistics
S2L2A_STATS = {
"mean": [1390.461, 1503.332, 1718.211, 1853.926, 2199.116,
2779.989, 2987.025, 3083.248, 3132.235, 3162.989,
2424.902, 1857.665],
"std": [2131.157, 2163.666, 2059.311, 2152.477, 2105.179,
1912.773, 1842.326, 1893.568, 1775.656, 1814.907,
1436.282, 1336.155]
}
datamodule = GenericNonGeoClassificationDataModule(
# ... other params ...
means=S2L1C_STATS["mean"],
stds=S2L1C_STATS["std"]
)Data Transforms
Basic Transform Pipeline
import albumentations as A
# Training transforms with augmentation
train_transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.VerticalFlip(p=0.5),
A.RandomRotate90(p=0.5),
A.pytorch.transforms.ToTensorV2()
])
# Validation/test transforms (no augmentation)
val_transform = A.Compose([
A.pytorch.transforms.ToTensorV2()
])
datamodule = GenericNonGeoClassificationDataModule(
# ... other params ...
train_transform=train_transform,
val_transform=val_transform,
test_transform=val_transform
)Cropping for Model Input Size
# If your images are larger than model input
train_transform = A.Compose([
A.RandomCrop(height=224, width=224), # Random crops for training
A.HorizontalFlip(p=0.5),
A.pytorch.transforms.ToTensorV2()
])
val_transform = A.Compose([
A.CenterCrop(height=224, width=224), # Center crop for validation
A.pytorch.transforms.ToTensorV2()
])Integration with HuggingFace Datasets
Loading from HuggingFace Hub
from datasets import load_dataset
# Load dataset
dataset = load_dataset("torchgeo/eurosat", "sentinel2-all")
# Save to disk in TerraTorch-compatible format
def save_classification_dataset(dataset, output_root):
"""Convert HF dataset to folder-per-class structure"""
from pathlib import Path
import shutil
for split_name in ["train", "validation", "test"]:
split = dataset[split_name]
for idx, example in enumerate(split):
class_name = example["label"]
class_dir = Path(output_root) / split_name / class_name
class_dir.mkdir(parents=True, exist_ok=True)
# Save image
img_path = class_dir / f"image_{idx:05d}.tif"
# ... save logic ...
# Use the saved dataset
datamodule = GenericNonGeoClassificationDataModule(
train_data_root="output_root/train",
val_data_root="output_root/validation",
test_data_root="output_root/test",
# ... other params ...
)Using HuggingFace Split Files
Many HuggingFace datasets provide split files:
import urllib.request
# Download split files from HF
base_url = "https://huggingface.co/datasets/torchgeo/eurosat/resolve/main"
for split in ["train", "val", "test"]:
url = f"{base_url}/eurosat-{split}.txt"
urllib.request.urlretrieve(url, f"splits/eurosat-{split}.txt")
# Use with TerraTorch
datamodule = GenericNonGeoClassificationDataModule(
train_data_root="dataset_root",
val_data_root="dataset_root",
test_data_root="dataset_root",
train_split="splits/eurosat-train.txt",
val_split="splits/eurosat-val.txt",
test_split="splits/eurosat-test.txt",
# ... other params ...
)Complete DataModule Example
Classification with EuroSAT
from terratorch.datamodules import GenericNonGeoClassificationDataModule
import albumentations as A
# Dataset paths
EUROSAT_ROOT = "/data/EuroSAT_MS"
SPLIT_DIR = "/data/EuroSAT/splits"
# Sentinel-2 bands
S2_BANDS = ["B1", "B2", "B3", "B4", "B5", "B6", "B7",
"B8", "B8A", "B9", "B10", "B11", "B12"]
MODEL_BANDS = ["B2", "B3", "B4", "B8A", "B11", "B12"]
# Statistics
S2L1C_MEAN = [2357.090, 2137.398, 2018.799, 2082.998,
2295.663, 2854.548, 3122.860, 3040.571,
3306.491, 1473.849, 506.072, 2472.840, 1838.943]
S2L1C_STD = [1673.639, 1722.641, 1602.205, 1873.138,
1866.055, 1779.839, 1776.496, 1724.114,
1771.041, 1079.786, 512.404, 1340.879, 1172.435]
# Transforms
train_transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.pytorch.transforms.ToTensorV2()
])
val_transform = A.pytorch.transforms.ToTensorV2()
# Create DataModule
datamodule = GenericNonGeoClassificationDataModule(
batch_size=32,
num_workers=4,
train_data_root=EUROSAT_ROOT,
val_data_root=EUROSAT_ROOT,
test_data_root=EUROSAT_ROOT,
train_split=f"{SPLIT_DIR}/eurosat-train.txt",
val_split=f"{SPLIT_DIR}/eurosat-val.txt",
test_split=f"{SPLIT_DIR}/eurosat-test.txt",
means=S2L1C_MEAN,
stds=S2L1C_STD,
num_classes=10,
dataset_bands=S2_BANDS,
output_bands=MODEL_BANDS,
train_transform=train_transform,
val_transform=val_transform,
test_transform=val_transform,
ignore_split_file_extensions=True,
allow_substring_split_file=True,
)
# Setup for training
datamodule.setup("fit")
# Get a batch
train_loader = datamodule.train_dataloader()
batch = next(iter(train_loader))
print(f"Batch image shape: {batch['image'].shape}")
print(f"Batch label shape: {batch['label'].shape}")Segmentation with Custom Dataset
from terratorch.datamodules import GenericNonGeoSegmentationDataModule
# Segmentation dataset structure:
# dataset_root/
# train_images/
# train_labels/
# val_images/
# val_labels/
datamodule = GenericNonGeoSegmentationDataModule(
batch_size=16,
num_workers=4,
train_data_root="dataset_root/train_images",
train_label_data_root="dataset_root/train_labels",
val_data_root="dataset_root/val_images",
val_label_data_root="dataset_root/val_labels",
test_data_root="dataset_root/test_images",
test_label_data_root="dataset_root/test_labels",
means=S2L2A_MEAN,
stds=S2L2A_STD,
num_classes=5, # Number of segmentation classes
dataset_bands=S2_BANDS,
output_bands=MODEL_BANDS,
train_transform=train_transform,
val_transform=val_transform,
test_transform=val_transform,
ignore_index=255, # Ignore value in labels
)Common Issues and Solutions
Issue: Split file paths donβt match actual files
Problem: Split file lists image.jpg but files are image.tif
Solution:
datamodule = GenericNonGeoClassificationDataModule(
# ... other params ...
ignore_split_file_extensions=True # Ignore extension when matching
)Issue: Band mismatch errors
Problem: Model expects 6 bands but receives 13
Solution: Specify both dataset_bands and output_bands
datamodule = GenericNonGeoClassificationDataModule(
# ... other params ...
dataset_bands=ALL_13_BANDS, # What files contain
output_bands=MODEL_6_BANDS # What model needs
)Issue: Normalization from wrong sensor
Problem: Using L2A statistics on L1C data
Solution: Match statistics to your data product
# Check your data source
# L1C = Top of Atmosphere reflectance (e.g., EuroSAT)
# L2A = Bottom of Atmosphere reflectance (e.g., Sentinel-2 L2A)
datamodule = GenericNonGeoClassificationDataModule(
# ... other params ...
means=S2L1C_MEAN if is_l1c else S2L2A_MEAN,
stds=S2L1C_STD if is_l1c else S2L2A_STD
)Issue: Images canβt be stacked into batches
Problem: Images have different dimensions
Solution: Use transforms to ensure consistent sizes
train_transform = A.Compose([
A.RandomCrop(height=224, width=224), # Ensures all same size
A.pytorch.transforms.ToTensorV2()
])Or disable stackability checking (if intentional):
datamodule = GenericNonGeoClassificationDataModule(
# ... other params ...
check_stackability=False
)Quick Reference
DataModule Parameters
| Parameter | Purpose | Example |
|---|---|---|
batch_size |
Samples per batch | 32 |
num_workers |
Parallel data loading threads | 4 |
train_data_root |
Path to training images | "data/train" |
train_split |
File listing training samples | "splits/train.txt" |
means |
Per-band normalization means | [2357.09, ...] |
stds |
Per-band normalization stds | [1673.64, ...] |
num_classes |
Number of output classes | 10 |
dataset_bands |
Bands in your files | ["B2", "B3", ...] |
output_bands |
Bands model receives | ["B2", "B3", "B4"] |
train_transform |
Training augmentations | A.Compose([...]) |
ignore_split_file_extensions |
Match files ignoring extension | True |
allow_substring_split_file |
Allow partial filename matching | True |
drop_last |
Drop incomplete final batch | True |
check_stackability |
Verify all images same size | True |
Common Dataset Patterns
| Task | Structure | DataModule Class |
|---|---|---|
| Classification | class_name/image.tif |
GenericNonGeoClassificationDataModule |
| Segmentation | images/ + labels/ |
GenericNonGeoSegmentationDataModule |
| Regression | images/ + labels.csv |
GenericNonGeoRegressionDataModule |
Workflow Summary
- Organize data according to task pattern
- Create split files or separate directories
- Compute or obtain normalization statistics
- Define transforms for train/val/test
- Initialize DataModule with all parameters
- Setup and validate with
datamodule.setup("fit") - Use with Trainer for fine-tuning
Next Steps
- See TerraMind EuroSAT Example for complete classification workflow
- See Segmentation Tutorial for segmentation workflow
- Check TorchGeo Basics for additional dataset utilities