Accessing satellite data via STAC for GFM training
Introduction to STAC
STAC (SpatioTemporal Asset Catalog) is a specification for describing geospatial information that makes it easier to work with satellite imagery and other earth observation data at scale.
import pystac_clientimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom pystac.extensions.eo import EOExtensionfrom pystac.extensions.sat import SatExtensionimport requestsimport jsonfrom datetime import datetime, timedeltaprint("STAC libraries loaded successfully")
STAC libraries loaded successfully
Microsoft Planetary Computer
Microsoftβs Planetary Computer provides free access to petabytes of earth observation data through STAC APIs.
# Connect to Planetary Computer STAC API# Note: Planetary Computer requires authentication for data access, but catalog browsing is publicpc_catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")print(f"Planetary Computer STAC API: {pc_catalog.title}")print(f"Description: {pc_catalog.description}")# List available collectionscollections =list(pc_catalog.get_collections())print(f"\nNumber of collections: {len(collections)}")# Show first few collection IDs and titlesfor i, collection inenumerate(collections[:10]):print(f"{i+1:2d}. {collection.id}: {collection.title}")
Planetary Computer STAC API: Microsoft Planetary Computer STAC API
Description: Searchable spatiotemporal metadata describing Earth science datasets hosted by the Microsoft Planetary Computer
Number of collections: 126
1. daymet-annual-pr: Daymet Annual Puerto Rico
2. daymet-daily-hi: Daymet Daily Hawaii
3. 3dep-seamless: USGS 3DEP Seamless DEMs
4. 3dep-lidar-dsm: USGS 3DEP Lidar Digital Surface Model
5. fia: Forest Inventory and Analysis
6. gridmet: gridMET
7. daymet-annual-na: Daymet Annual North America
8. daymet-monthly-na: Daymet Monthly North America
9. daymet-annual-hi: Daymet Annual Hawaii
10. daymet-monthly-hi: Daymet Monthly Hawaii
Key Planetary Computer Collections
# Key collections for GFM trainingkey_collections = ["sentinel-2-l2a", # Sentinel-2 Level 2A (surface reflectance)"landsat-c2-l2", # Landsat Collection 2 Level 2"modis-13A1-061", # MODIS Vegetation Indices"naip", # National Agriculture Imagery Program"aster-l1t", # ASTER Level 1T"cop-dem-glo-30"# Copernicus DEM Global 30m]print("Key Collections for GFM Training:")print("="*50)for collection_id in key_collections:try: collection = pc_catalog.get_collection(collection_id)print(f"\n{collection.id}")print(f" Title: {collection.title}")print(f" Extent: {collection.extent.temporal.intervals[0][0]} to {collection.extent.temporal.intervals[0][1]}")# Show available bands if it's an EO collectionif'eo:bands'in collection.summaries: bands = collection.summaries['eo:bands']print(f" Bands: {len(bands)} bands available")exceptExceptionas e:print(f" Error accessing {collection_id}: {e}")
Key Collections for GFM Training:
==================================================
sentinel-2-l2a
Title: Sentinel-2 Level-2A
Extent: 2015-06-27 10:25:31+00:00 to None
Error accessing sentinel-2-l2a: argument of type 'Summaries' is not iterable
landsat-c2-l2
Title: Landsat Collection 2 Level-2
Extent: 1982-08-22 00:00:00+00:00 to None
Error accessing landsat-c2-l2: argument of type 'Summaries' is not iterable
modis-13A1-061
Title: MODIS Vegetation Indices 16-Day (500m)
Extent: 2000-02-18 00:00:00+00:00 to None
Error accessing modis-13A1-061: argument of type 'Summaries' is not iterable
naip
Title: NAIP: National Agriculture Imagery Program
Extent: 2010-01-01 00:00:00+00:00 to 2023-12-31 00:00:00+00:00
Error accessing naip: argument of type 'Summaries' is not iterable
aster-l1t
Title: ASTER L1T
Extent: 2000-03-04 12:00:00+00:00 to 2006-12-31 12:00:00+00:00
Error accessing aster-l1t: argument of type 'Summaries' is not iterable
cop-dem-glo-30
Title: Copernicus DEM GLO-30
Extent: 2021-04-22 00:00:00+00:00 to 2021-04-22 00:00:00+00:00
Error accessing cop-dem-glo-30: argument of type 'Summaries' is not iterable
Searching for Data
Basic Search Parameters
# Define area of interest (AOI) - California Central Valleyaoi = {"type": "Polygon","coordinates": [[ [-121.5, 37.0], # Southwest corner [-121.0, 37.0], # Southeast corner [-121.0, 37.5], # Northeast corner [-121.5, 37.5], # Northwest corner [-121.5, 37.0] # Close polygon ]]}# Define time rangestart_date ="2023-06-01"end_date ="2023-08-31"print(f"Search parameters:")print(f" AOI: Central Valley, California")print(f" Time range: {start_date} to {end_date}")print(f" Collections: Sentinel-2 L2A")
Search parameters:
AOI: Central Valley, California
Time range: 2023-06-01 to 2023-08-31
Collections: Sentinel-2 L2A
Sentinel-2 Search Example
# Search for Sentinel-2 datasearch = pc_catalog.search( collections=["sentinel-2-l2a"], intersects=aoi, datetime=f"{start_date}/{end_date}", query={"eo:cloud_cover": {"lt": 20}} # Less than 20% cloud cover)# Get search resultsitems =list(search.items())print(f"Found {len(items)} Sentinel-2 scenes")# Display first few resultsfor i, item inenumerate(items[:5]):# Get cloud cover from properties cloud_cover = item.properties.get('eo:cloud_cover', 'N/A') date = item.datetime.strftime('%Y-%m-%d')print(f"{i+1}. {item.id}")print(f" Date: {date}")print(f" Cloud cover: {cloud_cover}%")print(f" Assets: {list(item.assets.keys())}")
# Take a closer look at a single itemif items: sample_item = items[0]print(f"Item ID: {sample_item.id}")print(f"Collection: {sample_item.collection_id}")print(f"Datetime: {sample_item.datetime}")print(f"Geometry: {sample_item.geometry['type']}")print(f"Bbox: {sample_item.bbox}")# Propertiesprint(f"\nKey Properties:") key_props = ['eo:cloud_cover', 'proj:epsg', 'gsd']for prop in key_props:if prop in sample_item.properties:print(f" {prop}: {sample_item.properties[prop]}")# Available assets (bands/files)print(f"\nAvailable Assets:")for asset_key, asset in sample_item.assets.items():print(f" {asset_key}: {asset.title}")
Item ID: LC08_L2SP_043035_20230726_02_T1
Collection: landsat-c2-l2
Datetime: 2023-07-26 18:40:05.170226+00:00
Geometry: Polygon
Bbox: [-122.26260425677917, 34.964494820144395, -119.67909134681727, 37.098925179855605]
Key Properties:
eo:cloud_cover: 25.26
gsd: 30
Available Assets:
qa: Surface Temperature Quality Assessment Band
ang: Angle Coefficients File
red: Red Band
blue: Blue Band
drad: Downwelled Radiance Band
emis: Emissivity Band
emsd: Emissivity Standard Deviation Band
trad: Thermal Radiance Band
urad: Upwelled Radiance Band
atran: Atmospheric Transmittance Band
cdist: Cloud Distance Band
green: Green Band
nir08: Near Infrared Band 0.8
lwir11: Surface Temperature Band
swir16: Short-wave Infrared Band 1.6
swir22: Short-wave Infrared Band 2.2
coastal: Coastal/Aerosol Band
mtl.txt: Product Metadata File (txt)
mtl.xml: Product Metadata File (xml)
mtl.json: Product Metadata File (json)
qa_pixel: Pixel Quality Assessment Band
qa_radsat: Radiometric Saturation and Terrain Occlusion Quality Assessment Band
qa_aerosol: Aerosol Quality Assessment Band
tilejson: TileJSON with default rendering
rendered_preview: Rendered preview
Accessing Asset URLs
# Get asset URLs (authentication required for actual data download)# For production use, install: pip install planetary-computerif items: sample_item = items[0]# Key Sentinel-2 bands for ML applications key_bands = ['B02', 'B03', 'B04', 'B08'] # Blue, Green, Red, NIRprint("Asset URLs for key bands:")for band in key_bands:if band in sample_item.assets:# Get the asset URL (would need signing for actual data access) asset = sample_item.assets[band] asset_href = asset.hrefprint(f" {band}: {asset_href[:80]}...")print(f" Title: {asset.title}")else:print(f" {band}: Not available")# Note: For actual data access, use planetary-computer package:# import planetary_computer as pc# signed_item = pc.sign(sample_item)# Then use signed_item.assets[band].href for downloadingprint(f"\nNote: URLs above require authentication for actual data access")print(f"Install 'planetary-computer' package and use pc.sign() for data downloads")
Asset URLs for key bands:
B02: Not available
B03: Not available
B04: Not available
B08: Not available
Note: URLs above require authentication for actual data access
Install 'planetary-computer' package and use pc.sign() for data downloads
Other STAC Providers
Earth Search (Element84)
# Connect to Earth Search STAC APItry: earth_search = pystac_client.Client.open("https://earth-search.aws.element84.com/v1")print(f"Earth Search API: {earth_search.title}")# List collections earth_collections =list(earth_search.get_collections())print(f"Available collections: {len(earth_collections)}")for collection in earth_collections[:5]:print(f" - {collection.id}: {collection.title}")exceptExceptionas e:print(f"Error connecting to Earth Search: {e}")
Earth Search API: Earth Search by Element 84
Available collections: 9
- sentinel-2-pre-c1-l2a: Sentinel-2 Pre-Collection 1 Level-2A
- cop-dem-glo-30: Copernicus DEM GLO-30
- naip: NAIP: National Agriculture Imagery Program
- cop-dem-glo-90: Copernicus DEM GLO-90
- landsat-c2-l2: Landsat Collection 2 Level-2
Google Earth Engine Data Catalog
# Example of other STAC endpointsother_endpoints = {"USGS STAC": "https://landsatlook.usgs.gov/stac-server","CBERS STAC": "https://cbers-stac.s3.amazonaws.com","Digital Earth Australia": "https://explorer.sandbox.dea.ga.gov.au/stac"}print("Other STAC Endpoints:")for name, url in other_endpoints.items():print(f" {name}: {url}")# Note: Some endpoints may require authentication or have different access patterns
Other STAC Endpoints:
USGS STAC: https://landsatlook.usgs.gov/stac-server
CBERS STAC: https://cbers-stac.s3.amazonaws.com
Digital Earth Australia: https://explorer.sandbox.dea.ga.gov.au/stac
Loading Data for ML Applications
Creating Datacubes
def create_datacube_info(items, bands=['B02', 'B03', 'B04', 'B08']):"""Create information for a datacube from STAC items""" datacube_info = []for item in items: item_info = {'id': item.id,'datetime': item.datetime,'cloud_cover': item.properties.get('eo:cloud_cover', None),'epsg': item.properties.get('proj:epsg', None),'bands': {} }for band in bands:if band in item.assets: item_info['bands'][band] = item.assets[band].hrefelse: item_info['bands'][band] =None datacube_info.append(item_info)return datacube_info# Create datacube informationif items: datacube = create_datacube_info(items[:5])print("Datacube Information:")for i, scene inenumerate(datacube):print(f"\nScene {i+1}:")print(f" ID: {scene['id']}")print(f" Date: {scene['datetime'].strftime('%Y-%m-%d')}")print(f" Cloud cover: {scene['cloud_cover']}%")print(f" Available bands: {list(scene['bands'].keys())}")
Datacube Information:
Scene 1:
ID: LC08_L2SP_043035_20230726_02_T1
Date: 2023-07-26
Cloud cover: 25.26%
Available bands: ['B02', 'B03', 'B04', 'B08']
Scene 2:
ID: LC08_L2SP_043034_20230726_02_T1
Date: 2023-07-26
Cloud cover: 0.75%
Available bands: ['B02', 'B03', 'B04', 'B08']
Scene 3:
ID: LC09_L2SP_044034_20230725_02_T1
Date: 2023-07-25
Cloud cover: 32.5%
Available bands: ['B02', 'B03', 'B04', 'B08']
Scene 4:
ID: LE07_L2SP_044034_20230721_02_T1
Date: 2023-07-21
Cloud cover: 54.0%
Available bands: ['B02', 'B03', 'B04', 'B08']
Scene 5:
ID: LC09_L2SP_043035_20230718_02_T1
Date: 2023-07-18
Cloud cover: 33.19%
Available bands: ['B02', 'B03', 'B04', 'B08']
Integration with Rasterio and Xarray
# Example of loading data with rasterio (conceptual)def load_stac_band_info(item, band_name):"""Get information needed to load a band with rasterio"""if band_name in item.assets: asset = item.assets[band_name] band_info = {'url': asset.href,'title': asset.title,'description': asset.description,'eo_bands': [] }# Get EO band information if availableifhasattr(asset, 'extra_fields') and'eo:bands'in asset.extra_fields: band_info['eo_bands'] = asset.extra_fields['eo:bands']return band_infoelse:returnNone# Example usageif items: sample_item = items[0] red_band_info = load_stac_band_info(sample_item, 'B04')if red_band_info:print("Red Band Information:")for key, value in red_band_info.items():print(f" {key}: {value}")
STAC for GFM Training Workflows
Multi-Temporal Data Collection
def plan_multitemporal_collection(aoi, date_ranges, collections):"""Plan a multi-temporal data collection for GFM training""" collection_plan = {'total_scenes': 0,'by_date_range': {},'by_collection': {} }for date_range in date_ranges: start, end = date_range range_key =f"{start}_to_{end}" collection_plan['by_date_range'][range_key] = {}for collection in collections:# Simulate search (would normally use actual search) estimated_scenes = np.random.randint(10, 50) # Mock data collection_plan['by_date_range'][range_key][collection] = estimated_scenesif collection notin collection_plan['by_collection']: collection_plan['by_collection'][collection] =0 collection_plan['by_collection'][collection] += estimated_scenes collection_plan['total_scenes'] += estimated_scenesreturn collection_plan# Plan multi-temporal collectiondate_ranges = [ ("2023-03-01", "2023-05-31"), # Spring ("2023-06-01", "2023-08-31"), # Summer ("2023-09-01", "2023-11-30") # Fall]collections = ["sentinel-2-l2a", "landsat-c2-l2"]plan = plan_multitemporal_collection(aoi, date_ranges, collections)print("Multi-temporal Collection Plan:")print(f"Total estimated scenes: {plan['total_scenes']}")print("\nBy Date Range:")for date_range, collections in plan['by_date_range'].items():print(f" {date_range}:")for collection, count in collections.items():print(f" {collection}: {count} scenes")print("\nBy Collection:")for collection, count in plan['by_collection'].items():print(f" {collection}: {count} scenes")
Multi-temporal Collection Plan:
Total estimated scenes: 167
By Date Range:
2023-03-01_to_2023-05-31:
sentinel-2-l2a: 21 scenes
landsat-c2-l2: 27 scenes
2023-06-01_to_2023-08-31:
sentinel-2-l2a: 31 scenes
landsat-c2-l2: 25 scenes
2023-09-01_to_2023-11-30:
sentinel-2-l2a: 23 scenes
landsat-c2-l2: 40 scenes
By Collection:
sentinel-2-l2a: 75 scenes
landsat-c2-l2: 92 scenes
def extract_training_metadata(items):"""Extract metadata useful for ML training""" metadata_df = []for item in items: metadata = {'scene_id': item.id,'collection': item.collection_id,'datetime': item.datetime,'cloud_cover': item.properties.get('eo:cloud_cover'),'sun_azimuth': item.properties.get('s2:mean_solar_azimuth'),'sun_elevation': item.properties.get('s2:mean_solar_zenith'),'data_coverage': item.properties.get('s2:data_coverage_percentage'),'processing_level': item.properties.get('processing:level'),'spatial_resolution': item.properties.get('gsd'),'epsg_code': item.properties.get('proj:epsg'),'bbox': item.bbox,'geometry_type': item.geometry['type'] }# Count available bands band_count =len([k for k in item.assets.keys() if k.startswith('B')]) metadata['band_count'] = band_count metadata_df.append(metadata)return pd.DataFrame(metadata_df)# Extract metadataif items: metadata_df = extract_training_metadata(items[:10])print("Training Metadata Summary:")print(f" Scenes: {len(metadata_df)}")print(f" Date range: {metadata_df['datetime'].min()} to {metadata_df['datetime'].max()}")print(f" Cloud cover range: {metadata_df['cloud_cover'].min()}% to {metadata_df['cloud_cover'].max()}%")print(f" Average bands per scene: {metadata_df['band_count'].mean():.1f}")# Show first few rowsprint("\nFirst 3 scenes:")print(metadata_df[['scene_id', 'datetime', 'cloud_cover', 'band_count']].head(3).to_string(index=False))
Training Metadata Summary:
Scenes: 10
Date range: 2023-07-10 18:39:59.445083+00:00 to 2023-07-26 18:40:05.170226+00:00
Cloud cover range: 0.75% to 54.0%
Average bands per scene: 0.0
First 3 scenes:
scene_id datetime cloud_cover band_count
LC08_L2SP_043035_20230726_02_T1 2023-07-26 18:40:05.170226+00:00 25.26 0
LC08_L2SP_043034_20230726_02_T1 2023-07-26 18:39:41.283422+00:00 0.75 0
LC09_L2SP_044034_20230725_02_T1 2023-07-25 18:45:40.240352+00:00 32.50 0
Advanced STAC Operations
Aggregating Collections
def compare_collections(catalog, collection_ids, aoi, date_range):"""Compare multiple collections for the same area and time""" comparison = {}for collection_id in collection_ids:try: search = catalog.search( collections=[collection_id], intersects=aoi, datetime=date_range, limit=100 ) items =list(search.items())if items:# Calculate statistics cloud_covers = [item.properties.get('eo:cloud_cover', 0) for item in items if item.properties.get('eo:cloud_cover') isnotNone] comparison[collection_id] = {'item_count': len(items),'avg_cloud_cover': np.mean(cloud_covers) if cloud_covers elseNone,'min_cloud_cover': np.min(cloud_covers) if cloud_covers elseNone,'temporal_coverage': (items[0].datetime, items[-1].datetime),'sample_bands': list(items[0].assets.keys())[:5] }else: comparison[collection_id] = {'item_count': 0}exceptExceptionas e: comparison[collection_id] = {'error': str(e)}return comparison# Compare collectionscomparison = compare_collections( pc_catalog, ["sentinel-2-l2a", "landsat-c2-l2"], aoi,"2023-07-01/2023-07-31")print("Collection Comparison:")for collection_id, stats in comparison.items():print(f"\n{collection_id}:")if'error'in stats:print(f" Error: {stats['error']}")elif stats['item_count'] ==0:print(" No items found")else:print(f" Items found: {stats['item_count']}")if stats['avg_cloud_cover'] isnotNone:print(f" Avg cloud cover: {stats['avg_cloud_cover']:.1f}%")print(f" Min cloud cover: {stats['min_cloud_cover']:.1f}%")print(f" Sample bands: {stats['sample_bands']}")
def create_training_manifest(filtered_items, output_bands=['B02', 'B03', 'B04', 'B08']):"""Create a manifest file for ML training""" training_manifest = {'dataset_info': {'created_at': datetime.now().isoformat(),'total_scenes': len(filtered_items),'bands': output_bands,'description': 'STAC-derived training dataset manifest' },'scenes': [] }for item in filtered_items: scene_data = {'scene_id': item.id,'collection': item.collection_id,'datetime': item.datetime.isoformat(),'bbox': item.bbox,'properties': {'cloud_cover': item.properties.get('eo:cloud_cover'),'sun_elevation': item.properties.get('s2:mean_solar_zenith'),'epsg': item.properties.get('proj:epsg') },'band_urls': {} }for band in output_bands:if band in item.assets: scene_data['band_urls'][band] = item.assets[band].href training_manifest['scenes'].append(scene_data)return training_manifest# Create training manifestif items: filtered_items, _ = filter_scenes_for_ml(items[:5], max_cloud_cover=15) manifest = create_training_manifest(filtered_items)print("Training Manifest Created:")print(f" Total scenes: {manifest['dataset_info']['total_scenes']}")print(f" Bands: {manifest['dataset_info']['bands']}")print(f" Created: {manifest['dataset_info']['created_at']}")# Show first scene structureif manifest['scenes']:print(f"\nFirst scene structure:") first_scene = manifest['scenes'][0]for key, value in first_scene.items():if key =='band_urls':print(f" {key}: {list(value.keys())}")else:print(f" {key}: {value}")
Training Manifest Created:
Total scenes: 1
Bands: ['B02', 'B03', 'B04', 'B08']
Created: 2025-08-12T19:02:37.690679
First scene structure:
scene_id: LC08_L2SP_043034_20230726_02_T1
collection: landsat-c2-l2
datetime: 2023-07-26T18:39:41.283422+00:00
bbox: [-121.86762720531041, 36.39286485893108, -119.22622808725025, 38.534055141068926]
properties: {'cloud_cover': 0.75, 'sun_elevation': None, 'epsg': None}
band_urls: []
Summary
Key STAC concepts for GFM development:
STAC APIs provide standardized access to petabytes of earth observation data
Microsoft Planetary Computer offers free access to major satellite datasets
Quality filtering is essential for ML training data preparation
Multi-temporal collections enable time-series and change detection models
Metadata extraction supports dataset organization and model training
Cross-collection searches maximize data availability and diversity
Essential workflows: - Search and filter scenes by quality metrics - Extract and organize metadata for training - Create manifests linking STAC items to training pipelines - Compare collections to optimize data selection - Plan multi-temporal acquisitions for comprehensive datasets
These patterns enable scalable, reproducible access to satellite imagery for geospatial foundation model development.
Source Code
---title: "STAC APIs & Planetary Computer"subtitle: "Accessing satellite data via STAC for GFM training"jupyter: geoaiformat: html: code-fold: false---## Introduction to STACSTAC (SpatioTemporal Asset Catalog) is a specification for describing geospatial information that makes it easier to work with satellite imagery and other earth observation data at scale.```{python}import pystac_clientimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom pystac.extensions.eo import EOExtensionfrom pystac.extensions.sat import SatExtensionimport requestsimport jsonfrom datetime import datetime, timedeltaprint("STAC libraries loaded successfully")```## Microsoft Planetary ComputerMicrosoft's Planetary Computer provides free access to petabytes of earth observation data through STAC APIs.```{python}# Connect to Planetary Computer STAC API# Note: Planetary Computer requires authentication for data access, but catalog browsing is publicpc_catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")print(f"Planetary Computer STAC API: {pc_catalog.title}")print(f"Description: {pc_catalog.description}")# List available collectionscollections =list(pc_catalog.get_collections())print(f"\nNumber of collections: {len(collections)}")# Show first few collection IDs and titlesfor i, collection inenumerate(collections[:10]):print(f"{i+1:2d}. {collection.id}: {collection.title}")```### Key Planetary Computer Collections```{python}# Key collections for GFM trainingkey_collections = ["sentinel-2-l2a", # Sentinel-2 Level 2A (surface reflectance)"landsat-c2-l2", # Landsat Collection 2 Level 2"modis-13A1-061", # MODIS Vegetation Indices"naip", # National Agriculture Imagery Program"aster-l1t", # ASTER Level 1T"cop-dem-glo-30"# Copernicus DEM Global 30m]print("Key Collections for GFM Training:")print("="*50)for collection_id in key_collections:try: collection = pc_catalog.get_collection(collection_id)print(f"\n{collection.id}")print(f" Title: {collection.title}")print(f" Extent: {collection.extent.temporal.intervals[0][0]} to {collection.extent.temporal.intervals[0][1]}")# Show available bands if it's an EO collectionif'eo:bands'in collection.summaries: bands = collection.summaries['eo:bands']print(f" Bands: {len(bands)} bands available")exceptExceptionas e:print(f" Error accessing {collection_id}: {e}")```## Searching for Data### Basic Search Parameters```{python}# Define area of interest (AOI) - California Central Valleyaoi = {"type": "Polygon","coordinates": [[ [-121.5, 37.0], # Southwest corner [-121.0, 37.0], # Southeast corner [-121.0, 37.5], # Northeast corner [-121.5, 37.5], # Northwest corner [-121.5, 37.0] # Close polygon ]]}# Define time rangestart_date ="2023-06-01"end_date ="2023-08-31"print(f"Search parameters:")print(f" AOI: Central Valley, California")print(f" Time range: {start_date} to {end_date}")print(f" Collections: Sentinel-2 L2A")```### Sentinel-2 Search Example```{python}# Search for Sentinel-2 datasearch = pc_catalog.search( collections=["sentinel-2-l2a"], intersects=aoi, datetime=f"{start_date}/{end_date}", query={"eo:cloud_cover": {"lt": 20}} # Less than 20% cloud cover)# Get search resultsitems =list(search.items())print(f"Found {len(items)} Sentinel-2 scenes")# Display first few resultsfor i, item inenumerate(items[:5]):# Get cloud cover from properties cloud_cover = item.properties.get('eo:cloud_cover', 'N/A') date = item.datetime.strftime('%Y-%m-%d')print(f"{i+1}. {item.id}")print(f" Date: {date}")print(f" Cloud cover: {cloud_cover}%")print(f" Assets: {list(item.assets.keys())}")```### Multi-Collection Search```{python}# Search across multiple collectionsmulti_search = pc_catalog.search( collections=["sentinel-2-l2a", "landsat-c2-l2"], intersects=aoi, datetime="2023-07-01/2023-07-31", limit=10)multi_items =list(multi_search.items())print(f"Found {len(multi_items)} items across collections")# Group by collectionby_collection = {}for item in multi_items: collection = item.collection_idif collection notin by_collection: by_collection[collection] = [] by_collection[collection].append(item)for collection, items in by_collection.items():print(f"\n{collection}: {len(items)} items")for item in items[:3]: # Show first 3 date = item.datetime.strftime('%Y-%m-%d')print(f" - {item.id} ({date})")```## Working with STAC Items### Examining Item Metadata```{python}# Take a closer look at a single itemif items: sample_item = items[0]print(f"Item ID: {sample_item.id}")print(f"Collection: {sample_item.collection_id}")print(f"Datetime: {sample_item.datetime}")print(f"Geometry: {sample_item.geometry['type']}")print(f"Bbox: {sample_item.bbox}")# Propertiesprint(f"\nKey Properties:") key_props = ['eo:cloud_cover', 'proj:epsg', 'gsd']for prop in key_props:if prop in sample_item.properties:print(f" {prop}: {sample_item.properties[prop]}")# Available assets (bands/files)print(f"\nAvailable Assets:")for asset_key, asset in sample_item.assets.items():print(f" {asset_key}: {asset.title}")```### Accessing Asset URLs```{python}# Get asset URLs (authentication required for actual data download)# For production use, install: pip install planetary-computerif items: sample_item = items[0]# Key Sentinel-2 bands for ML applications key_bands = ['B02', 'B03', 'B04', 'B08'] # Blue, Green, Red, NIRprint("Asset URLs for key bands:")for band in key_bands:if band in sample_item.assets:# Get the asset URL (would need signing for actual data access) asset = sample_item.assets[band] asset_href = asset.hrefprint(f" {band}: {asset_href[:80]}...")print(f" Title: {asset.title}")else:print(f" {band}: Not available")# Note: For actual data access, use planetary-computer package:# import planetary_computer as pc# signed_item = pc.sign(sample_item)# Then use signed_item.assets[band].href for downloadingprint(f"\nNote: URLs above require authentication for actual data access")print(f"Install 'planetary-computer' package and use pc.sign() for data downloads")```## Other STAC Providers### Earth Search (Element84)```{python}# Connect to Earth Search STAC APItry: earth_search = pystac_client.Client.open("https://earth-search.aws.element84.com/v1")print(f"Earth Search API: {earth_search.title}")# List collections earth_collections =list(earth_search.get_collections())print(f"Available collections: {len(earth_collections)}")for collection in earth_collections[:5]:print(f" - {collection.id}: {collection.title}")exceptExceptionas e:print(f"Error connecting to Earth Search: {e}")```### Google Earth Engine Data Catalog```{python}# Example of other STAC endpointsother_endpoints = {"USGS STAC": "https://landsatlook.usgs.gov/stac-server","CBERS STAC": "https://cbers-stac.s3.amazonaws.com","Digital Earth Australia": "https://explorer.sandbox.dea.ga.gov.au/stac"}print("Other STAC Endpoints:")for name, url in other_endpoints.items():print(f" {name}: {url}")# Note: Some endpoints may require authentication or have different access patterns```## Loading Data for ML Applications### Creating Datacubes```{python}def create_datacube_info(items, bands=['B02', 'B03', 'B04', 'B08']):"""Create information for a datacube from STAC items""" datacube_info = []for item in items: item_info = {'id': item.id,'datetime': item.datetime,'cloud_cover': item.properties.get('eo:cloud_cover', None),'epsg': item.properties.get('proj:epsg', None),'bands': {} }for band in bands:if band in item.assets: item_info['bands'][band] = item.assets[band].hrefelse: item_info['bands'][band] =None datacube_info.append(item_info)return datacube_info# Create datacube informationif items: datacube = create_datacube_info(items[:5])print("Datacube Information:")for i, scene inenumerate(datacube):print(f"\nScene {i+1}:")print(f" ID: {scene['id']}")print(f" Date: {scene['datetime'].strftime('%Y-%m-%d')}")print(f" Cloud cover: {scene['cloud_cover']}%")print(f" Available bands: {list(scene['bands'].keys())}")```### Integration with Rasterio and Xarray```{python}# Example of loading data with rasterio (conceptual)def load_stac_band_info(item, band_name):"""Get information needed to load a band with rasterio"""if band_name in item.assets: asset = item.assets[band_name] band_info = {'url': asset.href,'title': asset.title,'description': asset.description,'eo_bands': [] }# Get EO band information if availableifhasattr(asset, 'extra_fields') and'eo:bands'in asset.extra_fields: band_info['eo_bands'] = asset.extra_fields['eo:bands']return band_infoelse:returnNone# Example usageif items: sample_item = items[0] red_band_info = load_stac_band_info(sample_item, 'B04')if red_band_info:print("Red Band Information:")for key, value in red_band_info.items():print(f" {key}: {value}")```## STAC for GFM Training Workflows### Multi-Temporal Data Collection```{python}def plan_multitemporal_collection(aoi, date_ranges, collections):"""Plan a multi-temporal data collection for GFM training""" collection_plan = {'total_scenes': 0,'by_date_range': {},'by_collection': {} }for date_range in date_ranges: start, end = date_range range_key =f"{start}_to_{end}" collection_plan['by_date_range'][range_key] = {}for collection in collections:# Simulate search (would normally use actual search) estimated_scenes = np.random.randint(10, 50) # Mock data collection_plan['by_date_range'][range_key][collection] = estimated_scenesif collection notin collection_plan['by_collection']: collection_plan['by_collection'][collection] =0 collection_plan['by_collection'][collection] += estimated_scenes collection_plan['total_scenes'] += estimated_scenesreturn collection_plan# Plan multi-temporal collectiondate_ranges = [ ("2023-03-01", "2023-05-31"), # Spring ("2023-06-01", "2023-08-31"), # Summer ("2023-09-01", "2023-11-30") # Fall]collections = ["sentinel-2-l2a", "landsat-c2-l2"]plan = plan_multitemporal_collection(aoi, date_ranges, collections)print("Multi-temporal Collection Plan:")print(f"Total estimated scenes: {plan['total_scenes']}")print("\nBy Date Range:")for date_range, collections in plan['by_date_range'].items():print(f" {date_range}:")for collection, count in collections.items():print(f" {collection}: {count} scenes")print("\nBy Collection:")for collection, count in plan['by_collection'].items():print(f" {collection}: {count} scenes")```### Quality Filtering for ML```{python}def filter_scenes_for_ml(items, max_cloud_cover=10, min_data_coverage=80):"""Filter STAC items for ML training quality""" filtered_items = [] filter_stats = {'total_input': len(items),'passed_cloud_filter': 0,'passed_data_filter': 0,'final_count': 0 }for item in items:# Check cloud cover cloud_cover = item.properties.get('eo:cloud_cover', 100)if cloud_cover > max_cloud_cover:continue filter_stats['passed_cloud_filter'] +=1# Check data coverage (if available) data_coverage = item.properties.get('s2:data_coverage_percentage', 100)if data_coverage < min_data_coverage:continue filter_stats['passed_data_filter'] +=1 filtered_items.append(item) filter_stats['final_count'] +=1return filtered_items, filter_stats# Apply quality filteringif items: filtered_items, stats = filter_scenes_for_ml(items, max_cloud_cover=15)print("Quality Filtering Results:")print(f" Input scenes: {stats['total_input']}")print(f" Passed cloud filter (<15%): {stats['passed_cloud_filter']}")print(f" Passed data filter (>80%): {stats['passed_data_filter']}")print(f" Final count: {stats['final_count']}")print(f" Retention rate: {stats['final_count']/stats['total_input']*100:.1f}%")```### Metadata Extraction for Training```{python}def extract_training_metadata(items):"""Extract metadata useful for ML training""" metadata_df = []for item in items: metadata = {'scene_id': item.id,'collection': item.collection_id,'datetime': item.datetime,'cloud_cover': item.properties.get('eo:cloud_cover'),'sun_azimuth': item.properties.get('s2:mean_solar_azimuth'),'sun_elevation': item.properties.get('s2:mean_solar_zenith'),'data_coverage': item.properties.get('s2:data_coverage_percentage'),'processing_level': item.properties.get('processing:level'),'spatial_resolution': item.properties.get('gsd'),'epsg_code': item.properties.get('proj:epsg'),'bbox': item.bbox,'geometry_type': item.geometry['type'] }# Count available bands band_count =len([k for k in item.assets.keys() if k.startswith('B')]) metadata['band_count'] = band_count metadata_df.append(metadata)return pd.DataFrame(metadata_df)# Extract metadataif items: metadata_df = extract_training_metadata(items[:10])print("Training Metadata Summary:")print(f" Scenes: {len(metadata_df)}")print(f" Date range: {metadata_df['datetime'].min()} to {metadata_df['datetime'].max()}")print(f" Cloud cover range: {metadata_df['cloud_cover'].min()}% to {metadata_df['cloud_cover'].max()}%")print(f" Average bands per scene: {metadata_df['band_count'].mean():.1f}")# Show first few rowsprint("\nFirst 3 scenes:")print(metadata_df[['scene_id', 'datetime', 'cloud_cover', 'band_count']].head(3).to_string(index=False))```## Advanced STAC Operations### Aggregating Collections```{python}def compare_collections(catalog, collection_ids, aoi, date_range):"""Compare multiple collections for the same area and time""" comparison = {}for collection_id in collection_ids:try: search = catalog.search( collections=[collection_id], intersects=aoi, datetime=date_range, limit=100 ) items =list(search.items())if items:# Calculate statistics cloud_covers = [item.properties.get('eo:cloud_cover', 0) for item in items if item.properties.get('eo:cloud_cover') isnotNone] comparison[collection_id] = {'item_count': len(items),'avg_cloud_cover': np.mean(cloud_covers) if cloud_covers elseNone,'min_cloud_cover': np.min(cloud_covers) if cloud_covers elseNone,'temporal_coverage': (items[0].datetime, items[-1].datetime),'sample_bands': list(items[0].assets.keys())[:5] }else: comparison[collection_id] = {'item_count': 0}exceptExceptionas e: comparison[collection_id] = {'error': str(e)}return comparison# Compare collectionscomparison = compare_collections( pc_catalog, ["sentinel-2-l2a", "landsat-c2-l2"], aoi,"2023-07-01/2023-07-31")print("Collection Comparison:")for collection_id, stats in comparison.items():print(f"\n{collection_id}:")if'error'in stats:print(f" Error: {stats['error']}")elif stats['item_count'] ==0:print(" No items found")else:print(f" Items found: {stats['item_count']}")if stats['avg_cloud_cover'] isnotNone:print(f" Avg cloud cover: {stats['avg_cloud_cover']:.1f}%")print(f" Min cloud cover: {stats['min_cloud_cover']:.1f}%")print(f" Sample bands: {stats['sample_bands']}")```### Creating Training Datasets```{python}def create_training_manifest(filtered_items, output_bands=['B02', 'B03', 'B04', 'B08']):"""Create a manifest file for ML training""" training_manifest = {'dataset_info': {'created_at': datetime.now().isoformat(),'total_scenes': len(filtered_items),'bands': output_bands,'description': 'STAC-derived training dataset manifest' },'scenes': [] }for item in filtered_items: scene_data = {'scene_id': item.id,'collection': item.collection_id,'datetime': item.datetime.isoformat(),'bbox': item.bbox,'properties': {'cloud_cover': item.properties.get('eo:cloud_cover'),'sun_elevation': item.properties.get('s2:mean_solar_zenith'),'epsg': item.properties.get('proj:epsg') },'band_urls': {} }for band in output_bands:if band in item.assets: scene_data['band_urls'][band] = item.assets[band].href training_manifest['scenes'].append(scene_data)return training_manifest# Create training manifestif items: filtered_items, _ = filter_scenes_for_ml(items[:5], max_cloud_cover=15) manifest = create_training_manifest(filtered_items)print("Training Manifest Created:")print(f" Total scenes: {manifest['dataset_info']['total_scenes']}")print(f" Bands: {manifest['dataset_info']['bands']}")print(f" Created: {manifest['dataset_info']['created_at']}")# Show first scene structureif manifest['scenes']:print(f"\nFirst scene structure:") first_scene = manifest['scenes'][0]for key, value in first_scene.items():if key =='band_urls':print(f" {key}: {list(value.keys())}")else:print(f" {key}: {value}")```## SummaryKey STAC concepts for GFM development:- **STAC APIs** provide standardized access to petabytes of earth observation data- **Microsoft Planetary Computer** offers free access to major satellite datasets- **Quality filtering** is essential for ML training data preparation- **Multi-temporal collections** enable time-series and change detection models- **Metadata extraction** supports dataset organization and model training- **Cross-collection searches** maximize data availability and diversityEssential workflows:- Search and filter scenes by quality metrics- Extract and organize metadata for training- Create manifests linking STAC items to training pipelines- Compare collections to optimize data selection- Plan multi-temporal acquisitions for comprehensive datasetsThese patterns enable scalable, reproducible access to satellite imagery for geospatial foundation model development.