AI Forecast Submission Workflow

This document describes the complete sequential workflow for AI forecast submission using the AIFS ensemble system. The workflow spans from initial condition preparation to final forecast submission.

Overview

The AI forecast submission process consists of 3 main steps across different computing environments (CPU/ETL and GPU) to produce ensemble weather forecasts and submit them for evaluation.

Workflow Index

Initial Condition Preparation (ETL Machine) → ecmwf_opendata_pkl_input_aifsens.py
GPU Inference (GPU Machines)
- FP32 (A100 GPU) → automate_aifs_gpu_pipeline.py
- FP16 (G2 GPU) → fp16_automate_aifs_gpu_pipeline.py
Post-Processing & Submission (ETL Machine)
- Regrid → aifs_n320_grib_1p5defg_nc_cli.py
- Quintile Analysis → ensemble_quintile_analysis_cli.py
- Forecast Submission → forecast_submission_cli.py

ETL Environment Setup

The ETL (non-GPU) machine is used for Step 1 and Step 3. Start the Coiled notebook:

coiled notebook start --name p2-aifs-etl-20260129 --vm-type n2-standard-2 --software aifs-etl-v2 --workspace=gcp-sewaa-nka --region us-east5

Software Installation

Install the required environment using micromamba:

micromamba create -n aifs-etl -c conda-forge python=3.12.7 \
  && eval "$(micromamba shell hook --shell bash)" \
  && micromamba activate aifs-etl \
  && micromamba install -c conda-forge earthkit-data ecmwf-opendata \
  && pip install gcsfs s3fs earthkit-regrid==0.4.0 google-cloud-storage icechunk AI_WQ_package \
  && sudo apt update && sudo apt install nano

Credentials Setup

Copy the .env.example file to .env and fill in your credentials:

cp .env.example .env
nano .env

Example .env contents:

AIWQ_TEAM_NAME=Fahamu
AIWQ_MODEL_NAME=FahamuAIFSv1
AIWQ_MODEL_NAME_FP16=FahamuAIFSv1_fp16
AIWQ_PASSWORD=your_password_here

A GCS service account key file (coiled-data.json) is also required for cloud storage access.

Step 1: Initial Condition Preparation (ETL Machine)

File: ecmwf_opendata_pkl_input_aifsens.py

python ecmwf_opendata_pkl_input_aifsens.py

Purpose: Download and preprocess ECMWF open data for ensemble members 1-50
Environment: ETL machine (CPU-only, n2-standard-2)
Input: ECMWF open data (surface, soil, pressure level parameters)
Output: Pickle files uploaded to GCS bucket (gs://aifs-aiquest-us-20251127/YYYYMMDD_0000/input/)
Requires: coiled-data.json (GCS service account key)

Step 2: GPU Inference

Step 2a: FP32 Inference (A100 GPU)

Start the GPU notebook:

coiled notebook start --name p1-gpu-aifs-20260129 --vm-type a2-ultragpu-1g --software east5-us-flashattn-dockerv1 --workspace=gcp-sewaa-nka --region us-east5 --disk-size 60

File: automate_aifs_gpu_pipeline.py

python automate_aifs_gpu_pipeline.py --date 20260129_0000 --members 1-50

Purpose: Run AIFS-ENS model at full FP32 precision for all ensemble members
Environment: A100 GPU (a2-ultragpu-1g, ~80GB VRAM)
Processing: One member at a time (download → inference → upload → cleanup) to minimise storage usage
Output: GRIB files uploaded to gs://aifs-aiquest-us-20251127/YYYYMMDD_0000/forecasts/

Required files on the GPU machine:

File	Purpose
`automate_aifs_gpu_pipeline.py`	Main pipeline orchestrator
`fp32_multi_run_AIFS_ENS_v1.py`	AIFS model runner (FP32)
`download_pkl_from_gcs.py`	GCS download utility
`upload_aifs_gpu_output_grib_gcs.py`	GCS upload utility
`coiled-data.json`	GCS service account key

SHUTDOWN GPU notebook after completion to avoid unnecessary costs.

Step 2b: FP16 Inference (G2 GPU)

Start the GPU notebook:

coiled notebook start --name p2-fp16-20260129 --vm-type g2-standard-12 --software flashattn-dockerv1 --workspace=gcp-sewaa-nka --region us-east4 --disk-size 400

File: fp16_automate_aifs_gpu_pipeline.py

python fp16_automate_aifs_gpu_pipeline.py --date 20260129_0000 --members 1-50

Purpose: Run AIFS-ENS model at FP16 (half precision), reducing VRAM from ~50GB to <24GB
Environment: G2 GPU (g2-standard-12)
Processing: Same per-member pipeline as FP32 version
Output: GRIB files uploaded to gs://aifs-aiquest-us-20251127/YYYYMMDD_0000/fp16_forecasts/

Required files on the GPU machine:

File	Purpose
`fp16_automate_aifs_gpu_pipeline.py`	Main pipeline orchestrator (FP16)
`fp16_multi_run_AIFS_ENS_v1.py`	AIFS model runner (FP16)
`download_pkl_from_gcs.py`	GCS download utility
`upload_aifs_gpu_output_grib_gcs.py`	GCS upload utility
`coiled-data.json`	GCS service account key

SHUTDOWN GPU notebook after completion.

Step 3: Post-Processing & Submission (ETL Machine)

Use the same ETL machine from Step 1:

coiled notebook start --name p2-aifs-etl-20260129 --vm-type n2-standard-2 --software aifs-etl-v2 --workspace=gcp-sewaa-nka --region us-east5

Step 3a: Forecast Download & Regrid

File: aifs_n320_grib_1p5defg_nc_cli.py

python aifs_n320_grib_1p5defg_nc_cli.py --date 20260129

# For FP16:
python aifs_n320_grib_1p5defg_nc_cli.py --date 20260129 --fp16

Purpose: Download GRIB files from GCS and regrid from N320 to 1.5 degree NetCDF
Output: NetCDF files in gs://aifs-aiquest-us-20251127/YYYYMMDD_0000/1p5deg_nc/ (or fp16_1p5deg_nc/)

Step 3b: Ensemble Quintile Analysis

File: ensemble_quintile_analysis_cli.py

# FP32 mode (uses icechunk by default for memory efficiency)
python ensemble_quintile_analysis_cli.py --date 20260129

# FP16 mode
python ensemble_quintile_analysis_cli.py --date 20260129 --fp16

Purpose: Download ensemble NetCDF from GCS, retrieve climatology, calculate quintile probabilities
Output: ensemble_quintile_probabilities_YYYYMMDD.nc (or _fp16.nc)
Requires: .env file with AIWQ_PASSWORD for climatology retrieval, coiled-data.json for GCS access

Step 3c: Forecast Submission

File: forecast_submission_cli.py

# FP32 submission
python forecast_submission_cli.py --date 20260129

# FP16 submission
python forecast_submission_cli.py --date 20260129 --fp16

# Dry run (validate without submitting)
python forecast_submission_cli.py --date 20260129 --dry-run

Purpose: Submit quintile probabilities to AI Weather Quest competition
Requires: .env file with AIWQ_TEAM_NAME, AIWQ_MODEL_NAME, and AIWQ_PASSWORD
Submits: 3 variables (mslp, pr, tas) x 2 weeks = 6 forecasts per run

Data Flow

ECMWF Open Data → Pickle Files → GCS (YYYYMMDD_0000/input/)
                                        ↓
                              GPU Inference (FP32/FP16)
                                        ↓
                              GCS (YYYYMMDD_0000/forecasts/ or fp16_forecasts/)
                                        ↓
                              Regrid N320 → 1.5deg NetCDF
                                        ↓
                              GCS (YYYYMMDD_0000/1p5deg_nc/ or fp16_1p5deg_nc/)
                                        ↓
                              Quintile Analysis → Submission

Storage Strategy

GCS Bucket: aifs-aiquest-us-20251127
Path Structure:
- Input pickle files: YYYYMMDD_0000/input/
- FP32 GRIB forecasts: YYYYMMDD_0000/forecasts/
- FP16 GRIB forecasts: YYYYMMDD_0000/fp16_forecasts/
- FP32 NetCDF outputs: YYYYMMDD_0000/1p5deg_nc/
- FP16 NetCDF outputs: YYYYMMDD_0000/fp16_1p5deg_nc/
Service Account: coiled-data.json for GCS access

GPU Memory Optimization

GPU	VRAM	Pipeline Script	Precision	Chunks
A100 (80GB)	80GB	`automate_aifs_gpu_pipeline.py`	FP32	Default
A100 (40GB)	40GB	`automate_aifs_gpu_pipeline.py`	FP32	8
G2 (L4 24GB)	24GB	`fp16_automate_aifs_gpu_pipeline.py`	FP16	16
A10G	24GB	`fp16_automate_aifs_gpu_pipeline.py`	FP16	16
RTX 4090	24GB	`fp16_automate_aifs_gpu_pipeline.py`	FP16	16

Reference: HuggingFace Discussion #17

Ensemble Configuration

Members: 1-50
Forecast Length: 792 hours (33 days)
Meteorological Parameters: pr, mslp, tas

ERA5T Pipeline (CEDA Data Source)

An alternative pipeline using ERA5T data from the CEDA archive instead of ECMWF Open Data. This enables forecasts initialized from dates not covered by ECMWF Open Data (which only retains the most recent ~24h). ERA5T has a ~1 week lag from real time.

For full technical documentation, see era5tFp16FahamuAIFSv1.md.

Key Differences from Standard Pipeline

Aspect	Standard (ECMWF Open Data)	ERA5T (CEDA)
Input fields	92 fields	74 fields (adapted to 92 at inference)
Members	1-50	0-9 (10 EDA members)
Lead time	792h (33 days)	960h (40 days)
Data source	ECMWF Open Data (latest only)	CEDA ERA5T archive (~1 week lag)
Auth	None	CEDA Bearer token (`ceda_token` in `.env`)

ERA5T Workflow

Step 1: Create pkl files from CEDA (ETL Machine)

uv run ceda_era5t_pkl_input_aifsens.py

Edit DATE in the script to set the initialization date. Requires ceda_token in .env. Output: gs://aifs-aiquest-us-20251127/era5t/YYYYMMDD/input_state_member_00*.pkl

Step 2: GPU Inference (GPU Machine, >=24GB VRAM)

python era5t_fp16_automate_aifs_gpu_pipeline.py \
    --date YYYYMMDD_0000 \
    --members 0-9 \
    --gcs-input-prefix era5t/YYYYMMDD \
    --gcs-output-subpath era5t_fp16_forecasts \
    --lead-time 960

Note: --date is the target forecast date folder, --gcs-input-prefix points to the ERA5T init date pkl files. For example, init date 20260227 → target date 20260305.

Step 3: GRIB to 1.5deg NetCDF (ETL Machine)

python era5t_aifs_n320_grib_1p5deg_nc_cli.py \
    --date YYYYMMDD_0000 \
    --members 0-9 \
    --gcs-input-subpath era5t_fp16_forecasts \
    --gcs-output-subpath era5t_fp16_1p5deg_nc \
    --init-date YYYYMMDD

--init-date must match the ERA5T initialization date used in the GRIB filenames.

Step 4: Quintile Analysis (ETL Machine)

python era5t_ensemble_quintile_analysis_cli.py --date YYYYMMDD --members 0-9 --fp16

Step 5: Submit (ETL Machine)

python era5t_forecast_submission_cli.py --date YYYYMMDD

ERA5T Scripts Reference

Script	Purpose
`ceda_era5t_pkl_input_aifsens.py`	CEDA ERA5T → pkl (74 fields, 10 members)
`era5t_fp16_automate_aifs_gpu_pipeline.py`	GPU pipeline orchestrator (FP16)
`era5t_fp16_multi_run_AIFS_ENS_v1.py`	FP16 inference with field adaptation (74→92)
`era5t_aifs_n320_grib_1p5deg_nc_cli.py`	GRIB → 1.5deg NetCDF regridding
`era5t_ensemble_quintile_analysis_cli.py`	Quintile probability calculation
`era5t_forecast_submission_cli.py`	AI Weather Quest submission

ERA5T Execution Times and Costs

Script	Time	Environment	Cost
`ceda_era5t_pkl_input_aifsens.py`	~14 min	CPU (n2-standard-2)	~$0.04
`era5t_fp16_automate_aifs_gpu_pipeline.py`	~2.5 hours	GPU (g2-standard-12)	~$5-7
`era5t_aifs_n320_grib_1p5deg_nc_cli.py`	~1 hour	CPU (n2-standard-2)	~$0.24
`era5t_ensemble_quintile_analysis_cli.py`	~10 min	CPU (n2-standard-2)	~$0.04
`era5t_forecast_submission_cli.py`	~5 min	CPU (n2-standard-2)	~$0.02

Dependencies

Core Packages

anemoi-inference: ECMWF AI model runner
earthkit-data: ECMWF data handling
earthkit-regrid: Data regridding (v0.4.0)
google-cloud-storage: GCS operations
icechunk: Memory-efficient ensemble processing
AI_WQ_package: Forecast submission and evaluation
python-dotenv: Credential management from .env file

Authentication

GCS service account key (coiled-data.json) for cloud storage access
.env file with AI Weather Quest credentials (team name, model name, password)

Script Execution Times and Costs

Script	Execution Time	Environment	Cost (USD)	Notes
`ecmwf_opendata_pkl_input_aifsens.py`	2-2.5 hours	CPU (n2-standard-2)	~$0.48-0.60	Data preprocessing and GCS upload
`automate_aifs_gpu_pipeline.py`	6.5-7 hours	GPU (a2-ultragpu-1g)	~$35-42	FP32, 50 members, per-member processing
`fp16_automate_aifs_gpu_pipeline.py`	6.5-7 hours	GPU (g2-standard-12)	~$15-20	FP16, 50 members, reduced cost GPU
`aifs_n320_grib_1p5defg_nc_cli.py`	4-4.5 hours	CPU (n2-standard-2)	~$0.96-1.08	GRIB regridding and processing
`ensemble_quintile_analysis_cli.py`	15 minutes	CPU (n2-standard-2)	~$0.06	Ensemble analysis
`forecast_submission_cli.py`	5 minutes	CPU (n2-standard-2)	~$0.02	Submission validation

Troubleshooting: HuggingFace Model Download Hangs

Symptom

The GPU inference pipeline hangs indefinitely at model checkpoint download:

Running forecast for member 0...
Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

This occurs inside runner.run() when the anemoi-inference library attempts to download the ecmwf/aifs-ens-1.0 model weights (~3-4 GB) from HuggingFace Hub. The model metadata loads quickly during SimpleRunner() init, but the large checkpoint blob download stalls.

Root Cause

The HuggingFace huggingface_hub downloader can hang due to:

Network throttling or rate limiting on unauthenticated requests from cloud VMs
Incomplete downloads with stale lock files preventing retry (a previous failed/killed download leaves .incomplete and .lock files in the cache)
No HF_TOKEN set, causing anonymous download which is subject to stricter rate limits

Diagnosis

# Check for incomplete downloads and stale locks
ls -la ~/.cache/huggingface/hub/models--ecmwf--aifs-ens-1.0/blobs/
# Look for files ending in .incomplete

ls -la ~/.cache/huggingface/hub/.locks/models--ecmwf--aifs-ens-1.0/
# Look for .lock files with recent timestamps

# Check HF token
echo $HF_TOKEN

Fix: Clear Stale Cache and Retry

# 1. Kill the stuck process
kill $(pgrep -f era5t_fp16_automate_aifs_gpu_pipeline)

# 2. Remove incomplete downloads and stale locks
rm -f ~/.cache/huggingface/hub/models--ecmwf--aifs-ens-1.0/blobs/*.incomplete
rm -f ~/.cache/huggingface/hub/.locks/models--ecmwf--aifs-ens-1.0/*.lock

# 3. Set HF token to avoid rate limiting
export HF_TOKEN="your_huggingface_token"

# 4. Re-run the pipeline
python era5t_fp16_automate_aifs_gpu_pipeline.py --date YYYYMMDD_0000 --members 0-4 ...

Recommended: Pre-cache Model in Docker Image

The most reliable solution is to bake the HuggingFace model into the Docker image used for GPU inference. This eliminates runtime downloads entirely, avoids network dependency during forecast runs, and ensures reproducible deployments.

Approach 1: Dockerfile with Pre-downloaded Model

Add the model download step to the GPU Docker image build:

FROM your-base-gpu-image:latest

# Install huggingface_hub for download
RUN pip install huggingface_hub

# Pre-download the AIFS-ENS model checkpoint into the HF cache
# This caches all 7 files (~3-4 GB) at build time
ARG HF_TOKEN
RUN python -c "\
from huggingface_hub import snapshot_download; \
snapshot_download('ecmwf/aifs-ens-1.0', token='${HF_TOKEN}')"

# The model is now cached at /root/.cache/huggingface/hub/models--ecmwf--aifs-ens-1.0/
# anemoi-inference will find it automatically without any network calls

Build with:

docker build --build-arg HF_TOKEN=hf_your_token -t aifs-gpu-cached:latest .

Approach 2: Volume Mount from GCS

If Docker image size is a concern (~3-4 GB added), pre-download the model to a persistent disk or GCS bucket and mount it:

# Pre-download once to a persistent location
python -c "
from huggingface_hub import snapshot_download
snapshot_download('ecmwf/aifs-ens-1.0', cache_dir='/mnt/model-cache/huggingface')
"

# Mount at runtime
export HF_HOME=/mnt/model-cache/huggingface
python era5t_fp16_automate_aifs_gpu_pipeline.py ...

Approach 3: Coiled Software Environment with Cached Model

For Coiled-managed GPU notebooks, include the model download in the software environment setup so it is available when the notebook starts:

# During software environment creation, ensure model is cached
python -c "from huggingface_hub import snapshot_download; snapshot_download('ecmwf/aifs-ens-1.0')"

Why Docker Pre-caching is Preferred

Approach	Pros	Cons
Runtime download	No image size increase	Slow startup (~10-30 min), network dependent, can hang
Docker pre-cache	Zero startup delay, no network needed, fully reproducible	Larger image (~3-4 GB), requires rebuild for model updates
Volume mount	Flexible, shared across instances	Requires persistent disk setup, mount configuration

For operational forecast pipelines where reliability and speed matter, Docker pre-caching is strongly recommended. It converts a flaky runtime network dependency into a deterministic build-time step.

Forecast Run History

S.No	Date	Ensemble Members	Status	Notes
1	2025-08-21	50	Completed	Full ensemble run
2	2025-08-28	50	Completed	Full ensemble run
3	2025-09-04	50	Completed	Full ensemble run
4	2025-09-11	48	Completed	Reduced members due to GPU memory issue
5	2025-09-18	20	Completed	Time exceeded to download from opendata

Acknowledgements

This work was funded in part by:

Hazard modeling, impact estimation, climate storylines for event catalogue on drought and flood disasters in the Eastern Africa (E4DRR) project. https://icpac-igad.github.io/e4drr/ United Nations | Complex Risk Analytics Fund (CRAF'd) on the activity 2.3.3 Experiment generative AI for EPS(Ensemble Prediction Systems): Explore the application of Generative AI (cGAN) in bias correction and downscaling of EPS data in an operational setup.
The Strengthening Early Warning Systems for Anticipatory Action (SEWAA) Project. https://cgan.icpac.net/

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
docs		docs
ecmwf-gik		ecmwf-gik
env		env
serverless_cloudrun		serverless_cloudrun
unittests		unittests
README.md		README.md
aifs_792hr_forecast_grib_check_vars.py		aifs_792hr_forecast_grib_check_vars.py
aifs_n320_grib_1p5defg_nc.py		aifs_n320_grib_1p5defg_nc.py
aifs_n320_grib_1p5defg_nc_cli.py		aifs_n320_grib_1p5defg_nc_cli.py
automate-aifs-routine.md		automate-aifs-routine.md
automate_aifs_gpu_pipeline.py		automate_aifs_gpu_pipeline.py
ceda_era5t_pkl_input_aifsens.py		ceda_era5t_pkl_input_aifsens.py
create_comparison_plot.py		create_comparison_plot.py
diagnose_quintile_bias.py		diagnose_quintile_bias.py
download_grib_from_gcs.py		download_grib_from_gcs.py
download_pkl_from_gcs.py		download_pkl_from_gcs.py
ecmwf_opendata_pkl_input_aifsens.py		ecmwf_opendata_pkl_input_aifsens.py
ensemble_quintile_analysis.py		ensemble_quintile_analysis.py
ensemble_quintile_analysis_cli.py		ensemble_quintile_analysis_cli.py
era5tFp16FahamuAIFSv1.md		era5tFp16FahamuAIFSv1.md
era5t_aifs_n320_grib_1p5deg_nc_cli.py		era5t_aifs_n320_grib_1p5deg_nc_cli.py
era5t_ensemble_quintile_analysis_cli.py		era5t_ensemble_quintile_analysis_cli.py
era5t_forecast_submission_cli.py		era5t_forecast_submission_cli.py
era5t_fp16_automate_aifs_gpu_pipeline.py		era5t_fp16_automate_aifs_gpu_pipeline.py
era5t_fp16_multi_run_AIFS_ENS_v1.py		era5t_fp16_multi_run_AIFS_ENS_v1.py
forecast_submission_cli.py		forecast_submission_cli.py
fp16_automate_aifs_gpu_pipeline.py		fp16_automate_aifs_gpu_pipeline.py
fp16_multi_run_AIFS_ENS_v1.py		fp16_multi_run_AIFS_ENS_v1.py
fp32_multi_run_AIFS_ENS_v1.py		fp32_multi_run_AIFS_ENS_v1.py
pr_plot_week.py		pr_plot_week.py
pytorch_profile_fp16.py		pytorch_profile_fp16.py
pytorch_profile_fp32.py		pytorch_profile_fp32.py
upload_aifs_gpu_output_grib_gcs.py		upload_aifs_gpu_output_grib_gcs.py

Folders and files

Latest commit

History

Repository files navigation

AI Forecast Submission Workflow

Overview

Workflow Index

ETL Environment Setup

Software Installation

Credentials Setup

Step 1: Initial Condition Preparation (ETL Machine)

Step 2: GPU Inference

Step 2a: FP32 Inference (A100 GPU)

Step 2b: FP16 Inference (G2 GPU)

Step 3: Post-Processing & Submission (ETL Machine)

Step 3a: Forecast Download & Regrid

Step 3b: Ensemble Quintile Analysis

Step 3c: Forecast Submission

Data Flow

Storage Strategy

GPU Memory Optimization

Ensemble Configuration

ERA5T Pipeline (CEDA Data Source)

Key Differences from Standard Pipeline

ERA5T Workflow

ERA5T Scripts Reference

ERA5T Execution Times and Costs

Dependencies

Core Packages

Authentication

Script Execution Times and Costs

Troubleshooting: HuggingFace Model Download Hangs

Symptom

Root Cause

Diagnosis

Fix: Clear Stale Cache and Retry

Recommended: Pre-cache Model in Docker Image

Approach 1: Dockerfile with Pre-downloaded Model

Approach 2: Volume Mount from GCS

Approach 3: Coiled Software Environment with Cached Model

Why Docker Pre-caching is Preferred

Forecast Run History

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages