This document describes the complete sequential workflow for AI forecast submission using the AIFS ensemble system. The workflow spans from initial condition preparation to final forecast submission.
The AI forecast submission process consists of 3 main steps across different computing environments (CPU/ETL and GPU) to produce ensemble weather forecasts and submit them for evaluation.
- Initial Condition Preparation (ETL Machine) →
ecmwf_opendata_pkl_input_aifsens.py - GPU Inference (GPU Machines)
- FP32 (A100 GPU) →
automate_aifs_gpu_pipeline.py - FP16 (G2 GPU) →
fp16_automate_aifs_gpu_pipeline.py
- FP32 (A100 GPU) →
- Post-Processing & Submission (ETL Machine)
- Regrid →
aifs_n320_grib_1p5defg_nc_cli.py - Quintile Analysis →
ensemble_quintile_analysis_cli.py - Forecast Submission →
forecast_submission_cli.py
- Regrid →
The ETL (non-GPU) machine is used for Step 1 and Step 3. Start the Coiled notebook:
coiled notebook start --name p2-aifs-etl-20260129 --vm-type n2-standard-2 --software aifs-etl-v2 --workspace=gcp-sewaa-nka --region us-east5Install the required environment using micromamba:
micromamba create -n aifs-etl -c conda-forge python=3.12.7 \
&& eval "$(micromamba shell hook --shell bash)" \
&& micromamba activate aifs-etl \
&& micromamba install -c conda-forge earthkit-data ecmwf-opendata \
&& pip install gcsfs s3fs earthkit-regrid==0.4.0 google-cloud-storage icechunk AI_WQ_package \
&& sudo apt update && sudo apt install nanoCopy the .env.example file to .env and fill in your credentials:
cp .env.example .env
nano .envExample .env contents:
AIWQ_TEAM_NAME=Fahamu
AIWQ_MODEL_NAME=FahamuAIFSv1
AIWQ_MODEL_NAME_FP16=FahamuAIFSv1_fp16
AIWQ_PASSWORD=your_password_here
A GCS service account key file (coiled-data.json) is also required for cloud storage access.
File: ecmwf_opendata_pkl_input_aifsens.py
python ecmwf_opendata_pkl_input_aifsens.py- Purpose: Download and preprocess ECMWF open data for ensemble members 1-50
- Environment: ETL machine (CPU-only,
n2-standard-2) - Input: ECMWF open data (surface, soil, pressure level parameters)
- Output: Pickle files uploaded to GCS bucket (
gs://aifs-aiquest-us-20251127/YYYYMMDD_0000/input/) - Requires:
coiled-data.json(GCS service account key)
Start the GPU notebook:
coiled notebook start --name p1-gpu-aifs-20260129 --vm-type a2-ultragpu-1g --software east5-us-flashattn-dockerv1 --workspace=gcp-sewaa-nka --region us-east5 --disk-size 60File: automate_aifs_gpu_pipeline.py
python automate_aifs_gpu_pipeline.py --date 20260129_0000 --members 1-50- Purpose: Run AIFS-ENS model at full FP32 precision for all ensemble members
- Environment: A100 GPU (
a2-ultragpu-1g, ~80GB VRAM) - Processing: One member at a time (download → inference → upload → cleanup) to minimise storage usage
- Output: GRIB files uploaded to
gs://aifs-aiquest-us-20251127/YYYYMMDD_0000/forecasts/
Required files on the GPU machine:
| File | Purpose |
|---|---|
automate_aifs_gpu_pipeline.py |
Main pipeline orchestrator |
fp32_multi_run_AIFS_ENS_v1.py |
AIFS model runner (FP32) |
download_pkl_from_gcs.py |
GCS download utility |
upload_aifs_gpu_output_grib_gcs.py |
GCS upload utility |
coiled-data.json |
GCS service account key |
SHUTDOWN GPU notebook after completion to avoid unnecessary costs.
Start the GPU notebook:
coiled notebook start --name p2-fp16-20260129 --vm-type g2-standard-12 --software flashattn-dockerv1 --workspace=gcp-sewaa-nka --region us-east4 --disk-size 400File: fp16_automate_aifs_gpu_pipeline.py
python fp16_automate_aifs_gpu_pipeline.py --date 20260129_0000 --members 1-50- Purpose: Run AIFS-ENS model at FP16 (half precision), reducing VRAM from ~50GB to <24GB
- Environment: G2 GPU (
g2-standard-12) - Processing: Same per-member pipeline as FP32 version
- Output: GRIB files uploaded to
gs://aifs-aiquest-us-20251127/YYYYMMDD_0000/fp16_forecasts/
Required files on the GPU machine:
| File | Purpose |
|---|---|
fp16_automate_aifs_gpu_pipeline.py |
Main pipeline orchestrator (FP16) |
fp16_multi_run_AIFS_ENS_v1.py |
AIFS model runner (FP16) |
download_pkl_from_gcs.py |
GCS download utility |
upload_aifs_gpu_output_grib_gcs.py |
GCS upload utility |
coiled-data.json |
GCS service account key |
SHUTDOWN GPU notebook after completion.
Use the same ETL machine from Step 1:
coiled notebook start --name p2-aifs-etl-20260129 --vm-type n2-standard-2 --software aifs-etl-v2 --workspace=gcp-sewaa-nka --region us-east5File: aifs_n320_grib_1p5defg_nc_cli.py
python aifs_n320_grib_1p5defg_nc_cli.py --date 20260129
# For FP16:
python aifs_n320_grib_1p5defg_nc_cli.py --date 20260129 --fp16- Purpose: Download GRIB files from GCS and regrid from N320 to 1.5 degree NetCDF
- Output: NetCDF files in
gs://aifs-aiquest-us-20251127/YYYYMMDD_0000/1p5deg_nc/(orfp16_1p5deg_nc/)
File: ensemble_quintile_analysis_cli.py
# FP32 mode (uses icechunk by default for memory efficiency)
python ensemble_quintile_analysis_cli.py --date 20260129
# FP16 mode
python ensemble_quintile_analysis_cli.py --date 20260129 --fp16- Purpose: Download ensemble NetCDF from GCS, retrieve climatology, calculate quintile probabilities
- Output:
ensemble_quintile_probabilities_YYYYMMDD.nc(or_fp16.nc) - Requires:
.envfile withAIWQ_PASSWORDfor climatology retrieval,coiled-data.jsonfor GCS access
File: forecast_submission_cli.py
# FP32 submission
python forecast_submission_cli.py --date 20260129
# FP16 submission
python forecast_submission_cli.py --date 20260129 --fp16
# Dry run (validate without submitting)
python forecast_submission_cli.py --date 20260129 --dry-run- Purpose: Submit quintile probabilities to AI Weather Quest competition
- Requires:
.envfile withAIWQ_TEAM_NAME,AIWQ_MODEL_NAME, andAIWQ_PASSWORD - Submits: 3 variables (mslp, pr, tas) x 2 weeks = 6 forecasts per run
ECMWF Open Data → Pickle Files → GCS (YYYYMMDD_0000/input/)
↓
GPU Inference (FP32/FP16)
↓
GCS (YYYYMMDD_0000/forecasts/ or fp16_forecasts/)
↓
Regrid N320 → 1.5deg NetCDF
↓
GCS (YYYYMMDD_0000/1p5deg_nc/ or fp16_1p5deg_nc/)
↓
Quintile Analysis → Submission
- GCS Bucket:
aifs-aiquest-us-20251127 - Path Structure:
- Input pickle files:
YYYYMMDD_0000/input/ - FP32 GRIB forecasts:
YYYYMMDD_0000/forecasts/ - FP16 GRIB forecasts:
YYYYMMDD_0000/fp16_forecasts/ - FP32 NetCDF outputs:
YYYYMMDD_0000/1p5deg_nc/ - FP16 NetCDF outputs:
YYYYMMDD_0000/fp16_1p5deg_nc/
- Input pickle files:
- Service Account:
coiled-data.jsonfor GCS access
| GPU | VRAM | Pipeline Script | Precision | Chunks |
|---|---|---|---|---|
| A100 (80GB) | 80GB | automate_aifs_gpu_pipeline.py |
FP32 | Default |
| A100 (40GB) | 40GB | automate_aifs_gpu_pipeline.py |
FP32 | 8 |
| G2 (L4 24GB) | 24GB | fp16_automate_aifs_gpu_pipeline.py |
FP16 | 16 |
| A10G | 24GB | fp16_automate_aifs_gpu_pipeline.py |
FP16 | 16 |
| RTX 4090 | 24GB | fp16_automate_aifs_gpu_pipeline.py |
FP16 | 16 |
Reference: HuggingFace Discussion #17
- Members: 1-50
- Forecast Length: 792 hours (33 days)
- Meteorological Parameters: pr, mslp, tas
An alternative pipeline using ERA5T data from the CEDA archive instead of ECMWF Open Data. This enables forecasts initialized from dates not covered by ECMWF Open Data (which only retains the most recent ~24h). ERA5T has a ~1 week lag from real time.
For full technical documentation, see era5tFp16FahamuAIFSv1.md.
| Aspect | Standard (ECMWF Open Data) | ERA5T (CEDA) |
|---|---|---|
| Input fields | 92 fields | 74 fields (adapted to 92 at inference) |
| Members | 1-50 | 0-9 (10 EDA members) |
| Lead time | 792h (33 days) | 960h (40 days) |
| Data source | ECMWF Open Data (latest only) | CEDA ERA5T archive (~1 week lag) |
| Auth | None | CEDA Bearer token (ceda_token in .env) |
Step 1: Create pkl files from CEDA (ETL Machine)
uv run ceda_era5t_pkl_input_aifsens.pyEdit DATE in the script to set the initialization date. Requires ceda_token in .env.
Output: gs://aifs-aiquest-us-20251127/era5t/YYYYMMDD/input_state_member_00*.pkl
Step 2: GPU Inference (GPU Machine, >=24GB VRAM)
python era5t_fp16_automate_aifs_gpu_pipeline.py \
--date YYYYMMDD_0000 \
--members 0-9 \
--gcs-input-prefix era5t/YYYYMMDD \
--gcs-output-subpath era5t_fp16_forecasts \
--lead-time 960Note: --date is the target forecast date folder, --gcs-input-prefix points to the
ERA5T init date pkl files. For example, init date 20260227 → target date 20260305.
Step 3: GRIB to 1.5deg NetCDF (ETL Machine)
python era5t_aifs_n320_grib_1p5deg_nc_cli.py \
--date YYYYMMDD_0000 \
--members 0-9 \
--gcs-input-subpath era5t_fp16_forecasts \
--gcs-output-subpath era5t_fp16_1p5deg_nc \
--init-date YYYYMMDD--init-date must match the ERA5T initialization date used in the GRIB filenames.
Step 4: Quintile Analysis (ETL Machine)
python era5t_ensemble_quintile_analysis_cli.py --date YYYYMMDD --members 0-9 --fp16Step 5: Submit (ETL Machine)
python era5t_forecast_submission_cli.py --date YYYYMMDD| Script | Purpose |
|---|---|
ceda_era5t_pkl_input_aifsens.py |
CEDA ERA5T → pkl (74 fields, 10 members) |
era5t_fp16_automate_aifs_gpu_pipeline.py |
GPU pipeline orchestrator (FP16) |
era5t_fp16_multi_run_AIFS_ENS_v1.py |
FP16 inference with field adaptation (74→92) |
era5t_aifs_n320_grib_1p5deg_nc_cli.py |
GRIB → 1.5deg NetCDF regridding |
era5t_ensemble_quintile_analysis_cli.py |
Quintile probability calculation |
era5t_forecast_submission_cli.py |
AI Weather Quest submission |
| Script | Time | Environment | Cost |
|---|---|---|---|
ceda_era5t_pkl_input_aifsens.py |
~14 min | CPU (n2-standard-2) | ~$0.04 |
era5t_fp16_automate_aifs_gpu_pipeline.py |
~2.5 hours | GPU (g2-standard-12) | ~$5-7 |
era5t_aifs_n320_grib_1p5deg_nc_cli.py |
~1 hour | CPU (n2-standard-2) | ~$0.24 |
era5t_ensemble_quintile_analysis_cli.py |
~10 min | CPU (n2-standard-2) | ~$0.04 |
era5t_forecast_submission_cli.py |
~5 min | CPU (n2-standard-2) | ~$0.02 |
anemoi-inference: ECMWF AI model runnerearthkit-data: ECMWF data handlingearthkit-regrid: Data regridding (v0.4.0)google-cloud-storage: GCS operationsicechunk: Memory-efficient ensemble processingAI_WQ_package: Forecast submission and evaluationpython-dotenv: Credential management from.envfile
- GCS service account key (
coiled-data.json) for cloud storage access .envfile with AI Weather Quest credentials (team name, model name, password)
| Script | Execution Time | Environment | Cost (USD) | Notes |
|---|---|---|---|---|
ecmwf_opendata_pkl_input_aifsens.py |
2-2.5 hours | CPU (n2-standard-2) | ~$0.48-0.60 | Data preprocessing and GCS upload |
automate_aifs_gpu_pipeline.py |
6.5-7 hours | GPU (a2-ultragpu-1g) | ~$35-42 | FP32, 50 members, per-member processing |
fp16_automate_aifs_gpu_pipeline.py |
6.5-7 hours | GPU (g2-standard-12) | ~$15-20 | FP16, 50 members, reduced cost GPU |
aifs_n320_grib_1p5defg_nc_cli.py |
4-4.5 hours | CPU (n2-standard-2) | ~$0.96-1.08 | GRIB regridding and processing |
ensemble_quintile_analysis_cli.py |
15 minutes | CPU (n2-standard-2) | ~$0.06 | Ensemble analysis |
forecast_submission_cli.py |
5 minutes | CPU (n2-standard-2) | ~$0.02 | Submission validation |
The GPU inference pipeline hangs indefinitely at model checkpoint download:
Running forecast for member 0...
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s]
This occurs inside runner.run() when the anemoi-inference library attempts to download the ecmwf/aifs-ens-1.0 model weights (~3-4 GB) from HuggingFace Hub. The model metadata loads quickly during SimpleRunner() init, but the large checkpoint blob download stalls.
The HuggingFace huggingface_hub downloader can hang due to:
- Network throttling or rate limiting on unauthenticated requests from cloud VMs
- Incomplete downloads with stale lock files preventing retry (a previous failed/killed download leaves
.incompleteand.lockfiles in the cache) - No
HF_TOKENset, causing anonymous download which is subject to stricter rate limits
# Check for incomplete downloads and stale locks
ls -la ~/.cache/huggingface/hub/models--ecmwf--aifs-ens-1.0/blobs/
# Look for files ending in .incomplete
ls -la ~/.cache/huggingface/hub/.locks/models--ecmwf--aifs-ens-1.0/
# Look for .lock files with recent timestamps
# Check HF token
echo $HF_TOKEN# 1. Kill the stuck process
kill $(pgrep -f era5t_fp16_automate_aifs_gpu_pipeline)
# 2. Remove incomplete downloads and stale locks
rm -f ~/.cache/huggingface/hub/models--ecmwf--aifs-ens-1.0/blobs/*.incomplete
rm -f ~/.cache/huggingface/hub/.locks/models--ecmwf--aifs-ens-1.0/*.lock
# 3. Set HF token to avoid rate limiting
export HF_TOKEN="your_huggingface_token"
# 4. Re-run the pipeline
python era5t_fp16_automate_aifs_gpu_pipeline.py --date YYYYMMDD_0000 --members 0-4 ...The most reliable solution is to bake the HuggingFace model into the Docker image used for GPU inference. This eliminates runtime downloads entirely, avoids network dependency during forecast runs, and ensures reproducible deployments.
Add the model download step to the GPU Docker image build:
FROM your-base-gpu-image:latest
# Install huggingface_hub for download
RUN pip install huggingface_hub
# Pre-download the AIFS-ENS model checkpoint into the HF cache
# This caches all 7 files (~3-4 GB) at build time
ARG HF_TOKEN
RUN python -c "\
from huggingface_hub import snapshot_download; \
snapshot_download('ecmwf/aifs-ens-1.0', token='${HF_TOKEN}')"
# The model is now cached at /root/.cache/huggingface/hub/models--ecmwf--aifs-ens-1.0/
# anemoi-inference will find it automatically without any network callsBuild with:
docker build --build-arg HF_TOKEN=hf_your_token -t aifs-gpu-cached:latest .If Docker image size is a concern (~3-4 GB added), pre-download the model to a persistent disk or GCS bucket and mount it:
# Pre-download once to a persistent location
python -c "
from huggingface_hub import snapshot_download
snapshot_download('ecmwf/aifs-ens-1.0', cache_dir='/mnt/model-cache/huggingface')
"
# Mount at runtime
export HF_HOME=/mnt/model-cache/huggingface
python era5t_fp16_automate_aifs_gpu_pipeline.py ...For Coiled-managed GPU notebooks, include the model download in the software environment setup so it is available when the notebook starts:
# During software environment creation, ensure model is cached
python -c "from huggingface_hub import snapshot_download; snapshot_download('ecmwf/aifs-ens-1.0')"| Approach | Pros | Cons |
|---|---|---|
| Runtime download | No image size increase | Slow startup (~10-30 min), network dependent, can hang |
| Docker pre-cache | Zero startup delay, no network needed, fully reproducible | Larger image (~3-4 GB), requires rebuild for model updates |
| Volume mount | Flexible, shared across instances | Requires persistent disk setup, mount configuration |
For operational forecast pipelines where reliability and speed matter, Docker pre-caching is strongly recommended. It converts a flaky runtime network dependency into a deterministic build-time step.
| S.No | Date | Ensemble Members | Status | Notes |
|---|---|---|---|---|
| 1 | 2025-08-21 | 50 | Completed | Full ensemble run |
| 2 | 2025-08-28 | 50 | Completed | Full ensemble run |
| 3 | 2025-09-04 | 50 | Completed | Full ensemble run |
| 4 | 2025-09-11 | 48 | Completed | Reduced members due to GPU memory issue |
| 5 | 2025-09-18 | 20 | Completed | Time exceeded to download from opendata |
This work was funded in part by:
- Hazard modeling, impact estimation, climate storylines for event catalogue on drought and flood disasters in the Eastern Africa (E4DRR) project. https://icpac-igad.github.io/e4drr/ United Nations | Complex Risk Analytics Fund (CRAF'd) on the activity 2.3.3 Experiment generative AI for EPS(Ensemble Prediction Systems): Explore the application of Generative AI (cGAN) in bias correction and downscaling of EPS data in an operational setup.
- The Strengthening Early Warning Systems for Anticipatory Action (SEWAA) Project. https://cgan.icpac.net/