Processing workflows for producing cloud-native geospatial datasets on the NRP Nautilus Kubernetes cluster.
Browse the full catalog in STAC Browser:
radiantearth.github.io/stac-browser → Boettiger Lab Datasets
Datasets are hosted on NRP Nautilus S3 storage (s3-west.nrp-nautilus.io).
This repo contains no code — just configuration (k8s YAML), documentation (STAC metadata), and instructions. All processing is done by the cng-datasets CLI tool running inside Kubernetes pods.
- You run
cng-datasets workflowon your laptop — it generates Kubernetes Job YAML files - You
kubectl applythose files — the cluster does all the processing - Outputs land on S3: GeoParquet, PMTiles, and H3-indexed hex parquet
You never process data locally. Your laptop just generates YAML and talks to kubectl.
# Install the CLI (one-time)
pip install cng-datasets
# Generate a processing pipeline for a dataset
cng-datasets workflow \
--dataset my-dataset \
--source-url https://example.com/data.gdb \
--bucket public-mydata \
--layer MyLayer \
--h3-resolution 10 \
--parent-resolutions "9,8,0" \
--hex-memory 32Gi \
--max-completions 200 \
--max-parallelism 50 \
--output-dir catalog/mydata/k8s/mylayer
# One-time RBAC setup (only needed once per cluster/namespace, likely already done)
kubectl apply -f catalog/mydata/k8s/mylayer/workflow-rbac.yaml
# Apply workflow (per dataset)
kubectl apply -f catalog/mydata/k8s/mylayer/configmap.yaml \
-f catalog/mydata/k8s/mylayer/workflow.yaml
# Monitor
kubectl get jobs | grep my-dataset
kubectl logs job/my-dataset-workflowThat's it. The workflow orchestrates: bucket setup → convert to GeoParquet → PMTiles + H3 hex (parallel) → repartition.
- AGENTS.md — Complete step-by-step guide for processing datasets (for humans and LLM agents)
- DATASET_DOCUMENTATION_WORKFLOW.md — How to create README and STAC metadata after processing
- todo.md — Tracking status of all datasets
catalog/
<dataset>/
k8s/ # Generated Kubernetes job YAML
stac/ # README.md and stac-collection.json for the dataset
*.ipynb # Any exploratory notebooks (optional)
Each dataset gets a directory under catalog/. The k8s YAML is generated by cng-datasets workflow and applied with kubectl. STAC metadata is created after processing completes.
See the cng-datasets README for full CLI documentation.
Key commands:
| Command | What it does | Where it runs |
|---|---|---|
cng-datasets workflow |
Generates k8s job YAML | Your laptop |
kubectl apply -f ... |
Submits jobs to the cluster | Your laptop |
kubectl get jobs |
Monitors job status | Your laptop |
| Everything else | Processing, S3 uploads, etc. | Kubernetes pods |
- Cluster: NRP Nautilus, namespace
biodiversity - S3: Ceph object storage (S3-compatible, not AWS)
- Public endpoint:
https://s3-west.nrp-nautilus.io/<bucket>/<path> - Secrets:
awsandrclone-configare pre-configured in the namespace
See .github/copilot-instructions.md for detailed infrastructure context.