cng-datasets

A CLI toolkit for processing large geospatial datasets into cloud-native formats on Kubernetes.

What it does: Takes source geospatial data (Shapefiles, GeoPackages, FileGDB, GeoTIFFs) and produces:

GeoParquet — columnar format for analytical queries (DuckDB, Polars)
PMTiles — vector tiles for web map visualization
H3 Hex Parquet — hexagonal grid indexed at configurable H3 resolution, hive-partitioned for fast spatial joins
Cloud-Optimized GeoTIFF — for raster data, optimized for HTTP range requests

How it works: You run a single CLI command locally. It generates Kubernetes Job YAML files that orchestrate the entire pipeline on your cluster. You never process data on your local machine — the CLI just generates the jobs.

Installation

pip install cng-datasets

Or from source:

pip install -e "."

Usage

Generate a vector processing pipeline

cng-datasets workflow \
  --dataset cpad-2024 \
  --source-url https://example.com/cpad.gdb \
  --bucket public-cpad \
  --layer CPAD_SuperUnits \
  --h3-resolution 10 \
  --parent-resolutions "9,8,0" \
  --hex-memory 32Gi \
  --max-completions 200 \
  --max-parallelism 50

This generates YAML files for a 5-step pipeline:

setup-bucket — creates the S3 bucket with public-read policy
convert — reads source data, reprojects to EPSG:4326, writes GeoParquet
pmtiles — converts GeoParquet to PMTiles (parallel with step 4)
hex — computes H3 cell assignments in parallel pods
repartition — consolidates hex chunks into hive-partitioned layout

Multiple sources: Add --source-url multiple times to merge datasets:

cng-datasets workflow \
  --dataset merged-regions \
  --source-url https://example.com/region1.shp \
  --source-url https://example.com/region2.shp \
  --source-url https://example.com/region3.shp \
  --bucket my-bucket

Apply them:

kubectl apply -f <output-dir>/workflow-rbac.yaml
kubectl apply -f <output-dir>/configmap.yaml
kubectl apply -f <output-dir>/workflow.yaml

The workflow orchestrator runs steps sequentially, launching pmtiles and hex in parallel.

Generate a raster processing pipeline

cng-datasets raster-workflow \
  --dataset wetlands-cog \
  --source-url https://example.com/wetlands.tif \
  --bucket public-wetlands

Multi-layer sources

For GeoDatabase or GeoPackage files with multiple layers, run one workflow per layer. Use --layer to select each layer and --dataset with / for hierarchical S3 paths:

# Inspect layers
ogrinfo /vsicurl/https://example.com/data.gdb

# Process each layer separately
cng-datasets workflow --dataset mydata/layer-a --layer LayerA ...
cng-datasets workflow --dataset mydata/layer-b --layer LayerB ...

The / in --dataset creates nested S3 paths (e.g., mydata/layer-a.parquet) while using hyphens for k8s resource names.

CLI Reference

Command	Purpose
`cng-datasets workflow`	Generate vector processing k8s pipeline
`cng-datasets raster-workflow`	Generate raster processing k8s pipeline
`cng-datasets storage setup-bucket`	Create and configure an S3 bucket
`cng-convert-to-parquet`	Convert vector data to GeoParquet
`cng-datasets vector`	Run H3 hex tiling (used inside k8s pods)
`cng-datasets raster`	Run raster H3 tiling (used inside k8s pods)
`cng-datasets repartition`	Consolidate hex chunks (used inside k8s pods)

Commands marked "used inside k8s pods" are called by the generated jobs — you don't run them directly.

`cng-datasets workflow` options

--dataset NAME             Dataset name for S3 paths. Use / for hierarchy (e.g., "padus/fee").
--source-url URL           Public URL to source data.
--bucket BUCKET            Target S3 bucket name.
--layer LAYER              Layer name for multi-layer sources (GDB, GPKG).
--h3-resolution N          H3 resolution for hex tiling (default: 10).
--parent-resolutions STR   Comma-separated parent resolutions (default: "9,8,0").
--hex-memory SIZE          Memory per hex pod (default: 8Gi).
--max-completions N        Number of parallel chunks, max 200 (default: auto).
--max-parallelism N        Max concurrent pods (default: 50).
--id-column COL            ID column name (auto-detected if omitted).
--output-dir DIR           Directory for generated YAML files.
--intermediate-chunk-size  Rows per unnest batch (decrease if OOM).
--row-group-size N         Rows per parquet row group (default: 100000).

S3 Output Layout

bucket/
├── dataset.parquet              # GeoParquet
├── dataset.pmtiles              # PMTiles
├── dataset/
│   └── hex/
│       └── h0={cell}/data_0.parquet   # H3-indexed, hive-partitioned
├── README.md
└── stac-collection.json

Docker

The CLI and all dependencies are packaged in a Docker image used by the k8s jobs:

docker pull ghcr.io/boettiger-lab/datasets:latest

Troubleshooting

OOM on hex jobs: Increase --hex-memory (e.g., 32Gi → 64Gi), increase --max-completions for smaller chunks, or decrease --intermediate-chunk-size.

Convert fails on curved geometries (MULTISURFACE): Handled automatically — the converter linearizes curved geometry types via ogr2ogr before processing.

Monitoring:

kubectl get jobs              # Pipeline status
kubectl logs job/<name>       # Job logs
kubectl get pods | grep OOM   # Check for memory issues

Development

pip install -e ".[dev]"
pytest tests/

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 508 Commits
.github		.github
catalog		catalog
cng_datasets		cng_datasets
docs		docs
examples		examples
skills		skills
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG-multi-source.md		CHANGELOG-multi-source.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DATASET_DOCUMENTATION_WORKFLOW.md		DATASET_DOCUMENTATION_WORKFLOW.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
gdal-numpy2-notes.md		gdal-numpy2-notes.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cng-datasets

Installation

Usage

Generate a vector processing pipeline

Generate a raster processing pipeline

Multi-layer sources

CLI Reference

`cng-datasets workflow` options

S3 Output Layout

Docker

Troubleshooting

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cng-datasets

Installation

Usage

Generate a vector processing pipeline

Generate a raster processing pipeline

Multi-layer sources

CLI Reference

cng-datasets workflow options

S3 Output Layout

Docker

Troubleshooting

Development

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`cng-datasets workflow` options

Packages