fao-soils

LinkML schemas and data models for FAO soil databases

This repository provides standardized, machine-readable schemas for FAO (Food and Agriculture Organization) soil databases using LinkML.

Current Schemas

HWSD2 - Harmonized World Soil Database v2.0

The Harmonized World Soil Database v2.0 is a comprehensive global soil dataset that provides:

Global coverage at 30 arc-second resolution (~1 km)
7 standardized depth layers (0-200 cm)
Comprehensive soil properties:
- Physical: texture (sand/silt/clay), bulk density, coarse fragments
- Chemical: organic carbon, pH, nitrogen, C/N ratio, calcium carbonate, gypsum
- Cation exchange: CEC, base saturation, aluminum saturation
- Hydrological: drainage class, available water capacity
Multiple classification systems: WRB (World Reference Base), FAO-90, USDA
Climate context: Köppen-Geiger climate zones

Data structure:

408,835 layer records from 58,405 soil mapping units
Each mapping unit can contain multiple soil sequences (representing spatial heterogeneity)
Gridded raster data (43,200 × 21,600 pixels) linking locations to soil properties

Schema file: src/fao_soils/schema/hwsd2.yaml

Documentation Website

https://bioepic-data.github.io/fao-soils

Available Data Files

This repository includes both the pre-built DuckDB export and the tidied CSV tables:

export/hwsd2.ddb - ready-to-query DuckDB database export
export/hwsd2_parquet/ - one Parquet file per HWSD2 table
data/hwsd2/HWSD2_csv/ - tidied HWSD2 CSV files

If you want to start from packaged data in this repository, use export/ for DuckDB or Parquet workflows and data/ for direct CSV access.

Repository Structure

docs/ - mkdocs-managed documentation
- elements/ - generated schema documentation
examples/ - Examples of using the schema
export/ - Pre-built databases (ready to use!)
- hwsd2.ddb - DuckDB database (32 MB)
- hwsd2_parquet/ - Parquet exports of all HWSD2 tables
data/ - Source CSV files
- hwsd2/ - HWSD v2.0 data
  - HWSD2_csv/ - 25 CSV files (~100 MB)
scripts/ - Data processing and extraction scripts
- fetch_fao_soil_database.py - Download from FAO
- load_hwsd2.py - Build DuckDB database
- hwsd2_extractor.py - Extract by coordinates
project/ - project files (these files are auto-generated, do not edit)
src/ - source files (edit these)
- fao_soils
  - schema/ -- LinkML schemas (edit these)
    - hwsd2.yaml - HWSD v2.0 schema
    - fao_soils.yaml - Template schema
  - datamodel/ -- generated Python datamodels
tests/ - Python tests
- data/ - Example data

Quick Start

Getting the Data

Option 1: Use the pre-built database (Recommended, ~32 MB)

# Clone the repository
git clone https://github.com/bioepic-data/fao-soils.git
cd fao-soils

# The database is ready to use!
# File: export/hwsd2.ddb

Option 2: Build from source CSVs

# Install project dependencies
uv sync --dev

# Rebuild the packaged database in place (overwrites existing file)
uv run python scripts/load_hwsd2.py --force export/hwsd2.ddb

Option 3: Download from FAO (Complete dataset with rasters)

Requires mdb-tools to convert the FAO .mdb database into CSV and SQLite artifacts. See scripts/README.md for installation help.

# Install project dependencies
uv sync --dev

# Download and refresh the repo-managed source artifacts under data/hwsd2/
uv run python scripts/fetch_fao_soil_database.py --data-dir data/hwsd2

Convenience targets

just update-data       # refresh data/hwsd2 artifacts from FAO
just update-export     # rebuild export/hwsd2.ddb from the CSVs
just update-parquet    # export DuckDB tables to export/hwsd2_parquet/
just update-artifacts  # refresh data, DuckDB, and Parquet artifacts

just update-data and just update-artifacts require mdb-tools because they run the FAO .mdb conversion step.

Using the Data

Python with DuckDB (Recommended)

import duckdb

# Connect to the pre-built database
conn = duckdb.connect('export/hwsd2.ddb', read_only=True)

# Query soil properties
result = conn.execute("""
    SELECT l.*, d.VALUE as DRAINAGE_CLASS
    FROM HWSD2_LAYERS l
    LEFT JOIN D_DRAINAGE d ON l.DRAINAGE = d.CODE
    WHERE l.HWSD2_SMU_ID = 4726
    ORDER BY l.TOPDEP
""").fetchdf()

print(result)

Python with CSV files

import pandas as pd

# Load from CSV
layers = pd.read_csv('data/hwsd2/HWSD2_csv/HWSD2_LAYERS.csv')
drainage = pd.read_csv('data/hwsd2/HWSD2_csv/D_DRAINAGE.csv')

# Join tables
merged = layers.merge(drainage, left_on='DRAINAGE', right_on='CODE')

Extract soil by coordinates

from scripts.hwsd2_extractor import get_soil_profile

# Get 7-layer soil profile for a location
profile = get_soil_profile(lat=40.0, lon=-105.0)

if profile:
    print(f"Soil: {profile['metadata']['WRB2_NAME']}")
    print(profile['layers'][['LAYER', 'SAND', 'CLAY', 'ORG_CARBON']])

Using the Schema

from linkml_runtime.loaders import yaml_loader
from fao_soils.datamodel.hwsd2 import SoilMappingUnit, SoilLayer

# Load data conforming to the schema
# (After schema is compiled, which generates Python classes)

Generating Schema Artifacts

The LinkML schema can generate multiple formats:

# Install dependencies
uv sync --dev

# Generate Python dataclasses
gen-python src/fao_soils/schema/hwsd2.yaml > hwsd2_datamodel.py

# Generate JSON Schema
gen-json-schema src/fao_soils/schema/hwsd2.yaml > hwsd2.schema.json

# Generate SQL DDL
gen-sqlddl src/fao_soils/schema/hwsd2.yaml > hwsd2.sql

# Generate Markdown documentation
gen-markdown src/fao_soils/schema/hwsd2.yaml > hwsd2_docs.md

Use Cases

Ecosystem Modeling

The HWSD2 schema enables:

Extraction of soil profiles by geographic coordinates
Integration with climate forcing data for ecosystem models
Parameter calibration for biogeochemical models (e.g., EcoSIM, CENTURY, DayCENT)
Multi-site comparative studies

Data Integration

Standardized vocabulary for soil properties
Consistent units and value ranges
Validation of soil data quality
Interoperability with other environmental databases

Applications

Climate change impact assessments
Agricultural productivity modeling
Carbon cycle modeling
Hydrological modeling
Land use planning

Related Projects

For data extraction and analysis tools using this schema, see:

ecosim-co-scientist - AI-powered tools for ecosystem modeling including HWSD2 data extractors

Developer Tools

There are several pre-defined command-recipes available. They are written for the command runner just. To list all pre-defined commands, run just or just --list.

Common commands:

# Generate all schema artifacts
just gen-project

# Run tests
just test

# Build documentation
just gendoc

Data Sources

The schemas in this repository describe the structure of FAO soil databases. The actual data files are maintained by FAO and can be obtained from:

HWSD v2.0: FAO Soils Portal

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

Schema files: BSD-3-Clause (this repository)
HWSD v2.0 data: CC-BY-4.0 (provided by FAO)

See LICENSE for details.

Citation

If you use these schemas in your research, please cite:

@software{fao_soils_schema,
  title = {FAO Soils LinkML Schemas},
  author = {{BioEPIC Data Team}},
  year = {2024},
  url = {https://github.com/bioepic-data/fao-soils},
  note = {LinkML schemas for FAO soil databases}
}

For the HWSD v2.0 data itself, please cite:

@misc{hwsd2,
  title = {Harmonized World Soil Database version 2.0},
  author = {{FAO}},
  year = {2023},
  publisher = {Food and Agriculture Organization of the United Nations},
  url = {https://www.fao.org/soils-portal/data-hub/soil-maps-and-databases/harmonized-world-soil-database-v20/en/}
}

Credits

This project uses the template linkml-project-copier published as doi:10.5281/zenodo.15163584.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.claude		.claude
.devcontainer		.devcontainer
.github		.github
data		data
docs		docs
examples		examples
export		export
scripts		scripts
src/fao_soils		src/fao_soils
tests		tests
.copier-answers.yml		.copier-answers.yml
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.yamllint.yaml		.yamllint.yaml
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ai.just		ai.just
config.public.mk		config.public.mk
config.yaml		config.yaml
justfile		justfile
mkdocs.yml		mkdocs.yml
project.justfile		project.justfile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fao-soils

Current Schemas

HWSD2 - Harmonized World Soil Database v2.0

Documentation Website

Available Data Files

Repository Structure

Quick Start

Getting the Data

Using the Data

Python with DuckDB (Recommended)

Python with CSV files

Extract soil by coordinates

Using the Schema

Generating Schema Artifacts

Use Cases

Ecosystem Modeling

Data Integration

Applications

Related Projects

Developer Tools

Data Sources

Contributing

License

Citation

Credits

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fao-soils

Current Schemas

HWSD2 - Harmonized World Soil Database v2.0

Documentation Website

Available Data Files

Repository Structure

Quick Start

Getting the Data

Using the Data

Python with DuckDB (Recommended)

Python with CSV files

Extract soil by coordinates

Using the Schema

Generating Schema Artifacts

Use Cases

Ecosystem Modeling

Data Integration

Applications

Related Projects

Developer Tools

Data Sources

Contributing

License

Citation

Credits

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages