LinkML schemas and data models for FAO soil databases
This repository provides standardized, machine-readable schemas for FAO (Food and Agriculture Organization) soil databases using LinkML.
The Harmonized World Soil Database v2.0 is a comprehensive global soil dataset that provides:
- Global coverage at 30 arc-second resolution (~1 km)
- 7 standardized depth layers (0-200 cm)
- Comprehensive soil properties:
- Physical: texture (sand/silt/clay), bulk density, coarse fragments
- Chemical: organic carbon, pH, nitrogen, C/N ratio, calcium carbonate, gypsum
- Cation exchange: CEC, base saturation, aluminum saturation
- Hydrological: drainage class, available water capacity
- Multiple classification systems: WRB (World Reference Base), FAO-90, USDA
- Climate context: Köppen-Geiger climate zones
Data structure:
- 408,835 layer records from 58,405 soil mapping units
- Each mapping unit can contain multiple soil sequences (representing spatial heterogeneity)
- Gridded raster data (43,200 × 21,600 pixels) linking locations to soil properties
Schema file: src/fao_soils/schema/hwsd2.yaml
https://bioepic-data.github.io/fao-soils
This repository includes both the pre-built DuckDB export and the tidied CSV tables:
export/hwsd2.ddb- ready-to-query DuckDB database exportexport/hwsd2_parquet/- one Parquet file per HWSD2 tabledata/hwsd2/HWSD2_csv/- tidied HWSD2 CSV files
If you want to start from packaged data in this repository, use export/ for DuckDB or Parquet workflows and data/ for direct CSV access.
- docs/ - mkdocs-managed documentation
- elements/ - generated schema documentation
- examples/ - Examples of using the schema
- export/ - Pre-built databases (ready to use!)
hwsd2.ddb- DuckDB database (32 MB)hwsd2_parquet/- Parquet exports of all HWSD2 tables
- data/ - Source CSV files
- hwsd2/ - HWSD v2.0 data
- HWSD2_csv/ - 25 CSV files (~100 MB)
- hwsd2/ - HWSD v2.0 data
- scripts/ - Data processing and extraction scripts
fetch_fao_soil_database.py- Download from FAOload_hwsd2.py- Build DuckDB databasehwsd2_extractor.py- Extract by coordinates
- project/ - project files (these files are auto-generated, do not edit)
- src/ - source files (edit these)
- fao_soils
- schema/ -- LinkML schemas (edit these)
hwsd2.yaml- HWSD v2.0 schemafao_soils.yaml- Template schema
- datamodel/ -- generated Python datamodels
- schema/ -- LinkML schemas (edit these)
- fao_soils
- tests/ - Python tests
- data/ - Example data
Option 1: Use the pre-built database (Recommended, ~32 MB)
# Clone the repository
git clone https://github.com/bioepic-data/fao-soils.git
cd fao-soils
# The database is ready to use!
# File: export/hwsd2.ddbOption 2: Build from source CSVs
# Install project dependencies
uv sync --dev
# Rebuild the packaged database in place (overwrites existing file)
uv run python scripts/load_hwsd2.py --force export/hwsd2.ddbOption 3: Download from FAO (Complete dataset with rasters)
Requires mdb-tools to convert the FAO .mdb database into CSV and SQLite artifacts.
See scripts/README.md for installation help.
# Install project dependencies
uv sync --dev
# Download and refresh the repo-managed source artifacts under data/hwsd2/
uv run python scripts/fetch_fao_soil_database.py --data-dir data/hwsd2Convenience targets
just update-data # refresh data/hwsd2 artifacts from FAO
just update-export # rebuild export/hwsd2.ddb from the CSVs
just update-parquet # export DuckDB tables to export/hwsd2_parquet/
just update-artifacts # refresh data, DuckDB, and Parquet artifactsjust update-data and just update-artifacts require mdb-tools because they run the FAO .mdb conversion step.
import duckdb
# Connect to the pre-built database
conn = duckdb.connect('export/hwsd2.ddb', read_only=True)
# Query soil properties
result = conn.execute("""
SELECT l.*, d.VALUE as DRAINAGE_CLASS
FROM HWSD2_LAYERS l
LEFT JOIN D_DRAINAGE d ON l.DRAINAGE = d.CODE
WHERE l.HWSD2_SMU_ID = 4726
ORDER BY l.TOPDEP
""").fetchdf()
print(result)import pandas as pd
# Load from CSV
layers = pd.read_csv('data/hwsd2/HWSD2_csv/HWSD2_LAYERS.csv')
drainage = pd.read_csv('data/hwsd2/HWSD2_csv/D_DRAINAGE.csv')
# Join tables
merged = layers.merge(drainage, left_on='DRAINAGE', right_on='CODE')from scripts.hwsd2_extractor import get_soil_profile
# Get 7-layer soil profile for a location
profile = get_soil_profile(lat=40.0, lon=-105.0)
if profile:
print(f"Soil: {profile['metadata']['WRB2_NAME']}")
print(profile['layers'][['LAYER', 'SAND', 'CLAY', 'ORG_CARBON']])from linkml_runtime.loaders import yaml_loader
from fao_soils.datamodel.hwsd2 import SoilMappingUnit, SoilLayer
# Load data conforming to the schema
# (After schema is compiled, which generates Python classes)The LinkML schema can generate multiple formats:
# Install dependencies
uv sync --dev
# Generate Python dataclasses
gen-python src/fao_soils/schema/hwsd2.yaml > hwsd2_datamodel.py
# Generate JSON Schema
gen-json-schema src/fao_soils/schema/hwsd2.yaml > hwsd2.schema.json
# Generate SQL DDL
gen-sqlddl src/fao_soils/schema/hwsd2.yaml > hwsd2.sql
# Generate Markdown documentation
gen-markdown src/fao_soils/schema/hwsd2.yaml > hwsd2_docs.mdThe HWSD2 schema enables:
- Extraction of soil profiles by geographic coordinates
- Integration with climate forcing data for ecosystem models
- Parameter calibration for biogeochemical models (e.g., EcoSIM, CENTURY, DayCENT)
- Multi-site comparative studies
- Standardized vocabulary for soil properties
- Consistent units and value ranges
- Validation of soil data quality
- Interoperability with other environmental databases
- Climate change impact assessments
- Agricultural productivity modeling
- Carbon cycle modeling
- Hydrological modeling
- Land use planning
For data extraction and analysis tools using this schema, see:
- ecosim-co-scientist - AI-powered tools for ecosystem modeling including HWSD2 data extractors
There are several pre-defined command-recipes available.
They are written for the command runner just. To list all pre-defined commands, run just or just --list.
Common commands:
# Generate all schema artifacts
just gen-project
# Run tests
just test
# Build documentation
just gendocThe schemas in this repository describe the structure of FAO soil databases. The actual data files are maintained by FAO and can be obtained from:
- HWSD v2.0: FAO Soils Portal
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Schema files: BSD-3-Clause (this repository)
- HWSD v2.0 data: CC-BY-4.0 (provided by FAO)
See LICENSE for details.
If you use these schemas in your research, please cite:
@software{fao_soils_schema,
title = {FAO Soils LinkML Schemas},
author = {{BioEPIC Data Team}},
year = {2024},
url = {https://github.com/bioepic-data/fao-soils},
note = {LinkML schemas for FAO soil databases}
}For the HWSD v2.0 data itself, please cite:
@misc{hwsd2,
title = {Harmonized World Soil Database version 2.0},
author = {{FAO}},
year = {2023},
publisher = {Food and Agriculture Organization of the United Nations},
url = {https://www.fao.org/soils-portal/data-hub/soil-maps-and-databases/harmonized-world-soil-database-v20/en/}
}This project uses the template linkml-project-copier published as doi:10.5281/zenodo.15163584.