Skip to content

AIRCentre/UNIPLU-BR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UNIPLU-BR

Standardisation, quality control, and database import of the UNIPLU-BR (Unified Brazilian Rainfall Dataset) for the AIR Centre Atlantic Cloud API.

Author: João Pinelo (joao.pinelo@aircentre.org), AIR Centre. Status: Complete. 396,646,593 observations from 21,277 stations loaded into the production MySQL database. Validated.


What This Project Does

The UNIPLU-BR dataset consolidates precipitation records from six Brazilian monitoring networks into 2,885 ZIP files partitioned by state and year. It covers 21,395 stations, 911 million observations, and spans 1855–2025 — the most comprehensive unified rainfall dataset for Brazil to date. The source authors explicitly state that no quality control was performed on the data.

This pipeline reads the raw dataset, applies per-network quality control, aggregates sub-hourly data to hourly where appropriate, and loads the results into the Atlantic Cloud MySQL database alongside existing Portuguese meteorological data.

Pipeline Summary

Stage Input Output
Raw dataset 2,885 ZIP files, 911M observations
QC + aggregation 911M raw observations 396,646,593 clean observations (FAIL rows removed, SUSPECT rows flagged)
Database import 396,646,593 observations, 21,395 stations meteo_br_1hr (255.6M rows), meteo_br_daily (141.1M rows), stations (+21,395 rows)

What Was Removed and Why

Category Rows Removed Reason
Sub-hourly aggregation ~514M → ~256M (hourly) CEMADEN 10-min and Telemetria sub-hourly summed to hourly totals
FAIL_DUPLICATE ~511K Duplicate (station, timestamp) pairs in Telemetria and CEMADEN
FAIL_ABSOLUTE_UPPER_BOUND ~4,703 Rainfall exceeding 2× WMO world records for the interval duration
FAIL_NEGATIVE 0 No negative values found
FAIL_MISSING_TIMESTAMP 0 No missing timestamps found
FAIL_FUTURE_TIMESTAMP 0 No future timestamps found
FAIL_ORPHAN_RECORD 0 No orphan records found
Excluded stations 7 Indefinido stations with insufficient data to determine resolution
Unparseable files 37/2,885 Pre-1968 files using Parquet Encoding 41, unsupported by Parquet2.jl

What Was Flagged (retained in database with flagged = true)

Flag Count % of Output
SUSPECT_INCOMPLETE_HOUR 4,903,785 1.24
SUSPECT_FLAT_STATION 1,782,864 0.45
SUSPECT_DRY_SPELL 1,472,926 0.37
SUSPECT_STATISTICAL_OUTLIER 1,414,073 0.36
SUSPECT_CONSTANT_RAIN 771,006 0.19
SUSPECT_ISOLATED_SPIKE 186,711 0.05
SUSPECT_RAPID_OSCILLATION (CEMADEN only) <0.01

Networks

Network Stations Raw Resolution Output Resolution Period Output Rows
Hidroweb diário 12,051 Daily Daily 1885–2025 133,230,051
Telemetria 2,818 10/15/30/60 min Hourly 2014–2025 143,570,214
INMET subdiário 629 60 min Hourly 2000–2025 68,043,387
ICEA 176 Mixed Hourly + 1 daily 1951–2025 30,357,398
CEMADEN 5,014 10 min Hourly 2014–2025 13,643,755
INMET diário 589 Daily Daily 1889–2025 7,801,788
Total 21,277 396,646,593

Reproducing the Pipeline

Prerequisites

  • Julia ≥ 1.10 with packages: Parquet2 (v0.2.33), DataFrames, ZipFile.
  • MySQL CLI (mysql command) with access to the target database.
  • macOS with bash/zsh. Scripts use the global keyword for Julia 1.10 scoping.
  • UNIPLU-BR dataset downloaded from Zenodo and extracted. Set the data path in each processing script (default: ../Brazilian_Unified_Rainfall_Dataset_(1885-2025)/ relative to repo root).

Step-by-Step

Scripts are numbered in execution order. Each is self-contained and can be re-run independently.

Stage B — Data access and station catalogue:

julia scripts/00_build_station_catalogue.jl    # → output/stations/station_catalogue.csv (21,395 stations)
julia scripts/01_investigate_unparseable.jl     # → output/qc_reports/unparseable_files_report.md
julia scripts/02_investigate_indefinido_utc.jl  # → output/qc_reports/indefinido_investigation.md, utc_verification.md

Stage C — Per-network QC and aggregation (order does not matter):

julia scripts/03_process_icea.jl               # → output/networks/icea/
julia scripts/04_process_inmet_subdiario.jl     # → output/networks/inmet_subdiario/
julia scripts/05_process_inmet_diario.jl        # → output/networks/inmet_diario/
julia scripts/06_process_telemetria.jl          # → output/networks/telemetria/
julia scripts/07_process_cemaden.jl             # → output/networks/cemaden/
julia scripts/08_process_hidroweb_diario.jl     # → output/networks/hidroweb_diario/

Each produces three files: {network}_observations.csv, {network}_qc_summary.md, {network}_processing_log.md.

Stage D — Database import:

julia scripts/09_import_stations.jl            # → output/import/stations_import.sql
julia scripts/10_transform_observations.jl     # → output/import/*.sql (batched INSERT files, ~20 GB total)
bash  scripts/11_load_observations.sh          # Loads stations + observations into MySQL

The transform script takes ~20 minutes (I/O bound). The load script takes ~5 hours at ~24K rows/second to a remote database. Copy .env.example to .env and set your database connection details before running the load.

Validation:

mysql [connection flags] meteorology < sql/validate_import.sql

Results documented in output/qc_reports/validation_report.md.

Expected Outputs

After a full run, output/ contains:

output/
├── stations/
│   ├── station_catalogue.csv          21,395 stations with metadata
│   ├── metadata_conflicts.csv         231 stations with coordinate/elevation spread
│   └── catalogue_summary.md
├── networks/
│   ├── icea/                          30.4M obs, 176 stations
│   ├── inmet_subdiario/               68.0M obs, 629 stations
│   ├── inmet_diario/                   7.8M obs, 589 stations
│   ├── telemetria/                   143.6M obs, 2,818 stations
│   ├── cemaden/                       13.6M obs, 5,014 stations
│   └── hidroweb_diario/             133.2M obs, 12,051 stations
├── qc_reports/
│   ├── unparseable_files_report.md    37 files, Encoding 41 issue
│   ├── indefinido_investigation.md    42 stations resolved
│   ├── utc_verification.md            Diurnal cycle confirmation
│   └── validation_report.md           10/10 checks pass
└── import/                            Generated SQL files (gitignored)

Database Schema

The pipeline loads data into the existing meteorology database, which also serves Portuguese observations via the Atlantic Cloud API.

Tables Modified

stations — Extended with 6 new columns: country (CHAR(2), 'PT' or 'BR'), state, elevation_m, responsible, utc_offset, temporal_resolution_min. 21,395 Brazilian stations inserted alongside 349 existing Portuguese stations.

Tables Created

meteo_br_1hr — 255,569,728 hourly observations. Composite PK (station_id, timestamp). RANGE partitioned by year. Secondary index on (state, timestamp).

meteo_br_daily — 141,076,865 daily observations. Same structure.

Both use DATETIME (not TIMESTAMP) to support pre-1970 data. The state column is deliberately denormalised from the stations table for query performance on state-based queries — see docs/schema_design.md for the rationale.

No foreign key constraints (MySQL limitation on partitioned tables). Referential integrity verified: zero orphan station_ids.

Observation Columns

Column Type Description
station_id VARCHAR(45) Station identifier (= gauge_code from source).
timestamp DATETIME Observation time (UTC).
precipitation_accum_mm FLOAT UNSIGNED Rainfall in mm for the interval.
qc_flag VARCHAR(30) PASS or SUSPECT_*.
flagged TINYINT(1) 0 = PASS, 1 = SUSPECT.
state CHAR(2) Brazilian state code (denormalised).

Documentation

Document Location
QC criteria (v1.0) docs/qc_criteria.md
Database schema design docs/schema_design.md
Field mapping (early draft, partially superseded) docs/field_mapping.md
Processing log docs/processing_log.md
Validation report output/qc_reports/validation_report.md
Unparseable files investigation output/qc_reports/unparseable_files_report.md
Indefinido station resolution output/qc_reports/indefinido_investigation.md
UTC verification output/qc_reports/utc_verification.md
Per-network processing logs output/networks/{network}/{network}_processing_log.md
Per-network QC summaries output/networks/{network}/{network}_qc_summary.md
DDL sql/create_br_observations.sql
Validation queries sql/validate_import.sql

Known Limitations

  1. 37 unparseable files (1.3% of dataset). All pre-1968, Parquet Encoding 41 unsupported by Parquet2.jl v0.2.33. Could be recovered with PyArrow. Documented in output/qc_reports/unparseable_files_report.md.

  2. No spatial consistency checks. QC is per-station only. See docs/qc_criteria.md Section 13.

  3. No gap-filling or interpolation. Missing data remains missing.

  4. Metadata conflicts in 231 stations. Most recent metadata used. Documented in output/stations/metadata_conflicts.csv.

  5. n_intervals and expected_intervals not in database. Dropped during import for storage efficiency at 397M rows. The SUSPECT_INCOMPLETE_HOUR flag captures the key information. Original CSVs preserve the full detail.

  6. FLOAT precision. MySQL FLOAT (7 significant digits), matching the existing meteo_1hr table. No practical loss for rainfall in mm.


Repository Structure

UNIPLU-BR/
├── docs/
│   ├── qc_criteria.md              QC rules and justifications (v1.0)
│   ├── schema_design.md            Database design decisions (v1.0)
│   ├── field_mapping.md            Source→target mapping (partially superseded)
│   └── processing_log.md           Chronological decision log
├── src/
│   ├── reader.jl                   ZIP/Parquet reader module
│   ├── qc.jl                       Shared QC functions (310 tests)
│   ├── aggregation.jl              Sub-hourly → hourly aggregation
│   └── stations.jl                 Station catalogue utilities
├── scripts/
│   ├── 00–02                       Stage B: catalogue and investigations
│   ├── 03–08                       Stage C: per-network processing
│   └── 09–11                       Stage D: database import
├── test/
│   └── runtests.jl                 310 tests (QC, aggregation, reader)
├── sql/
│   ├── create_br_observations.sql  DDL for observation tables
│   └── validate_import.sql         Post-import validation queries
├── output/                         Pipeline outputs (.md tracked, .csv/.sql gitignored)
├── README.md                       This file
├── LICENSE                         MIT (code only)
└── Project.toml                    Julia dependencies

Data Source and Attribution

The UNIPLU-BR dataset is licensed under CC-BY 4.0. All outputs that include or derive from the dataset must cite the original authors.

Required Citation:

Lemos, F. C. et al. (2026). Unified Brazilian Rainfall Dataset (UNIPLU-BR): A Standardized National Database of Point Precipitation from Major Brazilian Monitoring Networks [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18883358

Companion Paper:

Das Neves Almeida, C. et al. (2025). The design of the Brazilian Sub-Daily Rainfall dataset (BR-SDR): two decades of high-time-resolution data in Brazil. Hydrological Sciences Journal, 70(11), 1850–1862. https://doi.org/10.1080/02626667.2025.2506193

QC Methodology References:

Meira, M. A. et al. (2022). Quality control procedures for sub-hourly rainfall data. Journal of Hydrology, 613(A), 128358.

Vidal-Barbosa, J. L. et al. (2025). BRain-D. Atmospheric Research, 108552.

WMO (1994). Guide to Hydrological Practices, 5th edition. WMO-No. 168.


Licence

Code in this repository is MIT licensed. The UNIPLU-BR dataset is CC-BY 4.0 by Lemos, F. C. et al. (2026).

About

Standardisation and QC of the UNIPLU-BR rainfall dataset for the AIR Centre Atlantic Cloud API.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors