Standardisation, quality control, and database import of the UNIPLU-BR (Unified Brazilian Rainfall Dataset) for the AIR Centre Atlantic Cloud API.
Author: João Pinelo (joao.pinelo@aircentre.org), AIR Centre. Status: Complete. 396,646,593 observations from 21,277 stations loaded into the production MySQL database. Validated.
The UNIPLU-BR dataset consolidates precipitation records from six Brazilian monitoring networks into 2,885 ZIP files partitioned by state and year. It covers 21,395 stations, 911 million observations, and spans 1855–2025 — the most comprehensive unified rainfall dataset for Brazil to date. The source authors explicitly state that no quality control was performed on the data.
This pipeline reads the raw dataset, applies per-network quality control, aggregates sub-hourly data to hourly where appropriate, and loads the results into the Atlantic Cloud MySQL database alongside existing Portuguese meteorological data.
| Stage | Input | Output |
|---|---|---|
| Raw dataset | 2,885 ZIP files, 911M observations | — |
| QC + aggregation | 911M raw observations | 396,646,593 clean observations (FAIL rows removed, SUSPECT rows flagged) |
| Database import | 396,646,593 observations, 21,395 stations | meteo_br_1hr (255.6M rows), meteo_br_daily (141.1M rows), stations (+21,395 rows) |
| Category | Rows Removed | Reason |
|---|---|---|
| Sub-hourly aggregation | ~514M → ~256M (hourly) | CEMADEN 10-min and Telemetria sub-hourly summed to hourly totals |
| FAIL_DUPLICATE | ~511K | Duplicate (station, timestamp) pairs in Telemetria and CEMADEN |
| FAIL_ABSOLUTE_UPPER_BOUND | ~4,703 | Rainfall exceeding 2× WMO world records for the interval duration |
| FAIL_NEGATIVE | 0 | No negative values found |
| FAIL_MISSING_TIMESTAMP | 0 | No missing timestamps found |
| FAIL_FUTURE_TIMESTAMP | 0 | No future timestamps found |
| FAIL_ORPHAN_RECORD | 0 | No orphan records found |
| Excluded stations | 7 | Indefinido stations with insufficient data to determine resolution |
| Unparseable files | 37/2,885 | Pre-1968 files using Parquet Encoding 41, unsupported by Parquet2.jl |
| Flag | Count | % of Output |
|---|---|---|
| SUSPECT_INCOMPLETE_HOUR | 4,903,785 | 1.24 |
| SUSPECT_FLAT_STATION | 1,782,864 | 0.45 |
| SUSPECT_DRY_SPELL | 1,472,926 | 0.37 |
| SUSPECT_STATISTICAL_OUTLIER | 1,414,073 | 0.36 |
| SUSPECT_CONSTANT_RAIN | 771,006 | 0.19 |
| SUSPECT_ISOLATED_SPIKE | 186,711 | 0.05 |
| SUSPECT_RAPID_OSCILLATION | (CEMADEN only) | <0.01 |
| Network | Stations | Raw Resolution | Output Resolution | Period | Output Rows |
|---|---|---|---|---|---|
| Hidroweb diário | 12,051 | Daily | Daily | 1885–2025 | 133,230,051 |
| Telemetria | 2,818 | 10/15/30/60 min | Hourly | 2014–2025 | 143,570,214 |
| INMET subdiário | 629 | 60 min | Hourly | 2000–2025 | 68,043,387 |
| ICEA | 176 | Mixed | Hourly + 1 daily | 1951–2025 | 30,357,398 |
| CEMADEN | 5,014 | 10 min | Hourly | 2014–2025 | 13,643,755 |
| INMET diário | 589 | Daily | Daily | 1889–2025 | 7,801,788 |
| Total | 21,277 | 396,646,593 |
- Julia ≥ 1.10 with packages: Parquet2 (v0.2.33), DataFrames, ZipFile.
- MySQL CLI (
mysqlcommand) with access to the target database. - macOS with bash/zsh. Scripts use the
globalkeyword for Julia 1.10 scoping. - UNIPLU-BR dataset downloaded from Zenodo and extracted. Set the data path in each processing script (default:
../Brazilian_Unified_Rainfall_Dataset_(1885-2025)/relative to repo root).
Scripts are numbered in execution order. Each is self-contained and can be re-run independently.
Stage B — Data access and station catalogue:
julia scripts/00_build_station_catalogue.jl # → output/stations/station_catalogue.csv (21,395 stations)
julia scripts/01_investigate_unparseable.jl # → output/qc_reports/unparseable_files_report.md
julia scripts/02_investigate_indefinido_utc.jl # → output/qc_reports/indefinido_investigation.md, utc_verification.mdStage C — Per-network QC and aggregation (order does not matter):
julia scripts/03_process_icea.jl # → output/networks/icea/
julia scripts/04_process_inmet_subdiario.jl # → output/networks/inmet_subdiario/
julia scripts/05_process_inmet_diario.jl # → output/networks/inmet_diario/
julia scripts/06_process_telemetria.jl # → output/networks/telemetria/
julia scripts/07_process_cemaden.jl # → output/networks/cemaden/
julia scripts/08_process_hidroweb_diario.jl # → output/networks/hidroweb_diario/Each produces three files: {network}_observations.csv, {network}_qc_summary.md, {network}_processing_log.md.
Stage D — Database import:
julia scripts/09_import_stations.jl # → output/import/stations_import.sql
julia scripts/10_transform_observations.jl # → output/import/*.sql (batched INSERT files, ~20 GB total)
bash scripts/11_load_observations.sh # Loads stations + observations into MySQLThe transform script takes ~20 minutes (I/O bound). The load script takes ~5 hours at ~24K rows/second to a remote database. Copy .env.example to .env and set your database connection details before running the load.
Validation:
mysql [connection flags] meteorology < sql/validate_import.sqlResults documented in output/qc_reports/validation_report.md.
After a full run, output/ contains:
output/
├── stations/
│ ├── station_catalogue.csv 21,395 stations with metadata
│ ├── metadata_conflicts.csv 231 stations with coordinate/elevation spread
│ └── catalogue_summary.md
├── networks/
│ ├── icea/ 30.4M obs, 176 stations
│ ├── inmet_subdiario/ 68.0M obs, 629 stations
│ ├── inmet_diario/ 7.8M obs, 589 stations
│ ├── telemetria/ 143.6M obs, 2,818 stations
│ ├── cemaden/ 13.6M obs, 5,014 stations
│ └── hidroweb_diario/ 133.2M obs, 12,051 stations
├── qc_reports/
│ ├── unparseable_files_report.md 37 files, Encoding 41 issue
│ ├── indefinido_investigation.md 42 stations resolved
│ ├── utc_verification.md Diurnal cycle confirmation
│ └── validation_report.md 10/10 checks pass
└── import/ Generated SQL files (gitignored)
The pipeline loads data into the existing meteorology database, which also serves Portuguese observations via the Atlantic Cloud API.
stations — Extended with 6 new columns: country (CHAR(2), 'PT' or 'BR'), state, elevation_m, responsible, utc_offset, temporal_resolution_min. 21,395 Brazilian stations inserted alongside 349 existing Portuguese stations.
meteo_br_1hr — 255,569,728 hourly observations. Composite PK (station_id, timestamp). RANGE partitioned by year. Secondary index on (state, timestamp).
meteo_br_daily — 141,076,865 daily observations. Same structure.
Both use DATETIME (not TIMESTAMP) to support pre-1970 data. The state column is deliberately denormalised from the stations table for query performance on state-based queries — see docs/schema_design.md for the rationale.
No foreign key constraints (MySQL limitation on partitioned tables). Referential integrity verified: zero orphan station_ids.
| Column | Type | Description |
|---|---|---|
| station_id | VARCHAR(45) | Station identifier (= gauge_code from source). |
| timestamp | DATETIME | Observation time (UTC). |
| precipitation_accum_mm | FLOAT UNSIGNED | Rainfall in mm for the interval. |
| qc_flag | VARCHAR(30) | PASS or SUSPECT_*. |
| flagged | TINYINT(1) | 0 = PASS, 1 = SUSPECT. |
| state | CHAR(2) | Brazilian state code (denormalised). |
| Document | Location |
|---|---|
| QC criteria (v1.0) | docs/qc_criteria.md |
| Database schema design | docs/schema_design.md |
| Field mapping (early draft, partially superseded) | docs/field_mapping.md |
| Processing log | docs/processing_log.md |
| Validation report | output/qc_reports/validation_report.md |
| Unparseable files investigation | output/qc_reports/unparseable_files_report.md |
| Indefinido station resolution | output/qc_reports/indefinido_investigation.md |
| UTC verification | output/qc_reports/utc_verification.md |
| Per-network processing logs | output/networks/{network}/{network}_processing_log.md |
| Per-network QC summaries | output/networks/{network}/{network}_qc_summary.md |
| DDL | sql/create_br_observations.sql |
| Validation queries | sql/validate_import.sql |
-
37 unparseable files (1.3% of dataset). All pre-1968, Parquet Encoding 41 unsupported by Parquet2.jl v0.2.33. Could be recovered with PyArrow. Documented in
output/qc_reports/unparseable_files_report.md. -
No spatial consistency checks. QC is per-station only. See
docs/qc_criteria.mdSection 13. -
No gap-filling or interpolation. Missing data remains missing.
-
Metadata conflicts in 231 stations. Most recent metadata used. Documented in
output/stations/metadata_conflicts.csv. -
n_intervalsandexpected_intervalsnot in database. Dropped during import for storage efficiency at 397M rows. TheSUSPECT_INCOMPLETE_HOURflag captures the key information. Original CSVs preserve the full detail. -
FLOAT precision. MySQL FLOAT (7 significant digits), matching the existing
meteo_1hrtable. No practical loss for rainfall in mm.
UNIPLU-BR/
├── docs/
│ ├── qc_criteria.md QC rules and justifications (v1.0)
│ ├── schema_design.md Database design decisions (v1.0)
│ ├── field_mapping.md Source→target mapping (partially superseded)
│ └── processing_log.md Chronological decision log
├── src/
│ ├── reader.jl ZIP/Parquet reader module
│ ├── qc.jl Shared QC functions (310 tests)
│ ├── aggregation.jl Sub-hourly → hourly aggregation
│ └── stations.jl Station catalogue utilities
├── scripts/
│ ├── 00–02 Stage B: catalogue and investigations
│ ├── 03–08 Stage C: per-network processing
│ └── 09–11 Stage D: database import
├── test/
│ └── runtests.jl 310 tests (QC, aggregation, reader)
├── sql/
│ ├── create_br_observations.sql DDL for observation tables
│ └── validate_import.sql Post-import validation queries
├── output/ Pipeline outputs (.md tracked, .csv/.sql gitignored)
├── README.md This file
├── LICENSE MIT (code only)
└── Project.toml Julia dependencies
The UNIPLU-BR dataset is licensed under CC-BY 4.0. All outputs that include or derive from the dataset must cite the original authors.
Required Citation:
Lemos, F. C. et al. (2026). Unified Brazilian Rainfall Dataset (UNIPLU-BR): A Standardized National Database of Point Precipitation from Major Brazilian Monitoring Networks [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18883358
Companion Paper:
Das Neves Almeida, C. et al. (2025). The design of the Brazilian Sub-Daily Rainfall dataset (BR-SDR): two decades of high-time-resolution data in Brazil. Hydrological Sciences Journal, 70(11), 1850–1862. https://doi.org/10.1080/02626667.2025.2506193
QC Methodology References:
Meira, M. A. et al. (2022). Quality control procedures for sub-hourly rainfall data. Journal of Hydrology, 613(A), 128358.
Vidal-Barbosa, J. L. et al. (2025). BRain-D. Atmospheric Research, 108552.
WMO (1994). Guide to Hydrological Practices, 5th edition. WMO-No. 168.
Code in this repository is MIT licensed. The UNIPLU-BR dataset is CC-BY 4.0 by Lemos, F. C. et al. (2026).