UNIPLU-BR

Standardisation, quality control, and database import of the UNIPLU-BR (Unified Brazilian Rainfall Dataset) for the AIR Centre Atlantic Cloud API.

Author: João Pinelo (joao.pinelo@aircentre.org), AIR Centre. Status: Complete. 396,646,593 observations from 21,277 stations loaded into the production MySQL database. Validated.

What This Project Does

The UNIPLU-BR dataset consolidates precipitation records from six Brazilian monitoring networks into 2,885 ZIP files partitioned by state and year. It covers 21,395 stations, 911 million observations, and spans 1855–2025 — the most comprehensive unified rainfall dataset for Brazil to date. The source authors explicitly state that no quality control was performed on the data.

This pipeline reads the raw dataset, applies per-network quality control, aggregates sub-hourly data to hourly where appropriate, and loads the results into the Atlantic Cloud MySQL database alongside existing Portuguese meteorological data.

Pipeline Summary

Stage	Input	Output
Raw dataset	2,885 ZIP files, 911M observations	—
QC + aggregation	911M raw observations	396,646,593 clean observations (FAIL rows removed, SUSPECT rows flagged)
Database import	396,646,593 observations, 21,395 stations	`meteo_br_1hr` (255.6M rows), `meteo_br_daily` (141.1M rows), `stations` (+21,395 rows)

What Was Removed and Why

Category	Rows Removed	Reason
Sub-hourly aggregation	~514M → ~256M (hourly)	CEMADEN 10-min and Telemetria sub-hourly summed to hourly totals
FAIL_DUPLICATE	~511K	Duplicate (station, timestamp) pairs in Telemetria and CEMADEN
FAIL_ABSOLUTE_UPPER_BOUND	~4,703	Rainfall exceeding 2× WMO world records for the interval duration
FAIL_NEGATIVE	0	No negative values found
FAIL_MISSING_TIMESTAMP	0	No missing timestamps found
FAIL_FUTURE_TIMESTAMP	0	No future timestamps found
FAIL_ORPHAN_RECORD	0	No orphan records found
Excluded stations	7	Indefinido stations with insufficient data to determine resolution
Unparseable files	37/2,885	Pre-1968 files using Parquet Encoding 41, unsupported by Parquet2.jl

What Was Flagged (retained in database with `flagged = true`)

Flag	Count	% of Output
SUSPECT_INCOMPLETE_HOUR	4,903,785	1.24
SUSPECT_FLAT_STATION	1,782,864	0.45
SUSPECT_DRY_SPELL	1,472,926	0.37
SUSPECT_STATISTICAL_OUTLIER	1,414,073	0.36
SUSPECT_CONSTANT_RAIN	771,006	0.19
SUSPECT_ISOLATED_SPIKE	186,711	0.05
SUSPECT_RAPID_OSCILLATION	(CEMADEN only)	<0.01

Networks

Network	Stations	Raw Resolution	Output Resolution	Period	Output Rows
Hidroweb diário	12,051	Daily	Daily	1885–2025	133,230,051
Telemetria	2,818	10/15/30/60 min	Hourly	2014–2025	143,570,214
INMET subdiário	629	60 min	Hourly	2000–2025	68,043,387
ICEA	176	Mixed	Hourly + 1 daily	1951–2025	30,357,398
CEMADEN	5,014	10 min	Hourly	2014–2025	13,643,755
INMET diário	589	Daily	Daily	1889–2025	7,801,788
Total	21,277				396,646,593

Reproducing the Pipeline

Prerequisites

Julia ≥ 1.10 with packages: Parquet2 (v0.2.33), DataFrames, ZipFile.
MySQL CLI (mysql command) with access to the target database.
macOS with bash/zsh. Scripts use the global keyword for Julia 1.10 scoping.
UNIPLU-BR dataset downloaded from Zenodo and extracted. Set the data path in each processing script (default: ../Brazilian_Unified_Rainfall_Dataset_(1885-2025)/ relative to repo root).

Step-by-Step

Scripts are numbered in execution order. Each is self-contained and can be re-run independently.

Stage B — Data access and station catalogue:

julia scripts/00_build_station_catalogue.jl    # → output/stations/station_catalogue.csv (21,395 stations)
julia scripts/01_investigate_unparseable.jl     # → output/qc_reports/unparseable_files_report.md
julia scripts/02_investigate_indefinido_utc.jl  # → output/qc_reports/indefinido_investigation.md, utc_verification.md

Stage C — Per-network QC and aggregation (order does not matter):

julia scripts/03_process_icea.jl               # → output/networks/icea/
julia scripts/04_process_inmet_subdiario.jl     # → output/networks/inmet_subdiario/
julia scripts/05_process_inmet_diario.jl        # → output/networks/inmet_diario/
julia scripts/06_process_telemetria.jl          # → output/networks/telemetria/
julia scripts/07_process_cemaden.jl             # → output/networks/cemaden/
julia scripts/08_process_hidroweb_diario.jl     # → output/networks/hidroweb_diario/

Each produces three files: {network}_observations.csv, {network}_qc_summary.md, {network}_processing_log.md.

Stage D — Database import:

julia scripts/09_import_stations.jl            # → output/import/stations_import.sql
julia scripts/10_transform_observations.jl     # → output/import/*.sql (batched INSERT files, ~20 GB total)
bash  scripts/11_load_observations.sh          # Loads stations + observations into MySQL

The transform script takes ~20 minutes (I/O bound). The load script takes ~5 hours at ~24K rows/second to a remote database. Copy .env.example to .env and set your database connection details before running the load.

Validation:

mysql [connection flags] meteorology < sql/validate_import.sql

Results documented in output/qc_reports/validation_report.md.

Expected Outputs

After a full run, output/ contains:

output/
├── stations/
│   ├── station_catalogue.csv          21,395 stations with metadata
│   ├── metadata_conflicts.csv         231 stations with coordinate/elevation spread
│   └── catalogue_summary.md
├── networks/
│   ├── icea/                          30.4M obs, 176 stations
│   ├── inmet_subdiario/               68.0M obs, 629 stations
│   ├── inmet_diario/                   7.8M obs, 589 stations
│   ├── telemetria/                   143.6M obs, 2,818 stations
│   ├── cemaden/                       13.6M obs, 5,014 stations
│   └── hidroweb_diario/             133.2M obs, 12,051 stations
├── qc_reports/
│   ├── unparseable_files_report.md    37 files, Encoding 41 issue
│   ├── indefinido_investigation.md    42 stations resolved
│   ├── utc_verification.md            Diurnal cycle confirmation
│   └── validation_report.md           10/10 checks pass
└── import/                            Generated SQL files (gitignored)

Database Schema

The pipeline loads data into the existing meteorology database, which also serves Portuguese observations via the Atlantic Cloud API.

Tables Modified

stations — Extended with 6 new columns: country (CHAR(2), 'PT' or 'BR'), state, elevation_m, responsible, utc_offset, temporal_resolution_min. 21,395 Brazilian stations inserted alongside 349 existing Portuguese stations.

Tables Created

meteo_br_1hr — 255,569,728 hourly observations. Composite PK (station_id, timestamp). RANGE partitioned by year. Secondary index on (state, timestamp).

meteo_br_daily — 141,076,865 daily observations. Same structure.

Both use DATETIME (not TIMESTAMP) to support pre-1970 data. The state column is deliberately denormalised from the stations table for query performance on state-based queries — see docs/schema_design.md for the rationale.

No foreign key constraints (MySQL limitation on partitioned tables). Referential integrity verified: zero orphan station_ids.

Observation Columns

Column	Type	Description
station_id	VARCHAR(45)	Station identifier (= gauge_code from source).
timestamp	DATETIME	Observation time (UTC).
precipitation_accum_mm	FLOAT UNSIGNED	Rainfall in mm for the interval.
qc_flag	VARCHAR(30)	PASS or SUSPECT_*.
flagged	TINYINT(1)	0 = PASS, 1 = SUSPECT.
state	CHAR(2)	Brazilian state code (denormalised).

Documentation

Document	Location
QC criteria (v1.0)	`docs/qc_criteria.md`
Database schema design	`docs/schema_design.md`
Field mapping (early draft, partially superseded)	`docs/field_mapping.md`
Processing log	`docs/processing_log.md`
Validation report	`output/qc_reports/validation_report.md`
Unparseable files investigation	`output/qc_reports/unparseable_files_report.md`
Indefinido station resolution	`output/qc_reports/indefinido_investigation.md`
UTC verification	`output/qc_reports/utc_verification.md`
Per-network processing logs	`output/networks/{network}/{network}_processing_log.md`
Per-network QC summaries	`output/networks/{network}/{network}_qc_summary.md`
DDL	`sql/create_br_observations.sql`
Validation queries	`sql/validate_import.sql`

Known Limitations

37 unparseable files (1.3% of dataset). All pre-1968, Parquet Encoding 41 unsupported by Parquet2.jl v0.2.33. Could be recovered with PyArrow. Documented in output/qc_reports/unparseable_files_report.md.
No spatial consistency checks. QC is per-station only. See docs/qc_criteria.md Section 13.
No gap-filling or interpolation. Missing data remains missing.
Metadata conflicts in 231 stations. Most recent metadata used. Documented in output/stations/metadata_conflicts.csv.
n_intervals and expected_intervals not in database. Dropped during import for storage efficiency at 397M rows. The SUSPECT_INCOMPLETE_HOUR flag captures the key information. Original CSVs preserve the full detail.
FLOAT precision. MySQL FLOAT (7 significant digits), matching the existing meteo_1hr table. No practical loss for rainfall in mm.

Repository Structure

UNIPLU-BR/
├── docs/
│   ├── qc_criteria.md              QC rules and justifications (v1.0)
│   ├── schema_design.md            Database design decisions (v1.0)
│   ├── field_mapping.md            Source→target mapping (partially superseded)
│   └── processing_log.md           Chronological decision log
├── src/
│   ├── reader.jl                   ZIP/Parquet reader module
│   ├── qc.jl                       Shared QC functions (310 tests)
│   ├── aggregation.jl              Sub-hourly → hourly aggregation
│   └── stations.jl                 Station catalogue utilities
├── scripts/
│   ├── 00–02                       Stage B: catalogue and investigations
│   ├── 03–08                       Stage C: per-network processing
│   └── 09–11                       Stage D: database import
├── test/
│   └── runtests.jl                 310 tests (QC, aggregation, reader)
├── sql/
│   ├── create_br_observations.sql  DDL for observation tables
│   └── validate_import.sql         Post-import validation queries
├── output/                         Pipeline outputs (.md tracked, .csv/.sql gitignored)
├── README.md                       This file
├── LICENSE                         MIT (code only)
└── Project.toml                    Julia dependencies

Data Source and Attribution

The UNIPLU-BR dataset is licensed under CC-BY 4.0. All outputs that include or derive from the dataset must cite the original authors.

Required Citation:

Lemos, F. C. et al. (2026). Unified Brazilian Rainfall Dataset (UNIPLU-BR): A Standardized National Database of Point Precipitation from Major Brazilian Monitoring Networks [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18883358

Companion Paper:

Das Neves Almeida, C. et al. (2025). The design of the Brazilian Sub-Daily Rainfall dataset (BR-SDR): two decades of high-time-resolution data in Brazil. Hydrological Sciences Journal, 70(11), 1850–1862. https://doi.org/10.1080/02626667.2025.2506193

QC Methodology References:

Meira, M. A. et al. (2022). Quality control procedures for sub-hourly rainfall data. Journal of Hydrology, 613(A), 128358.

Vidal-Barbosa, J. L. et al. (2025). BRain-D. Atmospheric Research, 108552.

WMO (1994). Guide to Hydrological Practices, 5th edition. WMO-No. 168.

Licence

Code in this repository is MIT licensed. The UNIPLU-BR dataset is CC-BY 4.0 by Lemos, F. C. et al. (2026).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UNIPLU-BR

What This Project Does

Pipeline Summary

What Was Removed and Why

What Was Flagged (retained in database with `flagged = true`)

Networks

Reproducing the Pipeline

Prerequisites

Step-by-Step

Expected Outputs

Database Schema

Tables Modified

Tables Created

Observation Columns

Documentation

Known Limitations

Repository Structure

Data Source and Attribution

Licence

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
docs		docs
output		output
scripts		scripts
sql		sql
src		src
test		test
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

UNIPLU-BR

What This Project Does

Pipeline Summary

What Was Removed and Why

What Was Flagged (retained in database with flagged = true)

Networks

Reproducing the Pipeline

Prerequisites

Step-by-Step

Expected Outputs

Database Schema

Tables Modified

Tables Created

Observation Columns

Documentation

Known Limitations

Repository Structure

Data Source and Attribution

Licence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What Was Flagged (retained in database with `flagged = true`)

Packages