wikidata-cache

Streaming Rust filter for Wikidata JSON data dumps. Extracts music-relevant entities (artists, bands, record labels) and writes flat CSV files, then loads them into PostgreSQL to create the wikidata-cache database.

Analogous to discogs-xml-converter for Discogs data.

Usage

wikidata-cache exposes the standard WXYC cache-builder CLI shape: a build subcommand that produces the cache (CSV files), and an import subcommand that loads them into PostgreSQL.

# Filter the full Wikidata dump (~130GB gzipped, ~3 hours)
wikidata-cache build latest-all.json.gz --data-dir /path/to/csv/

# Limit entities for testing
wikidata-cache build latest-all.json.gz --data-dir /tmp/test/ --limit 1000

# Adjust progress logging interval
wikidata-cache build latest-all.json.gz --data-dir /path/to/csv/ --progress-interval 500000

Gzipped input is auto-detected by .gz extension.

`build` options

Flag	Description
`--data-dir DIR`	Working directory for CSV files (default: `./data`)
`--limit N`	Stop after N entities, 0 = no limit (default: 0)
`--progress-interval N`	Log progress every N entities (default: 1000000)
`--gzip`	Force gzip decompression (auto-detected for `.gz` files)
`--resume`, `--state-file FILE`	Standard cache-builder flags (currently no-ops; the streaming filter is idempotent)

--output-dir is accepted as a deprecated alias for --data-dir; passing it prints a stderr warning. The alias will be removed in a future release.

What gets filtered

An entity is included if it has any of these properties:

Property	Meaning	Example
P1953	Discogs artist ID	Autechre has Discogs ID 12
P1902	Discogs label ID	Warp Records has Discogs label ID 23528
P106	Occupation = musician, singer, DJ, composer, etc.
P31	Instance of musical group, band, or record label

From matched entities, the filter also extracts: P737 (influenced by), P136 (genre), P264 (record label), P749 (parent organization), P434 (MusicBrainz artist ID), P2850 (Apple Music artist ID), P3283 (Bandcamp profile ID), and English aliases.

CSV Output

Produces 8 CSV files:

File	Columns	Description
`entity.csv`	qid, label, description, entity_type	Core entity metadata
`discogs_mapping.csv`	qid, property, discogs_id	External IDs (P1953 Discogs, P1902 Spotify, P434 MusicBrainz, P2850 Apple Music, P3283 Bandcamp)
`influence.csv`	source_qid, target_qid	P737 influence relationships
`genre.csv`	entity_qid, genre_qid	P136 genre claims
`record_label.csv`	artist_qid, label_qid	P264 record label claims
`label_hierarchy.csv`	child_qid, parent_qid	P749 parent organization
`entity_alias.csv`	qid, alias	English language aliases
`occupation.csv`	entity_qid, occupation_qid	P106 occupation claims

Import into PostgreSQL

The import subcommand loads the CSV output directly into PostgreSQL:

# Import CSVs into PostgreSQL (creates schema, imports data, runs VACUUM)
wikidata-cache import --data-dir /path/to/csv/ --database-url 'host=localhost dbname=wikidata user=wikidata password=wikidata'

# Or use the DATABASE_URL_WIKIDATA environment variable
export DATABASE_URL_WIKIDATA='host=localhost dbname=wikidata user=wikidata password=wikidata'
wikidata-cache import --data-dir /path/to/csv/

# Drop and recreate schema before importing
wikidata-cache import --data-dir /path/to/csv/ --database-url '...' --fresh

--csv-dir is accepted as a deprecated alias for --data-dir and will be removed in a future release.

The import:

Creates the schema (idempotent with IF NOT EXISTS)
Sets tables to UNLOGGED for faster bulk loading
Truncates existing data
Streams each CSV via PostgreSQL COPY in FK order
Restores tables to LOGGED
Runs VACUUM FULL

PostgreSQL Schema

The schema is defined in schema/create_database.sql and includes pg_trgm indexes for fuzzy text search on entity labels and aliases. See that file for the full DDL.

Performance

Processing is parallelized across all CPU cores:

A reader thread decompresses the gzipped input on the fly and reads line by line
A rayon worker pool parses JSON and filters entities in parallel
The main thread writes matched entities to CSV sequentially, preserving order

Expected throughput: ~10K entities/second. The full Wikidata dump (~110M entities) takes roughly 3 hours. Output is ~500MB-1GB of CSV for ~2-5M music entities.

Building

Requires the Rust toolchain.

cargo build --release
cargo install --path .

Testing

cargo test --lib        # unit tests only (no external deps)
cargo test --test cli_tests    # CLI integration tests

# Import integration tests require PostgreSQL:
docker compose up -d    # starts PostgreSQL on port 5435
cargo test --test import_test

# Run everything:
docker compose up -d && cargo test

Unit and CLI tests use hand-written JSON fixtures; no external data dumps needed. Import tests require a PostgreSQL instance (provided by docker-compose.yml on port 5435).

Full Pipeline

# 1. Download the Wikidata dump
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz

# 2. Filter to music entities
wikidata-cache build latest-all.json.gz --data-dir /path/to/csv/

# 3. Load into PostgreSQL
wikidata-cache import --data-dir /path/to/csv/ --database-url 'host=localhost dbname=wikidata user=wikidata password=wikidata' --fresh

Data source

Wikidata JSON dumps are published weekly at https://dumps.wikimedia.org/wikidatawiki/entities/. The latest dump is available at latest-all.json.gz (~130GB).

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
migrations		migrations
schema		schema
src		src
tests		tests
vendor/wxyc-etl		vendor/wxyc-etl
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
docker-compose.yml		docker-compose.yml
wxyc-etl-pin.txt		wxyc-etl-pin.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikidata-cache

Usage

`build` options

What gets filtered

CSV Output

Import into PostgreSQL

PostgreSQL Schema

Performance

Building

Testing

Full Pipeline

Data source

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wikidata-cache

Usage

build options

What gets filtered

CSV Output

Import into PostgreSQL

PostgreSQL Schema

Performance

Building

Testing

Full Pipeline

Data source

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`build` options

Packages