Builds a semantic artist graph from WXYC DJ transition data. When DJs curate transitions between artists during their shows, those adjacency relationships encode latent genre, mood, and style similarity. This project extracts that signal using Pointwise Mutual Information (PMI) and produces a graph for visualization and downstream use.
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python run_pipeline.py /path/to/wxycmusic.sqlThis parses the SQL dump, computes PMI for all artist co-occurrences, extracts cross-reference edges from the catalog, and writes a GEXF graph + SQLite database to output/.
python run_pipeline.py <dump_path> [--output-dir DIR] [--min-count N] [--no-sqlite] [--db-path PATH] [--entity-source {local,lml}]
| Flag | Default | Description |
|---|---|---|
--output-dir |
output/ |
Directory for output files |
--min-count |
2 |
Minimum co-occurrence count for graph edges |
--no-sqlite |
disabled | Skip SQLite database export |
--db-path |
none | Path to pipeline SQLite database with persistent identity resolution |
--entity-source |
local (most cases); required when both --db-path and --discogs-cache-dsn are set |
local skips LML; lml reads from LML's entity.identity PG table and requires both --discogs-cache-dsn and --db-path (the destination for imported identities). lml fails loudly if PG is unreachable, the DSN is missing, or --db-path is missing — pass --entity-source=local to bypass. The pipeline refuses to start when both --db-path and --discogs-cache-dsn are set without an explicit --entity-source choice (pre-PR #184 that combo silently triggered LML import; post-PR #184 it would silently skip LML, so the operator must pick). |
- Parse the tubafrenzy MySQL dump directly (no database required)
- Resolve artist names via the library catalog FK chain (LIBRARY_RELEASE → LIBRARY_CODE)
- Extract consecutive artist pairs within each radio show
- Compute PMI:
log2(P(a,b) / (P(a) * P(b)))— high PMI means two artists appear together more than chance predicts - Extract cross-reference edges from catalog tables (LIBRARY_CODE_CROSS_REFERENCE, RELEASE_CROSS_REFERENCE)
- Export a GEXF graph loadable in Gephi and a SQLite database for querying
A read-only FastAPI service that queries the SQLite database produced by the pipeline.
pip install -e ".[api]"
DB_PATH=output/wxyc_artist_graph.db python -m semantic_index.api| Variable | Default | Description |
|---|---|---|
DB_PATH |
output/wxyc_artist_graph.db |
Path to the SQLite graph database |
HOST |
0.0.0.0 |
Host to bind the server to |
PORT |
8000 |
Server port |
| Method | Path | Description |
|---|---|---|
| GET | /health |
Health check — returns 200 with artist count, or 503 if DB unavailable |
| GET | /graph/artists/search?q=autechre&limit=10 |
Case-insensitive artist name search, ordered by total_plays descending |
| GET | /graph/artists/{id}/neighbors?type=djTransition&limit=20 |
Neighbors by edge type: djTransition, sharedPersonnel, sharedStyle, labelFamily, compilation, crossReference |
| GET | /graph/artists/{id}/explain/{target_id} |
All relationship types between two artists with weights and details |
The API is deployed to Railway. Configuration lives in railway.toml:
- Builder: nixpacks (auto-detects Python from
pyproject.toml) - Start command:
python -m semantic_index.api - Health check:
GET /healthwith 300s timeout - Restart policy: on failure, max 10 retries
Railway sets the PORT environment variable automatically. Set DB_PATH to point to the SQLite database file (e.g. via a Railway volume mount or persistent storage).
The pipeline depends on wxyc-etl, a shared Rust/PyO3 package providing text normalization (to_match_form, is_compilation_artist, split_artist_name) and discogs-cache schema constants. All discogs-cache table names in SQL queries come from wxyc_etl.schema constants rather than hardcoded strings. See CLAUDE.md for the full list of shared functions used.
pytest # unit tests
pytest -m integration # integration tests (needs fixture dump)
ruff check . # lint
ruff format --check . # format check
mypy . # type checkSee CLAUDE.md for detailed development patterns, column mappings, and SQLite schema.