Skip to content

Add wxyc_library v2 hook (E1 §4.1.3 cross-cache-identity)#34

Merged
jakebromberg merged 6 commits into
mainfrom
feat/33-wxyc-library-v2
May 10, 2026
Merged

Add wxyc_library v2 hook (E1 §4.1.3 cross-cache-identity)#34
jakebromberg merged 6 commits into
mainfrom
feat/33-wxyc-library-v2

Conversation

@jakebromberg
Copy link
Copy Markdown
Member

@jakebromberg jakebromberg commented May 10, 2026

Summary

  • Lands the consolidated wxyc_library schema per §3.1 on the Homebrew wikidata cache. E1 §4.1.3 of the cross-cache-identity rollout. Schema validated by WXYC/discogs-etl#185; pattern-parallel with WXYC/musicbrainz-cache#48 (sibling Rust+sqlx port).
  • New module src/wxyc_loader.rs writes from library.db to wxyc_library via a single prepared INSERT … VALUES (...) ON CONFLICT (library_id) DO NOTHING executed once per row inside one transaction (small cache, ≤64K rows; per-row INSERT is plenty — switch to multi-VALUES batching only if profiling shows it's worth it). Normalizes via wxyc_etl::text::to_identity_match_form{,_title} (locked-on baseline, NOT to_match_form).
  • Wired as new import-wxyc-library subcommand. Added wxyc_library to import_schema::ALL_TABLES so --fresh rebuilds clean it. 7 pg-marked integration tests (schema, write-every-row, idempotency, snapshot_source CHECK, normalizer pin including Nilüfer Yanya diacritic-fold, NUL-byte stripping on derived norm columns, empty artist/title rejection) plus 1 unit test for minimal-schema library.db and 1 CLI integration test.
  • Per §4.1.3 the cache is small enough to build all indexes inline (no CONCURRENTLY needed); standard sqlx in-transaction migration applies cleanly.

Decisions flagged for review

  1. import-wxyc-library as a new subcommand (rather than embedding in the existing import flow). The existing import is single-purpose for the 8 CSV files; the wxyc-library flow needs a different input (--library-db) and is a separate post-rebuild step.
  2. Reused sibling worktree's PG instance during tests rather than starting our own docker compose up -d (the sibling container already binds :5435 with identical creds). Smoke-tested sqlx migrate run end-to-end against a fresh DB; both migrations apply, all 8 indexes land, second run is a no-op.

Test plan

  • cargo fmt --check — clean
  • cargo clippy --all-targets -- -D warnings -A clippy::manual_is_multiple_of — clean
  • cargo test --lib — unit tests pass
  • cargo test (full suite incl. 7 new integration tests) — all green
  • sqlx migrate run against fresh DB applies; idempotent on re-run

Closes #33

…oses #33

- migrations/0002_wxyc_library_v2.sql: consolidated wxyc_library table per
  §3.1 with all 8 indexes inline (pkey + 5 b-tree + 2 GIN trgm). Per
  §4.1.3 this cache is small enough to skip CONCURRENTLY, keeping the
  migration in a single transaction. Mirrored into schema/create_database.sql
  per the dual-source pattern in CLAUDE.md.
- src/wxyc_loader.rs: reads SQLite library.db, normalizes via the
  wxyc-etl 0.3.0 identity baseline (`to_identity_match_form` for artist +
  label, `to_identity_match_form_title` for title), and INSERTs with
  ON CONFLICT (library_id) DO NOTHING. snapshot_source is validated at the
  loader boundary against the §3.1 CHECK set {backend|tubafrenzy|llm}.
  artist_id/label_id/format_id/release_year stamp NULL — library.db
  doesn't carry Backend's IDs.
- src/main.rs: new `import-wxyc-library` subcommand with `--library-db`
  and `--snapshot-source` (defaults to `backend`).
- tests/wxyc_library_v2_test.rs: 5 PG-backed integration tests modeled
  after discogs-etl PR #185 — schema lands all 8 indexes, loader writes
  every fixture row, idempotent on re-run, rejects invalid snapshot_source,
  normalizer locked on `to_identity_match_form` (with diacritic-fold pin
  on Nilüfer Yanya). Use the docker compose `wikidata_test` DB on :5435.

Cargo.toml gains rusqlite 0.31 (bundled) — already in the allowed
dep set per the issue brief, sibling musicbrainz-cache uses the same
version. wxyc-etl stays on 0.3.0.
The pre-pr-review hook caught that adding wxyc_library to ALL_TABLES
(used by the bulk-import lifecycle) meant every `wikidata-cache import`
run would silently TRUNCATE the cross-cache identity hook before the
CSV import proceeds, wiping data loaded by the separate
`import-wxyc-library` subcommand.

Split ALL_TABLES into:
- CSV_IMPORT_TABLES (8 entries) — driven by truncate_all,
  set_tables_unlogged, set_tables_logged, vacuum_full
- ALL_TABLES (9 entries, includes wxyc_library) — driven by
  drop_schema (so --fresh still drops everything) and the schema
  validation test

Add a regression test asserting truncate_all preserves wxyc_library
while truncating the CSV tables.
1. NUL-byte stripping bug: wxyc_loader.rs computed norm_artist /
   norm_title / norm_label from the RAW source strings before stripping
   NUL bytes, then stripped only the display columns. A NUL in any
   library.db source would have crashed the INSERT on the norm columns.
   Reordered the loop so strip-then-normalize happens; norm columns
   inherit the PG-safe form.
   Regression test added in tests/wxyc_library_v2_test.rs:
   `test_loader_strips_nul_bytes_from_norm_columns` exercises a row with
   NULs in artist / title / label and asserts no NUL survives in any of
   the six TEXT columns.

2. read_library_db minimal-schema path: only the full-prod schema was
   exercised in integration tests. Added unit test
   `read_library_db_minimal_schema` covering the (id, artist, title)
   minimal-schema branch — the optional-column adapt loop must produce
   None for absent columns, not panic.

3. CLI integration test for `import-wxyc-library`: every other
   subcommand (`build`, `import`) has end-to-end binary invocation
   coverage; the new subcommand had none. Added
   `test_import_wxyc_library_subcommand` mirroring the
   `test_import_subcommand` pattern: builds a library.db fixture in a
   temp dir, invokes `wikidata-cache import-wxyc-library` via
   `assert_cmd::Command`, verifies rows + normalization landed, then
   asserts an invalid `--snapshot-source` is rejected.
The docstring promised that empty-string input would flow through as
NULL (so downstream NULL-aware joins on norm_label behave correctly),
but the implementation just did `.map(to_identity_match_form)` — which
returns Some("") for Some(""), not None. If library.db ever held an
empty-string label (possible on schema drift / bulk-import artifacts),
the loader would write a non-NULL empty norm_label that doesn't match
either an artist's NULL-label peers or any normalized real label.

Fix is one chained `.filter(|s| !s.is_empty())` — also covers the
Some("   ") whitespace-only case that the normalizer collapses to "".
Test pinned: norm_label_drops_empty_string.
Two fixes from the pre-pr-review hook on the wave:

1. The COPY-loop comment claimed "NOT NULL on the staging table will reject the row" if an empty artist/title slipped through. Postgres NOT NULL rejects SQL NULL, not empty strings — so a Some("") would have silently landed with an empty norm_artist column, defeating downstream NULL-aware joins. Replaced the false-safety claim with an explicit `anyhow::bail!` guard and pinned the loud-failure behavior in `loader_rejects_empty_artist_or_title`.

2. `schema/create_indexes.sql` claimed wxyc_library indexes were "mirrored CONCURRENTLY in migrations/0003_wxyc_library_v2.sql". The migration explicitly does NOT use CONCURRENTLY (sqlx implicit transaction; the migration file has a 21-line comment explaining why). Comment now accurately describes the symmetric plain-CREATE-INDEX-IF-NOT-EXISTS behavior on both paths.
Three review-driven changes:

1. **i32 overflow on integer columns must surface as a hard error**, not silently land NULL. `library.id` previously did `row.get::<_, i64>("id")? as i32` (truncating cast); `release_call_number` previously did `.ok().flatten().map(|v| v as i32)` (silent on overflow AND silent on type-mismatch). Replaced both with `i32::try_from` + a `FromSqlConversionFailure` carrying the offending value, mirroring the loud-failure pattern.

2. **`existing_columns` now lowercases the PRAGMA-returned column names.** SQLite identifiers are case-insensitive in DDL but `PRAGMA table_info` returns them as declared. A future `library.db` declared with `Label` instead of `label` would have silently missed the optional-column probe and stayed `None` with no error. The `select_parts` literals are already lowercase, so the comparison side now matches.

3. **Drop dead `let _ = wxyc_loader::ALLOWED_SNAPSHOT_SOURCES;`.** The constant is asserted directly elsewhere; the rebinding existed only to silence an unused-import lint that came from the unnecessary `self,` in the use group. Dropped both.

Tests: 38 unit + 7 pg integration + 24 import + 9 CLI + 5 charset + 5 oracle, all green; cargo fmt clean; clippy clean.
@jakebromberg jakebromberg merged commit 8c5723e into main May 10, 2026
3 checks passed
jakebromberg added a commit that referenced this pull request May 10, 2026
1. NUL-byte stripping bug: wxyc_loader.rs computed norm_artist /
   norm_title / norm_label from the RAW source strings before stripping
   NUL bytes, then stripped only the display columns. A NUL in any
   library.db source would have crashed the INSERT on the norm columns.
   Reordered the loop so strip-then-normalize happens; norm columns
   inherit the PG-safe form.
   Regression test added in tests/wxyc_library_v2_test.rs:
   `test_loader_strips_nul_bytes_from_norm_columns` exercises a row with
   NULs in artist / title / label and asserts no NUL survives in any of
   the six TEXT columns.

2. read_library_db minimal-schema path: only the full-prod schema was
   exercised in integration tests. Added unit test
   `read_library_db_minimal_schema` covering the (id, artist, title)
   minimal-schema branch — the optional-column adapt loop must produce
   None for absent columns, not panic.

3. CLI integration test for `import-wxyc-library`: every other
   subcommand (`build`, `import`) has end-to-end binary invocation
   coverage; the new subcommand had none. Added
   `test_import_wxyc_library_subcommand` mirroring the
   `test_import_subcommand` pattern: builds a library.db fixture in a
   temp dir, invokes `wikidata-cache import-wxyc-library` via
   `assert_cmd::Command`, verifies rows + normalization landed, then
   asserts an invalid `--snapshot-source` is rejected.
@jakebromberg jakebromberg deleted the feat/33-wxyc-library-v2 branch May 10, 2026 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[E1 §4.1.3] Homebrew wikidata (port 5432) — wxyc_library v2 hook

1 participant