Add wxyc_library v2 hook (E1 §4.1.3 cross-cache-identity)#34
Merged
Conversation
…oses #33 - migrations/0002_wxyc_library_v2.sql: consolidated wxyc_library table per §3.1 with all 8 indexes inline (pkey + 5 b-tree + 2 GIN trgm). Per §4.1.3 this cache is small enough to skip CONCURRENTLY, keeping the migration in a single transaction. Mirrored into schema/create_database.sql per the dual-source pattern in CLAUDE.md. - src/wxyc_loader.rs: reads SQLite library.db, normalizes via the wxyc-etl 0.3.0 identity baseline (`to_identity_match_form` for artist + label, `to_identity_match_form_title` for title), and INSERTs with ON CONFLICT (library_id) DO NOTHING. snapshot_source is validated at the loader boundary against the §3.1 CHECK set {backend|tubafrenzy|llm}. artist_id/label_id/format_id/release_year stamp NULL — library.db doesn't carry Backend's IDs. - src/main.rs: new `import-wxyc-library` subcommand with `--library-db` and `--snapshot-source` (defaults to `backend`). - tests/wxyc_library_v2_test.rs: 5 PG-backed integration tests modeled after discogs-etl PR #185 — schema lands all 8 indexes, loader writes every fixture row, idempotent on re-run, rejects invalid snapshot_source, normalizer locked on `to_identity_match_form` (with diacritic-fold pin on Nilüfer Yanya). Use the docker compose `wikidata_test` DB on :5435. Cargo.toml gains rusqlite 0.31 (bundled) — already in the allowed dep set per the issue brief, sibling musicbrainz-cache uses the same version. wxyc-etl stays on 0.3.0.
The pre-pr-review hook caught that adding wxyc_library to ALL_TABLES (used by the bulk-import lifecycle) meant every `wikidata-cache import` run would silently TRUNCATE the cross-cache identity hook before the CSV import proceeds, wiping data loaded by the separate `import-wxyc-library` subcommand. Split ALL_TABLES into: - CSV_IMPORT_TABLES (8 entries) — driven by truncate_all, set_tables_unlogged, set_tables_logged, vacuum_full - ALL_TABLES (9 entries, includes wxyc_library) — driven by drop_schema (so --fresh still drops everything) and the schema validation test Add a regression test asserting truncate_all preserves wxyc_library while truncating the CSV tables.
1. NUL-byte stripping bug: wxyc_loader.rs computed norm_artist / norm_title / norm_label from the RAW source strings before stripping NUL bytes, then stripped only the display columns. A NUL in any library.db source would have crashed the INSERT on the norm columns. Reordered the loop so strip-then-normalize happens; norm columns inherit the PG-safe form. Regression test added in tests/wxyc_library_v2_test.rs: `test_loader_strips_nul_bytes_from_norm_columns` exercises a row with NULs in artist / title / label and asserts no NUL survives in any of the six TEXT columns. 2. read_library_db minimal-schema path: only the full-prod schema was exercised in integration tests. Added unit test `read_library_db_minimal_schema` covering the (id, artist, title) minimal-schema branch — the optional-column adapt loop must produce None for absent columns, not panic. 3. CLI integration test for `import-wxyc-library`: every other subcommand (`build`, `import`) has end-to-end binary invocation coverage; the new subcommand had none. Added `test_import_wxyc_library_subcommand` mirroring the `test_import_subcommand` pattern: builds a library.db fixture in a temp dir, invokes `wikidata-cache import-wxyc-library` via `assert_cmd::Command`, verifies rows + normalization landed, then asserts an invalid `--snapshot-source` is rejected.
The docstring promised that empty-string input would flow through as
NULL (so downstream NULL-aware joins on norm_label behave correctly),
but the implementation just did `.map(to_identity_match_form)` — which
returns Some("") for Some(""), not None. If library.db ever held an
empty-string label (possible on schema drift / bulk-import artifacts),
the loader would write a non-NULL empty norm_label that doesn't match
either an artist's NULL-label peers or any normalized real label.
Fix is one chained `.filter(|s| !s.is_empty())` — also covers the
Some(" ") whitespace-only case that the normalizer collapses to "".
Test pinned: norm_label_drops_empty_string.
Two fixes from the pre-pr-review hook on the wave:
1. The COPY-loop comment claimed "NOT NULL on the staging table will reject the row" if an empty artist/title slipped through. Postgres NOT NULL rejects SQL NULL, not empty strings — so a Some("") would have silently landed with an empty norm_artist column, defeating downstream NULL-aware joins. Replaced the false-safety claim with an explicit `anyhow::bail!` guard and pinned the loud-failure behavior in `loader_rejects_empty_artist_or_title`.
2. `schema/create_indexes.sql` claimed wxyc_library indexes were "mirrored CONCURRENTLY in migrations/0003_wxyc_library_v2.sql". The migration explicitly does NOT use CONCURRENTLY (sqlx implicit transaction; the migration file has a 21-line comment explaining why). Comment now accurately describes the symmetric plain-CREATE-INDEX-IF-NOT-EXISTS behavior on both paths.
Three review-driven changes:
1. **i32 overflow on integer columns must surface as a hard error**, not silently land NULL. `library.id` previously did `row.get::<_, i64>("id")? as i32` (truncating cast); `release_call_number` previously did `.ok().flatten().map(|v| v as i32)` (silent on overflow AND silent on type-mismatch). Replaced both with `i32::try_from` + a `FromSqlConversionFailure` carrying the offending value, mirroring the loud-failure pattern.
2. **`existing_columns` now lowercases the PRAGMA-returned column names.** SQLite identifiers are case-insensitive in DDL but `PRAGMA table_info` returns them as declared. A future `library.db` declared with `Label` instead of `label` would have silently missed the optional-column probe and stayed `None` with no error. The `select_parts` literals are already lowercase, so the comparison side now matches.
3. **Drop dead `let _ = wxyc_loader::ALLOWED_SNAPSHOT_SOURCES;`.** The constant is asserted directly elsewhere; the rebinding existed only to silence an unused-import lint that came from the unnecessary `self,` in the use group. Dropped both.
Tests: 38 unit + 7 pg integration + 24 import + 9 CLI + 5 charset + 5 oracle, all green; cargo fmt clean; clippy clean.
jakebromberg
added a commit
that referenced
this pull request
May 10, 2026
1. NUL-byte stripping bug: wxyc_loader.rs computed norm_artist / norm_title / norm_label from the RAW source strings before stripping NUL bytes, then stripped only the display columns. A NUL in any library.db source would have crashed the INSERT on the norm columns. Reordered the loop so strip-then-normalize happens; norm columns inherit the PG-safe form. Regression test added in tests/wxyc_library_v2_test.rs: `test_loader_strips_nul_bytes_from_norm_columns` exercises a row with NULs in artist / title / label and asserts no NUL survives in any of the six TEXT columns. 2. read_library_db minimal-schema path: only the full-prod schema was exercised in integration tests. Added unit test `read_library_db_minimal_schema` covering the (id, artist, title) minimal-schema branch — the optional-column adapt loop must produce None for absent columns, not panic. 3. CLI integration test for `import-wxyc-library`: every other subcommand (`build`, `import`) has end-to-end binary invocation coverage; the new subcommand had none. Added `test_import_wxyc_library_subcommand` mirroring the `test_import_subcommand` pattern: builds a library.db fixture in a temp dir, invokes `wikidata-cache import-wxyc-library` via `assert_cmd::Command`, verifies rows + normalization landed, then asserts an invalid `--snapshot-source` is rejected.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
wxyc_libraryschema per §3.1 on the Homebrew wikidata cache. E1 §4.1.3 of the cross-cache-identity rollout. Schema validated by WXYC/discogs-etl#185; pattern-parallel with WXYC/musicbrainz-cache#48 (sibling Rust+sqlx port).src/wxyc_loader.rswrites fromlibrary.dbtowxyc_libraryvia a single preparedINSERT … VALUES (...) ON CONFLICT (library_id) DO NOTHINGexecuted once per row inside one transaction (small cache, ≤64K rows; per-row INSERT is plenty — switch to multi-VALUES batching only if profiling shows it's worth it). Normalizes viawxyc_etl::text::to_identity_match_form{,_title}(locked-on baseline, NOTto_match_form).import-wxyc-librarysubcommand. Addedwxyc_librarytoimport_schema::ALL_TABLESso--freshrebuilds clean it. 7 pg-marked integration tests (schema, write-every-row, idempotency, snapshot_source CHECK, normalizer pin including Nilüfer Yanya diacritic-fold, NUL-byte stripping on derived norm columns, empty artist/title rejection) plus 1 unit test for minimal-schema library.db and 1 CLI integration test.CONCURRENTLYneeded); standard sqlx in-transaction migration applies cleanly.Decisions flagged for review
import-wxyc-libraryas a new subcommand (rather than embedding in the existingimportflow). The existingimportis single-purpose for the 8 CSV files; the wxyc-library flow needs a different input (--library-db) and is a separate post-rebuild step.docker compose up -d(the sibling container already binds :5435 with identical creds). Smoke-testedsqlx migrate runend-to-end against a fresh DB; both migrations apply, all 8 indexes land, second run is a no-op.Test plan
cargo fmt --check— cleancargo clippy --all-targets -- -D warnings -A clippy::manual_is_multiple_of— cleancargo test --lib— unit tests passcargo test(full suite incl. 7 new integration tests) — all greensqlx migrate runagainst fresh DB applies; idempotent on re-runCloses #33