Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions dev_env/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ services:
- '${DB_PORT:-5432}:5432'
volumes:
- pg-data:/var/lib/postgresql/data
# Required by migration 0076: the wxyc_unaccent text-search dictionary
# loads its rules from $SHAREDIR/tsearch_data/. Vendored from
# WXYC/wxyc-etl@v0.4.0; SHA-pinned in wxyc-etl-pin.txt.
- ../vendor/wxyc-etl/wxyc_unaccent.rules:/usr/local/share/postgresql/tsearch_data/wxyc_unaccent.rules:ro
- ../vendor/wxyc-etl/wxyc_unaccent.version:/usr/local/share/postgresql/tsearch_data/wxyc_unaccent.version:ro
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U ${DB_USERNAME} -d ${DB_NAME} || pg_isready']
interval: 2s
Expand Down Expand Up @@ -56,6 +61,10 @@ services:
- '${CI_DB_PORT:-5433}:5432'
volumes:
- ci-pg-data:/var/lib/postgresql/data
# Mirror of the `db` mounts so migration 0076 finds the wxyc_unaccent
# rules file inside the CI container too.
- ../vendor/wxyc-etl/wxyc_unaccent.rules:/usr/local/share/postgresql/tsearch_data/wxyc_unaccent.rules:ro
- ../vendor/wxyc-etl/wxyc_unaccent.version:/usr/local/share/postgresql/tsearch_data/wxyc_unaccent.version:ro
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U ${DB_USERNAME} -d ${DB_NAME} || pg_isready']
interval: 2s
Expand Down
4 changes: 4 additions & 0 deletions docs/migrations.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,10 @@ CREATE UNIQUE INDEX IF NOT EXISTS ...

The same shape covers `NOT NULL` (count `WHERE col IS NULL`), `CHECK` (count rows that violate the predicate), and `FK` (count orphans via `LEFT JOIN ... WHERE referenced.id IS NULL`). The pattern is the same prevention this codebase already uses on the 0053 + `jobs/flowsheet-dj-name-backfill/` + 0054 chain — generalize it to any constraint-adding migration. Some constraints are provably safe (e.g. a `UNIQUE` index on a freshly-added nullable column, or `NOT NULL` paired with a `DEFAULT`); when no real precondition exists, document the reasoning with a `-- @no-precondition-needed: <reason>` comment so the linter (`scripts/validate-migrations.mjs` Check 8) suppresses its warning. The PR-bot data-shape report (companion #703) catches violations at PR time; the precondition guard is the last line of defense at apply time.

<!-- @rule id=wxyc-etl-vendoring enforced-by=tests/integration/wxyc-identity-match-functions.spec.js added=2026-05-11 -->

**`wxyc_identity_match_*` function vendoring (cross-cache-identity §3.3.5).** The four plpgsql functions (`wxyc_identity_match_artist`, `_title`, `_with_punctuation`, `_with_disambiguator_strip`) are deployed by migration `0076_wxyc-identity-match-functions.sql`. The canonical bodies live in `WXYC/wxyc-etl@v0.4.0` (`data/`) and are vendored byte-for-byte into this repo at `vendor/wxyc-etl/` with SHA pins in `wxyc-etl-pin.txt`. The migration is a generated wrapper: extension + text-search dictionary setup followed by the canonical SQL inlined verbatim (drizzle-kit applies plain SQL files in a single transaction, so `\i` isn't available). The dictionary loads its rules from `$SHAREDIR/tsearch_data/wxyc_unaccent.rules` — `dev_env/docker-compose.yml` mounts both `wxyc_unaccent.rules` and `.version` into that path on `db` and `ci-db`. Refresh procedure: re-vendor from a new wxyc-etl tag, bump the SHAs + version in `wxyc-etl-pin.txt`, regenerate a follow-on migration with the new content, rerun `tests/integration/wxyc-identity-match-functions.spec.js`. Column-side flip on `library_identity*` is deliberately downstream (gated on the E2-BS step-2 backfill PR #663) — this rule only governs the function-deploy half.

<!-- @rule id=cross-cache-identity-guards enforced-by=scripts/check-precondition-guards.sh added=2026-04-25 -->

**Cross-cache-identity precondition guards (cross-epic, project-scoped).** The precondition-guard pattern above (0053 → `jobs/flowsheet-dj-name-backfill/` → 0054, scoped within a single migration chain) extends to **cross-epic** prerequisite chains for the cross-cache-identity project. The substrate migration `0NNN_library_identity_substrate.sql` (filed under WXYC/Backend-Service#663; PR number to be backfilled here when the substrate PR opens) ships its gate-check at `scripts/check-library-identity-gate.sql`. Any migration in any epic that FK-references `library_identity` / `library_identity_source` / `library_identity_history`, or adds a `NOT NULL` / `UNIQUE` / `CHECK` constraint to those tables, must include a `DO $$ ... RAISE EXCEPTION ... END $$;` block that calls or inlines the gate-check's `truly_unresolved_rows < 1000` predicate. Same mechanism as 0053/0054, scoped across epics rather than within one chain. Plan reference: `WXYC/wiki/plans/library-hook-canonicalization-plan.md` §3.2.3. CI enforcement is the `Migration guards` job in `.github/workflows/test.yml`, which runs `scripts/check-precondition-guards.sh` (ships with the substrate PR); a migration that legitimately doesn't need the guard opts out with a `-- precondition-guard: not-required (rationale)` first line.
Expand Down
328 changes: 328 additions & 0 deletions shared/database/src/migrations/0076_wxyc-identity-match-functions.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
-- precondition-guard: not-required (function-deploy-only; no DDL touches library_identity* tables; the body-comment mention is a forward-reference to BS#663, not a column edit)
-- Cross-cache-identity match form (wiki §3.3.5).
--
-- Deploys the four `wxyc_identity_match_*` plpgsql functions onto the Backend's
-- `wxyc_schema` so cross-cache identity-form expressions can join cleanly across
-- the four caches. Vendored byte-for-byte from WXYC/wxyc-etl@v0.4.0 (`data/`,
-- mirrored here at `vendor/wxyc-etl/`); SHA pin lives at `wxyc-etl-pin.txt`.
-- The sqlx-cli-style wrapper prelude (extension + text-search dictionary) is
-- prepended because drizzle-kit applies plain SQL files in a single transaction,
-- and `\i` isn't available. A test in
-- `apps/backend/tests/integration/db/wxycIdentityMatchFunctions.test.ts`
-- asserts the four entry points return the expected normalization on a small
-- WXYC-canonical artist set.
--
-- Backend's column flip on `library_identity*` is downstream of this migration
-- — gated on the E2-BS step-2 backfill PR (#663). This file just ships the
-- function definitions so the backfill window has them available.

CREATE EXTENSION IF NOT EXISTS unaccent;

DROP TEXT SEARCH DICTIONARY IF EXISTS wxyc_unaccent;
CREATE TEXT SEARCH DICTIONARY wxyc_unaccent (
TEMPLATE = unaccent,
RULES = 'wxyc_unaccent'
);

-- Canonical SQL implementation of the cross-cache-identity match form.
--
-- Vendored verbatim into every cache repo (discogs-etl, musicbrainz-cache,
-- wikidata-cache) and Backend-Service. The four function bodies must produce
-- byte-identical output to the corresponding Rust entry points in
-- `wxyc_etl::text::identity`:
--
-- wxyc_identity_match_artist <-> to_identity_match_form
-- wxyc_identity_match_title <-> to_identity_match_form_title
-- wxyc_identity_match_with_punctuation <-> to_identity_match_form_with_punctuation
-- wxyc_identity_match_with_disambiguator_strip
-- <-> to_identity_match_form_with_disambiguator_strip
--
-- Parity is asserted by `wxyc-etl/tests/postgres_parity_test.rs` against the
-- 252-row fixture in `wxyc-etl/tests/fixtures/identity_normalization_cases.csv`.
--
-- Required Postgres version: 16+ (Unicode property classes, `normalize()`,
-- stable regex behavior). Required extension: `unaccent` configured with the
-- `wxyc_unaccent` text-search dictionary installed from
-- `data/wxyc_unaccent.rules`.
--
-- Vendoring contract: each consumer carries `wxyc-etl-pin.txt` recording the
-- SHA-256 of `data/wxyc_unaccent.rules` and the version header read from the
-- file's first comment line. Mismatch fails CI. See
-- `wxyc-etl/docs/postgres-analog-vendoring.md`.

DO $$
BEGIN
IF current_setting('server_version_num')::int < 160000 THEN
RAISE EXCEPTION 'wxyc identity-match functions require Postgres 16+; got %',
current_setting('server_version');
END IF;
END $$;

-- The wxyc_unaccent dictionary must be created before this file loads.
-- Consumer migrations do:
-- CREATE EXTENSION IF NOT EXISTS unaccent;
-- CREATE TEXT SEARCH DICTIONARY wxyc_unaccent (
-- TEMPLATE = unaccent, RULES = 'wxyc_unaccent'
-- );
-- followed by the rules-file SHA verification block (see vendoring docs).

-- ---------------------------------------------------------------------------
-- Base match-form pipeline.
--
-- Mirror of `wxyc_etl::text::to_match_form` after the storage-form pass
-- (no mojibake repair — callers responsible for storing pre-cleaned bytes).
-- Pipeline:
-- normalize NFKC -> lower -> wxyc_unaccent dictionary -> strip-Cf-except-ZWJ
-- -> collapse-ASCII-space + trim.
-- ---------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION wxyc_match_form(s text)
RETURNS text
LANGUAGE plpgsql
IMMUTABLE
PARALLEL SAFE
AS $$
DECLARE
r text;
zwj text := chr(8205); -- U+200D
cf_pattern text;
BEGIN
IF s IS NULL THEN RETURN NULL; END IF;
r := normalize(s, NFKC);
r := lower(r);
r := unaccent('wxyc_unaccent', r);
-- Strip Cf (format) characters except U+200D ZWJ (emoji integrity), matching
-- `strip_cf_except_zwj` in the Rust pipeline. Postgres regex has no
-- `\p{Cf}` and no char-class subtraction; build the class from explicit
-- BMP Cf codepoints split around ZWJ. Supplementary-plane Cf (U+E0001 etc.)
-- is rare in music-catalog data and intentionally not handled here.
cf_pattern :=
'['
|| chr(173) -- U+00AD soft hyphen
|| chr(1564) -- U+061C ALM
|| chr(1757) -- U+06DD ARABIC END OF AYAH
|| chr(1807) -- U+070F SYRIAC ABBREV MARK
|| chr(2274) -- U+08E2 ARABIC DISPUTED END OF AYAH
|| chr(6158) -- U+180E MONG VOWEL SEP
|| chr(8203) || '-' || chr(8204) -- U+200B-U+200C (200D ZWJ skipped)
|| chr(8206) || '-' || chr(8207) -- U+200E-U+200F
|| chr(8234) || '-' || chr(8238) -- U+202A-U+202E
|| chr(8288) || '-' || chr(8303) -- U+2060-U+206F
|| chr(65279) -- U+FEFF BOM
|| chr(65529) || '-' || chr(65531) -- U+FFF9-U+FFFB
|| ']';
-- ZWJ is excluded from the class above, so no placeholder swap needed.
r := regexp_replace(r, cf_pattern, '', 'g');
-- Collapse runs of ASCII space + trim. Other whitespace (TAB etc.) preserved.
r := regexp_replace(r, ' +', ' ', 'g');
r := regexp_replace(r, '^ | $', '', 'g');
RETURN r;
END
$$;

-- ---------------------------------------------------------------------------
-- Helper: strip a single trailing (...) or [...] group.
--
-- Mirror of `strip_trailing_parens` in `wxyc_etl::text::identity`. Returns
-- input unchanged when: no trailing close-bracket, brackets unbalanced,
-- or the matching open is at position 0 (would reduce stem to empty).
-- One pass only.
-- ---------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION wxyc_strip_trailing_parens(s text)
RETURNS text
LANGUAGE plpgsql
IMMUTABLE
PARALLEL SAFE
AS $$
DECLARE
trimmed text;
open_chr char;
close_chr char;
ch char;
depth int := 0;
open_idx int := -1;
i int;
stem text;
BEGIN
IF s IS NULL THEN RETURN NULL; END IF;
trimmed := regexp_replace(s, ' +$', '');
IF length(trimmed) = 0 THEN RETURN s; END IF;
ch := right(trimmed, 1);
IF ch = ')' THEN
open_chr := '('; close_chr := ')';
ELSIF ch = ']' THEN
open_chr := '['; close_chr := ']';
ELSE
RETURN s;
END IF;
-- Scan right-to-left for the matching open.
FOR i IN REVERSE length(trimmed)..1 LOOP
ch := substr(trimmed, i, 1);
IF ch = close_chr THEN
depth := depth + 1;
ELSIF ch = open_chr THEN
depth := depth - 1;
IF depth = 0 THEN
open_idx := i;
EXIT;
END IF;
END IF;
END LOOP;
IF open_idx < 0 OR open_idx = 1 THEN
-- Unbalanced or full-string brackets — preserve.
RETURN s;
END IF;
stem := substr(trimmed, 1, open_idx - 1);
stem := regexp_replace(stem, ' +$', '');
RETURN stem;
END
$$;

-- ---------------------------------------------------------------------------
-- Helper: drop a leading article or trailing comma-form article.
--
-- Mirror of `drop_articles` in `wxyc_etl::text::identity`. At most one
-- match is consumed. The leading form requires the article followed by
-- ASCII space (`the `, `a `, `an `); `theater` does not match. The comma
-- form requires `, the` / `, a` / `, an` at end-of-string with a
-- non-empty stem; `Beatles, the Best Of` does not match.
-- ---------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION wxyc_drop_articles(s text)
RETURNS text
LANGUAGE plpgsql
IMMUTABLE
PARALLEL SAFE
AS $$
DECLARE
art text;
stripped text;
BEGIN
IF s IS NULL THEN RETURN NULL; END IF;
FOREACH art IN ARRAY ARRAY['the ', 'a ', 'an '] LOOP
IF starts_with(s, art) THEN
RETURN substr(s, length(art) + 1);
END IF;
END LOOP;
FOREACH art IN ARRAY ARRAY[', the', ', a', ', an'] LOOP
-- Suffix check via `right()` rather than `LIKE '%' || art` so a future
-- article containing `%` or `_` doesn't trigger wildcard semantics.
IF length(s) >= length(art) AND right(s, length(art)) = art THEN
stripped := substr(s, 1, length(s) - length(art));
IF length(stripped) > 0 THEN
RETURN stripped;
END IF;
END IF;
END LOOP;
RETURN s;
END
$$;

-- ---------------------------------------------------------------------------
-- Helper: identity baseline (steps 4 + 5).
--
-- Mirror of `identity_baseline` in `wxyc_etl::text::identity`. The shared
-- body of artist + title entry points.
-- ---------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION wxyc_identity_baseline(s text)
RETURNS text
LANGUAGE plpgsql
IMMUTABLE
PARALLEL SAFE
AS $$
DECLARE
r text;
BEGIN
IF s IS NULL THEN RETURN NULL; END IF;
r := wxyc_match_form(s);
r := wxyc_strip_trailing_parens(r);
r := wxyc_drop_articles(r);
r := regexp_replace(r, ' +', ' ', 'g');
r := regexp_replace(r, '^ | $', '', 'g');
RETURN r;
END
$$;

-- ---------------------------------------------------------------------------
-- Public entry point: artist identity match.
-- Mirror of `wxyc_etl::text::to_identity_match_form`.
-- ---------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION wxyc_identity_match_artist(s text)
RETURNS text
LANGUAGE plpgsql
IMMUTABLE
PARALLEL SAFE
AS $$
BEGIN
RETURN wxyc_identity_baseline(s);
END
$$;

-- ---------------------------------------------------------------------------
-- Public entry point: title identity match.
-- Mirror of `wxyc_etl::text::to_identity_match_form_title`. Same body as
-- artist today; separate function so callers type-distinguish at the call
-- site and a future step-6 promotion does not silently change titles that
-- would not benefit (`Side A/2` etc.).
-- ---------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION wxyc_identity_match_title(s text)
RETURNS text
LANGUAGE plpgsql
IMMUTABLE
PARALLEL SAFE
AS $$
BEGIN
RETURN wxyc_identity_baseline(s);
END
$$;

-- ---------------------------------------------------------------------------
-- Public entry point: identity match + opt-in punctuation collapse (step 6).
-- Mirror of `wxyc_etl::text::to_identity_match_form_with_punctuation`.
-- Each run of one-or-more non-letter, non-number, non-whitespace codepoints
-- becomes a single ASCII space; result is re-collapsed and re-trimmed.
-- ---------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION wxyc_identity_match_with_punctuation(s text)
RETURNS text
LANGUAGE plpgsql
IMMUTABLE
PARALLEL SAFE
AS $$
DECLARE
r text;
BEGIN
IF s IS NULL THEN RETURN NULL; END IF;
r := wxyc_match_form(s);
r := wxyc_strip_trailing_parens(r);
r := wxyc_drop_articles(r);
-- Step 6: replace each run of non-{Letter,Number,Whitespace} with one space.
-- Postgres regex doesn't support `\p{L}` directly, but POSIX `[:alpha:]` /
-- `[:digit:]` / `[:space:]` are locale-aware (en_US.UTF-8 collation =
-- full Unicode coverage).
r := regexp_replace(r, '[^[:alpha:][:digit:][:space:]]+', ' ', 'g');
r := regexp_replace(r, ' +', ' ', 'g');
r := regexp_replace(r, '^ | $', '', 'g');
RETURN r;
END
$$;

-- ---------------------------------------------------------------------------
-- Public entry point: identity match + opt-in `/N` disambiguator strip (step 8).
-- Mirror of `wxyc_etl::text::to_identity_match_form_with_disambiguator_strip`.
--
-- Artists only. The leading whitespace before `/` is REQUIRED (`John Smith /1`
-- strips; `Track 1/12` does not — matches Rust's `\s+/\d+$` not `\s*`).
-- ---------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION wxyc_identity_match_with_disambiguator_strip(s text)
RETURNS text
LANGUAGE plpgsql
IMMUTABLE
PARALLEL SAFE
AS $$
DECLARE
r text;
BEGIN
IF s IS NULL THEN RETURN NULL; END IF;
r := wxyc_identity_baseline(s);
r := regexp_replace(r, ' +/\d+$', '');
RETURN r;
END
$$;
Loading
Loading