WXYC · jakebromberg · May 11, 2026 · May 11, 2026 · May 11, 2026
diff --git a/dev_env/docker-compose.yml b/dev_env/docker-compose.yml
@@ -12,6 +12,11 @@ services:
       - '${DB_PORT:-5432}:5432'
     volumes:
       - pg-data:/var/lib/postgresql/data
+      # Required by migration 0076: the wxyc_unaccent text-search dictionary
+      # loads its rules from $SHAREDIR/tsearch_data/. Vendored from
+      # WXYC/wxyc-etl@v0.4.0; SHA-pinned in wxyc-etl-pin.txt.
+      - ../vendor/wxyc-etl/wxyc_unaccent.rules:/usr/local/share/postgresql/tsearch_data/wxyc_unaccent.rules:ro
+      - ../vendor/wxyc-etl/wxyc_unaccent.version:/usr/local/share/postgresql/tsearch_data/wxyc_unaccent.version:ro
     healthcheck:
       test: ['CMD-SHELL', 'pg_isready -U ${DB_USERNAME} -d ${DB_NAME} || pg_isready']
       interval: 2s
@@ -56,6 +61,10 @@ services:
       - '${CI_DB_PORT:-5433}:5432'
     volumes:
       - ci-pg-data:/var/lib/postgresql/data
+      # Mirror of the `db` mounts so migration 0076 finds the wxyc_unaccent
+      # rules file inside the CI container too.
+      - ../vendor/wxyc-etl/wxyc_unaccent.rules:/usr/local/share/postgresql/tsearch_data/wxyc_unaccent.rules:ro
+      - ../vendor/wxyc-etl/wxyc_unaccent.version:/usr/local/share/postgresql/tsearch_data/wxyc_unaccent.version:ro
     healthcheck:
       test: ['CMD-SHELL', 'pg_isready -U ${DB_USERNAME} -d ${DB_NAME} || pg_isready']
       interval: 2s

diff --git a/docs/migrations.md b/docs/migrations.md
@@ -81,6 +81,10 @@ CREATE UNIQUE INDEX IF NOT EXISTS ...
 
 The same shape covers `NOT NULL` (count `WHERE col IS NULL`), `CHECK` (count rows that violate the predicate), and `FK` (count orphans via `LEFT JOIN ... WHERE referenced.id IS NULL`). The pattern is the same prevention this codebase already uses on the 0053 + `jobs/flowsheet-dj-name-backfill/` + 0054 chain — generalize it to any constraint-adding migration. Some constraints are provably safe (e.g. a `UNIQUE` index on a freshly-added nullable column, or `NOT NULL` paired with a `DEFAULT`); when no real precondition exists, document the reasoning with a `-- @no-precondition-needed: <reason>` comment so the linter (`scripts/validate-migrations.mjs` Check 8) suppresses its warning. The PR-bot data-shape report (companion #703) catches violations at PR time; the precondition guard is the last line of defense at apply time.
 
+<!-- @rule id=wxyc-etl-vendoring enforced-by=tests/integration/wxyc-identity-match-functions.spec.js added=2026-05-11 -->
+
+**`wxyc_identity_match_*` function vendoring (cross-cache-identity §3.3.5).** The four plpgsql functions (`wxyc_identity_match_artist`, `_title`, `_with_punctuation`, `_with_disambiguator_strip`) are deployed by migration `0076_wxyc-identity-match-functions.sql`. The canonical bodies live in `WXYC/wxyc-etl@v0.4.0` (`data/`) and are vendored byte-for-byte into this repo at `vendor/wxyc-etl/` with SHA pins in `wxyc-etl-pin.txt`. The migration is a generated wrapper: extension + text-search dictionary setup followed by the canonical SQL inlined verbatim (drizzle-kit applies plain SQL files in a single transaction, so `\i` isn't available). The dictionary loads its rules from `$SHAREDIR/tsearch_data/wxyc_unaccent.rules` — `dev_env/docker-compose.yml` mounts both `wxyc_unaccent.rules` and `.version` into that path on `db` and `ci-db`. Refresh procedure: re-vendor from a new wxyc-etl tag, bump the SHAs + version in `wxyc-etl-pin.txt`, regenerate a follow-on migration with the new content, rerun `tests/integration/wxyc-identity-match-functions.spec.js`. Column-side flip on `library_identity*` is deliberately downstream (gated on the E2-BS step-2 backfill PR #663) — this rule only governs the function-deploy half.
+
 <!-- @rule id=cross-cache-identity-guards enforced-by=scripts/check-precondition-guards.sh added=2026-04-25 -->
 
 **Cross-cache-identity precondition guards (cross-epic, project-scoped).** The precondition-guard pattern above (0053 → `jobs/flowsheet-dj-name-backfill/` → 0054, scoped within a single migration chain) extends to **cross-epic** prerequisite chains for the cross-cache-identity project. The substrate migration `0NNN_library_identity_substrate.sql` (filed under WXYC/Backend-Service#663; PR number to be backfilled here when the substrate PR opens) ships its gate-check at `scripts/check-library-identity-gate.sql`. Any migration in any epic that FK-references `library_identity` / `library_identity_source` / `library_identity_history`, or adds a `NOT NULL` / `UNIQUE` / `CHECK` constraint to those tables, must include a `DO $$ ... RAISE EXCEPTION ... END $$;` block that calls or inlines the gate-check's `truly_unresolved_rows < 1000` predicate. Same mechanism as 0053/0054, scoped across epics rather than within one chain. Plan reference: `WXYC/wiki/plans/library-hook-canonicalization-plan.md` §3.2.3. CI enforcement is the `Migration guards` job in `.github/workflows/test.yml`, which runs `scripts/check-precondition-guards.sh` (ships with the substrate PR); a migration that legitimately doesn't need the guard opts out with a `-- precondition-guard: not-required (rationale)` first line.

diff --git a/shared/database/src/migrations/0076_wxyc-identity-match-functions.sql b/shared/database/src/migrations/0076_wxyc-identity-match-functions.sql
@@ -0,0 +1,328 @@
+-- precondition-guard: not-required (function-deploy-only; no DDL touches library_identity* tables; the body-comment mention is a forward-reference to BS#663, not a column edit)
+-- Cross-cache-identity match form (wiki §3.3.5).
+--
+-- Deploys the four `wxyc_identity_match_*` plpgsql functions onto the Backend's
+-- `wxyc_schema` so cross-cache identity-form expressions can join cleanly across
+-- the four caches. Vendored byte-for-byte from WXYC/wxyc-etl@v0.4.0 (`data/`,
+-- mirrored here at `vendor/wxyc-etl/`); SHA pin lives at `wxyc-etl-pin.txt`.
+-- The sqlx-cli-style wrapper prelude (extension + text-search dictionary) is
+-- prepended because drizzle-kit applies plain SQL files in a single transaction,
+-- and `\i` isn't available. A test in
+-- `apps/backend/tests/integration/db/wxycIdentityMatchFunctions.test.ts`
+-- asserts the four entry points return the expected normalization on a small
+-- WXYC-canonical artist set.
+--
+-- Backend's column flip on `library_identity*` is downstream of this migration
+-- — gated on the E2-BS step-2 backfill PR (#663). This file just ships the
+-- function definitions so the backfill window has them available.
+
+CREATE EXTENSION IF NOT EXISTS unaccent;
+
+DROP TEXT SEARCH DICTIONARY IF EXISTS wxyc_unaccent;
+CREATE TEXT SEARCH DICTIONARY wxyc_unaccent (
+  TEMPLATE = unaccent,
+  RULES = 'wxyc_unaccent'
+);
+
+-- Canonical SQL implementation of the cross-cache-identity match form.
+--
+-- Vendored verbatim into every cache repo (discogs-etl, musicbrainz-cache,
+-- wikidata-cache) and Backend-Service. The four function bodies must produce
+-- byte-identical output to the corresponding Rust entry points in
+-- `wxyc_etl::text::identity`:
+--
+--   wxyc_identity_match_artist            <-> to_identity_match_form
+--   wxyc_identity_match_title             <-> to_identity_match_form_title
+--   wxyc_identity_match_with_punctuation  <-> to_identity_match_form_with_punctuation
+--   wxyc_identity_match_with_disambiguator_strip
+--                                         <-> to_identity_match_form_with_disambiguator_strip
+--
+-- Parity is asserted by `wxyc-etl/tests/postgres_parity_test.rs` against the
+-- 252-row fixture in `wxyc-etl/tests/fixtures/identity_normalization_cases.csv`.
+--
+-- Required Postgres version: 16+ (Unicode property classes, `normalize()`,
+-- stable regex behavior). Required extension: `unaccent` configured with the
+-- `wxyc_unaccent` text-search dictionary installed from
+-- `data/wxyc_unaccent.rules`.
+--
+-- Vendoring contract: each consumer carries `wxyc-etl-pin.txt` recording the
+-- SHA-256 of `data/wxyc_unaccent.rules` and the version header read from the
+-- file's first comment line. Mismatch fails CI. See
+-- `wxyc-etl/docs/postgres-analog-vendoring.md`.
+
+DO $$
+BEGIN
+  IF current_setting('server_version_num')::int < 160000 THEN
+    RAISE EXCEPTION 'wxyc identity-match functions require Postgres 16+; got %',
+      current_setting('server_version');
+  END IF;
+END $$;
+
+-- The wxyc_unaccent dictionary must be created before this file loads.
+-- Consumer migrations do:
+--   CREATE EXTENSION IF NOT EXISTS unaccent;
+--   CREATE TEXT SEARCH DICTIONARY wxyc_unaccent (
+--     TEMPLATE = unaccent, RULES = 'wxyc_unaccent'
+--   );
+-- followed by the rules-file SHA verification block (see vendoring docs).
+
+-- ---------------------------------------------------------------------------
+-- Base match-form pipeline.
+--
+-- Mirror of `wxyc_etl::text::to_match_form` after the storage-form pass
+-- (no mojibake repair — callers responsible for storing pre-cleaned bytes).
+-- Pipeline:
+--   normalize NFKC -> lower -> wxyc_unaccent dictionary -> strip-Cf-except-ZWJ
+--   -> collapse-ASCII-space + trim.
+-- ---------------------------------------------------------------------------
+CREATE OR REPLACE FUNCTION wxyc_match_form(s text)
+  RETURNS text
+  LANGUAGE plpgsql
+  IMMUTABLE
+  PARALLEL SAFE
+AS $$
+DECLARE
+  r text;
+  zwj text := chr(8205);     -- U+200D
+  cf_pattern text;
+BEGIN
+  IF s IS NULL THEN RETURN NULL; END IF;
+  r := normalize(s, NFKC);
+  r := lower(r);
+  r := unaccent('wxyc_unaccent', r);
+  -- Strip Cf (format) characters except U+200D ZWJ (emoji integrity), matching
+  -- `strip_cf_except_zwj` in the Rust pipeline. Postgres regex has no
+  -- `\p{Cf}` and no char-class subtraction; build the class from explicit
+  -- BMP Cf codepoints split around ZWJ. Supplementary-plane Cf (U+E0001 etc.)
+  -- is rare in music-catalog data and intentionally not handled here.
+  cf_pattern :=
+       '['
+    || chr(173)                                  -- U+00AD soft hyphen
+    || chr(1564)                                 -- U+061C ALM
+    || chr(1757)                                 -- U+06DD ARABIC END OF AYAH
+    || chr(1807)                                 -- U+070F SYRIAC ABBREV MARK
+    || chr(2274)                                 -- U+08E2 ARABIC DISPUTED END OF AYAH
+    || chr(6158)                                 -- U+180E MONG VOWEL SEP
+    || chr(8203) || '-' || chr(8204)             -- U+200B-U+200C  (200D ZWJ skipped)
+    || chr(8206) || '-' || chr(8207)             -- U+200E-U+200F
+    || chr(8234) || '-' || chr(8238)             -- U+202A-U+202E
+    || chr(8288) || '-' || chr(8303)             -- U+2060-U+206F
+    || chr(65279)                                -- U+FEFF BOM
+    || chr(65529) || '-' || chr(65531)           -- U+FFF9-U+FFFB
+    || ']';
+  -- ZWJ is excluded from the class above, so no placeholder swap needed.
+  r := regexp_replace(r, cf_pattern, '', 'g');
+  -- Collapse runs of ASCII space + trim. Other whitespace (TAB etc.) preserved.
+  r := regexp_replace(r, ' +', ' ', 'g');
+  r := regexp_replace(r, '^ | $', '', 'g');
+  RETURN r;
+END
+$$;
+
+-- ---------------------------------------------------------------------------
+-- Helper: strip a single trailing (...) or [...] group.
+--
+-- Mirror of `strip_trailing_parens` in `wxyc_etl::text::identity`. Returns
+-- input unchanged when: no trailing close-bracket, brackets unbalanced,
+-- or the matching open is at position 0 (would reduce stem to empty).
+-- One pass only.
+-- ---------------------------------------------------------------------------
+CREATE OR REPLACE FUNCTION wxyc_strip_trailing_parens(s text)
+  RETURNS text
+  LANGUAGE plpgsql
+  IMMUTABLE
+  PARALLEL SAFE
+AS $$
+DECLARE
+  trimmed text;
+  open_chr char;
+  close_chr char;
+  ch char;
+  depth int := 0;
+  open_idx int := -1;
+  i int;
+  stem text;
+BEGIN
+  IF s IS NULL THEN RETURN NULL; END IF;
+  trimmed := regexp_replace(s, ' +$', '');
+  IF length(trimmed) = 0 THEN RETURN s; END IF;
+  ch := right(trimmed, 1);
+  IF ch = ')' THEN
+    open_chr := '('; close_chr := ')';
+  ELSIF ch = ']' THEN
+    open_chr := '['; close_chr := ']';
+  ELSE
+    RETURN s;
+  END IF;
+  -- Scan right-to-left for the matching open.
+  FOR i IN REVERSE length(trimmed)..1 LOOP
+    ch := substr(trimmed, i, 1);
+    IF ch = close_chr THEN
+      depth := depth + 1;
+    ELSIF ch = open_chr THEN
+      depth := depth - 1;
+      IF depth = 0 THEN
+        open_idx := i;
+        EXIT;
+      END IF;
+    END IF;
+  END LOOP;
+  IF open_idx < 0 OR open_idx = 1 THEN
+    -- Unbalanced or full-string brackets — preserve.
+    RETURN s;
+  END IF;
+  stem := substr(trimmed, 1, open_idx - 1);
+  stem := regexp_replace(stem, ' +$', '');
+  RETURN stem;
+END
+$$;
+
+-- ---------------------------------------------------------------------------
+-- Helper: drop a leading article or trailing comma-form article.
+--
+-- Mirror of `drop_articles` in `wxyc_etl::text::identity`. At most one
+-- match is consumed. The leading form requires the article followed by
+-- ASCII space (`the `, `a `, `an `); `theater` does not match. The comma
+-- form requires `, the` / `, a` / `, an` at end-of-string with a
+-- non-empty stem; `Beatles, the Best Of` does not match.
+-- ---------------------------------------------------------------------------
+CREATE OR REPLACE FUNCTION wxyc_drop_articles(s text)
+  RETURNS text
+  LANGUAGE plpgsql
+  IMMUTABLE
+  PARALLEL SAFE
+AS $$
+DECLARE
+  art text;
+  stripped text;
+BEGIN
+  IF s IS NULL THEN RETURN NULL; END IF;
+  FOREACH art IN ARRAY ARRAY['the ', 'a ', 'an '] LOOP
+    IF starts_with(s, art) THEN
+      RETURN substr(s, length(art) + 1);
+    END IF;
+  END LOOP;
+  FOREACH art IN ARRAY ARRAY[', the', ', a', ', an'] LOOP
+    -- Suffix check via `right()` rather than `LIKE '%' || art` so a future
+    -- article containing `%` or `_` doesn't trigger wildcard semantics.
+    IF length(s) >= length(art) AND right(s, length(art)) = art THEN
+      stripped := substr(s, 1, length(s) - length(art));
+      IF length(stripped) > 0 THEN
+        RETURN stripped;
+      END IF;
+    END IF;
+  END LOOP;
+  RETURN s;
+END
+$$;
+
+-- ---------------------------------------------------------------------------
+-- Helper: identity baseline (steps 4 + 5).
+--
+-- Mirror of `identity_baseline` in `wxyc_etl::text::identity`. The shared
+-- body of artist + title entry points.
+-- ---------------------------------------------------------------------------
+CREATE OR REPLACE FUNCTION wxyc_identity_baseline(s text)
+  RETURNS text
+  LANGUAGE plpgsql
+  IMMUTABLE
+  PARALLEL SAFE
+AS $$
+DECLARE
+  r text;
+BEGIN
+  IF s IS NULL THEN RETURN NULL; END IF;
+  r := wxyc_match_form(s);
+  r := wxyc_strip_trailing_parens(r);
+  r := wxyc_drop_articles(r);
+  r := regexp_replace(r, ' +', ' ', 'g');
+  r := regexp_replace(r, '^ | $', '', 'g');
+  RETURN r;
+END
+$$;
+
+-- ---------------------------------------------------------------------------
+-- Public entry point: artist identity match.
+-- Mirror of `wxyc_etl::text::to_identity_match_form`.
+-- ---------------------------------------------------------------------------
+CREATE OR REPLACE FUNCTION wxyc_identity_match_artist(s text)
+  RETURNS text
+  LANGUAGE plpgsql
+  IMMUTABLE
+  PARALLEL SAFE
+AS $$
+BEGIN
+  RETURN wxyc_identity_baseline(s);
+END
+$$;
+
+-- ---------------------------------------------------------------------------
+-- Public entry point: title identity match.
+-- Mirror of `wxyc_etl::text::to_identity_match_form_title`. Same body as
+-- artist today; separate function so callers type-distinguish at the call
+-- site and a future step-6 promotion does not silently change titles that
+-- would not benefit (`Side A/2` etc.).
+-- ---------------------------------------------------------------------------
+CREATE OR REPLACE FUNCTION wxyc_identity_match_title(s text)
+  RETURNS text
+  LANGUAGE plpgsql
+  IMMUTABLE
+  PARALLEL SAFE
+AS $$
+BEGIN
+  RETURN wxyc_identity_baseline(s);
+END
+$$;
+
+-- ---------------------------------------------------------------------------
+-- Public entry point: identity match + opt-in punctuation collapse (step 6).
+-- Mirror of `wxyc_etl::text::to_identity_match_form_with_punctuation`.
+-- Each run of one-or-more non-letter, non-number, non-whitespace codepoints
+-- becomes a single ASCII space; result is re-collapsed and re-trimmed.
+-- ---------------------------------------------------------------------------
+CREATE OR REPLACE FUNCTION wxyc_identity_match_with_punctuation(s text)
+  RETURNS text
+  LANGUAGE plpgsql
+  IMMUTABLE
+  PARALLEL SAFE
+AS $$
+DECLARE
+  r text;
+BEGIN
+  IF s IS NULL THEN RETURN NULL; END IF;
+  r := wxyc_match_form(s);
+  r := wxyc_strip_trailing_parens(r);
+  r := wxyc_drop_articles(r);
+  -- Step 6: replace each run of non-{Letter,Number,Whitespace} with one space.
+  -- Postgres regex doesn't support `\p{L}` directly, but POSIX `[:alpha:]` /
+  -- `[:digit:]` / `[:space:]` are locale-aware (en_US.UTF-8 collation =
+  -- full Unicode coverage).
+  r := regexp_replace(r, '[^[:alpha:][:digit:][:space:]]+', ' ', 'g');
+  r := regexp_replace(r, ' +', ' ', 'g');
+  r := regexp_replace(r, '^ | $', '', 'g');
+  RETURN r;
+END
+$$;
+
+-- ---------------------------------------------------------------------------
+-- Public entry point: identity match + opt-in `/N` disambiguator strip (step 8).
+-- Mirror of `wxyc_etl::text::to_identity_match_form_with_disambiguator_strip`.
+--
+-- Artists only. The leading whitespace before `/` is REQUIRED (`John Smith /1`
+-- strips; `Track 1/12` does not — matches Rust's `\s+/\d+$` not `\s*`).
+-- ---------------------------------------------------------------------------
+CREATE OR REPLACE FUNCTION wxyc_identity_match_with_disambiguator_strip(s text)
+  RETURNS text
+  LANGUAGE plpgsql
+  IMMUTABLE
+  PARALLEL SAFE
+AS $$
+DECLARE
+  r text;
+BEGIN
+  IF s IS NULL THEN RETURN NULL; END IF;
+  r := wxyc_identity_baseline(s);
+  r := regexp_replace(r, ' +/\d+$', '');
+  RETURN r;
+END
+$$;