Skip to content

Deploy wxyc_identity_match_* plpgsql functions (wiki §3.3.5)#806

Closed
jakebromberg wants to merge 3 commits into
mainfrom
cross-cache-identity/identity-match-functions
Closed

Deploy wxyc_identity_match_* plpgsql functions (wiki §3.3.5)#806
jakebromberg wants to merge 3 commits into
mainfrom
cross-cache-identity/identity-match-functions

Conversation

@jakebromberg
Copy link
Copy Markdown
Member

Summary

Backend half of the four-cache function deploy. Mirrors WXYC/musicbrainz-cache#52 + WXYC/wikidata-cache#38; vendors canonical artifacts from WXYC/wxyc-etl@v0.4.0 (data/) under vendor/wxyc-etl/ and ships them as migration 0076.

  • shared/database/src/migrations/0076_wxyc-identity-match-functions.sql — extension + dictionary setup, then the canonical SQL inlined byte-for-byte.
  • vendor/wxyc-etl/{wxyc_unaccent.rules,.version,wxyc_identity_match_functions.sql} — vendored verbatim from upstream. SHA-pinned in wxyc-etl-pin.txt.
  • dev_env/docker-compose.ymldb (dev) and ci-db (ci) mount the rules + version files into /usr/local/share/postgresql/tsearch_data/ so the dictionary creates cleanly on first migrate.
  • tests/integration/wxyc-identity-match-functions.spec.js — three-layer check: SHA pin freshness, migration-vs-canonical byte-equality, and a small canonical-artist smoke + idempotence on the live PG.
  • docs/migrations.md — adds a @rule block documenting the vendoring convention and the refresh procedure.
  • Snapshot + journal: 0076 mirrors 0075's table/enum state with fresh UUIDs; journal uses when = previous + 1ms per the hand-edit recipe.

Scope

Function deploy only. Per the ticket and wiki §3.3.0 row 6, Backend's library_identity* column flip is downstream — gated on the E2-BS step-2 backfill PR (#663). This migration ships the function definitions so the backfill window has them available.

Closes #805.

Related: parent epic WXYC/wxyc-etl#73, prerequisite WXYC/wxyc-etl#113 (merged), sibling deploys WXYC/musicbrainz-cache#52, WXYC/wikidata-cache#38, WXYC/discogs-etl#194.

Test plan

  • node scripts/validate-migrations.mjs — passes (75 entries, 2 historical warnings unchanged)
  • Local PG 18 apply: psql -f shared/database/src/migrations/0076_*.sql → 8 functions created, smoke queries return expected normalization
  • CI green (ci-db mounts pick up the rules; integration spec runs on the live container)
  • After merge: Manual Build & Deploy within 24h per the migration-chain risk cadence

 #805)

Vendors the canonical artifacts from WXYC/wxyc-etl@v0.4.0 (`data/`) under `vendor/wxyc-etl/` and ships them as migration 0076. The migration sets up the `wxyc_unaccent` text-search dictionary, then inlines the canonical four-function SQL byte-for-byte (drizzle-kit applies plain SQL in a single transaction, so `\i` isn't an option). SHA-pinned in `wxyc-etl-pin.txt`.

Both `db` (dev profile) and `ci-db` (ci profile) in `dev_env/docker-compose.yml` now mount the rules + version files into `/usr/local/share/postgresql/tsearch_data/` so the dictionary creates cleanly on first migrate. The integration spec at `tests/integration/wxyc-identity-match-functions.spec.js` exercises three layers: pin SHA freshness, migration-vs-canonical byte-equality, and a small canonical-artist smoke + idempotence on the live PG.

Column flip on `library_identity*` is deliberately out of scope here — gated on the E2-BS step-2 backfill PR (#663) per the ticket and the wiki §3.3.0 row 6. This migration ships the function definitions so the backfill window has them available.

The journal entry uses `when = previous + 1ms` per the hand-edit recipe in `docs/migrations.md`. Snapshot 0076 mirrors 0075's table/enum/etc state with new id/prevId UUIDs since no schema-level changes accompany the function deploy.
@github-actions
Copy link
Copy Markdown

Schema constraint shape report

no new constraints detected in this diff (uniqueIndex, .unique(), SET NOT NULL, CHECK, FK)

The CI 'Migration guards' check greps for any reference to library_identity* in migration SQL; my body comment mentioning the downstream #663 column flip tripped it. Add the documented opt-out comment as the first line of the migration. Refreeze applied-hashes.json with the new SHA since the migration body changed.

Does NOT address the second dry-run failure (RDS managed PG can't load custom tsearch_data files) — that's tracked separately as a deployment-pattern concern.
@jakebromberg jakebromberg marked this pull request as draft May 11, 2026 17:01
@jakebromberg
Copy link
Copy Markdown
Member Author

CI exposed an architectural blocker — converting to draft

The lint failure (cross-cache-identity precondition-guard regex matching my body-comment mention of library_identity*) is fixed in a6525f9 (added the documented opt-out comment).

The deeper failure is the Migration Dry-Run (prod-shaped data) job:

ERROR: could not open unaccent file
"/rdsdbbin/postgres-14.17.R2/share/tsearch_data/wxyc_unaccent.rules":
No such file or directory

Two problems exposed:

  1. RDS managed Postgres can't load custom tsearch_data files. AWS-managed RDS doesn't expose write access to $SHAREDIR/tsearch_data/. The wiki §3.3.5 deploy pattern (canonical custom wxyc_unaccent dictionary) is feasible on self-managed PG (mb-cache, wikidata-cache, discogs-cache on docker/homebrew) but not on RDS as Backend uses it. The dev_env/docker-compose.yml mount I added works locally + CI; on prod RDS it has nowhere to land.
  2. Prod RDS is on Postgres 14.17, not 16. The wiki §3.3.5 spec (and the migration's current_setting('server_version_num') >= 16 guard) requires 16+. The wiki claim of "Backend prod RDS verified at Postgres 16 on 2026-04-28" is wrong about prod's current version. (The dry-run uses the latest automated RDS snapshot, so this reflects actual prod state.)

Proposed path forward

This PR is not feasible as designed on Backend's current managed-PG stack. Options:

  • Option A — inline transliteration. Drop the wxyc_unaccent dictionary dependency. Generate an inline translate() + replace() chain inside wxyc_match_form covering all 433 rules. Bigger function body but RDS-compatible. Filed for follow-up: needs a generator that emits the inline SQL from the same wxyc-etl upstream source.
  • Option B — backfill-job-side normalization. Skip the Backend SQL functions entirely. The eventual BS#663 step-2 backfill computes identity-match form in the backfill job (Python/Node via wxyc-etl package) and writes pre-normalized values to library_identity*. Loses expression-index optimizations but avoids the RDS constraint.
  • Option C — wait on prod PG 16 upgrade. The PG-version blocker is solvable independently from the tsearch_data one. Once prod is on 16+, the tsearch_data constraint still stands — RDS upgrade doesn't unlock the filesystem.

Recommendation: Option B for the eventual BS#663 backfill, since Backend's role post-pivot is "thin writer" (LML composes identity, Backend stores verbatim). Server-side SQL functions are only needed for ad-hoc queries; the backfill itself is naturally a job-level computation. The other three sibling deploys (mb-cache#52, wikidata#38, discogs-etl#195) ship as-is on their self-managed PG instances and serve LML#280's needs without Backend's leg.

Converting this PR to draft pending a decision on the path forward.

@jakebromberg
Copy link
Copy Markdown
Member Author

Closing — work folds into BS#663 step 2

Per the review feedback and the post-pivot architecture review:

  • Wiki erratum filed: WXYC/wiki#56 corrects the stale "Backend prod RDS is Postgres 16" claim in §3.3.5 and flags the Backend row of the per-cache implementation-ownership table as deferred.
  • Ticket Deploy wxyc_identity_match_* Postgres functions to Backend wxyc_schema (wiki §3.3.5) #805 updated: the post-pivot scope is now recorded at the top of the body — Backend's leg moves into the BS#663 step-2 backfill job (Python-side via the wxyc-etl wheel), not a server-side function deploy.
  • The vendored bytes from this branch are not lost: if BS#663 step 2 ends up needing the SHA pin contract, vendor/wxyc-etl/ + wxyc-etl-pin.txt can be cherry-picked from a6525f9 cleanly. But likely the backfill just pip install wxyc-etl==0.4.0 and calls the Python API — no vendoring needed.

The sibling deploys WXYC/musicbrainz-cache#52, WXYC/wikidata-cache#38, WXYC/discogs-etl#195 ship on self-managed PG instances and serve LML#280's needs without Backend's leg. LML#280 is unblocked on WXYC/discogs-etl#195 merging.

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deploy wxyc_identity_match_* Postgres functions to Backend wxyc_schema (wiki §3.3.5)

1 participant