Skip to content

Extend library-identity-consumer SELECT predicate to cover NULL canonical_entity_id rows (~34K out-of-scope today) #974

@jakebromberg

Description

@jakebromberg

Problem

jobs/library-identity-consumer/'s SELECT predicate (per README:9-17):

library.canonical_entity_id IS NOT NULL
OR library.id IN (
  SELECT library_id FROM library_identity
  WHERE last_verified_at < NOW() - interval '7 days'
)

only selects rows that already have canonical_entity_id populated (the OR clause is empty on first run when library_identity itself is empty). Rows with canonical_entity_id IS NULL are never touched by this consumer — they don't enter library_identity and never become eligible for the OR clause either.

Dry-run evidence (2026-05-20)

First-ever dry-run on prod (run_id 481b448d-7424-4a86-b3cc-db9aca533882) finished with:

  • scanned: 29,781 of ~64,676 library rows (~46% in scope)
  • resolved: 14,405 (48.4% of scanned)
  • unresolved: 15,376 (51.6% of scanned)
  • compilation: 0
  • All error counters: 0

The 34,895 rows excluded by the predicate include:

  • The ~6,300 V/A library rows per #801 (machine-populated by the old match_compilations.py path, likely with NULL canonical_entity_id). The compilation=0 finding from the dry-run is the proof — LML never saw a V/A row to classify, because the predicate filtered them out.
  • Newly-added library rows that haven't been canonicalized by any legacy path.
  • Anything that the old multi-source backfill failed to canonicalize.

End state

The consumer's SELECT predicate covers the full library on a first run, then narrows naturally on subsequent runs so it doesn't re-attempt unresolved rows on a hot loop.

Proposed shape

Extend the predicate to a three-way OR:

library.canonical_entity_id IS NOT NULL
OR library.id IN (
  SELECT library_id FROM library_identity
  WHERE last_verified_at < NOW() - interval '7 days'
)
OR library.id NOT IN (SELECT library_id FROM library_identity)

The third clause picks up library rows that have never been resolved by this consumer (which currently captures both NULL-canonical-entity-id rows AND any prior-pass unresolved rows that never landed in library_identity).

Then add a stop condition for repeatedly-unresolved rows so the predicate doesn't pull them every run:

  • Option A — add an unresolved_attempted_at column on library (or a library_unresolved sidecar) and gate the third clause on unresolved_attempted_at IS NULL OR < NOW() - interval '7 days'.
  • Option B — write an unresolved row to library_identity (kind discriminator), so the OR-2 clause's staleness check covers it. Requires schema change.

Option A is the smaller blast radius; Option B is cleaner but moves the consumer's writer surface.

Acceptance

  • Predicate change behind a feature flag or env var so the rollout can be staged (run-on-a-subset first).
  • Dry-run shows the full library now in scope (scanned ≈ 64,676 — or whatever the current count is).
  • Subsequent runs honor the 7-day staleness for both resolved and unresolved rows — no hot-loop re-attempts.
  • The unresolved-attempt-tracking schema lands as a separate migration, not in the consumer's writer code (DDL-only per docs/migrations.md).
  • Re-run against prod shows kind: compilation rows being classified (V/A rows now in scope).

Why now

  • #830 (E5-1 tracker — confirm matched_via.source distribution shifts toward library_identity) can't be answered until the full library is covered. The 14,405 rows that resolved in this dry-run are only ~22% of the library; the read-path observability would surface that misleading shape.
  • dj-site#520's rotation tracks dropdown depends on library_identity.discogs_release_id. Rotation entries that point at NULL-canonical-entity-id library rows (likely the majority of V/A rotation cases) will stay empty even after the consumer runs.
  • #801's population pass (per-track via bulk-resolve-libraries) mirrors this consumer's pattern. Whatever predicate-shape gets adopted here should generalize to the track-level case.

Related

  • #802 — parent (consumer implementation, closed 2026-05-11 by PR feat(jobs): library-identity-consumer for LML bulk-resolve (#802) #807)
  • #663 — Epic E2 (cross-cache-identity Backend half)
  • #801library_track generalization (parallel scope for track-level rows)
  • #830 — read-side observability tracker for matched_via.source shift
  • dj-site#520 — rotation tracks dropdown (downstream consumer)
  • Sentry trace: search by run_id:481b448d-7424-4a86-b3cc-db9aca533882 for the 2026-05-20 dry-run that surfaced this gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions