Extend library-identity-consumer SELECT predicate to cover NULL canonical_entity_id rows (~34K out-of-scope today)

## Problem

`jobs/library-identity-consumer/`'s SELECT predicate (per [README:9-17](https://github.com/WXYC/Backend-Service/blob/main/jobs/library-identity-consumer/README.md)):

```sql
library.canonical_entity_id IS NOT NULL
OR library.id IN (
  SELECT library_id FROM library_identity
  WHERE last_verified_at < NOW() - interval '7 days'
)
```

only selects rows that already have `canonical_entity_id` populated (the OR clause is empty on first run when `library_identity` itself is empty). Rows with `canonical_entity_id IS NULL` are never touched by this consumer — they don't enter `library_identity` and never become eligible for the OR clause either.

## Dry-run evidence (2026-05-20)

First-ever dry-run on prod (run_id `481b448d-7424-4a86-b3cc-db9aca533882`) finished with:

- **scanned: 29,781** of ~64,676 library rows (~46% in scope)
- **resolved: 14,405** (48.4% of scanned)
- **unresolved: 15,376** (51.6% of scanned)
- **compilation: 0**
- All error counters: 0

The 34,895 rows excluded by the predicate include:

- The ~6,300 V/A library rows per [#801](https://github.com/WXYC/Backend-Service/issues/801) (machine-populated by the old `match_compilations.py` path, likely with NULL `canonical_entity_id`). The compilation=0 finding from the dry-run is the proof — LML never saw a V/A row to classify, because the predicate filtered them out.
- Newly-added library rows that haven't been canonicalized by any legacy path.
- Anything that the old multi-source backfill failed to canonicalize.

## End state

The consumer's SELECT predicate covers the full library on a first run, then narrows naturally on subsequent runs so it doesn't re-attempt unresolved rows on a hot loop.

## Proposed shape

Extend the predicate to a three-way OR:

```sql
library.canonical_entity_id IS NOT NULL
OR library.id IN (
  SELECT library_id FROM library_identity
  WHERE last_verified_at < NOW() - interval '7 days'
)
OR library.id NOT IN (SELECT library_id FROM library_identity)
```

The third clause picks up library rows that have never been resolved by this consumer (which currently captures both NULL-canonical-entity-id rows AND any prior-pass `unresolved` rows that never landed in `library_identity`).

Then add a stop condition for repeatedly-unresolved rows so the predicate doesn't pull them every run:

- Option A — add an `unresolved_attempted_at` column on `library` (or a `library_unresolved` sidecar) and gate the third clause on `unresolved_attempted_at IS NULL OR < NOW() - interval '7 days'`.
- Option B — write an `unresolved` row to `library_identity` (kind discriminator), so the OR-2 clause's staleness check covers it. Requires schema change.

Option A is the smaller blast radius; Option B is cleaner but moves the consumer's writer surface.

## Acceptance

- [ ] Predicate change behind a feature flag or env var so the rollout can be staged (run-on-a-subset first).
- [ ] Dry-run shows the full library now in scope (scanned ≈ 64,676 — or whatever the current count is).
- [ ] Subsequent runs honor the 7-day staleness for both resolved and unresolved rows — no hot-loop re-attempts.
- [ ] The unresolved-attempt-tracking schema lands as a separate migration, not in the consumer's writer code (DDL-only per `docs/migrations.md`).
- [ ] Re-run against prod shows `kind: compilation` rows being classified (V/A rows now in scope).

## Why now

- [#830](https://github.com/WXYC/Backend-Service/issues/830) (E5-1 tracker — confirm `matched_via.source` distribution shifts toward `library_identity`) can't be answered until the full library is covered. The 14,405 rows that resolved in this dry-run are only ~22% of the library; the read-path observability would surface that misleading shape.
- [dj-site#520](https://github.com/WXYC/dj-site/issues/520)'s rotation tracks dropdown depends on `library_identity.discogs_release_id`. Rotation entries that point at NULL-canonical-entity-id library rows (likely the majority of V/A rotation cases) will stay empty even after the consumer runs.
- [#801](https://github.com/WXYC/Backend-Service/issues/801)'s population pass (per-track via bulk-resolve-libraries) mirrors this consumer's pattern. Whatever predicate-shape gets adopted here should generalize to the track-level case.

## Related

- [#802](https://github.com/WXYC/Backend-Service/issues/802) — parent (consumer implementation, closed 2026-05-11 by PR #807)
- [#663](https://github.com/WXYC/Backend-Service/issues/663) — Epic E2 (cross-cache-identity Backend half)
- [#801](https://github.com/WXYC/Backend-Service/issues/801) — `library_track` generalization (parallel scope for track-level rows)
- [#830](https://github.com/WXYC/Backend-Service/issues/830) — read-side observability tracker for matched_via.source shift
- [dj-site#520](https://github.com/WXYC/dj-site/issues/520) — rotation tracks dropdown (downstream consumer)
- Sentry trace: search by `run_id:481b448d-7424-4a86-b3cc-db9aca533882` for the 2026-05-20 dry-run that surfaced this gap.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend library-identity-consumer SELECT predicate to cover NULL canonical_entity_id rows (~34K out-of-scope today) #974

Problem

Dry-run evidence (2026-05-20)

End state

Proposed shape

Acceptance

Why now

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Extend library-identity-consumer SELECT predicate to cover NULL canonical_entity_id rows (~34K out-of-scope today) #974

Description

Problem

Dry-run evidence (2026-05-20)

End state

Proposed shape

Acceptance

Why now

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions