Skip to content

Backfill library.artwork_url for Discogs-resolvable rows #637

@jakebromberg

Description

@jakebromberg

Problem

Catalog search latency under the new 500ms LML enrichment budget cap (#618) is bounded but not optimal. Every search returns rows where library.artwork_url IS NULL, the enrichWithArtwork service makes parallel LML lookups for those rows, and either the lookups complete within 500ms or the budget fires and we return uncached rows. Either way, every search pays at minimum a slice of the budget on first load for un-cached albums.

Current state on prod:

SELECT count(*) FILTER (WHERE artwork_url IS NOT NULL) AS with_art,
       count(*) FILTER (WHERE artwork_url IS NULL) AS without,
       count(*) AS total
FROM wxyc_schema.library;

with_art=155, without=64008, total=641630.24% of library rows have artwork cached.

Tonight's entity_resolution run lifted Discogs identity coverage on wxyc_schema.artists to 24.3% (5,785 / 23,816 artists). Of those, the corresponding library rows are the search-time enrichment cache hits we'd actually expect to succeed if we proactively backfill artwork_url for them.

Cost model (2026-04-29 update)

After verifying the LML code path: artwork URLs are sourced from discogs-cache.release.artwork_url, populated by discogs-etl/scripts/import_csv.py:import_artwork from release_image.csv in the monthly Discogs XML dump. For any release in the cache, LML serves artwork as a local PG read — no Discogs API hop.

This means the original "30 minutes at 10 req/s" estimate is too pessimistic. The actual cost shape per row:

  • BS → LML round-trip (~5–20ms each).
  • LML → discogs-cache local PG read (sub-millisecond when cached).
  • BS write back to RDS.

Live Discogs API hits only happen for the residual where release_image.csv had no entry at all, bounded by Discogs's image coverage. LML rate limits are no longer the bottleneck.

Framing

This is a one-shot warm of the cache, not a recurring sync. Once library.artwork_url is populated, search-time enrichWithArtwork short-circuits and the 500ms enrichment budget stops firing for those rows. Staleness risk on the i.discogs.com URL is bounded by the monthly cache rebuild cadence (WXYC/discogs-etl#125) and Discogs's CDN URL stability.

For the 75% of rows whose artist isn't in Discogs, search-time enrichment is hopeless regardless and the budget will keep firing — that's the long-term concern addressed by other strategies (LML #207 wires up MusicBrainz/Wikidata/Spotify identity sources; LML #211 closes the matching gap). This issue addresses the resolvable slice.

Proposal

Mirror the pattern of jobs/library-artist-name-backfill/: a one-shot batched UPDATE job under jobs/library-artwork-url-backfill/ that:

  1. Queries every library row joined to artists where a.discogs_artist_id IS NOT NULL AND l.artwork_url IS NULL. Expected count after tonight's reconcile: ~18,500.
  2. For each batch of N rows, calls POST /api/v1/lookup against LML (or lookupMetadata directly if running inside the BS process) with the artist + album.
  3. Writes the resulting artwork_url back. If the lookup fails or returns no artwork, leave NULL — the search-path enrichment will retry and fail cleanly.
  4. Idempotent via WHERE artwork_url IS NULL AND a.discogs_artist_id IS NOT NULL filter — re-running picks up only rows the previous run didn't finish.

Run procedure mirrors tonight's artist-name backfill: Manual Build & Deploy with target=library-artwork-url-backfill, then docker run --rm --env-file .env <image> on EC2.

Expected impact

Once complete:

  • ~18,500 rows have artwork_url set, so enrichWithArtwork short-circuits on them.
  • For searches whose top-N results all have cached artwork, the enrichment budget never fires — search latency drops to baseline RTT (~370ms) consistently.
  • The 500ms budget continues to protect the path for searches that hit Discogs-unresolvable artists.

Caveats

  • Should not run against rows where artwork_url is non-null but stale; this job only touches NULL rows. Stale-artwork refresh is a separate, cron-scheduled concern.
  • Throughput can be tuned higher than the original 10 req/s estimate now that LML rate limits aren't the bottleneck — local round-trip is.

Acceptance

  • New job jobs/library-artwork-url-backfill/ mirroring library-artist-name-backfill/ (per-batch transactions, NULL-filter for idempotency, verification step at end).
  • Documented run procedure in CLAUDE.md.
  • Sentry / JSON logging via the standard Phase A observability contract (tool=library-artwork-url-backfill, step=batch-N, run_id UUID per run).
  • Post-run verification: count of artwork_url IS NOT NULL jumped to expected 18,500-ish.

Blocked by

Cross-repo risks

Out of scope

References

  • Tonight's catalog-latency investigation. enrichWithArtwork budget capped in Cap LML artwork enrichment on the catalog search hot path #618.
  • Pattern: jobs/library-artist-name-backfill/.
  • Identity coverage that gates this: 24.3% post-tonight's run (5,785 / 23,816 artists).
  • Cost-model verification: discogs-etl/scripts/import_csv.py:443 (artwork import) + library-metadata-lookup/discogs/cache_service.py:340,487 (local PG read).

Metadata

Metadata

Assignees

No one assigned

    Labels

    catalog-search-perf2026-04-28 catalog-search investigation cluster (incl. overlapping work)enhancementNew feature or requestlmlTouches library-metadata-lookupsearchLibrary search and Elasticsearch

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions