Skip to content

Per-call-site LML timeouts: shorten /proxy/metadata/album to ~8s with existing search-URL fallback #990

@jakebromberg

Description

@jakebromberg

Problem

BS#971 (closed #873) raised LmlClient.TIMEOUT_MS to 30000ms globally at apps/backend/services/lml/lml.client.ts:55. That's the right ceiling for fire-and-forget enrichment (apps/backend/services/metadata/enrichment.service.ts): listeners aren't blocked by the wait, and the new .catch arm writes the three synthesized YouTube/Bandcamp/SoundCloud search URLs as a partial fallback while keeping metadata_attempt_at NULL for the recurring drift-repair sweep to retry.

It's too patient for the synchronous /proxy/metadata/album endpoint that iOS calls directly. On a cold-cache compilation lookup the LML cascade can take 9–22s (see BS#873's prod measurements). iOS clients set their own URLSession timeouts — wxyc-dj-tool-ios ships with a default that fires before BS's 30s ceiling, so the listener sees NSURLErrorDomain Code=-1001 "The request timed out." instead of the synthesized-URL response that the controller already produces.

Concrete repro from 2026-05-21, in the window after the Railway PG service-source swap dropped LML's PG connection pool (see discogs-etl#223 for the swap context):

fetching metadata for Underworld — Jumbo
Task <DE28239B-77D7-4DF3-8186-B2E649D47A9C>.<13> finished with error [-1001]
Error Domain=NSURLErrorDomain Code=-1001 "The request timed out."
NSErrorFailingURLStringKey=https://api.wxyc.org/proxy/metadata/album?artistName=Underworld&releaseTitle=Jumbo

BS Sentry spans on GET /proxy/metadata/album for the same window (statsPeriod=1h, ~10 min around the iOS failure):

timestamp span.duration span.status
19:22:30Z 9.08s ok
19:23:04Z 9.41s ok
19:23:15Z 15.00s ok

All three returned 200 on the BS side; iOS gave up before the body arrived. The cold-cascade pattern matches BS#873's cache_stats exactly (api_calls=5–8, api_time_ms=15–22s). After 30 min of warm-cache use these same lookups land in sub-second territory.

Where the controller already handles this correctly

getAlbumMetadata at proxy.controller.ts:304-315 already wraps the lookupMetadata call in a try/catch that just console.warns and falls through with artwork undefined; the synthesized search URLs get filled in at lines 353-356 and the controller returns 200. That holds today even when lmlFetch throws LmlClientError('LML request timed out', 504) on AbortError — the 504 never surfaces to the client.

The bug is latency, not status. The fix is one knob (per-call timeoutMs), not a controller reshape.

Why now

The 2026-05-21 Railway PG image swap (discogs-etl#223, closed) restarted the destination Postgres container, dumping LML's asyncpg pool. The next ~20 min of cold-cache LML lookups paid the full cascade. This will happen again on any Railway PG redeploy (deploys, manual restarts, infrastructure events). Worth fixing the trade-off so the next event doesn't surface as user-visible timeouts.

Desired end state

Per-call-site timeouts on LmlClient, with /proxy/metadata/album passing { timeoutMs: 8000 } (the exact value is a knob — start at 8s, measure, adjust). On an 8s timeout the existing getAlbumMetadata catch arm fires immediately, the synthesized search URLs already at 353-356 populate the response, and iOS gets a 200 inside their URLSession window with search URLs but no album artwork or release metadata. The real LML match lands via the recurring drift-repair sweep when the cache warms.

Shape of the change:

  1. Add a timeoutMs?: number parameter to LmlClient.lookupMetadata. Default stays 30000 so existing fire-and-forget call sites are unchanged.
  2. getAlbumMetadata passes { timeoutMs: 8000 }.

That's it on the controller side — the existing try/catch + fallback path holds.

What iOS gets on an 8s timeout (response shape)

With PROXY_METADATA_SINGLE_LOOKUP off (current default), an 8s timeout means the listener gets:

  • youtubeMusicUrl, bandcampUrl, soundcloudUrl (the three synthesized search URLs from searchUrlProvider.getAllSearchUrls).
  • discogsReleaseId, discogsUrl, artworkUrl, artistBio, artistWikipediaUrl, spotifyUrl, appleMusicUrl (the populateCommonMetadataFields output).
  • releaseYear, genres, styles, label, discogsArtistId, fullReleaseDate, tracklist (the populateReleaseMetadata output).

The trade-off is "first cold lookup of a release returns search URLs only; the enriched payload arrives on the next read after the cache warms (or after the drift-repair sweep)." iOS UI should render gracefully without album artwork and without release-level metadata — not just without art.

Where

Side benefit — semaphore back-pressure

A shorter LML timeout releases the lookupSemaphore permit (max 5) sooner, so under the iOS-retry-storm scenario, queued retries get unblocked faster too. Not the primary motivation but a real second-order benefit when discogs-etl#223-style events happen again.

Constraints

  • Don't bundle this with anything else from project #32 (post-launch hardening). Epic A (LML perf, LML#338) is the upstream fix that would make the cold cascade fast enough for the 30s ceiling to be fine. This ticket is the BS-side defense-in-depth so the cold cascade stays survivable while Epic A lands.
  • Not a feature add within Epic A's blast radius — it's a defense-in-depth tuning of an existing client-side budget, not a behavior change to the LML cascade itself. Cleared against the "stop adding features in the touched files" rule of project Simplifies the logic surrounding the USE_CI env variable flag and adds default values #32.
  • Don't push the timeout below ~5s without measurement — iOS retries on transient timeouts, and a too-tight ceiling would convert "single slow but successful lookup" into "three retry storms hitting LML."
  • The synthesized search URLs are guaranteed cheap to generate (string templating, no network). The fallback path's cost is ~0ms.
  • metadata_attempt_at semantics stay as the enrichment service writes them today: a synthesized-URL fallback never stamps the column, so the drift-repair sweep keeps retrying. /proxy/metadata/album doesn't write metadata_attempt_at directly (the runtime path does), so the controller change doesn't risk that invariant.

Acceptance criteria

  • LmlClient.lookupMetadata accepts a per-call timeoutMs parameter; default stays 30000.
  • /proxy/metadata/album (getAlbumMetadata) passes timeoutMs: 8000 (or measured equivalent).
  • Unit test for the proxy timeout → fallback path. Induce LML latency via the mock-api-server (the exact API — simulateError with delayMs, a separate simulateLatency, or extending the mock — confirm against tests/mock-api-server/ source before writing the test). Assert response shape: status 200, youtubeMusicUrl/bandcampUrl/soundcloudUrl populated, artworkUrl / discogsUrl / discogsReleaseId / releaseYear / spotifyUrl / appleMusicUrl / artistBio absent (or null).
  • Unit test that the fire-and-forget enrichment path still uses 30s (regression guard — easy to accidentally change when the signature gains a parameter).
  • Manual prod verification: after deploy, dispatch the same Underworld — Jumbo cold-cache lookup from iOS (or curl with a valid bearer). Should get a 200 with search URLs in <10s instead of timing out. Assert response shape, not just status + latency — confirm only the three search URLs are populated on this cold-timeout path.

Out of scope

  • /proxy/metadata/artist and /proxy/entity/resolve — same regression vector, but SearchUrlProvider.getAllSearchUrls needs a name string and these endpoints only get artistId / type + id. There's no synthesized payload to fall into, and the catch-arm shape is an open design question that requires iOS-side coordination. Tracked separately in BS#991 with a "decide before implementing" gate.

Related

  • Sibling: BS#991 — same timeout-knob change for the artist + entity endpoints, behind the design decision.
  • Parent / context: BS#873 (closed) — the original cold-cascade incident.
  • Predecessor PR: BS#971 — raised the global timeout to 30s. This ticket refines that change rather than reversing it.
  • Upstream fix that would make this unnecessary: LML#338 (Epic A in project Simplifies the logic surrounding the USE_CI env variable flag and adds default values #32). When LML's cold-cascade lands in <5s, the 30s ceiling is fine everywhere.
  • Trigger event: discogs-etl#223 (closed) — Railway PG image swap dropped LML's connection pool; cold cache → user-visible timeouts. Filed without a status:blocked label because the work doesn't actually require LML#338 to land first; it's a defense-in-depth.

Metadata

Metadata

Assignees

No one assigned

    Labels

    lmlTouches library-metadata-lookup

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions