You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
BS#971 (closed #873) raised LmlClient.TIMEOUT_MS to 30000ms globally at apps/backend/services/lml/lml.client.ts:55. That's the right ceiling for fire-and-forget enrichment (apps/backend/services/metadata/enrichment.service.ts): listeners aren't blocked by the wait, and the new .catch arm writes the three synthesized YouTube/Bandcamp/SoundCloud search URLs as a partial fallback while keeping metadata_attempt_at NULL for the recurring drift-repair sweep to retry.
It's too patient for the synchronous /proxy/metadata/album endpoint that iOS calls directly. On a cold-cache compilation lookup the LML cascade can take 9–22s (see BS#873's prod measurements). iOS clients set their own URLSession timeouts — wxyc-dj-tool-ios ships with a default that fires before BS's 30s ceiling, so the listener sees NSURLErrorDomain Code=-1001 "The request timed out." instead of the synthesized-URL response that the controller already produces.
Concrete repro from 2026-05-21, in the window after the Railway PG service-source swap dropped LML's PG connection pool (see discogs-etl#223 for the swap context):
fetching metadata for Underworld — Jumbo
Task <DE28239B-77D7-4DF3-8186-B2E649D47A9C>.<13> finished with error [-1001]
Error Domain=NSURLErrorDomain Code=-1001 "The request timed out."
NSErrorFailingURLStringKey=https://api.wxyc.org/proxy/metadata/album?artistName=Underworld&releaseTitle=Jumbo
BS Sentry spans on GET /proxy/metadata/album for the same window (statsPeriod=1h, ~10 min around the iOS failure):
timestamp
span.duration
span.status
19:22:30Z
9.08s
ok
19:23:04Z
9.41s
ok
19:23:15Z
15.00s
ok
All three returned 200 on the BS side; iOS gave up before the body arrived. The cold-cascade pattern matches BS#873's cache_stats exactly (api_calls=5–8, api_time_ms=15–22s). After 30 min of warm-cache use these same lookups land in sub-second territory.
Where the controller already handles this correctly
The bug is latency, not status. The fix is one knob (per-call timeoutMs), not a controller reshape.
Why now
The 2026-05-21 Railway PG image swap (discogs-etl#223, closed) restarted the destination Postgres container, dumping LML's asyncpg pool. The next ~20 min of cold-cache LML lookups paid the full cascade. This will happen again on any Railway PG redeploy (deploys, manual restarts, infrastructure events). Worth fixing the trade-off so the next event doesn't surface as user-visible timeouts.
Desired end state
Per-call-site timeouts on LmlClient, with /proxy/metadata/album passing { timeoutMs: 8000 } (the exact value is a knob — start at 8s, measure, adjust). On an 8s timeout the existing getAlbumMetadata catch arm fires immediately, the synthesized search URLs already at 353-356 populate the response, and iOS gets a 200 inside their URLSession window with search URLs but no album artwork or release metadata. The real LML match lands via the recurring drift-repair sweep when the cache warms.
Shape of the change:
Add a timeoutMs?: number parameter to LmlClient.lookupMetadata. Default stays 30000 so existing fire-and-forget call sites are unchanged.
getAlbumMetadata passes { timeoutMs: 8000 }.
That's it on the controller side — the existing try/catch + fallback path holds.
What iOS gets on an 8s timeout (response shape)
With PROXY_METADATA_SINGLE_LOOKUP off (current default), an 8s timeout means the listener gets:
✅ youtubeMusicUrl, bandcampUrl, soundcloudUrl (the three synthesized search URLs from searchUrlProvider.getAllSearchUrls).
The trade-off is "first cold lookup of a release returns search URLs only; the enriched payload arrives on the next read after the cache warms (or after the drift-repair sweep)." iOS UI should render gracefully without album artwork and without release-level metadata — not just without art.
apps/backend/services/metadata/enrichment.service.ts — no change; confirm fire-and-forget callers still use the 30s default after the signature change.
Side benefit — semaphore back-pressure
A shorter LML timeout releases the lookupSemaphore permit (max 5) sooner, so under the iOS-retry-storm scenario, queued retries get unblocked faster too. Not the primary motivation but a real second-order benefit when discogs-etl#223-style events happen again.
Constraints
Don't bundle this with anything else from project #32 (post-launch hardening). Epic A (LML perf, LML#338) is the upstream fix that would make the cold cascade fast enough for the 30s ceiling to be fine. This ticket is the BS-side defense-in-depth so the cold cascade stays survivable while Epic A lands.
Don't push the timeout below ~5s without measurement — iOS retries on transient timeouts, and a too-tight ceiling would convert "single slow but successful lookup" into "three retry storms hitting LML."
The synthesized search URLs are guaranteed cheap to generate (string templating, no network). The fallback path's cost is ~0ms.
metadata_attempt_at semantics stay as the enrichment service writes them today: a synthesized-URL fallback never stamps the column, so the drift-repair sweep keeps retrying. /proxy/metadata/album doesn't write metadata_attempt_at directly (the runtime path does), so the controller change doesn't risk that invariant.
Acceptance criteria
LmlClient.lookupMetadata accepts a per-call timeoutMs parameter; default stays 30000.
Unit test for the proxy timeout → fallback path. Induce LML latency via the mock-api-server (the exact API — simulateError with delayMs, a separate simulateLatency, or extending the mock — confirm against tests/mock-api-server/ source before writing the test). Assert response shape: status 200, youtubeMusicUrl/bandcampUrl/soundcloudUrl populated, artworkUrl / discogsUrl / discogsReleaseId / releaseYear / spotifyUrl / appleMusicUrl / artistBio absent (or null).
Unit test that the fire-and-forget enrichment path still uses 30s (regression guard — easy to accidentally change when the signature gains a parameter).
Manual prod verification: after deploy, dispatch the same Underworld — Jumbo cold-cache lookup from iOS (or curl with a valid bearer). Should get a 200 with search URLs in <10s instead of timing out. Assert response shape, not just status + latency — confirm only the three search URLs are populated on this cold-timeout path.
Out of scope
/proxy/metadata/artist and /proxy/entity/resolve — same regression vector, but SearchUrlProvider.getAllSearchUrls needs a name string and these endpoints only get artistId / type + id. There's no synthesized payload to fall into, and the catch-arm shape is an open design question that requires iOS-side coordination. Tracked separately in BS#991 with a "decide before implementing" gate.
Related
Sibling: BS#991 — same timeout-knob change for the artist + entity endpoints, behind the design decision.
Parent / context: BS#873 (closed) — the original cold-cascade incident.
Predecessor PR: BS#971 — raised the global timeout to 30s. This ticket refines that change rather than reversing it.
Trigger event: discogs-etl#223 (closed) — Railway PG image swap dropped LML's connection pool; cold cache → user-visible timeouts. Filed without a status:blocked label because the work doesn't actually require LML#338 to land first; it's a defense-in-depth.
Problem
BS#971 (closed #873) raised
LmlClient.TIMEOUT_MSto 30000ms globally atapps/backend/services/lml/lml.client.ts:55. That's the right ceiling for fire-and-forget enrichment (apps/backend/services/metadata/enrichment.service.ts): listeners aren't blocked by the wait, and the new.catcharm writes the three synthesized YouTube/Bandcamp/SoundCloud search URLs as a partial fallback while keepingmetadata_attempt_atNULL for the recurring drift-repair sweep to retry.It's too patient for the synchronous
/proxy/metadata/albumendpoint that iOS calls directly. On a cold-cache compilation lookup the LML cascade can take 9–22s (see BS#873's prod measurements). iOS clients set their ownURLSessiontimeouts —wxyc-dj-tool-iosships with a default that fires before BS's 30s ceiling, so the listener seesNSURLErrorDomain Code=-1001 "The request timed out."instead of the synthesized-URL response that the controller already produces.Concrete repro from 2026-05-21, in the window after the Railway PG service-source swap dropped LML's PG connection pool (see discogs-etl#223 for the swap context):
BS Sentry spans on
GET /proxy/metadata/albumfor the same window (statsPeriod=1h, ~10 min around the iOS failure):All three returned 200 on the BS side; iOS gave up before the body arrived. The cold-cascade pattern matches BS#873's
cache_statsexactly (api_calls=5–8,api_time_ms=15–22s). After 30 min of warm-cache use these same lookups land in sub-second territory.Where the controller already handles this correctly
getAlbumMetadataatproxy.controller.ts:304-315already wraps thelookupMetadatacall in a try/catch that justconsole.warns and falls through withartworkundefined; the synthesized search URLs get filled in at lines 353-356 and the controller returns 200. That holds today even whenlmlFetchthrowsLmlClientError('LML request timed out', 504)on AbortError — the 504 never surfaces to the client.The bug is latency, not status. The fix is one knob (per-call
timeoutMs), not a controller reshape.Why now
The 2026-05-21 Railway PG image swap (discogs-etl#223, closed) restarted the destination Postgres container, dumping LML's
asyncpgpool. The next ~20 min of cold-cache LML lookups paid the full cascade. This will happen again on any Railway PG redeploy (deploys, manual restarts, infrastructure events). Worth fixing the trade-off so the next event doesn't surface as user-visible timeouts.Desired end state
Per-call-site timeouts on
LmlClient, with/proxy/metadata/albumpassing{ timeoutMs: 8000 }(the exact value is a knob — start at 8s, measure, adjust). On an 8s timeout the existinggetAlbumMetadatacatch arm fires immediately, the synthesized search URLs already at 353-356 populate the response, and iOS gets a 200 inside theirURLSessionwindow with search URLs but no album artwork or release metadata. The real LML match lands via the recurring drift-repair sweep when the cache warms.Shape of the change:
timeoutMs?: numberparameter toLmlClient.lookupMetadata. Default stays 30000 so existing fire-and-forget call sites are unchanged.getAlbumMetadatapasses{ timeoutMs: 8000 }.That's it on the controller side — the existing try/catch + fallback path holds.
What iOS gets on an 8s timeout (response shape)
With
PROXY_METADATA_SINGLE_LOOKUPoff (current default), an 8s timeout means the listener gets:youtubeMusicUrl,bandcampUrl,soundcloudUrl(the three synthesized search URLs fromsearchUrlProvider.getAllSearchUrls).discogsReleaseId,discogsUrl,artworkUrl,artistBio,artistWikipediaUrl,spotifyUrl,appleMusicUrl(thepopulateCommonMetadataFieldsoutput).releaseYear,genres,styles,label,discogsArtistId,fullReleaseDate,tracklist(thepopulateReleaseMetadataoutput).The trade-off is "first cold lookup of a release returns search URLs only; the enriched payload arrives on the next read after the cache warms (or after the drift-repair sweep)." iOS UI should render gracefully without album artwork and without release-level metadata — not just without art.
Where
apps/backend/services/lml/lml.client.ts:55— globalTIMEOUT_MS = 30000→ default + per-call override.apps/backend/controllers/proxy.controller.ts:309-310— pass{ timeoutMs: 8000 }tolookupMetadatafromgetAlbumMetadata.apps/backend/services/metadata/enrichment.service.ts— no change; confirm fire-and-forget callers still use the 30s default after the signature change.Side benefit — semaphore back-pressure
A shorter LML timeout releases the
lookupSemaphorepermit (max 5) sooner, so under the iOS-retry-storm scenario, queued retries get unblocked faster too. Not the primary motivation but a real second-order benefit when discogs-etl#223-style events happen again.Constraints
metadata_attempt_atsemantics stay as the enrichment service writes them today: a synthesized-URL fallback never stamps the column, so the drift-repair sweep keeps retrying./proxy/metadata/albumdoesn't writemetadata_attempt_atdirectly (the runtime path does), so the controller change doesn't risk that invariant.Acceptance criteria
LmlClient.lookupMetadataaccepts a per-calltimeoutMsparameter; default stays 30000./proxy/metadata/album(getAlbumMetadata) passestimeoutMs: 8000(or measured equivalent).simulateErrorwithdelayMs, a separatesimulateLatency, or extending the mock — confirm againsttests/mock-api-server/source before writing the test). Assert response shape: status 200,youtubeMusicUrl/bandcampUrl/soundcloudUrlpopulated,artworkUrl/discogsUrl/discogsReleaseId/releaseYear/spotifyUrl/appleMusicUrl/artistBioabsent (or null).Underworld — Jumbocold-cache lookup from iOS (or curl with a valid bearer). Should get a 200 with search URLs in <10s instead of timing out. Assert response shape, not just status + latency — confirm only the three search URLs are populated on this cold-timeout path.Out of scope
/proxy/metadata/artistand/proxy/entity/resolve— same regression vector, butSearchUrlProvider.getAllSearchUrlsneeds a name string and these endpoints only getartistId/type + id. There's no synthesized payload to fall into, and the catch-arm shape is an open design question that requires iOS-side coordination. Tracked separately in BS#991 with a "decide before implementing" gate.Related
status:blockedlabel because the work doesn't actually require LML#338 to land first; it's a defense-in-depth.