Skip to content

Per-call-site LML timeouts for /proxy/metadata/artist + /proxy/entity/resolve (design before implementing) #991

@jakebromberg

Description

@jakebromberg

Problem

Sibling ticket to BS#990 (per-call-site LML timeouts for /proxy/metadata/album). The other two synchronous proxy endpoints — /proxy/metadata/artist and /proxy/entity/resolve — face the same regression vector (cold-cache LML cascade > iOS's URLSession ceiling → user-visible NSURLErrorDomain Code=-1001 timeout) but the synthesized-URL fallback that closes BS#990 doesn't apply here:

Neither endpoint has a try/catch today; LmlClientError bubbles to the global handler and surfaces as 502. So unlike the album endpoint (where the fix is just plumbing the timeout knob), these need both a timeout AND a controller reshape — but the reshape's target shape is the open question.

Design question (must decide before implementing)

When LmlClient times out from a synchronous proxy endpoint that can't fall back to synthesized search URLs, what does iOS see?

Candidate shapes:

Option Status Body iOS UX
(a) empty / null body 200 { discogsArtistId: null, bio: null, ... } Listener renders empty bio + no avatar; reads next time may succeed
(b) explicit fallback flag 200 { ..., _lookupFailed: true } Same UX but client can show retry affordance
(c) 504 + Retry-After hint 504 { error: "lml_timeout", retryAfterMs: 30000 } Listener handles 5xx explicitly; can retry or show error
(d) keep current bubble-to-502 502 (global error shape) Today's behavior; iOS shows generic "couldn't load"
(e) hybrid 200 if any partial result, 504 otherwise varies Most listener-friendly; most controller complexity

Inputs needed before picking:

  • Current iOS-side handling of 200-with-empty vs 504 vs 502 — check wxyc-ios-64 for getArtistMetadata / resolveEntity callers and what expectations they encode.
  • Whether the client wants to differentiate "no data exists" (genuinely unknown artist) from "we tried but LML didn't answer in time" — (b) or (c) make that explicit; (a) collapses them.

The right shape isn't obvious from the BS side alone; this needs an iOS-side coordination pass.

Where

Constraints

  • Same as BS#990's project-board placement: not a feature add within Epic A's blast radius — defense-in-depth tuning of an existing client-side budget, not a behavior change to the LML cascade itself. Cleared against the "stop adding features in the touched files" rule of project #32.
  • Decide before implementing: file a comment on this ticket documenting the chosen option (a/b/c/d/e) with the iOS-side context, then write the PR against that decision. Don't ship the controller-reshape PR without that comment landed.
  • Keep the timeout value coordinated with BS#990 (start at 8s; if BS#990 measures something else, match).

Acceptance criteria

  • Comment on this ticket documents the chosen response shape with iOS-side context.
  • LmlClient.lookupArtist (and any other entry point used by these endpoints) accepts the per-call timeoutMs parameter (inherits from BS#990 if landed first; otherwise add it here).
  • /proxy/metadata/artist and /proxy/entity/resolve cap LML calls at the chosen timeout.
  • Both endpoints implement the chosen catch-arm behavior.
  • Unit tests for each endpoint's timeout path. Assert response shape matching the chosen option, not just status + latency.
  • Manual prod verification: trigger a cold-cache lookup for each endpoint; confirm the chosen shape arrives within the timeout budget.
  • iOS-side updated if the chosen shape isn't backward compatible with current callers.

Related

  • Sibling: BS#990 — same regression for /proxy/metadata/album, but with a clean synthesized-URL fallback. Implement-then-measure independently of this ticket.
  • Parent / context: BS#873 (closed) — the original cold-cascade incident; full prod measurements live there.
  • Predecessor PR: BS#971 — raised the global timeout to 30s.
  • Upstream fix that would make this unnecessary: LML#338 (Epic A in project #32). When LML's cold-cascade lands in <5s, the 30s ceiling is fine everywhere.
  • Trigger event: discogs-etl#223 (closed) — Railway PG image swap dropped LML's connection pool; cold cache → user-visible timeouts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    lmlTouches library-metadata-lookup

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions