Skip to content

Reconsider process-wide limiter shape (G4 follow-up) #962

@jakebromberg

Description

@jakebromberg

Context

G4 (#906) shipped a module-level Semaphore(5) + TokenBucket(50/min) in apps/backend/services/lml/lml.client.ts to mirror LML's discogs_max_concurrent + discogs_rate_limit ceilings, giving BS pre-emptive backpressure so we don't queue arbitrary depth inside LML. The implementation works as designed and is the right safety net to have.

This ticket is a design refinement, not a bug. The BS#955 incident surfaced four sharp edges in the chosen shape; this ticket frames the design questions for a future revisit. The fixes we shipped (#957 env-var overrides on both CI surfaces, follow-ups #958 + #959) are point patches that keep the current shape working; they don't address the shape itself.

Sharp edges observed

1. Module-level state persists across spec files under --runInBand

TokenBucket and Semaphore are module-scoped singletons initialized at import. The integration suite (--runInBand) imports lml.client.ts once and shares the bucket across every spec. By the time metadata-lml.spec.js runs, earlier specs (flowsheet.spec.js, search.spec.js, etc.) may have fired enough fireAndForgetMetadata calls to drain the 50-token bucket; subsequent calls then wait ~1200 ms each for refill, blowing the 30 s per-test budget. This was the BS#955 cause; today the workaround is to set LML_CLIENT_RATE_PER_MIN=60000 in CI to make the bucket effectively unlimited.

A bucket scoped to a request (or to a logical caller, e.g. the proxy controller vs the enrichment job) would not need test overrides.

2. Test-vs-prod env divergence has two surfaces

The override has to be wired in both local ci:testmock (docker-compose backend service env block) and GHA Integration-Tests (workflow Start services env block — host node processes). Pinned by source-grep test tests/unit/scripts/lml-limiter-test-env.test.ts. Any future limiter knob will need the same dual wire-up. BS#958 will mechanically guard the env-block divergence, but doesn't address the underlying coupling between limiter knob count and CI-surface count.

3. No integration with the existing per-request budget

PR #618 added LIBRARY_SEARCH_ENRICHMENT_BUDGET_MS (per-request deadline budget); LML's API contract supports an X-Caller-Budget-Ms header. A request that's been queueing inside the bucket for 1200 ms has burned ~25 % of a typical 5 s budget before its actual upstream call starts, but the caller can't see that — the limiter and the budget timer are independent axes. Either the limiter should consult the deadline (skip rather than queue when budget would be exceeded) or queue time should be subtracted from the budget passed to LML.

4. Shared bucket across call sites with different deadline tolerance

A synchronous /proxy/metadata/album lookup has a tight deadline (the iOS client is waiting). A fireAndForgetMetadata enrichment call has loose deadline tolerance (the row will get filled in eventually, the client polls). Sharing one bucket means a flood of fire-and-forget can starve sync. Splitting by call site (or giving sync priority through a small reserved sub-budget) would let us tune them independently.

Design questions (not committing to a fix)

  • Should the bucket be scoped to a request, a call-site, or the process?
  • Should bucket queueing consume from the caller's budget?
  • Should sync and fire-and-forget share a bucket or split?
  • Is the right metric here "depth" (current), "wait time per call", or both?
  • If we keep process-wide, does the existing _resetLmlClientLimitersForTest() hook want to wire into beforeEach for integration setup, so spec files don't inherit each other's bucket state?

Out of scope

  • Reverting G4. The safety net is real — when LML actually saturates Discogs, the limiter does its job.
  • Replacing the limiter with a circuit breaker. That's a separate concern (failure response, not pre-emptive backpressure).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    status:design-onlyDesign/exploration work; not committed to a fix

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions