Skip to content

[Epic G] Capture enrichment-outcome quality (all-null result), not just LML client throws #969

@jakebromberg

Description

@jakebromberg

Problem

The existing Sentry instrumentation at the .catch arm in apps/backend/services/metadata/enrichment.service.ts only fires when fetchMetadata throws. That captures the LML-timeout path — but it misses the dominant failure mode: LML responds just inside its 5 s budget with a degraded result (empty artwork, empty streaming URLs, synthesized fallback only). The row gets UPDATE'd with the degraded result, no exception is thrown, no Sentry event is emitted.

The 2026-05-19 data pull on BS#692 quantified the undercount:

  • 30 days of subsystem:metadata exception capture: 68 events across two Sentry-grouped issues.
  • LML-side spans on /api/v1/lookup over the same window: thousands of slow lookups (>4.5 s) — 3,072 in one burst at 2026-05-16 06:46 UTC, 1,024 at 2026-05-16 03:33 UTC, 128 at 2026-05-18 23:14 UTC, plus a long-tail sparse trickle.
  • Capture undercount ratio: ~50–500× depending on which caller is hitting LML.

The 2026-05-14 iOS outage surfaced as 9 captured BS events for what BS#873 called "70% of newly-added playcuts broken for hours". Most of those 70% never threw — they got the degraded result and the UPDATE landed silently.

This issue closes the gap by moving the capture point from "LML client throws" to "enrichment outcome is empty".

Why now (Epic G fit)

Epic G (#881) charters enrichment-pipeline observability. The 2026-05-13 perf regression went undetected for two days specifically because the existing alerts measure the wrong thing — Sentry caught timeouts, the timeouts were sparse, and the actual user-visible all-null outcomes never paged.

The companion epic child BS#904 (G2) was framed as "pick a Sentry capture disposition for subsystem=metadata" — that question presupposed the existing capture was instrumenting the right thing. Per the BS#692 data pull, it isn't. This issue replaces #904's question with the right one: instrument outcome quality, then alert on it.

End state

In apps/backend/services/metadata/enrichment.service.ts, after the row is UPDATE'd with the result of fetchMetadata, the code evaluates whether the user-visible fields are all null and emits a structured Sentry event with a stable fingerprint and a classified cause when they are.

Specifically:

  • A new capture predicate: a row is "empty outcome" iff artwork_url IS NULL AND spotify_url IS NULL AND apple_music_url IS NULL AND discogs_url IS NULL AND release_year IS NULL. (Same definition surface as BS#873's failure-mode characterization.)
  • A new Sentry capture path on the "outcome empty" branch — Sentry.captureMessage('enrichment-empty-outcome', { level: 'warning', tags: { subsystem: 'metadata', cause, transaction }, fingerprint: ['enrichment-empty-outcome', 'subsystem-metadata'] }).
  • The cause tag classifies into the failure mode that produced the empty outcome, drawn from the LML response shape and any catch-arm error:
    • lml-timeout — the LmlClientError("LML request timed out") path threw (covers what BACKEND-SERVICE-1 / BACKEND-SERVICE-B already capture today).
    • lml-degraded — LML responded inside the budget with artwork === null across all source legs (this is the dominant case the existing instrumentation misses).
    • lml-no-match — LML returned an explicit no-match verdict (kind: unresolved post-pivot).
    • unknown — everything else; this should approach zero once classification is dialed in.
  • The fingerprint is stable across releases so the same issue persists deploy-over-deploy. The current per-release fingerprinting (BACKEND-SERVICE-1 + BACKEND-SERVICE-B are split because release tag changed) becomes a single long-lived Sentry issue.
  • A Sentry alert rule pages when the sustained rate of enrichment-empty-outcome events exceeds the chosen threshold for 15 minutes. The threshold is set against the post-fix baseline, not guessed.
  • Concurrently, the legacy Sentry.captureException(err, { tags: { subsystem: 'metadata' } }) on the catch arm stays — it now becomes one of the three contributing cause classifications, not the only signal. Once the new path is in production for a week, BACKEND-SERVICE-1 + BACKEND-SERVICE-B can be resolved as superseded.

Why this captures the 2026-05-14 outage

Counterfactual against the data:

  • During the 2026-05-14/15 burst window, the actual LML response shape was "responded inside 5 s but with no useful fields populated" (BS#873 confirms: "logged the timeout to Sentry and wrote nothing"). Today's instrumentation: caught the timeouts that did throw, 9 events. The new instrumentation: would have caught all the rows that got UPDATE'd with empty results — likely hundreds, matching the "70% of newly-added playcuts" surface.
  • At a 30-events/hour threshold (vs today's effective infinite-threshold), the 2026-05-14 outage pages within ~30 minutes of the regression starting. That's the order-of-magnitude detection improvement Epic G needs.

Why this doesn't blow up the Sentry quota

Baseline rate of all-null outcomes outside incidents is bounded by LML's no-match rate, which is observable in trace data: outside the incident bursts captured above, the >4.5 s slow-lookup rate is single-digit events/hour. So the steady-state enrichment-empty-outcome rate is also small — and even if it's not, the stable fingerprint means events stack on a single issue rather than fanning out.

Implementation outline

  1. Identify the post-UPDATE branch in enrichment.service.ts where the metadata object that was just written is still in scope. Likely inside the .then() arm after the LML lookup resolves successfully.
  2. Add the predicate check + classification function. Keep the predicate isolated — isEmptyOutcome(metadata) and classifyCause(metadata, lmlResponseShape) so they're unit-testable.
  3. Add the Sentry.captureMessage call on the truthy branch of the predicate.
  4. Wire the alert rule in Sentry once events start landing (don't pre-create; need real baseline first).
  5. Update metadata.enrichment.test.ts to cover each cause path (lml-timeout / lml-degraded / lml-no-match) — adjacent to BS#907 (G5)'s scope.

Acceptance

  • isEmptyOutcome(metadata) + classifyCause(metadata, lmlResponseShape) helpers exist and have unit tests covering each cause classification.
  • Sentry.captureMessage('enrichment-empty-outcome', ...) fires on the empty-outcome branch with a stable fingerprint and the four-value cause tag.
  • Cross-check against current prod traffic: within 48h of deploy, the new event class has captured at least an order of magnitude more events than the legacy captureException path did over the same window, confirming the instrument now measures the right thing.
  • Sentry alert rule paging on the new fingerprint at a threshold informed by the first week of baseline data (don't pre-commit to a number).
  • BACKEND-SERVICE-1 + BACKEND-SERVICE-B can be resolved one week after the new path is in production.

Relationships

  • Sub-issue of #881 (Epic G — Enrichment-pipeline observability).
  • Supersedes #692 (the original "pick capture disposition" framing; the 2026-05-19 data pull on that issue reframed it as the wrong instrument question, captured here).
  • Replaces / re-scopes #904 (G2 — was "Resolve BS#692: pick Sentry capture disposition" — that disposition decision now turns on shipping this instrumentation first; [G2] Resolve BS#692: pick Sentry capture disposition for subsystem=metadata #904 can be re-scoped to "tune the alert threshold once #N has accrued a week of baseline" or closed in favor of the alert-threshold acceptance criterion above).
  • Adjacent to #907 (G5 — test gaps in metadata.enrichment.test.ts); the new helpers' unit tests can land in the same PR.
  • Triggered by the 2026-05-19 data pull on #692 and the BS#873 iOS outage RCA.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlmlTouches library-metadata-lookup

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions