Change MetricsAggregatorService to publisher to support live metrics without shmem by nv-alicheng · Pull Request #306 · mlcommons/endpoints

nv-alicheng · 2026-05-05T17:21:46Z

What does this PR do?

shmem implementation of KVStore in MetricsAggregatorService causes issues on ARM. Several solutions exist:

Rewrite shmem implementation in C/C++ where memory fencing primitives are exposed.
Restructure the design of the MetricsAggregatorService to be a ZMQ Publisher which publishes metrics at a fixed rate, which other processes, such as a TUI, can subscribe to.

This PR implements (2).

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-05-05T17:22:03Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request refactors the metrics aggregation system to use a registry-based architecture with HDR histograms and a generalized pub/sub transport layer, replacing the legacy mmap-backed storage. The update introduces periodic snapshot publishing with disk fallback and updates reporting logic to consume these snapshots. Feedback suggests improving encapsulation by exposing in-flight task metrics through public properties and adopting a more numerically stable variance formula for high-precision latency calculations. Additionally, several legacy tests have been skipped pending migration to the new system.

arekay-nv

(superseded — see #pullrequestreview-4237753512 for the full review-council output)

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

Found 12 issues across 7 files (excluding tests, pyproject.toml, uv.lock).

⚠️ GitHub's review-comment API is currently returning a persistent pull_request_review_thread.base internal error on this PR for inline comments via line+side and position-based posts (the github-code-quality and gemini-code-assist bot comments from yesterday were unaffected). Falling back to a single review with file:line links — clicking each link opens the PR diff at the cited line.

Note: the existing gemini comments on aggregator.py:286/342/353/354 already cover the encapsulation concern around table._in_flight_tasks direct access, so that issue is intentionally not duplicated here. The cancel-without-await issue (#4 below) is a separate concurrency concern at the same file:line.

🔴 Must Fix (critical/high)

Issues that will cause incorrect behavior, data loss, or security problems in production.

#	File:Line	Reviewer	Category	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py:482`	Claude	data-integrity	Subscriber late-binding can drop initial ticks (ZMQ slow-joiner)
2	`src/inference_endpoint/async_utils/services/metrics_aggregator/publisher.py:168`	Claude	performance	`_write_atomic_fallback` runs blocking I/O on the event loop
3	`src/inference_endpoint/async_utils/services/metrics_aggregator/registry.py:208`	Codex	performance	Unbounded raw-sample retention in `SeriesSampler`

🟡 Should Fix (medium)

Real issues that trigger under specific conditions or design flaws that will compound.

#	File:Line	Reviewer	Category	Summary
4	`src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py:353`	Claude	error-handling	Cancellations not awaited before reading `n_pending`
5	`src/inference_endpoint/async_utils/services/metrics_aggregator/registry.py:161`	Claude	bug	`HdrHistogram(low, high, sig_figs)` constructed without `high >= 2*low` check
6	`src/inference_endpoint/async_utils/services/metrics_aggregator/subscriber.py:55`	Claude	data-integrity	`conflate=True` default for the Report consumer is fragile
7	`src/inference_endpoint/commands/benchmark/execute.py:423`	Codex	data-integrity	Stale `final_snapshot.msgpack` not cleared on report-dir reuse

🔵 Consider (low)

Valid improvements that could be follow-ups.

#	File:Line	Reviewer	Category	Summary
8	`src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py:281`	Claude	bug	Double `STARTED` reassigns and orphans the tick task
9	`src/inference_endpoint/async_utils/services/metrics_aggregator/publisher.py:195`	Claude	concurrency	`close()` cancels tick task but doesn't await it
10	`src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py:166`	Claude	error-handling	No top-level exception handling around `run_until_complete(main())`
11	`src/inference_endpoint/load_generator/session.py:408`	Claude	design	ERROR-before-COMPLETE relies on an undocumented publisher ordering contract
12	`src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py:132`	Claude	bug	SIGTERM bypasses `publish_final`

Detailed findings

1. 🔴 high · `data-integrity` · [Claude]

src/inference_endpoint/commands/benchmark/execute.py:482

Subscriber late-binding can drop initial ticks (ZMQ slow-joiner)

MetricsSnapshotSubscriber is constructed (482) and start()ed (485) AFTER launcher.launch() waits for aggregator readiness. The aggregator can begin publishing as soon as STARTED is observed; ZMQ slow-joiner means snapshots emitted before the SUB handshake completes are dropped. The comment on line 480 acknowledges this. On a busy host the subscriber may also miss COMPLETE if the publisher closes before the subscription warms up, silently degrading to LIVE/DRAINING (incomplete report). Fix: move subscriber construction + start() BEFORE launcher.launch(). Connecting to a not-yet-bound IPC path is fine — ZMQ retries connect transparently.

2. 🔴 high · `performance` · [Claude]

src/inference_endpoint/async_utils/services/metrics_aggregator/publisher.py:168

_write_atomic_fallback runs blocking I/O on the event loop

f.write + f.flush + two os.fsync (file + parent dir) + os.rename execute synchronously on the aggregator subprocess's main event loop (awaited via publish_final from aggregator.py:362). On a busy box os.fsync can block tens-to-hundreds of ms — long enough to back-pressure event-record processing if any events are still in flight at finalization. Fix: wrap with await asyncio.to_thread(self._write_atomic_fallback, payload).

3. 🔴 high · `performance` · [Codex]

src/inference_endpoint/async_utils/services/metrics_aggregator/registry.py:208

Unbounded raw-sample retention in SeriesSampler

self._raw.append(value) retains every observed value in an array.array for the full run so the final snapshot can recompute exact percentiles. The prior mmap-backed store spilled this to disk; the new path keeps it in the aggregator subprocess RAM, so memory now scales linearly with run length. A 5-min 50k-QPS run produces ~15M samples (~120MB per int64 series); once latency/ISL/OSL/TTFT/TPOT are all tracked, several hundred MB; longer runs OOM or swap heavily before the report is emitted. Fix: periodically spill _raw to disk, cap with reservoir sampling, or rely solely on the HDR digest for percentiles.

4. 🟡 medium · `error-handling` · [Claude]

src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py:353

Cancellations not awaited before reading n_pending

After the drain timeout, t.cancel() only schedules cancellation at the next await point. n_pending = len(table._in_flight_tasks) on the next line therefore reads a count that's still high, and the not-yet-cancelled tasks may still be running when _publisher.close() and the loop teardown happen → "Task was destroyed but it is pending!" warnings on shutdown. Fix: after t.cancel(), do await asyncio.gather(*tasks, return_exceptions=True) (bounded by a short timeout) before recomputing n_pending. Independent of the encapsulation point gemini already raised.

5. 🟡 medium · `bug` · [Claude]

src/inference_endpoint/async_utils/services/metrics_aggregator/registry.py:161

HdrHistogram(low, high, sig_figs) constructed without high >= 2*low check

hdrh requires high >= 2*low. Current bounds (1ns–3.6e12ns, 1–1e7 tokens) satisfy this, but a future caller passing e.g. hdr_low=1, hdr_high=1 (or any hdr_high < 2*hdr_low) gets an opaque ValueError from deep inside the C library. Fix: explicit pre-check after the clamps: if self._hdr_high < self._hdr_low * 2: raise ValueError(...).

6. 🟡 medium · `data-integrity` · [Claude]

src/inference_endpoint/async_utils/services/metrics_aggregator/subscriber.py:55

conflate=True default for the Report consumer is fragile

The docstring argues this is safe because COMPLETE is the publisher's last message, but ZMQ CONFLATE only keeps the most recent unread message. If the subscriber's loop is starved (main process busy in wait_for_exit) and the publisher closes before the SUB sees COMPLETE (LINGER expires, IPC socket unlinked), complete never fires and latest may be a stale LIVE. The 2 s wait_for_complete timeout in execute.py:548 papers over most cases, but consider conflate=False for the Report consumer (a single producer at a few snapshots/sec, not a TUI).

7. 🟡 medium · `data-integrity` · [Codex]

src/inference_endpoint/commands/benchmark/execute.py:423

Stale final_snapshot.msgpack not cleared on report-dir reuse

metrics_output_dir = ctx.report_dir / "metrics" is created (423) but an existing final_snapshot.msgpack from a prior run is never removed. If a user reuses --report-dir and the new run exits before publishing a fresh final snapshot (SIGTERM, crash), _load_final_snapshot_from_disk() decodes the previous run's file and silently builds the new report from stale metrics. Fix: unlink any pre-existing final_snapshot.msgpack here, OR refuse to decode a snapshot whose timestamp/run-id doesn't match the current run.

8. 🔵 low · `bug` · [Claude]

src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py:281

Double STARTED reassigns and orphans the tick task

self._publisher.start(...) is called every time a SessionEventType.STARTED record is observed. If two STARTED events ever land in the EventRecord stream (replay, buggy producer, test fixture), start() reassigns _tick_task; the previous task keeps running until garbage-collected and races with the new task to publish snapshots. Fix: guard with if self._tick_task is None: ... inside publisher.start(), or only call start() on the first STARTED in process().

9. 🔵 low · `concurrency` · [Claude]

src/inference_endpoint/async_utils/services/metrics_aggregator/publisher.py:195

close() cancels tick task but doesn't await it

self._tick_task.cancel() (205) is followed immediately by self._publisher.close() (206). If close() is invoked from a sync path, the cancelled task may still be runnable when the underlying ZMQ publisher socket is closed → brief window where _tick could call publish() on a closed transport. ZmqMessagePublisher.publish() early-returns on is_closed so this isn't fatal, but produces noisy CancelledError-during-shutdown traces. Fix: make close() async-aware (mirror publish_final) or document that callers must await publish_final (which already cancels the task) before close.

10. 🔵 low · `error-handling` · [Claude]

src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py:166

No top-level exception handling around run_until_complete(main())

If main() raises (bad CLI args, ZMQ bind failure, tokenizer load error), the subprocess exits with a stack trace but the parent's ServiceLauncher.wait_for_exit sees only a non-zero exit code with no diagnostic context propagated upward. Fix: wrap with try/except + structured logging so failures are surfaced in the parent's logs alongside the exit code.

11. 🔵 low · `design` · [Claude]

src/inference_endpoint/load_generator/session.py:408

ERROR-before-COMPLETE relies on an undocumented publisher ordering contract

The aggregator's TRACKED_SAMPLES_FAILED logic (aggregator.py:303–305) assumes ERROR is observed BEFORE COMPLETE removes the row. The ordering is correct here, but it depends on ZmqMessagePublisher preserving publish() order through batching AND ZMQ PUB→SUB delivering in order to a single SUB (both currently true). A short comment in session.py noting "publisher must preserve publish-order" would harden this against future transport refactors.

12. 🔵 low · `bug` · [Claude]

src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py:132

SIGTERM bypasses publish_final

_finalize() only sets _shutdown_event after publish_final and _publisher.close() complete, but __main__.py doesn't install a SIGTERM handler. On parent-side launcher.kill_all (or any external SIGTERM before ENDED arrives), neither pub/sub COMPLETE nor disk fallback runs, leaving the parent's triple-redundant snapshot path empty. Fix: loop.add_signal_handler(SIGTERM, ...) to flush the final snapshot defensively before exit.

Generated by /review-council — Codex gpt-5.4 (review of git diff against main) + Claude (direct file review with HEAD-source line verification).

Two high-severity issues raised by the review-council pass on PR #306: 1. (#306-1) Subscriber late-binding could drop early ticks via the ZMQ slow-joiner pattern. Move MetricsSnapshotSubscriber construction + start() BEFORE launcher.launch() so the SUB handshake completes during the subprocess-spawn window. ZMQ tolerates connect-before- bind on IPC — the connect resolves once the binder appears. The prior ordering (subscribe AFTER launch returns) had a window where the aggregator could begin ticking on STARTED before the SUB subscription warmed up, dropping early live snapshots and, in the worst case, missing COMPLETE entirely. 2. (#306-2) MetricsPublisher._write_atomic_fallback runs synchronous f.flush + fsync(file) + fsync(parent dir) + rename on the aggregator's event loop. On a busy host this can block tens-to- hundreds of ms — long enough to back-pressure event-record processing. Wrap with asyncio.to_thread inside publish_final. Both fixes are localized — no API changes, no test changes required. Existing integration tests (test_concurrency_benchmark, test_end_to_ end_oracle) exercise both paths end-to-end and still pass. The third P0 item (#306-3, unbounded raw-sample retention) is the agreed memory trade documented in metrics_pubsub_design_v5.md §11; addressed by adding "--persist-raw" as a tracked follow-up rather than a code change in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

arekay-nv

Please do a performance test as well to determine what the limits of the interval and qps are.

Address PR #306 review comments from gemini-code-assist (encapsulation) and github-code-quality (non-iterable enum loop): - Add `MetricsTable.in_flight_tasks_count` property so the aggregator no longer reaches into `table._in_flight_tasks` to report pending- task counts on snapshots and drain logging. - Add `MetricsTable.cancel_in_flight_tasks()` returning the list of cancelled tasks (sets up the T3 await-cancellations fix). - Update aggregator.py call sites accordingly. - Use `MetricCounterKey.__members__.values()` in test_report_builder to satisfy CodeQL's "non-iterable used in for-loop" check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR #306 review comments from arekay-nv asking for the "interval" naming convention used elsewhere in the repo (e.g. `check_interval` in worker_manager, `interval` in benchmark_httpclient). - CLI flag `--refresh-hz <Hz>` → `--publish-interval <seconds>` (default 4.0 Hz → 0.25 s; same wire cadence). - Constructor parameters `refresh_hz` (aggregator + publisher) → `publish_interval_s`. The `_s` suffix makes the unit explicit so call sites can't accidentally pass a frequency. - Internal field `_refresh_hz` → `_publish_interval_s`. - Drops the `period = 1.0 / refresh_hz` indirection in publisher.start. - Tests / AGENTS.md updated accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR #306 gemini-code-assist comment on report.py:53. For ns-precision latency series (`SAMPLE_LATENCY_NS`, `TTFT_NS`, `TPOT_NS`, etc.) the rollups store `total` and `sum_sq` as Python ints that can grow to many digits. The previous formula `sum_sq - total*total / n` evaluates `total*total / n` as a float and catastrophically cancels against `sum_sq` when the variance is small relative to the mean, producing a negative variance numerator the sqrt() then clamps to 0. Use the exact integer numerator `n*sum_sq - total*total` when the inputs are ints (this is what the math.sqrt sees, no cancellation), falling back to the float form for series whose dtype is float (currently only TPOT, where the magnitudes are small enough that the naive form is fine). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR #306 review-council items #4, #9, #12. #4 — Cancellations not awaited before reading n_pending After a drain timeout, the aggregator's `t.cancel()` loop only *scheduled* cancellation; reading `n_pending` on the next line therefore reported a stale-high count and left the to-be-cancelled tasks runnable when the loop tore down. Now `await asyncio.gather(*cancelled, return_exceptions=True)` runs before `n_pending = table.in_flight_tasks_count`, so the snapshot reflects the post-cancellation set and the cancelled tasks actually exit. #9 — close() cancels tick task but doesn't await it Added `MetricsPublisher.aclose()` (async) that cancels the tick task AND awaits its exit before closing the underlying transport. Aggregator's post-publish_final path and __main__.py's finally block now use it. Sync `close()` is kept for sync error-path fallbacks with a docstring noting the race. #12 — SIGTERM bypasses publish_final Installed `SIGTERM` and `SIGINT` handlers in __main__.py that fire `publish_final` defensively before setting `shutdown_event`. Added `MetricsPublisher._finalized` so the SIGTERM-triggered and the ENDED-triggered paths are safe to race — only the first call publishes a COMPLETE frame. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR #306 council review #5 (registry.py:161). The C-backed hdrhistogram constructor requires `high >= 2*low` but raises an opaque allocation error if that doesn't hold — making it hard to debug a misconfigured `register_series` call. Add an explicit pre-check after the low/high clamps so the error names the series and both values up front. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR #306 council review #8 (aggregator.py:281). A repeat `SessionEventType.STARTED` (replay buffer, buggy producer, test fixture) used to make `MetricsPublisher.start` overwrite `_tick_task`, orphaning the first tick task — it kept running until GC and raced the new task to publish snapshots. Make `start` idempotent: if `_tick_task` is already set, log a warning and return without spawning a second task. The original task remains the one `publish_final` / `aclose` cancels and awaits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR #306 council review #10 (__main__.py:166). Wrap the top-level `run_until_complete(main())` so startup / bind / tokenizer-load failures emit a structured `logger.exception` before the interpreter prints the traceback. The parent's ServiceLauncher previously saw only the non-zero exit code and a raw stderr trace with no context to correlate against the parent's logs. `SystemExit` is re-raised untouched so argparse usage / explicit sys.exit paths stay user-facing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…LETE Address PR #306 council review #11 (session.py:408). The metrics aggregator's `TRACKED_SAMPLES_FAILED` accounting relies on the publisher delivering ERROR strictly before COMPLETE for a failed sample. The ordering is correct today (ZMQ PUB→SUB in-order delivery, ZmqMessagePublisher batches without reordering), but it's an implicit contract — a future transport refactor that breaks it would break tracked-failure counting silently. Document the invariant inline so that future refactors trip over it instead of past it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR #306 arekay-nv inline comment on aggregator.py:99 ("This might need to be higher"). Analysis: at the system's design point (50k QPS short-context, default 2 tokenizer workers) the 30 s drain finishes in well under a second. Long-context tokenize workloads can push the backlog higher — a 32k- context 5k-QPS run with 2 workers can take ~100 s to drain. The right knob there is `--tokenizer-workers`, not the drain budget, but giving the user a CLI handle makes both ends tunable without redeploying. Changes: - Default drain budget bumped 30s → 60s. Covers normal + long-context at the default 2 workers without inflating the high-QPS short- context case (we exit early when drain_tasks returns). - New `--drain-timeout <seconds>` CLI flag plumbed through the aggregator subprocess and into `MetricsAggregatorService` as a constructor arg `drain_timeout_s`. The kwarg is positionable (not a global) so callers can inject test values without monkey-patching the module-level constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR #306 arekay-nv inline comments on snapshot.py:35 (add an `INITIALIZE` state preceding `LIVE`) and test_snapshot.py:67 (add state-check tests). - Add `SessionState.INITIALIZE = "initialize"` to the wire schema; the aggregator now starts in INITIALIZE and transitions to LIVE on the first STARTED event. The state machine is forward-only: INITIALIZE → LIVE → DRAINING → COMPLETE. - No INITIALIZE snapshot is emitted today (the tick task only starts on the first STARTED), but the state exists as the well-defined starting point and so a future setup-phase tick has a state to carry. Wire compatibility is preserved — INITIALIZE round-trips through the codec (test added). - New `TestSessionStateTransitions` pins: member set, declaration order (consumers can rely on `list(SessionState)` for forward checks), the `complete = state == COMPLETE and n_pending_tasks == 0` rule across every state, and the INITIALIZE round-trip. - AGENTS.md updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR #306 arekay-nv inline comment on registry.py:118 ("can we add tests to ensure that the behavior is fixed and any changes are caught by tests, specifically the internal points/boundaries"). New `TestSeriesSamplerBoundaries` class pins: - HDR construction-time invariants: `high < 2*low` rejected, equality case accepted, `low=0` coerced to 1, unsupported dtype rejected. - Clamp behavior at the HDR bounds: values exactly at `hdr_low` / `hdr_high` are unclamped and don't trip the warn-once flag. - Under- and over-bound clamping: warn-once fires exactly once per sampler, raw values stay un-clamped (only HDR's view is clamped). - Float dtype uses float comparison for the lower clamp (so sub-integer under-bound values are still detected). - sig_figs at HDR-supported extremes (1 and 5) construct and record. - Rollup edges: count==1 (min==max==total, sum_sq==v^2) and the empty case (count==0, histogram==[]). - Warn-once flag is per-sampler, not process-global. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Decouple the two delivery paths so the Report consumer no longer depends on pub/sub terminal-frame survivability. Closes PR #306 council #6 (conflate=True fragility for the Report consumer). Architecture change: - `MetricsPublisher.publish_final(..., interrupted: bool = False)` now atomically writes `final_snapshot.json` (pretty-printed, dict form) as the **primary** Report source AND publishes the terminal-state snapshot over pub/sub as a **TUI shutdown signal**. Disk write and pub/sub send are independent best-effort paths. - Signal handler in `__main__.py` invokes `publish_final(interrupted= True)` so SIGTERM/SIGINT writes a snapshot tagged `INTERRUPTED` (introduced in the prior commit) — distinguishes "user killed the run mid-execution" from a clean shutdown. - `MetricsSnapshotSubscriber` is now TUI-only: stripped `complete`, `_complete_event`, `wait_for_complete`. `conflate=True` is the unambiguous default — no Report-consumer fragility to reason about. - `execute.py` reads `final_snapshot.json` via `json.loads` straight to the dict form, drops the 2 s `wait_for_complete` window and the triple-redundant fallback chain. Single fallback: if the file is missing (SIGKILL/OOM before the signal handler ran), convert the subscriber's `latest` live snapshot via `snapshot_to_dict` and mark the report incomplete. - `Report.from_snapshot` now accepts a dict (the consumer contract). All field reads use `dict.get(...)` with defaults that produce an honest "incomplete" report on missing fields rather than crashing. Surfaces a `state: str` field so `display()` renders an explicit INTERRUPTED indicator. - New `snapshot_to_dict()` in `snapshot.py` is the one-way bridge from the wire `MetricsSnapshot` (array_like=True, compact msgpack) to the dict form used by both the file writer and any consumer that needs to feed a live Struct into Report. The inverse is intentionally absent — see `Report.from_snapshot` docstring for the rationale. Tests rewritten: - `test_publisher.py`: assertions read JSON from disk instead of msgpack, new test for `interrupted=True` writing `state=interrupted`. - `test_aggregator_e2e.py`: covers both delivery paths (JSON file + pub/sub terminal frame). - `test_report_builder.py`: routes through `snapshot_to_dict`; new tests for INTERRUPTED display, empty-dict defaults, and malformed metric entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduces the three primitives that the upcoming pub/sub metrics path will compose on top of: - snapshot.py: MetricsSnapshot wire struct (msgspec, tagged union of CounterStat | SeriesStat) plus SessionState enum (LIVE / DRAINING / COMPLETE) and msgpack codec. - registry.py: MetricsRegistry holding CounterSamplers and SeriesSamplers. Series samplers carry an HDR Histogram for cheap live percentiles, an array.array of raw values for exact-final computation, and exact rollup primitives. Histogram bucket edges are log-spaced over the observed [min, max] per snapshot, so they auto-zoom to data instead of wasting buckets on empty range. - New unit tests cover the wire codec round-trip, sampler hot path, and registry registration/collision behavior. Adds hdrhistogram==0.10.3 as a runtime dependency. Wiring of these primitives into the aggregator and removal of the old KVStore path follow in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- publisher.py: MetricsPublisher owns the periodic tick task that publishes live MetricsSnapshots over IPC pub/sub at refresh_hz, plus publish_final() which is awaited by the aggregator on ENDED. Final delivery is dual-path: * pub/sub publish (best-effort, telemetry knobs sndhwm=4, linger=10s) * disk fallback (atomic: tmp + fsync(file) + rename + fsync(parent dir)) Both paths are independently wrapped in try/except — neither failure suppresses the other. publish_final is async and awaits tick-task cancellation before publishing COMPLETE so a late LIVE/DRAINING tick can never land after COMPLETE on the wire. - subscriber.py: MetricsSnapshotSubscriber tracks ``latest`` and the ``COMPLETE``-state snapshot. Defaults to conflate=True (TUI / report consumer) but parametrized for any consumer that needs every tick. - New unit tests cover tick-task lifecycle, atomic disk fallback, independence of pub/sub vs disk failure paths, and the regression that publish_final must await tick-task cancellation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the mmap-backed BasicKVStore with the registry/publisher path introduced in the previous two commits. Aggregator changes: - MetricsAggregatorService now constructs a MetricsRegistry and MetricsPublisher on entry; trigger callbacks call registry.record / registry.increment instead of kv_store.update. - Tracks SessionState (LIVE → DRAINING on ENDED → COMPLETE on publish_final). The publisher tick task captures (state, n_pending_tasks) per tick via a callback; consumers detect drain timeout as state == COMPLETE and n_pending_tasks > 0. - Adds TRACKED_SAMPLES_FAILED counter, incremented on ERROR events whose tracked row still exists at processing time. Correctness depends on the load_generator emitting ERROR before COMPLETE; the matching test asserts that order. - ENDED handler awaits drain_tasks (30s timeout), publish_final, and closes the publisher (linger=10s drains pending pub/sub frames). Report changes: - Replaces from_kv_reader with from_snapshot (pure function on a MetricsSnapshot). complete is derived from state == COMPLETE and n_pending_tasks == 0. Display warns when not complete. Main-process changes (commands/benchmark/execute.py): - Spawns a MetricsSnapshotSubscriber on the main loop. Triple-redundant report sourcing: pub/sub COMPLETE → disk fallback → latest live. - Removes _setup_kv_reader, ARM tmpfs branching, and mmap salvage in _salvage_tmpfs (events.jsonl salvage is preserved). - Awaits subscriber.wait_for_complete(timeout=2.0) after launcher exit so the loop can dispatch the COMPLETE frame before deciding the pub/sub path missed. Removed: - async_utils/services/metrics_aggregator/kv_store.py - async_utils/services/metrics_aggregator/fs_check.py Tests: - Deletes test_kv_store.py. - Marks test_aggregator.py / test_aggregator_e2e.py / test_metrics_table.py / test_report_builder.py / conftest.py with module-level skip + a TODO referencing the design doc; rewriting these on the new fixtures is a tracked follow-up. - Adds test_aggregator_error_handler.py covering the TRACKED_SAMPLES_FAILED increment path and the negative case where COMPLETE arrives before ERROR (documents the bug the ERROR/COMPLETE swap fixes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Swaps the publish order in BenchmarkSession._handle_response so that a QueryResult carrying an error emits ErrorEventType.GENERIC first, then SampleEventType.COMPLETE. This is required for metrics-aggregator correctness: COMPLETE causes MetricsTable.set_field to remove the tracked row, so an ERROR observed afterward has no row to inspect and TRACKED_SAMPLES_FAILED would silently stay at 0. Emitting ERROR first keeps the row alive long enough for the aggregator's error handler to identify the failure as tracked. EventLoggerService and other event consumers treat the two event types independently, so order is invisible to them. The test_failed_query_published_as_error_event test now asserts the order explicitly so a future revert is caught immediately, and the aggregator-side regression is covered by test_aggregator_error_handler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>