Skip to content

feat: add distributed store backend for circuit breaker (Phase 4)#1061

Merged
ncrispino merged 9 commits intomassgen:dev/v0.1.78from
amabito:feat/cb-store-phase4
Apr 17, 2026
Merged

feat: add distributed store backend for circuit breaker (Phase 4)#1061
ncrispino merged 9 commits intomassgen:dev/v0.1.78from
amabito:feat/cb-store-phase4

Conversation

@amabito
Copy link
Copy Markdown
Contributor

@amabito amabito commented Apr 14, 2026

Summary

Phase 4: pluggable store backend for the LLM circuit breaker, so state can be shared across workers / processes. Default (store=None) keeps the existing single-process path unchanged.

Rebased on top of #1038 (Phase 2) and #1056 (Phase 3).

What's in

  • CircuitBreakerStore Protocol (sync, to match the existing CB code path)
  • InMemoryStore -- RLock, zero deps
  • RedisStore -- requires redis>=4.0, lazy-imported
  • atomic_record_failure / atomic_record_success on the Protocol. Needed for linearizability when multiple workers race on the same backend; found this the hard way in review and had to extend the Protocol mid-way.
  • Metrics hooks from feat: add circuit breaker observability module (Phase 3) #1056 fire on every store code path too.

Tests

159 tests passing locally: store unit, adversarial (TOCTOU, Redis eviction, corrupted state, Protocol contract), plus regression on the existing CB tests.

Unrelated test_subagent_docker_logs.py etc. fail on Windows with cp932 encoding, but they fail on origin/main too -- pre-existing, not touched here.

Install

Redis backend is optional:

pip install massgen[redis-store]

Summary by CodeRabbit

  • New Features

    • Persistent circuit-breaker stores (in-memory and Redis) and a factory; circuit breakers can be configured to use shared store-based state and coordinated probe ownership.
  • Bug Fixes / Reliability

    • Atomic updates, CAS semantics, probe-claiming, TTL handling, and retry/fallback paths to prevent races and stale reopen timing.
  • Tests

    • Extensive unit and adversarial tests for store behavior, concurrency, and end-to-end integration.
  • Chores

    • Added optional Redis dependency group.

amabito added 3 commits April 14, 2026 11:57
- CircuitBreakerStore Protocol (get_state/set_state/cas_state/increment_failure/clear)
- InMemoryStore: thread-safe, zero deps, default backend (ThreadingRLock)
- RedisStore: optional redis>=4.0, Lua CAS script, dynamic TTL covering open_until,
  WATCH/MULTI/EXEC fallback, _script_unavailable narrowed to avoid misclassifying
  READONLY/ACL errors
- make_store factory (memory|redis)
- LLMCircuitBreaker gains store= param; store=None preserves all existing behavior
  (100% backward compat)
- 138 tests: 27 happy path, 38 adversarial (thread-safety, Redis fallback, Rule 14/22/27/45/51),
  66 regression (test_llm_circuit_breaker unchanged)
- Pre-commit clean (black/isort/flake8/autoflake)

Known design constraint (Phase 5): record_failure store path splits
increment_failure + cas_state across two store calls; not fully linearizable
against concurrent reset()/record_success(). Requires Protocol extension with
atomic record_failure() method to fix properly.
- Protocol extended with atomic_record_failure and atomic_record_success
- InMemoryStore: RLock-protected read-compute-write for both atomic methods
- RedisStore: Lua scripts for single-round-trip atomicity
- LLMCircuitBreaker routes through atomic methods when store is set
- Preserves force_open semantics: atomic_record_success is no-op on OPEN state
- Preserves record_failure correctness in HALF_OPEN: any failure -> OPEN
- Backward compat: store=None path unchanged
- Adds new tests in test_cb_store.py and test_cb_store_adversarial.py
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a pluggable circuit-breaker state store and integrates it into LLMCircuitBreaker: new massgen/backend/cb_store.py (InMemoryStore, RedisStore, CircuitBreakerStore, factory), extensive tests, and optional redis dependency; LLMCircuitBreaker gains a store parameter and uses store atomic APIs for state transitions.

Changes

Cohort / File(s) Summary
State store implementation
massgen/backend/cb_store.py
New module: CircuitBreakerStore protocol, DEFAULT_CIRCUIT_BREAKER_STATE, InMemoryStore (RLock-protected) and RedisStore (hash-backed, Lua scripts + WATCH/MULTI fallback). Provides atomic ops: get_state, set_state, cas_state, increment_failure, atomic_record_failure, atomic_record_success, try_transition_and_claim_probe, clear, TTL and serialization, plus make_store() factory.
Circuit breaker integration
massgen/backend/llm_circuit_breaker.py
Adds store ctor param and routes state reads/writes through the store when provided. Refactors gating/claiming into _should_block_with_claim(), and delegates should_block, record_failure, record_success, force_open, reset, and call_with_retry to use store atomic APIs and prior-state metadata.
Unit tests
massgen/tests/test_cb_store.py
New unit tests for InMemoryStore and RedisStore: defaults, set/get isolation, CAS semantics, increment/clear, atomic failure/success transitions, TTL behavior, concurrency, and LLMCircuitBreaker integration with a shared store.
Adversarial tests
massgen/tests/test_cb_store_adversarial.py
Extensive adversarial/stress tests: edge backend names, corrupted/partial state, CAS races, Redis non-Lua fallback and retry-exhaustion, partial-write scenarios, concurrency linearizability, and deep LLMCircuitBreaker invariant tests (probe ownership, force-open races, ordering).
Dependency manifest
pyproject.toml
Adds optional dependency group redis-store = ["redis>=4.0"] and includes redis>=4.0 in the all extras group.

Sequence Diagram(s)

sequenceDiagram
    participant Client as LLMCircuitBreaker
    participant Store as CircuitBreakerStore
    participant Backend as Storage Backend

    Client->>Store: get_state(backend)
    Store->>Backend: fetch state (in-memory or Redis)
    Backend-->>Store: state dict
    Store-->>Client: state dict

    alt state == OPEN and now >= open_until
        Client->>Store: try_transition_and_claim_probe(backend, now, recovery_timeout)
        Store->>Backend: atomic OPEN→HALF_OPEN + claim probe
        Backend-->>Store: (claimed=true, state)
        Store-->>Client: (claimed, state)
    else state == HALF_OPEN
        Client->>Client: check half_open_probe_active -> gate/allow probe
    else state == CLOSED
        Client->>Client: allow request
    end

    Client->>Client: execute request

    alt request succeeds
        Client->>Store: atomic_record_success(backend)
        Store->>Backend: reset failure_count, maybe CLOSE
        Backend-->>Store: new state + prev metadata
        Store-->>Client: new state
    else request fails
        Client->>Store: atomic_record_failure(backend, threshold, recovery_timeout)
        Store->>Backend: increment failure_count, maybe OPEN with open_until
        Backend-->>Store: new state + prev metadata
        Store-->>Client: new state
    end
Loading
sequenceDiagram
    participant InMem as InMemoryStore
    participant Lock as RLock
    participant Redis as RedisStore
    participant RedisSvr as Redis Server

    rect rgba(100, 150, 255, 0.5)
        Note over InMem,Lock: In-memory path (single-process)
        InMem->>Lock: acquire()
        Lock-->>InMem: locked
        InMem->>InMem: read/modify dict atomically
        InMem->>Lock: release()
    end

    rect rgba(255, 150, 100, 0.5)
        Note over Redis,RedisSvr: Redis path (Lua preferred)
        Redis->>RedisSvr: EVAL (atomic script)
        RedisSvr->>RedisSvr: read/parse hash fields
        RedisSvr->>RedisSvr: compute new fields (counts, open_until)
        RedisSvr->>RedisSvr: HSET/EXPIRE atomically
        RedisSvr-->>Redis: serialized state
    end

    rect rgba(200, 100, 255, 0.5)
        Note over Redis,RedisSvr: Redis fallback (no EVAL)
        Redis->>RedisSvr: WATCH key
        Redis->>RedisSvr: HGETALL key
        RedisSvr-->>Redis: current hash
        Redis->>Redis: parse locally, compute updates
        Redis->>RedisSvr: MULTI / HSET / EXPIRE / EXEC
        RedisSvr-->>Redis: success or nil on conflict
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

Suggested reviewers

  • a5507203
  • ncrispino
🚥 Pre-merge checks | ✅ 3 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is missing required template sections: Type of change, Checklist, Pre-commit status, How to Test, and Test CLI Command are all absent or incomplete. Add missing sections from the template: explicitly check the type of change (New feature), complete the Checklist items, provide Pre-commit status output, and include test CLI commands with expected results.
Docstring Coverage ⚠️ Warning Docstring coverage is 14.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Documentation Updated ⚠️ Warning Pull request lacks design documentation, complete Google-style docstrings for implementation methods, documentation of new store parameter, and configuration examples for distributed circuit breaker store architecture. Create design document in docs/dev_notes/ explaining architecture and atomic operations, add Google-style docstrings to all public methods in cb_store.py, document the store parameter in LLMCircuitBreaker.init, and provide example YAML configurations.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: add distributed store backend for circuit breaker (Phase 4)' directly and clearly summarizes the main change—introducing a distributed store backend for the circuit breaker with explicit phase notation.
Capabilities Registry Check ✅ Passed The custom check for capabilities registry updates is not applicable to this PR. This PR introduces distributed circuit breaker state management infrastructure (cb_store.py with InMemoryStore and RedisStore) and extends llm_circuit_breaker.py to support distributed state via an optional store parameter. It does not introduce any new backend models or backend-specific capabilities. The term "backend" used in cb_store.py refers to logical backend names (like "claude", "openai") as keys for storing circuit breaker state, not new backend definitions. Since no new models or capabilities are being added, no updates to massgen/backend/capabilities.py or massgen/token_manager/token_manager.py are required.
Config Parameter Sync ✅ Passed The store parameter is a constructor-only dependency injection parameter, not part of LLMCircuitBreakerConfig. No new YAML configuration parameters were added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
massgen/backend/llm_circuit_breaker.py (1)

601-623: ⚠️ Potential issue | 🟠 Major

Store-backed retries can lose probe ownership tracking.

On later attempts, _owns_probe is refreshed only from self._state, which never reflects the shared-store state. If a retry acquires the HALF_OPEN probe in store mode and then exits via cancellation or another BaseException, the cleanup block skips the reopen path and can leave half_open_probe_active=True indefinitely.

Also applies to: 735-775

🧹 Nitpick comments (1)
massgen/backend/cb_store.py (1)

69-165: Add Google-style docstrings to the new store surface.

The protocol is documented, but the concrete store methods and factory still leave TTL behavior, expected-state no-ops, and Lua/WATCH fallback semantics implicit in the implementation. Please document the new callable surface directly.

As per coding guidelines "For new or changed functions, include Google-style docstrings".

Also applies to: 167-466, 675-684

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/cb_store.py` around lines 69 - 165, The InMemoryStore class
and its methods lack Google-style docstrings describing TTL behavior,
expected-state no-ops, Lua/WATCH fallback semantics, thread-safety and return
types; add concise Google-style docstrings to the class InMemoryStore and each
method (get_state, set_state, cas_state, increment_failure,
atomic_record_failure, atomic_record_success, clear) that specify: purpose, args
(types and meaning, e.g., backend, expected_state, failure_threshold,
recovery_timeout, updates), return values, side effects (mutation of in-memory
state, deep-copy behavior), concurrency guarantees (uses RLock), and the
semantics for TTL/open_until, expected_state no-op behavior, and that this
in-memory implementation is a non-distributed fallback to Redis/Lua/WATCH
semantics. Ensure docstrings follow the project Google style and cover
error/edge-case behavior used by callers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@massgen/backend/cb_store.py`:
- Around line 420-445: The Redis-backed code is persisting time.monotonic()
which is not portable across processes; update atomic_record_failure (and the
fallback _atomic_record_failure_without_lua referenced around lines 517-521) to
use a wall-clock Unix timestamp (e.g., time.time()) or the Redis server TIME
inside the Lua script instead of time.monotonic(), and ensure the values written
as open_until/last_failure_time are produced and compared against the same clock
(Unix time) used by LLMCircuitBreaker so timestamps are meaningful across
workers/processes.
- Around line 675-684: The factory function make_store drops RedisStore's
key_prefix and will raise a raw KeyError if redis_client is missing; update
make_store to forward key_prefix (e.g., use kwargs.get("key_prefix")) into the
RedisStore constructor along with ttl, and validate presence of redis_client
(raise a clear ValueError like "redis_client required for redis backend" if
missing) instead of letting a KeyError propagate so callers can isolate
namespaces and get a helpful error message.

In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 239-280: The current probe transition in llm_circuit_breaker.py
reads open_until and half_open_probe_active but calls
self._store.cas_state(self.backend_name, CircuitState.OPEN.value, ...) which
only guards on "state", allowing races (e.g., force_open updates open_until) and
multiple probes; add and use a single atomic store-side operation (e.g.,
self._store.try_transition_and_claim_probe or similar) that checks both the
current "state" and the relevant fields ("open_until" for OPEN->HALF_OPEN
transition and "half_open_probe_active" for HALF_OPEN probe claim) in one
compare-and-set, then perform the state change and set half_open_probe_active
atomically; update the code paths around the CircuitState.OPEN branch and the
HALF_OPEN branch to call this new atomic method instead of cas_state and handle
its boolean result for whether this process won the probe, keeping metric
logging and returned booleans the same.
- Around line 330-374: The atomic store methods (atomic_record_failure and
atomic_record_success) must return both previous and new state (e.g., include
keys like "_prev_state" and "_prev_was_half_open" or an explicit "transition"
descriptor) and the CircuitBreaker callers (record_failure and record_success in
llm_circuit_breaker.py) must base their logging and metrics solely on that
returned payload instead of inferring state from the new state only; update the
store implementations to include these fields, then change the code paths around
atomic_record_failure (current block using new_state_str, _prev_was_half_open,
_prev_state) and the symmetric atomic_record_success call (the other block noted
in the comment) to read the returned previous/new state or transition and emit
the correct log message and metrics (e.g., use the returned prev->new transition
for record_state_transition and for choosing "Probe failed, circuit breaker
re-opened" vs "Circuit breaker opened"/"open -> closed" messaging).

---

Nitpick comments:
In `@massgen/backend/cb_store.py`:
- Around line 69-165: The InMemoryStore class and its methods lack Google-style
docstrings describing TTL behavior, expected-state no-ops, Lua/WATCH fallback
semantics, thread-safety and return types; add concise Google-style docstrings
to the class InMemoryStore and each method (get_state, set_state, cas_state,
increment_failure, atomic_record_failure, atomic_record_success, clear) that
specify: purpose, args (types and meaning, e.g., backend, expected_state,
failure_threshold, recovery_timeout, updates), return values, side effects
(mutation of in-memory state, deep-copy behavior), concurrency guarantees (uses
RLock), and the semantics for TTL/open_until, expected_state no-op behavior, and
that this in-memory implementation is a non-distributed fallback to
Redis/Lua/WATCH semantics. Ensure docstrings follow the project Google style and
cover error/edge-case behavior used by callers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0cb84a85-1af9-403d-9e85-2fd5d399ed10

📥 Commits

Reviewing files that changed from the base of the PR and between 965090f and 68cb298.

📒 Files selected for processing (5)
  • massgen/backend/cb_store.py
  • massgen/backend/llm_circuit_breaker.py
  • massgen/tests/test_cb_store.py
  • massgen/tests/test_cb_store_adversarial.py
  • pyproject.toml

Comment thread massgen/backend/cb_store.py
Comment thread massgen/backend/cb_store.py
Comment thread massgen/backend/llm_circuit_breaker.py Outdated
Comment thread massgen/backend/llm_circuit_breaker.py
- atomic try_transition_and_claim_probe op (Lua + InMem + WATCH fallback)
  closes the OPEN/HALF_OPEN race where force_open could be silently undone
  and prevents two workers from both winning the HALF_OPEN probe slot
- _should_block_with_claim() returns probe-ownership flag so call_with_retry
  correctly tracks _owns_probe in store mode (previous local-state refresh
  never fired, leaving half_open_probe_active stuck on cancellation)
- switch store-persisted timestamps to time.time() (Unix wall clock); Lua
  scripts and Python helpers now share comparable cross-process clocks
- make_store: forward key_prefix to RedisStore and raise ValueError for
  missing redis_client instead of bare KeyError
- Google-style docstrings on Protocol surface, InMemoryStore, RedisStore,
  and make_store factory
- 11 new adversarial tests for probe race, force_open race, factory
  validation, and store-mode probe ownership
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
massgen/backend/llm_circuit_breaker.py (1)

587-607: ⚠️ Potential issue | 🔴 Critical

Carry probe ownership through the retryable-status gate.

The new _owns_probe plumbing only updates the entry checks here. The retryable HTTP branch still calls self.should_block() at Line 709, and in store mode that call can win try_transition_and_claim_probe() without surfacing the claim. On the next iteration, the request can reject its own retry as HALF_OPEN, and because _owns_probe stayed False, the except BaseException cleanup never clears half_open_probe_active. That leaves the breaker wedged until some external reset.

Also applies to: 612-627

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/llm_circuit_breaker.py` around lines 587 - 607, The
retryable-status branch still calls self.should_block(), which in store mode can
claim the half-open probe without updating the `_owns_probe` flag, leaving
half_open_probe_active set and the breaker wedged; update the retry loop to use
the same claim-aware call (self._should_block_with_claim()) or otherwise capture
and merge its claimed result into `_owns_probe` whenever you call
should_block()/try_transition_and_claim_probe() so `_owns_probe` accurately
reflects probe ownership across iterations, and ensure the BaseException cleanup
path clears half_open_probe_active when `_owns_probe` is true.
🧹 Nitpick comments (1)
massgen/backend/cb_store.py (1)

20-21: Add Google-style docstrings to the new concrete store methods.

The protocol is very well documented, but most of the new _default_state(), InMemoryStore, and RedisStore implementations still rely on signatures alone. Matching the concrete methods to the same docstring standard would make the fallback behavior and timestamp semantics much easier to audit and maintain.

As per coding guidelines, "**/*.py: For new or changed functions, include Google-style docstrings`."

Also applies to: 132-267, 574-949

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/cb_store.py` around lines 20 - 21, Add Google-style
docstrings to the new concrete store functions and classes: _default_state(),
class InMemoryStore (its methods), and class RedisStore (its methods). For each
docstring follow the project's Google-style format and document purpose,
parameters, return types, exceptions, side effects, fallback behavior, and
timestamp semantics (e.g., what timestamps represent and when they are
set/updated) so the fallback behavior and timestamp semantics are explicit and
auditable; ensure methods that mutate state or read/write timestamps (e.g.,
get/set/delete/refresh methods in InMemoryStore and RedisStore) include examples
or notes about concurrency and persistence where relevant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@massgen/backend/cb_store.py`:
- Around line 797-853: Both _cas_state_without_lua and
_increment_failure_without_lua open a watched pipeline but do not always release
it on all paths; wrap the existing try/except in each function with a finally
block that calls pipe.reset() so the watched connection is returned to the pool
on success, on retryable exceptions (where code currently continues), and on
other exits; ensure the finally runs after pipe.execute() returns or before each
early return (you can keep the existing pipe.reset() call in the early-return
branch but the finally must still call pipe.reset()).

In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 479-504: The force_open() path (and the abnormal-probe cleanup at
the 745-765 region) currently does a non-atomic get_state() → mutate →
set_state(), which can lose a longer open_until written by another worker;
replace this with an atomic store-side compare/update (e.g., add or use
_store.compare_and_update_state / compare_and_set API) that retrieves the
current state and writes a new state only after merging values so open_until =
max(existing.open_until, computed_open_until) and last_failure_time =
max(existing.last_failure_time, computed_time), and ensure
half_open_probe_active is set to False; specifically modify the code around
force_open(), the calls to _store.get_state and _store.set_state, and the
abnormal-probe cleanup to perform that compare-and-update (or implement a CAS
loop using a versioned/state-token returned by _store) so updates preserve the
maximum cooldown instead of blind overwrites.

---

Outside diff comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 587-607: The retryable-status branch still calls
self.should_block(), which in store mode can claim the half-open probe without
updating the `_owns_probe` flag, leaving half_open_probe_active set and the
breaker wedged; update the retry loop to use the same claim-aware call
(self._should_block_with_claim()) or otherwise capture and merge its claimed
result into `_owns_probe` whenever you call
should_block()/try_transition_and_claim_probe() so `_owns_probe` accurately
reflects probe ownership across iterations, and ensure the BaseException cleanup
path clears half_open_probe_active when `_owns_probe` is true.

---

Nitpick comments:
In `@massgen/backend/cb_store.py`:
- Around line 20-21: Add Google-style docstrings to the new concrete store
functions and classes: _default_state(), class InMemoryStore (its methods), and
class RedisStore (its methods). For each docstring follow the project's
Google-style format and document purpose, parameters, return types, exceptions,
side effects, fallback behavior, and timestamp semantics (e.g., what timestamps
represent and when they are set/updated) so the fallback behavior and timestamp
semantics are explicit and auditable; ensure methods that mutate state or
read/write timestamps (e.g., get/set/delete/refresh methods in InMemoryStore and
RedisStore) include examples or notes about concurrency and persistence where
relevant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f0db84e0-daae-462e-91b8-6d9c643604c1

📥 Commits

Reviewing files that changed from the base of the PR and between 68cb298 and 5340531.

📒 Files selected for processing (4)
  • massgen/backend/cb_store.py
  • massgen/backend/llm_circuit_breaker.py
  • massgen/tests/test_cb_store.py
  • massgen/tests/test_cb_store_adversarial.py

Comment thread massgen/backend/cb_store.py
Comment thread massgen/backend/llm_circuit_breaker.py
- cb_store: release watched pipeline on all paths (finally: pipe.reset())
- llm_circuit_breaker: CAS with max-merge in force_open and abnormal-probe
  cleanup to preserve longer open_until under concurrent force_open
- llm_circuit_breaker: use _should_block_with_claim in retryable branch so
  _owns_probe tracks HALF_OPEN probe ownership across retries
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
massgen/backend/llm_circuit_breaker.py (1)

502-517: ⚠️ Potential issue | 🟠 Major

Avoid the blind set_state() fallback in force_open().

After the CAS loop exhausts, this drops back to get_state()set_state(). A concurrent worker can still write a later open_until after Line 505 and before Line 508, and this branch will overwrite it with the shorter merged value. Keep this path CAS-based all the way through, or move the max-merge into a single store-side atomic operation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/llm_circuit_breaker.py` around lines 502 - 517, The fallback
in force_open() uses _store.get_state() then _store.set_state(), which can
overwrite a concurrent writer's newer open_until; replace this blind set with a
CAS loop or a single store-side atomic merge: read current via
_store.get_state(), compute merged_open_until and merged_last_failure as shown,
then attempt to write using the store's compare-and-set API (e.g.
_store.cas_state or _store.compare_and_set) passing the expected current and the
new merged dict; retry the CAS until it succeeds or a max retry/timeout is
reached; if your store cannot CAS, add a new atomic store-side method (e.g.
_store.merge_state_max(self.backend_name, merged_fields)) that performs the
max-merge server-side and call that instead to ensure no race overwrites newer
open_until.
🧹 Nitpick comments (2)
massgen/backend/cb_store.py (2)

765-769: Make the flat-pair zip() explicit.

Ruff is flagging this call. Since Line 767 already proves the slices are the same length, adding strict=True is a no-risk fix that keeps the invariant local.

Suggested change
-        return self._state_from_items(zip(pairs[0::2], pairs[1::2]))
+        return self._state_from_items(
+            zip(pairs[0::2], pairs[1::2], strict=True),
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/cb_store.py` around lines 765 - 769, The call to zip in
_state_from_flat_pairs currently relies on equal-length slices but linter flags
it; update the return to call zip(pairs[0::2], pairs[1::2], strict=True) in the
_state_from_flat_pairs method so the equality invariant is enforced explicitly
and satisfies Ruff while preserving behavior, keeping the error check for odd
length unchanged.

136-162: Add Google-style docstrings to the new store methods.

Most of the newly introduced methods in this module are still missing per-function docstrings, which makes the store contract harder to implement and review consistently.

As per coding guidelines: **/*.py: For new or changed functions, include Google-style docstrings.

Also applies to: 574-610

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/cb_store.py` around lines 136 - 162, The new store methods
get_state, set_state, cas_state, and increment_failure lack Google-style
docstrings; add a brief Google-style docstring to each function describing what
it does, the arguments (type and purpose) including backend: str and for
set_state/cas_state/ increment_failure the state/expected_state/updates types,
the return value and its type, any raised exceptions (if applicable), and a note
about thread-safety/locking behavior; ensure wording matches existing project
docstring style and covers side effects (mutates internal _storage) for
set_state/cas_state/increment_failure and that get_state returns a deep-copied
dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@massgen/backend/cb_store.py`:
- Around line 181-189: The failure handling currently only sets
state["open_until"] when transitioning from "half_open" or "closed" to "open",
but when state["state"] is already "open" we must extend the cooldown instead of
leaving it unchanged; update the failure-handling branches that set
state["state"] = "open" (the blocks manipulating state["state"],
state["open_until"], and state["half_open_probe_active"]) to set
state["open_until"] = max(state.get("open_until", 0), now + recovery_timeout) so
existing open breakers extend their cooldown; apply this same change to the
other two identical implementations in the file (the other failure-handling
blocks noted around the other occurrences).

In `@massgen/tests/test_cb_store_adversarial.py`:
- Around line 164-169: The thread helper currently catches BaseException in
run_safely (which uses exception_lock and appends to exceptions), which will
also swallow KeyboardInterrupt/SystemExit; change the handler to catch Exception
instead of BaseException to only capture regular worker failures; apply the same
change to the other helper block referenced around the second occurrence (the
helper at lines ~191-202) so both places catch Exception rather than
BaseException.
- Around line 732-738: The try/except around the assertions in the test masks
AssertionError from the checks on state returned by store.get_state("backend");
remove the broad except block so assertion failures surface directly. Locate the
block using store.get_state and the assertions for state["failure_count"] and
state["state"], delete the try/except wrapper (or at minimum remove the except
Exception clause) so that AssertionError is not caught and pytest shows the
original failure.

---

Duplicate comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 502-517: The fallback in force_open() uses _store.get_state() then
_store.set_state(), which can overwrite a concurrent writer's newer open_until;
replace this blind set with a CAS loop or a single store-side atomic merge: read
current via _store.get_state(), compute merged_open_until and
merged_last_failure as shown, then attempt to write using the store's
compare-and-set API (e.g. _store.cas_state or _store.compare_and_set) passing
the expected current and the new merged dict; retry the CAS until it succeeds or
a max retry/timeout is reached; if your store cannot CAS, add a new atomic
store-side method (e.g. _store.merge_state_max(self.backend_name,
merged_fields)) that performs the max-merge server-side and call that instead to
ensure no race overwrites newer open_until.

---

Nitpick comments:
In `@massgen/backend/cb_store.py`:
- Around line 765-769: The call to zip in _state_from_flat_pairs currently
relies on equal-length slices but linter flags it; update the return to call
zip(pairs[0::2], pairs[1::2], strict=True) in the _state_from_flat_pairs method
so the equality invariant is enforced explicitly and satisfies Ruff while
preserving behavior, keeping the error check for odd length unchanged.
- Around line 136-162: The new store methods get_state, set_state, cas_state,
and increment_failure lack Google-style docstrings; add a brief Google-style
docstring to each function describing what it does, the arguments (type and
purpose) including backend: str and for set_state/cas_state/ increment_failure
the state/expected_state/updates types, the return value and its type, any
raised exceptions (if applicable), and a note about thread-safety/locking
behavior; ensure wording matches existing project docstring style and covers
side effects (mutates internal _storage) for
set_state/cas_state/increment_failure and that get_state returns a deep-copied
dict.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3af4cf73-d7b9-4de3-80d6-3681f0d429d5

📥 Commits

Reviewing files that changed from the base of the PR and between 5340531 and 175434b.

📒 Files selected for processing (3)
  • massgen/backend/cb_store.py
  • massgen/backend/llm_circuit_breaker.py
  • massgen/tests/test_cb_store_adversarial.py

Comment thread massgen/backend/cb_store.py
Comment thread massgen/tests/test_cb_store_adversarial.py
Comment thread massgen/tests/test_cb_store_adversarial.py Outdated
- cb_store: extend open_until on OPEN->OPEN failures (3 store impls) to
  match in-memory LLMCircuitBreaker.record_failure semantics
- llm_circuit_breaker: drop blind set_state fallback in force_open();
  log warning when CAS exhausts instead of risking stale open_until
- tests: narrow BaseException to Exception in thread helpers
- tests: drop broad except wrapper that masked AssertionError
- cb_store: zip(..., strict=True) for flat-pair parsing
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
massgen/backend/cb_store.py (1)

136-169: Add Google-style docstrings to the concrete store methods.

The CircuitBreakerStore protocol is documented well, but most of the new public methods in InMemoryStore and RedisStore still have no docstrings. Filling those in would make the concrete behavior easier to audit without jumping back to the protocol every time.

As per coding guidelines, **/*.py: For new or changed functions, include Google-style docstrings.

Also applies to: 199-236, 268-270, 580-661, 738-739

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/cb_store.py` around lines 136 - 169, The concrete store
classes InMemoryStore and RedisStore are missing Google-style docstrings on
their public methods (e.g., get_state, set_state, cas_state, increment_failure,
atomic_record_failure), so add concise Google-style docstrings for each public
method describing purpose, args, returns, and raises where applicable; ensure
docstrings explain parameter types (backend, state, expected_state, updates,
failure_threshold, recovery_timeout), the return values (dict/int/bool), and any
side effects (mutating internal storage or performing CAS), and apply the same
pattern to the other listed method ranges (lines referenced in the review:
199-236, 268-270, 580-661, 738-739) so all new/changed functions follow project
docstring guidelines.
massgen/tests/test_cb_store_adversarial.py (1)

1009-1021: Assert namespace behavior instead of private fields.

This test only checks _key_prefix / _ttl, so it would still pass if RedisStore._key() ignored the custom prefix and wrote to the default namespace. A small behavioral assertion with set_state() plus client.exists("custom:ns:backend") would cover the isolation guarantee the factory is supposed to provide.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_cb_store_adversarial.py` around lines 1009 - 1021, The
test currently asserts private attributes on RedisStore (_key_prefix/_ttl)
rather than observable behavior; change it to exercise the behavior by using
make_store to create the RedisStore, call the store.set_state(...) method (e.g.,
set_state("backend", some test value)), and then assert against the fake redis
client that the key with the custom namespace exists
(client.exists("custom:ns:backend") or equivalent) and optionally that the TTL
is set as expected; reference make_store, RedisStore, and set_state when
locating where to change the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@massgen/backend/cb_store.py`:
- Around line 677-687: The probe transition paths currently set
half_open_probe_active and expire the key with only self._ttl (see uses in
_TRANSITION_PROBE_SCRIPT invocation and
_try_transition_and_claim_probe_without_lua), which can let the probe entry
expire too soon; change the TTL used when writing the half-open probe to take a
probe-specific floor (use the threaded recovery_timeout/probe TTL parameter)
e.g. replace the current max(1, self._ttl) usage with max(1, self._ttl,
recovery_timeout) (or a dedicated self._probe_ttl_floor) in both the Lua path
(_TRANSITION_PROBE_SCRIPT call) and the non-Lua path
(_try_transition_and_claim_probe_without_lua and the other occurrence around
lines 724-726) so the half_open probe cannot expire before recovery_timeout.
- Around line 569-579: The RedisStore __init__ currently allows construction
with a missing redis_client and fails later; modify RedisStore.__init__ to
validate the redis_client parameter (e.g., if not redis_client) and immediately
raise a ValueError with a clear message (e.g., "redis_client is required") so
the public constructor enforces the same contract as make_store; reference the
constructor parameters (redis_client, ttl, key_prefix) and the instance
attribute self._client when adding this check.

---

Nitpick comments:
In `@massgen/backend/cb_store.py`:
- Around line 136-169: The concrete store classes InMemoryStore and RedisStore
are missing Google-style docstrings on their public methods (e.g., get_state,
set_state, cas_state, increment_failure, atomic_record_failure), so add concise
Google-style docstrings for each public method describing purpose, args,
returns, and raises where applicable; ensure docstrings explain parameter types
(backend, state, expected_state, updates, failure_threshold, recovery_timeout),
the return values (dict/int/bool), and any side effects (mutating internal
storage or performing CAS), and apply the same pattern to the other listed
method ranges (lines referenced in the review: 199-236, 268-270, 580-661,
738-739) so all new/changed functions follow project docstring guidelines.

In `@massgen/tests/test_cb_store_adversarial.py`:
- Around line 1009-1021: The test currently asserts private attributes on
RedisStore (_key_prefix/_ttl) rather than observable behavior; change it to
exercise the behavior by using make_store to create the RedisStore, call the
store.set_state(...) method (e.g., set_state("backend", some test value)), and
then assert against the fake redis client that the key with the custom namespace
exists (client.exists("custom:ns:backend") or equivalent) and optionally that
the TTL is set as expected; reference make_store, RedisStore, and set_state when
locating where to change the assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aff8d988-b658-44c1-9972-c0b7c9040df0

📥 Commits

Reviewing files that changed from the base of the PR and between 175434b and 1134c5c.

📒 Files selected for processing (3)
  • massgen/backend/cb_store.py
  • massgen/backend/llm_circuit_breaker.py
  • massgen/tests/test_cb_store_adversarial.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • massgen/backend/llm_circuit_breaker.py

Comment thread massgen/backend/cb_store.py
Comment thread massgen/backend/cb_store.py
- RedisStore(None, ...) now raises ValueError at construction time
  instead of deferring failure to the first Redis call.
- half_open probe state is no longer bound to base TTL. Added
  half_open_probe_ttl param (forwarded through make_store) so the
  half_open entry outlives long probe requests when callers configure
  a small base TTL.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
massgen/backend/cb_store.py (1)

20-21: Add Google-style docstrings to the new helpers and store methods.

A lot of the new functions here are still undocumented, and the Redis class-level Args section still omits half_open_probe_ttl. This module will be easier to maintain if the public surface and fallback helpers all carry the repo-standard docs.

As per coding guidelines, "**/*.py: For new or changed functions, include Google-style docstrings`."

Also applies to: 132-236, 268-270, 273-287, 574-673, 710-877

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/cb_store.py` around lines 20 - 21, Add Google-style
docstrings to every new or changed function and public method in this module
(including the helper _default_state and all Redis class methods), following the
repo standard: a short summary line, Args with parameter types (ensure Redis
class Args includes half_open_probe_ttl), Returns with types, and Raises if
applicable; apply this to the helper functions and store methods in the shown
ranges (functions around lines 132-236, 268-270, 273-287, 574-673, 710-877) so
each public symbol (e.g., _default_state, Redis.__init__, Redis.get/set methods,
and any helper utilities) has a clear Google-style docstring describing purpose,
parameters, return value and side effects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@massgen/backend/cb_store.py`:
- Around line 289-302: The Lua CAS (_CAS_SCRIPT) and the Python WATCH/MULTI path
must compute the TTL from the post-merge state instead of trusting the incoming
updates payload; modify _CAS_SCRIPT to read current hash fields (state,
open_until, probe_time, etc.), apply the ARGV updates to compute the merged
state inside the script and then derive and set EXPIRE based on that merged
state (not ARGV[2] alone), and update the Python cas_state() / the WATCH/MULTI
implementation to first read the current hash, apply the same merge logic to
produce the final state and then compute TTL from that merged state before
issuing HSET/EXPIRE in the transaction so both code paths use the same
post-merge expiry calculation.

---

Nitpick comments:
In `@massgen/backend/cb_store.py`:
- Around line 20-21: Add Google-style docstrings to every new or changed
function and public method in this module (including the helper _default_state
and all Redis class methods), following the repo standard: a short summary line,
Args with parameter types (ensure Redis class Args includes
half_open_probe_ttl), Returns with types, and Raises if applicable; apply this
to the helper functions and store methods in the shown ranges (functions around
lines 132-236, 268-270, 273-287, 574-673, 710-877) so each public symbol (e.g.,
_default_state, Redis.__init__, Redis.get/set methods, and any helper utilities)
has a clear Google-style docstring describing purpose, parameters, return value
and side effects.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d33d4e2b-526e-4277-930d-4c8d1eff364a

📥 Commits

Reviewing files that changed from the base of the PR and between 1134c5c and da4d12b.

📒 Files selected for processing (1)
  • massgen/backend/cb_store.py

Comment thread massgen/backend/cb_store.py
amabito added 2 commits April 14, 2026 17:44
In cas_state(), TTL was computed from the updates patch alone.
A partial CAS on an already-open or half_open breaker would reset
the Redis TTL to the base value, allowing the key to expire before
open_until or the probe window elapses.

Fix: _CAS_SCRIPT now reads post-merge state/open_until after HSET
and computes effective TTL using max(base_ttl, time_remaining+60)
for open state and max(base_ttl, half_open_probe_ttl) for half_open.
The Python WATCH/MULTI fallback merges existing state with updates
before calling _compute_ttl.
@ncrispino ncrispino changed the base branch from main to dev/v0.1.78 April 17, 2026 17:19
@ncrispino ncrispino merged commit 0b209a1 into massgen:dev/v0.1.78 Apr 17, 2026
11 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Apr 17, 2026
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants