feat: add distributed store backend for circuit breaker (Phase 4)#1061
feat: add distributed store backend for circuit breaker (Phase 4)#1061ncrispino merged 9 commits intomassgen:dev/v0.1.78from
Conversation
- CircuitBreakerStore Protocol (get_state/set_state/cas_state/increment_failure/clear) - InMemoryStore: thread-safe, zero deps, default backend (ThreadingRLock) - RedisStore: optional redis>=4.0, Lua CAS script, dynamic TTL covering open_until, WATCH/MULTI/EXEC fallback, _script_unavailable narrowed to avoid misclassifying READONLY/ACL errors - make_store factory (memory|redis) - LLMCircuitBreaker gains store= param; store=None preserves all existing behavior (100% backward compat) - 138 tests: 27 happy path, 38 adversarial (thread-safety, Redis fallback, Rule 14/22/27/45/51), 66 regression (test_llm_circuit_breaker unchanged) - Pre-commit clean (black/isort/flake8/autoflake) Known design constraint (Phase 5): record_failure store path splits increment_failure + cas_state across two store calls; not fully linearizable against concurrent reset()/record_success(). Requires Protocol extension with atomic record_failure() method to fix properly.
- Protocol extended with atomic_record_failure and atomic_record_success - InMemoryStore: RLock-protected read-compute-write for both atomic methods - RedisStore: Lua scripts for single-round-trip atomicity - LLMCircuitBreaker routes through atomic methods when store is set - Preserves force_open semantics: atomic_record_success is no-op on OPEN state - Preserves record_failure correctness in HALF_OPEN: any failure -> OPEN - Backward compat: store=None path unchanged - Adds new tests in test_cb_store.py and test_cb_store_adversarial.py
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a pluggable circuit-breaker state store and integrates it into LLMCircuitBreaker: new Changes
Sequence Diagram(s)sequenceDiagram
participant Client as LLMCircuitBreaker
participant Store as CircuitBreakerStore
participant Backend as Storage Backend
Client->>Store: get_state(backend)
Store->>Backend: fetch state (in-memory or Redis)
Backend-->>Store: state dict
Store-->>Client: state dict
alt state == OPEN and now >= open_until
Client->>Store: try_transition_and_claim_probe(backend, now, recovery_timeout)
Store->>Backend: atomic OPEN→HALF_OPEN + claim probe
Backend-->>Store: (claimed=true, state)
Store-->>Client: (claimed, state)
else state == HALF_OPEN
Client->>Client: check half_open_probe_active -> gate/allow probe
else state == CLOSED
Client->>Client: allow request
end
Client->>Client: execute request
alt request succeeds
Client->>Store: atomic_record_success(backend)
Store->>Backend: reset failure_count, maybe CLOSE
Backend-->>Store: new state + prev metadata
Store-->>Client: new state
else request fails
Client->>Store: atomic_record_failure(backend, threshold, recovery_timeout)
Store->>Backend: increment failure_count, maybe OPEN with open_until
Backend-->>Store: new state + prev metadata
Store-->>Client: new state
end
sequenceDiagram
participant InMem as InMemoryStore
participant Lock as RLock
participant Redis as RedisStore
participant RedisSvr as Redis Server
rect rgba(100, 150, 255, 0.5)
Note over InMem,Lock: In-memory path (single-process)
InMem->>Lock: acquire()
Lock-->>InMem: locked
InMem->>InMem: read/modify dict atomically
InMem->>Lock: release()
end
rect rgba(255, 150, 100, 0.5)
Note over Redis,RedisSvr: Redis path (Lua preferred)
Redis->>RedisSvr: EVAL (atomic script)
RedisSvr->>RedisSvr: read/parse hash fields
RedisSvr->>RedisSvr: compute new fields (counts, open_until)
RedisSvr->>RedisSvr: HSET/EXPIRE atomically
RedisSvr-->>Redis: serialized state
end
rect rgba(200, 100, 255, 0.5)
Note over Redis,RedisSvr: Redis fallback (no EVAL)
Redis->>RedisSvr: WATCH key
Redis->>RedisSvr: HGETALL key
RedisSvr-->>Redis: current hash
Redis->>Redis: parse locally, compute updates
Redis->>RedisSvr: MULTI / HSET / EXPIRE / EXEC
RedisSvr-->>Redis: success or nil on conflict
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
massgen/backend/llm_circuit_breaker.py (1)
601-623:⚠️ Potential issue | 🟠 MajorStore-backed retries can lose probe ownership tracking.
On later attempts,
_owns_probeis refreshed only fromself._state, which never reflects the shared-store state. If a retry acquires the HALF_OPEN probe in store mode and then exits via cancellation or anotherBaseException, the cleanup block skips the reopen path and can leavehalf_open_probe_active=Trueindefinitely.Also applies to: 735-775
🧹 Nitpick comments (1)
massgen/backend/cb_store.py (1)
69-165: Add Google-style docstrings to the new store surface.The protocol is documented, but the concrete store methods and factory still leave TTL behavior, expected-state no-ops, and Lua/WATCH fallback semantics implicit in the implementation. Please document the new callable surface directly.
As per coding guidelines "For new or changed functions, include Google-style docstrings".
Also applies to: 167-466, 675-684
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/cb_store.py` around lines 69 - 165, The InMemoryStore class and its methods lack Google-style docstrings describing TTL behavior, expected-state no-ops, Lua/WATCH fallback semantics, thread-safety and return types; add concise Google-style docstrings to the class InMemoryStore and each method (get_state, set_state, cas_state, increment_failure, atomic_record_failure, atomic_record_success, clear) that specify: purpose, args (types and meaning, e.g., backend, expected_state, failure_threshold, recovery_timeout, updates), return values, side effects (mutation of in-memory state, deep-copy behavior), concurrency guarantees (uses RLock), and the semantics for TTL/open_until, expected_state no-op behavior, and that this in-memory implementation is a non-distributed fallback to Redis/Lua/WATCH semantics. Ensure docstrings follow the project Google style and cover error/edge-case behavior used by callers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@massgen/backend/cb_store.py`:
- Around line 420-445: The Redis-backed code is persisting time.monotonic()
which is not portable across processes; update atomic_record_failure (and the
fallback _atomic_record_failure_without_lua referenced around lines 517-521) to
use a wall-clock Unix timestamp (e.g., time.time()) or the Redis server TIME
inside the Lua script instead of time.monotonic(), and ensure the values written
as open_until/last_failure_time are produced and compared against the same clock
(Unix time) used by LLMCircuitBreaker so timestamps are meaningful across
workers/processes.
- Around line 675-684: The factory function make_store drops RedisStore's
key_prefix and will raise a raw KeyError if redis_client is missing; update
make_store to forward key_prefix (e.g., use kwargs.get("key_prefix")) into the
RedisStore constructor along with ttl, and validate presence of redis_client
(raise a clear ValueError like "redis_client required for redis backend" if
missing) instead of letting a KeyError propagate so callers can isolate
namespaces and get a helpful error message.
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 239-280: The current probe transition in llm_circuit_breaker.py
reads open_until and half_open_probe_active but calls
self._store.cas_state(self.backend_name, CircuitState.OPEN.value, ...) which
only guards on "state", allowing races (e.g., force_open updates open_until) and
multiple probes; add and use a single atomic store-side operation (e.g.,
self._store.try_transition_and_claim_probe or similar) that checks both the
current "state" and the relevant fields ("open_until" for OPEN->HALF_OPEN
transition and "half_open_probe_active" for HALF_OPEN probe claim) in one
compare-and-set, then perform the state change and set half_open_probe_active
atomically; update the code paths around the CircuitState.OPEN branch and the
HALF_OPEN branch to call this new atomic method instead of cas_state and handle
its boolean result for whether this process won the probe, keeping metric
logging and returned booleans the same.
- Around line 330-374: The atomic store methods (atomic_record_failure and
atomic_record_success) must return both previous and new state (e.g., include
keys like "_prev_state" and "_prev_was_half_open" or an explicit "transition"
descriptor) and the CircuitBreaker callers (record_failure and record_success in
llm_circuit_breaker.py) must base their logging and metrics solely on that
returned payload instead of inferring state from the new state only; update the
store implementations to include these fields, then change the code paths around
atomic_record_failure (current block using new_state_str, _prev_was_half_open,
_prev_state) and the symmetric atomic_record_success call (the other block noted
in the comment) to read the returned previous/new state or transition and emit
the correct log message and metrics (e.g., use the returned prev->new transition
for record_state_transition and for choosing "Probe failed, circuit breaker
re-opened" vs "Circuit breaker opened"/"open -> closed" messaging).
---
Nitpick comments:
In `@massgen/backend/cb_store.py`:
- Around line 69-165: The InMemoryStore class and its methods lack Google-style
docstrings describing TTL behavior, expected-state no-ops, Lua/WATCH fallback
semantics, thread-safety and return types; add concise Google-style docstrings
to the class InMemoryStore and each method (get_state, set_state, cas_state,
increment_failure, atomic_record_failure, atomic_record_success, clear) that
specify: purpose, args (types and meaning, e.g., backend, expected_state,
failure_threshold, recovery_timeout, updates), return values, side effects
(mutation of in-memory state, deep-copy behavior), concurrency guarantees (uses
RLock), and the semantics for TTL/open_until, expected_state no-op behavior, and
that this in-memory implementation is a non-distributed fallback to
Redis/Lua/WATCH semantics. Ensure docstrings follow the project Google style and
cover error/edge-case behavior used by callers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0cb84a85-1af9-403d-9e85-2fd5d399ed10
📒 Files selected for processing (5)
massgen/backend/cb_store.pymassgen/backend/llm_circuit_breaker.pymassgen/tests/test_cb_store.pymassgen/tests/test_cb_store_adversarial.pypyproject.toml
- atomic try_transition_and_claim_probe op (Lua + InMem + WATCH fallback) closes the OPEN/HALF_OPEN race where force_open could be silently undone and prevents two workers from both winning the HALF_OPEN probe slot - _should_block_with_claim() returns probe-ownership flag so call_with_retry correctly tracks _owns_probe in store mode (previous local-state refresh never fired, leaving half_open_probe_active stuck on cancellation) - switch store-persisted timestamps to time.time() (Unix wall clock); Lua scripts and Python helpers now share comparable cross-process clocks - make_store: forward key_prefix to RedisStore and raise ValueError for missing redis_client instead of bare KeyError - Google-style docstrings on Protocol surface, InMemoryStore, RedisStore, and make_store factory - 11 new adversarial tests for probe race, force_open race, factory validation, and store-mode probe ownership
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
massgen/backend/llm_circuit_breaker.py (1)
587-607:⚠️ Potential issue | 🔴 CriticalCarry probe ownership through the retryable-status gate.
The new
_owns_probeplumbing only updates the entry checks here. The retryable HTTP branch still callsself.should_block()at Line 709, and in store mode that call can wintry_transition_and_claim_probe()without surfacing the claim. On the next iteration, the request can reject its own retry as HALF_OPEN, and because_owns_probestayedFalse, theexcept BaseExceptioncleanup never clearshalf_open_probe_active. That leaves the breaker wedged until some external reset.Also applies to: 612-627
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/llm_circuit_breaker.py` around lines 587 - 607, The retryable-status branch still calls self.should_block(), which in store mode can claim the half-open probe without updating the `_owns_probe` flag, leaving half_open_probe_active set and the breaker wedged; update the retry loop to use the same claim-aware call (self._should_block_with_claim()) or otherwise capture and merge its claimed result into `_owns_probe` whenever you call should_block()/try_transition_and_claim_probe() so `_owns_probe` accurately reflects probe ownership across iterations, and ensure the BaseException cleanup path clears half_open_probe_active when `_owns_probe` is true.
🧹 Nitpick comments (1)
massgen/backend/cb_store.py (1)
20-21: Add Google-style docstrings to the new concrete store methods.The protocol is very well documented, but most of the new
_default_state(),InMemoryStore, andRedisStoreimplementations still rely on signatures alone. Matching the concrete methods to the same docstring standard would make the fallback behavior and timestamp semantics much easier to audit and maintain.As per coding guidelines, "
**/*.py: For new or changed functions, include Google-style docstrings`."Also applies to: 132-267, 574-949
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/cb_store.py` around lines 20 - 21, Add Google-style docstrings to the new concrete store functions and classes: _default_state(), class InMemoryStore (its methods), and class RedisStore (its methods). For each docstring follow the project's Google-style format and document purpose, parameters, return types, exceptions, side effects, fallback behavior, and timestamp semantics (e.g., what timestamps represent and when they are set/updated) so the fallback behavior and timestamp semantics are explicit and auditable; ensure methods that mutate state or read/write timestamps (e.g., get/set/delete/refresh methods in InMemoryStore and RedisStore) include examples or notes about concurrency and persistence where relevant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@massgen/backend/cb_store.py`:
- Around line 797-853: Both _cas_state_without_lua and
_increment_failure_without_lua open a watched pipeline but do not always release
it on all paths; wrap the existing try/except in each function with a finally
block that calls pipe.reset() so the watched connection is returned to the pool
on success, on retryable exceptions (where code currently continues), and on
other exits; ensure the finally runs after pipe.execute() returns or before each
early return (you can keep the existing pipe.reset() call in the early-return
branch but the finally must still call pipe.reset()).
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 479-504: The force_open() path (and the abnormal-probe cleanup at
the 745-765 region) currently does a non-atomic get_state() → mutate →
set_state(), which can lose a longer open_until written by another worker;
replace this with an atomic store-side compare/update (e.g., add or use
_store.compare_and_update_state / compare_and_set API) that retrieves the
current state and writes a new state only after merging values so open_until =
max(existing.open_until, computed_open_until) and last_failure_time =
max(existing.last_failure_time, computed_time), and ensure
half_open_probe_active is set to False; specifically modify the code around
force_open(), the calls to _store.get_state and _store.set_state, and the
abnormal-probe cleanup to perform that compare-and-update (or implement a CAS
loop using a versioned/state-token returned by _store) so updates preserve the
maximum cooldown instead of blind overwrites.
---
Outside diff comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 587-607: The retryable-status branch still calls
self.should_block(), which in store mode can claim the half-open probe without
updating the `_owns_probe` flag, leaving half_open_probe_active set and the
breaker wedged; update the retry loop to use the same claim-aware call
(self._should_block_with_claim()) or otherwise capture and merge its claimed
result into `_owns_probe` whenever you call
should_block()/try_transition_and_claim_probe() so `_owns_probe` accurately
reflects probe ownership across iterations, and ensure the BaseException cleanup
path clears half_open_probe_active when `_owns_probe` is true.
---
Nitpick comments:
In `@massgen/backend/cb_store.py`:
- Around line 20-21: Add Google-style docstrings to the new concrete store
functions and classes: _default_state(), class InMemoryStore (its methods), and
class RedisStore (its methods). For each docstring follow the project's
Google-style format and document purpose, parameters, return types, exceptions,
side effects, fallback behavior, and timestamp semantics (e.g., what timestamps
represent and when they are set/updated) so the fallback behavior and timestamp
semantics are explicit and auditable; ensure methods that mutate state or
read/write timestamps (e.g., get/set/delete/refresh methods in InMemoryStore and
RedisStore) include examples or notes about concurrency and persistence where
relevant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f0db84e0-daae-462e-91b8-6d9c643604c1
📒 Files selected for processing (4)
massgen/backend/cb_store.pymassgen/backend/llm_circuit_breaker.pymassgen/tests/test_cb_store.pymassgen/tests/test_cb_store_adversarial.py
- cb_store: release watched pipeline on all paths (finally: pipe.reset()) - llm_circuit_breaker: CAS with max-merge in force_open and abnormal-probe cleanup to preserve longer open_until under concurrent force_open - llm_circuit_breaker: use _should_block_with_claim in retryable branch so _owns_probe tracks HALF_OPEN probe ownership across retries
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
massgen/backend/llm_circuit_breaker.py (1)
502-517:⚠️ Potential issue | 🟠 MajorAvoid the blind
set_state()fallback inforce_open().After the CAS loop exhausts, this drops back to
get_state()→set_state(). A concurrent worker can still write a lateropen_untilafter Line 505 and before Line 508, and this branch will overwrite it with the shorter merged value. Keep this path CAS-based all the way through, or move the max-merge into a single store-side atomic operation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/llm_circuit_breaker.py` around lines 502 - 517, The fallback in force_open() uses _store.get_state() then _store.set_state(), which can overwrite a concurrent writer's newer open_until; replace this blind set with a CAS loop or a single store-side atomic merge: read current via _store.get_state(), compute merged_open_until and merged_last_failure as shown, then attempt to write using the store's compare-and-set API (e.g. _store.cas_state or _store.compare_and_set) passing the expected current and the new merged dict; retry the CAS until it succeeds or a max retry/timeout is reached; if your store cannot CAS, add a new atomic store-side method (e.g. _store.merge_state_max(self.backend_name, merged_fields)) that performs the max-merge server-side and call that instead to ensure no race overwrites newer open_until.
🧹 Nitpick comments (2)
massgen/backend/cb_store.py (2)
765-769: Make the flat-pairzip()explicit.Ruff is flagging this call. Since Line 767 already proves the slices are the same length, adding
strict=Trueis a no-risk fix that keeps the invariant local.Suggested change
- return self._state_from_items(zip(pairs[0::2], pairs[1::2])) + return self._state_from_items( + zip(pairs[0::2], pairs[1::2], strict=True), + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/cb_store.py` around lines 765 - 769, The call to zip in _state_from_flat_pairs currently relies on equal-length slices but linter flags it; update the return to call zip(pairs[0::2], pairs[1::2], strict=True) in the _state_from_flat_pairs method so the equality invariant is enforced explicitly and satisfies Ruff while preserving behavior, keeping the error check for odd length unchanged.
136-162: Add Google-style docstrings to the new store methods.Most of the newly introduced methods in this module are still missing per-function docstrings, which makes the store contract harder to implement and review consistently.
As per coding guidelines:
**/*.py: For new or changed functions, include Google-style docstrings.Also applies to: 574-610
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/cb_store.py` around lines 136 - 162, The new store methods get_state, set_state, cas_state, and increment_failure lack Google-style docstrings; add a brief Google-style docstring to each function describing what it does, the arguments (type and purpose) including backend: str and for set_state/cas_state/ increment_failure the state/expected_state/updates types, the return value and its type, any raised exceptions (if applicable), and a note about thread-safety/locking behavior; ensure wording matches existing project docstring style and covers side effects (mutates internal _storage) for set_state/cas_state/increment_failure and that get_state returns a deep-copied dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@massgen/backend/cb_store.py`:
- Around line 181-189: The failure handling currently only sets
state["open_until"] when transitioning from "half_open" or "closed" to "open",
but when state["state"] is already "open" we must extend the cooldown instead of
leaving it unchanged; update the failure-handling branches that set
state["state"] = "open" (the blocks manipulating state["state"],
state["open_until"], and state["half_open_probe_active"]) to set
state["open_until"] = max(state.get("open_until", 0), now + recovery_timeout) so
existing open breakers extend their cooldown; apply this same change to the
other two identical implementations in the file (the other failure-handling
blocks noted around the other occurrences).
In `@massgen/tests/test_cb_store_adversarial.py`:
- Around line 164-169: The thread helper currently catches BaseException in
run_safely (which uses exception_lock and appends to exceptions), which will
also swallow KeyboardInterrupt/SystemExit; change the handler to catch Exception
instead of BaseException to only capture regular worker failures; apply the same
change to the other helper block referenced around the second occurrence (the
helper at lines ~191-202) so both places catch Exception rather than
BaseException.
- Around line 732-738: The try/except around the assertions in the test masks
AssertionError from the checks on state returned by store.get_state("backend");
remove the broad except block so assertion failures surface directly. Locate the
block using store.get_state and the assertions for state["failure_count"] and
state["state"], delete the try/except wrapper (or at minimum remove the except
Exception clause) so that AssertionError is not caught and pytest shows the
original failure.
---
Duplicate comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 502-517: The fallback in force_open() uses _store.get_state() then
_store.set_state(), which can overwrite a concurrent writer's newer open_until;
replace this blind set with a CAS loop or a single store-side atomic merge: read
current via _store.get_state(), compute merged_open_until and
merged_last_failure as shown, then attempt to write using the store's
compare-and-set API (e.g. _store.cas_state or _store.compare_and_set) passing
the expected current and the new merged dict; retry the CAS until it succeeds or
a max retry/timeout is reached; if your store cannot CAS, add a new atomic
store-side method (e.g. _store.merge_state_max(self.backend_name,
merged_fields)) that performs the max-merge server-side and call that instead to
ensure no race overwrites newer open_until.
---
Nitpick comments:
In `@massgen/backend/cb_store.py`:
- Around line 765-769: The call to zip in _state_from_flat_pairs currently
relies on equal-length slices but linter flags it; update the return to call
zip(pairs[0::2], pairs[1::2], strict=True) in the _state_from_flat_pairs method
so the equality invariant is enforced explicitly and satisfies Ruff while
preserving behavior, keeping the error check for odd length unchanged.
- Around line 136-162: The new store methods get_state, set_state, cas_state,
and increment_failure lack Google-style docstrings; add a brief Google-style
docstring to each function describing what it does, the arguments (type and
purpose) including backend: str and for set_state/cas_state/ increment_failure
the state/expected_state/updates types, the return value and its type, any
raised exceptions (if applicable), and a note about thread-safety/locking
behavior; ensure wording matches existing project docstring style and covers
side effects (mutates internal _storage) for
set_state/cas_state/increment_failure and that get_state returns a deep-copied
dict.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3af4cf73-d7b9-4de3-80d6-3681f0d429d5
📒 Files selected for processing (3)
massgen/backend/cb_store.pymassgen/backend/llm_circuit_breaker.pymassgen/tests/test_cb_store_adversarial.py
- cb_store: extend open_until on OPEN->OPEN failures (3 store impls) to match in-memory LLMCircuitBreaker.record_failure semantics - llm_circuit_breaker: drop blind set_state fallback in force_open(); log warning when CAS exhausts instead of risking stale open_until - tests: narrow BaseException to Exception in thread helpers - tests: drop broad except wrapper that masked AssertionError - cb_store: zip(..., strict=True) for flat-pair parsing
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
massgen/backend/cb_store.py (1)
136-169: Add Google-style docstrings to the concrete store methods.The
CircuitBreakerStoreprotocol is documented well, but most of the new public methods inInMemoryStoreandRedisStorestill have no docstrings. Filling those in would make the concrete behavior easier to audit without jumping back to the protocol every time.As per coding guidelines,
**/*.py: For new or changed functions, include Google-style docstrings.Also applies to: 199-236, 268-270, 580-661, 738-739
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/cb_store.py` around lines 136 - 169, The concrete store classes InMemoryStore and RedisStore are missing Google-style docstrings on their public methods (e.g., get_state, set_state, cas_state, increment_failure, atomic_record_failure), so add concise Google-style docstrings for each public method describing purpose, args, returns, and raises where applicable; ensure docstrings explain parameter types (backend, state, expected_state, updates, failure_threshold, recovery_timeout), the return values (dict/int/bool), and any side effects (mutating internal storage or performing CAS), and apply the same pattern to the other listed method ranges (lines referenced in the review: 199-236, 268-270, 580-661, 738-739) so all new/changed functions follow project docstring guidelines.massgen/tests/test_cb_store_adversarial.py (1)
1009-1021: Assert namespace behavior instead of private fields.This test only checks
_key_prefix/_ttl, so it would still pass ifRedisStore._key()ignored the custom prefix and wrote to the default namespace. A small behavioral assertion withset_state()plusclient.exists("custom:ns:backend")would cover the isolation guarantee the factory is supposed to provide.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/tests/test_cb_store_adversarial.py` around lines 1009 - 1021, The test currently asserts private attributes on RedisStore (_key_prefix/_ttl) rather than observable behavior; change it to exercise the behavior by using make_store to create the RedisStore, call the store.set_state(...) method (e.g., set_state("backend", some test value)), and then assert against the fake redis client that the key with the custom namespace exists (client.exists("custom:ns:backend") or equivalent) and optionally that the TTL is set as expected; reference make_store, RedisStore, and set_state when locating where to change the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@massgen/backend/cb_store.py`:
- Around line 677-687: The probe transition paths currently set
half_open_probe_active and expire the key with only self._ttl (see uses in
_TRANSITION_PROBE_SCRIPT invocation and
_try_transition_and_claim_probe_without_lua), which can let the probe entry
expire too soon; change the TTL used when writing the half-open probe to take a
probe-specific floor (use the threaded recovery_timeout/probe TTL parameter)
e.g. replace the current max(1, self._ttl) usage with max(1, self._ttl,
recovery_timeout) (or a dedicated self._probe_ttl_floor) in both the Lua path
(_TRANSITION_PROBE_SCRIPT call) and the non-Lua path
(_try_transition_and_claim_probe_without_lua and the other occurrence around
lines 724-726) so the half_open probe cannot expire before recovery_timeout.
- Around line 569-579: The RedisStore __init__ currently allows construction
with a missing redis_client and fails later; modify RedisStore.__init__ to
validate the redis_client parameter (e.g., if not redis_client) and immediately
raise a ValueError with a clear message (e.g., "redis_client is required") so
the public constructor enforces the same contract as make_store; reference the
constructor parameters (redis_client, ttl, key_prefix) and the instance
attribute self._client when adding this check.
---
Nitpick comments:
In `@massgen/backend/cb_store.py`:
- Around line 136-169: The concrete store classes InMemoryStore and RedisStore
are missing Google-style docstrings on their public methods (e.g., get_state,
set_state, cas_state, increment_failure, atomic_record_failure), so add concise
Google-style docstrings for each public method describing purpose, args,
returns, and raises where applicable; ensure docstrings explain parameter types
(backend, state, expected_state, updates, failure_threshold, recovery_timeout),
the return values (dict/int/bool), and any side effects (mutating internal
storage or performing CAS), and apply the same pattern to the other listed
method ranges (lines referenced in the review: 199-236, 268-270, 580-661,
738-739) so all new/changed functions follow project docstring guidelines.
In `@massgen/tests/test_cb_store_adversarial.py`:
- Around line 1009-1021: The test currently asserts private attributes on
RedisStore (_key_prefix/_ttl) rather than observable behavior; change it to
exercise the behavior by using make_store to create the RedisStore, call the
store.set_state(...) method (e.g., set_state("backend", some test value)), and
then assert against the fake redis client that the key with the custom namespace
exists (client.exists("custom:ns:backend") or equivalent) and optionally that
the TTL is set as expected; reference make_store, RedisStore, and set_state when
locating where to change the assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aff8d988-b658-44c1-9972-c0b7c9040df0
📒 Files selected for processing (3)
massgen/backend/cb_store.pymassgen/backend/llm_circuit_breaker.pymassgen/tests/test_cb_store_adversarial.py
🚧 Files skipped from review as they are similar to previous changes (1)
- massgen/backend/llm_circuit_breaker.py
- RedisStore(None, ...) now raises ValueError at construction time instead of deferring failure to the first Redis call. - half_open probe state is no longer bound to base TTL. Added half_open_probe_ttl param (forwarded through make_store) so the half_open entry outlives long probe requests when callers configure a small base TTL.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
massgen/backend/cb_store.py (1)
20-21: Add Google-style docstrings to the new helpers and store methods.A lot of the new functions here are still undocumented, and the Redis class-level
Argssection still omitshalf_open_probe_ttl. This module will be easier to maintain if the public surface and fallback helpers all carry the repo-standard docs.As per coding guidelines, "
**/*.py: For new or changed functions, include Google-style docstrings`."Also applies to: 132-236, 268-270, 273-287, 574-673, 710-877
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/cb_store.py` around lines 20 - 21, Add Google-style docstrings to every new or changed function and public method in this module (including the helper _default_state and all Redis class methods), following the repo standard: a short summary line, Args with parameter types (ensure Redis class Args includes half_open_probe_ttl), Returns with types, and Raises if applicable; apply this to the helper functions and store methods in the shown ranges (functions around lines 132-236, 268-270, 273-287, 574-673, 710-877) so each public symbol (e.g., _default_state, Redis.__init__, Redis.get/set methods, and any helper utilities) has a clear Google-style docstring describing purpose, parameters, return value and side effects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@massgen/backend/cb_store.py`:
- Around line 289-302: The Lua CAS (_CAS_SCRIPT) and the Python WATCH/MULTI path
must compute the TTL from the post-merge state instead of trusting the incoming
updates payload; modify _CAS_SCRIPT to read current hash fields (state,
open_until, probe_time, etc.), apply the ARGV updates to compute the merged
state inside the script and then derive and set EXPIRE based on that merged
state (not ARGV[2] alone), and update the Python cas_state() / the WATCH/MULTI
implementation to first read the current hash, apply the same merge logic to
produce the final state and then compute TTL from that merged state before
issuing HSET/EXPIRE in the transaction so both code paths use the same
post-merge expiry calculation.
---
Nitpick comments:
In `@massgen/backend/cb_store.py`:
- Around line 20-21: Add Google-style docstrings to every new or changed
function and public method in this module (including the helper _default_state
and all Redis class methods), following the repo standard: a short summary line,
Args with parameter types (ensure Redis class Args includes
half_open_probe_ttl), Returns with types, and Raises if applicable; apply this
to the helper functions and store methods in the shown ranges (functions around
lines 132-236, 268-270, 273-287, 574-673, 710-877) so each public symbol (e.g.,
_default_state, Redis.__init__, Redis.get/set methods, and any helper utilities)
has a clear Google-style docstring describing purpose, parameters, return value
and side effects.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d33d4e2b-526e-4277-930d-4c8d1eff364a
📒 Files selected for processing (1)
massgen/backend/cb_store.py
In cas_state(), TTL was computed from the updates patch alone. A partial CAS on an already-open or half_open breaker would reset the Redis TTL to the base value, allowing the key to expire before open_until or the probe window elapses. Fix: _CAS_SCRIPT now reads post-merge state/open_until after HSET and computes effective TTL using max(base_ttl, time_remaining+60) for open state and max(base_ttl, half_open_probe_ttl) for half_open. The Python WATCH/MULTI fallback merges existing state with updates before calling _compute_ttl.
Summary
Phase 4: pluggable store backend for the LLM circuit breaker, so state can be shared across workers / processes. Default (
store=None) keeps the existing single-process path unchanged.Rebased on top of #1038 (Phase 2) and #1056 (Phase 3).
What's in
CircuitBreakerStoreProtocol (sync, to match the existing CB code path)InMemoryStore-- RLock, zero depsRedisStore-- requiresredis>=4.0, lazy-importedatomic_record_failure/atomic_record_successon the Protocol. Needed for linearizability when multiple workers race on the same backend; found this the hard way in review and had to extend the Protocol mid-way.Tests
159 tests passing locally: store unit, adversarial (TOCTOU, Redis eviction, corrupted state, Protocol contract), plus regression on the existing CB tests.
Unrelated
test_subagent_docker_logs.pyetc. fail on Windows with cp932 encoding, but they fail onorigin/maintoo -- pre-existing, not touched here.Install
Redis backend is optional:
Summary by CodeRabbit
New Features
Bug Fixes / Reliability
Tests
Chores