Skip to content

fix(circuit_breaker): clear rate keys on auto-reset#147

Open
levleontiev wants to merge 1 commit intomainfrom
fix/circuit-breaker-auto-reset-rate-keys
Open

fix(circuit_breaker): clear rate keys on auto-reset#147
levleontiev wants to merge 1 commit intomainfrom
fix/circuit-breaker-auto-reset-rate-keys

Conversation

@levleontiev
Copy link
Copy Markdown
Contributor

Summary

The circuit breaker's auto-reset path had a bug: when the cooldown elapsed and the state key was deleted, the rate accumulation keys were not cleared. For auto_reset_after_minutes < 2 (shorter than WINDOW_TTL_SECONDS = 120 s), the old high-rate data from the triggering spike was still alive in the shared dict. The sliding-window estimator would then read that stale data and immediately re-trip the breaker on the very first post-reset request — making auto-reset effectively a no-op.

Root cause

-- BEFORE (buggy): state key deleted, rate keys survive → breaker re-trips immediately
dict:delete(state_key)
-- falls through to rate check with stale previous_total still alive

Fix

Mirror what _M.reset() already does correctly — delete current and previous rate keys after clearing the state key:

-- AFTER: clear rate keys so sliding-window doesn't re-trip on first post-reset request
dict:delete(state_key)
local current_window_ar = floor(now / WINDOW_SIZE_SECONDS) * WINDOW_SIZE_SECONDS
local previous_window_ar = current_window_ar - WINDOW_SIZE_SECONDS
dict:delete(_M.build_rate_key(limit_key, current_window_ar))
dict:delete(_M.build_rate_key(limit_key, previous_window_ar))

Also introduces SECONDS_PER_MINUTE = 60 to decouple the elapsed-minutes calculation from WINDOW_SIZE_SECONDS — the two constants are coincidentally equal but represent different concepts.

Why tests didn't catch this

AC-4 uses auto_reset_after_minutes = 5 (300 s), well above WINDOW_TTL_SECONDS = 120 s, so rate keys expired naturally before auto-reset fired. The failure mode only manifests for auto_reset_after_minutes < 2.

🤖 Generated with Claude Code

When auto_reset_after_minutes > 0 and the cooldown elapses, the
circuit breaker deletes the state key but leaves the rate accumulation
keys alive (TTL = WINDOW_TTL_SECONDS = 120 s). For any cooldown shorter
than 120 s the sliding-window estimator reads the stale high-rate data
and immediately re-trips the breaker on the very first post-reset
request, making auto-reset a no-op.

Fix: mirror what _M.reset() already does — delete current and previous
rate keys immediately after clearing the state key.

Also introduce SECONDS_PER_MINUTE (= 60) so the elapsed-minutes
calculation is not semantically coupled to WINDOW_SIZE_SECONDS; the two
constants are coincidentally equal but represent different concepts.

Test coverage: AC-4 uses auto_reset_after_minutes=5 (300 s > 120 s
WINDOW_TTL_SECONDS), so rate keys expire naturally and the bug never
manifests. A future test with auto_reset_after_minutes=1 will cover the
fixed path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant