Skip to content

tests/run-tests.py: Add automatic retry for known-flaky tests.#6

Draft
andrewleech wants to merge 2 commits intoreview/flaky-test-retryfrom
flaky-test-retry
Draft

tests/run-tests.py: Add automatic retry for known-flaky tests.#6
andrewleech wants to merge 2 commits intoreview/flaky-test-retryfrom
flaky-test-retry

Conversation

@andrewleech
Copy link
Owner

@andrewleech andrewleech commented Feb 23, 2026

Summary

The unix port CI workflow on master fails ~18% of the time due to a handful of intermittent test failures (mostly thread-related). This has normalised red CI to the point where contributors ignore it, masking real regressions.

I added a flaky-test allowlist and retry-until-pass mechanism to run-tests.py. When a listed test fails, it's re-run up to 2 more times and passes on the first success. The 8 tests on the list account for essentially all spurious CI failures on master.

tools/ci.sh is updated to remove --exclude patterns for tests now handled by the retry logic, so they get actual coverage again. thread/stress_aes.py stays excluded on RISC-V QEMU where it runs close to the 200s timeout and retries would burn excessive CI time.

Retries are on by default; --no-retry disables them. The end-of-run summary reports tests that passed via retry separately from clean passes.

Testing

Not yet tested in CI — this draft PR is to exercise the review workflow and get CI results on the fork.

Trade-offs and Alternatives

Retry-until-pass can mask a test that has become intermittently broken. The allowlist is deliberately small and each entry documents a reason and optional platform restriction. An alternative would be to mark these tests as expected-fail, but that removes coverage entirely which is worse than the current situation.

The test result status strings ("pass", "fail", "skip") are bare string literals throughout run-tests.py — the retry code follows this existing convention. Worth considering a follow-up to consolidate these into constants for the whole file.

Generative AI

I used generative AI tools when creating this PR, but a human has checked the code and is responsible for the description above.

@github-actions
Copy link

github-actions bot commented Feb 23, 2026

Code size report:

Reference:  zephyr/mpconfigport: Remove duplicate builtins.open definition. [1ab9b66]
Comparison: tests/run-tests.py: Ignore known-flaky test failures instead of retrying. [merge of 36ae4eb]
  mpy-cross:    +0 +0.000% 
   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
      esp32:    +0 +0.000% ESP32_GENERIC
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

saved_testcase_count = testcase_count.value
flaky_failures = []
for i, r in enumerate(results):
if r[1] == "fail":
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a shared / original definition of "fail" that can be used rather than hard coded in string here?

retried_pass = False
for attempt in range(1, FLAKY_TEST_MAX_RETRIES + 1):
print(
"retry {} ({}/{}, {})".format(
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is .format the standard way of outputting strings in this project?

run_one_test(test_file)
retry_entries = results[pre_len:]
del results[pre_len:] # clean up all entries appended during retry
if len(retry_entries) == 1 and retry_entries[0][1] == "pass":
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per earlier comment, "pass" probably shouldn't be hard coded in string here. Also is this correct for all tests types (diff/exp/unittest)?

"{} (retried {})".format(FLAKY_REASON_PREFIX, FLAKY_TEST_MAX_RETRIES),
)
print(
"FAIL {} (retried {}, {})".format(
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this FAIL compare to earlier fail strings?

Repository owner deleted a comment from mpy-reviewer bot Feb 24, 2026
Repository owner deleted a comment from mpy-reviewer bot Feb 24, 2026
@andrewleech
Copy link
Owner Author

/review

Repository owner deleted a comment from mpy-reviewer bot Feb 24, 2026
Copy link

@mpy-reviewer mpy-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several issues with the retry mechanism: the flaky-test registry uses a fragile string prefix for tagging results and private _value access to reset the counter, and stress_aes.py with platforms=None will triple the timeout on MIPS/ARM QEMU. Style nits: missing blank line and len(x) > 0 checks.

"extmod/time_time_ns.py": ("CI runner clock precision", None),
}

FLAKY_TEST_MAX_RETRIES = 2 # 3 total runs including initial
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing blank line between closing } of FLAKY_TESTS and this constant.

test_file, FLAKY_TEST_MAX_RETRIES, flaky_reason
)
)
testcase_count._value = saved_testcase_count
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_value is a private attribute of ThreadSafeCounter. Use testcase_count.value = saved_testcase_count if the type supports assignment, or restructure to avoid resetting it entirely. Accessing internals like this is fragile.

# or a tuple of sys.platform strings to restrict retries to those platforms.
FLAKY_TESTS = {
"thread/thread_gc1.py": ("GC race condition", None),
"thread/stress_aes.py": ("timeout under QEMU emulation", None),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stress_aes.py is listed with platforms=None (all platforms) but the reason is "timeout under QEMU emulation". On MIPS/ARM QEMU it takes 70-90 s; retrying twice means up to 3x that timeout burning CI time. Restrict to the riscv64 platform, or drop from FLAKY_TESTS given ci.sh already excludes it there with --exclude thread/stress_aes.py anyway.

flaky_tests = [
r for r in test_results if r[1] == "pass" and r[2].startswith(FLAKY_REASON_PREFIX)
]
if len(flaky_tests) > 0:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use if flaky_tests: not if len(flaky_tests) > 0:.

flaky_failed = [
r for r in test_results if r[1] == "fail" and r[2].startswith(FLAKY_REASON_PREFIX)
]
if len(flaky_failed) > 0:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: if flaky_failed: not if len(flaky_failed) > 0:.

import pyboard

# Prefix used by run-tests.py to tag flaky-retry results.
FLAKY_REASON_PREFIX = "flaky"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a string prefix to tag result tuples is fragile -- a test whose reason string happens to start with "flaky" would be misidentified. Better to use a distinct status value such as "pass-retry" or extend the result tuple with a dedicated field.

return False, ""
return True, reason

results = test_results.value # direct ref to internal list
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth verifying this is safe. If run_one_test appends via the ThreadSafeCounter wrapper method (which takes the lock), and here we mutate the raw list directly (bypassing the lock), the del results[pre_len:] could race with a concurrent appender. The comment claims this block runs after all parallel work is done; confirm that is always true, especially for the single-job (sequential) path.

@andrewleech
Copy link
Owner Author

/review

1 similar comment
@andrewleech
Copy link
Owner Author

/review

@mpy-reviewer
Copy link

mpy-reviewer bot commented Feb 24, 2026

Review failed. Retry with /review.

@andrewleech
Copy link
Owner Author

/review

1 similar comment
@andrewleech
Copy link
Owner Author

/review

Copy link

@mpy-reviewer mpy-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/run-tests.py, FLAKY_TESTSthread/stress_aes.py entry (suggestion)

platforms=None means retries apply everywhere including native unix/macOS, but the documented reason is "timeout under QEMU emulation". On native platforms this test isn't slow; if it fails for a real reason it'll be silently retried. Either restrict to QEMU platform identifiers or remove it from the allowlist — it's not actually flaky on MIPS/ARM, just slow.


tests/run-tests.py, after FLAKY_TEST_MAX_RETRIES = 2 (nitpick)

Missing blank line between the new constant and the existing # Tests require at least CPython 3.3 comment block below it.


tests/run-tests.py, testcase_count._value = saved_testcase_count (suggestion)

Don't write to ._value directly — that bypasses the ThreadSafeCounter abstraction. Better options:

  • Track the count delta added during retries and subtract it via testcase_count.add(-delta).
  • Add a public set() or reset() method to ThreadSafeCounter.

tests/run-tests.py, --no-retry argument (nitpick)

default=False is redundant for action="store_true". Can remove.


tests/test_utils.py, flaky summary blocks (nitpick)

if len(flaky_tests) > 0:if flaky_tests: (same for flaky_failed).

@andrewleech
Copy link
Owner Author

/review

Copy link

@mpy-reviewer mpy-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry approach is reasonable for reducing spurious CI failures. A few issues to address:

tests/run-tests.py, line R1113 — testcase_count._value = saved_testcase_count
Accessing ._value directly bypasses ThreadSafeCounter's public interface and any associated locking. If testcase_count.value is a read-only property, add a set() or reset() method to ThreadSafeCounter and use that here.

tests/run-tests.py, line R51 — "thread/stress_aes.py": ("timeout under QEMU emulation", None)
Platform is None (all platforms), but this test runs fine on macOS/native and is only slow on QEMU targets. On MIPS (~90s) and ARM (~70s), a retry on timeout failure burns up to 270s/210s of CI time. Either restrict to QEMU platforms or remove from FLAKY_TESTS entirely and rely on the existing --exclude in the rare case it times out.

tests/test_utils.py, lines R330 and R341 — r[2].startswith(FLAKY_REASON_PREFIX)
Detecting flaky results via startswith on the third tuple element is fragile — the result tuple has no documented schema for r[2]. If a non-flaky test's output string happens to start with "flaky", it'll be miscategorised. Consider using a 4-element tuple (test_file, status, reason, is_flaky_retry) or a named type instead of co-opting the reason string.

tests/run-tests.py, line R1064 — def _is_flaky(test_file):
No reason to define this inside the if not args.no_retry: block; it can be a module-level function alongside FLAKY_TESTS.

tests/test_utils.py, line R332 — if len(flaky_tests) > 0:
please use if flaky_tests: (likewise if flaky_failed: on R343).

tests/run-tests.py, line R1237 — default=False
Redundant with action="store_true"; remove it.

@andrewleech
Copy link
Owner Author

/review

Copy link

@mpy-reviewer mpy-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two bugs in FLAKY_TESTS: stress_recurse.py is listed twice so the first entry is silently dropped, and testcase_count._value is set directly via a private attribute that bypasses the ThreadSafeCounter abstraction. Several entries have platform-specific failure reasons but platforms=None, which will trigger unnecessary retries on native runners.

"thread/thread_gc1.py": ("GC race condition", None),
"thread/stress_aes.py": ("timeout under QEMU emulation", None),
"thread/stress_schedule.py": ("intermittent crash under QEMU", None),
"thread/stress_recurse.py": ("stack overflow under emulation", None),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate key — "thread/stress_recurse.py" already appears on the line above, so this entry silently overwrites the first ("intermittent crash under QEMU" is dropped). Merge into one entry.

# or a tuple of sys.platform strings to restrict retries to those platforms.
FLAKY_TESTS = {
"thread/thread_gc1.py": ("GC race condition", None),
"thread/stress_aes.py": ("timeout under QEMU emulation", None),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason says "timeout under QEMU emulation" but platforms=None means this test will also be retried on native macOS/Linux runners where it doesn't time out. Restrict to QEMU platforms or broaden the reason.

"thread/stress_schedule.py": ("intermittent crash under QEMU", None),
"thread/stress_recurse.py": ("stack overflow under emulation", None),
"thread/stress_heap.py": ("flaky on macOS", ("darwin",)),
"cmdline/repl_lock.py": ("REPL timing under QEMU", None),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue: "REPL timing under QEMU" but platforms=None. Should restrict to QEMU platforms.

test_file, FLAKY_TEST_MAX_RETRIES, flaky_reason
)
)
testcase_count._value = saved_testcase_count
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assign to ._value directly. Add a reset(value) method (or similar) to ThreadSafeCounter and use that here.

cmd_parser.add_argument(
"--no-retry",
action="store_true",
default=False,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default=False is redundant with action="store_true".

flaky_tests = [
r for r in test_results if r[1] == "pass" and r[2].startswith(FLAKY_REASON_PREFIX)
]
if len(flaky_tests) > 0:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use if flaky_tests: instead of if len(flaky_tests) > 0:.

flaky_failed = [
r for r in test_results if r[1] == "fail" and r[2].startswith(FLAKY_REASON_PREFIX)
]
if len(flaky_failed) > 0:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use if flaky_failed: instead.

(cd ports/unix && gcov -o build-coverage/py ../../py/*.c || true)
(cd ports/unix && gcov -o build-coverage/extmod ../../extmod/*.c || true)
- name: Upload coverage to Codecov
if: ${{ secrets.CODECOV_TOKEN != '' }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Codecov change is unrelated to the flaky-retry feature. Should be a separate PR.

Signed-off-by: Andrew Leech <andrew.leech@planet-innovation.com>
…ing.

Reclassify failures of tests listed in flaky_tests_to_skip as "ignored"
instead of retrying them. Ignored tests still run and their output is
reported, but they don't affect the exit code. The ci.sh --exclude lists
for these tests are removed so they run normally.

Signed-off-by: Andrew Leech <andrew.leech@planet-innovation.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants