tests/run-tests.py: Add automatic retry for known-flaky tests.#6
tests/run-tests.py: Add automatic retry for known-flaky tests.#6andrewleech wants to merge 2 commits intoreview/flaky-test-retryfrom
Conversation
29e879c to
91111d5
Compare
|
Code size report: |
tests/run-tests.py
Outdated
| saved_testcase_count = testcase_count.value | ||
| flaky_failures = [] | ||
| for i, r in enumerate(results): | ||
| if r[1] == "fail": |
There was a problem hiding this comment.
Is there a shared / original definition of "fail" that can be used rather than hard coded in string here?
tests/run-tests.py
Outdated
| retried_pass = False | ||
| for attempt in range(1, FLAKY_TEST_MAX_RETRIES + 1): | ||
| print( | ||
| "retry {} ({}/{}, {})".format( |
There was a problem hiding this comment.
Is .format the standard way of outputting strings in this project?
tests/run-tests.py
Outdated
| run_one_test(test_file) | ||
| retry_entries = results[pre_len:] | ||
| del results[pre_len:] # clean up all entries appended during retry | ||
| if len(retry_entries) == 1 and retry_entries[0][1] == "pass": |
There was a problem hiding this comment.
As per earlier comment, "pass" probably shouldn't be hard coded in string here. Also is this correct for all tests types (diff/exp/unittest)?
tests/run-tests.py
Outdated
| "{} (retried {})".format(FLAKY_REASON_PREFIX, FLAKY_TEST_MAX_RETRIES), | ||
| ) | ||
| print( | ||
| "FAIL {} (retried {}, {})".format( |
There was a problem hiding this comment.
How does this FAIL compare to earlier fail strings?
|
/review |
There was a problem hiding this comment.
Several issues with the retry mechanism: the flaky-test registry uses a fragile string prefix for tagging results and private _value access to reset the counter, and stress_aes.py with platforms=None will triple the timeout on MIPS/ARM QEMU. Style nits: missing blank line and len(x) > 0 checks.
tests/run-tests.py
Outdated
| "extmod/time_time_ns.py": ("CI runner clock precision", None), | ||
| } | ||
|
|
||
| FLAKY_TEST_MAX_RETRIES = 2 # 3 total runs including initial |
There was a problem hiding this comment.
Missing blank line between closing } of FLAKY_TESTS and this constant.
tests/run-tests.py
Outdated
| test_file, FLAKY_TEST_MAX_RETRIES, flaky_reason | ||
| ) | ||
| ) | ||
| testcase_count._value = saved_testcase_count |
There was a problem hiding this comment.
_value is a private attribute of ThreadSafeCounter. Use testcase_count.value = saved_testcase_count if the type supports assignment, or restructure to avoid resetting it entirely. Accessing internals like this is fragile.
tests/run-tests.py
Outdated
| # or a tuple of sys.platform strings to restrict retries to those platforms. | ||
| FLAKY_TESTS = { | ||
| "thread/thread_gc1.py": ("GC race condition", None), | ||
| "thread/stress_aes.py": ("timeout under QEMU emulation", None), |
There was a problem hiding this comment.
stress_aes.py is listed with platforms=None (all platforms) but the reason is "timeout under QEMU emulation". On MIPS/ARM QEMU it takes 70-90 s; retrying twice means up to 3x that timeout burning CI time. Restrict to the riscv64 platform, or drop from FLAKY_TESTS given ci.sh already excludes it there with --exclude thread/stress_aes.py anyway.
tests/test_utils.py
Outdated
| flaky_tests = [ | ||
| r for r in test_results if r[1] == "pass" and r[2].startswith(FLAKY_REASON_PREFIX) | ||
| ] | ||
| if len(flaky_tests) > 0: |
There was a problem hiding this comment.
Use if flaky_tests: not if len(flaky_tests) > 0:.
tests/test_utils.py
Outdated
| flaky_failed = [ | ||
| r for r in test_results if r[1] == "fail" and r[2].startswith(FLAKY_REASON_PREFIX) | ||
| ] | ||
| if len(flaky_failed) > 0: |
There was a problem hiding this comment.
Same: if flaky_failed: not if len(flaky_failed) > 0:.
| import pyboard | ||
|
|
||
| # Prefix used by run-tests.py to tag flaky-retry results. | ||
| FLAKY_REASON_PREFIX = "flaky" |
There was a problem hiding this comment.
Using a string prefix to tag result tuples is fragile -- a test whose reason string happens to start with "flaky" would be misidentified. Better to use a distinct status value such as "pass-retry" or extend the result tuple with a dedicated field.
tests/run-tests.py
Outdated
| return False, "" | ||
| return True, reason | ||
|
|
||
| results = test_results.value # direct ref to internal list |
There was a problem hiding this comment.
Worth verifying this is safe. If run_one_test appends via the ThreadSafeCounter wrapper method (which takes the lock), and here we mutate the raw list directly (bypassing the lock), the del results[pre_len:] could race with a concurrent appender. The comment claims this block runs after all parallel work is done; confirm that is always true, especially for the single-job (sequential) path.
|
/review |
1 similar comment
|
/review |
|
Review failed. Retry with |
|
/review |
1 similar comment
|
/review |
There was a problem hiding this comment.
tests/run-tests.py, FLAKY_TESTS — thread/stress_aes.py entry (suggestion)
platforms=None means retries apply everywhere including native unix/macOS, but the documented reason is "timeout under QEMU emulation". On native platforms this test isn't slow; if it fails for a real reason it'll be silently retried. Either restrict to QEMU platform identifiers or remove it from the allowlist — it's not actually flaky on MIPS/ARM, just slow.
tests/run-tests.py, after FLAKY_TEST_MAX_RETRIES = 2 (nitpick)
Missing blank line between the new constant and the existing # Tests require at least CPython 3.3 comment block below it.
tests/run-tests.py, testcase_count._value = saved_testcase_count (suggestion)
Don't write to ._value directly — that bypasses the ThreadSafeCounter abstraction. Better options:
- Track the count delta added during retries and subtract it via
testcase_count.add(-delta). - Add a public
set()orreset()method toThreadSafeCounter.
tests/run-tests.py, --no-retry argument (nitpick)
default=False is redundant for action="store_true". Can remove.
tests/test_utils.py, flaky summary blocks (nitpick)
if len(flaky_tests) > 0: → if flaky_tests: (same for flaky_failed).
|
/review |
There was a problem hiding this comment.
The retry approach is reasonable for reducing spurious CI failures. A few issues to address:
tests/run-tests.py, line R1113 — testcase_count._value = saved_testcase_count
Accessing ._value directly bypasses ThreadSafeCounter's public interface and any associated locking. If testcase_count.value is a read-only property, add a set() or reset() method to ThreadSafeCounter and use that here.
tests/run-tests.py, line R51 — "thread/stress_aes.py": ("timeout under QEMU emulation", None)
Platform is None (all platforms), but this test runs fine on macOS/native and is only slow on QEMU targets. On MIPS (~90s) and ARM (~70s), a retry on timeout failure burns up to 270s/210s of CI time. Either restrict to QEMU platforms or remove from FLAKY_TESTS entirely and rely on the existing --exclude in the rare case it times out.
tests/test_utils.py, lines R330 and R341 — r[2].startswith(FLAKY_REASON_PREFIX)
Detecting flaky results via startswith on the third tuple element is fragile — the result tuple has no documented schema for r[2]. If a non-flaky test's output string happens to start with "flaky", it'll be miscategorised. Consider using a 4-element tuple (test_file, status, reason, is_flaky_retry) or a named type instead of co-opting the reason string.
tests/run-tests.py, line R1064 — def _is_flaky(test_file):
No reason to define this inside the if not args.no_retry: block; it can be a module-level function alongside FLAKY_TESTS.
tests/test_utils.py, line R332 — if len(flaky_tests) > 0:
please use if flaky_tests: (likewise if flaky_failed: on R343).
tests/run-tests.py, line R1237 — default=False
Redundant with action="store_true"; remove it.
|
/review |
There was a problem hiding this comment.
Two bugs in FLAKY_TESTS: stress_recurse.py is listed twice so the first entry is silently dropped, and testcase_count._value is set directly via a private attribute that bypasses the ThreadSafeCounter abstraction. Several entries have platform-specific failure reasons but platforms=None, which will trigger unnecessary retries on native runners.
tests/run-tests.py
Outdated
| "thread/thread_gc1.py": ("GC race condition", None), | ||
| "thread/stress_aes.py": ("timeout under QEMU emulation", None), | ||
| "thread/stress_schedule.py": ("intermittent crash under QEMU", None), | ||
| "thread/stress_recurse.py": ("stack overflow under emulation", None), |
There was a problem hiding this comment.
Duplicate key — "thread/stress_recurse.py" already appears on the line above, so this entry silently overwrites the first ("intermittent crash under QEMU" is dropped). Merge into one entry.
tests/run-tests.py
Outdated
| # or a tuple of sys.platform strings to restrict retries to those platforms. | ||
| FLAKY_TESTS = { | ||
| "thread/thread_gc1.py": ("GC race condition", None), | ||
| "thread/stress_aes.py": ("timeout under QEMU emulation", None), |
There was a problem hiding this comment.
Reason says "timeout under QEMU emulation" but platforms=None means this test will also be retried on native macOS/Linux runners where it doesn't time out. Restrict to QEMU platforms or broaden the reason.
tests/run-tests.py
Outdated
| "thread/stress_schedule.py": ("intermittent crash under QEMU", None), | ||
| "thread/stress_recurse.py": ("stack overflow under emulation", None), | ||
| "thread/stress_heap.py": ("flaky on macOS", ("darwin",)), | ||
| "cmdline/repl_lock.py": ("REPL timing under QEMU", None), |
There was a problem hiding this comment.
Same issue: "REPL timing under QEMU" but platforms=None. Should restrict to QEMU platforms.
tests/run-tests.py
Outdated
| test_file, FLAKY_TEST_MAX_RETRIES, flaky_reason | ||
| ) | ||
| ) | ||
| testcase_count._value = saved_testcase_count |
There was a problem hiding this comment.
Don't assign to ._value directly. Add a reset(value) method (or similar) to ThreadSafeCounter and use that here.
tests/run-tests.py
Outdated
| cmd_parser.add_argument( | ||
| "--no-retry", | ||
| action="store_true", | ||
| default=False, |
There was a problem hiding this comment.
default=False is redundant with action="store_true".
tests/test_utils.py
Outdated
| flaky_tests = [ | ||
| r for r in test_results if r[1] == "pass" and r[2].startswith(FLAKY_REASON_PREFIX) | ||
| ] | ||
| if len(flaky_tests) > 0: |
There was a problem hiding this comment.
Use if flaky_tests: instead of if len(flaky_tests) > 0:.
tests/test_utils.py
Outdated
| flaky_failed = [ | ||
| r for r in test_results if r[1] == "fail" and r[2].startswith(FLAKY_REASON_PREFIX) | ||
| ] | ||
| if len(flaky_failed) > 0: |
There was a problem hiding this comment.
Use if flaky_failed: instead.
.github/workflows/ports_unix.yml
Outdated
| (cd ports/unix && gcov -o build-coverage/py ../../py/*.c || true) | ||
| (cd ports/unix && gcov -o build-coverage/extmod ../../extmod/*.c || true) | ||
| - name: Upload coverage to Codecov | ||
| if: ${{ secrets.CODECOV_TOKEN != '' }} |
There was a problem hiding this comment.
This Codecov change is unrelated to the flaky-retry feature. Should be a separate PR.
5571e56 to
cbc5529
Compare
Signed-off-by: Andrew Leech <andrew.leech@planet-innovation.com>
…ing. Reclassify failures of tests listed in flaky_tests_to_skip as "ignored" instead of retrying them. Ignored tests still run and their output is reported, but they don't affect the exit code. The ci.sh --exclude lists for these tests are removed so they run normally. Signed-off-by: Andrew Leech <andrew.leech@planet-innovation.com>
cbc5529 to
36ae4eb
Compare
Summary
The unix port CI workflow on master fails ~18% of the time due to a handful of intermittent test failures (mostly thread-related). This has normalised red CI to the point where contributors ignore it, masking real regressions.
I added a flaky-test allowlist and retry-until-pass mechanism to
run-tests.py. When a listed test fails, it's re-run up to 2 more times and passes on the first success. The 8 tests on the list account for essentially all spurious CI failures on master.tools/ci.shis updated to remove--excludepatterns for tests now handled by the retry logic, so they get actual coverage again.thread/stress_aes.pystays excluded on RISC-V QEMU where it runs close to the 200s timeout and retries would burn excessive CI time.Retries are on by default;
--no-retrydisables them. The end-of-run summary reports tests that passed via retry separately from clean passes.Testing
Not yet tested in CI — this draft PR is to exercise the review workflow and get CI results on the fork.
Trade-offs and Alternatives
Retry-until-pass can mask a test that has become intermittently broken. The allowlist is deliberately small and each entry documents a reason and optional platform restriction. An alternative would be to mark these tests as expected-fail, but that removes coverage entirely which is worse than the current situation.
The test result status strings (
"pass","fail","skip") are bare string literals throughoutrun-tests.py— the retry code follows this existing convention. Worth considering a follow-up to consolidate these into constants for the whole file.Generative AI
I used generative AI tools when creating this PR, but a human has checked the code and is responsible for the description above.