tests/run-tests.py: Add automatic retry for known-flaky tests. by andrewleech · Pull Request #6 · andrewleech/micropython

andrewleech · 2026-02-23T03:45:50Z

Summary

The unix port CI workflow on master fails ~18% of the time due to a handful of intermittent test failures (mostly thread-related). This has normalised red CI to the point where contributors ignore it, masking real regressions.

I added a flaky-test allowlist and retry-until-pass mechanism to run-tests.py. When a listed test fails, it's re-run up to 2 more times and passes on the first success. The 8 tests on the list account for essentially all spurious CI failures on master.

tools/ci.sh is updated to remove --exclude patterns for tests now handled by the retry logic, so they get actual coverage again. thread/stress_aes.py stays excluded on RISC-V QEMU where it runs close to the 200s timeout and retries would burn excessive CI time.

Retries are on by default; --no-retry disables them. The end-of-run summary reports tests that passed via retry separately from clean passes.

Testing

Not yet tested in CI — this draft PR is to exercise the review workflow and get CI results on the fork.

Trade-offs and Alternatives

Retry-until-pass can mask a test that has become intermittently broken. The allowlist is deliberately small and each entry documents a reason and optional platform restriction. An alternative would be to mark these tests as expected-fail, but that removes coverage entirely which is worse than the current situation.

The test result status strings ("pass", "fail", "skip") are bare string literals throughout run-tests.py — the retry code follows this existing convention. Worth considering a follow-up to consolidate these into constants for the whole file.

Generative AI

I used generative AI tools when creating this PR, but a human has checked the code and is responsible for the description above.

github-actions · 2026-02-23T06:53:18Z

Code size report:

Reference:  zephyr/mpconfigport: Remove duplicate builtins.open definition. [1ab9b66]
Comparison: tests/run-tests.py: Ignore known-flaky test failures instead of retrying. [merge of 36ae4eb]
  mpy-cross:    +0 +0.000% 
   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
      esp32:    +0 +0.000% ESP32_GENERIC
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

andrewleech · 2026-02-23T11:15:40Z

tests/run-tests.py

+        saved_testcase_count = testcase_count.value
+        flaky_failures = []
+        for i, r in enumerate(results):
+            if r[1] == "fail":


Is there a shared / original definition of "fail" that can be used rather than hard coded in string here?

andrewleech · 2026-02-23T11:16:57Z

tests/run-tests.py

+            retried_pass = False
+            for attempt in range(1, FLAKY_TEST_MAX_RETRIES + 1):
+                print(
+                    "retry {} ({}/{}, {})".format(


Is .format the standard way of outputting strings in this project?

andrewleech · 2026-02-23T11:18:58Z

tests/run-tests.py

+                run_one_test(test_file)
+                retry_entries = results[pre_len:]
+                del results[pre_len:]  # clean up all entries appended during retry
+                if len(retry_entries) == 1 and retry_entries[0][1] == "pass":


As per earlier comment, "pass" probably shouldn't be hard coded in string here. Also is this correct for all tests types (diff/exp/unittest)?

andrewleech · 2026-02-23T11:20:35Z

tests/run-tests.py

+                    "{} (retried {})".format(FLAKY_REASON_PREFIX, FLAKY_TEST_MAX_RETRIES),
+                )
+                print(
+                    "FAIL  {} (retried {}, {})".format(


How does this FAIL compare to earlier fail strings?

andrewleech · 2026-02-24T05:29:45Z

/review

mpy-reviewer

Several issues with the retry mechanism: the flaky-test registry uses a fragile string prefix for tagging results and private _value access to reset the counter, and stress_aes.py with platforms=None will triple the timeout on MIPS/ARM QEMU. Style nits: missing blank line and len(x) > 0 checks.

mpy-reviewer · 2026-02-24T05:35:43Z

tests/run-tests.py

+    "extmod/time_time_ns.py": ("CI runner clock precision", None),
+}
+
+FLAKY_TEST_MAX_RETRIES = 2  # 3 total runs including initial


Missing blank line between closing } of FLAKY_TESTS and this constant.

mpy-reviewer · 2026-02-24T05:36:18Z

tests/run-tests.py

+                        test_file, FLAKY_TEST_MAX_RETRIES, flaky_reason
+                    )
+                )
+        testcase_count._value = saved_testcase_count


_value is a private attribute of ThreadSafeCounter. Use testcase_count.value = saved_testcase_count if the type supports assignment, or restructure to avoid resetting it entirely. Accessing internals like this is fragile.

mpy-reviewer · 2026-02-24T05:36:19Z

tests/run-tests.py

+# or a tuple of sys.platform strings to restrict retries to those platforms.
+FLAKY_TESTS = {
+    "thread/thread_gc1.py": ("GC race condition", None),
+    "thread/stress_aes.py": ("timeout under QEMU emulation", None),


stress_aes.py is listed with platforms=None (all platforms) but the reason is "timeout under QEMU emulation". On MIPS/ARM QEMU it takes 70-90 s; retrying twice means up to 3x that timeout burning CI time. Restrict to the riscv64 platform, or drop from FLAKY_TESTS given ci.sh already excludes it there with --exclude thread/stress_aes.py anyway.

mpy-reviewer · 2026-02-24T05:36:20Z

tests/test_utils.py

+    flaky_tests = [
+        r for r in test_results if r[1] == "pass" and r[2].startswith(FLAKY_REASON_PREFIX)
+    ]
+    if len(flaky_tests) > 0:


Use if flaky_tests: not if len(flaky_tests) > 0:.

mpy-reviewer · 2026-02-24T05:36:21Z

tests/test_utils.py

+    flaky_failed = [
+        r for r in test_results if r[1] == "fail" and r[2].startswith(FLAKY_REASON_PREFIX)
+    ]
+    if len(flaky_failed) > 0:


Same: if flaky_failed: not if len(flaky_failed) > 0:.

mpy-reviewer · 2026-02-24T05:36:22Z

tests/test_utils.py

 import pyboard

+# Prefix used by run-tests.py to tag flaky-retry results.
+FLAKY_REASON_PREFIX = "flaky"


Using a string prefix to tag result tuples is fragile -- a test whose reason string happens to start with "flaky" would be misidentified. Better to use a distinct status value such as "pass-retry" or extend the result tuple with a dedicated field.

mpy-reviewer · 2026-02-24T05:36:23Z

tests/run-tests.py

+                return False, ""
+            return True, reason
+
+        results = test_results.value  # direct ref to internal list


Worth verifying this is safe. If run_one_test appends via the ThreadSafeCounter wrapper method (which takes the lock), and here we mutate the raw list directly (bypassing the lock), the del results[pre_len:] could race with a concurrent appender. The comment claims this block runs after all parallel work is done; confirm that is always true, especially for the single-job (sequential) path.

andrewleech · 2026-02-24T09:57:44Z

/review

andrewleech · 2026-02-24T10:15:01Z

/review

mpy-reviewer · 2026-02-24T10:25:08Z

Review failed. Retry with /review.

andrewleech · 2026-02-24T11:19:56Z

/review

andrewleech · 2026-02-24T11:39:32Z

/review

mpy-reviewer

tests/run-tests.py, FLAKY_TESTS — thread/stress_aes.py entry (suggestion)

platforms=None means retries apply everywhere including native unix/macOS, but the documented reason is "timeout under QEMU emulation". On native platforms this test isn't slow; if it fails for a real reason it'll be silently retried. Either restrict to QEMU platform identifiers or remove it from the allowlist — it's not actually flaky on MIPS/ARM, just slow.

tests/run-tests.py, after FLAKY_TEST_MAX_RETRIES = 2 (nitpick)

Missing blank line between the new constant and the existing # Tests require at least CPython 3.3 comment block below it.

tests/run-tests.py, testcase_count._value = saved_testcase_count (suggestion)

Don't write to ._value directly — that bypasses the ThreadSafeCounter abstraction. Better options:

Track the count delta added during retries and subtract it via testcase_count.add(-delta).
Add a public set() or reset() method to ThreadSafeCounter.

tests/run-tests.py, --no-retry argument (nitpick)

default=False is redundant for action="store_true". Can remove.

tests/test_utils.py, flaky summary blocks (nitpick)

if len(flaky_tests) > 0: → if flaky_tests: (same for flaky_failed).

andrewleech · 2026-02-24T22:01:26Z

/review

mpy-reviewer

The retry approach is reasonable for reducing spurious CI failures. A few issues to address:

tests/run-tests.py, line R1113 — testcase_count._value = saved_testcase_count
Accessing ._value directly bypasses ThreadSafeCounter's public interface and any associated locking. If testcase_count.value is a read-only property, add a set() or reset() method to ThreadSafeCounter and use that here.

tests/run-tests.py, line R51 — "thread/stress_aes.py": ("timeout under QEMU emulation", None)
Platform is None (all platforms), but this test runs fine on macOS/native and is only slow on QEMU targets. On MIPS (~90s) and ARM (~70s), a retry on timeout failure burns up to 270s/210s of CI time. Either restrict to QEMU platforms or remove from FLAKY_TESTS entirely and rely on the existing --exclude in the rare case it times out.

tests/test_utils.py, lines R330 and R341 — r[2].startswith(FLAKY_REASON_PREFIX)
Detecting flaky results via startswith on the third tuple element is fragile — the result tuple has no documented schema for r[2]. If a non-flaky test's output string happens to start with "flaky", it'll be miscategorised. Consider using a 4-element tuple (test_file, status, reason, is_flaky_retry) or a named type instead of co-opting the reason string.

tests/run-tests.py, line R1064 — def _is_flaky(test_file):
No reason to define this inside the if not args.no_retry: block; it can be a module-level function alongside FLAKY_TESTS.

tests/test_utils.py, line R332 — if len(flaky_tests) > 0:
please use if flaky_tests: (likewise if flaky_failed: on R343).

tests/run-tests.py, line R1237 — default=False
Redundant with action="store_true"; remove it.

andrewleech · 2026-02-24T23:10:12Z

/review

mpy-reviewer

Two bugs in FLAKY_TESTS: stress_recurse.py is listed twice so the first entry is silently dropped, and testcase_count._value is set directly via a private attribute that bypasses the ThreadSafeCounter abstraction. Several entries have platform-specific failure reasons but platforms=None, which will trigger unnecessary retries on native runners.

mpy-reviewer · 2026-02-24T23:16:57Z

tests/run-tests.py

+    "thread/thread_gc1.py": ("GC race condition", None),
+    "thread/stress_aes.py": ("timeout under QEMU emulation", None),
+    "thread/stress_schedule.py": ("intermittent crash under QEMU", None),
+    "thread/stress_recurse.py": ("stack overflow under emulation", None),


Duplicate key — "thread/stress_recurse.py" already appears on the line above, so this entry silently overwrites the first ("intermittent crash under QEMU" is dropped). Merge into one entry.

mpy-reviewer · 2026-02-24T23:16:57Z

tests/run-tests.py

+# or a tuple of sys.platform strings to restrict retries to those platforms.
+FLAKY_TESTS = {
+    "thread/thread_gc1.py": ("GC race condition", None),
+    "thread/stress_aes.py": ("timeout under QEMU emulation", None),


Reason says "timeout under QEMU emulation" but platforms=None means this test will also be retried on native macOS/Linux runners where it doesn't time out. Restrict to QEMU platforms or broaden the reason.

mpy-reviewer · 2026-02-24T23:16:57Z

tests/run-tests.py

+    "thread/stress_schedule.py": ("intermittent crash under QEMU", None),
+    "thread/stress_recurse.py": ("stack overflow under emulation", None),
+    "thread/stress_heap.py": ("flaky on macOS", ("darwin",)),
+    "cmdline/repl_lock.py": ("REPL timing under QEMU", None),


Same issue: "REPL timing under QEMU" but platforms=None. Should restrict to QEMU platforms.

mpy-reviewer · 2026-02-24T23:16:57Z

tests/run-tests.py

+                        test_file, FLAKY_TEST_MAX_RETRIES, flaky_reason
+                    )
+                )
+        testcase_count._value = saved_testcase_count


Don't assign to ._value directly. Add a reset(value) method (or similar) to ThreadSafeCounter and use that here.

mpy-reviewer · 2026-02-24T23:16:57Z

tests/run-tests.py

+    cmd_parser.add_argument(
+        "--no-retry",
+        action="store_true",
+        default=False,


default=False is redundant with action="store_true".

mpy-reviewer · 2026-02-24T23:16:57Z

tests/test_utils.py

+    flaky_tests = [
+        r for r in test_results if r[1] == "pass" and r[2].startswith(FLAKY_REASON_PREFIX)
+    ]
+    if len(flaky_tests) > 0:


Use if flaky_tests: instead of if len(flaky_tests) > 0:.

mpy-reviewer · 2026-02-24T23:16:57Z

tests/test_utils.py

+    flaky_failed = [
+        r for r in test_results if r[1] == "fail" and r[2].startswith(FLAKY_REASON_PREFIX)
+    ]
+    if len(flaky_failed) > 0:


Use if flaky_failed: instead.

mpy-reviewer · 2026-02-24T23:16:57Z

.github/workflows/ports_unix.yml

        (cd ports/unix && gcov -o build-coverage/py ../../py/*.c || true)
        (cd ports/unix && gcov -o build-coverage/extmod ../../extmod/*.c || true)
    - name: Upload coverage to Codecov
+      if: ${{ secrets.CODECOV_TOKEN != '' }}


This Codecov change is unrelated to the flaky-retry feature. Should be a separate PR.

Signed-off-by: Andrew Leech <andrew.leech@planet-innovation.com>

…ing. Reclassify failures of tests listed in flaky_tests_to_skip as "ignored" instead of retrying them. Ignored tests still run and their output is reported, but they don't affect the exit code. The ci.sh --exclude lists for these tests are removed so they run normally. Signed-off-by: Andrew Leech <andrew.leech@planet-innovation.com>

andrewleech force-pushed the flaky-test-retry branch from 29e879c to 91111d5 Compare February 23, 2026 06:30

andrewleech commented Feb 23, 2026

View reviewed changes

Repository owner deleted a comment from mpy-reviewer bot Feb 24, 2026

mpy-reviewer bot reviewed Feb 24, 2026

View reviewed changes

andrewleech mentioned this pull request Feb 25, 2026

tests/run-tests.py: Add automatic retry for known-flaky tests. micropython/micropython#18861

Open

andrewleech force-pushed the flaky-test-retry branch 2 times, most recently from 5571e56 to cbc5529 Compare February 25, 2026 19:04

pi-anl added 2 commits February 27, 2026 20:13

github/workflows: Skip codecov upload when token is not configured.

ef6dbbc

Signed-off-by: Andrew Leech <andrew.leech@planet-innovation.com>

andrewleech force-pushed the flaky-test-retry branch from cbc5529 to 36ae4eb Compare February 27, 2026 10:04

Conversation

andrewleech commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Trade-offs and Alternatives

Generative AI

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewleech Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

andrewleech Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

andrewleech Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

andrewleech Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

andrewleech commented Feb 24, 2026

Uh oh!

mpy-reviewer bot left a comment

Choose a reason for hiding this comment

Uh oh!

mpy-reviewer bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mpy-reviewer bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mpy-reviewer bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mpy-reviewer bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mpy-reviewer bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mpy-reviewer bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mpy-reviewer bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

andrewleech commented Feb 24, 2026

Uh oh!

andrewleech commented Feb 24, 2026

Uh oh!

mpy-reviewer bot commented Feb 24, 2026

Uh oh!

andrewleech commented Feb 24, 2026

Uh oh!

andrewleech commented Feb 24, 2026

Uh oh!

mpy-reviewer bot left a comment

Choose a reason for hiding this comment

Uh oh!

andrewleech commented Feb 24, 2026

Uh oh!

mpy-reviewer bot left a comment

Choose a reason for hiding this comment

Uh oh!

andrewleech commented Feb 24, 2026

Uh oh!

mpy-reviewer bot left a comment

Choose a reason for hiding this comment

Uh oh!

mpy-reviewer bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mpy-reviewer bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mpy-reviewer bot Feb 24, 2026

Choose a reason for hiding this comment

andrewleech commented Feb 23, 2026 •

edited

Loading

github-actions bot commented Feb 23, 2026 •

edited

Loading