RDKEMW-13253- Dobby l2 fix #433
Conversation
Rebase to develop
There was a problem hiding this comment.
Pull request overview
This PR targets GitHub Actions L2 test stability by improving bundle handling and platform-specific behavior (cgroupv2/CI quirks), and by adjusting templates/CI steps to reduce environment-dependent failures.
Changes:
- Harden L2 test runner utilities (bundle extraction validation, DobbyDaemon readiness probing, container launch retries, improved async log reading).
- Make multiple L2 tests more CI-tolerant (bundle config sanitization, cgroupv1/v2 pid-limit path resolution, more robust bundle comparisons).
- Add CI workflow step to regenerate bundles for cgroupv2 compatibility and relax LCOV handling for unused coverage artifacts.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/L2_testing/test_runner/thunder_plugin.py | Sanitizes bundle config and makes test set platform-conditional; improves handling of invalid bundles. |
| tests/L2_testing/test_runner/test_utils.py | Adds bundle extraction validation, nested bundle path detection, DobbyDaemon readiness probing, and more robust container launch behavior. |
| tests/L2_testing/test_runner/start_from_bundle.py | Uses validated bundle context and waits briefly for log flushing before asserting. |
| tests/L2_testing/test_runner/runner.py | Fixes test-results aggregation to only attach per-suite JSON when tests actually ran. |
| tests/L2_testing/test_runner/pid_limit_tests.py | Resolves pids.max across cgroup v1/v2 layouts, including /proc/<pid>/cgroup-derived paths. |
| tests/L2_testing/test_runner/network_tests.py | Switches to shared launch_container and adds invalid-bundle handling. |
| tests/L2_testing/test_runner/memcr_tests.py | Hardens PID parsing and optionally skips PID checkpoint validation when PIDs aren’t reported. |
| tests/L2_testing/test_runner/bundle_generation.py | Replaces filesystem diff with normalized config.json comparison and adds rootfs existence check. |
| tests/L2_testing/test_runner/bundle/regenerate_bundles.py | New CI helper to patch and repack test bundles for cgroupv2 compatibility. |
| tests/L2_testing/test_runner/basic_sanity_tests.py | Reworks async log reading with select() and makes daemon-stop verification less log-dependent. |
| tests/L2_testing/test_runner/annotation_tests.py | Starts container from spec path (vs extracted bundle) and stops container after annotation checks. |
| client/tool/source/Main.cpp | Removes mutex lock from stop callback before completing a promise. |
| bundle/lib/source/templates/OciConfigJsonVM1.0.2-dobby.template | Removes memory swappiness field from generated OCI config templates. |
| bundle/lib/source/templates/OciConfigJson1.0.2-dobby.template | Removes memory swappiness field from generated OCI config templates. |
| .github/workflows/L2-tests.yml | Adds bundle-regeneration step and relaxes LCOV errors for unused coverage. |
Comments suppressed due to low confidence (1)
tests/L2_testing/test_runner/network_tests.py:121
- If
launch_containerfails, the test still proceeds to sleep and read from netcat, and then overwrites the earlier launch-failure message with a generic "Received ... expected ..." message. This makes failures harder to diagnose and can produce misleading output. Consider short-circuiting whenlaunch_resultis false (or preserving the original error message/log) before attempting to validate the netcat payload.
launch_result = test_utils.launch_container(container_name, bundle_path)
message = ""
result = True
if not launch_result:
message = "Container did not launch successfully"
result = False
# give container time to start and send message before checking netcat listener
sleep(2)
nc_message = nc.get_output().rstrip("\n")
# check if netcat listener received message
if test.expected_output.lower() not in nc_message.lower():
message = "Received '%s' from container, expected '%s'" % (nc_message.lower(), test.expected_output.lower())
result = False
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f1bf099 to
298f0e7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
818fd76 to
88e0368
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
tests/L2_testing/test_runner/basic_sanity_tests.py:172
- read_asynchronous() now always returns from the select()-based loop, but the old thread-based implementation remains below and is unreachable. This dead code makes the function harder to maintain and can confuse future changes; remove the unreachable threaded fallback block.
# Use a daemon thread so the nested target function is never pickled.
# multiprocessing.Process requires pickling the target, which fails
# for nested functions on spawn-based environments (newer Python/OS).
# Threading shares the address space so no serialisation is needed.
def wait_for_string():
while True:
# notice that all data are in stderr not in stdout, this is DobbyDaemon design
output = proc.stderr.readline()
if not output:
# EOF – subprocess closed its stderr pipe
return
if string_to_find in output:
test_utils.print_log("Found string \"%s\"" % string_to_find, test_utils.Severity.debug)
return
found = False
reader = threading.Thread(target=wait_for_string, daemon=True)
test_utils.print_log("Starting async read thread", test_utils.Severity.debug)
reader.start()
reader.join(timeout)
# if thread still running
if reader.is_alive():
test_utils.print_log("Reader still exists, closing", test_utils.Severity.debug)
# daemon=True: thread is abandoned and reaped when the process exits
test_utils.print_log("Not found string \"%s\"" % string_to_find, test_utils.Severity.error)
else:
found = True
return found
| def __enter__(self): | ||
| """Returns the bundle path, or raises BundleExtractionError if extraction failed.""" | ||
| if self._error_message: | ||
| raise BundleExtractionError(self._error_message) | ||
| return self.path | ||
|
|
| mkdir -p build | ||
| cd build | ||
| cmake -DCMAKE_TOOLCHAIN_FILE="${{ env.TOOLCHAIN_FILE }}" -DRDK_PLATFORM=DEV_VM -DCMAKE_INSTALL_PREFIX:PATH=/usr ${{ matrix.extra_flags }} -DRDK=ON -DUSE_SYSTEMD=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DAI_BUILD_TYPE=AI_DEBUG -DDOBBY_HIBERNATE_MEMCR_IMPL=ON -DDOBBY_HIBERNATE_MEMCR_PARAMS_ENABLED=ON .. | ||
| cmake -DCMAKE_TOOLCHAIN_FILE="${{ env.TOOLCHAIN_FILE }}" -DRDK_PLATFORM=DEV_VM -DCMAKE_INSTALL_PREFIX:PATH=/usr ${{ matrix.extra_flags }} -DUSE_LEGACY_CGROUP=OFF -DRDK=ON -DUSE_SYSTEMD=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DAI_BUILD_TYPE=AI_DEBUG -DDOBBY_HIBERNATE_MEMCR_IMPL=ON -DDOBBY_HIBERNATE_MEMCR_PARAMS_ENABLED=ON .. |
| bundle_ctx = test_utils.untar_bundle(container_name) | ||
| with test_utils.dobby_daemon(), bundle_ctx as bundle_path: | ||
| if not bundle_ctx.valid: | ||
| for test in tests: | ||
| output = test_utils.create_simple_test_output(test, False, "Bundle extraction or validation failed", |
| bundle_ctx = test_utils.untar_bundle(container_id) | ||
| with bundle_ctx as bundle_path: | ||
| if not bundle_ctx.valid: | ||
| return False, "Bundle extraction or validation failed" | ||
|
|
| bundle_ctx = test_utils.untar_bundle(container_name) | ||
| with test_utils.dobby_daemon(), netcat_listener() as nc, bundle_ctx as bundle_path: | ||
| if not bundle_ctx.valid: | ||
| output = test_utils.create_simple_test_output(tests[0], False, "Bundle extraction or validation failed", | ||
| log_content="Bundle extraction or validation failed; container was never launched.") | ||
| output_table.append(output) | ||
| test_utils.print_single_result(output) | ||
| return test_utils.count_print_results(output_table) |
| bundle_ctx = test_utils.untar_bundle(container_name) | ||
| with bundle_ctx as bundle_path: | ||
| if not bundle_ctx.valid: | ||
| test = tests[0] | ||
| output = test_utils.create_simple_test_output( | ||
| test, False, | ||
| "Bundle extraction or validation failed", | ||
| "Bundle tarball could not be extracted or config.json was missing" | ||
| ) | ||
| test_utils.print_single_result(output) | ||
| return test_utils.count_print_results([output]) |
518144c to
88e0368
Compare
1537821 to
298f0e7
Compare
|
|
||
| # Backup original | ||
| if backup: | ||
| backup_path = bundle_tarball.with_suffix('.tar.gz.bak') |
| if (state == IDobbyProxyEvents::ContainerState::Stopped && containerId == *id) | ||
| { | ||
| AI_LOG_INFO("Container %s has stopped", containerId.c_str()); | ||
| std::lock_guard<std::mutex> locker(gLock); | ||
| promise.set_value(); | ||
| } |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
tests/L2_testing/test_runner/memcr_tests.py:252
- Same as
basic_memcr_test: this function still relies ontest_utils.untar_bundle(... )directly in thewithchain and assumesbundle_pathis always a valid string path. With the updateduntar_bundlebehavior (returningNonewhen invalid), this can crash when bundle extraction fails. Please switch to thebundle_ctx.validguarded pattern before starting the container.
return False, "Unable to start container"
# store container pids
pids = wait_for_container_pids(container_id)
skip_pid_checks = not bool(pids)
| Changes made for cgroupv2 compatibility: | ||
| - Removes 'swappiness' from memory resources (not supported in cgroupv2) | ||
| - Sets realtimeRuntime and realtimePeriod to valid values or removes them | ||
| - Updates rootfsPropagation to 'slave' for better compatibility |
| # Uses select() for a true timeout instead of threads — no lingering readers. | ||
| # Reads raw bytes via os.read() to avoid Python TextIOWrapper buffering that | ||
| # can desynchronise from select()'s kernel-level readiness checks. | ||
| def read_asynchronous(proc, string_to_find, timeout): | ||
| """Reads asynchronous from process. Ends when found string or timeout occurred. | ||
| """Reads from process stderr with a real timeout using select(). | ||
|
|
||
| Unlike a threaded approach, this cannot leak a blocked reader: select() | ||
| returns when data is available *or* when the timeout expires, so the | ||
| caller always regains control promptly. | ||
|
|
| return False, "Unable to start container" | ||
|
|
||
| # store container pids | ||
| pids = get_container_pids(container_id) | ||
| pids = wait_for_container_pids(container_id) |
Fix L2 test failures on GitHub Actions CI (cgroupv2 compatibility and GCOV coverage issues)
This PR addresses multiple cascading failures in the L2 test suite when running on GitHub Actions runners which use cgroupv2 (unified hierarchy) and have coverage instrumentation enabled.
Root Causes Fixed:
cgroupv2 Compatibility - The codebase assumed cgroupv1 (separate controller mounts). GitHub Actions uses cgroupv2 where all controllers share a single mount point. Updated DobbyEnv.cpp, IonMemoryPlugin.cpp, GpuPlugin.cpp, and DobbyTemplate.cpp to detect and handle cgroupv2. Also added runtime patching in test_utils.py to remove unsupported settings (swappiness, kernel memory, RT scheduling) from bundle configs.
GCOV Coverage Write Failures - Instrumented binaries (DobbyPluginLauncher) couldn't write .gcda files to /home/runner/work, causing hooks to exit with code 1 and breaking all container operations. Fixed by setting GCOV_PREFIX=/tmp/gcov at workflow level and passing GCOV environment to DobbyDaemon via sudo -E.
Network Setup Issues - DNS resolution and NAT not working in containers because eth0 was hardcoded but doesn't exist on GitHub runners, and systemd stub resolver doesn't work well in containers. Fixed by detecting default-route interface dynamically and forcing reliable DNS servers (1.1.1.1, 8.8.8.8).
OCI Hook Config Path - createContainer hooks run in container namespace and couldn't access host paths for config.json. Fixed by mounting config.json to /tmp/dobby_config.json inside container and adding fallback path resolution in DobbyPluginLauncher.
Invalid Interface Index in Networking - Netlink code was attempting to set addresses on interface index 0 (invalid), causing network setup failures. Added validation to reject invalid interface indices before setting addresses.
Test Procedure
Push PR to trigger GitHub Actions L2 test workflow
Verify all test groups pass:
basic_sanity_tests
container_manipulations
bundle_generation
plugin_launcher
command_line_containers
annotation_tests
start_from_bundle
thunder_plugin
network_tests
pid_limit_tests
memcr_tests
Type of Change
[Bug fix (non-breaking change which fixes an issue)
Requires Bitbake Recipe changes?
No - This PR contains only code fixes and CI workflow changes. No new dependencies, build options, or installed files are added.