Skip to content

lucidsoftware/bazel_virtual_thread_repro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

--experimental_async_execution can cause linux-sandbox actions to fail erroneously

This is a minimal repro case demonstrating that --experimental_async_execution causes linux-sandbox actions that take longer than 30 seconds to be killed with SIGKILL.

Root cause

With --experimental_async_execution enabled Bazel runs actions on virtual threads. A virtual thread forks from its carrier thread. The carrier thread is detached while the virtual thread waits for the forked process to complete. If there isn't other work for the carrier thread to do, then it is idle and can be killed due to inactivity. The default virtual thread scheduler in JDK 25 uses a ForkJoinPool with a 30 second TTL.

The parent carrier thread being killed causes the linux-sandbox to die because the sandbox self destructs when its parent dies via prctl(PR_SET_PDEATHSIG, SIGKILL).

These two things combined can cause Bazel actions to fail erroneously when they use the linux-sandbox and take longer than 30 seconds.

Prerequisites

The linux-sandbox needs to be used as the strategy for this bug to occur. That means the prereqs for linux-sandbox need to be met. If you try to repro the bug by building bazel build //:slow_action and get an error like this:

ERROR: 'linux-sandbox' was requested for explicit default strategies but no strategy with that identifier was registered. Valid values are: [dynamic_worker, processwrapper-sandbox, standalone, dynamic, remote, worker, sandboxed, local]

Then you likely need to temporarily disable the apparmor restriction that prevents unprivileged user namespaces:

# Check if unprivileged user namespaces are restrcited
cat /proc/sys/kernel/apparmor_restrict_unprivileged_userns
# If 1, temporarily disable:
sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0
# Restart the bazel server afterwards, so it picks up on the change
bazel shutdown

How to repro

  1. Build the target:

    bazel build //:slow_action

    The build fails after approximately 30 seconds. The action shows it was (Killed):

    ERROR: /home/<redacted>/opensource/bazel_virtual_thread_repro/BUILD.bazel:1:8: Executing genrule //:slow_action failed: (Killed): bash failed: error executing Genrule command (from target //:slow_action) /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; sleep 45 && echo done > bazel-out/k8-fastbuild/bin/output.txt'
    
    Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
    Target //:slow_action failed to build
    Use --verbose_failures to see the command lines of failed build steps.
    INFO: Elapsed time: 32.774s, Critical Path: 30.03s
    INFO: 2 processes: 2 internal.
    ERROR: Build did NOT complete successfully
    
  2. Verify it passes without async execution. It will take approximately 45 seconds.

    bazel build //:slow_action --noexperimental_async_execution
  3. You can also verify it passes when processwrapper-sandbox is used while --experimental_async_execution is still enabled. You may need to bazel clean if you successfully built the action in step 2.

    bazel build //:slow_action --spawn_strategy=processwrapper-sandbox

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors