Simplify fine grain locking to lock before job execution by ranger-ross · Pull Request #16657 · rust-lang/cargo

ranger-ross · 2026-02-20T15:43:07Z

What does this PR try to resolve?

Tracking issue: #4282

This PR simplifies the locking design by taking the exclusive lock BEFORE executing the jobs.
This means that ALL exclusive locks are taken before executing the job queue.

The rational for why this is better:

We are optimizing for the "development iteration" workflow where small frequent changes are being made with a cargo build between changes.
- This means that builds are "warm" and we are generally only recompiling a small subset of the dependency tree.
- In most cases the small subset of units that need to be recompiled will NOT be shared between cargo build and cargo check.
- Since units are not shared both cargo build and cargo check will be able to take all of their locks up front and still be able to run in parallel.
This approach simplifies the locking design
- No communication is needed with in flight jobs
- No need to worry about a blocked job preventing other jobs from executing
- Blocking messages are easier to implement

Drawbacks

Clean build/checks will get less parallelism as the second build will block until the last shared unit (often a build script or proc macro) is built
The build/check blocking problem may still be an issue when making changes to a proc-macro crate or build script itself

Original Description

What does this PR try to resolve?

This PR fixes a bug in -Zfine-grain-locking that prevent prevent other jobs from running when a job is waiting on a lock.
Cargo's job queue has previously never support the ability to suspend a job that is already executing.
This PR adds support to suspend and resume jobs being executed.

This also paves the way to add better blocking messages to the user (originally attempted in #16463)

Tracking issue: #4282

Flow

Before taking an exclusive lock (and potentially blocking) we do a .try_lock() to see if we will block.
- If we are going to block, we send a Message::Blocked letting the job queue reclaim the token for another job to run while we are blocked
- If try_lock succeeds, we have the lock and never blocked, so we simply start compiling.
Once we acquire the lock, we sending a Message::Unblocked letting the job queue know we are ready to continue
But before compiling, to respect the --jobs limit, wait for the job queue to reschedule us by calling JobState.resume.recv() which blocks the current thread.
The job queue will eventually call ActiveJob.resume.send() which will let the job resume exection.

Design

We leverage the existing Queue<Message> in JobState to notify the job queue that we are going to block.
For resuming tasks that are blocked, I used a std::sync::mpsc::channel.

flowchart LR
    JQ["Job Queue (DrainState)"]
    J["Jobs"]

    J -- "Queue&lt;Message&gt;" --> JQ
    JQ -- "resume mpsc::channel" --> J

How to test and review this PR?

To test this I found the easiest to way to do it synthetically by injecting a std::thread::sleep in the lock manager for a given lock. Along with passing --jobs 1 to make a blocked job block the entire build.

pub fn lock(&self, key: &LockKey) -> CargoResult<()> {
      let locks = self.locks.read().unwrap();
      if key.0.to_str().unwrap().contains("libc") {
          std::thread::sleep(std::time::Duration::from_secs(5));
      }
      // ....
}

Test script

cargo new foo; cd foo; cargo add tokio --features full

rm -rf target build
alias lc=/path/to/cargo
CARGO_BUILD_BUILD_DIR=build lc -Zbuild-dir-new-layout -Zfine-grain-locking build --jobs 1

Doing this on master will result in blocked build that never completes, while changes in this PR will finish the build.

r? @epage

epage · 2026-02-20T15:52:08Z

src/cargo/core/compiler/locking.rs

Could you provide the motivation for this commit and in particular why each lock type was chosen with the confines of this commit?

Sorry, I should have made that clear in the commit message.

The issue I am trying to solve is to avoid a holding the MutexGuard while waiting for the file lock.
This is an issue since we can only lock a single lock at a time. (I am a bit surprised I overlooked this previously)

For the lock manager, we only insert into hashmap during the single threaded prepare fingerprint so we take RwLock::write(). During the multithreaded Job section, we take RwLock::read() to the hashmap.

Sounds like this is a fix independent of the rest of the PR?

Can we split it out and update the title to focus on the user impact?

epage · 2026-02-20T15:55:24Z

src/cargo/core/compiler/job_queue/job_state.rs

I worry about the maintainability of having side band communication going on.

Is it possible for us to manage this within the existing flow? For example, could we use Poll on Job::run to return early if we couldn't acquire the lock?

Hmmm, I think it might be tricky to use Poll with the generic structure of Job. (like how to resume the job since internally its a closure)

Though I am happy to look into that to see what is possible.

Wait, if we know what jobs will actually build a prior (see #16659 (comment)), then why can't we do all of the lock processing before the build, grabbing our exclusive locks and blocking then, rather than waiting?

Oh thats an interesting idea.
So basically the DrainState could try to lock before calling DrainState::run (running the job)

This is single threaded so it will change the way we take the locks.
I am imagining spawning a thread to take each lock as to avoid blocking the main job queue loop.
I think this could be encapsulated in the LockManager. (and use try_lock() as an optimization to avoid spawning a new thread)

I'm saying that we acquire the locks in compile and downgrade after build.

Thats super::compile which currently coordinates locks.

How much does grabbing as we go benefit in practice when the build we block on would hold a shared lock until its done, blocking our job and all dependents anyways? We can build some jobs that don't overlap, so a little?

We can build some jobs that don't overlap, so a little?

The benefit we get is when there are build units that do not overlap.
From my perspective, this is that main benefit of fine grain locking over what we currently have.
If we were to wait for everything, I feel like the benefits would be limited to a few more niche scenarios?

The goal we are working towards is non-blocking. If there is overlap, then we are blocking.

So the question then is does the benefit from some overlap justify a more complex architecture (both for this and blocking messages).

I did some thinking on this and I think we might be able to get away with simply locking before kicking off jobs.

The primary use case for fine grain locking is to avoid rust-analyzer's cargo check from blocking the users cargo build which slows down iteration during development.

If the user is making small incremental changes to their project, this means full rebuilds should be uncommon. So we will likely take shared locks to all of the dependencies, so we don't need an exclusive lock for those. We just need to take an exclusive lock on the things that changed(dirty units). Build scripts are still a problem with this but I think those will not be dirty unless the user is editing the build script itself.

I think the big question if this is worth the extra complexity of suspending jobs is if we care about build script units (or any other units that are shared between build and check) being blocking.

I made the changes I mentioned above and I believe they are working correctly. I was able to run a build and check in parallel, as well as couldn't make it deadlock. (though much more testing is needed before stabilizing of course)

Thanks for the reviews and helping guide me in the direction!
I think this approach is much better in term of simplicity while still accomplishing the goals for fine grain locking

I've updated the the PR description and title to better reflect the changes (and minimized the original PR description)

@epage

### What does this PR try to resolve? Split out of #16657 Currently we are holding a `MutexGuard` while waiting for the file lock, preventing multiple locks from being taken at once. This means that a single blocking unit can cause other units from running. For the lock manager, we only insert into hashmap during the single threaded prepare fingerprint so we take `RwLock::write()`. During the multithreaded `Job` section, we take `RwLock::read()` to the hashmap. ### How to test and review this PR? See the "How to test and review this PR" section of #16657 However, unlike that PR, `--jobs 1` will not work with the changes in that PR. We would need to do `--jobs 2` and make sure that the first build unit blocked causing the second one to block. Though I think this is a bit difficult to setup a test for without the changes in #16657 r? @epage

epage · 2026-02-25T17:13:11Z

src/cargo/core/compiler/mod.rs

+                    // We take an exclusive lock before scheduling the job to keep our locking model
+                    // simple. While this does mean we could potentially be waiting for another job
+                    // when this could begin immediately, it reduces the risk of deadlock.
+                    //
+                    // We are also optimizing for avoiding rust-analyzer's `cargo check` preventing
+                    // the user from running `cargo build` while developing. Generally, only a few
+                    // workspace units will be changing in each build and the units will not be
+                    // shared between `build` and `check` allowing them to run in parallel.
+                    build_runner.lock_manager.lock(&lock)?;
                    job.after(downgrade_lock_to_shared(lock));


as well as couldn't make it deadlock. (though much more testing is needed before stabilizing of course)

The deadlock concern is if you have two cargo checks that acquire the shared lock before either acquires the exclusive lock. At that point, neither of them can. The window for this is short which makes it tricky to observe through one off testing but likely to be seen in the wild, especially with rust-analyzer running in the background.

What if we had another profile level lock that we took (exclusively) during the fingerprint checking and while we took the unit level locks. Then dropped that lock before executing compiler jobs.

That would ensure that only a single Cargo instance is in that critical section where dead locking is possible. And like you mention this part of cargo is generally quite fast so I think this should be okay.

If we were to do this, I suppose we could probably reuse the existing .cargo-lock. I can't think of any obvious reasons why this wouldn't work.

What do you think?

I think that should be safe. The one downside I can think of is if we block when getting a unit lock, we now block all other check builds, even where there isn't overlap. Most likely there would be, so maybe its fine for the MVP?

I implemented this in 8d87975

I initially tried to use .cargo-lock but realized that we can't unlock it until the build is done as cargo clean relies on it to avoid cleaning while a build is in progress.
So instead, I created a new .acquisition-lock (not married to the name, we can change if preferred) that we use to coordinate lock acquisition between processes.

ranger-ross · 2026-02-28T14:34:13Z

yikes, ci failed on windows and it was build_dir.rs tests which has the -Zfine-grain-locking flag enabled.

They seemingly deadlocked but I am not quiet sure how that is possible 😅
I'll look into this

…king) This commit adds the concept of an acquisition lock that is held exclusively during the fingerprint checking and job preparation. We release this lock prior to executing jobs in the queue to allow other cargos to proceed. This lock prevents deadlocks by only allowing a single Cargo to lock the build unit locks at a time effectively making this operation a "transaction" guarded by the acquisition lock. Note that we added a new `.acquisition-lock` instead of reusing `.cargo-lock` This is needed as we drop the `.acquisition-lock` during the compilation phase, however `cargo clean` relies on `.cargo-lock` to avoid cleaning during a build. If we were to change the behavior of `.cargo-lock` to be unlocked during the compilation phase, it would open us up to having `cargo clean` potentially corrupt an in progress build.

ranger-ross · 2026-03-02T12:24:58Z

src/cargo/core/compiler/mod.rs

+                    // Also note that we unlock before taking the exclusive lock as not all
+                    // platforms support lock upgrading. This is safe as we hold the acquisition
+                    // lock so we should be the only one operating on the locks so we can assume
+                    // that the lock will not be stolen by another instance.
                    build_runner.lock_manager.unlock(&lock)?;
-                    job.before(prebuild_lock_exclusive(lock.clone()));
+                    build_runner.lock_manager.lock(&lock)?;


Re: #16657 (comment)

I found the issue on Windows. There is different behavior between Linux and Windows when taking a lock on an EX file that already locked with a SH lock. Linux will allow the lock upgrade while Windows will block.

I re-added the unlock() before the lock() so that we always have a fresh lock. We should be safe from deadlocking since we also hold the acquisition lock

rustbot · 2026-03-06T01:57:07Z

☔ The latest upstream changes (possibly #16708) made this pull request unmergeable. Please resolve the merge conflicts.

rustbot assigned epage Feb 20, 2026

rustbot added A-build-execution Area: anything dealing with executing the compiler S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 20, 2026

epage reviewed Feb 20, 2026

View reviewed changes

ranger-ross mentioned this pull request Feb 20, 2026

Fix parallel locking when -Zfine-grain-locking is enabled #16659

Merged

ranger-ross marked this pull request as draft February 21, 2026 09:05

rustbot removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 21, 2026

ranger-ross force-pushed the suspendable-tasks branch 2 times, most recently from d56a024 to 277bcc9 Compare February 25, 2026 14:36

fix: Simplify fine grain locking to lock before job

39a24a5

ranger-ross force-pushed the suspendable-tasks branch from 277bcc9 to 39a24a5 Compare February 25, 2026 14:36

ranger-ross changed the title ~~Add support for suspendable jobs~~ Simplify fine grain locking to lock before job execution Feb 25, 2026

ranger-ross marked this pull request as ready for review February 25, 2026 14:58

rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 25, 2026

epage reviewed Feb 25, 2026

View reviewed changes

rustbot added the A-layout Area: target output directory layout, naming, and organization label Feb 28, 2026

ranger-ross force-pushed the suspendable-tasks branch from 8d87975 to e6af9a2 Compare March 2, 2026 12:18

ranger-ross commented Mar 2, 2026

View reviewed changes

ranger-ross mentioned this pull request Mar 4, 2026

Added better blocking messages for fine grain locking #16463

Closed

ranger-ross mentioned this pull request Mar 9, 2026

Rework Cargo Build Dir Layout rust-lang/rust-project-goals#401

Open

2 tasks

Conversation

ranger-ross commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR try to resolve?

What does this PR try to resolve?

Flow

Design

How to test and review this PR?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ranger-ross Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ranger-ross Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ranger-ross commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rustbot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ranger-ross commented Feb 20, 2026 •

edited

Loading

ranger-ross Feb 23, 2026 •

edited

Loading

ranger-ross Feb 25, 2026 •

edited

Loading

ranger-ross commented Feb 28, 2026 •

edited

Loading