Skip to content

Attempt to optimize LocalExecutor#144

Open
james7132 wants to merge 23 commits intosmol-rs:masterfrom
james7132:optimized-local-executor
Open

Attempt to optimize LocalExecutor#144
james7132 wants to merge 23 commits intosmol-rs:masterfrom
james7132:optimized-local-executor

Conversation

@james7132
Copy link
Copy Markdown
Contributor

@james7132 james7132 commented Jul 24, 2025

This PR makes LocalExecutor it's own separate implementation instead of wrapping an Executor, forking it's own State type utilizing unsynchronized !Send types where possible: Arc -> Rc, Mutex -> RefCell, AtomicPtr -> Cell<*mut T>, ConcurrentQueue -> VecDeque. This implementation also removes any extra operations that assumes there are other concurrent Runners/Tickers (i.e. local queues, extra notifications).

For testing, I've duplicated most of the single-threaded compatible executor tests to ensure there's equivalent coverage on the new independent LocalExecutor.

I've also made an attempt to write this with UnsafeCell instead of RefCell, but this litters a huge amount of unsafe that might be too much for this crate. The gains here might not be worth it.

I previously wrote some additional benchmarks but lost the changes to a stray git reset --hard when benchmarking.
The gains here are substantial. Here are the results:

single_thread/local_executor::spawn_one
                        time:   [130.05 ns 130.98 ns 132.16 ns]
                        change: [-80.586% -80.413% -80.214%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
single_thread/local_executor::spawn_batch
                        time:   [17.195 µs 22.167 µs 32.281 µs]
                        change: [-30.007% +0.4082% +41.137%] (p = 0.98 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  7 (7.00%) high mild
  10 (10.00%) high severe
Benchmarking single_thread/local_executor::spawn_many_local: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 856.5s, or reduce sample count to 10.
single_thread/local_executor::spawn_many_local
                        time:   [2.7766 ms 2.8091 ms 2.8453 ms]
                        change: [-21.653% -8.5454% +6.4249%] (p = 0.30 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
single_thread/local_executor::spawn_recursively
                        time:   [19.618 ms 20.071 ms 20.558 ms]
                        change: [-24.237% -22.461% -20.762%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
Benchmarking single_thread/local_executor::yield_now: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.7s, enable flat sampling, or reduce sample count to 50.
single_thread/local_executor::yield_now
                        time:   [1.8382 ms 1.8421 ms 1.8470 ms]
                        change: [-52.888% -52.744% -52.570%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe
single_thread/local_executor::channels
                        time:   [9.9259 ms 9.9355 ms 9.9452 ms]
                        change: [-15.797% -15.656% -15.533%] (p = 0.00 < 0.05)
                        Performance has improved.
single_thread/local_executor::web_server
                        time:   [54.839 µs 56.776 µs 59.062 µs]
                        change: [-23.926% -18.725% -13.009%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

@james7132
Copy link
Copy Markdown
Contributor Author

Hmmm, VecDeque::new is not const until Rust 1.68. It could be trivially replaced with LinkedList which has a const constructor in 1.63, but it would be a major regression in performance.

@taiki-e
Copy link
Copy Markdown
Collaborator

taiki-e commented Jul 25, 2025

There is no need to be worried about the MSRV of const VecDeque::new, because we will be able to raise the MSRV after about two weeks: smol-rs/smol#244 (comment)

@james7132
Copy link
Copy Markdown
Contributor Author

OK, did a more comprehensive benchmark run and it's pretty clear that the UnsafeCell version is tangibly faster:

group                                                     local-executor-opt                      local-executor-opt-w-unsafe-cell        master
-----                                                     ------------------                      --------------------------------        ------
single_thread/local_executor::channels                    1.18     12.2±0.69ms        ? ?/sec     1.00     10.3±0.08ms        ? ?/sec     1.41     14.6±0.93ms        ? ?/sec
single_thread/local_executor::spawn_many_local            1.11      2.7±0.20ms        ? ?/sec     1.00      2.4±0.15ms        ? ?/sec     1.25      3.0±0.20ms        ? ?/sec
single_thread/local_executor::spawn_one                   1.15    155.7±7.89ns        ? ?/sec     1.00   135.0±14.68ns        ? ?/sec     6.08   821.0±86.13ns        ? ?/sec
single_thread/local_executor::spawn_recursively           1.01     27.4±2.17ms        ? ?/sec     1.00     27.1±2.79ms        ? ?/sec     1.17     31.7±2.56ms        ? ?/sec
single_thread/local_executor::web_server                  1.25     20.2±1.63ms        ? ?/sec     1.00     16.2±0.10ms        ? ?/sec     1.60     25.9±1.69ms        ? ?/sec
single_thread/local_executor::yield_now                   1.18      2.3±0.12ms        ? ?/sec     1.00  1981.4±166.43µs        ? ?/sec    2.35      4.7±0.21ms        ? ?/sec
single_thread/static_local_executor::channels             1.38     13.8±1.80ms        ? ?/sec     1.00     10.0±0.13ms        ? ?/sec     1.63     16.3±1.70ms        ? ?/sec
single_thread/static_local_executor::spawn_many_local     1.20  1770.7±119.07µs        ? ?/sec    1.00  1474.5±28.53µs        ? ?/sec     1.70      2.5±0.11ms        ? ?/sec
single_thread/static_local_executor::spawn_one            1.49    154.8±8.22ns        ? ?/sec     1.00    104.0±6.94ns        ? ?/sec     7.28   756.9±66.28ns        ? ?/sec
single_thread/static_local_executor::spawn_recursively    1.53     26.4±2.18ms        ? ?/sec     1.00     17.3±1.05ms        ? ?/sec     1.61     27.8±3.10ms        ? ?/sec
single_thread/static_local_executor::web_server           1.19     20.0±1.05ms        ? ?/sec     1.00     16.7±0.32ms        ? ?/sec     1.71     28.6±3.43ms        ? ?/sec
single_thread/static_local_executor::yield_now            1.29      2.5±0.22ms        ? ?/sec     1.00  1916.6±165.43µs        ? ?/sec    2.54      4.9±0.19ms        ? ?/sec

This PR should be ready for review now.

@james7132 james7132 marked this pull request as ready for review August 9, 2025 05:11
@james7132 james7132 requested review from notgull and taiki-e August 9, 2025 05:26
@james7132
Copy link
Copy Markdown
Contributor Author

james7132 commented Aug 9, 2025

I don't think we can reasonably make LocalExecutor any faster with the current interface it has. Any more would probably require std LocalWaker support, a customized LocalTask<T> in async_task, or stack allocation of some kind.

@james7132 james7132 force-pushed the optimized-local-executor branch from d206459 to 13fbe6f Compare August 9, 2025 06:01
@james7132
Copy link
Copy Markdown
Contributor Author

Spoke too soon, added an early-out when the queue is empty to LocalExecutor::run and nabbed some additional performance improvements:

group                                                     local-executor-opt-w-unsafe-cell        local-executor-opt-w-unsafe-cell-and-early-out-run    master
-----                                                     --------------------------------        --------------------------------------------------    ------
single_thread/local_executor::channels                    1.04     10.3±0.08ms        ? ?/sec     1.00      9.9±0.15ms        ? ?/sec                   1.47     14.6±0.93ms        ? ?/sec
single_thread/local_executor::spawn_many_local            1.42      2.4±0.15ms        ? ?/sec     1.00  1703.3±37.62µs        ? ?/sec                   1.78      3.0±0.20ms        ? ?/sec
single_thread/local_executor::spawn_one                   1.44   135.0±14.68ns        ? ?/sec     1.00     94.1±2.81ns        ? ?/sec                   8.73   821.0±86.13ns        ? ?/sec
single_thread/local_executor::spawn_recursively           1.58     27.1±2.79ms        ? ?/sec     1.00     17.2±0.84ms        ? ?/sec                   1.85     31.7±2.56ms        ? ?/sec
single_thread/local_executor::web_server                  1.00     16.2±0.10ms        ? ?/sec     1.00     16.1±0.14ms        ? ?/sec                   1.60     25.9±1.69ms        ? ?/sec
single_thread/local_executor::yield_now                   1.19  1981.4±166.43µs        ? ?/sec    1.00  1670.0±18.72µs        ? ?/sec                   2.79      4.7±0.21ms        ? ?/sec
single_thread/static_local_executor::channels             1.03     10.0±0.13ms        ? ?/sec     1.00      9.8±0.06ms        ? ?/sec                   1.67     16.3±1.70ms        ? ?/sec
single_thread/static_local_executor::spawn_many_local     1.11  1474.5±28.53µs        ? ?/sec     1.00  1323.9±83.38µs        ? ?/sec                   1.90      2.5±0.11ms        ? ?/sec
single_thread/static_local_executor::spawn_one            1.38    104.0±6.94ns        ? ?/sec     1.00     75.2±1.06ns        ? ?/sec                   10.07  756.9±66.28ns        ? ?/sec
single_thread/static_local_executor::spawn_recursively    1.20     17.3±1.05ms        ? ?/sec     1.00     14.4±0.85ms        ? ?/sec                   1.93     27.8±3.10ms        ? ?/sec
single_thread/static_local_executor::web_server           1.03     16.7±0.32ms        ? ?/sec     1.00     16.3±0.10ms        ? ?/sec                   1.76     28.6±3.43ms        ? ?/sec
single_thread/static_local_executor::yield_now            1.25  1916.6±165.43µs        ? ?/sec    1.00  1530.8±30.15µs        ? ?/sec                   3.18      4.9±0.19ms        ? ?/sec

@notgull
Copy link
Copy Markdown
Contributor

notgull commented Aug 13, 2025

@james7132 Can you rebase on master?

@james7132 james7132 force-pushed the optimized-local-executor branch from 13fbe6f to c4f077f Compare August 13, 2025 09:34
@james7132
Copy link
Copy Markdown
Contributor Author

@james7132 Can you rebase on master?

Done!

Comment thread src/local_executor.rs Outdated
Comment thread src/local_executor.rs
Comment thread src/local_executor.rs
Comment thread src/local_executor.rs Outdated
@james7132 james7132 force-pushed the optimized-local-executor branch from 2b5adca to d0a1f9e Compare August 14, 2025 00:01
Comment thread src/local_executor.rs
Comment thread src/local_executor.rs
Comment thread src/lib.rs Outdated
Co-authored-by: Ellen Emilia Anna Zscheile <fogti+devel@ytrizja.de>
Copy link
Copy Markdown
Member

@fogti fogti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM.
but as this is a complicated change, it should still be reviewed by others.

cc @smol-rs/admins

Copy link
Copy Markdown
Contributor

@notgull notgull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to wonder if there's some kind of generic implementation we could do with the onset of GAT types. Like have one type that wraps it in Arc<T> and one type that wraps it in Rc<T>. I would prefer that to the duplicated implementation here.

Comment thread src/local_executor.rs
@notgull
Copy link
Copy Markdown
Contributor

notgull commented Aug 27, 2025

Any chance you can look into using GATs for this one?

@james7132
Copy link
Copy Markdown
Contributor Author

Any chance you can look into using GATs for this one?

I can definitely try to do so, though I'm not sure how much duplication there actually is with how different the runners are.

Comment thread src/local_executor.rs Outdated
Comment thread src/local_executor.rs
Comment on lines +481 to +484
// SAFETY: All UnsafeCell accesses to sleepers are tightly scoped, and because
// `LocalExecutor` is !Send, there is no way to have concurrent access to the
// values in `State`, including the sleepers field.
let sleepers = unsafe { &mut *self.state.sleepers.get() };
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this let should be moved above the match, given that it is present in all branches anyways.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it in to minimize the scope of the UnsafeCell borrows, though I guess it currently does not hurt to move it outside the match.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do so (simply because nothing is gained here by doing this code duplication, the match itself is side-effect-free)

Co-authored-by: Ellen Emilia Anna Zscheile <fogti+devel@ytrizja.de>
@james7132
Copy link
Copy Markdown
Contributor Author

Any chance you can look into using GATs for this one?

Looked into this and every attempt I've made to try to make this work was either much more complex or was unsound due to the UnsafeCell use.

@fogti
Copy link
Copy Markdown
Member

fogti commented Jan 21, 2026

I would be very interested in moving this forward (I hope this to be able to lead to improvements to be possible in downstream https://github.com/hermit-os/kernel), but the conflicts would have to be resolved first.

Comment thread src/local_executor.rs

use crate::{AbortOnPanic, AsyncCallOnDrop, Sleepers};
#[doc(no_inline)]
pub use async_task::Task;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we pub re-export this from here?

Comment thread src/local_executor.rs
Comment on lines +1 to +8
use std::cell::{Cell, UnsafeCell};
use std::collections::VecDeque;
use std::marker::PhantomData;
use std::panic::{RefUnwindSafe, UnwindSafe};
use std::pin::Pin;
use std::rc::Rc;
use std::task::{Poll, Waker};
use std::{fmt, mem};
Copy link
Copy Markdown
Member

@fogti fogti Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of #160, it would be a good idea to only use std where necessary, and use core/alloc when possible.

Comment thread src/local_executor.rs
Comment on lines +481 to +484
// SAFETY: All UnsafeCell accesses to sleepers are tightly scoped, and because
// `LocalExecutor` is !Send, there is no way to have concurrent access to the
// values in `State`, including the sleepers field.
let sleepers = unsafe { &mut *self.state.sleepers.get() };
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do so (simply because nothing is gained here by doing this code duplication, the match itself is side-effect-free)

@fogti
Copy link
Copy Markdown
Member

fogti commented Jan 21, 2026

It would be nice if this implementation could be made no_std compatible (because I could then try to use it directly in the Hermit kernel and throw it against the fairly extensive test suite of the Hermit project).

@fogti
Copy link
Copy Markdown
Member

fogti commented Jan 21, 2026

First and foremost, it would be a good idea to split the local_executor into a separate .rs file in a separate commit and merge that first, to reduce the potential for future merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants