Though the traditional Rust benchmarking utility relies heavily on color, there are ways to
do these comparisons without depending on color vision. PR's to Criterion to improve the
visualizations would improve many lives :-). It is tough for my eyes to see, but the lru
and example lines are up top, while the k-cache and workshop lines are near the bottom.
The multi_thread benchmark image above shows a comparison of cache interaction latency
up the Y axis, for increasing contending thread count along the X axis.
The benchmark is set up for a 4:1 read:write ratio over a working set that is 2x the size of the caches. This is in the range of fairly realistic workflows.
The usual tool for benchmarking pieces of code in Rust is Criterion. This is a microbenchmarking tool, and the usual microbenchmarking caveats apply: You need to be careful about making sure that you are measuring what you think you are measuring. Compiling in release mode enables some interesting optimizations that can even remove your code - so you'd just be benchmarking an empty function body. That's obviously quick to execute, but not representative of the performance of your code.
I'm also using pprof here, so that MacOS users can get profiles without jumping through hoops.
Typically you'd just use perf as the flamegraph backend and brew install perf but the pprof
profiler has less dependency and does a good enough job for workshop purposes!
To run the benchmarks, run this at the root of your workshop directory:
cargo bench --bench bench_main -- multi_thread
The output goes to target/criterion.
OSX
Open the report in your web browser. From the repo root directory, run this:
open target/criterion/report/index.html
WSL
Open the report folder in the Explorer and double-click index.html. From the repo root directory in a WSL terminal run this:
explorer.exe target/criterion/report
Remember that VS Code's integrated terminal is a WSL terminal, so you can easily run this from the terminal in your editor.
Your sieve cache is the workshop line. When you write your own sieve cache, you should see
it perform similarly to the example sieve cache line. It is the fastest in the screenshot
above because it is not implemented. No code is faster than no code!
With a little effort, you can match the performance of the rigorous and high quality moka
library. The effort moka spends on maintaining the LRU algorithm eventually boils down to
something similar to a mutex. There is fancy work amortization and whatnot that makes moka's
LRU implementation quite impressive, however it does eventually reach a point where it can't
squeeze any more throughput through it and you're reduced to approximately a mutex's throughput.
You can see that the example sieve cache and probably your workshop sieve cache perform in
the rough ballpark of moka. This is explained by the cache::ShareableCache which wraps both
the example cache and the workshop cache. The ShareableCache wrapper uses a mutex to make
it shareable, and that means you can't do better than a mutex with ShareableCache.
lru with the moka implementation does not require an external mutex for sharing. Neither does
the k-cache line. k-cache is a sharded implementation of sieve cache, for better parallelism.
Both of these cache implementations are suitable for a multi-threaded application. But to beat the
lru algorithm, k-cache needs only a couple hundred lines of code. moka is a feature-rich
cache, so its line count is not very relevant as a comparison. But it's worth noting that a simple
implementation of exactly what you need may be better suited to you than a generic implemtation of
what everyone needs!
Profile:
CARGO_PROFILE_BENCH_DEBUG=true cargo bench \
--bench bench_main -- --profile-time 5 --exact \
'multi_thread/lru/1'
This outputs a flamegraph at target/criterion/multi_thread/lru/1/profile/flamegraph.svg:
You can see the work that the moka crate does to achieve a high-performance lru implemantation.
CARGO_PROFILE_BENCH_DEBUG=true cargo bench \
--bench bench_main -- --profile-time 5 --exact \
'multi_thread/k-cache/1'
Here you can see how much less work a sharded sieve cache has to do to perform well at the same task.
There are still opportunities to improve, but it's better than lru at this kind of workload.
CARGO_PROFILE_BENCH_DEBUG=true cargo bench \
--bench bench_main -- --profile-time 5 --exact \
'multi_thread/workshop/1'
Here you can see how your implementation compares, where it is spending its time. If your implementation is not as quick as the example, you can see where you're spending time and address it.
You can move on to comparing contended 16 thread workloads if you like, which gets a little more into the
art of concurrency. In here, you'll be looking for frames like lock_contended and queues which are used
to work around locking. The deeper you go, the more you will need to read and understand the source code
that you are profiling. At a high level, you can see where time is being spent and whether locks or queues
are used to communicate across threads.
You will likely find that to get the most effect for the least effort, you will shard your cache. By making 10 individually locked caches instead of 1, and hashing each key into one of the caches, you can probabilistically avoid contention. The exact probability of a conflicting access relates to the latency of your cache and the rate of concurrent operations accessing your cache.
To take advantage of a sharded cache, you will need to implement the ShareableCache trait for your cache.
See how the k-cache implementation of SizeLimitedCache does not depend on &mut self, and the ShareableCache
implementation simply uses the &self functions of the sharded k-cache.
If you do this, you will find out some interesting details about Rust's borrow checker and ownership semantics.
&mut self is different from many other languages' concepts of a mutable variable. Often times, a "mutable" flag
on a variable is a hint to the developer and a disabling of compiler guard rails. In Rust, it is almost the opposite
of those things. &mut self is a contract you make with the compiler that guarantees that the stack frame that can
see the self is the only stack frame that can see the self. It is the unique pointer to this piece of memory, so
you can safely modify it without any synchronization or other concern. In many other languages, mutable references are
risky. In rust, mutable references are sort of uniquely safe!
When you seek to improve the latency of a workflow, sometimes you need to change the workflow's approach. That's the idea of the sieve algorithm: Flip the concept of caching around - expire quickly and promote lazily instead of promoting quickly and expiring lazily.
Sieve caching is not without its downsides, and a simple implementation is vulnerable to issues like accessing the whole working set, then adding 1 element, and repeating. For many workflows that's not a realistic concern, and most cache algorithms have some patterns that don't work well.
Stick to comparable workloads and compare your implementation against other cache implementations!
| profile | result |
|---|---|
| workshop | |
| example | |
| k-cache sieve | |
| moka lru |
