Fix LeapMap migration bugs and add concurrent benchmarks by mandrean · Pull Request #22 · robclu/leapfrog

mandrean · 2026-03-24T16:11:52Z

Initially I just wanted to benchmark LeapMap against DashMap (similar to https://github.com/robclu/leapfrog/blob/main/benches/hashmap.rs), but that actually crashed. So I started digging further into it, and found some bugs:

Summary

Fix five correctness bugs in the LeapMap migration/resize machinery (memory ordering, livelock, TOCTOU race)
Fix UB in bucket_slice / bucket_slice_mut (slices created 4x past allocation boundary)
Add Criterion benchmarks comparing LeapMap vs DashMap vs std lock-based maps
Add dashmap as a dev-dependency for side-by-side comparison
Probably closes Inserting causes a pianic #15?

Bug fixes

Memory ordering (`be22a87`)

Multiple Relaxed atomics in the Migrator upgraded to proper Acquire/Release/AcqRel:

Location	Change	Reason
`set_initialization_begin_flag`	Relaxed -> AcqRel	Must see prior state and publish flag
`set_initialization_complete_flag`	Relaxed -> Release	Publishes initialization stores to joining workers
`finishing()`, `initialization_complete()`, `in_process()`	Relaxed -> Acquire	Must see stores published by initializer/completer
`status.fetch_or(1)` (overflow and completion)	Relaxed -> Release	Publishes `overflowed` flag and migration results
`stale_sources` cleanup	separate load+store -> `swap(AcqRel)`	Prevents double-free race
`current_stale_source` CAS	Relaxed -> AcqRel/Acquire	Must see stale table pointer before deallocating

`bucket_slice` UB fix (`be22a87`)

Table::bucket_slice() and bucket_slice_mut() created slices with self.size() elements (number of cells), but only self.size() >> 2 buckets are allocated. Fixed to self.size() >> 2.

Note: hashmap.rs has the identical bug at lines 967-972, left for a follow-up.

Migration retry livelock (`be22a87` introduced, `f8cd76c` fixed)

prepare_retry_after_overflow stored status = INITIALIZATION_MASK (0x06) to signal a retry round, but non-last workers spin on while status >= 1 expecting status == 0. Since 0x06 >= 1, they spin forever under sustained concurrent inserts.

Fix: removed the status store from prepare_retry_after_overflow. The last worker now drives the retry migration single-handedly via drive_retry_migration() while other workers spin, then stores status = 0 to release them.

TOCTOU race in `perform_migration` (`f8cd76c`)

A thread could pass initialization_complete() in participate_in_migration, but by the time it reached the CAS status + 8 in perform_migration, the migration had completed (status = 0). The CAS 0 -> 8 succeeded, creating a phantom worker that corrupted the count (prev_status = 0x08 instead of 0x0F), leading to heap corruption and SIGABRT.

Fix: added an INITIALIZATION_MASK guard in the CAS join loop. Bail out if init flags are not set.

Infinite spin in `participate_in_migration` (`f8cd76c`)

while migrator.finishing() looped forever when status == 0 because finishing() returns true for status == 0. Threads entering this path after a migration completed would never exit.

Fix: check status & MIGRATION_COMPLETE_FLAG directly instead of calling finishing().

Benchmarks

benches/leapmap.rs adds three benchmark groups, each comparing LeapMap against DashMap and the appropriate lock-based baseline:

Results with my M1 Max 64GB (10 cores):

Benchmark	LeapMap	DashMap	Lock-based
concurrent_insert (N threads x 10K inserts)	4.0 ms (24.7 Melem/s)	4.2 ms (24.0 Melem/s)	16.6 ms Mutex (6.0 Melem/s)
concurrent_get (N threads x 10K reads)	646 us (154.9 Melem/s)	1.7 ms (58.4 Melem/s)	25.5 ms RwLock (3.9 Melem/s)
resize/1K (single-threaded fill)	52 us	40 us	--
resize/10K	532 us	355 us	--
resize/100K	6.8 ms	3.5 ms	--

LeapMap is roughly 2.6x faster than DashMap for concurrent reads and competitive for concurrent inserts. Single-threaded resize is slower as expected (lock-free migration overhead amortizes under concurrency).

Test plan

Unit tests (3/3 pass, including new overflow_retry_keeps_migration_live)
Integration tests: basic (6/6), hashmap (6/6)
Cuckoo stress test (same pre-existing flake rate as main, ~20%)
Doc-tests (44/44)
All 9 Criterion benchmarks complete without hang or crash
Standalone 1400+ iteration stress test (10 threads x 10K inserts, tight loop)
No regressions in existing hashmap benchmark

Add benches/leapmap.rs with three benchmark groups, each comparing LeapMap against DashMap and the appropriate lock-based baseline: concurrent_insert LeapMap vs DashMap vs Mutex<HashMap> N threads (available_parallelism), each inserting 10 000 unique keys into a fresh map. concurrent_get LeapMap vs DashMap vs RwLock<HashMap> N threads performing 10 000 random reads on a map pre-populated with 16 384 entries (setup excluded from timing). resize/{1k,10k,100k} LeapMap vs DashMap, single-threaded fill from empty to N elements with no pre-allocated capacity, exercising the full resize/migration path including Drop cleanup. DashMap is added as a dev-dependency so results are directly comparable to a popular alternative. The existing benches/hashmap.rs is unchanged.

Three fixes for the LeapMap concurrent migration machinery: 1. **Livelock fix**: `prepare_retry_after_overflow` no longer stores `status = INITIALIZATION_MASK`. Previously, non-last workers spinning on `while status >= 1` would never exit because INITIALIZATION_MASK (0x06) >= 1. The last worker now drives the retry migration single-handedly via `drive_retry_migration` while other workers spin, then stores `status = 0` to release them. 2. **TOCTOU race fix**: Added an initialization-flag guard in `perform_migration`'s CAS join loop. Without this, a thread could pass `initialization_complete()` in `participate_in_migration`, but by the time it CAS'd `status + 8`, the migration had already completed (status = 0). The CAS `0 → 8` would succeed, creating a phantom worker that corrupts the worker count (prev_status != 15) and leads to heap corruption. 3. **Infinite spin fix**: The `while migrator.finishing()` loop in `participate_in_migration` now checks the completion flag directly instead of calling `finishing()` (which returns true for status == 0, causing an infinite loop when the migration completes while a thread is in the cleanup spin).

mandrean added 3 commits March 24, 2026 12:53

Fix LeapMap migration retry race

be22a87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LeapMap migration bugs and add concurrent benchmarks#22

Fix LeapMap migration bugs and add concurrent benchmarks#22
mandrean wants to merge 3 commits intorobclu:mainfrom
mandrean:fix/migration-retry-livelock

mandrean commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mandrean commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug fixes

Memory ordering (be22a87)

bucket_slice UB fix (be22a87)

Migration retry livelock (be22a87 introduced, f8cd76c fixed)

TOCTOU race in perform_migration (f8cd76c)

Infinite spin in participate_in_migration (f8cd76c)

Benchmarks

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mandrean commented Mar 24, 2026 •

edited

Loading

Memory ordering (`be22a87`)

`bucket_slice` UB fix (`be22a87`)

Migration retry livelock (`be22a87` introduced, `f8cd76c` fixed)

TOCTOU race in `perform_migration` (`f8cd76c`)

Infinite spin in `participate_in_migration` (`f8cd76c`)