Skip to content

Fix LeapMap migration bugs and add concurrent benchmarks#22

Open
mandrean wants to merge 3 commits intorobclu:mainfrom
mandrean:fix/migration-retry-livelock
Open

Fix LeapMap migration bugs and add concurrent benchmarks#22
mandrean wants to merge 3 commits intorobclu:mainfrom
mandrean:fix/migration-retry-livelock

Conversation

@mandrean
Copy link
Copy Markdown
Contributor

@mandrean mandrean commented Mar 24, 2026

Initially I just wanted to benchmark LeapMap against DashMap (similar to https://github.com/robclu/leapfrog/blob/main/benches/hashmap.rs), but that actually crashed. So I started digging further into it, and found some bugs:

Summary

  • Fix five correctness bugs in the LeapMap migration/resize machinery (memory ordering, livelock, TOCTOU race)
  • Fix UB in bucket_slice / bucket_slice_mut (slices created 4x past allocation boundary)
  • Add Criterion benchmarks comparing LeapMap vs DashMap vs std lock-based maps
  • Add dashmap as a dev-dependency for side-by-side comparison
  • Probably closes Inserting causes a pianic #15?

Bug fixes

Memory ordering (be22a87)

Multiple Relaxed atomics in the Migrator upgraded to proper Acquire/Release/AcqRel:

Location Change Reason
set_initialization_begin_flag Relaxed -> AcqRel Must see prior state and publish flag
set_initialization_complete_flag Relaxed -> Release Publishes initialization stores to joining workers
finishing(), initialization_complete(), in_process() Relaxed -> Acquire Must see stores published by initializer/completer
status.fetch_or(1) (overflow and completion) Relaxed -> Release Publishes overflowed flag and migration results
stale_sources cleanup separate load+store -> swap(AcqRel) Prevents double-free race
current_stale_source CAS Relaxed -> AcqRel/Acquire Must see stale table pointer before deallocating

bucket_slice UB fix (be22a87)

Table::bucket_slice() and bucket_slice_mut() created slices with self.size() elements (number of cells), but only self.size() >> 2 buckets are allocated. Fixed to self.size() >> 2.

Note: hashmap.rs has the identical bug at lines 967-972, left for a follow-up.

Migration retry livelock (be22a87 introduced, f8cd76c fixed)

prepare_retry_after_overflow stored status = INITIALIZATION_MASK (0x06) to signal a retry round, but non-last workers spin on while status >= 1 expecting status == 0. Since 0x06 >= 1, they spin forever under sustained concurrent inserts.

Fix: removed the status store from prepare_retry_after_overflow. The last worker now drives the retry migration single-handedly via drive_retry_migration() while other workers spin, then stores status = 0 to release them.

TOCTOU race in perform_migration (f8cd76c)

A thread could pass initialization_complete() in participate_in_migration, but by the time it reached the CAS status + 8 in perform_migration, the migration had completed (status = 0). The CAS 0 -> 8 succeeded, creating a phantom worker that corrupted the count (prev_status = 0x08 instead of 0x0F), leading to heap corruption and SIGABRT.

Fix: added an INITIALIZATION_MASK guard in the CAS join loop. Bail out if init flags are not set.

Infinite spin in participate_in_migration (f8cd76c)

while migrator.finishing() looped forever when status == 0 because finishing() returns true for status == 0. Threads entering this path after a migration completed would never exit.

Fix: check status & MIGRATION_COMPLETE_FLAG directly instead of calling finishing().

Benchmarks

benches/leapmap.rs adds three benchmark groups, each comparing LeapMap against DashMap and the appropriate lock-based baseline:

Results with my M1 Max 64GB (10 cores):

Benchmark LeapMap DashMap Lock-based
concurrent_insert (N threads x 10K inserts) 4.0 ms (24.7 Melem/s) 4.2 ms (24.0 Melem/s) 16.6 ms Mutex (6.0 Melem/s)
concurrent_get (N threads x 10K reads) 646 us (154.9 Melem/s) 1.7 ms (58.4 Melem/s) 25.5 ms RwLock (3.9 Melem/s)
resize/1K (single-threaded fill) 52 us 40 us --
resize/10K 532 us 355 us --
resize/100K 6.8 ms 3.5 ms --

LeapMap is roughly 2.6x faster than DashMap for concurrent reads and competitive for concurrent inserts. Single-threaded resize is slower as expected (lock-free migration overhead amortizes under concurrency).

Test plan

  • Unit tests (3/3 pass, including new overflow_retry_keeps_migration_live)
  • Integration tests: basic (6/6), hashmap (6/6)
  • Cuckoo stress test (same pre-existing flake rate as main, ~20%)
  • Doc-tests (44/44)
  • All 9 Criterion benchmarks complete without hang or crash
  • Standalone 1400+ iteration stress test (10 threads x 10K inserts, tight loop)
  • No regressions in existing hashmap benchmark

Add benches/leapmap.rs with three benchmark groups, each comparing
LeapMap against DashMap and the appropriate lock-based baseline:

concurrent_insert
  LeapMap vs DashMap vs Mutex<HashMap>
  N threads (available_parallelism), each inserting 10 000 unique keys
  into a fresh map.

concurrent_get
  LeapMap vs DashMap vs RwLock<HashMap>
  N threads performing 10 000 random reads on a map pre-populated with
  16 384 entries (setup excluded from timing).

resize/{1k,10k,100k}
  LeapMap vs DashMap, single-threaded fill from empty to N elements
  with no pre-allocated capacity, exercising the full resize/migration
  path including Drop cleanup.

DashMap is added as a dev-dependency so results are directly comparable
to a popular alternative. The existing benches/hashmap.rs is unchanged.
Three fixes for the LeapMap concurrent migration machinery:

1. **Livelock fix**: `prepare_retry_after_overflow` no longer stores
   `status = INITIALIZATION_MASK`.  Previously, non-last workers spinning
   on `while status >= 1` would never exit because INITIALIZATION_MASK
   (0x06) >= 1.  The last worker now drives the retry migration
   single-handedly via `drive_retry_migration` while other workers spin,
   then stores `status = 0` to release them.

2. **TOCTOU race fix**: Added an initialization-flag guard in
   `perform_migration`'s CAS join loop.  Without this, a thread could
   pass `initialization_complete()` in `participate_in_migration`, but by
   the time it CAS'd `status + 8`, the migration had already completed
   (status = 0).  The CAS `0 → 8` would succeed, creating a phantom
   worker that corrupts the worker count (prev_status != 15) and leads
   to heap corruption.

3. **Infinite spin fix**: The `while migrator.finishing()` loop in
   `participate_in_migration` now checks the completion flag directly
   instead of calling `finishing()` (which returns true for status == 0,
   causing an infinite loop when the migration completes while a thread
   is in the cleanup spin).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inserting causes a pianic

1 participant