Off-heap memory allocators for Go — GC-isolated, lock-free, backed by mmap.
Package memory provides four off-heap allocator types:
Arena— variable-size bump-pointer allocator (CAS, lock-free)Pool— variable-size slab allocator (CAS, lock-free, bulkReset())FreeList— fixed-size lock-free allocator (Treiber stack, per-objectDeallocate())ShardedFreeList— sharded fixed-size allocator (per-shard LIFO caches + Hyaline SMR,Deallocate()orRetire())
Allocations are served from mmap'd slabs; the Go GC never scans this memory. Safe memory reclamation (SMR) for concurrent workloads is provided by Hyaline (PLDI 2021), a reference-counting scheme with a single-store hot path.
- Off-heap — allocations live in mmap'd memory, invisible to the Go GC
- Variable + fixed-size —
Pool/Arenafor arbitrary sizes;FreeList/ShardedFreeListfor fixed-size slots - Bulk or per-object free —
Pool.Reset()bulk-frees everything;FreeList.Deallocate()frees individual slots;ShardedFreeList.Retire()defers reclamation via Hyaline SMR - Hard memory bounds —
PoolSizecaps total mmap'd bytes; no unbounded growth - Lock-free hot paths — CAS-based allocation across all allocator types; zero mutex contention on the fast path
- Zero heap allocations — verified on every code path with
-benchmem, escape analysis, andGODEBUG=gctrace=1 - ShardedFreeList with adaptive backpressure — PI-controlled batch flushing prevents pool exhaustion stalls under extreme oversubscription
go get github.com/xDarkicex/memory
| Type | Allocation model | Free model | Concurrency | Best for |
|---|---|---|---|---|
Pool |
Variable-size (CAS slab) | Bulk Reset() |
Lock-free multi-producer | Request-scoped scratch buffers, parse buffers |
Arena |
Variable-size (CAS bump pointer) | Reset() (rewind) or Free() (destroy) |
Lock-free multi-producer | Frame scratch, per-request temp data |
FreeList |
Fixed-size (Treiber stack) | Per-object Deallocate() |
Lock-free | Fixed-size object pools, per-vector allocations |
ShardedFreeList |
Fixed-size (sharded + Hyaline SMR) | Per-object Deallocate() or Retire() |
Lock-free, sharded by goroutine | High-concurrency fixed-size pools, vector DBs |
pool, err := memory.NewPool(memory.AllocatorConfig{
PoolSize: 64 * 1024 * 1024, // 64MB hard limit
SlabSize: 1024 * 1024, // 1MB slabs
SlabCount: 16,
Prealloc: true,
})
if err != nil {
panic(err)
}
defer pool.Free()
buf, err := pool.Allocate(4096) // off-heap, zero GC
// use buf...
pool.Reset() // bulk-free everythingarena, err := memory.NewArena(1024 * 1024) // 1MB
ptr, err := arena.Alloc(256) // bump-pointer, lock-free
arena.Reset() // rewind, keep mmap
arena.Free() // release mmapfl, err := memory.NewFreeList(memory.FreeListConfig{
PoolSize: 256 * 1024 * 1024,
SlotSize: 64, // every slot is exactly 64 bytes
SlabSize: 2 * 1024 * 1024,
SlabCount: 32,
Prealloc: true,
})
if err != nil {
panic(err)
}
defer fl.Free()
slot, err := fl.Allocate() // returns []byte of exactly SlotSize
fl.Deallocate(slot) // return to freelist
fl.BatchAllocate(dst [][]byte) // batch-refill, amortizes CASsfl, err := memory.NewShardedFreeList(memory.FreeListConfig{
PoolSize: 256 * 1024 * 1024,
SlotSize: 64,
SlabSize: 2 * 1024 * 1024,
SlabCount: 32,
Prealloc: true,
}, 64) // 64 shards
if err != nil {
panic(err)
}
defer sfl.Free()
slot, err := sfl.Allocate()
// use slot...
sfl.Deallocate(slot) // fast path: shard cache, zero atomics- Large, bounded working sets (vector DBs, caches, parse buffers)
- GC scan time dominates latency percentiles
- Hard memory limits needed (no unbounded growth like
sync.Pool) - Fixed-size objects with high allocation churn (FreeList / ShardedFreeList)
- Allocation lifetimes are naturally scoped (per-request, per-frame, per-batch)
- You accept trading per-allocation speed for zero GC overhead
- Allocations are small and short-lived (Go's stack allocator is faster)
- You need automatic memory management (no GC integration)
- Your working set fits comfortably in the Go heap with acceptable GC pauses
- You need per-allocation free for variable-size allocations (use FreeList instead of Pool)
- You're building a library that can't impose lifecycle rules on callers
All allocations use unix.Mmap with MAP_ANON | MAP_PRIVATE. This memory is
not tracked by the Go GC — no heap scanning, no GOMEMLIMIT pressure.
The caller controls the lifecycle.
pool, err := memory.NewPool(memory.AllocatorConfig{...})
buf, err := pool.Allocate(size) // off-heap, 0 heap allocs
stats := pool.Stats() // atomic snapshot
pool.Reset() // bulk-free, reuse mmap
pool.Free() // release mmap, invalidate poolarena, err := memory.NewArena(size)
ptr, err := arena.Alloc(size) // bump-pointer, lock-free
remaining := arena.Remaining()
arena.Reset() // rewind, keep mmap
arena.Free() // release mmap, invalidatefl, err := memory.NewFreeList(cfg)
slot, err := fl.Allocate() // single fixed-size slot
n, err := fl.BatchAllocate(dst[:]) // batch refill, amortizes CAS
err := fl.Deallocate(slot) // return to freelist
stats := fl.Stats()
fl.Reset() // bulk-free, reuse mmap
fl.Free() // release mmapsfl, err := memory.NewShardedFreeList(cfg, numShards)
slot, err := sfl.Allocate() // shard cache → batch refill → global
err := sfl.Deallocate(slot) // fast path: shard cache (zero atomics)
err := sfl.Retire(slot) // Hyaline SMR path (see contracts below)
sfl.HyalineEnter(shardIdx) // protect concurrent readers
sfl.HyalineLeave(shardIdx) // drain retired nodes, decrement refs
stats := sfl.Stats()
sfl.Reset() // bulk-free + restart PID controller
sfl.Free() // release mmap + cancel PID controllerTyped allocation helpers for every allocator — no unsafe.Pointer casting
needed. All have Must variants that panic instead of returning an error.
// Pool: typed single-value and slice allocation
hdr, err := memory.PoolAlloc[Header](pool)
vec, err := memory.PoolSlice[float32](pool, 1536)
vec = vec[:1536] // reslice to full capacity
// Arena: typed allocation + string + append
hdr, err := memory.ArenaAlloc[Header](arena)
vec, err := memory.ArenaSlice[float32](arena, 1536)
s, err := memory.ArenaNewString(arena, "hello")
vec = memory.ArenaAppend(arena, vec, 1.0, 2.0)
// FreeList: typed alloc + dealloc
hdr, err := memory.FreeListAlloc[Header](fl)
hdr.ID = 42
memory.FreeListDealloc(fl, hdr)| Helper | Allocator | Description |
|---|---|---|
PoolAlloc[T](pool) *T |
Pool | Allocate a single T |
PoolSlice[T](pool, n) []T |
Pool | Allocate []T (len=0, cap=n) |
MustPoolAlloc[T](pool) *T |
Pool | Panic-on-error variant |
MustPoolSlice[T](pool, n) []T |
Pool | Panic-on-error variant |
ArenaAlloc[T](arena) *T |
Arena | Allocate a single T |
ArenaSlice[T](arena, n) []T |
Arena | Allocate []T (len=0, cap=n) |
ArenaNewString(arena, s) string |
Arena | Copy a string to the arena |
ArenaAppend[T](arena, sl, elems) []T |
Arena | Append to an arena-backed slice |
MustArenaAlloc[T](arena) *T |
Arena | Panic-on-error variant |
MustArenaSlice[T](arena, n) []T |
Arena | Panic-on-error variant |
MustArenaNewString(arena, s) string |
Arena | Panic-on-error variant |
FreeListAlloc[T](fl) *T |
FreeList | Allocate a single T |
FreeListDealloc[T](fl, *T) |
FreeList | Deallocate a *T |
FreeListSlotFor[T](fl, *T) []byte |
FreeList | Get backing slot for *T |
MustFreeListAlloc[T](fl) *T |
FreeList | Panic-on-error variant |
Reading or writing through an allocation after Reset() is undefined
behavior — it will either segfault (if the OS has reclaimed the page) or
silently corrupt data (if the page has been re-mmap'd and handed to another
allocation). The caller is responsible for ensuring no references survive
Reset().
Calling Reset() while other goroutines hold allocations from the same pool
is also undefined behavior. The caller must ensure quiescence — no in-flight
Allocate calls — before calling Reset().
Reset() increments a generation counter before unmapping slabs. Allocators
check the generation before and after their CAS: if the generation changed,
the allocation is retried rather than returning a pointer into memory being
unmapped. This is best-effort, not a true RCU barrier. The only guarantee
is external quiescence.
The Hyaline safe memory reclamation protocol has required invariants. Violating any of them causes use-after-free data corruption.
Every HyalineEnter MUST be paired with exactly one HyalineLeave.
sfl.HyalineEnter(shardIdx)
// ... read shared memory ...
sfl.HyalineLeave(shardIdx) // REQUIRED: paired with EnterRetire MUST NOT be called while the slot is still reachable by readers
that entered the corresponding Hyaline slot. The correct pattern is:
// CORRECT: unlink from shared structure, then retire
sfl.HyalineEnter(shardIdx)
slot, _ := sfl.Allocate()
// ... use slot, possibly publish it ...
// Remove from shared structure BEFORE retiring
liveMu.Lock()
delete(liveSet, slot)
liveMu.Unlock()
sfl.Retire(slot) // safe: no reader can reach this slot
sfl.HyalineLeave(shardIdx)// WRONG: retiring while still reachable — reader UAF risk
sfl.HyalineEnter(shardIdx)
sfl.Retire(slot) // UNSAFE: slot still in liveSet, readers can access it
sfl.HyalineLeave(shardIdx)A reader that calls HyalineEnter is protected from having memory freed
that was retired after the Enter. The reader must obtain its pointers
through a safe publication mechanism (shared slice, map, etc.) and must
not access memory after calling HyalineLeave.
// Reader goroutine
sfl.HyalineEnter(shardIdx)
liveMu.RLock()
for _, ptr := range livePtrs {
_ = *(*uint64)(ptr) // safe: protected by Enter
}
liveMu.RUnlock()
sfl.HyalineLeave(shardIdx)
// UNSAFE to access ptrs after LeaveDeallocate: Fast path. Returns the slot directly to the shard cache. No SMR protection. Use only when no other goroutine can reach the slot.Retire: Hyaline SMR path. Defers reclamation until all readers that entered before the retire have left. Use when concurrent readers may still access the slot.
Both Deallocate and Retire detect double-frees via per-slot generation
counters. Attempting to free or retire the same slot twice returns
ErrDoubleDeallocation. This is a safety net, not a correctness guarantee
under races — once you deallocate a slot, another goroutine can allocate
and use it before your second deallocate.
| Error | Meaning |
|---|---|
ErrInvalidSize |
size == 0 |
ErrPoolExhausted |
PoolSize limit reached |
ErrMmapFailed |
OS mmap call failed (OOM, system limit, hugepage alignment) |
ErrArenaExhausted |
Arena has insufficient space |
ErrFreelistExhausted |
FreeList pool exhausted (all slots allocated) |
ErrInvalidDeallocation |
Slot size mismatch or pointer outside any slab |
ErrDoubleDeallocation |
Slot freed or retired twice |
ErrLA57 |
5-level paging detected; tagged pointers require ≤48-bit virtual addresses |
ErrPoolFreed |
Pool has been freed |
ErrFreelistFreed |
FreeList has been freed |
ErrArenaCapacityExceeded |
Arena slice capacity exceeded |
ErrSlotTooSmall |
Slot size is too small for the requested struct/slice |
See examples/ for runnable demonstrations with benchmarks:
| Example | Scenario | Key metric |
|---|---|---|
| parser-scratch | JSON tokenizer with scratch buffer | 0 allocs vs 1 heap alloc per parse |
| request-pool | Per-request TLV message builder | Bulk Reset() vs per-buffer free |
| vector-storage | float32[1536] embeddings off-heap | 0 allocs vs 1 per vector; GC never scans vectors |
Each example includes a main.go (runnable demo), main_test.go (correctness
tests + benchmarks), and a README.md explaining the use case and tradeoffs.
To run an example benchmark:
go test -bench=. -benchmem ./examples/parser-scratch/
See BENCHMARK.md for extended methodology, raw data, and historical trends. Summary below. Apple M2, Go 1.25, Darwin (arm64). All paths show 0 heap allocations.
| Allocator | ns/op | B/op | allocs/op | vs make() |
|---|---|---|---|---|
| FreeList | 30.8 | 0 | 0 | 17.0× faster |
| ShardedFreeList | 38.5 | 0 | 0 | 13.6× faster |
| Slabby | 63.4 | 0 | 0 | 8.3× faster |
make([]float32, 1536) |
525 | 6,144 | 1 | 1.00× baseline |
| Pool (CAS slab) | 1,041 | 0 | 0 | 2.0× slower |
B/op and allocs/op reflect scaffolding (pool creation, goroutines), not the allocation hot path.
| Allocator | ms/op | B/op | allocs/op |
|---|---|---|---|
make() (Go heap) |
11.9 | 61,685,782 | 10,001 |
| Pool | 12.3 | 13,813 | 8 |
| FreeList | 13.3 | 361,308 | 8 |
| ShardedFreeList | 14.5 | 376,134 | 17 |
| Slabby | 26.0 | 62,221,757 | 10,024 |
All allocators show the same scaffolding overhead (~292 B/op, 3 allocs/op). The allocation hot path is zero heap.
| Allocator | ms/op | B/op | allocs/op |
|---|---|---|---|
| Pool | 3.41 | 292 | 3 |
make() (Go heap) |
3.42 | 292 | 3 |
| FreeList | 3.45 | 292 | 3 |
| ShardedFreeList | 3.61 | 292 | 3 |
| Slabby | 3.70 | 292 | 3 |
| Duration | Total ops | ops/sec | Errors | Error rate | Stalls | Corruption |
|---|---|---|---|---|---|---|
| 30s | 0.43B | 14.43M | 1.39M | 0.32% | 0 | 0 |
| 5m | 3.95B | 13.16M | 4.13M | 0.10% | 0 | 0 |
| 10m | 7.34B | 12.23M | 2.22M | 0.03% | 0 | 0 |
| 1h | 42.02B | 11.67M | 15.59M | 0.037% | 0 | 0 |
1-hour post-hammer recovery: 10,000/10,000 alloc/free cycles succeeded. RSS flat at ~6 MB (128 MB pool is off-heap mmap). Zero memory leak, zero throughput degradation beyond asymptotic PID settling. v1.0.0-gold certified.
| Metric | Static (threshold=65) | PID (adaptive) | Improvement |
|---|---|---|---|
| Stall duration | 6 seconds | 0 seconds | Eliminated |
| Error rate | 1.07% | 0.10% | 10× lower |
| Total errors | 40.1M | 4.13M | 89.7% reduction |
| Path | ops/sec | ns/op | B/op | allocs/op |
|---|---|---|---|---|
| Hot path (slab has space) | 124M | 9.4 | 0 | 0 |
| Slow path (scan for free slab) | 3.7M | 314 | 0 | 0 |
| Grow path (mmap new slab) | 1.9M | 620 | 0 | 0 |
| Large allocation (1MB, direct mmap) | 2.0M | 595 | 0 | 0 |
| Slabs | ns/op | B/op | allocs/op |
|---|---|---|---|
| 4 | 2,339 | 0 | 0 |
| 16 | 9,463 | 0 | 0 |
| 64 | 39,591 | 0 | 0 |
| 256 | 172,423 | 0 | 0 |
Sustained runs under GODEBUG=gctrace=1. Every path shows 0→0→0 MB
live heap with zero automatic GC triggers.
| Path | Duration | GC Cycles | Live Heap | Auto GC |
|---|---|---|---|---|
| Pool hot path | 10s | 7 forced | 0→0→0 MB | 0 |
| Pool grow path | 5s | 4 forced | 0→0→0 MB | 0 |
| Pool large allocation | 5s | 4 forced | 0→0→0 MB | 0 |
| FreeList per-vector alloc+free | 1s | 2 forced | 0→0→0 MB | 0 |
| ShardedFreeList per-vector alloc+free | 1s | 2 forced | 0→0→0 MB | 0 |
| ShardedFreeList + PID controller | 60m | all forced | 0→0→0 MB | 0 |
gctrace format (live_before→live_marked→live_after): all zeros means the GC
found nothing to scan. All cycles are (forced) — triggered by runtime.GC()
in benchmark scaffolding, not by heap pressure. No automatic GC fired because
the runtime never detected heap growth.
The PID controller (100ms ticker, per-vector allocations, 1-hour stress hammer)
adds zero measurable heap pressure. GC trace shows steady 0→0→0 MB with no
creep over time.
RSS behavior after Reset() varies by platform:
| Platform | madvise behavior |
RSS after Reset |
|---|---|---|
| Linux | MADV_DONTNEED releases pages immediately |
RSS drops |
| macOS (darwin) | MADV_FREE lazily reclaims pages |
RSS may linger until pressure |
On macOS, top/htop may show higher resident memory after Reset() due to
lazy page reclamation. This is cosmetic — the OS reclaims pages under pressure.
Go runtime metrics (MemStats) always report zero heap growth.
| Field | Type | Default | Description |
|---|---|---|---|
PoolSize |
uint64 | 64MB | Hard limit on total mmap'd bytes |
SlabSize |
uint64 | 1MB | Size of each slab |
SlabCount |
int | 16 | Initial slab descriptor capacity |
Prealloc |
bool | false | Eagerly allocate SlabCount slabs at creation |
UseHugePages |
bool | false | Use MAP_HUGETLB (Linux only; requires 2MB-aligned SlabSize) |
| Field | Type | Default | Description |
|---|---|---|---|
PoolSize |
uint64 | 64MB | Hard limit on total mmap'd bytes |
SlotSize |
uint64 | 64 | Fixed size of each slot (min 32 for metadata) |
SlabSize |
uint64 | 1MB | Size of each slab |
SlabCount |
int | 16 | Initial slab descriptor capacity |
Prealloc |
bool | false | Eagerly allocate SlabCount slabs at creation |
Prealloc: When true, NewPool/NewFreeList eagerly allocates SlabCount
slabs. On failure, already-allocated slabs are rolled back and ErrMmapFailed
is returned.
UseHugePages: Linux only. Attempts MAP_HUGETLB; silently falls back to
regular mmap if unavailable. macOS ignores this flag.
PoolSize is a hard limit on mmap'd bytes tracked via atomic reserve().
When exhausted, Allocate returns ErrPoolExhausted.
SlotSize (FreeList/ShardedFreeList): Must be ≥ 32 bytes. The slot metadata (Hyaline chain pointers, batch references, struct index, shard index) occupies offsets 0–31. Offsets 32+ are usable payload.
The numShards parameter to NewShardedFreeList defaults to 64. It is rounded
up to the next power of two. More shards reduce cross-CPU contention but increase
memory overhead (per-shard batch, caches, mutex). 64 is a good default for most
workloads; 256 is appropriate for extreme oversubscription scenarios.
For P-bound affinity (goroutines pinned to OS threads), build with -tags procpin
to use runtime.procPin instead of stack-address hashing for shard selection.
stats := pool.Stats() // atomic snapshot, safe for concurrent use
// PoolStats{Reserved, Allocated, Committed, PeakUsage, SlabCount, SlabSize, Align}Reserved= total bytes mmap'd (≤PoolSize)Allocated= bytes handed to callers (may slightly overcount during Reset churn)Committed= bytes mmap'd for large (>SlabSize) allocations
All allocations are 8-byte aligned for SIMD/ARM compatibility.
memory.Hint(HintWillNeed, ptr, len) or memory.Hint(HintDontNeed, ptr, len) wraps madvise(2) for
cache warming or page reclaim hints. Linux uses MADV_DONTNEED (eager);
macOS uses MADV_FREE (lazy).
| Operation | Complexity | Locks |
|---|---|---|
| Pool hot path (slab has space) | O(1), lock-free CAS | None |
| Pool slow path (scan slabs) | O(n slabs) | None |
| FreeList.Allocate | O(1), lock-free CAS | None |
| ShardedFreeList.Allocate (cache hit) | O(1), zero atomics | None |
| ShardedFreeList.Allocate (batch refill) | O(1), lock-free CAS | None |
| ShardedFreeList.Retire | O(1) amortized, lock-free CAS | batchMu (per-shard, uncontended) |
| HyalineEnter | O(1), single atomic store | None |
| HyalineLeave | O(nodes in slot chain) | None |
| PID controller | O(1) every 100ms, background | None |
| Reset | O(n slabs) munmap | None |
NewShardedFreeList launches a background PI controller (Kp=2.0, Ki=0.5,
anti-windup ±100, 100ms ticker) that dynamically adjusts the Hyaline batch
flush threshold from its default of 65 down to as low as 1. When the pool
drops below 20% free capacity, the controller forces partial batches to
flush sooner, preventing the exhaustion cliff that occurs with a static
threshold. The hot path (hyalineRetire) sees only a single
atomic.Uint64.Load — zero additional contention or branching.
The controller is automatically restarted on Reset() and cancelled on
Free().
A process-wide heap pressure monitor is available via
RegisterMemoryPressureCallback(threshold, fn). It monitors Go heap metrics
(HeapInuse), not the off-heap mmap'd memory managed by this package.
- Not GC-safe — memory is not zeroed on alloc/reset; caller manages contents
- Not thread-safe for
ArenaReset — single-producer reset only; calling Reset concurrently with Alloc causes overlapping allocations - Not a substitute for
sync.Pool— designed for explicit lifecycle control, not automatic GC integration - Not a general-purpose allocator — tuned for slab workloads; large allocations bypass slabs
- Not safe for use-after-Reset — accessing an allocation after
Reset()will segfault or corrupt data - Not safe for use-after-Retire without Enter — accessing a retired slot without holding an active Hyaline enter is a use-after-free bug
This implementation bridges high-level Go concurrency with low-level systems research:
-
Safe Memory Reclamation: Based on Hyaline: Fast and Transparent Lock-Free Memory Reclamation (PLDI '21) by Nikolaev and Ravindran. This provides
$O(1)$ reclamation and robustness against stalled goroutines, enabling our 13.8M ops/sec throughput without the frequent memory barrier overhead inherent to traditional Hazard Pointers (Michael, 2004). - Lock-Free Primitives: Utilizes a sharded Treiber Stack (1986). To resolve the ABA problem (a classic weakness of Treiber stacks in non-GC languages), 16-bit generation tags are packed into 48-bit virtual addresses. Furthermore, sharding is used to avoid the scalability bottlenecks of global stacks, a principle outlined in A Scalable Lock-free Stack Algorithm (Hendler, Shavit, and Yerushalmi, 2004).
- Adaptive Control: Reclamation pressure is managed via a PID controller, dynamically tuning batch flush thresholds to prevent liveness stalls under extreme oversubscription, applying principles from Feedback Control for Computer Systems (Janert).
See CONTRIBUTING.md.
MIT — see LICENSE.