UPSTREAM PR #21821: llama : add --hugepages for HugeTLB-backed weight loading (Linux) by loci-dev · Pull Request #1347 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-13T03:12:39Z

Note

Source pull request: ggml-org/llama.cpp#21821

Overview

Addresses #2251 (partially — weights via --mmap path only; I hope to address --no-mmap and KV cache in a follow-up). I realize this is a non-trivial change (even though the implementation itself it just one file / two blocks), so I'm going into a bit more detail than I usually would with a PR. :)

Summary

Adds a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES) that backs model weight memory with anonymous 2 MiB HugeTLB pages on Linux.

Motivation: vmemmap reclamation
(not TLB speedup,v though it may facilitate future work in this area)

When HugeTLB Vmemmap Optimization (HVO, CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y) is enabled, the kernel frees the per-4 KiB struct page metadata within each hugepage. On a 128 GiB system this recovers ~1.75 GiB of kernel memory — enough to turn a tight-ceiling workload from OOM into working (which is what was happening to me).

The flag is opt-in at runtime, so there's no cost to anyone who doesn't use it.

# reserve the pool (runtime, no reboot for 2 MiB pages)
sudo sysctl -w vm.nr_hugepages=65536   # 65536 x 2 MiB = 128 GiB

# run with the flag
./llama-cli --hugepages -m model.gguf ...

Why?

Each 4 KiB page costs ~64 bytes of struct page metadata. With 128 GiB, that's ~2 GiB just to track pages. HugeTLB with HVO reduces this; a 2 MiB hugepage is only 4 KiB instead of 32 KiB.

System RAM	Approx. Benefit
16 GiB	~224 MiB
64 GiB	~896 MiB
128 GiB	~1.75 GiB

Here's my real-life example: Strix APU with 128 GiB RAM (unified in my case) running MiniMax m2.5 IQ4_XS. The total footprint ~127,910 MiB against ~127,342 MiB available ("normal" pages). Saving 1,792 MiB pushes available memory to ~129,134 MiB, so now the model fits with ~1 GiB to spare.

Transparent Huge Pages (MADV_HUGEPAGE) don't help here, as THP remaps existing 4 KiB pages under a 2 MiB entry for TLB efficiency, but the struct page array stays intact (and saves no memory). Only explicit HugeTLB pool allocation with HVO addresses this.

I do want to be clear that currently this only works with CPU inference; that's because I haven't tackled hipMalloc. (That would open the door to directly passing memory to the GPU without reallocating.) I also am unsure of how this would work with other unified memory systems.

Approach

In #2251, @slaren suggested adding MAP_HUGETLB to the existing mmap call. @qdacsvx noted the kernel rejects MAP_HUGETLB on regular descriptors (EINVAL). This PR implements what @slaren had in mind:

llama_mmap allocates an anonymous region with MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB | MAP_POPULATE
load_all_data populates the region per-tensor via file->seek + file->read_raw (same pattern as the existing --no-mmap branch)
After all tensors are loaded, mprotect downgrades to PROT_READ

MAP_POPULATE forces atomic pool allocation at mmap time, so if the pool is insufficient, it yields a clean ENOMEM (with a friendly diagnostic including sysctl guidance), not a SIGBUS during load.

Arg-parse rejects --hugepages combined with --no-mmap (wording points at a followup PR), --direct-io (the loader's existing conflict resolver would silently bypass our code), and non-Linux platforms.

Anonymous mapping is the only path to HugeTLB-backed weight memory without requiring a hugetlbfs mount, copying the model, and recompiling, a la PR #12521 #12552 (issue #12444), which isn't especially user-friendly. This PR avoids all of that — just sysctl + --hugepages.

PR #7420 also used anonymous mappings inside llama_mmap, but for direct-I/O bypass rather than hugepage backing. A concern raised there was that anonymous memory can swap under pressure. That doesn't apply here, since HugeTLB pages are inherently pinned by the kernel and cannot be swapped, reclaimed, or migrated regardless of memory pressure or RLIMIT_MEMLOCK settings. The can be relinquished / reused, though.

Tradeoffs

Warm-load time increases because --mmap normally shares page-cache pages zero-copy, while --hugepages must read_raw file into the anonymous region.

Measured on qwen3-235B-Q4 (19.4 GB), Strix Halo, Linux 6.17:

Path	Cold	Warm
`--mmap` baseline	3843 ms	547 ms
`--hugepages`	5030 ms	2319 ms
Slowdown	1.31x	4.24x

The warm 4.24x is driven by Strix Halo's copy_to_user bandwidth (~8.5 GB/s on LPDDR5X). Other platforms may see less. At the 128 GiB target scale,
warm-load adds ~11 seconds per session — in exchange for the model actually fitting.

Changes

There are three primary changes (in these three commits):

Adding the CLI parameter --hugepages (and LLAMA_ARG_HUGEPAGES)
Modifying function signatures & data structure to accommodate hugepages tracking
Implementation of the hugepages allocation (llama-mmap.cpp)

Testing

Builds clean on Linux x86_64, GCC 14.2, both CPU-only and HIP/ROCm configurations
End-to-end verified on Strix Halo (gfx1151, ROCm 7.2.0, Linux 6.17):
model loads with --hugepages, inference produces correct output
(39 t/s prompt, 8.2 t/s generation on qwen3 19.4 GB)
--help shows the flag with env var
Arg-parse rejections verified (--hugepages --no-mmap, --hugepages -dio)
Standalone harness validated the core mechanism on Strix Halo with a 19.4 GB GGUF:
- Pool accounting exact (9424 x 2 MiB pages consumed/restored)
- VmRSS confirms vmemmap reclamation: 3872 kB under --hugepages vs 19.30 GB baseline
- No SIGBUS under MAP_POPULATE
- read_raw EINTR/short-read handling inherited from existing code

HVO verification

For users to confirm HVO is active:

# compiled in?
grep CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP /boot/config-$(uname -r)
# → CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y

# enabled at runtime?
cat /proc/sys/vm/hugetlb_optimize_vmemmap
# → 1  (if 0: sudo sysctl -w vm.hugetlb_optimize_vmemmap=1)

# pool state
grep -i huge /proc/meminfo

Follow-ups (future work / not this PR)

--no-mmap path + KV cache + compute buffers via a new ggml_backend_cpu_hugetlb_buffer_type (parallel to the existing HBM pattern). Reuses
the --hugepages flag.
Multi-threaded read_raw for load parallelism
posix_fadvise(POSIX_FADV_DONTNEED) on the source page cache after load
buffer_from_host_ptr is hardcoded false for CUDA/HIP (ggml-cuda.cu:4710). This PR is forward-compatible with a future flip for zero-copy on unified-memory APUs.

cc @ggerganov @slaren — per prior comments in #2251, I was hoping you could let me know what your take is on this approach. I know this issue matters for me. I've done my best to simplify / minimize code changes, but I am always happy to reconsider my approach as needed. (I hope it's OK to tag you; sorry if I missed a policy against it.)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, AI was used in the process of preparing these commits.

AI was used to identify the appropriate strategy, draft a harness, and draft initial code snippets. Every line was reviewed, edited as appropriate, and included in commits (with sections of code separated manually into different commits with specific focus to ensure clarity and appropriate review (e.g., CLI parameter setup, data structures / call signatures, and final implementation were all separated into separate commits during review and manual processing.)

…ing (Linux) (flag only, not implementation) First commit to add --hugepages option (CLI). This commit only adds the structure for the flag but does not change any other code. (Commit 1/3) Full descriptions included here for convenience... --- Back model weight mappings with anonymous 2 MiB HugeTLB pages on Linux, activated by a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES). Primary benefit is kernel vmemmap reclamation via HugeTLB Vmemmap Optimization (HVO) — not TLB speedup. On a 128 GiB system fully backed with 2 MiB hugepages this frees ~1.75 GiB of struct page memory, turning tight-ceiling workloads from OOM into working. Mechanism: llama_mmap allocates MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB| MAP_HUGE_2MB|MAP_POPULATE (zero-filled), then load_all_data populates the region per-tensor via file->read_raw before check_tensors and view allocation consume it. mprotect downgrades to PROT_READ after load. MAP_POPULATE is a race-safety guarantee (pool-short → clean ENOMEM at mmap time, not SIGBUS mid-load). Measured on qwen3 19.4 GB / Strix Halo: 1.31x cold / 4.24x warm slowdown vs baseline mmap; vmemmap reclamation confirmed via VmRSS delta (3872 kB hugepages vs 19.30 GB baseline).

This is the second commit to add huge pages support. This commit prepares the data structures and function call signatures but does not change functionality. (commit 2/3)

…ing (Linux) (**implementation**) This is the final commit of 3 to implement hugepages support. This is the implemenmtation; previous commits where preperatory. (The description below is identical to the first commit.) --- Back model weight mappings with anonymous 2 MiB HugeTLB pages on Linux, activated by a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES). Primary benefit is kernel vmemmap reclamation via HugeTLB Vmemmap Optimization (HVO) — not TLB speedup. On a 128 GiB system fully backed with 2 MiB hugepages this frees ~1.75 GiB of struct page memory, turning tight-ceiling workloads from OOM into working. Mechanism: llama_mmap allocates MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB| MAP_HUGE_2MB|MAP_POPULATE (zero-filled), then load_all_data populates the region per-tensor via file->read_raw before check_tensors and view allocation consume it. mprotect downgrades to PROT_READ after load. MAP_POPULATE is a race-safety guarantee (pool-short → clean ENOMEM at mmap time, not SIGBUS mid-load). Measured on qwen3 19.4 GB / Strix Halo: 1.31x cold / 4.24x warm slowdown vs baseline mmap; vmemmap reclamation confirmed via VmRSS delta (3872 kB hugepages vs 19.30 GB baseline).

loci-review · 2026-04-13T05:04:38Z

The flame graphs show the transformation from a trivial 10.7ns operation (base) to a 10-level call hierarchy (target) dominated by common_log infrastructure initialization (8,357ns, 88.8% of time), including vector operations (5,687ns), thread synchronization (366ns), and memory management. This represents the architectural migration to worker-thread-based asynchronous logging.

Additional Findings

Inference hot path unaffected: All performance-critical components remain unchanged—llama_decode(), matrix operations (70-90% of inference time), attention mechanisms, KV cache, quantization kernels, and all GPU backends (CUDA, Metal, HIP, Vulkan, SYCL). The hugepages feature adds memory optimization without modifying computation logic, providing 5-15% memory access improvement for large models through reduced TLB misses. Runtime benefits far exceed the 9.6μs maximum startup overhead.

💬 Questions? Tag @loci-dev

doctorjei and others added 3 commits April 12, 2026 14:02

hugepages support - data structures & function calls (no implementation)

359f51a

This is the second commit to add huge pages support. This commit prepares the data structures and function call signatures but does not change functionality. (commit 2/3)

loci-dev temporarily deployed to PROD__AL_DEMO April 13, 2026 03:12 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 8 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21821: llama : add --hugepages for HugeTLB-backed weight loading (Linux)#1347

UPSTREAM PR #21821: llama : add --hugepages for HugeTLB-backed weight loading (Linux)#1347
loci-dev wants to merge 3 commits intomainfrom
loci/pr-21821-hugepages-pr

loci-dev commented Apr 13, 2026

Uh oh!

loci-review Bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 13, 2026

Overview

Summary

Why?

Approach

Tradeoffs

Changes

Testing

HVO verification

Follow-ups (future work / not this PR)

Requirements

Uh oh!

loci-review Bot commented Apr 13, 2026

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants