Skip to content

UPSTREAM PR #21821: llama : add --hugepages for HugeTLB-backed weight loading (Linux)#1347

Open
loci-dev wants to merge 3 commits intomainfrom
loci/pr-21821-hugepages-pr
Open

UPSTREAM PR #21821: llama : add --hugepages for HugeTLB-backed weight loading (Linux)#1347
loci-dev wants to merge 3 commits intomainfrom
loci/pr-21821-hugepages-pr

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21821

Overview

Addresses #2251 (partially — weights via --mmap path only; I hope to address --no-mmap and KV cache in a follow-up). I realize this is a non-trivial change (even though the implementation itself it just one file / two blocks), so I'm going into a bit more detail than I usually would with a PR. :)

Summary

Adds a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES) that backs model weight memory with anonymous 2 MiB HugeTLB pages on Linux.

Motivation: vmemmap reclamation
(not TLB speedup,v though it may facilitate future work in this area)

When HugeTLB Vmemmap Optimization (HVO, CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y) is enabled, the kernel frees the per-4 KiB struct page metadata within each hugepage. On a 128 GiB system this recovers ~1.75 GiB of kernel memory — enough to turn a tight-ceiling workload from OOM into working (which is what was happening to me).

The flag is opt-in at runtime, so there's no cost to anyone who doesn't use it.

# reserve the pool (runtime, no reboot for 2 MiB pages)
sudo sysctl -w vm.nr_hugepages=65536   # 65536 x 2 MiB = 128 GiB

# run with the flag
./llama-cli --hugepages -m model.gguf ...

Why?

Each 4 KiB page costs ~64 bytes of struct page metadata. With 128 GiB, that's ~2 GiB just to track pages. HugeTLB with HVO reduces this; a 2 MiB hugepage is only 4 KiB instead of 32 KiB.

System RAM Approx. Benefit
16 GiB ~224 MiB
64 GiB ~896 MiB
128 GiB ~1.75 GiB

Here's my real-life example: Strix APU with 128 GiB RAM (unified in my case) running MiniMax m2.5 IQ4_XS. The total footprint ~127,910 MiB against ~127,342 MiB available ("normal" pages). Saving 1,792 MiB pushes available memory to ~129,134 MiB, so now the model fits with ~1 GiB to spare.

Transparent Huge Pages (MADV_HUGEPAGE) don't help here, as THP remaps existing 4 KiB pages under a 2 MiB entry for TLB efficiency, but the struct page array stays intact (and saves no memory). Only explicit HugeTLB pool allocation with HVO addresses this.

I do want to be clear that currently this only works with CPU inference; that's because I haven't tackled hipMalloc. (That would open the door to directly passing memory to the GPU without reallocating.) I also am unsure of how this would work with other unified memory systems.

Approach

In #2251, @slaren suggested adding MAP_HUGETLB to the existing mmap call. @qdacsvx noted the kernel rejects MAP_HUGETLB on regular descriptors (EINVAL). This PR implements what @slaren had in mind:

  1. llama_mmap allocates an anonymous region with MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB | MAP_POPULATE
  2. load_all_data populates the region per-tensor via file->seek + file->read_raw (same pattern as the existing --no-mmap branch)
  3. After all tensors are loaded, mprotect downgrades to PROT_READ

MAP_POPULATE forces atomic pool allocation at mmap time, so if the pool is insufficient, it yields a clean ENOMEM (with a friendly diagnostic including sysctl guidance), not a SIGBUS during load.

Arg-parse rejects --hugepages combined with --no-mmap (wording points at a followup PR), --direct-io (the loader's existing conflict resolver would silently bypass our code), and non-Linux platforms.

Anonymous mapping is the only path to HugeTLB-backed weight memory without requiring a hugetlbfs mount, copying the model, and recompiling, a la PR #12521 #12552 (issue #12444), which isn't especially user-friendly. This PR avoids all of that — just sysctl + --hugepages.

PR #7420 also used anonymous mappings inside llama_mmap, but for direct-I/O bypass rather than hugepage backing. A concern raised there was that anonymous memory can swap under pressure. That doesn't apply here, since HugeTLB pages are inherently pinned by the kernel and cannot be swapped, reclaimed, or migrated regardless of memory pressure or RLIMIT_MEMLOCK settings. The can be relinquished / reused, though.

Tradeoffs

Warm-load time increases because --mmap normally shares page-cache pages zero-copy, while --hugepages must read_raw file into the anonymous region.

Measured on qwen3-235B-Q4 (19.4 GB), Strix Halo, Linux 6.17:

Path Cold Warm
--mmap baseline 3843 ms 547 ms
--hugepages 5030 ms 2319 ms
Slowdown 1.31x 4.24x

The warm 4.24x is driven by Strix Halo's copy_to_user bandwidth (~8.5 GB/s on LPDDR5X). Other platforms may see less. At the 128 GiB target scale,
warm-load adds ~11 seconds per session — in exchange for the model actually fitting.

Changes

There are three primary changes (in these three commits):

  • Adding the CLI parameter --hugepages (and LLAMA_ARG_HUGEPAGES)
  • Modifying function signatures & data structure to accommodate hugepages tracking
  • Implementation of the hugepages allocation (llama-mmap.cpp)

Testing

  • Builds clean on Linux x86_64, GCC 14.2, both CPU-only and HIP/ROCm configurations
  • End-to-end verified on Strix Halo (gfx1151, ROCm 7.2.0, Linux 6.17):
    model loads with --hugepages, inference produces correct output
    (39 t/s prompt, 8.2 t/s generation on qwen3 19.4 GB)
  • --help shows the flag with env var
  • Arg-parse rejections verified (--hugepages --no-mmap, --hugepages -dio)
  • Standalone harness validated the core mechanism on Strix Halo with a 19.4 GB GGUF:
    • Pool accounting exact (9424 x 2 MiB pages consumed/restored)
    • VmRSS confirms vmemmap reclamation: 3872 kB under --hugepages vs 19.30 GB baseline
    • No SIGBUS under MAP_POPULATE
    • read_raw EINTR/short-read handling inherited from existing code

HVO verification

For users to confirm HVO is active:

# compiled in?
grep CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP /boot/config-$(uname -r)
# → CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y

# enabled at runtime?
cat /proc/sys/vm/hugetlb_optimize_vmemmap
# → 1  (if 0: sudo sysctl -w vm.hugetlb_optimize_vmemmap=1)

# pool state
grep -i huge /proc/meminfo

Follow-ups (future work / not this PR)

  • --no-mmap path + KV cache + compute buffers via a new ggml_backend_cpu_hugetlb_buffer_type (parallel to the existing HBM pattern). Reuses
    the --hugepages flag.
  • Multi-threaded read_raw for load parallelism
  • posix_fadvise(POSIX_FADV_DONTNEED) on the source page cache after load
  • buffer_from_host_ptr is hardcoded false for CUDA/HIP (ggml-cuda.cu:4710). This PR is forward-compatible with a future flip for zero-copy on unified-memory APUs.

cc @ggerganov @slaren — per prior comments in #2251, I was hoping you could let me know what your take is on this approach. I know this issue matters for me. I've done my best to simplify / minimize code changes, but I am always happy to reconsider my approach as needed. (I hope it's OK to tag you; sorry if I missed a policy against it.)

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, AI was used in the process of preparing these commits.

AI was used to identify the appropriate strategy, draft a harness, and draft initial code snippets. Every line was reviewed, edited as appropriate, and included in commits (with sections of code separated manually into different commits with specific focus to ensure clarity and appropriate review (e.g., CLI parameter setup, data structures / call signatures, and final implementation were all separated into separate commits during review and manual processing.)

doctorjei and others added 3 commits April 12, 2026 14:02
…ing (Linux) (flag only, not implementation)

First commit to add --hugepages option (CLI). This commit only adds the structure for the flag but does not change any other code. (Commit 1/3)

Full descriptions included here for convenience...
---
Back model weight mappings with anonymous 2 MiB HugeTLB pages on
Linux, activated by a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES).

Primary benefit is kernel vmemmap reclamation via HugeTLB Vmemmap
Optimization (HVO) — not TLB speedup. On a 128 GiB system fully
backed with 2 MiB hugepages this frees ~1.75 GiB of struct page
memory, turning tight-ceiling workloads from OOM into working.

Mechanism: llama_mmap allocates MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB|
MAP_HUGE_2MB|MAP_POPULATE (zero-filled), then load_all_data populates
the region per-tensor via file->read_raw before check_tensors and
view allocation consume it. mprotect downgrades to PROT_READ after
load. MAP_POPULATE is a race-safety guarantee (pool-short → clean
ENOMEM at mmap time, not SIGBUS mid-load).

Measured on qwen3 19.4 GB / Strix Halo: 1.31x cold / 4.24x warm
slowdown vs baseline mmap; vmemmap reclamation confirmed via VmRSS
delta (3872 kB hugepages vs 19.30 GB baseline).
This is the second commit to add huge pages support. This commit prepares the data structures and function call signatures but does not change functionality. (commit 2/3)
…ing (Linux) (**implementation**)

This is the final commit of 3 to implement hugepages support. This is the implemenmtation; previous commits where preperatory. (The description below is identical to the first commit.)
---
Back model weight mappings with anonymous 2 MiB HugeTLB pages on
Linux, activated by a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES).

Primary benefit is kernel vmemmap reclamation via HugeTLB Vmemmap
Optimization (HVO) — not TLB speedup. On a 128 GiB system fully
backed with 2 MiB hugepages this frees ~1.75 GiB of struct page
memory, turning tight-ceiling workloads from OOM into working.

Mechanism: llama_mmap allocates MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB|
MAP_HUGE_2MB|MAP_POPULATE (zero-filled), then load_all_data populates
the region per-tensor via file->read_raw before check_tensors and
view allocation consume it. mprotect downgrades to PROT_READ after
load. MAP_POPULATE is a race-safety guarantee (pool-short → clean
ENOMEM at mmap time, not SIGBUS mid-load).

Measured on qwen3 19.4 GB / Strix Halo: 1.31x cold / 4.24x warm
slowdown vs baseline mmap; vmemmap reclamation confirmed via VmRSS
delta (3872 kB hugepages vs 19.30 GB baseline).
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 13, 2026

Flame Graph: build.bin.llama-tts::arg.cpp__ZZ25common_params_parser_initR13common_params13llama_examplePFviPPcEENKUlS0_E44_clES0_

The flame graphs show the transformation from a trivial 10.7ns operation (base) to a 10-level call hierarchy (target) dominated by common_log infrastructure initialization (8,357ns, 88.8% of time), including vector operations (5,687ns), thread synchronization (366ns), and memory management. This represents the architectural migration to worker-thread-based asynchronous logging.

Additional Findings

Inference hot path unaffected: All performance-critical components remain unchanged—llama_decode(), matrix operations (70-90% of inference time), attention mechanisms, KV cache, quantization kernels, and all GPU backends (CUDA, Metal, HIP, Vulkan, SYCL). The hugepages feature adds memory optimization without modifying computation logic, providing 5-15% memory access improvement for large models through reduced TLB misses. Runtime benefits far exceed the 9.6μs maximum startup overhead.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants