Add --defer-experts flag to defer expert mmap residency on Linux by dmaivel · Pull Request #1634 · ikawrakow/ik_llama.cpp

dmaivel · 2026-04-14T01:20:57Z

This PR introduces --defer-experts. Using this flag, expert tensor pages are faulted in on demand rather than being eagerly loaded during initialization. This allows us to reduce cold-start latency, thus improving the load time of MoE models, particularly on systems where users are running models off of storage.

TLDR; when users wish to run an MoE model which exceeds their system RAM + VRAM, they can use --defer-experts --no-warmup to bring load times from minutes down to ~10s.

I benchmarked two models that I have on hand on a 64 GiB RAM system:

Qwen3.5-122B-A10B @ IQ2_KS ~43.3 GiB
Qwen3.5-397B-A17B @ IQ2_XS ~123.2 GiB

GGML_CUDA_NO_PINNED=1 was used in order to force usage of mmap, which is required for this feature.

I used ./llama-cli with a ~1k token prompt and -n 1 to measure time to first token on cold-starts:

Model	Config	Median TTFT
Qwen3.5-122B-A10B	baseline	37.89s
Qwen3.5-122B-A10B	`--no-warmup`	37.06s
Qwen3.5-122B-A10B	`--defer-experts`	18.37s
Qwen3.5-122B-A10B	`--defer-experts --no-warmup`	21.61s
Qwen3.5-397B-A17B	baseline	121.59s
Qwen3.5-397B-A17B	`--no-warmup`	93.72s
Qwen3.5-397B-A17B	`--defer-experts`	61.96s
Qwen3.5-397B-A17B	`--defer-experts --no-warmup`	30.90s

Qwen3.5-122B-A10B loads ~2.06x faster when comparing baseline to --defer-experts
Qwen3.5-397B-A17B loads ~1.96x faster when comparing baseline to --defer-experts, ~3.93x faster with --no-warmup included

I did run llama-bench with default parameters as well, in case of any PP and TG regressions (also dropping memory cache between runs):

model	size	params	backend	ngl	threads	defer	test	t/s
qwen35moe 397B.A17B Q6_K	123.22 GiB	396.35 B	CUDA	99	24	1	pp512	25.45 ± 5.39
qwen35moe 397B.A17B Q6_K	123.22 GiB	396.35 B	CUDA	99	24	1	tg128	6.63 ± 1.08
qwen35moe 397B.A17B Q6_K	123.22 GiB	396.35 B	CUDA	99	24	0	pp512	24.98 ± 4.72
qwen35moe 397B.A17B Q6_K	123.22 GiB	396.35 B	CUDA	99	24	0	tg128	6.30 ± 1.00

model	size	params	backend	ngl	threads	defer	test	t/s
qwen35moe 122B.A10B IQ2_KL - 2.6875 bpw	43.32 GiB	122.11 B	CUDA	99	24	1	pp512	120.45 ± 55.26
qwen35moe 122B.A10B IQ2_KL - 2.6875 bpw	43.32 GiB	122.11 B	CUDA	99	24	1	tg128	23.20 ± 1.48
qwen35moe 122B.A10B IQ2_KL - 2.6875 bpw	43.32 GiB	122.11 B	CUDA	99	24	0	pp512	160.56 ± 3.53
qwen35moe 122B.A10B IQ2_KL - 2.6875 bpw	43.32 GiB	122.11 B	CUDA	99	24	0	tg128	25.05 ± 0.12

build: c5acec8 (4412)

I wouldn't recommend users use --defer-experts --no-warmup with models that fit in RAM (as the tables show, I observe slightly reduced performance). It's more advantageous for these users not to use mmap in the first place, or try just --defer-experts.

For the smaller model, I suspect that the 120.45 ± 55.26 t/s for PP is because the first run(s) were slowed down during faulting on cold-start.

In my own use case, running GGML_CUDA_NO_PINNED=1 ./llama-server -ngl 99 -ot exps=CPU ... with Qwen3.5-397B-A17B-IQ2_XS, I see that the server is ready within:

baseline: ~2m 1s
--no-warmup: ~1m 16s (1.59x faster)
--defer-experts: ~41s (2.95x faster)
--defer-experts --no-warmup: ~10s (12.1x faster)

Enabling this flag will also tell users about how much memory was deferred:

llm_load_tensors: dense parameters loaded in 4.02s (7.90 GiB), expert parameters deferred (115.31 GiB)

If users are patient enough to use oversized models, I suppose that baseline might be fine for them as well, in case they face any regressions in their workflows.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ikawrakow · 2026-04-14T08:16:19Z

Thank you for the PR.

The benchmark results give CUDA as the backend, what is/are the GPU(s), and what are the command line arguments?

Also, did you try --dry-run? This will also give you information about VRAM/RAM usage nearly instantaneously without actually loading the tensors.

I think it would be useful to measure TTFT on cold start (caches dropped), a prompt of a given length, and --no-warmup. If --defer-experts is significantly faster than the baseline, this would indicate that there is something wrong with the way the tensors are loaded, and one should try to fix that.

dmaivel · 2026-04-14T18:31:05Z

I'm using a single GPU for testing, NVIDIA GeForce RTX 4080 SUPER.

For dropping caches between runs, I use sudo sh -c 'sync; echo 1 > /proc/sys/vm/drop_caches'.

For my tests with llama-cli I was using:

GGML_CUDA_NO_PINNED=1 ./build/bin/llama-cli \
    -m ~/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf \
    -c 12000 -ngl 99 -ot exps=CPU --threads 24 \
    -fa on -b 4096 -ub 4096 \
    -ctk q8_0 -ctv q8_0 \
    -wgt 1 \
    -f ~/.../prompt2.txt -n 1 \
    --temp 0 --top-k 1 -s 0 # configured between runs: --defer-experts --no-warmup

where prompt2.txt is ~1,048 tokens.

For llama-bench, I was using:

GGML_CUDA_NO_PINNED=1 ./llama-bench -m ~/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf \
    -ngl 99 -ot exps=CPU --threads 24 \
    -n 128 -r 3 # --defer-experts

With --dry-run, I still observe slow load times. When I run:

GGML_CUDA_NO_PINNED=1 ./llama-server \
    -m ~/.../Qwen3.5-122B-A10B-IQ2_KL.gguf \
    -c 12000 -ngl 99 -ot exps=CPU --threads 24 \
    -fa on -b 4096 -ub 4096 \
    -wgt 1 \
    --dry-run

I have to wait ~24s before it loads and completes. With --defer-experts, it becomes nearly instantaneously.

Using Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf instead, I wait ~73s. With --defer-experts, this also loads instantaneously.

Interestingly, on all 4 of these runs, I am observing a crash after loading. I don't believe that this is the expected behavior, as noted by the massive buffer allocation. I made sure to also try this on the main branch, and I observed this there as well.

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 118816.27 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 124587884672
llama_init_from_model: failed to allocate compute buffers
~ggml_backend_cuda_context: have 0 graphs
llama_init_from_gpt_params: error: failed to create context with model '/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf'
 ERR [              load_model] unable to load model | tid="140020223762432" timestamp=1776187798 model="/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf"
free(): invalid pointer

Omitting --dry-run from the same command, there is no crashing.

Redoing my TTFT experiments with this command (dropping cache between runs):

GGML_CUDA_NO_PINNED=1 ./llama-cli \
      -m ~/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf \
      -c 4096 -ngl 99 -ot exps=CPU --threads 24 \
      -fa on -b 4096 -ub 4096 \
      -wgt 1 \
      -f ~/.../prompt2.txt -n 1 --no-warmup # --defer-experts

Without --defer-experts:

llama_print_timings:        load time =   98799.04 ms
llama_print_timings:      sample time =       0.68 ms /     1 runs   (    0.68 ms per token,  1461.99 tokens per second)
llama_print_timings: prompt eval time =   26414.89 ms /  1048 tokens (   25.21 ms per token,    39.67 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   98800.50 ms /  1049 tokens

With --defer-experts:

llama_print_timings:        load time =   29447.51 ms
llama_print_timings:      sample time =       0.28 ms /     1 runs   (    0.28 ms per token,  3623.19 tokens per second)
llama_print_timings: prompt eval time =   25382.59 ms /  1048 tokens (   24.22 ms per token,    41.29 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   29448.97 ms /  1049 tokens

For good measure, I tried this with Qwen3.5-122B-A10B-IQ2_KL.gguf as well, where we observe reduced initial PP performance with deferral:

Without --defer-experts:

llama_print_timings:        load time =   28731.30 ms
llama_print_timings:      sample time =       0.28 ms /     1 runs   (    0.28 ms per token,  3558.72 tokens per second)
llama_print_timings: prompt eval time =    2573.51 ms /  1048 tokens (    2.46 ms per token,   407.23 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   28733.20 ms /  1049 tokens

With --defer-experts:

llama_print_timings:        load time =   21529.31 ms
llama_print_timings:      sample time =       0.36 ms /     1 runs   (    0.36 ms per token,  2808.99 tokens per second)
llama_print_timings: prompt eval time =   19180.01 ms /  1048 tokens (   18.30 ms per token,    54.64 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   21531.16 ms /  1049 tokens

Based on these results, --defer-experts does appear to be faster on cold start. Would this indicate an issue with the way tensors are currently loaded?

ikawrakow

LGTM, but let's have at least one more RAM-poor person test that. I don't have models that do not fit in RAM, and I see no difference in loading time with or without --defer-experts.

usrlocalben · 2026-04-16T00:02:52Z

To me It seems like --defer-experts should implicitly set --no-warmup, otherwise it doesn't fully solve the problem as described.
Preload/Warmup could probably be disabled automatically if ram-size < model/exps-size.
I'm not RAM-poor but I tried it out w/drop_caches between invocations and it appears to work as advertised.

ikawrakow · 2026-04-16T06:03:41Z

@dmaivel

I think @usrlocalben's suggestion is good. Can you add auto-enabling --no-warmup with --defer-experts? Thanks.

dmaivel · 2026-04-16T06:46:47Z

Done

FNsi · 2026-04-16T15:46:45Z

LGTM, but let's have at least one more RAM-poor person test that.

okay I tried to load stepfun q8_0 in my laptop, turns out there's no much difference if the size is around twice of ram, or even worse.
(Oh my poor ssd)

First is default --no-warmup, second is --defer-experts

…wrakow#1634) * Add --defer-experts flag to defer expert mmap residency on Linux * Disable warmup when defer-experts is enabled

markaalonzo · 2026-04-17T23:34:49Z

Post-merge observation from a downstream fork carrying this flag. Two things worth flagging for any follow-up work.

Silent-no-op risk in `is_split_expert_tensor`

src/llama-model-loader.cpp:231 detects split-expert tensors by string-matching a fixed prefix list:

static const char * prefixes[] = { "ffn_gate.", "ffn_down.", "ffn_up." };

The merged-expert path (line 260, is_merged_expert_tensor(llm_tensor tensor_type)) uses the type enum, which is robust to renames. The split path can't — the comment at line 498 notes llm_tensor_type can't disambiguate two-%d formats — so string matching is genuinely needed here.

But the failure mode is silent: if a future arch names split experts differently (e.g. a moe_* convention, or a rename), is_split_expert_tensor returns false for every tensor, build_expert_tensor_index produces an empty range list, and --defer-experts becomes a no-op on that arch with no error. The flag would appear to do nothing, which is hard to diagnose from the user side.

Suggestion: when should_defer_expert_mmaps() returns true and the collected expert ranges are empty (or materially less than n_expert_used_count * n_layer * 3), emit an LLAMA_LOG_WARN naming the arch and noting the flag had no effect. Cheap insurance against a class of regression that wouldn't otherwise surface in tests.

Observed behavior on a Qwen3.5-35B-A3B MoE workload

For calibration of the "reduce model load time" claim: tested on a single-model always-on inference server (RTX 3080 Ti + Intel Core Ultra 7 265, 30 GB RAM, model fits comfortably without the flag).

Config	VmRSS	Decode steady-state	Run 1	Run 6
`--defer-experts` off (default)	14.47 GB	53.5 tok/s	47.8	53.7
`--defer-experts` on	13.66 GB	(degrades each run)	54.0	46.4

The flag does what's documented (lower load-time RSS, fast Rep 1), but steady-state throughput monotonically degrades on repeated inference. majflt=0 throughout — the degradation is minor-fault cost plus presumably TLB/THP loss from MADV_DONTNEED clearing the expert ranges, combined with params.warmup=false removing the pass that would re-prefault them.

Not a bug — matches the design tradeoff — but the params.warmup=false coupling in common/common.cpp:1562 is what turns a one-time load-time saving into a permanent per-run cost on MoE workloads where expert activation patterns vary. An always-on inference server (vs. a short-lived bench) likely wants --defer-experts with warmup left enabled, so the first call prefaults and subsequent calls keep the working set hot. Would welcome a separate --defer-experts-no-warmup if the combined behavior is intentional for Linux cold-boot use cases; otherwise decoupling might be worth considering.

Happy to test alternative patches on the same workload if useful.

usrlocalben · 2026-04-17T23:46:58Z

An always-on inference server (vs. a short-lived bench) likely wants --defer-experts with warmup left enabled, so the first call prefaults and subsequent calls keep the working set hot.

This statement indicates that you don't understand the problem this PR aims to solve.

This is for systems that don't have RAM to load the model and instead read from block (or similar) storage "directly" during compute via mmap & page-faults. The normal startup preload/warm-up events are just a cost to them without benefit.

They will never warm up. They will never have a hot working set.

Add --defer-experts flag to defer expert mmap residency on Linux

c5acec8

ikawrakow approved these changes Apr 15, 2026

View reviewed changes

Disable warmup when defer-experts is enabled

297a918

ikawrakow merged commit 4f4bcfb into ikawrakow:main Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --defer-experts flag to defer expert mmap residency on Linux#1634

Add --defer-experts flag to defer expert mmap residency on Linux#1634
ikawrakow merged 2 commits intoikawrakow:mainfrom
dmaivel:defer-experts

dmaivel commented Apr 14, 2026 •

edited

Loading

Uh oh!

ikawrakow commented Apr 14, 2026

Uh oh!

dmaivel commented Apr 14, 2026 •

edited

Loading

Uh oh!

ikawrakow left a comment

Uh oh!

usrlocalben commented Apr 16, 2026

Uh oh!

ikawrakow commented Apr 16, 2026

Uh oh!

dmaivel commented Apr 16, 2026

Uh oh!

FNsi commented Apr 16, 2026

Uh oh!

markaalonzo commented Apr 17, 2026

Uh oh!

usrlocalben commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dmaivel commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Apr 14, 2026

Uh oh!

dmaivel commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

usrlocalben commented Apr 16, 2026

Uh oh!

ikawrakow commented Apr 16, 2026

Uh oh!

dmaivel commented Apr 16, 2026

Uh oh!

FNsi commented Apr 16, 2026

Uh oh!

markaalonzo commented Apr 17, 2026

Silent-no-op risk in is_split_expert_tensor

Observed behavior on a Qwen3.5-35B-A3B MoE workload

Uh oh!

usrlocalben commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dmaivel commented Apr 14, 2026 •

edited

Loading

dmaivel commented Apr 14, 2026 •

edited

Loading

Silent-no-op risk in `is_split_expert_tensor`

usrlocalben commented Apr 17, 2026 •

edited

Loading