Skip to content

Add --defer-experts flag to defer expert mmap residency on Linux#1634

Merged
ikawrakow merged 2 commits intoikawrakow:mainfrom
dmaivel:defer-experts
Apr 16, 2026
Merged

Add --defer-experts flag to defer expert mmap residency on Linux#1634
ikawrakow merged 2 commits intoikawrakow:mainfrom
dmaivel:defer-experts

Conversation

@dmaivel
Copy link
Copy Markdown
Contributor

@dmaivel dmaivel commented Apr 14, 2026

This PR introduces --defer-experts. Using this flag, expert tensor pages are faulted in on demand rather than being eagerly loaded during initialization. This allows us to reduce cold-start latency, thus improving the load time of MoE models, particularly on systems where users are running models off of storage.

TLDR; when users wish to run an MoE model which exceeds their system RAM + VRAM, they can use --defer-experts --no-warmup to bring load times from minutes down to ~10s.


I benchmarked two models that I have on hand on a 64 GiB RAM system:

  • Qwen3.5-122B-A10B @ IQ2_KS ~43.3 GiB
  • Qwen3.5-397B-A17B @ IQ2_XS ~123.2 GiB

GGML_CUDA_NO_PINNED=1 was used in order to force usage of mmap, which is required for this feature.

I used ./llama-cli with a ~1k token prompt and -n 1 to measure time to first token on cold-starts:

Model Config Median TTFT
Qwen3.5-122B-A10B baseline 37.89s
Qwen3.5-122B-A10B --no-warmup 37.06s
Qwen3.5-122B-A10B --defer-experts 18.37s
Qwen3.5-122B-A10B --defer-experts --no-warmup 21.61s
Qwen3.5-397B-A17B baseline 121.59s
Qwen3.5-397B-A17B --no-warmup 93.72s
Qwen3.5-397B-A17B --defer-experts 61.96s
Qwen3.5-397B-A17B --defer-experts --no-warmup 30.90s
  • Qwen3.5-122B-A10B loads ~2.06x faster when comparing baseline to --defer-experts
  • Qwen3.5-397B-A17B loads ~1.96x faster when comparing baseline to --defer-experts, ~3.93x faster with --no-warmup included

I did run llama-bench with default parameters as well, in case of any PP and TG regressions (also dropping memory cache between runs):

model size params backend ngl threads defer test t/s
qwen35moe 397B.A17B Q6_K 123.22 GiB 396.35 B CUDA 99 24 1 pp512 25.45 ± 5.39
qwen35moe 397B.A17B Q6_K 123.22 GiB 396.35 B CUDA 99 24 1 tg128 6.63 ± 1.08
qwen35moe 397B.A17B Q6_K 123.22 GiB 396.35 B CUDA 99 24 0 pp512 24.98 ± 4.72
qwen35moe 397B.A17B Q6_K 123.22 GiB 396.35 B CUDA 99 24 0 tg128 6.30 ± 1.00
model size params backend ngl threads defer test t/s
qwen35moe 122B.A10B IQ2_KL - 2.6875 bpw 43.32 GiB 122.11 B CUDA 99 24 1 pp512 120.45 ± 55.26
qwen35moe 122B.A10B IQ2_KL - 2.6875 bpw 43.32 GiB 122.11 B CUDA 99 24 1 tg128 23.20 ± 1.48
qwen35moe 122B.A10B IQ2_KL - 2.6875 bpw 43.32 GiB 122.11 B CUDA 99 24 0 pp512 160.56 ± 3.53
qwen35moe 122B.A10B IQ2_KL - 2.6875 bpw 43.32 GiB 122.11 B CUDA 99 24 0 tg128 25.05 ± 0.12

build: c5acec8 (4412)

I wouldn't recommend users use --defer-experts --no-warmup with models that fit in RAM (as the tables show, I observe slightly reduced performance). It's more advantageous for these users not to use mmap in the first place, or try just --defer-experts.

For the smaller model, I suspect that the 120.45 ± 55.26 t/s for PP is because the first run(s) were slowed down during faulting on cold-start.

In my own use case, running GGML_CUDA_NO_PINNED=1 ./llama-server -ngl 99 -ot exps=CPU ... with Qwen3.5-397B-A17B-IQ2_XS, I see that the server is ready within:

  • baseline: ~2m 1s
  • --no-warmup: ~1m 16s (1.59x faster)
  • --defer-experts: ~41s (2.95x faster)
  • --defer-experts --no-warmup: ~10s (12.1x faster)

Enabling this flag will also tell users about how much memory was deferred:

llm_load_tensors: dense parameters loaded in 4.02s (7.90 GiB), expert parameters deferred (115.31 GiB)

If users are patient enough to use oversized models, I suppose that baseline might be fine for them as well, in case they face any regressions in their workflows.

@ikawrakow
Copy link
Copy Markdown
Owner

Thank you for the PR.

The benchmark results give CUDA as the backend, what is/are the GPU(s), and what are the command line arguments?

Also, did you try --dry-run? This will also give you information about VRAM/RAM usage nearly instantaneously without actually loading the tensors.

I think it would be useful to measure TTFT on cold start (caches dropped), a prompt of a given length, and --no-warmup. If --defer-experts is significantly faster than the baseline, this would indicate that there is something wrong with the way the tensors are loaded, and one should try to fix that.

@dmaivel
Copy link
Copy Markdown
Contributor Author

dmaivel commented Apr 14, 2026

I'm using a single GPU for testing, NVIDIA GeForce RTX 4080 SUPER.

For dropping caches between runs, I use sudo sh -c 'sync; echo 1 > /proc/sys/vm/drop_caches'.

For my tests with llama-cli I was using:

GGML_CUDA_NO_PINNED=1 ./build/bin/llama-cli \
    -m ~/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf \
    -c 12000 -ngl 99 -ot exps=CPU --threads 24 \
    -fa on -b 4096 -ub 4096 \
    -ctk q8_0 -ctv q8_0 \
    -wgt 1 \
    -f ~/.../prompt2.txt -n 1 \
    --temp 0 --top-k 1 -s 0 # configured between runs: --defer-experts --no-warmup

where prompt2.txt is ~1,048 tokens.

For llama-bench, I was using:

GGML_CUDA_NO_PINNED=1 ./llama-bench -m ~/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf \
    -ngl 99 -ot exps=CPU --threads 24 \
    -n 128 -r 3 # --defer-experts

With --dry-run, I still observe slow load times. When I run:

GGML_CUDA_NO_PINNED=1 ./llama-server \
    -m ~/.../Qwen3.5-122B-A10B-IQ2_KL.gguf \
    -c 12000 -ngl 99 -ot exps=CPU --threads 24 \
    -fa on -b 4096 -ub 4096 \
    -wgt 1 \
    --dry-run

I have to wait ~24s before it loads and completes. With --defer-experts, it becomes nearly instantaneously.

Using Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf instead, I wait ~73s. With --defer-experts, this also loads instantaneously.

Interestingly, on all 4 of these runs, I am observing a crash after loading. I don't believe that this is the expected behavior, as noted by the massive buffer allocation. I made sure to also try this on the main branch, and I observed this there as well.

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 118816.27 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 124587884672
llama_init_from_model: failed to allocate compute buffers
~ggml_backend_cuda_context: have 0 graphs
llama_init_from_gpt_params: error: failed to create context with model '/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf'
 ERR [              load_model] unable to load model | tid="140020223762432" timestamp=1776187798 model="/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf"
free(): invalid pointer

Omitting --dry-run from the same command, there is no crashing.


Redoing my TTFT experiments with this command (dropping cache between runs):

GGML_CUDA_NO_PINNED=1 ./llama-cli \
      -m ~/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf \
      -c 4096 -ngl 99 -ot exps=CPU --threads 24 \
      -fa on -b 4096 -ub 4096 \
      -wgt 1 \
      -f ~/.../prompt2.txt -n 1 --no-warmup # --defer-experts

Without --defer-experts:

llama_print_timings:        load time =   98799.04 ms
llama_print_timings:      sample time =       0.68 ms /     1 runs   (    0.68 ms per token,  1461.99 tokens per second)
llama_print_timings: prompt eval time =   26414.89 ms /  1048 tokens (   25.21 ms per token,    39.67 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   98800.50 ms /  1049 tokens

With --defer-experts:

llama_print_timings:        load time =   29447.51 ms
llama_print_timings:      sample time =       0.28 ms /     1 runs   (    0.28 ms per token,  3623.19 tokens per second)
llama_print_timings: prompt eval time =   25382.59 ms /  1048 tokens (   24.22 ms per token,    41.29 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   29448.97 ms /  1049 tokens

For good measure, I tried this with Qwen3.5-122B-A10B-IQ2_KL.gguf as well, where we observe reduced initial PP performance with deferral:

Without --defer-experts:

llama_print_timings:        load time =   28731.30 ms
llama_print_timings:      sample time =       0.28 ms /     1 runs   (    0.28 ms per token,  3558.72 tokens per second)
llama_print_timings: prompt eval time =    2573.51 ms /  1048 tokens (    2.46 ms per token,   407.23 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   28733.20 ms /  1049 tokens

With --defer-experts:

llama_print_timings:        load time =   21529.31 ms
llama_print_timings:      sample time =       0.36 ms /     1 runs   (    0.36 ms per token,  2808.99 tokens per second)
llama_print_timings: prompt eval time =   19180.01 ms /  1048 tokens (   18.30 ms per token,    54.64 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   21531.16 ms /  1049 tokens

Based on these results, --defer-experts does appear to be faster on cold start. Would this indicate an issue with the way tensors are currently loaded?

Copy link
Copy Markdown
Owner

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but let's have at least one more RAM-poor person test that. I don't have models that do not fit in RAM, and I see no difference in loading time with or without --defer-experts.

@usrlocalben
Copy link
Copy Markdown
Contributor

To me It seems like --defer-experts should implicitly set --no-warmup, otherwise it doesn't fully solve the problem as described.
Preload/Warmup could probably be disabled automatically if ram-size < model/exps-size.
I'm not RAM-poor but I tried it out w/drop_caches between invocations and it appears to work as advertised.

@ikawrakow
Copy link
Copy Markdown
Owner

@dmaivel

I think @usrlocalben's suggestion is good. Can you add auto-enabling --no-warmup with --defer-experts? Thanks.

@dmaivel
Copy link
Copy Markdown
Contributor Author

dmaivel commented Apr 16, 2026

Done

@ikawrakow ikawrakow merged commit 4f4bcfb into ikawrakow:main Apr 16, 2026
@FNsi
Copy link
Copy Markdown

FNsi commented Apr 16, 2026

LGTM, but let's have at least one more RAM-poor person test that.

okay I tried to load stepfun q8_0 in my laptop, turns out there's no much difference if the size is around twice of ram, or even worse.
(Oh my poor ssd)

image

First is default --no-warmup, second is --defer-experts

markaalonzo pushed a commit to markaalonzo/ik_llama.cpp that referenced this pull request Apr 17, 2026
…wrakow#1634)

* Add --defer-experts flag to defer expert mmap residency on Linux

* Disable warmup when defer-experts is enabled
@markaalonzo
Copy link
Copy Markdown
Contributor

Post-merge observation from a downstream fork carrying this flag. Two things worth flagging for any follow-up work.

Silent-no-op risk in is_split_expert_tensor

src/llama-model-loader.cpp:231 detects split-expert tensors by string-matching a fixed prefix list:

static const char * prefixes[] = { "ffn_gate.", "ffn_down.", "ffn_up." };

The merged-expert path (line 260, is_merged_expert_tensor(llm_tensor tensor_type)) uses the type enum, which is robust to renames. The split path can't — the comment at line 498 notes llm_tensor_type can't disambiguate two-%d formats — so string matching is genuinely needed here.

But the failure mode is silent: if a future arch names split experts differently (e.g. a moe_* convention, or a rename), is_split_expert_tensor returns false for every tensor, build_expert_tensor_index produces an empty range list, and --defer-experts becomes a no-op on that arch with no error. The flag would appear to do nothing, which is hard to diagnose from the user side.

Suggestion: when should_defer_expert_mmaps() returns true and the collected expert ranges are empty (or materially less than n_expert_used_count * n_layer * 3), emit an LLAMA_LOG_WARN naming the arch and noting the flag had no effect. Cheap insurance against a class of regression that wouldn't otherwise surface in tests.

Observed behavior on a Qwen3.5-35B-A3B MoE workload

For calibration of the "reduce model load time" claim: tested on a single-model always-on inference server (RTX 3080 Ti + Intel Core Ultra 7 265, 30 GB RAM, model fits comfortably without the flag).

Config VmRSS Decode steady-state Run 1 Run 6
--defer-experts off (default) 14.47 GB 53.5 tok/s 47.8 53.7
--defer-experts on 13.66 GB (degrades each run) 54.0 46.4

The flag does what's documented (lower load-time RSS, fast Rep 1), but steady-state throughput monotonically degrades on repeated inference. majflt=0 throughout — the degradation is minor-fault cost plus presumably TLB/THP loss from MADV_DONTNEED clearing the expert ranges, combined with params.warmup=false removing the pass that would re-prefault them.

Not a bug — matches the design tradeoff — but the params.warmup=false coupling in common/common.cpp:1562 is what turns a one-time load-time saving into a permanent per-run cost on MoE workloads where expert activation patterns vary. An always-on inference server (vs. a short-lived bench) likely wants --defer-experts with warmup left enabled, so the first call prefaults and subsequent calls keep the working set hot. Would welcome a separate --defer-experts-no-warmup if the combined behavior is intentional for Linux cold-boot use cases; otherwise decoupling might be worth considering.

Happy to test alternative patches on the same workload if useful.

@usrlocalben
Copy link
Copy Markdown
Contributor

usrlocalben commented Apr 17, 2026

An always-on inference server (vs. a short-lived bench) likely wants --defer-experts with warmup left enabled, so the first call prefaults and subsequent calls keep the working set hot.

This statement indicates that you don't understand the problem this PR aims to solve.

This is for systems that don't have RAM to load the model and instead read from block (or similar) storage "directly" during compute via mmap & page-faults. The normal startup preload/warm-up events are just a cost to them without benefit.

They will never warm up. They will never have a hot working set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants