Add --defer-experts flag to defer expert mmap residency on Linux#1634
Add --defer-experts flag to defer expert mmap residency on Linux#1634ikawrakow merged 2 commits intoikawrakow:mainfrom
Conversation
|
Thank you for the PR. The benchmark results give CUDA as the backend, what is/are the GPU(s), and what are the command line arguments? Also, did you try I think it would be useful to measure TTFT on cold start (caches dropped), a prompt of a given length, and |
|
I'm using a single GPU for testing, For dropping caches between runs, I use For my tests with GGML_CUDA_NO_PINNED=1 ./build/bin/llama-cli \
-m ~/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf \
-c 12000 -ngl 99 -ot exps=CPU --threads 24 \
-fa on -b 4096 -ub 4096 \
-ctk q8_0 -ctv q8_0 \
-wgt 1 \
-f ~/.../prompt2.txt -n 1 \
--temp 0 --top-k 1 -s 0 # configured between runs: --defer-experts --no-warmupwhere For GGML_CUDA_NO_PINNED=1 ./llama-bench -m ~/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf \
-ngl 99 -ot exps=CPU --threads 24 \
-n 128 -r 3 # --defer-expertsWith GGML_CUDA_NO_PINNED=1 ./llama-server \
-m ~/.../Qwen3.5-122B-A10B-IQ2_KL.gguf \
-c 12000 -ngl 99 -ot exps=CPU --threads 24 \
-fa on -b 4096 -ub 4096 \
-wgt 1 \
--dry-runI have to wait ~24s before it loads and completes. With Using Interestingly, on all 4 of these runs, I am observing a crash after loading. I don't believe that this is the expected behavior, as noted by the massive buffer allocation. I made sure to also try this on the Omitting Redoing my TTFT experiments with this command (dropping cache between runs): GGML_CUDA_NO_PINNED=1 ./llama-cli \
-m ~/.../Qwen3.5-397B-A17B-IQ2_XS-00001-of-00004.gguf \
-c 4096 -ngl 99 -ot exps=CPU --threads 24 \
-fa on -b 4096 -ub 4096 \
-wgt 1 \
-f ~/.../prompt2.txt -n 1 --no-warmup # --defer-expertsWithout With For good measure, I tried this with Without With Based on these results, |
ikawrakow
left a comment
There was a problem hiding this comment.
LGTM, but let's have at least one more RAM-poor person test that. I don't have models that do not fit in RAM, and I see no difference in loading time with or without --defer-experts.
|
To me It seems like --defer-experts should implicitly set --no-warmup, otherwise it doesn't fully solve the problem as described. |
|
I think @usrlocalben's suggestion is good. Can you add auto-enabling |
|
Done |
…wrakow#1634) * Add --defer-experts flag to defer expert mmap residency on Linux * Disable warmup when defer-experts is enabled
|
Post-merge observation from a downstream fork carrying this flag. Two things worth flagging for any follow-up work. Silent-no-op risk in
|
| Config | VmRSS | Decode steady-state | Run 1 | Run 6 |
|---|---|---|---|---|
--defer-experts off (default) |
14.47 GB | 53.5 tok/s | 47.8 | 53.7 |
--defer-experts on |
13.66 GB | (degrades each run) | 54.0 | 46.4 |
The flag does what's documented (lower load-time RSS, fast Rep 1), but steady-state throughput monotonically degrades on repeated inference. majflt=0 throughout — the degradation is minor-fault cost plus presumably TLB/THP loss from MADV_DONTNEED clearing the expert ranges, combined with params.warmup=false removing the pass that would re-prefault them.
Not a bug — matches the design tradeoff — but the params.warmup=false coupling in common/common.cpp:1562 is what turns a one-time load-time saving into a permanent per-run cost on MoE workloads where expert activation patterns vary. An always-on inference server (vs. a short-lived bench) likely wants --defer-experts with warmup left enabled, so the first call prefaults and subsequent calls keep the working set hot. Would welcome a separate --defer-experts-no-warmup if the combined behavior is intentional for Linux cold-boot use cases; otherwise decoupling might be worth considering.
Happy to test alternative patches on the same workload if useful.
This statement indicates that you don't understand the problem this PR aims to solve. This is for systems that don't have RAM to load the model and instead read from block (or similar) storage "directly" during compute via mmap & page-faults. The normal startup preload/warm-up events are just a cost to them without benefit. They will never warm up. They will never have a hot working set. |

This PR introduces
--defer-experts. Using this flag, expert tensor pages are faulted in on demand rather than being eagerly loaded during initialization. This allows us to reduce cold-start latency, thus improving the load time of MoE models, particularly on systems where users are running models off of storage.TLDR; when users wish to run an MoE model which exceeds their system RAM + VRAM, they can use
--defer-experts --no-warmupto bring load times from minutes down to ~10s.I benchmarked two models that I have on hand on a 64 GiB RAM system:
GGML_CUDA_NO_PINNED=1was used in order to force usage ofmmap, which is required for this feature.I used
./llama-cliwith a ~1k token prompt and-n 1to measure time to first token on cold-starts:--no-warmup--defer-experts--defer-experts --no-warmup--no-warmup--defer-experts--defer-experts --no-warmup--defer-experts--defer-experts, ~3.93x faster with--no-warmupincludedI did run llama-bench with default parameters as well, in case of any PP and TG regressions (also dropping memory cache between runs):
build: c5acec8 (4412)
I wouldn't recommend users use
--defer-experts --no-warmupwith models that fit in RAM (as the tables show, I observe slightly reduced performance). It's more advantageous for these users not to usemmapin the first place, or try just--defer-experts.For the smaller model, I suspect that the 120.45 ± 55.26 t/s for PP is because the first run(s) were slowed down during faulting on cold-start.
In my own use case, running
GGML_CUDA_NO_PINNED=1 ./llama-server -ngl 99 -ot exps=CPU ...withQwen3.5-397B-A17B-IQ2_XS, I see that the server is ready within:--no-warmup: ~1m 16s (1.59x faster)--defer-experts: ~41s (2.95x faster)--defer-experts --no-warmup: ~10s (12.1x faster)Enabling this flag will also tell users about how much memory was deferred:
If users are patient enough to use oversized models, I suppose that baseline might be fine for them as well, in case they face any regressions in their workflows.