cuda: DGX Spark / GB10 backend support — HBM-resident model#13
Open
TrevorS wants to merge 1 commit into
Open
Conversation
This was referenced May 24, 2026
36c1735 to
4e47e95
Compare
DGX Spark (GB10, sm_121, 121 GiB UMA, driver 580+) sits in an unusual
spot for CUDA inference: ATS (Address Translation Service) lets the
GPU consume host-mmap'd weights directly, but at significantly lower
effective bandwidth than HBM-resident copies. For an 80 GB IQ2XXS
DeepSeek V4 Flash checkpoint, the difference is the model running
versus the model being usable.
This commit adds:
- Startup HBM cache that copies hot tensor spans (attn projections,
MoE shared experts, output projection) into device memory at engine
init, capped by a configurable budget (defaults sized to leave
headroom for KV cache and a second model load). Cold MoE routed
experts stay ATS-mapped.
- Factored the cudaMalloc+memcpy populate path into a helper and
reordered cuda_model_range_ptr so the HBM-resident lookup is a
single hash-keyed read that wins over the UVA-mapped pointer on
the hot decode path.
- GPU argmax kernel; the prior fallback misused indexer scoring as
an argmax which double-paid the dispatcher cost on N=1 decode.
- Pair-fused Q_A + KV_A matmuls in qkv_rms_fused decode path
(one shared weight load per row, two outputs).
- Parallelized matmul_q8_0_hc_expand epilogue across n_hc lanes
(n_hc parallel residual loads + writes vs n_hc^2 serial reads).
- HBM cache also populated for the MTP support model.
- Drop `cudaHostRegisterReadOnly` flag — unsupported on GB10.
- Drop `!mtp_ready` gate from accelerator_cache_model_tensors so
the MTP support model gets the same HBM-cache treatment.
Bench (DGX Spark / GB10, ds4flash, n=256, "knight" prompt, 3-run mean):
Plain decode before: ~13.9 t/s (ATS-mapped weights, all paths)
Plain decode after: ~16.13 t/s (HBM-resident hot spans + small-N kernel fuses)
Adds `speed-bench/gb10.csv` per CONTRIBUTING.md convention so the
2048..65536 sweep is preserved alongside the existing m2_ultra.csv
and m4_max.csv. Generated via:
./ds4-bench -m ds4flash.gguf \
--prompt-file speed-bench/promessi_sposi.txt \
--ctx-start 2048 --ctx-max 65536 --step-incr 2048 \
--gen-tokens 128 --csv speed-bench/gb10.csv
Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
4e47e95 to
a6284f0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cuda: DGX Spark / GB10 backend support — HBM-resident model
Summary
Makes the 80 GB IQ2XXS DeepSeek V4 Flash checkpoint run well on NVIDIA DGX Spark (GB10, sm_121, 121 GiB UMA). On Spark the GPU can consume host-mmap'd weights via ATS, but at lower effective bandwidth than device-resident copies. This PR makes a budgeted set of hot tensor spans HBM-resident at startup, leaving the cold MoE routed experts ATS-mapped.
Standalone change: no MTP behavior, no kernel semantics change. Plain decode only. Adds GB10 to
speed-bench/next tom2_ultra.csvandm4_max.csv.Speed —
ds4-benchstandard sweep (promessi_sposi.txt, gen=128, GB10)Steady ~+0.4 t/s from the small-N kernel fuses (Q_A+KV_A pairing, hc_expand epilogue parallelization, head_rms_norm+rope_tail fusion).
UMA headroom (the bigger reason)
On current
upstream/main, the startup cache copies the full 80.76 GiB model to device. On a 121 GiB Spark that leaves little headroom once KV cache and prefill activations grow: in my testing the standardds4-bench2048→65536 sweep did not complete past ~18k context on upstream.This PR caps the cache at 24 GiB with a MoE filter, so only the hot spans (attention projections, shared experts, embedding, output head) become device-resident — ~8.2 GiB in practice — while cold routed experts stay ATS-mapped. The full 2048→65536 sweep completes with headroom to spare. (Worth confirming on your own Spark — UMA pressure depends on driver and allocator behavior.)
What's in the PR
cuda_model_range_ptrhelper: single hash-keyed lookup for device-resident pointers on the hot path.qkv_rms_fuseddecode.matmul_q8_0_hc_expandepilogue acrossn_hclanes.--mtp).cudaHostRegisterReadOnly(unsupported on GB10).speed-bench/gb10.csvfrom the standard sweep.Tested
make clean && make cuda-spark— cleanmake cpu— clean./ds4_test --long-context,--tool-call-quality,--server,--metal-kernels— OK./ds4-bench2048→65536 sweep — completes;speed-bench/gb10.csvds4_testchecks fail identically on stockupstream/main(f91c12b) — not introduced here:--logprob-vectors short_code_completion— the only divergence across all 4 steps is the case of the markdown code-fence language tag: the full-precision official API emits```c, the IQ2XXS (2-bit) local model emits```C; the generated code (return snprintf) is byte-identical. A near-tie the aggressive quant resolves to uppercase; reproduces on stockupstream/main.--metal-tensor-equivalence— long-context only. Upstream's MoE routed-expert down-projection accumulates via floatatomicAddwhenn_tokens >= 128(use_atomic_down); the order is scheduling-dependent, so two runs of the same config drift at ulp scale and occasionally flip a greedy argmax at long ctx.DS4_CUDA_MOE_NO_ATOMIC_DOWN=1(an upstream flag) makes it bit-exact, confirming the cause. Stock upstream flakes identically (~2/5 runs); this is a plain-decode-only PR and doesn't touch that path.Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model:
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.ggufNote
Foundation for a stacked follow-up (#14) that adds MTP combined-forward speculative decode. This PR stands alone — it's what makes Spark usable at high context whether or not you care about MTP.