Skip to content

cuda: DGX Spark / GB10 backend support — HBM-resident model#13

Open
TrevorS wants to merge 1 commit into
mainfrom
gb10-hbm-resident-model
Open

cuda: DGX Spark / GB10 backend support — HBM-resident model#13
TrevorS wants to merge 1 commit into
mainfrom
gb10-hbm-resident-model

Conversation

@TrevorS
Copy link
Copy Markdown
Owner

@TrevorS TrevorS commented May 24, 2026

cuda: DGX Spark / GB10 backend support — HBM-resident model

Summary

Makes the 80 GB IQ2XXS DeepSeek V4 Flash checkpoint run well on NVIDIA DGX Spark (GB10, sm_121, 121 GiB UMA). On Spark the GPU can consume host-mmap'd weights via ATS, but at lower effective bandwidth than device-resident copies. This PR makes a budgeted set of hot tensor spans HBM-resident at startup, leaving the cold MoE routed experts ATS-mapped.

Standalone change: no MTP behavior, no kernel semantics change. Plain decode only. Adds GB10 to speed-bench/ next to m2_ultra.csv and m4_max.csv.

Speed — ds4-bench standard sweep (promessi_sposi.txt, gen=128, GB10)

ctx upstream/main this PR Δ
2048 13.85 14.24 +0.39
8192 13.67 14.10 +0.43
16384 13.54 13.97 +0.43
18432 13.45 13.88 +0.43

Steady ~+0.4 t/s from the small-N kernel fuses (Q_A+KV_A pairing, hc_expand epilogue parallelization, head_rms_norm+rope_tail fusion).

UMA headroom (the bigger reason)

On current upstream/main, the startup cache copies the full 80.76 GiB model to device. On a 121 GiB Spark that leaves little headroom once KV cache and prefill activations grow: in my testing the standard ds4-bench 2048→65536 sweep did not complete past ~18k context on upstream.

This PR caps the cache at 24 GiB with a MoE filter, so only the hot spans (attention projections, shared experts, embedding, output head) become device-resident — ~8.2 GiB in practice — while cold routed experts stay ATS-mapped. The full 2048→65536 sweep completes with headroom to spare. (Worth confirming on your own Spark — UMA pressure depends on driver and allocator behavior.)

What's in the PR

  • Startup HBM cache with a budget cap + MoE filter (hot spans device-resident, cold routed experts ATS-mapped).
  • cuda_model_range_ptr helper: single hash-keyed lookup for device-resident pointers on the hot path.
  • GPU argmax kernel (the prior fallback misused indexer scoring as argmax, double-paying dispatcher cost at N=1).
  • Pair-fused Q_A + KV_A matmuls in qkv_rms_fused decode.
  • Parallelized matmul_q8_0_hc_expand epilogue across n_hc lanes.
  • HBM cache extends to the MTP support model (no behavioral change without --mtp).
  • Drop cudaHostRegisterReadOnly (unsupported on GB10).
  • speed-bench/gb10.csv from the standard sweep.

Tested

  • make clean && make cuda-spark — clean
  • make cpu — clean
  • ./ds4_test --long-context, --tool-call-quality, --server, --metal-kernels — OK
  • ./ds4-bench 2048→65536 sweep — completes; speed-bench/gb10.csv
  • Two ds4_test checks fail identically on stock upstream/main (f91c12b) — not introduced here:
    • --logprob-vectors short_code_completion — the only divergence across all 4 steps is the case of the markdown code-fence language tag: the full-precision official API emits ```c, the IQ2XXS (2-bit) local model emits ```C; the generated code (return snprintf) is byte-identical. A near-tie the aggressive quant resolves to uppercase; reproduces on stock upstream/main.
    • --metal-tensor-equivalence — long-context only. Upstream's MoE routed-expert down-projection accumulates via float atomicAdd when n_tokens >= 128 (use_atomic_down); the order is scheduling-dependent, so two runs of the same config drift at ulp scale and occasionally flip a greedy argmax at long ctx. DS4_CUDA_MOE_NO_ATOMIC_DOWN=1 (an upstream flag) makes it bit-exact, confirming the cause. Stock upstream flakes identically (~2/5 runs); this is a plain-decode-only PR and doesn't touch that path.

Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

Note

Foundation for a stacked follow-up (#14) that adds MTP combined-forward speculative decode. This PR stands alone — it's what makes Spark usable at high context whether or not you care about MTP.

DGX Spark (GB10, sm_121, 121 GiB UMA, driver 580+) sits in an unusual
spot for CUDA inference: ATS (Address Translation Service) lets the
GPU consume host-mmap'd weights directly, but at significantly lower
effective bandwidth than HBM-resident copies.  For an 80 GB IQ2XXS
DeepSeek V4 Flash checkpoint, the difference is the model running
versus the model being usable.

This commit adds:

  - Startup HBM cache that copies hot tensor spans (attn projections,
    MoE shared experts, output projection) into device memory at engine
    init, capped by a configurable budget (defaults sized to leave
    headroom for KV cache and a second model load).  Cold MoE routed
    experts stay ATS-mapped.
  - Factored the cudaMalloc+memcpy populate path into a helper and
    reordered cuda_model_range_ptr so the HBM-resident lookup is a
    single hash-keyed read that wins over the UVA-mapped pointer on
    the hot decode path.
  - GPU argmax kernel; the prior fallback misused indexer scoring as
    an argmax which double-paid the dispatcher cost on N=1 decode.
  - Pair-fused Q_A + KV_A matmuls in qkv_rms_fused decode path
    (one shared weight load per row, two outputs).
  - Parallelized matmul_q8_0_hc_expand epilogue across n_hc lanes
    (n_hc parallel residual loads + writes vs n_hc^2 serial reads).
  - HBM cache also populated for the MTP support model.
  - Drop `cudaHostRegisterReadOnly` flag — unsupported on GB10.
  - Drop `!mtp_ready` gate from accelerator_cache_model_tensors so
    the MTP support model gets the same HBM-cache treatment.

Bench (DGX Spark / GB10, ds4flash, n=256, "knight" prompt, 3-run mean):

  Plain decode before: ~13.9 t/s  (ATS-mapped weights, all paths)
  Plain decode after:  ~16.13 t/s (HBM-resident hot spans + small-N kernel fuses)

Adds `speed-bench/gb10.csv` per CONTRIBUTING.md convention so the
2048..65536 sweep is preserved alongside the existing m2_ultra.csv
and m4_max.csv.  Generated via:

  ./ds4-bench -m ds4flash.gguf \
    --prompt-file speed-bench/promessi_sposi.txt \
    --ctx-start 2048 --ctx-max 65536 --step-incr 2048 \
    --gen-tokens 128 --csv speed-bench/gb10.csv

Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant