cuda: DGX Spark / GB10 backend support — HBM-resident model by TrevorS · Pull Request #13 · TrevorS/ds4

TrevorS · 2026-05-24T21:07:54Z

cuda: DGX Spark / GB10 backend support — HBM-resident model

Summary

Makes the 80 GB IQ2XXS DeepSeek V4 Flash checkpoint run well on NVIDIA DGX Spark (GB10, sm_121, 121 GiB UMA). On Spark the GPU can consume host-mmap'd weights via ATS, but at lower effective bandwidth than device-resident copies. This PR makes a budgeted set of hot tensor spans HBM-resident at startup, leaving the cold MoE routed experts ATS-mapped.

Standalone change: no MTP behavior, no kernel semantics change. Plain decode only. Adds GB10 to speed-bench/ next to m2_ultra.csv and m4_max.csv.

Speed — `ds4-bench` standard sweep (promessi_sposi.txt, gen=128, GB10)

ctx	upstream/main	this PR	Δ
2048	13.85	14.24	+0.39
8192	13.67	14.10	+0.43
16384	13.54	13.97	+0.43
18432	13.45	13.88	+0.43

Steady ~+0.4 t/s from the small-N kernel fuses (Q_A+KV_A pairing, hc_expand epilogue parallelization, head_rms_norm+rope_tail fusion).

UMA headroom (the bigger reason)

On current upstream/main, the startup cache copies the full 80.76 GiB model to device. On a 121 GiB Spark that leaves little headroom once KV cache and prefill activations grow: in my testing the standard ds4-bench 2048→65536 sweep did not complete past ~18k context on upstream.

This PR caps the cache at 24 GiB with a MoE filter, so only the hot spans (attention projections, shared experts, embedding, output head) become device-resident — ~8.2 GiB in practice — while cold routed experts stay ATS-mapped. The full 2048→65536 sweep completes with headroom to spare. (Worth confirming on your own Spark — UMA pressure depends on driver and allocator behavior.)

What's in the PR

Startup HBM cache with a budget cap + MoE filter (hot spans device-resident, cold routed experts ATS-mapped).
cuda_model_range_ptr helper: single hash-keyed lookup for device-resident pointers on the hot path.
GPU argmax kernel (the prior fallback misused indexer scoring as argmax, double-paying dispatcher cost at N=1).
Pair-fused Q_A + KV_A matmuls in qkv_rms_fused decode.
Parallelized matmul_q8_0_hc_expand epilogue across n_hc lanes.
HBM cache extends to the MTP support model (no behavioral change without --mtp).
Drop cudaHostRegisterReadOnly (unsupported on GB10).
speed-bench/gb10.csv from the standard sweep.

Tested

make clean && make cuda-spark — clean
make cpu — clean
./ds4_test --long-context, --tool-call-quality, --server, --metal-kernels — OK
./ds4-bench 2048→65536 sweep — completes; speed-bench/gb10.csv
Two ds4_test checks fail identically on stock upstream/main (f91c12b) — not introduced here:
- --logprob-vectors short_code_completion — the only divergence across all 4 steps is the case of the markdown code-fence language tag: the full-precision official API emits ```c, the IQ2XXS (2-bit) local model emits ```C; the generated code (return snprintf) is byte-identical. A near-tie the aggressive quant resolves to uppercase; reproduces on stock upstream/main.
- --metal-tensor-equivalence — long-context only. Upstream's MoE routed-expert down-projection accumulates via float atomicAdd when n_tokens >= 128 (use_atomic_down); the order is scheduling-dependent, so two runs of the same config drift at ulp scale and occasionally flip a greedy argmax at long ctx. DS4_CUDA_MOE_NO_ATOMIC_DOWN=1 (an upstream flag) makes it bit-exact, confirming the cause. Stock upstream flakes identically (~2/5 runs); this is a plain-decode-only PR and doesn't touch that path.

Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

Note

Foundation for a stacked follow-up (#14) that adds MTP combined-forward speculative decode. This PR stands alone — it's what makes Spark usable at high context whether or not you care about MTP.

DGX Spark (GB10, sm_121, 121 GiB UMA, driver 580+) sits in an unusual spot for CUDA inference: ATS (Address Translation Service) lets the GPU consume host-mmap'd weights directly, but at significantly lower effective bandwidth than HBM-resident copies. For an 80 GB IQ2XXS DeepSeek V4 Flash checkpoint, the difference is the model running versus the model being usable. This commit adds: - Startup HBM cache that copies hot tensor spans (attn projections, MoE shared experts, output projection) into device memory at engine init, capped by a configurable budget (defaults sized to leave headroom for KV cache and a second model load). Cold MoE routed experts stay ATS-mapped. - Factored the cudaMalloc+memcpy populate path into a helper and reordered cuda_model_range_ptr so the HBM-resident lookup is a single hash-keyed read that wins over the UVA-mapped pointer on the hot decode path. - GPU argmax kernel; the prior fallback misused indexer scoring as an argmax which double-paid the dispatcher cost on N=1 decode. - Pair-fused Q_A + KV_A matmuls in qkv_rms_fused decode path (one shared weight load per row, two outputs). - Parallelized matmul_q8_0_hc_expand epilogue across n_hc lanes (n_hc parallel residual loads + writes vs n_hc^2 serial reads). - HBM cache also populated for the MTP support model. - Drop `cudaHostRegisterReadOnly` flag — unsupported on GB10. - Drop `!mtp_ready` gate from accelerator_cache_model_tensors so the MTP support model gets the same HBM-cache treatment. Bench (DGX Spark / GB10, ds4flash, n=256, "knight" prompt, 3-run mean): Plain decode before: ~13.9 t/s (ATS-mapped weights, all paths) Plain decode after: ~16.13 t/s (HBM-resident hot spans + small-N kernel fuses) Adds `speed-bench/gb10.csv` per CONTRIBUTING.md convention so the 2048..65536 sweep is preserved alongside the existing m2_ultra.csv and m4_max.csv. Generated via: ./ds4-bench -m ds4flash.gguf \ --prompt-file speed-bench/promessi_sposi.txt \ --ctx-start 2048 --ctx-max 65536 --step-incr 2048 \ --gen-tokens 128 --csv speed-bench/gb10.csv Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0 Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

This was referenced May 24, 2026

mtp: combined-forward speculative decode beats plain on GB10 (+2.4 t/s) (stacked on #13) #14

Open

cuda: DGX Spark / GB10 backend support — HBM-resident model #11

Closed

TrevorS force-pushed the gb10-hbm-resident-model branch from 36c1735 to 4e47e95 Compare May 24, 2026 21:35

TrevorS force-pushed the gb10-hbm-resident-model branch from 4e47e95 to a6284f0 Compare May 24, 2026 22:10

TrevorS mentioned this pull request May 25, 2026

ds4 / DGX Spark / MTP and Performance Improvements antirez/ds4#244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: DGX Spark / GB10 backend support — HBM-resident model#13

cuda: DGX Spark / GB10 backend support — HBM-resident model#13
TrevorS wants to merge 1 commit into
mainfrom
gb10-hbm-resident-model

TrevorS commented May 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TrevorS commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cuda: DGX Spark / GB10 backend support — HBM-resident model

Summary

Speed — ds4-bench standard sweep (promessi_sposi.txt, gen=128, GB10)

UMA headroom (the bigger reason)

What's in the PR

Tested

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TrevorS commented May 24, 2026 •

edited

Loading

Speed — `ds4-bench` standard sweep (promessi_sposi.txt, gen=128, GB10)