cuda: HBM-resident model on DGX Spark / GB10 by TrevorS · Pull Request #1 · TrevorS/ds4

TrevorS · 2026-05-23T19:38:29Z

PR1 draft: cuda: HBM-resident model on DGX Spark / GB10

Summary

Seven commits on TrevorS/ds4 @ upstream-minimal (tip 3514c55), branched from upstream/main (9ae1eeb).

Fixes a GB10-specific bug where cudaHostRegister(... | cudaHostRegisterReadOnly) returns cudaErrorNotSupported (driver 580.142 / cudaDevAttrHostRegisterReadOnlySupported = 0), causing the model-resident fast path to silently fall back to per-deref H2D streaming. nsys on a 32-token MTP decode measured 90,525 MB H2D transferred (87.7% of GPU timeline) before the fix.

The stack:

24ce6bd — replaces an indexer-as-argmax misuse in the MTP path with a real GPU argmax kernel (also closes an upstream strict-mode drift under DS4_MTP_BATCH_VERIFY)
d9a986b — drops the cudaHostRegisterReadOnly flag (the actual GB10 bug)
2081f13 — routes MTP-loaded sessions through the startup tensor-cache walk (was gated off by !mtp_ready)
e9085d0 — refactors the cudaMalloc cache helper + reorders the lookup so device-resident copies win over the registered-mapped UVA pointer (behavior-equivalent on its own; sets up the next commit)
5505cf8 — force-populates the HBM cache at startup with a 24 GiB budget cap + MoE expert filter
df6b438 — extends the same cache walk to the MTP support model
3514c55 — adds speed-bench/gb10.csv per CONTRIBUTING.md

Headline numbers (DGX Spark / GB10, ds4flash IQ2XXS-w2Q2K)

ds4-bench canonical sweep (--ctx-start 2048 --step-incr 2048 --gen-tokens 128):

ctx_tokens	upstream gen t/s	this PR gen t/s	Δ
2,048	13.75	14.20	+3.3%
8,192	13.62	13.98	+2.6%
16,384	13.49	13.84	+2.6%
24,576	12.94	13.61	+5.2%
30,720	13.01	13.35	+2.6%
32,768	(timeout)	12.98	—
49,152	(timeout)	12.60	—
65,536	(timeout)	12.08	—

Upstream's sweep was truncated at ctx=30720 because per-iteration latency on upstream/main exceeded the available wall-clock budget (>9 min per row at ctx≥32768) — itself a symptom of the H2D-streaming bug this PR fixes. PR1's sweep completes the full canonical 2048→65536 range. Full per-row data is in speed-bench/gb10.csv.

MTP-mode wins (not exercised by ds4-bench, captured during commit-message benches with ./ds4 -n 256):

Mode	upstream `9ae1eeb`	PR1 tip `3514c55`	Δ
Plain decode	~13.9 t/s	~16.0 t/s (`./ds4`) / 14.2 t/s (`ds4-bench`)	+2-15%
MTP `BATCH_VERIFY=1 STRICT=1 --mtp-draft 2`	8.76	16.51	+89%
MTP canonical strict (K=1)	7.74	13.50	+74%

The MTP path was effectively broken on GB10 before this PR.

Tested against

make clean && make cuda-spark — clean build, no new warnings
./ds4_test --all — 1 failure (logprob-vectors short_code_completion) also fails on upstream/main, pre-existing fixture drift
make cuda-regression — pre-existing build error in tests/cuda_long_context_smoke.c (signature mismatch), also fails on upstream/main, not introduced by this PR
make cpu — clean build, no new warnings
./ds4-bench formal sweep — see speed-bench/gb10.csv (added by this PR)
Byte-equality diff vs upstream/main for plain decode (-n 64 -p "knight" --temp 0) — 0 bytes
Byte-equality diff vs upstream/main for MTP strict (DS4_MTP_BATCH_VERIFY=1 DS4_MTP_STRICT=1 --mtp-draft 2 -n 64 -p "knight" --temp 0) — 0 bytes
Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf (80.76 GiB)
MTP: DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf (3.5 GiB)

Notes

New env knob: DS4_CUDA_NO_HBM_CACHE=1 — kill-switch to fall back to pre-fix UVA-mapped behavior. Diagnostic only; default = off.
Existing knob default change: DS4_CUDA_WEIGHT_CACHE_LIMIT_GB default changes from unbounded → 24 GiB. Tunable on hosts with more memory budget.
The HBM cache duplicates ~8.2 GiB of hot tensors into device memory under the default MoE filter. On UMA this is real memory pressure; the budget cap prevents OOM. MoE expert weights (*_exps.weight, ~65 GiB total, top-K of 256 active per token) are skipped — caching them would waste budget on bandwidth-cold weights. Disable filter via DS4_CUDA_CACHE_ALL_TENSORS=1.

Sample row from `speed-bench/gb10.csv`

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,402.88,128,14.20,52184460
16384,2048,375.45,128,13.84,249505164
32768,2048,346.36,128,12.98,475014540
65536,2048,287.44,128,12.08,926033292

Out of scope (follow-ups)

Combined-forward MTP path (Tier-2 stack in our fork) — has unresolved strict-mode drift; gated for follow-up PR
Captured-graph support for spec decode — independent subsystem, separate PR
Pre-existing --logprob-vectors short_code_completion fixture drift — also on upstream, likely tokenizer/template drift between fixture and current GGUF
Pre-existing tests/cuda_long_context_smoke.c signature mismatch in make cuda-regression — also on upstream

the answer was outside of the claimed energy precision. the evaluation after the fix (with smooth distribution over the tokens) ``` $ ./ds4-eval --temp 3.0 --min-p 0.25 --nothink ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000) ds4-eval: context buffers 479.38 MiB (ctx=16777, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=4196) ds4-eval: 17/92 passed, 1 failed, runtime 00h:34m # state prompt gen total given correct test 1 PASSED 201 733 934 B B GPQA Diamond/recNu3MXkvWUzHZr9 2 PASSED 149 87 236 C C SuperGPQA/001b51d76b4d422988f2c11f104a2c6c 3 PASSED 81 574 655 70 70 AIME2025/aime2025-01 4 PASSED 313 239 552 C C GPQA Diamond/recoiTJPGUmzAkief 5 PASSED 272 177 449 J J SuperGPQA/b7e20eac98764fb0bf30e8366d951daa 6 PASSED 146 1140 1286 468 468 AIME2025/aime2025-16 7 PASSED 156 646 802 B B GPQA Diamond/rec4UqStf9WUVif1f 8 PASSED 127 52 179 E E SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8 9 PASSED 633 4780 5413 588 588 AIME2025/aime2025-02 10 PASSED 182 322 504 B B GPQA Diamond/recgI6tUQ7RLJRWGx 11 PASSED 137 68 205 A A SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09 12 PASSED 165 747 912 16 16 AIME2025/aime2025-03 13 PASSED 149 672 821 A A GPQA Diamond (modified)/recDytVnNYZe2HuUU 14 PASSED 167 68 235 J J SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a 15 FAILED 305 4837 5142 86 82 AIME2025/aime2025-18 16 PASSED 131 671 802 D D GPQA Diamond/recNFJjE5PPTqVJGv 17 PASSED 175 67 242 I I SuperGPQA/7ca71b86327744b78e93185a45bc5cef 18 PASSED 102 1199 1301 117 117 AIME2025/aime2025-04 19 STOPPED 187 80 267 - B GPQA Diamond/rec2UlKqC6RFHdcro 20 PENDING 0 0 0 - E SuperGPQA/d44b94f7749345a39a65f6312bda8764 21 PENDING 0 0 0 - 106 AIME2025/aime2025-19 22 PENDING 0 0 0 - B GPQA Diamond/recv7GsQg3f0fvB1f 23 PENDING 0 0 0 - B SuperGPQA/febe406f44d74a40b50bb5b7c69d5dc1 ```

Full routed-MoE TensorOps enabled the gate, up, and down projections. The regression was isolated to the gate projection: enabling TensorOps for gate is sufficient to send a sensitive AIME continuation into a repeated wrong answer, while TensorOps for up+down remains stable. The kernel-side cause is small but real arithmetic drift in mpp::tensor_ops::matmul2d relative to the legacy simdgroup MMA contraction. A same-input routed-MoE probe showed no address/layout corruption: TensorOps gate was close to legacy, but not bit-identical. An isolated same-tile primitive probe confirmed the source outside DS4 routing and quantization: legacy simdgroup_multiply_accumulate matched a CPU FP32 serial dot-product reference on the tested tile, while TensorOps produced close nonzero FP32 differences. MTLMathModeSafe and the tested TensorOps descriptor variant did not remove the drift. That normally tiny drift matters here because MoE routing has discontinuous top-k expert selection. In the failing path the first observed safe-vs-full routing change was layer 3, token row 11: the selected sixth expert changed from 96 to 50 across a margin of only about 8e-4. Once an expert changes, the transformer state is no longer a smooth local perturbation, and autoregressive decoding can fall into a bad repetition basin. Attempts that preserved the full gate TensorOps speed did not produce a zero-drift or stable fix: forcing the routed intermediate to F32, using the older generic TensorOps routed matmul instead of the expert-major fast layout, changing the TensorOps descriptor mode, and compiling with strict Metal math all left the gate drift or the bad continuation in place. Retaining TensorOps for up and down keeps most of the MoE speedup, but gate stays on the legacy path because it feeds the nonlinear silu(gate) * up branch and is the projection that can flip later router decisions.

Remove the routed-MoE TensorOps/NAX path completely instead of leaving it as a gated-off mode. Semantic evals showed that gate, up, and down TensorOps routed-MoE variants can each move the model into bad continuations, while the full-tile-only expert-major experiment was correctness-interesting but slower than the legacy simdgroup path. Keeping the dead kernels around risks accidental re-enablement without a trustworthy correctness story. The routed MoE grouped matmul now always uses the legacy 32-token expert-major simdgroup kernel. Other Metal4 TensorOps paths, such as the attention-output projection, remain enabled independently.

ds4-eval (fix): q13 provides wrong answer

agent: add --chdir working-directory option

Preserve generated output while linenoise prompt/history entries grow by scrolling the output region and keeping the output cursor column. Polish streaming colors by highlighting [upto], restoring active text attributes after prompt redraws, and showing throttled power in the status bar without duplicate messages. Keep unknown slash commands editable by beeping and restoring the input instead of printing an error or sending it to the model.

Persist agent session titles in the KV file trailer and derive the saved session ID from title plus created_at so resaves keep a stable identity. Preserve and display titles in /list, keep stripped sessions readable, and migrate legacy rendered-text sessions on their next successful save. Also fix stripped-session reloads to accept the retokenized rendered text count, and adjust the /list footer wording to use session IDs.

On DGX Spark / GB10 with driver 580.142, cudaHostRegister fails with cudaErrorNotSupported when the cudaHostRegisterReadOnly flag is requested (cudaDevAttrHostRegisterReadOnlySupported reports 0). Pre-fix behavior: registration fails entirely, the model-resident fast path falls back to per-deref H2D streaming. nsys on combined-MTP at 32 tokens measured 90,525 MB H2D transferred (87.7% of the GPU timeline), 2,572 individual copies, and 12.16s of cudaMemcpy_ptds API time. Spark's equivalent decode does 24 MB total H2D / 0.3% of timeline. Fix: drop the ReadOnly flag from both cudaHostRegister call sites in ds4_cuda.cu. Mirrors spark's policy in cuda/runtime.cu:254-268 -- on ATS-capable hosts spark skips registration entirely, but its non-ATS fallback also uses plain cudaHostRegisterMapped (no ReadOnly). Per-mode bench impact (n=256, /dev/null, 3-run avg), in combination with the follow-up !mtp_ready gate drop: Mode Pre-fix Post-fix Delta Plain decode 15.48 13.92 -10% MTP canonical strict 7.74 11.87 +53% MTP batched verifier 8.76 14.73 +68% MTP combined-K=1 10.31 17.93 +74% The 10% plain regression comes from the register-mapped UVA pointer going through host page tables on every kernel read, vs. the pre-fix path which fell through to cudaMalloc+cudaMemcpy and produced truly device-resident HBM copies. Net trade is heavily positive for MTP- dominant workloads. A follow-up will pursue a hybrid that recovers plain decode by pre-populating the cudaMalloc cache at startup while keeping the register-mapped fallback for cold lookups.

Pre-fix the startup model-tensor cache walk was guarded by `!e->mtp_ready`, meaning MTP-loaded sessions skipped the entire device-side cache population. Under the prior cudaHostRegister-fails behavior, that meant MTP runs paid per-call cudaMalloc + cudaMemcpy on every uncached tensor deref during decode -- contributing to the 2,572 H2D copies per 32-token spec iter that nsys captured. Combined with the cudaHostRegisterReadOnly drop in the prior commit, removing this gate lets MTP-loaded sessions use the same fast device- mapped registered model pointer as plain decode. Per-mode bench impact (cumulative with prior commit, n=256, /dev/null, 3-run avg): MTP canonical strict 7.74 -> 11.87 (+53%) MTP batched verifier 8.76 -> 14.73 (+68%) MTP combined-K=1 10.31 -> 17.93 (+74%) MTP combined-K=1 at 17.93 t/s now beats: - mainline plain 13.92 (+29%) - pre-fix mainline plain 15.48 (+16%) - spark MTP K=1 captured 16.20 (+11%) - spark MTP K=2 captured 14.83 (+21%) Phase 4 goal "mtp > plain on mainline" is met.

Scaffolding for the hybrid HBM-cache restore. No behavior change today (cache stays empty under post-fix register-mapped flow); sets up the plumbing for the next commit. 1. Extract cudaMalloc + chunked cudaMemcpy body into cuda_model_range_populate_device_copy(). Same logic, same caching into g_model_ranges + g_model_range_by_offset, same verbose log line. Single call from cuda_model_range_ptr replaces the inline block at the bottom of that function. 2. Reorder cuda_model_range_ptr to check g_model_ranges cache *before* the g_model_device_owned / g_model_registered short-circuit. Today the cache is empty in the registered-mapped flow so the reorder is behavior-equivalent: cache lookup returns nothing, fall through to the registered shortcut, return the UVA-mapped pointer. After the next commit populates the cache via force-populate during the startup walk, this reorder lets device-resident HBM copies win over the slower UVA-mapped pointer (~10% on plain decode). Bench parity check (3-run avg, n=256, vs post-fix baseline): Plain decode 13.94 (was 13.92) MTP combined-K=1 18.04 (was 17.93) Within noise. Confirms commit is behavior-equivalent.

Restores plain decode (and gives MTP another 9% lift) by switching the startup tensor-span walk to actively cudaMalloc + cudaMemcpy device- resident copies instead of silently no-op'ing under the registered- mapped flow. Three changes: 1. ds4_gpu_cache_model_range: stop calling cuda_model_range_ptr (which short-circuits to the UVA-mapped pointer when g_model_registered is set) and call cuda_model_range_populate_device_copy directly. Adds budget guard + DS4_CUDA_NO_HBM_CACHE opt-out env. 2. cuda_model_cache_limit_bytes: change default from UINT64_MAX to 24 GiB. Without a cap, an 80 GiB model duplicated into HBM-backed cudaMalloc would exhaust 121 GiB UMA. 24 GiB comfortably covers non-MoE tensors plus headroom; tune via DS4_CUDA_WEIGHT_CACHE_LIMIT_GB. 3. accelerator_cache_model_tensor_spans: skip MoE expert weights (`*_exps.weight`, ~65 GiB total). Top-K of N=256 experts fire per token, so caching them wastes most of the budget on cold weights and starves attn/embed/output-head/shared-FFN. Cold MoE reads fall back to UVA-mapped pointer. Disable filter via DS4_CUDA_CACHE_ALL_TENSORS=1. Per-mode bench (3-run avg, n=256, /dev/null): Mode Pre-hybrid (post-fix) Post-hybrid Delta Plain decode 13.92 15.90 +14.2% MTP canonical strict 11.87 13.47 +13.5% MTP batched verifier 14.73 16.45 +11.7% MTP combined-K=1 17.93 19.56 +9.1% Cumulative vs pre-weight-residency-fix baseline: Plain decode 15.48 -> 15.90 +2.7% MTP combined-K=1 10.31 -> 19.56 +89.7% MTP combined-K=1 at 19.56 t/s beats: mainline plain 15.90 (+23%) spark MTP K=1 cap 16.20 (+21%) pre-hybrid post-fix 17.93 (+9%)

After commit 2 (force-populate at startup), the cache walk ran only against the base model -- MTP-block tensor reads at decode time hit the UVA-mapped pointer instead of cudaMalloc'd HBM copies. Two reasons it didn't manifest as catastrophic: - MTP model is 3.5 GiB vs 80 GiB base -- small footprint - Most MTP tensors are MoE experts that the *_exps filter would skip anyway, so the missed opportunity is small Fix: call accelerator_cache_model_tensors a second time against e->mtp_model when MTP is loaded. The MoE filter from commit 2 automatically excludes `mtp.0.ffn_*_exps.weight`. Per-mode bench (3-run avg, n=256, /dev/null): Mode Commit-2 Commit-3 Delta Plain decode 15.90 15.89 noise MTP canonical strict 13.47 13.50 +0.2% MTP batched verifier 16.45 16.51 +0.4% MTP combined-K=1 19.56 19.58 +0.1% Closes a latent correctness gap (MTP tensors should be in the cache just like base-model tensors) without measurable risk. The walk is idempotent under the MoE filter, so even if both models share offsets the populate path skips already-cached spans. Final cumulative vs pre-weight-residency-fix baseline: Plain decode 15.48 -> 15.89 +2.6% MTP canonical strict 7.74 -> 13.50 +74.4% MTP batched verifier 8.76 -> 16.51 +88.5% MTP combined-K=1 10.31 -> 19.58 +89.9% MTP combined-K=1 (19.58 t/s) beats: mainline plain 15.89 (+23%) spark MTP K=1 captured 16.20 (+21%) spark plain (graphs ON) 14.89 (+32%)

Generated on this branch via: ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \ --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 \ --csv speed-bench/gb10.csv Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0 Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

Two related changes packed together. 1. Combined-forward K=2 wiring ----------------------------- `spec_argmax_combined` now also handles draft_cap=2 (N=3 batched verify over [first_token, drafts[0], drafts[1]]) with prefix-2 commit dispatch for the commit ∈ {0, 1, 2} cases. Gated behind DS4_MTP_COMBINED_K2=1 because measurement shows the K=2 variant is currently a loss on mainline: Combined K=1 (N=2 batched): 9.51 t/s (no flag, /dev/null) Combined K=2 (N=3 batched): 7.34 t/s (DS4_MTP_COMBINED_K2=1) Why K=2 loses: `drafts[1]` cascades from `drafts[0]`'s MTP-state, but `drafts[0]` itself comes from `combined_prev_hc` (= post-previous-iter- last-token HC), not from the fresh post-`first_token` main-HC the canonical eval(first_token) would produce. So `drafts[0]` is "one position stale" already, and `drafts[1]` cascades further off-target. The target verifier rejects `drafts[1]` in the vast majority of iters, so the extra batched-N=3 row costs more than it pays. Keeping the K=2 path as opt-in because the prefix-2 wiring is correct and reusable when the staleness fix lands (interleaved MTP-block inside batched main forward, spark-style). See PHASE4.md item #1. 2. Session-cached spec_row_logits buffer ---------------------------------------- Adds `s->spec_row_logits_buf` (3 * VOCAB f32 = ~1.5 MiB) and `s->spec_row_tops_buf` (3 * int) allocated at session creation, replacing the per-spec-call xmalloc/free pattern in `ds4_session_eval_speculative_argmax_combined`. Measurement impact: small (~0-3% in noise). The malloc overhead hypothesis was a wrong guess at what was producing the 73 ms per-call overhead between component-timed (~95 ms) and observed wall (~168 ms) combined cost. Documented in PHASE4.md item #3 -- the actual source of that overhead is still unidentified after this attempt. Effect on default combined K=1: 9.51 -> 9.48 t/s (within noise). Foundation for future xmalloc cleanup in the canonical path's decode2_exact branch (still allocates per-call). LOC --- ds4.c: +67/-35 (combined K=2 dispatch + session buf fields + alloc/free sites + caller changes). Two new session fields, two new env gates. NO github push. jj change oxmoztuq.

TrevorS · 2026-05-24T20:58:57Z

Superseded by the reframed 2-PR stack (#11 + #12), which tells the same Spark/GB10 + MTP combined-forward story more concisely, rebased on current upstream/main, with the exploratory paths dropped.

gmontana and others added 6 commits May 23, 2026 14:43

agent: add working-directory option

23e264d

Highlight agent code output

e792c55

Fix F16 routed MoE graph dumps

a05e69e

Add GPU power throttling

444afce

TrevorS mentioned this pull request May 23, 2026

cuda: small-N batched kernel polish (stacked on #1) #2

Closed

antirez and others added 20 commits May 23, 2026 22:30

Simplify agent edit tools

df22245

Add runtime power commands

f398aa3

Guard attention output TensorOps full tiles

57ae485

Apply agent power changes while busy

4e867d6

Improve agent edit tooling

d447bdb

Fix anchored edit tail matching

a1f30a6

Merge pull request antirez#233 from alantsev/eval-fix

e3efafe

ds4-eval (fix): q13 provides wrong answer

Merge pull request antirez#232 from gmontana/fix/agent-chdir

b230e8e

agent: add --chdir working-directory option

Refine agent tool prompt reminders

925eb4c

Improve agent session management

4a855d8

cuda: GPU argmax kernel + remove indexer-as-argmax misuse in MTP path

693d717

TrevorS force-pushed the upstream-minimal branch from 3514c55 to 6e95cf6 Compare May 24, 2026 17:13

TrevorS closed this May 24, 2026

TrevorS deleted the upstream-minimal branch May 24, 2026 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: HBM-resident model on DGX Spark / GB10#1

cuda: HBM-resident model on DGX Spark / GB10#1
TrevorS wants to merge 26 commits into
mainfrom
upstream-minimal

TrevorS commented May 23, 2026

Uh oh!

TrevorS commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

TrevorS commented May 23, 2026

PR1 draft: cuda: HBM-resident model on DGX Spark / GB10

Summary

Headline numbers (DGX Spark / GB10, ds4flash IQ2XXS-w2Q2K)

Tested against

Notes

Sample row from speed-bench/gb10.csv

Out of scope (follow-ups)

Uh oh!

TrevorS commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sample row from `speed-bench/gb10.csv`