Skip to content

cuda: HBM-resident model on DGX Spark / GB10#1

Closed
TrevorS wants to merge 26 commits into
mainfrom
upstream-minimal
Closed

cuda: HBM-resident model on DGX Spark / GB10#1
TrevorS wants to merge 26 commits into
mainfrom
upstream-minimal

Conversation

@TrevorS
Copy link
Copy Markdown
Owner

@TrevorS TrevorS commented May 23, 2026

PR1 draft: cuda: HBM-resident model on DGX Spark / GB10

Summary

Seven commits on TrevorS/ds4 @ upstream-minimal (tip 3514c55), branched from upstream/main (9ae1eeb).

Fixes a GB10-specific bug where cudaHostRegister(... | cudaHostRegisterReadOnly) returns cudaErrorNotSupported (driver 580.142 / cudaDevAttrHostRegisterReadOnlySupported = 0), causing the model-resident fast path to silently fall back to per-deref H2D streaming. nsys on a 32-token MTP decode measured 90,525 MB H2D transferred (87.7% of GPU timeline) before the fix.

The stack:

  1. 24ce6bd — replaces an indexer-as-argmax misuse in the MTP path with a real GPU argmax kernel (also closes an upstream strict-mode drift under DS4_MTP_BATCH_VERIFY)
  2. d9a986b — drops the cudaHostRegisterReadOnly flag (the actual GB10 bug)
  3. 2081f13 — routes MTP-loaded sessions through the startup tensor-cache walk (was gated off by !mtp_ready)
  4. e9085d0 — refactors the cudaMalloc cache helper + reorders the lookup so device-resident copies win over the registered-mapped UVA pointer (behavior-equivalent on its own; sets up the next commit)
  5. 5505cf8 — force-populates the HBM cache at startup with a 24 GiB budget cap + MoE expert filter
  6. df6b438 — extends the same cache walk to the MTP support model
  7. 3514c55 — adds speed-bench/gb10.csv per CONTRIBUTING.md

Headline numbers (DGX Spark / GB10, ds4flash IQ2XXS-w2Q2K)

ds4-bench canonical sweep (--ctx-start 2048 --step-incr 2048 --gen-tokens 128):

ctx_tokens upstream gen t/s this PR gen t/s Δ
2,048 13.75 14.20 +3.3%
8,192 13.62 13.98 +2.6%
16,384 13.49 13.84 +2.6%
24,576 12.94 13.61 +5.2%
30,720 13.01 13.35 +2.6%
32,768 (timeout) 12.98
49,152 (timeout) 12.60
65,536 (timeout) 12.08

Upstream's sweep was truncated at ctx=30720 because per-iteration latency on upstream/main exceeded the available wall-clock budget (>9 min per row at ctx≥32768) — itself a symptom of the H2D-streaming bug this PR fixes. PR1's sweep completes the full canonical 2048→65536 range. Full per-row data is in speed-bench/gb10.csv.

MTP-mode wins (not exercised by ds4-bench, captured during commit-message benches with ./ds4 -n 256):

Mode upstream 9ae1eeb PR1 tip 3514c55 Δ
Plain decode ~13.9 t/s ~16.0 t/s (./ds4) / 14.2 t/s (ds4-bench) +2-15%
MTP BATCH_VERIFY=1 STRICT=1 --mtp-draft 2 8.76 16.51 +89%
MTP canonical strict (K=1) 7.74 13.50 +74%

The MTP path was effectively broken on GB10 before this PR.

Tested against

  • make clean && make cuda-spark — clean build, no new warnings
  • ./ds4_test --all — 1 failure (logprob-vectors short_code_completion) also fails on upstream/main, pre-existing fixture drift
  • make cuda-regression — pre-existing build error in tests/cuda_long_context_smoke.c (signature mismatch), also fails on upstream/main, not introduced by this PR
  • make cpu — clean build, no new warnings
  • ./ds4-bench formal sweep — see speed-bench/gb10.csv (added by this PR)
  • Byte-equality diff vs upstream/main for plain decode (-n 64 -p "knight" --temp 0) — 0 bytes
  • Byte-equality diff vs upstream/main for MTP strict (DS4_MTP_BATCH_VERIFY=1 DS4_MTP_STRICT=1 --mtp-draft 2 -n 64 -p "knight" --temp 0) — 0 bytes
  • Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
  • Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf (80.76 GiB)
  • MTP: DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf (3.5 GiB)

Notes

  • New env knob: DS4_CUDA_NO_HBM_CACHE=1 — kill-switch to fall back to pre-fix UVA-mapped behavior. Diagnostic only; default = off.
  • Existing knob default change: DS4_CUDA_WEIGHT_CACHE_LIMIT_GB default changes from unbounded → 24 GiB. Tunable on hosts with more memory budget.
  • The HBM cache duplicates ~8.2 GiB of hot tensors into device memory under the default MoE filter. On UMA this is real memory pressure; the budget cap prevents OOM. MoE expert weights (*_exps.weight, ~65 GiB total, top-K of 256 active per token) are skipped — caching them would waste budget on bandwidth-cold weights. Disable filter via DS4_CUDA_CACHE_ALL_TENSORS=1.

Sample row from speed-bench/gb10.csv

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,402.88,128,14.20,52184460
16384,2048,375.45,128,13.84,249505164
32768,2048,346.36,128,12.98,475014540
65536,2048,287.44,128,12.08,926033292

Out of scope (follow-ups)

  • Combined-forward MTP path (Tier-2 stack in our fork) — has unresolved strict-mode drift; gated for follow-up PR
  • Captured-graph support for spec decode — independent subsystem, separate PR
  • Pre-existing --logprob-vectors short_code_completion fixture drift — also on upstream, likely tokenizer/template drift between fixture and current GGUF
  • Pre-existing tests/cuda_long_context_smoke.c signature mismatch in make cuda-regression — also on upstream

gmontana and others added 6 commits May 23, 2026 14:43
the answer was outside of the claimed energy precision.

the evaluation after the fix
(with smooth distribution over the tokens)

```
$ ./ds4-eval --temp 3.0 --min-p 0.25 --nothink
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000)
ds4-eval: context buffers 479.38 MiB (ctx=16777, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=4196)
ds4-eval: 17/92 passed, 1 failed, runtime 00h:34m
#   state      prompt      gen    total given    correct  test
  1 PASSED        201      733      934 B        B        GPQA Diamond/recNu3MXkvWUzHZr9
  2 PASSED        149       87      236 C        C        SuperGPQA/001b51d76b4d422988f2c11f104a2c6c
  3 PASSED         81      574      655 70       70       AIME2025/aime2025-01
  4 PASSED        313      239      552 C        C        GPQA Diamond/recoiTJPGUmzAkief
  5 PASSED        272      177      449 J        J        SuperGPQA/b7e20eac98764fb0bf30e8366d951daa
  6 PASSED        146     1140     1286 468      468      AIME2025/aime2025-16
  7 PASSED        156      646      802 B        B        GPQA Diamond/rec4UqStf9WUVif1f
  8 PASSED        127       52      179 E        E        SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8
  9 PASSED        633     4780     5413 588      588      AIME2025/aime2025-02
 10 PASSED        182      322      504 B        B        GPQA Diamond/recgI6tUQ7RLJRWGx
 11 PASSED        137       68      205 A        A        SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09
 12 PASSED        165      747      912 16       16       AIME2025/aime2025-03
 13 PASSED        149      672      821 A        A        GPQA Diamond (modified)/recDytVnNYZe2HuUU
 14 PASSED        167       68      235 J        J        SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a
 15 FAILED        305     4837     5142 86       82       AIME2025/aime2025-18
 16 PASSED        131      671      802 D        D        GPQA Diamond/recNFJjE5PPTqVJGv
 17 PASSED        175       67      242 I        I        SuperGPQA/7ca71b86327744b78e93185a45bc5cef
 18 PASSED        102     1199     1301 117      117      AIME2025/aime2025-04
 19 STOPPED       187       80      267 -        B        GPQA Diamond/rec2UlKqC6RFHdcro
 20 PENDING         0        0        0 -        E        SuperGPQA/d44b94f7749345a39a65f6312bda8764
 21 PENDING         0        0        0 -        106      AIME2025/aime2025-19
 22 PENDING         0        0        0 -        B        GPQA Diamond/recv7GsQg3f0fvB1f
 23 PENDING         0        0        0 -        B        SuperGPQA/febe406f44d74a40b50bb5b7c69d5dc1
```
Full routed-MoE TensorOps enabled the gate, up, and down projections. The
regression was isolated to the gate projection: enabling TensorOps for gate is
sufficient to send a sensitive AIME continuation into a repeated wrong answer,
while TensorOps for up+down remains stable.

The kernel-side cause is small but real arithmetic drift in
mpp::tensor_ops::matmul2d relative to the legacy simdgroup MMA contraction. A
same-input routed-MoE probe showed no address/layout corruption: TensorOps gate
was close to legacy, but not bit-identical. An isolated same-tile primitive
probe confirmed the source outside DS4 routing and quantization: legacy
simdgroup_multiply_accumulate matched a CPU FP32 serial dot-product reference on
the tested tile, while TensorOps produced close nonzero FP32 differences.
MTLMathModeSafe and the tested TensorOps descriptor variant did not remove the
drift.

That normally tiny drift matters here because MoE routing has discontinuous
top-k expert selection. In the failing path the first observed safe-vs-full
routing change was layer 3, token row 11: the selected sixth expert changed from
96 to 50 across a margin of only about 8e-4. Once an expert changes, the
transformer state is no longer a smooth local perturbation, and autoregressive
decoding can fall into a bad repetition basin.

Attempts that preserved the full gate TensorOps speed did not produce a
zero-drift or stable fix: forcing the routed intermediate to F32, using the
older generic TensorOps routed matmul instead of the expert-major fast layout,
changing the TensorOps descriptor mode, and compiling with strict Metal math all
left the gate drift or the bad continuation in place. Retaining TensorOps for
up and down keeps most of the MoE speedup, but gate stays on the legacy path
because it feeds the nonlinear silu(gate) * up branch and is the projection that
can flip later router decisions.
antirez and others added 20 commits May 23, 2026 22:30
Remove the routed-MoE TensorOps/NAX path completely instead of leaving it as a gated-off mode. Semantic evals showed that gate, up, and down TensorOps routed-MoE variants can each move the model into bad continuations, while the full-tile-only expert-major experiment was correctness-interesting but slower than the legacy simdgroup path. Keeping the dead kernels around risks accidental re-enablement without a trustworthy correctness story.

The routed MoE grouped matmul now always uses the legacy 32-token expert-major simdgroup kernel. Other Metal4 TensorOps paths, such as the attention-output projection, remain enabled independently.
ds4-eval (fix): q13 provides wrong answer
agent: add --chdir working-directory option
Preserve generated output while linenoise prompt/history entries grow by scrolling the output region and keeping the output cursor column.

Polish streaming colors by highlighting [upto], restoring active text attributes after prompt redraws, and showing throttled power in the status bar without duplicate messages.

Keep unknown slash commands editable by beeping and restoring the input instead of printing an error or sending it to the model.
Persist agent session titles in the KV file trailer and derive the saved session ID from title plus created_at so resaves keep a stable identity. Preserve and display titles in /list, keep stripped sessions readable, and migrate legacy rendered-text sessions on their next successful save.

Also fix stripped-session reloads to accept the retokenized rendered text count, and adjust the /list footer wording to use session IDs.
On DGX Spark / GB10 with driver 580.142, cudaHostRegister fails with
cudaErrorNotSupported when the cudaHostRegisterReadOnly flag is
requested (cudaDevAttrHostRegisterReadOnlySupported reports 0).

Pre-fix behavior: registration fails entirely, the model-resident fast
path falls back to per-deref H2D streaming. nsys on combined-MTP at
32 tokens measured 90,525 MB H2D transferred (87.7% of the GPU
timeline), 2,572 individual copies, and 12.16s of cudaMemcpy_ptds API
time. Spark's equivalent decode does 24 MB total H2D / 0.3% of timeline.

Fix: drop the ReadOnly flag from both cudaHostRegister call sites in
ds4_cuda.cu. Mirrors spark's policy in cuda/runtime.cu:254-268 -- on
ATS-capable hosts spark skips registration entirely, but its non-ATS
fallback also uses plain cudaHostRegisterMapped (no ReadOnly).

Per-mode bench impact (n=256, /dev/null, 3-run avg), in combination
with the follow-up !mtp_ready gate drop:

  Mode                   Pre-fix   Post-fix   Delta
  Plain decode             15.48     13.92    -10%
  MTP canonical strict      7.74     11.87    +53%
  MTP batched verifier      8.76     14.73    +68%
  MTP combined-K=1         10.31     17.93    +74%

The 10% plain regression comes from the register-mapped UVA pointer
going through host page tables on every kernel read, vs. the pre-fix
path which fell through to cudaMalloc+cudaMemcpy and produced truly
device-resident HBM copies. Net trade is heavily positive for MTP-
dominant workloads. A follow-up will pursue a hybrid that recovers
plain decode by pre-populating the cudaMalloc cache at startup while
keeping the register-mapped fallback for cold lookups.
Pre-fix the startup model-tensor cache walk was guarded by
`!e->mtp_ready`, meaning MTP-loaded sessions skipped the entire
device-side cache population. Under the prior cudaHostRegister-fails
behavior, that meant MTP runs paid per-call cudaMalloc + cudaMemcpy on
every uncached tensor deref during decode -- contributing to the
2,572 H2D copies per 32-token spec iter that nsys captured.

Combined with the cudaHostRegisterReadOnly drop in the prior commit,
removing this gate lets MTP-loaded sessions use the same fast device-
mapped registered model pointer as plain decode.

Per-mode bench impact (cumulative with prior commit, n=256, /dev/null,
3-run avg):

  MTP canonical strict      7.74 -> 11.87  (+53%)
  MTP batched verifier      8.76 -> 14.73  (+68%)
  MTP combined-K=1         10.31 -> 17.93  (+74%)

MTP combined-K=1 at 17.93 t/s now beats:
  - mainline plain         13.92  (+29%)
  - pre-fix mainline plain 15.48  (+16%)
  - spark MTP K=1 captured 16.20  (+11%)
  - spark MTP K=2 captured 14.83  (+21%)

Phase 4 goal "mtp > plain on mainline" is met.
Scaffolding for the hybrid HBM-cache restore.  No behavior change today
(cache stays empty under post-fix register-mapped flow); sets up the
plumbing for the next commit.

1. Extract cudaMalloc + chunked cudaMemcpy body into
   cuda_model_range_populate_device_copy().  Same logic, same caching
   into g_model_ranges + g_model_range_by_offset, same verbose log
   line.  Single call from cuda_model_range_ptr replaces the inline
   block at the bottom of that function.

2. Reorder cuda_model_range_ptr to check g_model_ranges cache *before*
   the g_model_device_owned / g_model_registered short-circuit.  Today
   the cache is empty in the registered-mapped flow so the reorder is
   behavior-equivalent: cache lookup returns nothing, fall through to
   the registered shortcut, return the UVA-mapped pointer.

After the next commit populates the cache via force-populate during
the startup walk, this reorder lets device-resident HBM copies win
over the slower UVA-mapped pointer (~10% on plain decode).

Bench parity check (3-run avg, n=256, vs post-fix baseline):
  Plain decode             13.94 (was 13.92)
  MTP combined-K=1         18.04 (was 17.93)

Within noise.  Confirms commit is behavior-equivalent.
Restores plain decode (and gives MTP another 9% lift) by switching the
startup tensor-span walk to actively cudaMalloc + cudaMemcpy device-
resident copies instead of silently no-op'ing under the registered-
mapped flow.

Three changes:

1. ds4_gpu_cache_model_range: stop calling cuda_model_range_ptr (which
   short-circuits to the UVA-mapped pointer when g_model_registered is
   set) and call cuda_model_range_populate_device_copy directly.  Adds
   budget guard + DS4_CUDA_NO_HBM_CACHE opt-out env.

2. cuda_model_cache_limit_bytes: change default from UINT64_MAX to
   24 GiB.  Without a cap, an 80 GiB model duplicated into HBM-backed
   cudaMalloc would exhaust 121 GiB UMA.  24 GiB comfortably covers
   non-MoE tensors plus headroom; tune via DS4_CUDA_WEIGHT_CACHE_LIMIT_GB.

3. accelerator_cache_model_tensor_spans: skip MoE expert weights
   (`*_exps.weight`, ~65 GiB total).  Top-K of N=256 experts fire per
   token, so caching them wastes most of the budget on cold weights and
   starves attn/embed/output-head/shared-FFN.  Cold MoE reads fall back
   to UVA-mapped pointer.  Disable filter via DS4_CUDA_CACHE_ALL_TENSORS=1.

Per-mode bench (3-run avg, n=256, /dev/null):

  Mode                    Pre-hybrid (post-fix)  Post-hybrid  Delta
  Plain decode                     13.92             15.90    +14.2%
  MTP canonical strict             11.87             13.47    +13.5%
  MTP batched verifier             14.73             16.45    +11.7%
  MTP combined-K=1                 17.93             19.56     +9.1%

Cumulative vs pre-weight-residency-fix baseline:
  Plain decode                     15.48 ->          15.90     +2.7%
  MTP combined-K=1                 10.31 ->          19.56    +89.7%

MTP combined-K=1 at 19.56 t/s beats:
  mainline plain      15.90  (+23%)
  spark MTP K=1 cap   16.20  (+21%)
  pre-hybrid post-fix 17.93   (+9%)
After commit 2 (force-populate at startup), the cache walk ran only
against the base model -- MTP-block tensor reads at decode time hit
the UVA-mapped pointer instead of cudaMalloc'd HBM copies.

Two reasons it didn't manifest as catastrophic:
- MTP model is 3.5 GiB vs 80 GiB base -- small footprint
- Most MTP tensors are MoE experts that the *_exps filter would skip
  anyway, so the missed opportunity is small

Fix: call accelerator_cache_model_tensors a second time against
e->mtp_model when MTP is loaded.  The MoE filter from commit 2
automatically excludes `mtp.0.ffn_*_exps.weight`.

Per-mode bench (3-run avg, n=256, /dev/null):

  Mode                    Commit-2  Commit-3   Delta
  Plain decode              15.90     15.89    noise
  MTP canonical strict      13.47     13.50    +0.2%
  MTP batched verifier      16.45     16.51    +0.4%
  MTP combined-K=1          19.56     19.58    +0.1%

Closes a latent correctness gap (MTP tensors should be in the cache
just like base-model tensors) without measurable risk.  The walk is
idempotent under the MoE filter, so even if both models share offsets
the populate path skips already-cached spans.

Final cumulative vs pre-weight-residency-fix baseline:

  Plain decode             15.48 -> 15.89    +2.6%
  MTP canonical strict      7.74 -> 13.50   +74.4%
  MTP batched verifier      8.76 -> 16.51   +88.5%
  MTP combined-K=1         10.31 -> 19.58   +89.9%

MTP combined-K=1 (19.58 t/s) beats:
  mainline plain          15.89  (+23%)
  spark MTP K=1 captured  16.20  (+21%)
  spark plain (graphs ON) 14.89  (+32%)
Generated on this branch via:
  ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \
    --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 \
    --csv speed-bench/gb10.csv

Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
@TrevorS TrevorS force-pushed the upstream-minimal branch from 3514c55 to 6e95cf6 Compare May 24, 2026 17:13
TrevorS added a commit that referenced this pull request May 24, 2026
Two related changes packed together.

1. Combined-forward K=2 wiring
-----------------------------
`spec_argmax_combined` now also handles draft_cap=2 (N=3 batched verify
over [first_token, drafts[0], drafts[1]]) with prefix-2 commit dispatch
for the commit ∈ {0, 1, 2} cases.

Gated behind DS4_MTP_COMBINED_K2=1 because measurement shows the K=2
variant is currently a loss on mainline:

  Combined K=1 (N=2 batched): 9.51 t/s (no flag, /dev/null)
  Combined K=2 (N=3 batched): 7.34 t/s (DS4_MTP_COMBINED_K2=1)

Why K=2 loses: `drafts[1]` cascades from `drafts[0]`'s MTP-state, but
`drafts[0]` itself comes from `combined_prev_hc` (= post-previous-iter-
last-token HC), not from the fresh post-`first_token` main-HC the
canonical eval(first_token) would produce.  So `drafts[0]` is "one
position stale" already, and `drafts[1]` cascades further off-target.
The target verifier rejects `drafts[1]` in the vast majority of iters,
so the extra batched-N=3 row costs more than it pays.

Keeping the K=2 path as opt-in because the prefix-2 wiring is correct
and reusable when the staleness fix lands (interleaved MTP-block inside
batched main forward, spark-style).  See PHASE4.md item #1.

2. Session-cached spec_row_logits buffer
----------------------------------------
Adds `s->spec_row_logits_buf` (3 * VOCAB f32 = ~1.5 MiB) and
`s->spec_row_tops_buf` (3 * int) allocated at session creation,
replacing the per-spec-call xmalloc/free pattern in
`ds4_session_eval_speculative_argmax_combined`.

Measurement impact: small (~0-3% in noise).  The malloc overhead
hypothesis was a wrong guess at what was producing the 73 ms per-call
overhead between component-timed (~95 ms) and observed wall
(~168 ms) combined cost.  Documented in PHASE4.md item #3 -- the
actual source of that overhead is still unidentified after this
attempt.

Effect on default combined K=1: 9.51 -> 9.48 t/s (within noise).
Foundation for future xmalloc cleanup in the canonical path's
decode2_exact branch (still allocates per-call).

LOC
---
ds4.c: +67/-35 (combined K=2 dispatch + session buf fields + alloc/free
sites + caller changes).  Two new session fields, two new env gates.

NO github push.  jj change oxmoztuq.
@TrevorS
Copy link
Copy Markdown
Owner Author

TrevorS commented May 24, 2026

Superseded by the reframed 2-PR stack (#11 + #12), which tells the same Spark/GB10 + MTP combined-forward story more concisely, rebased on current upstream/main, with the exploratory paths dropped.

@TrevorS TrevorS closed this May 24, 2026
@TrevorS TrevorS deleted the upstream-minimal branch May 24, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants