cuda: HBM-resident model on DGX Spark / GB10#1
Closed
TrevorS wants to merge 26 commits into
Closed
Conversation
the answer was outside of the claimed energy precision. the evaluation after the fix (with smooth distribution over the tokens) ``` $ ./ds4-eval --temp 3.0 --min-p 0.25 --nothink ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000) ds4-eval: context buffers 479.38 MiB (ctx=16777, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=4196) ds4-eval: 17/92 passed, 1 failed, runtime 00h:34m # state prompt gen total given correct test 1 PASSED 201 733 934 B B GPQA Diamond/recNu3MXkvWUzHZr9 2 PASSED 149 87 236 C C SuperGPQA/001b51d76b4d422988f2c11f104a2c6c 3 PASSED 81 574 655 70 70 AIME2025/aime2025-01 4 PASSED 313 239 552 C C GPQA Diamond/recoiTJPGUmzAkief 5 PASSED 272 177 449 J J SuperGPQA/b7e20eac98764fb0bf30e8366d951daa 6 PASSED 146 1140 1286 468 468 AIME2025/aime2025-16 7 PASSED 156 646 802 B B GPQA Diamond/rec4UqStf9WUVif1f 8 PASSED 127 52 179 E E SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8 9 PASSED 633 4780 5413 588 588 AIME2025/aime2025-02 10 PASSED 182 322 504 B B GPQA Diamond/recgI6tUQ7RLJRWGx 11 PASSED 137 68 205 A A SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09 12 PASSED 165 747 912 16 16 AIME2025/aime2025-03 13 PASSED 149 672 821 A A GPQA Diamond (modified)/recDytVnNYZe2HuUU 14 PASSED 167 68 235 J J SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a 15 FAILED 305 4837 5142 86 82 AIME2025/aime2025-18 16 PASSED 131 671 802 D D GPQA Diamond/recNFJjE5PPTqVJGv 17 PASSED 175 67 242 I I SuperGPQA/7ca71b86327744b78e93185a45bc5cef 18 PASSED 102 1199 1301 117 117 AIME2025/aime2025-04 19 STOPPED 187 80 267 - B GPQA Diamond/rec2UlKqC6RFHdcro 20 PENDING 0 0 0 - E SuperGPQA/d44b94f7749345a39a65f6312bda8764 21 PENDING 0 0 0 - 106 AIME2025/aime2025-19 22 PENDING 0 0 0 - B GPQA Diamond/recv7GsQg3f0fvB1f 23 PENDING 0 0 0 - B SuperGPQA/febe406f44d74a40b50bb5b7c69d5dc1 ```
Full routed-MoE TensorOps enabled the gate, up, and down projections. The regression was isolated to the gate projection: enabling TensorOps for gate is sufficient to send a sensitive AIME continuation into a repeated wrong answer, while TensorOps for up+down remains stable. The kernel-side cause is small but real arithmetic drift in mpp::tensor_ops::matmul2d relative to the legacy simdgroup MMA contraction. A same-input routed-MoE probe showed no address/layout corruption: TensorOps gate was close to legacy, but not bit-identical. An isolated same-tile primitive probe confirmed the source outside DS4 routing and quantization: legacy simdgroup_multiply_accumulate matched a CPU FP32 serial dot-product reference on the tested tile, while TensorOps produced close nonzero FP32 differences. MTLMathModeSafe and the tested TensorOps descriptor variant did not remove the drift. That normally tiny drift matters here because MoE routing has discontinuous top-k expert selection. In the failing path the first observed safe-vs-full routing change was layer 3, token row 11: the selected sixth expert changed from 96 to 50 across a margin of only about 8e-4. Once an expert changes, the transformer state is no longer a smooth local perturbation, and autoregressive decoding can fall into a bad repetition basin. Attempts that preserved the full gate TensorOps speed did not produce a zero-drift or stable fix: forcing the routed intermediate to F32, using the older generic TensorOps routed matmul instead of the expert-major fast layout, changing the TensorOps descriptor mode, and compiling with strict Metal math all left the gate drift or the bad continuation in place. Retaining TensorOps for up and down keeps most of the MoE speedup, but gate stays on the legacy path because it feeds the nonlinear silu(gate) * up branch and is the projection that can flip later router decisions.
Remove the routed-MoE TensorOps/NAX path completely instead of leaving it as a gated-off mode. Semantic evals showed that gate, up, and down TensorOps routed-MoE variants can each move the model into bad continuations, while the full-tile-only expert-major experiment was correctness-interesting but slower than the legacy simdgroup path. Keeping the dead kernels around risks accidental re-enablement without a trustworthy correctness story. The routed MoE grouped matmul now always uses the legacy 32-token expert-major simdgroup kernel. Other Metal4 TensorOps paths, such as the attention-output projection, remain enabled independently.
ds4-eval (fix): q13 provides wrong answer
agent: add --chdir working-directory option
Preserve generated output while linenoise prompt/history entries grow by scrolling the output region and keeping the output cursor column. Polish streaming colors by highlighting [upto], restoring active text attributes after prompt redraws, and showing throttled power in the status bar without duplicate messages. Keep unknown slash commands editable by beeping and restoring the input instead of printing an error or sending it to the model.
Persist agent session titles in the KV file trailer and derive the saved session ID from title plus created_at so resaves keep a stable identity. Preserve and display titles in /list, keep stripped sessions readable, and migrate legacy rendered-text sessions on their next successful save. Also fix stripped-session reloads to accept the retokenized rendered text count, and adjust the /list footer wording to use session IDs.
On DGX Spark / GB10 with driver 580.142, cudaHostRegister fails with cudaErrorNotSupported when the cudaHostRegisterReadOnly flag is requested (cudaDevAttrHostRegisterReadOnlySupported reports 0). Pre-fix behavior: registration fails entirely, the model-resident fast path falls back to per-deref H2D streaming. nsys on combined-MTP at 32 tokens measured 90,525 MB H2D transferred (87.7% of the GPU timeline), 2,572 individual copies, and 12.16s of cudaMemcpy_ptds API time. Spark's equivalent decode does 24 MB total H2D / 0.3% of timeline. Fix: drop the ReadOnly flag from both cudaHostRegister call sites in ds4_cuda.cu. Mirrors spark's policy in cuda/runtime.cu:254-268 -- on ATS-capable hosts spark skips registration entirely, but its non-ATS fallback also uses plain cudaHostRegisterMapped (no ReadOnly). Per-mode bench impact (n=256, /dev/null, 3-run avg), in combination with the follow-up !mtp_ready gate drop: Mode Pre-fix Post-fix Delta Plain decode 15.48 13.92 -10% MTP canonical strict 7.74 11.87 +53% MTP batched verifier 8.76 14.73 +68% MTP combined-K=1 10.31 17.93 +74% The 10% plain regression comes from the register-mapped UVA pointer going through host page tables on every kernel read, vs. the pre-fix path which fell through to cudaMalloc+cudaMemcpy and produced truly device-resident HBM copies. Net trade is heavily positive for MTP- dominant workloads. A follow-up will pursue a hybrid that recovers plain decode by pre-populating the cudaMalloc cache at startup while keeping the register-mapped fallback for cold lookups.
Pre-fix the startup model-tensor cache walk was guarded by `!e->mtp_ready`, meaning MTP-loaded sessions skipped the entire device-side cache population. Under the prior cudaHostRegister-fails behavior, that meant MTP runs paid per-call cudaMalloc + cudaMemcpy on every uncached tensor deref during decode -- contributing to the 2,572 H2D copies per 32-token spec iter that nsys captured. Combined with the cudaHostRegisterReadOnly drop in the prior commit, removing this gate lets MTP-loaded sessions use the same fast device- mapped registered model pointer as plain decode. Per-mode bench impact (cumulative with prior commit, n=256, /dev/null, 3-run avg): MTP canonical strict 7.74 -> 11.87 (+53%) MTP batched verifier 8.76 -> 14.73 (+68%) MTP combined-K=1 10.31 -> 17.93 (+74%) MTP combined-K=1 at 17.93 t/s now beats: - mainline plain 13.92 (+29%) - pre-fix mainline plain 15.48 (+16%) - spark MTP K=1 captured 16.20 (+11%) - spark MTP K=2 captured 14.83 (+21%) Phase 4 goal "mtp > plain on mainline" is met.
Scaffolding for the hybrid HBM-cache restore. No behavior change today (cache stays empty under post-fix register-mapped flow); sets up the plumbing for the next commit. 1. Extract cudaMalloc + chunked cudaMemcpy body into cuda_model_range_populate_device_copy(). Same logic, same caching into g_model_ranges + g_model_range_by_offset, same verbose log line. Single call from cuda_model_range_ptr replaces the inline block at the bottom of that function. 2. Reorder cuda_model_range_ptr to check g_model_ranges cache *before* the g_model_device_owned / g_model_registered short-circuit. Today the cache is empty in the registered-mapped flow so the reorder is behavior-equivalent: cache lookup returns nothing, fall through to the registered shortcut, return the UVA-mapped pointer. After the next commit populates the cache via force-populate during the startup walk, this reorder lets device-resident HBM copies win over the slower UVA-mapped pointer (~10% on plain decode). Bench parity check (3-run avg, n=256, vs post-fix baseline): Plain decode 13.94 (was 13.92) MTP combined-K=1 18.04 (was 17.93) Within noise. Confirms commit is behavior-equivalent.
Restores plain decode (and gives MTP another 9% lift) by switching the startup tensor-span walk to actively cudaMalloc + cudaMemcpy device- resident copies instead of silently no-op'ing under the registered- mapped flow. Three changes: 1. ds4_gpu_cache_model_range: stop calling cuda_model_range_ptr (which short-circuits to the UVA-mapped pointer when g_model_registered is set) and call cuda_model_range_populate_device_copy directly. Adds budget guard + DS4_CUDA_NO_HBM_CACHE opt-out env. 2. cuda_model_cache_limit_bytes: change default from UINT64_MAX to 24 GiB. Without a cap, an 80 GiB model duplicated into HBM-backed cudaMalloc would exhaust 121 GiB UMA. 24 GiB comfortably covers non-MoE tensors plus headroom; tune via DS4_CUDA_WEIGHT_CACHE_LIMIT_GB. 3. accelerator_cache_model_tensor_spans: skip MoE expert weights (`*_exps.weight`, ~65 GiB total). Top-K of N=256 experts fire per token, so caching them wastes most of the budget on cold weights and starves attn/embed/output-head/shared-FFN. Cold MoE reads fall back to UVA-mapped pointer. Disable filter via DS4_CUDA_CACHE_ALL_TENSORS=1. Per-mode bench (3-run avg, n=256, /dev/null): Mode Pre-hybrid (post-fix) Post-hybrid Delta Plain decode 13.92 15.90 +14.2% MTP canonical strict 11.87 13.47 +13.5% MTP batched verifier 14.73 16.45 +11.7% MTP combined-K=1 17.93 19.56 +9.1% Cumulative vs pre-weight-residency-fix baseline: Plain decode 15.48 -> 15.90 +2.7% MTP combined-K=1 10.31 -> 19.56 +89.7% MTP combined-K=1 at 19.56 t/s beats: mainline plain 15.90 (+23%) spark MTP K=1 cap 16.20 (+21%) pre-hybrid post-fix 17.93 (+9%)
After commit 2 (force-populate at startup), the cache walk ran only against the base model -- MTP-block tensor reads at decode time hit the UVA-mapped pointer instead of cudaMalloc'd HBM copies. Two reasons it didn't manifest as catastrophic: - MTP model is 3.5 GiB vs 80 GiB base -- small footprint - Most MTP tensors are MoE experts that the *_exps filter would skip anyway, so the missed opportunity is small Fix: call accelerator_cache_model_tensors a second time against e->mtp_model when MTP is loaded. The MoE filter from commit 2 automatically excludes `mtp.0.ffn_*_exps.weight`. Per-mode bench (3-run avg, n=256, /dev/null): Mode Commit-2 Commit-3 Delta Plain decode 15.90 15.89 noise MTP canonical strict 13.47 13.50 +0.2% MTP batched verifier 16.45 16.51 +0.4% MTP combined-K=1 19.56 19.58 +0.1% Closes a latent correctness gap (MTP tensors should be in the cache just like base-model tensors) without measurable risk. The walk is idempotent under the MoE filter, so even if both models share offsets the populate path skips already-cached spans. Final cumulative vs pre-weight-residency-fix baseline: Plain decode 15.48 -> 15.89 +2.6% MTP canonical strict 7.74 -> 13.50 +74.4% MTP batched verifier 8.76 -> 16.51 +88.5% MTP combined-K=1 10.31 -> 19.58 +89.9% MTP combined-K=1 (19.58 t/s) beats: mainline plain 15.89 (+23%) spark MTP K=1 captured 16.20 (+21%) spark plain (graphs ON) 14.89 (+32%)
Generated on this branch via:
./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \
--ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 \
--csv speed-bench/gb10.csv
Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
TrevorS
added a commit
that referenced
this pull request
May 24, 2026
Two related changes packed together.
1. Combined-forward K=2 wiring
-----------------------------
`spec_argmax_combined` now also handles draft_cap=2 (N=3 batched verify
over [first_token, drafts[0], drafts[1]]) with prefix-2 commit dispatch
for the commit ∈ {0, 1, 2} cases.
Gated behind DS4_MTP_COMBINED_K2=1 because measurement shows the K=2
variant is currently a loss on mainline:
Combined K=1 (N=2 batched): 9.51 t/s (no flag, /dev/null)
Combined K=2 (N=3 batched): 7.34 t/s (DS4_MTP_COMBINED_K2=1)
Why K=2 loses: `drafts[1]` cascades from `drafts[0]`'s MTP-state, but
`drafts[0]` itself comes from `combined_prev_hc` (= post-previous-iter-
last-token HC), not from the fresh post-`first_token` main-HC the
canonical eval(first_token) would produce. So `drafts[0]` is "one
position stale" already, and `drafts[1]` cascades further off-target.
The target verifier rejects `drafts[1]` in the vast majority of iters,
so the extra batched-N=3 row costs more than it pays.
Keeping the K=2 path as opt-in because the prefix-2 wiring is correct
and reusable when the staleness fix lands (interleaved MTP-block inside
batched main forward, spark-style). See PHASE4.md item #1.
2. Session-cached spec_row_logits buffer
----------------------------------------
Adds `s->spec_row_logits_buf` (3 * VOCAB f32 = ~1.5 MiB) and
`s->spec_row_tops_buf` (3 * int) allocated at session creation,
replacing the per-spec-call xmalloc/free pattern in
`ds4_session_eval_speculative_argmax_combined`.
Measurement impact: small (~0-3% in noise). The malloc overhead
hypothesis was a wrong guess at what was producing the 73 ms per-call
overhead between component-timed (~95 ms) and observed wall
(~168 ms) combined cost. Documented in PHASE4.md item #3 -- the
actual source of that overhead is still unidentified after this
attempt.
Effect on default combined K=1: 9.51 -> 9.48 t/s (within noise).
Foundation for future xmalloc cleanup in the canonical path's
decode2_exact branch (still allocates per-call).
LOC
---
ds4.c: +67/-35 (combined K=2 dispatch + session buf fields + alloc/free
sites + caller changes). Two new session fields, two new env gates.
NO github push. jj change oxmoztuq.
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR1 draft: cuda: HBM-resident model on DGX Spark / GB10
Summary
Seven commits on
TrevorS/ds4 @ upstream-minimal(tip3514c55), branched fromupstream/main(9ae1eeb).Fixes a GB10-specific bug where
cudaHostRegister(... | cudaHostRegisterReadOnly)returnscudaErrorNotSupported(driver 580.142 / cudaDevAttrHostRegisterReadOnlySupported = 0), causing the model-resident fast path to silently fall back to per-deref H2D streaming. nsys on a 32-token MTP decode measured 90,525 MB H2D transferred (87.7% of GPU timeline) before the fix.The stack:
24ce6bd— replaces an indexer-as-argmax misuse in the MTP path with a real GPU argmax kernel (also closes an upstream strict-mode drift underDS4_MTP_BATCH_VERIFY)d9a986b— drops thecudaHostRegisterReadOnlyflag (the actual GB10 bug)2081f13— routes MTP-loaded sessions through the startup tensor-cache walk (was gated off by!mtp_ready)e9085d0— refactors the cudaMalloc cache helper + reorders the lookup so device-resident copies win over the registered-mapped UVA pointer (behavior-equivalent on its own; sets up the next commit)5505cf8— force-populates the HBM cache at startup with a 24 GiB budget cap + MoE expert filterdf6b438— extends the same cache walk to the MTP support model3514c55— addsspeed-bench/gb10.csvper CONTRIBUTING.mdHeadline numbers (DGX Spark / GB10, ds4flash IQ2XXS-w2Q2K)
ds4-benchcanonical sweep (--ctx-start 2048 --step-incr 2048 --gen-tokens 128):Upstream's sweep was truncated at ctx=30720 because per-iteration latency on
upstream/mainexceeded the available wall-clock budget (>9 min per row at ctx≥32768) — itself a symptom of the H2D-streaming bug this PR fixes. PR1's sweep completes the full canonical 2048→65536 range. Full per-row data is inspeed-bench/gb10.csv.MTP-mode wins (not exercised by
ds4-bench, captured during commit-message benches with./ds4 -n 256):9ae1eeb3514c55./ds4) / 14.2 t/s (ds4-bench)BATCH_VERIFY=1 STRICT=1 --mtp-draft 2The MTP path was effectively broken on GB10 before this PR.
Tested against
make clean && make cuda-spark— clean build, no new warnings./ds4_test --all— 1 failure (logprob-vectors short_code_completion) also fails onupstream/main, pre-existing fixture driftmake cuda-regression— pre-existing build error intests/cuda_long_context_smoke.c(signature mismatch), also fails onupstream/main, not introduced by this PRmake cpu— clean build, no new warnings./ds4-benchformal sweep — seespeed-bench/gb10.csv(added by this PR)upstream/mainfor plain decode (-n 64 -p "knight" --temp 0) — 0 bytesupstream/mainfor MTP strict (DS4_MTP_BATCH_VERIFY=1 DS4_MTP_STRICT=1 --mtp-draft 2 -n 64 -p "knight" --temp 0) — 0 bytesDeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf(80.76 GiB)DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf(3.5 GiB)Notes
DS4_CUDA_NO_HBM_CACHE=1— kill-switch to fall back to pre-fix UVA-mapped behavior. Diagnostic only; default = off.DS4_CUDA_WEIGHT_CACHE_LIMIT_GBdefault changes from unbounded → 24 GiB. Tunable on hosts with more memory budget.*_exps.weight, ~65 GiB total, top-K of 256 active per token) are skipped — caching them would waste budget on bandwidth-cold weights. Disable filter viaDS4_CUDA_CACHE_ALL_TENSORS=1.Sample row from
speed-bench/gb10.csvOut of scope (follow-ups)
--logprob-vectors short_code_completionfixture drift — also on upstream, likely tokenizer/template drift between fixture and current GGUFtests/cuda_long_context_smoke.csignature mismatch inmake cuda-regression— also on upstream