cli: default mtp_draft_tokens=2; cuda: post-stack cleanup (stacked on #9)#10
Closed
TrevorS wants to merge 1 commit into
Closed
cli: default mtp_draft_tokens=2; cuda: post-stack cleanup (stacked on #9)#10TrevorS wants to merge 1 commit into
TrevorS wants to merge 1 commit into
Conversation
Makes the combined-forward MTP win available at zero CLI ceremony.
With the share-warp Q8 and F16 kernels landed earlier in this stack,
combined-forward K=1 (mtp_draft_tokens=2) at the verifier reaches
+1.0 t/s above plain decode on DGX Spark. Previously users had to
pass `--mtp-draft 2` explicitly; the default of 1 routed all `--mtp`
sessions through the canonical decode2_exact path and delivered no
visible win over plain decode.
Default flips from 1 -> 2. The path through `DS4_MTP_STRICT=1` (or
`--quality`) still falls back to canonical decode2_exact for users
who require byte-equality with plain decode.
Cleanup pass on ds4_cuda.cu now that the share-warp design is settled:
- Strip fork-internal "PR3/PR5/PR7/PR8/PR9" comment prefixes that
referenced this stack's own PR numbering; they would be noise in
the upstream history. The behavioral comments themselves are
preserved or tightened.
- Drop the share-warp dispatcher's stale "no perf delta because
combined-forward routes through F16" note: combined-forward at
--mtp-draft 2 DOES route through Q8 share-warp 10k+ times per
64-token gen (confirmed by nsys profile), so PR8's bit-equality
rewrite is load-bearing for the win this PR makes default.
- Tighten the F16 share-warp comment block to reflect the
n_tok==2 final shape (was scoped to 2..4 in intermediate
revisions).
Refresh `speed-bench/gb10.csv` with current PR-stack tip numbers
(plain decode only; ds4-bench doesn't drive --mtp). Plain decode is
within +/-0.05 t/s of the prior baseline at every measured context
size from 2048 to 65536 -- no regression, slight improvement at
small contexts.
Bench (DGX Spark / GB10, ds4flash, n=256, "knight" prompt, 5-run mean):
Plain decode (no --mtp) 16.13 t/s
--mtp (now defaults to combined-forward K=1) 17.14 t/s (+1.01)
--mtp --mtp-draft 1 (forces canonical decode2) 16.18 t/s (parity)
--mtp --quality (strict canonical) 12.92 t/s
CONTRIBUTING.md test sweep:
- `make clean && make cuda-spark` clean
- `make cpu` clean
- `make test` (= `./ds4_test --all`) 1 failure, pre-existing
(`--logprob-vectors short_code_completion`, same as upstream/main,
PR5-9; the test fixture's official continuation is one greedy
token off, well-documented across this stack)
- `make cuda-regression` pre-existing build
error in `tests/cuda_long_context_smoke.c` (stale signature for
`ds4_gpu_attention_decode_heads_tensor`; same on PR7 base and
upstream/main, not introduced here)
- `./ds4-bench` standard 2048..65536 sweep written to
speed-bench/gb10.csv; no regression vs prior baseline
Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
MTP: DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf
This is the capstone for the mtp-beats-plain-kernels stack: 10 PRs,
each minimal, that take MTP from -0.5 t/s under plain to +1.0 t/s
above plain on DGX Spark, with bit-equal plain decode preserved.
120c033 to
cfddd4b
Compare
4d7eda8 to
cf209be
Compare
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR10: cli: default mtp_draft_tokens=2; cuda: post-stack cleanup pass (stacked on #9)
Summary
Capstone for the
mtp-beats-plain-kernelsstack. Two changes:ds4_cli.c: defaultmtp_draft_tokensflips from 1 → 2. With the share-warp Q8 (PR8) and F16 (PR9) kernels landed, combined-forward MTP K=1 delivers +1.0 t/s above plain decode on DGX Spark. Previously users had to pass--mtp-draft 2explicitly to get the win; the default of 1 routed--mtpthrough canonicaldecode2_exactand produced no measurable win over plain.ds4_cuda.cu: cleanup pass — strip fork-internal"PR3"/"PR5"/"PR7"/"PR8"/"PR9"comment prefixes that reference this stack's own PR numbering. Tighten the share-warp dispatcher comments to reflect the settled final shape (was scoped wider in intermediate revisions). Net -18 LOC with no behavior change to the kernels.speed-bench/gb10.csv: refreshed with PR-stack-tip plain-decode numbers from the standard CONTRIBUTING.md sweep.Bench headline (DGX Spark / GB10, ds4flash, n=256, "knight" prompt, 5-run mean)
--mtp(default)--mtpvs plain on PR10All 5 default-
--mtpruns landed within 0.03 t/s: 17.16 / 17.15 / 17.13 / 17.15 / 17.13.Other modes (still supported):
--mtp --mtp-draft 1(forces canonical decode2)--mtp --quality(strict canonical)CONTRIBUTING.md test sweep
❨
✓❩make clean && make cuda-spark— clean, no warnings❨
✓❩make cpu— clean❨
~❩make test(=./ds4_test --all) — 1 pre-existing failure╰─--logprob-vectors short_code_completion(same on upstream/main, PR5-9; test fixture's official continuation is one greedy token off; well-documented across this stack)❨
~❩make cuda-regression— pre-existing build error╰─tests/cuda_long_context_smoke.chas a stale signature fords4_gpu_attention_decode_heads_tensor(verified same on PR7 base and upstream/main; not introduced here)❨
✓❩./ds4-benchstandard 2048..65536 sweep — written tospeed-bench/gb10.csv╰─no regression vs prior baseline: every measured context size within ±0.05 t/sWhat this stack delivers, end-to-end
The 10-PR stack:
AGENT.md compliance
--quality/DS4_MTP_STRICT=1) preserved as byte-equal-to-plain canonical path.--mtp-draftCLI flag is pre-existing; we change only its default value.DS4_CUDA_NO_*opt-outs preserved as kill switches.Hardware
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.ggufDeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.ggufOut of scope / follow-ups
The scout from PR9's bench analysis (recorded as comments in PR9) identifies the remaining levers if the stack lands and there's appetite for more:
drafts[1]staleness is fixed)drafts[1]staleness — currently rejects nearly always; fixing unlocks--mtp-draft 3