repro(pa-asm): standalone reproducer for fp8 PA OOB at bs=128, qlen=3#3295
Open
yhl-amd wants to merge 6 commits into
Open
repro(pa-asm): standalone reproducer for fp8 PA OOB at bs=128, qlen=3#3295yhl-amd wants to merge 6 commits into
yhl-amd wants to merge 6 commits into
Conversation
Use min-based anchor rebase in K_V_window_rebase to allow pages within the same wave to span different 65536 windows, as long as max_page_id - min_page_id < 65536. This removes the previous constraint that all pages in a load group must share the same high-16-bit window. Changes: - Replace v_mul_u32_u24 with v_mul_lo_u32 to remove 24-bit truncation - Update K/V buffer descriptor num_records for full offset range - Update all 36 PA kernel .co binaries (gfx942 + gfx950) - Add test_pa_block_id_truncation.py regression test
Rebuild all 36 PA kernel .co files from updated SP3 sources that use min-based anchor rebase instead of the previous direct offset approach. Performance (test_pa.py, bf16, batch=128): - ctx_len=128: 12.17-12.22 us (vs 12.51-12.58 in v1, vs 11.73-11.78 baseline) - ctx_len=257: 18.10-18.17 us (vs 18.52-18.82 in v1) - ctx_len=4097: 161.16-162.33 us (vs 160.27-160.49 baseline, <1% delta) Short-context regression reduced from ~7-9% to ~3-5% vs baseline. Long-context effectively neutral (<1%).
- Reformat multiline function args per black style - Remove unused imports (dtypes, time) - Remove extra blank line - Add trailing newline
Adds op_tests/repros/ with a self-contained Python reproducer for the HIP
illegal-memory access in pa_bf16_pertokenFp8_gqa8_1tg_4w_mtp_msk1.co
(gfx950). Triggers exactly when batch_size==128 and qlen==3 on the fp8
per-token-quant ASM PA path. The bf16 sibling kernel
(pa_bf16_noquant_gqa8_1tg_4w_mtp_msk1.co) is clean on the same shape.
Minimal repro (≤5s, no concurrency required):
AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1 \
python op_tests/repros/pa_asm_fp8_repeat_call.py \
--bs 128 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5
See op_tests/repros/README.md for the full sweep matrix and negative
controls (bs±1, qlen±1, same total_qo via other factorings, bf16 KV).
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Comment on lines
+47
to
+48
| import os | ||
| import random |
Contributor
| import sys | ||
| import time | ||
| import traceback | ||
| from typing import List, Optional, Tuple |
Contributor
Comment on lines
+57
to
+58
| from aiter import dtypes | ||
| from aiter import pertoken_quant |
Contributor
| torch.cuda.synchronize() | ||
|
|
||
| # ---- single call ---- | ||
| print(f"[min-repro] calling pa_fwd_asm ...", flush=True) |
Contributor
Comment on lines
+22
to
+23
| import os | ||
| import sys |
Contributor
|
|
||
| print(f"# fp8 ASM PA shape sweep (each cell = {args.n_repeat} repeats " | ||
| f"of same call, fresh process)") | ||
| print(f"# OK = no crash. CRASH@k = launch error surfaced at call k " |
Contributor
| print(f"# fp8 ASM PA shape sweep (each cell = {args.n_repeat} repeats " | ||
| f"of same call, fresh process)") | ||
| print(f"# OK = no crash. CRASH@k = launch error surfaced at call k " | ||
| f"(0-indexed; means call k-1 corrupted device).") |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
op_tests/repros/with a self-contained Python reproducer for the HIP illegal-memory access inpa_bf16_pertokenFp8_gqa8_1tg_4w_mtp_msk1.coon gfx950.batch_size == 128andqlen == 3on the fp8 per-token-quant ASM PA path. The bf16 sibling kernel (pa_bf16_noquant_gqa8_1tg_4w_mtp_msk1.co) is clean on the same shape.total_qovia other factorings, bf16 KV) and a shape sweep so the failure surface is unambiguous.Minimal repro (≤5s, no concurrency required)
AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1 \ python op_tests/repros/pa_asm_fp8_repeat_call.py \ --bs 128 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5See
op_tests/repros/README.mdfor the full sweep matrix and controls.Test plan
--kv-dtype bf16on same shape — expect cleanpa_asm_fp8_shape_sweep.py) and confirm onlybs=128 ∧ qlen=3fails正例(应崩,~5 秒)
AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1
python op_tests/repros/pa_asm_fp8_repeat_call.py
--bs 128 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5
负例(应过)
python op_tests/repros/pa_asm_fp8_repeat_call.py --bs 128 --ctx 1024 --qlen 3 --kv-dtype bf16 --n-repeat 5
python op_tests/repros/pa_asm_fp8_repeat_call.py --bs 127 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5
python op_tests/repros/pa_asm_fp8_repeat_call.py --bs 128 --ctx 1024 --qlen 2 --kv-dtype fp8 --n-repeat 5