repro(pa-asm): standalone reproducer for fp8 PA OOB at bs=128, qlen=3 by yhl-amd · Pull Request #3295 · ROCm/aiter

yhl-amd · 2026-05-21T03:17:19Z

Summary

Adds op_tests/repros/ with a self-contained Python reproducer for the HIP illegal-memory access in pa_bf16_pertokenFp8_gqa8_1tg_4w_mtp_msk1.co on gfx950.
Crash triggers exactly when batch_size == 128 and qlen == 3 on the fp8 per-token-quant ASM PA path. The bf16 sibling kernel (pa_bf16_noquant_gqa8_1tg_4w_mtp_msk1.co) is clean on the same shape.
Includes negative controls (bs±1, qlen±1, same total_qo via other factorings, bf16 KV) and a shape sweep so the failure surface is unambiguous.

Minimal repro (≤5s, no concurrency required)

AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1 \
  python op_tests/repros/pa_asm_fp8_repeat_call.py \
    --bs 128 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5

See op_tests/repros/README.md for the full sweep matrix and controls.

Test plan

Run minimal repro on gfx950 — expect HIP memory access fault
Run with --kv-dtype bf16 on same shape — expect clean
Run shape sweep (pa_asm_fp8_shape_sweep.py) and confirm only bs=128 ∧ qlen=3 fails

正例(应崩,~5 秒)

AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1
python op_tests/repros/pa_asm_fp8_repeat_call.py
--bs 128 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5

负例(应过)

python op_tests/repros/pa_asm_fp8_repeat_call.py --bs 128 --ctx 1024 --qlen 3 --kv-dtype bf16 --n-repeat 5
python op_tests/repros/pa_asm_fp8_repeat_call.py --bs 127 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5
python op_tests/repros/pa_asm_fp8_repeat_call.py --bs 128 --ctx 1024 --qlen 2 --kv-dtype fp8 --n-repeat 5

Use min-based anchor rebase in K_V_window_rebase to allow pages within the same wave to span different 65536 windows, as long as max_page_id - min_page_id < 65536. This removes the previous constraint that all pages in a load group must share the same high-16-bit window. Changes: - Replace v_mul_u32_u24 with v_mul_lo_u32 to remove 24-bit truncation - Update K/V buffer descriptor num_records for full offset range - Update all 36 PA kernel .co binaries (gfx942 + gfx950) - Add test_pa_block_id_truncation.py regression test

Rebuild all 36 PA kernel .co files from updated SP3 sources that use min-based anchor rebase instead of the previous direct offset approach. Performance (test_pa.py, bf16, batch=128): - ctx_len=128: 12.17-12.22 us (vs 12.51-12.58 in v1, vs 11.73-11.78 baseline) - ctx_len=257: 18.10-18.17 us (vs 18.52-18.82 in v1) - ctx_len=4097: 161.16-162.33 us (vs 160.27-160.49 baseline, <1% delta) Short-context regression reduced from ~7-9% to ~3-5% vs baseline. Long-context effectively neutral (<1%).

- Reformat multiline function args per black style - Remove unused imports (dtypes, time) - Remove extra blank line - Add trailing newline

Adds op_tests/repros/ with a self-contained Python reproducer for the HIP illegal-memory access in pa_bf16_pertokenFp8_gqa8_1tg_4w_mtp_msk1.co (gfx950). Triggers exactly when batch_size==128 and qlen==3 on the fp8 per-token-quant ASM PA path. The bf16 sibling kernel (pa_bf16_noquant_gqa8_1tg_4w_mtp_msk1.co) is clean on the same shape. Minimal repro (≤5s, no concurrency required): AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1 \ python op_tests/repros/pa_asm_fp8_repeat_call.py \ --bs 128 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5 See op_tests/repros/README.md for the full sweep matrix and negative controls (bs±1, qlen±1, same total_qo via other factorings, bf16 KV).

github-actions · 2026-05-21T03:17:55Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3295 --add-label <label>

github-actions · 2026-05-21T03:18:03Z

+import os
+import random


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
os imported but unused

Suggested change

import os

import random

import random

github-actions · 2026-05-21T03:18:04Z

+import sys
+import time
+import traceback
+from typing import List, Optional, Tuple


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
typing.Optional imported but unused

Suggested change

from typing import List, Optional, Tuple

from typing import List, Tuple

github-actions · 2026-05-21T03:18:04Z

+from aiter import dtypes
+from aiter import pertoken_quant


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
aiter.dtypes imported but unused

Suggested change

from aiter import dtypes

from aiter import pertoken_quant

from aiter import pertoken_quant

github-actions · 2026-05-21T03:18:04Z

+    torch.cuda.synchronize()
+
+    # ---- single call ----
+    print(f"[min-repro] calling pa_fwd_asm ...", flush=True)


⚠️ [ruff] <F541> _{reported by reviewdog 🐶}
f-string without any placeholders

Suggested change

print(f"[min-repro] calling pa_fwd_asm ...", flush=True)

print("[min-repro] calling pa_fwd_asm ...", flush=True)

github-actions · 2026-05-21T03:18:04Z

+import os
+import sys


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
os imported but unused

Suggested change

import os

import sys

import sys

github-actions · 2026-05-21T03:18:04Z

+
+    print(f"# fp8 ASM PA shape sweep (each cell = {args.n_repeat} repeats "
+          f"of same call, fresh process)")
+    print(f"# OK = no crash. CRASH@k = launch error surfaced at call k "


⚠️ [ruff] <F541> _{reported by reviewdog 🐶}
f-string without any placeholders

Suggested change

print(f"# OK = no crash. CRASH@k = launch error surfaced at call k "

print("# OK = no crash. CRASH@k = launch error surfaced at call k "

github-actions · 2026-05-21T03:18:04Z

+    print(f"# fp8 ASM PA shape sweep (each cell = {args.n_repeat} repeats "
+          f"of same call, fresh process)")
+    print(f"# OK = no crash. CRASH@k = launch error surfaced at call k "
+          f"(0-indexed; means call k-1 corrupted device).")


⚠️ [ruff] <F541> _{reported by reviewdog 🐶}
f-string without any placeholders

Suggested change

f"(0-indexed; means call k-1 corrupted device).")

"(0-indexed; means call k-1 corrupted device).")

fangche123 and others added 6 commits May 20, 2026 03:03

style: fix black/ruff formatting in test_pa_block_id_truncation.py

603bebb

- Reformat multiline function args per black style - Remove unused imports (dtypes, time) - Remove extra blank line - Add trailing newline

update mi300 kernel co

5b0b35b

using min_v rebase

aff4047

yhl-amd requested a review from a team May 21, 2026 03:17

github-actions Bot reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repro(pa-asm): standalone reproducer for fp8 PA OOB at bs=128, qlen=3#3295

repro(pa-asm): standalone reproducer for fp8 PA OOB at bs=128, qlen=3#3295
yhl-amd wants to merge 6 commits into
mainfrom
repro/pa-asm-fp8-bs128-qlen3-oob

yhl-amd commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot May 21, 2026

Uh oh!

github-actions Bot May 21, 2026

Uh oh!

github-actions Bot May 21, 2026

Uh oh!

github-actions Bot May 21, 2026

Uh oh!

github-actions Bot May 21, 2026

Uh oh!

github-actions Bot May 21, 2026

Uh oh!

github-actions Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	from typing import List, Optional, Tuple
	from typing import List, Tuple

	from aiter import dtypes
	from aiter import pertoken_quant
	from aiter import pertoken_quant

	print(f"[min-repro] calling pa_fwd_asm ...", flush=True)
	print("[min-repro] calling pa_fwd_asm ...", flush=True)

	print(f"# OK = no crash. CRASH@k = launch error surfaced at call k "
	print("# OK = no crash. CRASH@k = launch error surfaced at call k "

	f"(0-indexed; means call k-1 corrupted device).")
	"(0-indexed; means call k-1 corrupted device).")

Conversation

yhl-amd commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Minimal repro (≤5s, no concurrency required)

Test plan

正例(应崩,~5 秒)

负例(应过)

Uh oh!

github-actions Bot commented May 21, 2026

🏷️ CI Guide

Uh oh!

github-actions Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yhl-amd commented May 21, 2026 •

edited

Loading