Skip to content

repro(pa-asm): standalone reproducer for fp8 PA OOB at bs=128, qlen=3#3295

Open
yhl-amd wants to merge 6 commits into
mainfrom
repro/pa-asm-fp8-bs128-qlen3-oob
Open

repro(pa-asm): standalone reproducer for fp8 PA OOB at bs=128, qlen=3#3295
yhl-amd wants to merge 6 commits into
mainfrom
repro/pa-asm-fp8-bs128-qlen3-oob

Conversation

@yhl-amd
Copy link
Copy Markdown

@yhl-amd yhl-amd commented May 21, 2026

Summary

  • Adds op_tests/repros/ with a self-contained Python reproducer for the HIP illegal-memory access in pa_bf16_pertokenFp8_gqa8_1tg_4w_mtp_msk1.co on gfx950.
  • Crash triggers exactly when batch_size == 128 and qlen == 3 on the fp8 per-token-quant ASM PA path. The bf16 sibling kernel (pa_bf16_noquant_gqa8_1tg_4w_mtp_msk1.co) is clean on the same shape.
  • Includes negative controls (bs±1, qlen±1, same total_qo via other factorings, bf16 KV) and a shape sweep so the failure surface is unambiguous.

Minimal repro (≤5s, no concurrency required)

AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1 \
  python op_tests/repros/pa_asm_fp8_repeat_call.py \
    --bs 128 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5

See op_tests/repros/README.md for the full sweep matrix and controls.

Test plan

  • Run minimal repro on gfx950 — expect HIP memory access fault
  • Run with --kv-dtype bf16 on same shape — expect clean
  • Run shape sweep (pa_asm_fp8_shape_sweep.py) and confirm only bs=128 ∧ qlen=3 fails

正例(应崩,~5 秒)

AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1
python op_tests/repros/pa_asm_fp8_repeat_call.py
--bs 128 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5

负例(应过)

python op_tests/repros/pa_asm_fp8_repeat_call.py --bs 128 --ctx 1024 --qlen 3 --kv-dtype bf16 --n-repeat 5
python op_tests/repros/pa_asm_fp8_repeat_call.py --bs 127 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5
python op_tests/repros/pa_asm_fp8_repeat_call.py --bs 128 --ctx 1024 --qlen 2 --kv-dtype fp8 --n-repeat 5

fangche123 and others added 6 commits May 20, 2026 03:03
Use min-based anchor rebase in K_V_window_rebase to allow pages within
the same wave to span different 65536 windows, as long as
max_page_id - min_page_id < 65536. This removes the previous constraint
that all pages in a load group must share the same high-16-bit window.

Changes:
- Replace v_mul_u32_u24 with v_mul_lo_u32 to remove 24-bit truncation
- Update K/V buffer descriptor num_records for full offset range
- Update all 36 PA kernel .co binaries (gfx942 + gfx950)
- Add test_pa_block_id_truncation.py regression test
Rebuild all 36 PA kernel .co files from updated SP3 sources that use
min-based anchor rebase instead of the previous direct offset approach.

Performance (test_pa.py, bf16, batch=128):
- ctx_len=128:  12.17-12.22 us (vs 12.51-12.58 in v1, vs 11.73-11.78 baseline)
- ctx_len=257:  18.10-18.17 us (vs 18.52-18.82 in v1)
- ctx_len=4097: 161.16-162.33 us (vs 160.27-160.49 baseline, <1% delta)

Short-context regression reduced from ~7-9% to ~3-5% vs baseline.
Long-context effectively neutral (<1%).
- Reformat multiline function args per black style
- Remove unused imports (dtypes, time)
- Remove extra blank line
- Add trailing newline
Adds op_tests/repros/ with a self-contained Python reproducer for the HIP
illegal-memory access in pa_bf16_pertokenFp8_gqa8_1tg_4w_mtp_msk1.co
(gfx950). Triggers exactly when batch_size==128 and qlen==3 on the fp8
per-token-quant ASM PA path. The bf16 sibling kernel
(pa_bf16_noquant_gqa8_1tg_4w_mtp_msk1.co) is clean on the same shape.

Minimal repro (≤5s, no concurrency required):

  AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1 \
    python op_tests/repros/pa_asm_fp8_repeat_call.py \
      --bs 128 --ctx 1024 --qlen 3 --kv-dtype fp8 --n-repeat 5

See op_tests/repros/README.md for the full sweep matrix and negative
controls (bs±1, qlen±1, same total_qo via other factorings, bf16 KV).
@yhl-amd yhl-amd requested a review from a team May 21, 2026 03:17
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3295 --add-label <label>

Comment on lines +47 to +48
import os
import random
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
os imported but unused

Suggested change
import os
import random
import random

import sys
import time
import traceback
from typing import List, Optional, Tuple
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
typing.Optional imported but unused

Suggested change
from typing import List, Optional, Tuple
from typing import List, Tuple

Comment on lines +57 to +58
from aiter import dtypes
from aiter import pertoken_quant
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
aiter.dtypes imported but unused

Suggested change
from aiter import dtypes
from aiter import pertoken_quant
from aiter import pertoken_quant

torch.cuda.synchronize()

# ---- single call ----
print(f"[min-repro] calling pa_fwd_asm ...", flush=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F541> reported by reviewdog 🐶
f-string without any placeholders

Suggested change
print(f"[min-repro] calling pa_fwd_asm ...", flush=True)
print("[min-repro] calling pa_fwd_asm ...", flush=True)

Comment on lines +22 to +23
import os
import sys
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F401> reported by reviewdog 🐶
os imported but unused

Suggested change
import os
import sys
import sys


print(f"# fp8 ASM PA shape sweep (each cell = {args.n_repeat} repeats "
f"of same call, fresh process)")
print(f"# OK = no crash. CRASH@k = launch error surfaced at call k "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F541> reported by reviewdog 🐶
f-string without any placeholders

Suggested change
print(f"# OK = no crash. CRASH@k = launch error surfaced at call k "
print("# OK = no crash. CRASH@k = launch error surfaced at call k "

print(f"# fp8 ASM PA shape sweep (each cell = {args.n_repeat} repeats "
f"of same call, fresh process)")
print(f"# OK = no crash. CRASH@k = launch error surfaced at call k "
f"(0-indexed; means call k-1 corrupted device).")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F541> reported by reviewdog 🐶
f-string without any placeholders

Suggested change
f"(0-indexed; means call k-1 corrupted device).")
"(0-indexed; means call k-1 corrupted device).")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants