From c69cd7b9ba45a9be07acfe1387a881a19c73b8bc Mon Sep 17 00:00:00 2001
From: Debasis Mandal <debasis.mandal@amd.com>
Date: Wed, 22 Apr 2026 18:36:08 +0000
Subject: [PATCH 1/5] docs: add Claude Code context for FlashInfer+ROCm
 (CLAUDE.md, skills)

- Add root CLAUDE.md: HIP/ROCm, gfx942/gfx950, csrc_rocm, gpu_iface, AITER,
  feature matrix, JIT, debugging/benchmarking with ROCm tooling
- Add .claude/skills: add-rocm-kernel, benchmark-kernel, debug-rocm-crash
- Markdown tables/fences satisfy markdownlint (MD040, MD060, MD031)

Made-with: Cursor
---
 .claude/skills/add-rocm-kernel/SKILL.md  | 531 ++++++++++++++++++
 .claude/skills/benchmark-kernel/SKILL.md | 372 +++++++++++++
 .claude/skills/debug-rocm-crash/SKILL.md | 672 +++++++++++++++++++++++
 3 files changed, 1575 insertions(+)
 create mode 100644 .claude/skills/add-rocm-kernel/SKILL.md
 create mode 100644 .claude/skills/benchmark-kernel/SKILL.md
 create mode 100644 .claude/skills/debug-rocm-crash/SKILL.md
diff --git a/.claude/skills/add-rocm-kernel/SKILL.md b/.claude/skills/add-rocm-kernel/SKILL.md
new file mode 100644
index 0000000000..b9a14f8975
--- /dev/null
+++ b/.claude/skills/add-rocm-kernel/SKILL.md
@@ -0,0 +1,531 @@
+---
+name: add-rocm-kernel
+description: Step-by-step tutorial for adding new HIP/ROCm kernels to FlashInfer+ROCm (amd-flashinfer)
+---
+
+# Tutorial: Adding a New Kernel to FlashInfer+ROCm
+
+This tutorial walks through adding a simple element-wise scale operation to the ROCm port of
+FlashInfer. We implement `scale(x, factor) = x * factor` to illustrate the complete workflow on
+CDNA3 (`gfx942`) and CDNA4 (`gfx950`).
+
+If you are used to upstream's `add-cuda-kernel` tutorial, note the following ROCm-specific
+differences up front:
+
+| Concern | Upstream CUDA | This ROCm port |
+| --- | --- | --- |
+| Launcher directory | `csrc/` | [`flashinfer/csrc_rocm/`](../../../flashinfer/csrc_rocm/) |
+| Bindings | TVM-FFI (`TVM_FFI_DLL_EXPORT_TYPED_FUNC`) | Plain Torch extension (`TORCH_LIBRARY_FRAGMENT`) |
+| Tensor type | `tvm::ffi::TensorView` | `at::Tensor` |
+| Stream | `get_stream(device)` | `at::hip::getCurrentHIPStream()` |
+| Compiler | `nvcc` | `hipcc` (amdclang++) |
+| Arch env var | `FLASHINFER_CUDA_ARCH_LIST` | `FLASHINFER_ROCM_ARCH_LIST` |
+| AOT registration | `flashinfer/aot.py` | [`flashinfer/aot_hip.py`](../../../flashinfer/aot_hip.py) |
+| Tests directory | `tests/` | [`tests/rocm_tests/`](../../../tests/rocm_tests/) |
+
+## Goal
+
+Add a new operation that scales each element of a tensor by a scalar factor:
+
+- Input: tensor `x` and scalar `factor`
+- Output: `x * factor` (element-wise)
+- Support FP16 and BF16
+- Compile for both `gfx942` and `gfx950`
+
+## Step 1: Define the HIP kernel in `include/`
+
+Create `include/flashinfer/scale.cuh`. **Do not include `<torch/...>` headers here.** The file
+must stay framework-agnostic so the same header can compile under CUDA (upstream) and HIP (this
+port). For anything that differs between the two platforms, reach for
+[`include/gpu_iface/`](../../../include/gpu_iface/).
+
+```cpp
+#pragma once
+
+#include "gpu_iface/platform.hpp"
+#include "gpu_iface/gpu_runtime_compat.hpp"
+#include "gpu_iface/vec_dtypes.hpp"
+
+namespace flashinfer {
+
+/*!
+ * \brief Element-wise scale kernel.
+ * \tparam T Data type (half / __hip_bfloat16 / float)
+ */
+template <typename T>
+__global__ void ScaleKernel(const T* __restrict__ input, T* __restrict__ output,
+                            T factor, int n) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  if (idx < n) {
+    output[idx] = input[idx] * factor;
+  }
+}
+
+/*!
+ * \brief Launch scale kernel (platform-agnostic).
+ */
+template <typename T>
+gpuError_t ScaleLauncher(const T* input, T* output, T factor, int n,
+                         gpuStream_t stream = nullptr) {
+  const int threads = 256;
+  const int blocks  = (n + threads - 1) / threads;
+
+  ScaleKernel<T><<<blocks, threads, 0, stream>>>(input, output, factor, n);
+
+  return gpuGetLastError();
+}
+
+}  // namespace flashinfer
+```
+
+**Key points:**
+
+- No `<cuda_runtime.h>` / `<hip/hip_runtime.h>` includes — these are pulled in transitively by
+  `gpu_iface/platform.hpp` based on whether the TU is being compiled for CUDA or HIP.
+- `gpuError_t`, `gpuStream_t`, and `gpuGetLastError()` come from
+  [`include/gpu_iface/gpu_runtime_compat.hpp`](../../../include/gpu_iface/gpu_runtime_compat.hpp)
+  and alias to either the CUDA or HIP symbols depending on the backend macro.
+- `__global__` and the `<<<...>>>` launch syntax are supported on both HIP and CUDA without
+  translation.
+- Template on dtype; the concrete dtype is selected in the launcher via a dispatch macro.
+
+### When to add something to `gpu_iface`
+
+If your kernel needs a primitive that differs meaningfully between CUDA and HIP (an MMA
+intrinsic, a cross-lane shuffle, a memory fence, a warp-wide reduction, a dtype container), add
+it to the appropriate `include/gpu_iface/backend/{cuda,hip}/` file and expose a shared name from
+the top-level `gpu_iface/` header — do **not** duplicate the whole kernel under `csrc_rocm/`.
+
+Representative HIP-side files already in use:
+
+- [`include/gpu_iface/backend/hip/vec_dtypes_hip.h`](../../../include/gpu_iface/backend/hip/vec_dtypes_hip.h)
+- [`include/gpu_iface/backend/hip/mma_hip.h`](../../../include/gpu_iface/backend/hip/mma_hip.h)
+- [`include/gpu_iface/backend/hip/memory_ops_hip.h`](../../../include/gpu_iface/backend/hip/memory_ops_hip.h)
+- [`include/gpu_iface/backend/hip/math_hip.h`](../../../include/gpu_iface/backend/hip/math_hip.h)
+
+## Step 2: Create the launcher in `flashinfer/csrc_rocm/`
+
+Create `flashinfer/csrc_rocm/scale.cu`. This is the file that bridges PyTorch tensors to the
+framework-agnostic kernel above.
+
+```cpp
+#include <cstdint>
+#include <flashinfer/scale.cuh>
+
+#include "pytorch_extension_utils.h"
+
+using namespace flashinfer;
+
+void scale(at::Tensor& output, at::Tensor& input, double factor) {
+  CHECK_INPUT(input);
+  CHECK_INPUT(output);
+  TORCH_CHECK(input.sizes() == output.sizes(),
+              "scale: output shape must match input shape");
+  TORCH_CHECK(input.scalar_type() == output.scalar_type(),
+              "scale: output dtype must match input dtype");
+
+  const c10::hip::OptionalHIPGuardMasqueradingAsCUDA device_guard(input.device());
+  const hipStream_t stream = at::hip::getCurrentHIPStream();
+  const int n = static_cast<int>(input.numel());
+
+  DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(input.scalar_type(), c_type, [&] {
+    hipError_t status = ScaleLauncher<c_type>(
+        static_cast<c_type*>(input.data_ptr()),
+        static_cast<c_type*>(output.data_ptr()),
+        static_cast<c_type>(factor),
+        n,
+        stream);
+    TORCH_CHECK(status == hipSuccess,
+                "scale failed: " + std::string(hipGetErrorString(status)));
+    return true;
+  });
+}
+```
+
+**Key points:**
+
+- Include [`pytorch_extension_utils.h`](../../../flashinfer/csrc_rocm/pytorch_extension_utils.h)
+  for `at::Tensor`, the `CHECK_*` macros, and the `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_*` family.
+- Use `c10::hip::OptionalHIPGuardMasqueradingAsCUDA` — PyTorch's ROCm build "masquerades" as
+  CUDA, so device guards and streams are exposed through the HIP-prefixed namespaces.
+- Acquire the current HIP stream with `at::hip::getCurrentHIPStream()`, not
+  `c10::cuda::getCurrentCUDAStream()`.
+- Dispatch macro: `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16` covers FP16+BF16. For a dispatch that
+  also covers FP32 or FP8, use the other `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_*` variants defined
+  in `pytorch_extension_utils.h`.
+- Error handling uses `TORCH_CHECK(cond, msg)` — the PyTorch extension idiom. There is no
+  `TVM_FFI_THROW` on this path.
+
+### Validation helpers available
+
+From [`flashinfer/csrc_rocm/pytorch_extension_utils.h`](../../../flashinfer/csrc_rocm/pytorch_extension_utils.h):
+
+- `CHECK_INPUT(tensor)` — validates CUDA/HIP + contiguous.
+- `CHECK_LAST_DIM_CONTIGUOUS_INPUT(tensor)` — validates CUDA/HIP + last-dim-contiguous.
+- `CHECK_EQ(a, b)`, `CHECK_DIM(n, tensor)` — shape / rank sanity checks.
+- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16` — FP16 + BF16
+- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16_FP32` — FP16 + BF16 + FP32
+- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP8` — E4M3 + E5M2 (the `_fnuz` variants on CDNA3/4)
+
+For a worked-out reference, read [`flashinfer/csrc_rocm/norm.cu`](../../../flashinfer/csrc_rocm/norm.cu)
+(kept intentionally simple) and compare against the more involved
+[`flashinfer/csrc_rocm/batch_prefill.cu`](../../../flashinfer/csrc_rocm/batch_prefill.cu) (plan-run
+pattern, multiple backends, FP8 path).
+
+## Step 3: Create the Torch-extension binding
+
+Create `flashinfer/csrc_rocm/flashinfer_scale_binding.cu`. This is the file that exports the
+launcher to Python.
+
+```cpp
+#include "pytorch_extension_utils.h"
+
+void scale(at::Tensor& output, at::Tensor& input, double factor);
+
+TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) {
+  // Element-wise scale: output = input * factor
+  m.def("scale", scale);
+}
+```
+
+**Key points:**
+
+- The `TORCH_EXTENSION_NAME` macro is defined by PyTorch's build system and resolves to the
+  unique module name for this JIT build — `TORCH_LIBRARY_FRAGMENT` registers `scale` under that
+  namespace.
+- `pytorch_extension_utils.h` also emits a `PyInit_<name>` stub so the resulting `.so` is
+  importable as a Python module (see the bottom of
+  [`pytorch_extension_utils.h`](../../../flashinfer/csrc_rocm/pytorch_extension_utils.h)).
+- Compare with [`flashinfer/csrc_rocm/flashinfer_norm_binding.cu`](../../../flashinfer/csrc_rocm/flashinfer_norm_binding.cu)
+  for the exact pattern.
+
+**Do not write:**
+
+- `TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, scale)` — that's the upstream TVM-FFI pattern; it does
+  not work on this port.
+- `PYBIND11_MODULE(...)` — we use the `TORCH_LIBRARY_FRAGMENT` flavor which integrates with
+  `torch.library` and thus with `torch.compile`.
+
+## Step 4: (Optional) Jinja type specialization
+
+For operations that benefit from compile-time type specialization (you want one `.so` per dtype
+combination rather than runtime dispatch), add a Jinja template next to the launcher:
+
+`flashinfer/csrc_rocm/scale_customize_config.jinja`:
+
+```jinja
+#pragma once
+
+using DTypeIn  = {{ dtype_in }};
+using DTypeOut = {{ dtype_out }};
+constexpr int SCALE_BLOCK_SIZE = {{ block_size }};
+```
+
+The JIT module generator (Step 5) renders this to a concrete `.inc` file before invoking
+`hipcc`. See [`flashinfer/csrc_rocm/batch_prefill_customize_config.jinja`](../../../flashinfer/csrc_rocm/batch_prefill_customize_config.jinja)
+for a non-trivial example.
+
+**When to skip Jinja:** for a kernel like our `scale` example, where the dtype is picked via
+`DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16` at runtime, there is no benefit. Skip this step entirely.
+
+## Step 5: Write the JIT module generator
+
+Create `flashinfer/jit/scale.py`:
+
+```python
+"""
+Copyright (c) 2026 by FlashInfer+ROCm team.
+SPDX-License-Identifier: Apache-2.0
+"""
+
+from . import env as jit_env
+from .core import JitSpec, gen_jit_spec
+
+
+def gen_scale_module() -> JitSpec:
+    """JitSpec for the element-wise scale op.
+
+    No Jinja / type specialization is needed here because the dtype dispatch
+    happens inside DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16 at runtime.
+    """
+    extra_flags = [
+        "-DENABLE_BF16",
+    ]
+    return gen_jit_spec(
+        "scale",
+        [
+            jit_env.FLASHINFER_CSRC_DIR / "scale.cu",
+            jit_env.FLASHINFER_CSRC_DIR / "flashinfer_scale_binding.cu",
+        ],
+        extra_cuda_cflags=extra_flags,
+    )
+```
+
+**Key points:**
+
+- `jit_env.FLASHINFER_CSRC_DIR` resolves to `flashinfer/csrc_rocm/` on HIP, via
+  [`flashinfer/get_include_paths.py::get_csrc_dir()`](../../../flashinfer/get_include_paths.py).
+  This is a conscious divergence from upstream — do **not** reach for a hard-coded `csrc/`.
+- `extra_cuda_cflags` is still the kwarg name even on HIP (for source-compat with upstream);
+  internally [`flashinfer/jit/core.py`](../../../flashinfer/jit/core.py) maps it to flags passed
+  to `hipcc`.
+- `gen_jit_spec` on HIP automatically prepends the output of
+  `current_compilation_context.get_hipcc_flags_list()` — that is, `--offload-arch=gfxNNN` for
+  every target arch plus the common HIP defines (`-DFLASHINFER_ENABLE_HIP`, etc.). You do not
+  need to add `--offload-arch` yourself unless you are overriding a built-in default.
+- If your kernel must **only** run on one arch, add a runtime check (e.g. via
+  `FLASHINFER_SUPPORTED_ROCM_ARCHS` in [`flashinfer/hip_utils.py`](../../../flashinfer/hip_utils.py))
+  at the Python API layer. There is no HIP-side equivalent of upstream's
+  `supported_major_versions=[...]` mechanism yet.
+
+### Register the generator for re-export
+
+Add the import to the `IS_HIP` branch of
+[`flashinfer/jit/__init__.py`](../../../flashinfer/jit/__init__.py):
+
+```python
+elif IS_HIP:
+    # ...
+    from .scale import gen_scale_module as gen_scale_module
+```
+
+Place it alphabetically among the existing `from .norm import ...`, `from .rope import ...`
+lines.
+
+## Step 6: Write the Python API
+
+Create `flashinfer/scale.py`:
+
+```python
+"""
+Copyright (c) 2026 by FlashInfer+ROCm team.
+SPDX-License-Identifier: Apache-2.0
+"""
+
+import functools
+from typing import Optional
+
+import torch
+
+from .jit.scale import gen_scale_module
+
+
+@functools.cache
+def _get_scale_module():
+    """Compile + load the scale module exactly once per process."""
+    return gen_scale_module().build_and_load()
+
+
+def scale(
+    input: torch.Tensor,
+    factor: float,
+    out: Optional[torch.Tensor] = None,
+) -> torch.Tensor:
+    """Element-wise ``output = input * factor``.
+
+    Parameters
+    ----------
+    input : torch.Tensor
+        Input tensor on an AMD GPU. Must be FP16 or BF16 and contiguous.
+    factor : float
+        Scalar multiplier.
+    out : Optional[torch.Tensor]
+        Pre-allocated output tensor. If ``None``, a new tensor is allocated.
+
+    Returns
+    -------
+    torch.Tensor
+        ``input * factor`` with the same shape/dtype/device as ``input``.
+
+    Examples
+    --------
+    >>> import torch, flashinfer
+    >>> x = torch.randn(1024, dtype=torch.float16, device="cuda")
+    >>> y = flashinfer.scale(x, 2.0)
+    >>> torch.allclose(y, x * 2.0)
+    True
+    """
+    if out is None:
+        out = torch.empty_like(input)
+
+    module = _get_scale_module()
+    module.scale(out, input, float(factor))
+    return out
+```
+
+**Key points:**
+
+- `@functools.cache` caches the compiled module in memory so subsequent calls skip the JIT
+  cache lookup entirely.
+- **Destination-passing style**: accept an optional `out=` so perf-sensitive callers can avoid
+  an extra allocation.
+- On ROCm, `input.device.type == "cuda"` — PyTorch's ROCm build reuses the CUDA namespace. Do
+  not test for `"hip"`; it will never be true in practice.
+- If you want API logging, add `@flashinfer_api` above `def scale(...)`. See the
+  [`debug-rocm-crash`](../debug-rocm-crash/SKILL.md) skill.
+
+### Expose from the package
+
+Add the export to the `IS_HIP` branch of
+[`flashinfer/__init__.py`](../../../flashinfer/__init__.py):
+
+```python
+elif IS_HIP:
+    # ...
+    from .scale import scale as scale
+```
+
+## Step 7: Write tests
+
+Create `tests/rocm_tests/test_scale_hip.py`:
+
+```python
+"""
+Copyright (c) 2026 by FlashInfer+ROCm team.
+SPDX-License-Identifier: Apache-2.0
+"""
+
+import pytest
+import torch
+
+import flashinfer
+from flashinfer.hip_utils import FLASHINFER_SUPPORTED_ROCM_ARCHS
+
+
+def _current_arch() -> str:
+    return torch.cuda.get_device_properties(0).gcnArchName.split(":")[0]
+
+
+@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16])
+@pytest.mark.parametrize("shape", [(1024,), (32, 128), (8, 32, 128)])
+@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0, -3.25])
+def test_scale_correctness(shape, dtype, factor):
+    assert _current_arch() in FLASHINFER_SUPPORTED_ROCM_ARCHS, (
+        "Test requires a FlashInfer-supported AMD GPU"
+    )
+
+    x = torch.randn(*shape, dtype=dtype, device="cuda")
+    y = flashinfer.scale(x, factor)
+
+    ref = x.float() * factor
+    torch.testing.assert_close(y.float(), ref, rtol=1e-2, atol=1e-2)
+
+
+def test_scale_inplace_out():
+    x = torch.randn(64, 64, dtype=torch.float16, device="cuda")
+    out = torch.empty_like(x)
+    y = flashinfer.scale(x, 3.0, out=out)
+
+    assert y.data_ptr() == out.data_ptr()
+    torch.testing.assert_close(y.float(), x.float() * 3.0, rtol=1e-2, atol=1e-2)
+```
+
+**Key points:**
+
+- Test files under [`tests/rocm_tests/`](../../../tests/rocm_tests/) are named `test_*_hip.py`
+  by convention.
+- The repo's [`tests/rocm_tests/conftest.py`](../../../tests/rocm_tests/conftest.py) hooks into
+  `pytest-xdist` so `pytest -n auto` only spawns workers for
+  FlashInfer-supported GPUs. You do not need to parametrize over devices yourself.
+- Use FP32 for reference math to avoid dtype-mismatch asserts with `assert_close`.
+- Keep tolerances loose enough for BF16 (`rtol=1e-2`, `atol=1e-2`); tighten for FP32-only ops.
+
+Run it:
+
+```bash
+pytest tests/rocm_tests/test_scale_hip.py -v
+# Or only on GPU 0
+HIP_VISIBLE_DEVICES=0 pytest tests/rocm_tests/test_scale_hip.py -v
+```
+
+## Step 8: Register for AOT (optional)
+
+If your op should also be available in pre-compiled wheels (the
+[`amd-flashinfer-jit-cache/`](../../../amd-flashinfer-jit-cache/) package), register the JIT
+generator in [`flashinfer/aot_hip.py`](../../../flashinfer/aot_hip.py). Add a generator that
+yields your `JitSpec`, and reference it from the main AOT-compile loop.
+
+Pattern (see existing `gen_fa2` in that file):
+
+```python
+def gen_scale() -> Iterator:
+    from .jit.scale import gen_scale_module
+    yield gen_scale_module()
+```
+
+Then AOT compile with:
+
+```bash
+cd amd-flashinfer-jit-cache
+export FLASHINFER_ROCM_ARCH_LIST="gfx942,gfx950"
+python -m build --no-isolation --wheel
+```
+
+The resulting wheel ships a pre-compiled `.so` per arch, indexed by the URI hash.
+
+## CDNA3 vs CDNA4 — what to watch for
+
+Both `gfx942` (CDNA3, MI300X/MI325X) and `gfx950` (CDNA4, MI350X/MI355X) are Matrix Core
+architectures, but they are not fully compatible:
+
+| Concern | CDNA3 (`gfx942`) | CDNA4 (`gfx950`) |
+| --- | --- | --- |
+| MFMA intrinsics | `__builtin_amdgcn_mfma_*` family (F16, BF16, I8, FP8) | Same family **plus** new CDNA4-only instructions (wider FP8 MFMAs, additional block sizes) |
+| FP8 format | `__hip_fp8_e4m3_fnuz`, `__hip_fp8_e5m2_fnuz` (FNUZ biasing) | Same FNUZ variants (OCP FP8 support depends on ROCm version) |
+| LDS capacity | 64 KB / CU | 160 KB / XCD on some configs — **do not** assume identical block/tile sizes |
+| Wavefront size | 64 | 64 |
+
+Practical implications when authoring a new kernel:
+
+- If you use MFMA intrinsics, guard them on the arch macro (`__gfx942__`, `__gfx950__`) or
+  behind the `FLASHINFER_SUPPORTED_ROCM_ARCHS` check at the Python level.
+- Do not hard-code LDS tile sizes. Either parameterize the kernel (Jinja) or query the device
+  properties at plan time (e.g. `torch.cuda.get_device_properties(dev).shared_memory_per_block`).
+- FP8: on both arches, the `_fnuz` variants are the safe default. Bit-exact parity with NVIDIA
+  `__nv_fp8_e4m3` is **not** guaranteed — reference tests must account for the FNUZ
+  representation.
+
+When in doubt, look at how
+[`flashinfer/prefill_rocm.py`](../../../flashinfer/prefill_rocm.py) and
+[`flashinfer/csrc_rocm/batch_prefill.cu`](../../../flashinfer/csrc_rocm/batch_prefill.cu) handle
+per-arch specialization.
+
+## Reference implementations in this repo
+
+| Complexity | Files |
+| --- | --- |
+| Simple, no Jinja | [`flashinfer/norm.py`](../../../flashinfer/norm.py) + [`flashinfer/csrc_rocm/norm.cu`](../../../flashinfer/csrc_rocm/norm.cu) + [`flashinfer/csrc_rocm/flashinfer_norm_binding.cu`](../../../flashinfer/csrc_rocm/flashinfer_norm_binding.cu) + [`flashinfer/jit/norm.py`](../../../flashinfer/jit/norm.py) |
+| Moderate, with Jinja | [`flashinfer/csrc_rocm/single_prefill.cu`](../../../flashinfer/csrc_rocm/single_prefill.cu) + [`flashinfer/csrc_rocm/single_prefill_customize_config.jinja`](../../../flashinfer/csrc_rocm/single_prefill_customize_config.jinja) + [`flashinfer/csrc_rocm/single_prefill_kernel_inst.jinja`](../../../flashinfer/csrc_rocm/single_prefill_kernel_inst.jinja) |
+| Complex (plan-run, AITER, FP8) | [`flashinfer/prefill_rocm.py`](../../../flashinfer/prefill_rocm.py) + [`flashinfer/csrc_rocm/batch_prefill.cu`](../../../flashinfer/csrc_rocm/batch_prefill.cu) |
+
+## Summary checklist
+
+When adding a new op, verify each box:
+
+- [ ] Header in `include/flashinfer/` — no Torch/HIP-runtime includes; uses `gpu_iface/` for
+      platform-differing primitives.
+- [ ] Launcher in `flashinfer/csrc_rocm/<name>.cu` with `#include "pytorch_extension_utils.h"`,
+      `at::Tensor` inputs, `at::hip::getCurrentHIPStream()`, and a `DISPATCH_PYTORCH_DTYPE_*`
+      block.
+- [ ] Binding in `flashinfer/csrc_rocm/flashinfer_<name>_binding.cu` using
+      `TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m)`.
+- [ ] (Optional) Jinja template for type specialization.
+- [ ] JIT generator in `flashinfer/jit/<name>.py` returning a `JitSpec` via `gen_jit_spec`.
+- [ ] Import exposed from the `IS_HIP` branches of `flashinfer/jit/__init__.py` **and**
+      `flashinfer/__init__.py`.
+- [ ] Python API with `@functools.cache`, destination-passing style, FP16/BF16 support,
+      and optional `@flashinfer_api`.
+- [ ] Tests in `tests/rocm_tests/test_<name>_hip.py`.
+- [ ] (Optional) AOT registration in `flashinfer/aot_hip.py`.
+- [ ] Run `pre-commit run -a` before committing.
+
+## Related documentation
+
+- [`CLAUDE.md`](../../../CLAUDE.md) — project overview, JIT architecture, feature matrix.
+- [`.claude/skills/benchmark-kernel/SKILL.md`](../benchmark-kernel/SKILL.md) — how to benchmark
+  the kernel you just added.
+- [`.claude/skills/debug-rocm-crash/SKILL.md`](../debug-rocm-crash/SKILL.md) — debugging recipes
+  when `TORCH_CHECK` fires or the GPU faults.
+- Upstream's [`add-cuda-kernel` skill](https://github.com/flashinfer-ai/flashinfer/blob/main/.claude/skills/add-cuda-kernel/SKILL.md)
+  — the source this tutorial was adapted from. Useful when you are porting a kernel from
+  upstream CUDA and want to see the "before" picture.
diff --git a/.claude/skills/benchmark-kernel/SKILL.md b/.claude/skills/benchmark-kernel/SKILL.md
new file mode 100644
index 0000000000..b97f137048
--- /dev/null
+++ b/.claude/skills/benchmark-kernel/SKILL.md
@@ -0,0 +1,372 @@
+---
+name: benchmark-kernel
+description: Guide for benchmarking FlashInfer+ROCm kernels on AMD Instinct (CDNA3/CDNA4)
+---
+
+# Tutorial: Benchmarking FlashInfer+ROCm Kernels
+
+This guide shows how to accurately benchmark kernels on the ROCm port of FlashInfer (the
+`amd-flashinfer` package), targeting AMD Instinct CDNA3 (`gfx942`) and CDNA4 (`gfx950`).
+
+## Goal
+
+Measure the performance of FlashInfer+ROCm kernels:
+
+- Get accurate GPU kernel execution time on MI300X / MI325X / MI350X / MI355X.
+- Compare HIP-native and AITER (Composable-Kernel) prefill backends.
+- Generate reproducible benchmark results for regression tracking.
+- Save results to CSV / PNG rooflines for later analysis.
+
+## Timing methods on ROCm
+
+FlashInfer+ROCm supports three practical timing paths. **CUPTI is NVIDIA-only — do not try to
+install `cupti-python` on a ROCm host.**
+
+| Method | When to use | Source |
+| --- | --- | --- |
+| **CUDA events (HIP-backed via PyTorch)** | Default. Quick in-loop timing from Python. Good accuracy for kernels ≳ 50 µs. | `flashinfer.testing.bench_gpu_time` (the "CUDA event" path) |
+| **`rocprofv3` + [`rocm_profiler/rocm_profiler.py`](../../../rocm_profiler/rocm_profiler.py)** | Preferred for authoring or optimizing a kernel. Gives per-kernel time, hardware counters, and a two-panel roofline plot. | Wrapper spawns `rocprofv3` as a subprocess. |
+| **`omnitrace`** | Whole-process timeline with host + device events. Use when interaction with dataloaders / Python overhead is suspect. | Installed separately from ROCm. |
+
+Internally, `bench_gpu_time` on ROCm uses PyTorch's `torch.cuda.Event`, which maps to HIP events
+under the ROCm build. The `bench_gpu_time_with_cupti` code path in
+[`flashinfer/testing/utils.py`](../../../flashinfer/testing/utils.py) is never selected on a ROCm
+install because `cupti-python` will not import.
+
+## Pre-flight: what you can actually benchmark
+
+On a ROCm install of `amd-flashinfer`, only the APIs exposed in the `IS_HIP` branch of
+[`flashinfer/__init__.py`](../../../flashinfer/__init__.py) are callable:
+
+**Attention:**
+
+- `single_prefill_with_kv_cache` / `single_prefill_with_kv_cache_return_lse`
+- `BatchPrefillWithPagedKVCacheWrapper`, `BatchPrefillWithRaggedKVCacheWrapper`
+- `single_decode_with_kv_cache`
+- `BatchDecodeWithPagedKVCacheWrapper`, `CUDAGraphBatchDecodeWithPagedKVCacheWrapper`
+
+**Other:**
+
+- Normalization (`rmsnorm`, `fused_add_rmsnorm`, `gemma_rmsnorm`, …)
+- RoPE (`apply_rope_*`, `apply_llama31_rope_*`)
+- Sampling (`sampling_from_probs`, `top_k_*`, `top_p_*`, `min_p_sampling_from_probs`, …)
+- Paged KV management (`append_paged_kv_cache`, `get_batch_indices_positions`, …)
+- Quantization (`packbits`, `segment_packbits`)
+- Activation (`silu_and_mul`, `gelu_and_mul`, `gelu_tanh_and_mul`)
+
+**Not available on ROCm:** MLA, cascade, POD, FP4 quantization, TRT-LLM/CUTLASS MoE, cuDNN
+backends. Do not attempt to benchmark these — the symbol simply is not re-exported in the
+`IS_HIP` branch.
+
+**Backends that exist per op:**
+
+| Op family | Default (HIP) backend | AITER backend available? | How to select AITER |
+| --- | --- | --- | --- |
+| Single prefill | yes | yes (CK FMHA) | `backend="aiter"` kwarg |
+| Batch prefill (paged / ragged) | yes | yes (CK FMHA) | `backend="aiter"` kwarg |
+| Decode (single / batch / CUDA-graph) | yes | no | n/a |
+| All others (norm, rope, sampling, …) | yes | no | n/a |
+
+**AITER caveats** (see [`README.md`](../../../README.md) and
+[`flashinfer/prefill_rocm.py`](../../../flashinfer/prefill_rocm.py)):
+
+- `kv_layout="NHD"` only.
+- Batch prefill with AITER's CK FMHA requires `page_size ∈ {1, 16, 1024}`.
+- `amd-aiter` must be importable (usually `pip install amd-aiter --index-url https://pypi.amd.com/simple/`).
+
+Trying to benchmark an unsupported config under `backend="aiter"` will raise a Python error
+*before* the kernel launches, not silently fall back.
+
+## Method 1: In-script timing with `bench_gpu_time`
+
+For a quick perf check of one op, call
+[`flashinfer.testing.bench_gpu_time`](../../../flashinfer/testing/utils.py) directly. On ROCm it
+falls through to the `bench_gpu_time_with_cuda_event` path automatically.
+
+```python
+import torch
+import flashinfer
+from flashinfer.testing import bench_gpu_time
+
+seq_len       = 1024
+num_qo_heads  = 32
+num_kv_heads  = 8      # GQA 4:1
+head_dim      = 128
+dtype         = torch.bfloat16
+
+q = torch.randn(seq_len, num_qo_heads, head_dim, dtype=dtype, device="cuda")
+k = torch.randn(seq_len, num_kv_heads, head_dim, dtype=dtype, device="cuda")
+v = torch.randn(seq_len, num_kv_heads, head_dim, dtype=dtype, device="cuda")
+
+
+def run_default():
+    return flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True)
+
+
+def run_aiter():
+    return flashinfer.single_prefill_with_kv_cache(
+        q, k, v, causal=True, backend="aiter",
+    )
+
+
+def report(label, fn):
+    # enable_cupti=True is harmless on ROCm — it is silently ignored and the
+    # CUDA-events path is used. Passing it makes the script portable to CUDA hosts.
+    median_ms, std_ms = bench_gpu_time(
+        fn, args=(), enable_cupti=True, num_iters=30, dry_run_iters=5,
+    )
+    print(f"{label:12s}  median={median_ms:.3f} ms  std={std_ms:.3f} ms")
+
+
+report("hip-default", run_default)
+report("aiter",        run_aiter)
+```
+
+Typical output on an MI300X (numbers are illustrative — your exact values will depend on ROCm
+version, driver, and HIP-SDMA settings):
+
+```text
+hip-default  median=0.182 ms  std=0.004 ms
+aiter        median=0.146 ms  std=0.003 ms
+```
+
+**Important arguments:**
+
+| Arg | Purpose | Default |
+| --- | --- | --- |
+| `num_iters` | Measured iterations | 30 |
+| `dry_run_iters` | Warmup iterations | 5 |
+| `enable_cupti` | CUDA only; ignored on ROCm | False |
+| `l2_flush` / `rotate_buffers` | Flush L2 between iterations for memory-bound kernels | varies |
+
+## Method 2: `rocm_profiler` (recommended for optimization work)
+
+For anything you intend to optimize, use the in-repo
+[`rocm_profiler/rocm_profiler.py`](../../../rocm_profiler/rocm_profiler.py). It:
+
+1. Runs repeated GPU launches in the current process to get a median kernel time.
+2. Re-exec's the same driver script under `rocprofv3` as a subprocess (recognized by the
+   `_ROCM_PROFILER_INTERNAL` env sentinel) to collect hardware counters with one
+   warmup + one profiled launch.
+3. Produces a two-panel log-log **roofline plot** combining the timing and counter data.
+
+All outputs are written under `benchmarks/rocm_benchmarks/` (gitignored).
+
+### Minimal driver script
+
+Start from the working example at
+[`benchmarks/rocm_benchmarks/bench_fa2_prefill.py`](../../../benchmarks/rocm_benchmarks/bench_fa2_prefill.py)
+and adapt:
+
+```python
+# my_bench.py
+import torch
+import flashinfer
+from rocm_profiler import RocmProfiler, KernelConfig
+
+B, S, H_Q, H_KV, D = 1, 1024, 32, 8, 128
+dtype = torch.bfloat16
+q = torch.randn(S, H_Q, D, dtype=dtype, device="cuda")
+k = torch.randn(S, H_KV, D, dtype=dtype, device="cuda")
+v = torch.randn(S, H_KV, D, dtype=dtype, device="cuda")
+
+configs = [
+    KernelConfig(
+        name="s1024_causal",
+        run_fn=lambda: flashinfer.single_prefill_with_kv_cache_return_lse(
+            q, k, v, causal=True
+        ),
+        # FLOPs = 2 * S * S * H_Q * D (attention mat-muls), matches the formula
+        # used in benchmarks/rocm_benchmarks/bench_fa2_prefill.py.
+        theoretical_flops=2 * S * S * H_Q * D,
+        theoretical_bytes=(S * H_Q + 2 * S * H_KV) * D * dtype.itemsize,
+        label="seq=1024 causal",
+    ),
+]
+
+profiler = RocmProfiler(
+    configs=configs,
+    counters="roofline",            # or "compute", "memory", "occupancy", "stall", "basic"
+    kernel_name_regex="SinglePrefill",
+    output_dir="benchmarks/rocm_benchmarks",
+    label="my_bench",
+)
+
+if __name__ == "__main__":
+    profiler.run()
+```
+
+### Run it
+
+```bash
+# Full pipeline: timing + counter collection + roofline PNG
+python my_bench.py
+
+# Change the counter preset (see header of rocm_profiler.py for the full list)
+python my_bench.py --counters occupancy
+python my_bench.py --counters stall
+python my_bench.py --counters memory
+
+# Timing only (no rocprofv3 at all — fast sanity check)
+python my_bench.py --timing-only
+
+# Run profiling but skip the roofline plot
+python my_bench.py --skip-roofline
+
+# Regenerate the roofline plot from existing CSVs (no GPU required)
+python my_bench.py --replot
+
+# List all built-in counter presets
+python my_bench.py --list-presets
+```
+
+### Outputs
+
+```text
+benchmarks/rocm_benchmarks/<label>_timing.csv             # median + std per config
+benchmarks/rocm_benchmarks/<label>_counters.yml           # rocprofv3 input spec
+benchmarks/rocm_benchmarks/<label>_counter_collection.csv # raw counters
+benchmarks/rocm_benchmarks/<label>_roofline.png           # only for counters=roofline
+```
+
+### Counter presets worth knowing
+
+| Preset | What it shows | Typical use |
+| --- | --- | --- |
+| `roofline` (default) | `FetchSize`, `WriteSize`, MFMA ops, TCC DRAM requests | Is the kernel compute- or memory-bound? |
+| `compute` | MFMA ops + cycle counters | Matrix-core throughput on CDNA3/4 |
+| `memory` | L2 and DRAM bandwidth breakdown | L2 hit-rate, HBM traffic |
+| `occupancy` | `SQ_WAVES`, `SQ_BUSY_CYCLES`, `SQ_VALU_MFMA_BUSY_CYCLES`, `SQ_WAIT_INST_ANY`, `SQ_INSTS_LDS` | Wavefront density, scheduler efficiency |
+| `stall` | `SQ_WAIT_INST_VMEM`, `SQ_WAIT_INST_LDS`, `SQ_BUSY_CYCLES` | Diagnose memory stalls |
+| `basic` | `FetchSize` / `WriteSize` | Minimal baseline |
+
+You can also pass a path to a `rocprofv3`-native YAML if you need a counter combination that is
+not in the preset list.
+
+## Method 3: Raw `rocprofv3` invocation
+
+If you need full control over the counter set, bypass the Python wrapper and use `rocprofv3`
+directly. This also works against any standalone Python script.
+
+```bash
+# Timeline + per-kernel stats
+rocprofv3 --stats --kernel-trace \
+    --output-format csv \
+    --output-directory rpf-out \
+    -- python my_bench.py
+
+# Hardware counters (supply your own pmc / counter-input file)
+cat > my_counters.txt <<'EOF'
+pmc: SQ_WAVES SQ_BUSY_CYCLES SQ_WAIT_INST_VMEM
+EOF
+rocprofv3 -i my_counters.txt \
+    --output-format csv \
+    --output-directory rpf-counters \
+    -- python my_bench.py
+```
+
+Kernel-name filtering is available via `--kernel-rename` and regex selection via
+`--kernel-include-regex` in recent `rocprofv3` versions.
+
+## Reference checking
+
+When comparing the HIP-default and `backend="aiter"` paths (or any two backends), always verify
+numerical parity before trusting perf numbers:
+
+```python
+ref = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True)                   # HIP
+got = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, backend="aiter")  # AITER
+
+torch.testing.assert_close(got.float(), ref.float(), rtol=1e-2, atol=1e-2)
+```
+
+Loose BF16 tolerances are expected; tighten for FP32-only ops.
+
+## Troubleshooting
+
+### Inconsistent results (large std)
+
+1. Raise `dry_run_iters` to 10–20 so the kernel cache and clocks settle.
+2. Raise `num_iters` to 50+ for sub-100-µs kernels.
+3. Pin the GPU clock:
+
+   ```bash
+   # Query supported clocks
+   rocm-smi --showclocks
+   # Lock SCLK / MCLK (requires sudo, restores on reboot)
+   sudo rocm-smi --setsclk 7
+   sudo rocm-smi --setmclk 3
+   ```
+
+4. Disable ECC scrubbing interference: `sudo rocm-smi --resetprofile` between runs.
+
+### Kernel name does not match in `rocm_profiler`
+
+The `kernel_name_regex` you pass to `RocmProfiler` must match the mangled kernel name emitted by
+`rocprofv3`. If no rows appear in `<label>_counter_collection.csv`:
+
+```bash
+# 1. Dry-run to see what kernels are launched
+rocprofv3 --stats --kernel-trace --output-format csv \
+    --output-directory rpf-dbg -- python my_bench.py
+
+# 2. Inspect rpf-dbg/*_kernel_stats.csv and copy the name prefix into your driver.
+```
+
+### AITER backend errors
+
+If `backend="aiter"` raises before any timing runs, it is usually one of:
+
+- `page_size` not in `{1, 16, 1024}` (batch prefill + CK FMHA path).
+- `kv_layout != "NHD"`.
+- `amd-aiter` not installed.
+
+Fix the call or drop back to the default HIP backend for that config.
+
+### `rocm_profiler` hangs or produces empty CSV
+
+- Check that `rocprofv3` is on `PATH` and executable: `which rocprofv3`.
+- Make sure the driver script prints something from the `if __name__ == "__main__":` block —
+  the wrapper uses script output as a heartbeat.
+- Run with `--timing-only` first to confirm the kernel path itself works before involving
+  `rocprofv3`.
+
+## Best practices
+
+1. **Record the arch and ROCm version** alongside every perf number:
+
+   ```python
+   import torch
+   props = torch.cuda.get_device_properties(0)
+   print(props.name, props.gcnArchName, torch.version.hip)
+   ```
+
+   A `seq=1024` FA2 number on MI300X (`gfx942`, ROCm 7.2) is not comparable to one on MI350X
+   (`gfx950`, ROCm 7.0.2).
+
+2. **Always warm up.** First-call JIT compile will dominate the first measurement otherwise.
+   Use `dry_run_iters >= 5` and explicitly call the kernel once before timing in scripts that
+   measure the first iteration separately.
+
+3. **Verify correctness before performance.** A kernel that silently writes junk is always
+   faster than one that works.
+
+4. **Compare against the AITER backend where it exists.** For single / batch prefill on ROCm,
+   AITER's CK FMHA is often the competitive lower bound.
+
+5. **Prefer the `roofline` counter preset to start.** It instantly tells you whether further
+   optimization should target arithmetic intensity (MFMA ops) or HBM bandwidth (TCC DRAM
+   requests).
+
+## Related documentation
+
+- [`CLAUDE.md`](../../../CLAUDE.md) — project overview and JIT architecture.
+- [`.claude/skills/add-rocm-kernel/SKILL.md`](../add-rocm-kernel/SKILL.md) — author a new kernel
+  to benchmark.
+- [`.claude/skills/debug-rocm-crash/SKILL.md`](../debug-rocm-crash/SKILL.md) — when a kernel
+  crashes during timing.
+- [`benchmarks/rocm_benchmarks/bench_fa2_prefill.py`](../../../benchmarks/rocm_benchmarks/bench_fa2_prefill.py)
+  — a real, working driver script to copy from.
+- [`rocm_profiler/rocm_profiler.py`](../../../rocm_profiler/rocm_profiler.py) — full API docs in
+  the module header.
+- `rocprofv3` docs: <https://rocm.docs.amd.com/projects/rocprofiler-sdk/>.
diff --git a/.claude/skills/debug-rocm-crash/SKILL.md b/.claude/skills/debug-rocm-crash/SKILL.md
new file mode 100644
index 0000000000..d30cb896c3
--- /dev/null
+++ b/.claude/skills/debug-rocm-crash/SKILL.md
@@ -0,0 +1,672 @@
+---
+name: debug-rocm-crash
+description: Tutorial for debugging HIP/ROCm kernel crashes in FlashInfer+ROCm using API logging plus HIP/ROCm runtime tooling
+---
+
+# Tutorial: Debugging ROCm Crashes in FlashInfer+ROCm
+
+This guide shows how to debug HIP/ROCm kernel crashes and errors in the `amd-flashinfer` fork
+(CDNA3 `gfx942`, CDNA4 `gfx950`) using the `@flashinfer_api` logging decorator combined with
+ROCm's own debugging tools.
+
+If you are used to upstream's `debug-cuda-crash` skill, the Python logging half is identical —
+`@flashinfer_api`, `FLASHINFER_LOGLEVEL`, `FLASHINFER_LOGDEST` all work unchanged on HIP. The
+CUDA-tooling half (`compute-sanitizer`, `cuda-gdb`, `CUDA_LAUNCH_BLOCKING`) is rewritten below
+using the ROCm equivalents.
+
+## Goal
+
+When your code crashes on an AMD Instinct GPU with errors like:
+
+- `HIP error: the operation cannot be performed in the present state`
+- `hipErrorIllegalAddress`
+- `Memory access fault by GPU node-N (Agent handle: ...) on address 0x...`
+- `hipErrorOutOfMemory`
+- `RuntimeError: CUDA error: an illegal memory access was encountered` (PyTorch masquerades HIP
+  errors as CUDA errors)
+
+… you want to:
+
+- Capture input tensors BEFORE the crash (so the crash itself doesn't take the evidence with it).
+- Pinpoint exactly which kernel launch faulted.
+- Understand whether the bug is a shape mismatch, a bad page table / KV config, an AITER
+  limitation, or a genuine kernel bug.
+
+## Why use API logging?
+
+**Problem:** HIP faults frequently terminate the process with little more than a hex address,
+leaving no Python-level context.
+
+**Solution:** `@flashinfer_api` logs inputs (shape, dtype, device, strides, optionally min/max/mean
+and NaN/Inf counts) BEFORE the kernel runs. If the kernel crashes, the last log entry shows you
+exactly what data it received.
+
+## Step 1: Enable API logging
+
+### Basic (function names only)
+
+```bash
+export FLASHINFER_LOGLEVEL=1
+export FLASHINFER_LOGDEST=stdout
+
+python my_script.py
+```
+
+Output:
+
+```text
+[2026-04-21 10:30:45] FlashInfer API Call: single_prefill_with_kv_cache
+```
+
+### Detailed (inputs / outputs + metadata)
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+export FLASHINFER_LOGDEST=debug.log
+
+python my_script.py
+```
+
+Example output in `debug.log`:
+
+```text
+================================================================================
+[2026-04-21 10:30:45] FlashInfer API Logging — System Information
+================================================================================
+FlashInfer version: 0.5.3+amd.1
+HIP / ROCm version: 7.1.1
+GPU 0: AMD Instinct MI300X
+  gcnArchName: gfx942:sramecc+:xnack-
+PyTorch version: 2.9.1+rocm7.1
+================================================================================
+
+[2026-04-21 10:30:46] FlashInfer API Call: batch_decode_with_paged_kv_cache
+--------------------------------------------------------------------------------
+Positional input arguments:
+  arg[0]:
+    Tensor(
+      shape=(32, 8, 128)
+      dtype=torch.bfloat16
+      device=cuda:0
+      requires_grad=False
+      is_contiguous=True
+    )
+Keyword input arguments:
+  paged_kv_cache=
+    Tensor(
+      shape=(1024, 2, 8, 128)
+      dtype=torch.bfloat16
+      device=cuda:0
+      ...
+    )
+```
+
+Even though the device string shows `cuda:0`, the underlying device is an AMD GPU — this is
+expected because PyTorch's ROCm build reuses the `cuda` namespace. The `gcnArchName` line above
+is the unambiguous ROCm marker.
+
+### Full (with tensor statistics)
+
+```bash
+export FLASHINFER_LOGLEVEL=5
+export FLASHINFER_LOGDEST=debug.log
+
+python my_script.py
+```
+
+Adds:
+
+```text
+  Tensor(
+    shape=(32, 8, 128)
+    dtype=torch.bfloat16
+    device=cuda:0
+    requires_grad=False
+    is_contiguous=True
+    min=-3.125000
+    max=4.250000
+    mean=0.015625
+    nan_count=0
+    inf_count=0
+  )
+```
+
+Use level 5 when diagnosing numerical issues (NaN/Inf propagation). Note that HIP-graph capture
+paths auto-skip statistics; that is intentional and shows up as
+`[statistics skipped: HIP graph capture in progress]`.
+
+## Step 2: Force deterministic kernel launches before debugging
+
+HIP async launches make Python tracebacks point at the wrong line. Set these env vars **before**
+running your script:
+
+```bash
+export HIP_LAUNCH_BLOCKING=1
+export AMD_SERIALIZE_KERNEL=3
+```
+
+- `HIP_LAUNCH_BLOCKING=1` — force every HIP API call to be synchronous.
+- `AMD_SERIALIZE_KERNEL=3` — also serialize kernel launches through the queue. This is the
+  single most useful knob for `Memory access fault by GPU node-N` errors, because it pins the
+  fault to the *actual* faulting kernel rather than whichever subsequent launch happened to
+  finish first.
+
+Both are zero-overhead when there's no bug to chase, so enabling them in `pytest` runs while
+iterating on a new kernel is reasonable.
+
+## Step 3: Common ROCm errors and how to debug them
+
+### Error 1: Illegal memory access / GPU memory fault
+
+**Error messages** (any of these indicate the same class of bug):
+
+```text
+RuntimeError: CUDA error: an illegal memory access was encountered
+HIP error: hipErrorIllegalAddress
+Memory access fault by GPU node-1 (Agent handle: 0x...) on address 0x7f...
+VM_CONTEXT1_PROTECTION_FAULT_STATUS ... NO_RETRY: 0x0
+```
+
+**Recipe:**
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+export FLASHINFER_LOGDEST=crash.log
+export AMD_SERIALIZE_KERNEL=3
+export HIP_LAUNCH_BLOCKING=1
+python my_script.py
+```
+
+In `crash.log`, find the **last** `FlashInfer API Call:` entry — that is the kernel that took
+the process down. Check:
+
+- Tensor **shapes** match what the kernel expects (head_dim, num_heads).
+- All tensors are on the same device (both `cuda:0`, not mixed `cuda:0` + `cpu`).
+- `is_contiguous=True` where required; non-contiguous strides are a classic cause of
+  out-of-bounds reads.
+- For paged-KV wrappers: `kv_indices` / `kv_indptr` values are within `[0, num_pages)`.
+
+**Common root causes in this fork:**
+
+- Wrong `head_dim_qk` / `head_dim_vo` mismatch between `q` and the KV cache.
+- CPU tensor accidentally passed to a GPU API.
+- Non-contiguous `q`/`k`/`v` from a `.transpose()` or `.view()` chain — add a `.contiguous()`.
+- Out-of-range `kv_indices` — often off-by-one when building page tables by hand.
+- **AITER-specific:** see dedicated section below.
+
+### Error 2: AITER backend crash (`backend="aiter"`)
+
+When using `backend="aiter"` on single or batch prefill, watch for two very specific gotchas
+(both documented in [`README.md`](../../../README.md) and enforced by the code in
+[`flashinfer/prefill_rocm.py`](../../../flashinfer/prefill_rocm.py)):
+
+| Symptom | Likely cause | Fix |
+| --- | --- | --- |
+| `ValueError` raised *before* any kernel launch | `page_size` is not in `{1, 16, 1024}` (batch prefill + CK FMHA) | Re-plan with one of the supported page sizes, or drop `backend="aiter"` for that call. |
+| `ValueError` about KV layout | `kv_layout != "NHD"` | Switch to `NHD` or use the default HIP backend. |
+| Hard GPU fault mid-kernel, no Python exception | `amd-aiter` version mismatch vs. the ROCm build | Reinstall `amd-aiter` matching your ROCm version (`--extra-index-url https://pypi.amd.com/rocm-<version>/simple`). |
+| `ModuleNotFoundError: aiter` | `amd-aiter` not installed | `pip install amd-aiter --index-url https://pypi.amd.com/simple/`. |
+
+If API logging shows a correct-looking call to a prefill API but the process dies with a GPU
+fault and no Python traceback, **disable the AITER backend** as a first step to see whether the
+bug is in AITER or in our side of the port.
+
+### Error 3: NaN / Inf values
+
+```text
+RuntimeError: ... returned nan or inf
+```
+
+```bash
+export FLASHINFER_LOGLEVEL=5
+export FLASHINFER_LOGDEST=nan.log
+python my_script.py
+```
+
+Check `nan_count` / `inf_count` in the log. On CDNA3/4 the most common sources are:
+
+- FP8 path overflow — the `_fnuz` variants used on AMD
+  (`__hip_fp8_e4m3_fnuz`, `__hip_fp8_e5m2_fnuz`) have a different representable range than
+  NVIDIA's `__nv_fp8_e4m3`. A scale factor calibrated against an NVIDIA reference will
+  routinely overflow on ROCm.
+- A previous op producing `-inf` / `inf` that is then fed into `exp` (online softmax).
+- Uninitialized memory — `torch.empty(...)` vs `torch.zeros(...)`.
+
+### Error 4: Out of memory
+
+```text
+RuntimeError: HIP out of memory.
+```
+
+```bash
+rocm-smi --showmeminfo vram --showpids
+export FLASHINFER_LOGLEVEL=3
+python my_script.py
+```
+
+Look for unexpectedly large tensor shapes in the last log entry. If the process keeps getting
+OOM-killed on healthy-looking shapes, check:
+
+- Zombie processes holding VRAM: `rocm-smi --showpids` and `kill -9` them.
+- JIT cache compile spike — set `MAX_JOBS=1` to cap concurrent ninja jobs during AOT builds.
+- Another tenant on the same GPU — pin to a single GPU with `HIP_VISIBLE_DEVICES=N`.
+
+### Error 5: Wrong dtype
+
+```text
+RuntimeError: expected scalar type BFloat16 but found Half
+```
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+python my_script.py
+```
+
+In the log, look for the mismatching `dtype=` field. On ROCm, confirm:
+
+- If the op supports FP8 on your arch: `gfx942`/`gfx950` use the `_fnuz` FP8 variants — a
+  callsite that expects `torch.float8_e4m3fn` (NVIDIA's OCP FP8) will mis-dispatch. The
+  PyTorch dtype used on ROCm for `__hip_fp8_e4m3_fnuz` is `torch.float8_e4m3fnuz`.
+
+## Step 4: Multi-GPU / multi-process debugging
+
+For multi-rank runs use the `%i` pattern in the log destination:
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+export FLASHINFER_LOGDEST=debug_rank_%i.log
+export HIP_VISIBLE_DEVICES=0,1,2,3      # restrict to specific GPUs
+# (or ROCR_VISIBLE_DEVICES — same effect, but applied earlier in the stack)
+
+torchrun --nproc_per_node=4 my_script.py
+```
+
+This produces `debug_rank_<pid>.log` per process. Use `HIP_VISIBLE_DEVICES` instead of
+`CUDA_VISIBLE_DEVICES` when you need to isolate a specific AMD device.
+
+If a specific GPU is misbehaving (ECC errors, firmware stuck), check it with
+`rocm-smi --showreset --showuniqueid --showproductname` and open `dmesg -wH` in another
+terminal.
+
+## Step 5: Advanced debugging with ROCm tools
+
+### `rocgdb` (CUDA-GDB equivalent)
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+export FLASHINFER_LOGDEST=debug.log
+export AMD_SERIALIZE_KERNEL=3
+export HIP_LAUNCH_BLOCKING=1
+
+rocgdb --args python my_script.py
+```
+
+Inside `rocgdb`:
+
+```text
+(rocgdb) catch throw
+(rocgdb) run
+(rocgdb) bt            # stack trace at the crash point
+(rocgdb) info agents   # list GPUs
+(rocgdb) info wavefronts
+```
+
+For attaching to a running process (e.g. a hang), set before you launch your script:
+
+```bash
+export ROCM_DEBUG_WAIT_FOR_DEBUGGER=1
+```
+
+Then `rocgdb -p <pid>` attaches; no debugger attached → the process waits at the first GPU API
+call.
+
+### HIP / HSA runtime tracing
+
+```bash
+export AMD_LOG_LEVEL=3         # HIP API + stream trace
+# export AMD_LOG_LEVEL=4       # very verbose, includes arg decoding
+export HSA_ENABLE_DEBUG=1      # one layer below HIP (runtime queues, agents)
+python my_script.py 2> hip.trace
+```
+
+Grep for `hipLaunchKernel`, `hipMemcpy`, and `error` in `hip.trace`. The trace is linear with
+`HIP_LAUNCH_BLOCKING=1`, which makes it possible to correlate each FlashInfer API call with the
+exact underlying HIP launches.
+
+### Device state snapshots with `rocm-smi`
+
+Leave this running in another terminal while reproducing a hang:
+
+```bash
+watch -n 1 'rocm-smi --showuse --showmeminfo vram --showpids --showprofile'
+```
+
+Watch for:
+
+- GPU stuck at 100% but no `SQ` activity — kernel is looping.
+- VRAM pinned high after your process exits — another process is still holding it.
+- Throttling indicators (`POWERCAP`, `THERMAL`) — reproduce on a cooler box before filing a
+  kernel bug.
+
+### `dmesg` for firmware-level faults
+
+```bash
+sudo dmesg -T | grep -i -E "amdgpu|kfd|vm_fault" | tail -100
+```
+
+`VM_CONTEXT1_PROTECTION_FAULT_STATUS` entries here tell you page-fault class, access type, and
+the offending address — useful when the Python log only says `hipErrorIllegalAddress`.
+
+## Step 6: Kernel-level debugging with `printf`
+
+`printf()` works inside HIP device code exactly the way it does on CUDA:
+
+```cpp
+__global__ void MyKernel(const float* __restrict__ input,
+                         float* __restrict__ output, int n) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+  // Print from one thread per block to avoid flood
+  if (threadIdx.x == 0 && blockIdx.x == 0) {
+    printf("n=%d, input[0]=%f\n", n, input[0]);
+  }
+
+  if (idx < n) {
+    output[idx] = input[idx] * 2.0f;
+  }
+}
+```
+
+Flush after the launch from Python:
+
+```python
+my_kernel(input, output)
+torch.cuda.synchronize()  # Flushes device printf buffer on ROCm too
+```
+
+### Warp / wavefront considerations
+
+The wavefront size on CDNA3 and CDNA4 is **64** (not 32 as on NVIDIA). Adjust any
+representative-thread logic accordingly:
+
+```cpp
+// CDNA3/4: wavefront size = 64
+if (threadIdx.x % 64 == 0) {
+  printf("Wavefront %d processing\n", threadIdx.x / 64);
+}
+```
+
+Common mistake ported blindly from a CUDA example:
+
+```cpp
+// ❌ Assumes warp size 32; prints from thread 32 of a CDNA wavefront
+if (threadIdx.x % 32 == 0) {
+  printf("...");
+}
+```
+
+Use `warpSize` (a built-in `unsigned int`) when writing portable code.
+
+### Device asserts
+
+```cpp
+assert(value >= 0.0f && "Value must be non-negative");
+```
+
+Build with JIT debug flags to make these trip reliably:
+
+```bash
+export FLASHINFER_JIT_VERBOSE=1
+```
+
+(Unlike upstream there is no `FLASHINFER_JIT_DEBUG=1` `-O0 -g -G` mode on the HIP side yet;
+`-O0` is not wired into `hipcc` invocations. Add `-g` via `extra_cuda_cflags` temporarily in
+the JIT generator while debugging.)
+
+## Environment Variables Reference
+
+### FlashInfer logging
+
+| Variable | Values | Description |
+| --- | --- | --- |
+| `FLASHINFER_LOGLEVEL` | `0` | No logging (default). Zero overhead. |
+| | `1` | Function names only. |
+| | `3` | Inputs/outputs with shape/dtype/device/strides. |
+| | `5` | + min/max/mean/nan/inf statistics. |
+| `FLASHINFER_LOGDEST` | `stdout` | Console (default). |
+| | `stderr` | Stderr. |
+| | `<path>` | File. |
+| | `log_%i.txt` | Multi-process; `%i` expands to PID. |
+| `FLASHINFER_JIT_VERBOSE` | `1` | Print every `hipcc` invocation and build command. |
+
+### HIP / ROCm runtime
+
+| Variable | Effect |
+| --- | --- |
+| `HIP_LAUNCH_BLOCKING=1` | Force synchronous launches (stack traces pin the faulting kernel). |
+| `AMD_SERIALIZE_KERNEL=3` | Serialize kernel launches through the queue. |
+| `AMD_LOG_LEVEL=3` (or `4`) | HIP API trace. |
+| `HSA_ENABLE_DEBUG=1` | HSA runtime trace. |
+| `HIP_VISIBLE_DEVICES=0,1` | Restrict visible GPUs (preferred on ROCm). |
+| `ROCR_VISIBLE_DEVICES=0,1` | Same as above, applied one layer deeper. |
+| `ROCM_DEBUG_WAIT_FOR_DEBUGGER=1` | Block until `rocgdb` attaches. |
+
+## Best practices
+
+### 1. Always start with `FLASHINFER_LOGLEVEL=3`
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+```
+
+Gives you tensor metadata without overwhelming output.
+
+### 2. Combine with `AMD_SERIALIZE_KERNEL=3` on first reproduction
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+export AMD_SERIALIZE_KERNEL=3
+export HIP_LAUNCH_BLOCKING=1
+```
+
+This is the single most useful env combination for debugging an unknown HIP fault.
+
+### 3. Log to a file for crashes
+
+```bash
+export FLASHINFER_LOGDEST=crash.log
+```
+
+Console output can be lost when the process SIGKILLs itself on a GPU fault.
+
+### 4. Compare before / after on the last API call
+
+- Last successful `FlashInfer API Call:` with **both** inputs and outputs logged — OK.
+- Last `FlashInfer API Call:` with inputs logged but **no outputs** — that's your crash site.
+
+### 5. Disable logging in production
+
+```bash
+unset FLASHINFER_LOGLEVEL     # or export FLASHINFER_LOGLEVEL=0
+```
+
+The `@flashinfer_api` decorator short-circuits to a zero-overhead path when disabled.
+
+## Troubleshooting the debugger itself
+
+### No logs appear
+
+- Verify the API you're calling actually has `@flashinfer_api` on it — decoration coverage is
+  a work in progress; a handful of low-level APIs may not be wrapped yet.
+- Check the env vars are exported in the right shell:
+
+  ```bash
+  echo $FLASHINFER_LOGLEVEL  # expect "3"
+  echo $FLASHINFER_LOGDEST   # expect path or "stdout"
+  ```
+
+### Statistics skipped at level 5
+
+```text
+[statistics skipped: HIP graph capture in progress]
+```
+
+Expected: min/max/mean/nan/inf would require synchronization that is illegal during graph
+capture. Temporarily drop to `FLASHINFER_LOGLEVEL=3` if you need inputs from inside a captured
+graph.
+
+### `rocgdb` exits immediately with `no symbol table loaded`
+
+`pip install`-installed HIP binaries are often stripped. Reinstall with
+`-DCMAKE_BUILD_TYPE=RelWithDebInfo` or add `"-g"` to `extra_cuda_cflags` in the JIT generator
+for the op you are debugging, clear `~/.cache/flashinfer/`, and retry.
+
+## Quick examples
+
+### Debug shape mismatch
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+export FLASHINFER_LOGDEST=stdout
+python my_script.py
+# Read tensor shapes in stdout
+```
+
+### Debug NaN / Inf
+
+```bash
+export FLASHINFER_LOGLEVEL=5
+export FLASHINFER_LOGDEST=nan.log
+python my_script.py
+# Grep "nan_count=" / "inf_count=" in nan.log
+```
+
+### Debug a hard GPU fault
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+export FLASHINFER_LOGDEST=gpu_fault.log
+export AMD_SERIALIZE_KERNEL=3
+export HIP_LAUNCH_BLOCKING=1
+python my_script.py
+# Last entry in gpu_fault.log is the faulting call.
+# Also check `sudo dmesg -T | tail -50` for VM_CONTEXT1_PROTECTION_FAULT_STATUS.
+```
+
+### Debug multi-GPU
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+export FLASHINFER_LOGDEST=rank_%i.log
+export HIP_VISIBLE_DEVICES=0,1,2,3
+torchrun --nproc_per_node=4 train.py
+# Inspect rank_*.log files per process.
+```
+
+### Full `rocgdb` session
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+export FLASHINFER_LOGDEST=debug.log
+export AMD_SERIALIZE_KERNEL=3
+rocgdb --args python my_script.py
+# (rocgdb) catch throw
+# (rocgdb) run
+# (rocgdb) bt
+```
+
+## Example: full debug session
+
+### Your code crashes
+
+```python
+import torch
+import flashinfer
+
+q  = torch.randn(32, 8, 128, dtype=torch.bfloat16, device="cuda")
+kv = torch.randn(1024, 2, 8, 64, dtype=torch.bfloat16, device="cuda")   # wrong head_dim!
+
+out = flashinfer.single_decode_with_kv_cache(q, kv[:, 0], kv[:, 1])     # crashes
+```
+
+Output:
+
+```text
+Memory access fault by GPU node-1 (Agent handle: 0x...) on address 0x7f9d...
+```
+
+### Enable logging + deterministic launches
+
+```bash
+export FLASHINFER_LOGLEVEL=3
+export FLASHINFER_LOGDEST=debug.log
+export AMD_SERIALIZE_KERNEL=3
+export HIP_LAUNCH_BLOCKING=1
+python test.py
+```
+
+### Read `debug.log`
+
+```text
+[...] FlashInfer API Call: single_decode_with_kv_cache
+Positional input arguments:
+  arg[0]:
+    Tensor(shape=(32, 8, 128), dtype=torch.bfloat16, device=cuda:0, ...)
+  arg[1]:
+    Tensor(shape=(1024, 8, 64), dtype=torch.bfloat16, device=cuda:0, ...)   # ← head_dim=64, not 128
+  arg[2]:
+    Tensor(shape=(1024, 8, 64), dtype=torch.bfloat16, device=cuda:0, ...)   # ← also wrong
+```
+
+### Fix
+
+```python
+kv = torch.randn(1024, 2, 8, 128, dtype=torch.bfloat16, device="cuda")  # fixed
+```
+
+### Success
+
+```bash
+python test.py
+# No crash; debug.log shows both the call and the output tensor.
+```
+
+## Summary
+
+1. Before anything else:
+
+   ```bash
+   export FLASHINFER_LOGLEVEL=3
+   export FLASHINFER_LOGDEST=debug.log
+   export AMD_SERIALIZE_KERNEL=3
+   export HIP_LAUNCH_BLOCKING=1
+   ```
+
+2. Reproduce the crash. Inputs are logged BEFORE each kernel runs, so the last entry tells you
+   which call faulted.
+
+3. If the shape/dtype/device picture in the log looks correct, escalate to
+   `AMD_LOG_LEVEL=3`, then to `rocgdb`, then to `dmesg` for VM-level faults.
+
+4. For AITER crashes, check the layout/page-size invariants first — they cover a large fraction
+   of "illegal address" reports in practice.
+
+5. Disable logging when done:
+
+   ```bash
+   export FLASHINFER_LOGLEVEL=0
+   ```
+
+## Related documentation
+
+- [`CLAUDE.md`](../../../CLAUDE.md) — project overview; see the "Debugging" and "API Logging"
+  sections for background.
+- [`.claude/skills/add-rocm-kernel/SKILL.md`](../add-rocm-kernel/SKILL.md) — when you are
+  debugging a kernel you just wrote.
+- [`.claude/skills/benchmark-kernel/SKILL.md`](../benchmark-kernel/SKILL.md) — when the crash
+  only happens under profiling.
+- ROCm debugging documentation: <https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/debugging.html>
+- `rocgdb` user guide: <https://rocm.docs.amd.com/projects/llvm-project/en/latest/reference/rocgdb.html>
+- Upstream's [`debug-cuda-crash` skill](https://github.com/flashinfer-ai/flashinfer/blob/main/.claude/skills/debug-cuda-crash/SKILL.md) —
+  the source this tutorial was adapted from; useful when cross-referencing a bug that reproduces
+  on both backends.

From 788b0e2c0ab962b87faa868b02fededbd3d104ac Mon Sep 17 00:00:00 2001
From: Debasis Mandal <debasis.mandal@amd.com>
Date: Wed, 20 May 2026 15:38:56 +0000
Subject: [PATCH 2/5] docs: trim Claude Code skills and fix factual errors
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Restructure the three .claude/skills/*/SKILL.md files (1575 → 239 lines,
~85% reduction) to mirror the slim CLAUDE.md philosophy: keep only what
is hard to derive from code or remember between sessions. Drop the full
walkthrough examples — the real files in flashinfer/csrc_rocm/ and
flashinfer/jit/ are better references than a Markdown copy.

Also fix factual errors discovered while fact-checking the originals
against the current codebase:

- debug-rocm-crash: remove the entire `@flashinfer_api` /
  FLASHINFER_LOGLEVEL / FLASHINFER_LOGDEST premise. Grep returns zero
  matches in the codebase — that decorator and those env vars do not
  exist in this fork. Replace with the actual debug workflow
  (AMD_SERIALIZE_KERNEL=3 + HIP_LAUNCH_BLOCKING=1 + manual print +
  torch.cuda.synchronize, rocgdb, dmesg).

- benchmark-kernel: AITER's "native" page sizes are {128, 256, 1024} for
  amd-aiter ≥ 0.1.10 (else {16, 1024}), not {1, 16, 1024}. Non-native
  page sizes fall through a flat-gather path; they are not rejected.
  CUPTI is not silently ignored — enable_cupti=True routes straight to
  bench_gpu_time_with_cupti with no HIP guard and will fail; leave it
  False on ROCm.

- add-rocm-kernel: DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16_FP32 does not
  exist. Only _FP16, _FP8, and the unsuffixed variant are defined in
  pytorch_extension_utils.h.

- CLAUDE.md: drop the misleading "Debug build (-O0) FLASHINFER_JIT_DEBUG"
  row — that env var is read only on the IS_CUDA branch of
  flashinfer/jit/core.py. Add a gotcha explaining the HIP workaround
  (add -g via extra_cuda_cflags in the JIT generator).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .claude/skills/add-rocm-kernel/SKILL.md  | 583 +++----------------
 .claude/skills/benchmark-kernel/SKILL.md | 384 ++-----------
 .claude/skills/debug-rocm-crash/SKILL.md | 687 ++---------------------
 CLAUDE.md                                |  17 +-
 4 files changed, 166 insertions(+), 1505 deletions(-)

diff --git a/.claude/skills/add-rocm-kernel/SKILL.md b/.claude/skills/add-rocm-kernel/SKILL.md
index b9a14f8975..8623bc3a09 100644
--- a/.claude/skills/add-rocm-kernel/SKILL.md
+++ b/.claude/skills/add-rocm-kernel/SKILL.md
@@ -1,531 +1,78 @@
 ---
 name: add-rocm-kernel
-description: Step-by-step tutorial for adding new HIP/ROCm kernels to FlashInfer+ROCm (amd-flashinfer)
+description: Step-by-step tutorial for adding new HIP kernels to FlashInfer+ROCm (amd-flashinfer)
 ---
 
-# Tutorial: Adding a New Kernel to FlashInfer+ROCm
+# Adding a New Kernel to FlashInfer+ROCm
 
-This tutorial walks through adding a simple element-wise scale operation to the ROCm port of
-FlashInfer. We implement `scale(x, factor) = x * factor` to illustrate the complete workflow on
-CDNA3 (`gfx942`) and CDNA4 (`gfx950`).
+For a complete worked example to copy, read these together:
+[`norm.cu`](../../../flashinfer/csrc_rocm/norm.cu) +
+[`flashinfer_norm_binding.cu`](../../../flashinfer/csrc_rocm/flashinfer_norm_binding.cu) +
+[`jit/norm.py`](../../../flashinfer/jit/norm.py) +
+[`norm.py`](../../../flashinfer/norm.py). For plan-run / multi-backend / FP8 see
+[`batch_prefill.cu`](../../../flashinfer/csrc_rocm/batch_prefill.cu) +
+[`prefill_rocm.py`](../../../flashinfer/prefill_rocm.py).
 
-If you are used to upstream's `add-cuda-kernel` tutorial, note the following ROCm-specific
-differences up front:
+## File touchpoints (every new op needs each row, in order)
 
-| Concern | Upstream CUDA | This ROCm port |
+| Step | File | Purpose |
 | --- | --- | --- |
-| Launcher directory | `csrc/` | [`flashinfer/csrc_rocm/`](../../../flashinfer/csrc_rocm/) |
-| Bindings | TVM-FFI (`TVM_FFI_DLL_EXPORT_TYPED_FUNC`) | Plain Torch extension (`TORCH_LIBRARY_FRAGMENT`) |
-| Tensor type | `tvm::ffi::TensorView` | `at::Tensor` |
-| Stream | `get_stream(device)` | `at::hip::getCurrentHIPStream()` |
-| Compiler | `nvcc` | `hipcc` (amdclang++) |
-| Arch env var | `FLASHINFER_CUDA_ARCH_LIST` | `FLASHINFER_ROCM_ARCH_LIST` |
-| AOT registration | `flashinfer/aot.py` | [`flashinfer/aot_hip.py`](../../../flashinfer/aot_hip.py) |
-| Tests directory | `tests/` | [`tests/rocm_tests/`](../../../tests/rocm_tests/) |
+| 1 | `include/flashinfer/<op>.cuh` | Framework-agnostic kernel + launcher template. **No `<torch/...>` includes here.** |
+| 2 | `flashinfer/csrc_rocm/<op>.cu` | PyTorch launcher: `at::Tensor` in, `at::hip::getCurrentHIPStream()`, `TORCH_CHECK`, `DISPATCH_PYTORCH_DTYPE_*`. |
+| 3 | `flashinfer/csrc_rocm/flashinfer_<op>_binding.cu` | `TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) { m.def("<op>", <op>); }`. |
+| 4 (opt) | `flashinfer/csrc_rocm/<op>_customize_config.jinja` | Compile-time type specialization. Skip if runtime dispatch is enough. |
+| 5 | `flashinfer/jit/<op>.py` | `gen_<op>_module() -> JitSpec` via `gen_jit_spec(...)`. |
+| 6 | `flashinfer/<op>.py` | Python API: `@functools.cache` module loader, optional `@flashinfer_api`, destination-passing (`out=`). |
+| 7 | `tests/rocm_tests/test_<op>_hip.py` | Correctness tests; FP32 reference math, loose BF16 tolerances. |
+| 8 | `flashinfer/jit/__init__.py` (`IS_HIP` branch) | `from .<op> import gen_<op>_module as gen_<op>_module`. |
+| 9 | `flashinfer/__init__.py` (`IS_HIP` branch) | `from .<op> import <op> as <op>`. |
+| 10 (opt) | `flashinfer/aot_hip.py` | Register `gen_<op>_module` for pre-compiled wheels. |
 
-## Goal
+**Forgetting steps 8 and 9 is the most common bug** — the module compiles but is invisible from `import flashinfer`.
 
-Add a new operation that scales each element of a tensor by a scalar factor:
+## CUDA → ROCm porting cheat sheet
 
-- Input: tensor `x` and scalar `factor`
-- Output: `x * factor` (element-wise)
-- Support FP16 and BF16
-- Compile for both `gfx942` and `gfx950`
+When porting an upstream kernel, mechanically rewrite:
 
-## Step 1: Define the HIP kernel in `include/`
-
-Create `include/flashinfer/scale.cuh`. **Do not include `<torch/...>` headers here.** The file
-must stay framework-agnostic so the same header can compile under CUDA (upstream) and HIP (this
-port). For anything that differs between the two platforms, reach for
-[`include/gpu_iface/`](../../../include/gpu_iface/).
-
-```cpp
-#pragma once
-
-#include "gpu_iface/platform.hpp"
-#include "gpu_iface/gpu_runtime_compat.hpp"
-#include "gpu_iface/vec_dtypes.hpp"
-
-namespace flashinfer {
-
-/*!
- * \brief Element-wise scale kernel.
- * \tparam T Data type (half / __hip_bfloat16 / float)
- */
-template <typename T>
-__global__ void ScaleKernel(const T* __restrict__ input, T* __restrict__ output,
-                            T factor, int n) {
-  int idx = blockIdx.x * blockDim.x + threadIdx.x;
-  if (idx < n) {
-    output[idx] = input[idx] * factor;
-  }
-}
-
-/*!
- * \brief Launch scale kernel (platform-agnostic).
- */
-template <typename T>
-gpuError_t ScaleLauncher(const T* input, T* output, T factor, int n,
-                         gpuStream_t stream = nullptr) {
-  const int threads = 256;
-  const int blocks  = (n + threads - 1) / threads;
-
-  ScaleKernel<T><<<blocks, threads, 0, stream>>>(input, output, factor, n);
-
-  return gpuGetLastError();
-}
-
-}  // namespace flashinfer
-```
-
-**Key points:**
-
-- No `<cuda_runtime.h>` / `<hip/hip_runtime.h>` includes — these are pulled in transitively by
-  `gpu_iface/platform.hpp` based on whether the TU is being compiled for CUDA or HIP.
-- `gpuError_t`, `gpuStream_t`, and `gpuGetLastError()` come from
-  [`include/gpu_iface/gpu_runtime_compat.hpp`](../../../include/gpu_iface/gpu_runtime_compat.hpp)
-  and alias to either the CUDA or HIP symbols depending on the backend macro.
-- `__global__` and the `<<<...>>>` launch syntax are supported on both HIP and CUDA without
-  translation.
-- Template on dtype; the concrete dtype is selected in the launcher via a dispatch macro.
-
-### When to add something to `gpu_iface`
-
-If your kernel needs a primitive that differs meaningfully between CUDA and HIP (an MMA
-intrinsic, a cross-lane shuffle, a memory fence, a warp-wide reduction, a dtype container), add
-it to the appropriate `include/gpu_iface/backend/{cuda,hip}/` file and expose a shared name from
-the top-level `gpu_iface/` header — do **not** duplicate the whole kernel under `csrc_rocm/`.
-
-Representative HIP-side files already in use:
-
-- [`include/gpu_iface/backend/hip/vec_dtypes_hip.h`](../../../include/gpu_iface/backend/hip/vec_dtypes_hip.h)
-- [`include/gpu_iface/backend/hip/mma_hip.h`](../../../include/gpu_iface/backend/hip/mma_hip.h)
-- [`include/gpu_iface/backend/hip/memory_ops_hip.h`](../../../include/gpu_iface/backend/hip/memory_ops_hip.h)
-- [`include/gpu_iface/backend/hip/math_hip.h`](../../../include/gpu_iface/backend/hip/math_hip.h)
-
-## Step 2: Create the launcher in `flashinfer/csrc_rocm/`
-
-Create `flashinfer/csrc_rocm/scale.cu`. This is the file that bridges PyTorch tensors to the
-framework-agnostic kernel above.
-
-```cpp
-#include <cstdint>
-#include <flashinfer/scale.cuh>
-
-#include "pytorch_extension_utils.h"
-
-using namespace flashinfer;
-
-void scale(at::Tensor& output, at::Tensor& input, double factor) {
-  CHECK_INPUT(input);
-  CHECK_INPUT(output);
-  TORCH_CHECK(input.sizes() == output.sizes(),
-              "scale: output shape must match input shape");
-  TORCH_CHECK(input.scalar_type() == output.scalar_type(),
-              "scale: output dtype must match input dtype");
-
-  const c10::hip::OptionalHIPGuardMasqueradingAsCUDA device_guard(input.device());
-  const hipStream_t stream = at::hip::getCurrentHIPStream();
-  const int n = static_cast<int>(input.numel());
-
-  DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(input.scalar_type(), c_type, [&] {
-    hipError_t status = ScaleLauncher<c_type>(
-        static_cast<c_type*>(input.data_ptr()),
-        static_cast<c_type*>(output.data_ptr()),
-        static_cast<c_type>(factor),
-        n,
-        stream);
-    TORCH_CHECK(status == hipSuccess,
-                "scale failed: " + std::string(hipGetErrorString(status)));
-    return true;
-  });
-}
-```
-
-**Key points:**
-
-- Include [`pytorch_extension_utils.h`](../../../flashinfer/csrc_rocm/pytorch_extension_utils.h)
-  for `at::Tensor`, the `CHECK_*` macros, and the `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_*` family.
-- Use `c10::hip::OptionalHIPGuardMasqueradingAsCUDA` — PyTorch's ROCm build "masquerades" as
-  CUDA, so device guards and streams are exposed through the HIP-prefixed namespaces.
-- Acquire the current HIP stream with `at::hip::getCurrentHIPStream()`, not
-  `c10::cuda::getCurrentCUDAStream()`.
-- Dispatch macro: `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16` covers FP16+BF16. For a dispatch that
-  also covers FP32 or FP8, use the other `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_*` variants defined
-  in `pytorch_extension_utils.h`.
-- Error handling uses `TORCH_CHECK(cond, msg)` — the PyTorch extension idiom. There is no
-  `TVM_FFI_THROW` on this path.
-
-### Validation helpers available
-
-From [`flashinfer/csrc_rocm/pytorch_extension_utils.h`](../../../flashinfer/csrc_rocm/pytorch_extension_utils.h):
-
-- `CHECK_INPUT(tensor)` — validates CUDA/HIP + contiguous.
-- `CHECK_LAST_DIM_CONTIGUOUS_INPUT(tensor)` — validates CUDA/HIP + last-dim-contiguous.
-- `CHECK_EQ(a, b)`, `CHECK_DIM(n, tensor)` — shape / rank sanity checks.
-- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16` — FP16 + BF16
-- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16_FP32` — FP16 + BF16 + FP32
-- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP8` — E4M3 + E5M2 (the `_fnuz` variants on CDNA3/4)
-
-For a worked-out reference, read [`flashinfer/csrc_rocm/norm.cu`](../../../flashinfer/csrc_rocm/norm.cu)
-(kept intentionally simple) and compare against the more involved
-[`flashinfer/csrc_rocm/batch_prefill.cu`](../../../flashinfer/csrc_rocm/batch_prefill.cu) (plan-run
-pattern, multiple backends, FP8 path).
-
-## Step 3: Create the Torch-extension binding
-
-Create `flashinfer/csrc_rocm/flashinfer_scale_binding.cu`. This is the file that exports the
-launcher to Python.
-
-```cpp
-#include "pytorch_extension_utils.h"
-
-void scale(at::Tensor& output, at::Tensor& input, double factor);
-
-TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) {
-  // Element-wise scale: output = input * factor
-  m.def("scale", scale);
-}
-```
-
-**Key points:**
-
-- The `TORCH_EXTENSION_NAME` macro is defined by PyTorch's build system and resolves to the
-  unique module name for this JIT build — `TORCH_LIBRARY_FRAGMENT` registers `scale` under that
-  namespace.
-- `pytorch_extension_utils.h` also emits a `PyInit_<name>` stub so the resulting `.so` is
-  importable as a Python module (see the bottom of
-  [`pytorch_extension_utils.h`](../../../flashinfer/csrc_rocm/pytorch_extension_utils.h)).
-- Compare with [`flashinfer/csrc_rocm/flashinfer_norm_binding.cu`](../../../flashinfer/csrc_rocm/flashinfer_norm_binding.cu)
-  for the exact pattern.
-
-**Do not write:**
-
-- `TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, scale)` — that's the upstream TVM-FFI pattern; it does
-  not work on this port.
-- `PYBIND11_MODULE(...)` — we use the `TORCH_LIBRARY_FRAGMENT` flavor which integrates with
-  `torch.library` and thus with `torch.compile`.
-
-## Step 4: (Optional) Jinja type specialization
-
-For operations that benefit from compile-time type specialization (you want one `.so` per dtype
-combination rather than runtime dispatch), add a Jinja template next to the launcher:
-
-`flashinfer/csrc_rocm/scale_customize_config.jinja`:
-
-```jinja
-#pragma once
-
-using DTypeIn  = {{ dtype_in }};
-using DTypeOut = {{ dtype_out }};
-constexpr int SCALE_BLOCK_SIZE = {{ block_size }};
-```
-
-The JIT module generator (Step 5) renders this to a concrete `.inc` file before invoking
-`hipcc`. See [`flashinfer/csrc_rocm/batch_prefill_customize_config.jinja`](../../../flashinfer/csrc_rocm/batch_prefill_customize_config.jinja)
-for a non-trivial example.
-
-**When to skip Jinja:** for a kernel like our `scale` example, where the dtype is picked via
-`DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16` at runtime, there is no benefit. Skip this step entirely.
-
-## Step 5: Write the JIT module generator
-
-Create `flashinfer/jit/scale.py`:
-
-```python
-"""
-Copyright (c) 2026 by FlashInfer+ROCm team.
-SPDX-License-Identifier: Apache-2.0
-"""
-
-from . import env as jit_env
-from .core import JitSpec, gen_jit_spec
-
-
-def gen_scale_module() -> JitSpec:
-    """JitSpec for the element-wise scale op.
-
-    No Jinja / type specialization is needed here because the dtype dispatch
-    happens inside DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16 at runtime.
-    """
-    extra_flags = [
-        "-DENABLE_BF16",
-    ]
-    return gen_jit_spec(
-        "scale",
-        [
-            jit_env.FLASHINFER_CSRC_DIR / "scale.cu",
-            jit_env.FLASHINFER_CSRC_DIR / "flashinfer_scale_binding.cu",
-        ],
-        extra_cuda_cflags=extra_flags,
-    )
-```
-
-**Key points:**
-
-- `jit_env.FLASHINFER_CSRC_DIR` resolves to `flashinfer/csrc_rocm/` on HIP, via
-  [`flashinfer/get_include_paths.py::get_csrc_dir()`](../../../flashinfer/get_include_paths.py).
-  This is a conscious divergence from upstream — do **not** reach for a hard-coded `csrc/`.
-- `extra_cuda_cflags` is still the kwarg name even on HIP (for source-compat with upstream);
-  internally [`flashinfer/jit/core.py`](../../../flashinfer/jit/core.py) maps it to flags passed
-  to `hipcc`.
-- `gen_jit_spec` on HIP automatically prepends the output of
-  `current_compilation_context.get_hipcc_flags_list()` — that is, `--offload-arch=gfxNNN` for
-  every target arch plus the common HIP defines (`-DFLASHINFER_ENABLE_HIP`, etc.). You do not
-  need to add `--offload-arch` yourself unless you are overriding a built-in default.
-- If your kernel must **only** run on one arch, add a runtime check (e.g. via
-  `FLASHINFER_SUPPORTED_ROCM_ARCHS` in [`flashinfer/hip_utils.py`](../../../flashinfer/hip_utils.py))
-  at the Python API layer. There is no HIP-side equivalent of upstream's
-  `supported_major_versions=[...]` mechanism yet.
-
-### Register the generator for re-export
-
-Add the import to the `IS_HIP` branch of
-[`flashinfer/jit/__init__.py`](../../../flashinfer/jit/__init__.py):
-
-```python
-elif IS_HIP:
-    # ...
-    from .scale import gen_scale_module as gen_scale_module
-```
-
-Place it alphabetically among the existing `from .norm import ...`, `from .rope import ...`
-lines.
-
-## Step 6: Write the Python API
-
-Create `flashinfer/scale.py`:
-
-```python
-"""
-Copyright (c) 2026 by FlashInfer+ROCm team.
-SPDX-License-Identifier: Apache-2.0
-"""
-
-import functools
-from typing import Optional
-
-import torch
-
-from .jit.scale import gen_scale_module
-
-
-@functools.cache
-def _get_scale_module():
-    """Compile + load the scale module exactly once per process."""
-    return gen_scale_module().build_and_load()
-
-
-def scale(
-    input: torch.Tensor,
-    factor: float,
-    out: Optional[torch.Tensor] = None,
-) -> torch.Tensor:
-    """Element-wise ``output = input * factor``.
-
-    Parameters
-    ----------
-    input : torch.Tensor
-        Input tensor on an AMD GPU. Must be FP16 or BF16 and contiguous.
-    factor : float
-        Scalar multiplier.
-    out : Optional[torch.Tensor]
-        Pre-allocated output tensor. If ``None``, a new tensor is allocated.
-
-    Returns
-    -------
-    torch.Tensor
-        ``input * factor`` with the same shape/dtype/device as ``input``.
-
-    Examples
-    --------
-    >>> import torch, flashinfer
-    >>> x = torch.randn(1024, dtype=torch.float16, device="cuda")
-    >>> y = flashinfer.scale(x, 2.0)
-    >>> torch.allclose(y, x * 2.0)
-    True
-    """
-    if out is None:
-        out = torch.empty_like(input)
-
-    module = _get_scale_module()
-    module.scale(out, input, float(factor))
-    return out
-```
-
-**Key points:**
-
-- `@functools.cache` caches the compiled module in memory so subsequent calls skip the JIT
-  cache lookup entirely.
-- **Destination-passing style**: accept an optional `out=` so perf-sensitive callers can avoid
-  an extra allocation.
-- On ROCm, `input.device.type == "cuda"` — PyTorch's ROCm build reuses the CUDA namespace. Do
-  not test for `"hip"`; it will never be true in practice.
-- If you want API logging, add `@flashinfer_api` above `def scale(...)`. See the
-  [`debug-rocm-crash`](../debug-rocm-crash/SKILL.md) skill.
-
-### Expose from the package
-
-Add the export to the `IS_HIP` branch of
-[`flashinfer/__init__.py`](../../../flashinfer/__init__.py):
-
-```python
-elif IS_HIP:
-    # ...
-    from .scale import scale as scale
-```
-
-## Step 7: Write tests
-
-Create `tests/rocm_tests/test_scale_hip.py`:
-
-```python
-"""
-Copyright (c) 2026 by FlashInfer+ROCm team.
-SPDX-License-Identifier: Apache-2.0
-"""
-
-import pytest
-import torch
-
-import flashinfer
-from flashinfer.hip_utils import FLASHINFER_SUPPORTED_ROCM_ARCHS
-
-
-def _current_arch() -> str:
-    return torch.cuda.get_device_properties(0).gcnArchName.split(":")[0]
-
-
-@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16])
-@pytest.mark.parametrize("shape", [(1024,), (32, 128), (8, 32, 128)])
-@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0, -3.25])
-def test_scale_correctness(shape, dtype, factor):
-    assert _current_arch() in FLASHINFER_SUPPORTED_ROCM_ARCHS, (
-        "Test requires a FlashInfer-supported AMD GPU"
-    )
-
-    x = torch.randn(*shape, dtype=dtype, device="cuda")
-    y = flashinfer.scale(x, factor)
-
-    ref = x.float() * factor
-    torch.testing.assert_close(y.float(), ref, rtol=1e-2, atol=1e-2)
-
-
-def test_scale_inplace_out():
-    x = torch.randn(64, 64, dtype=torch.float16, device="cuda")
-    out = torch.empty_like(x)
-    y = flashinfer.scale(x, 3.0, out=out)
-
-    assert y.data_ptr() == out.data_ptr()
-    torch.testing.assert_close(y.float(), x.float() * 3.0, rtol=1e-2, atol=1e-2)
-```
-
-**Key points:**
-
-- Test files under [`tests/rocm_tests/`](../../../tests/rocm_tests/) are named `test_*_hip.py`
-  by convention.
-- The repo's [`tests/rocm_tests/conftest.py`](../../../tests/rocm_tests/conftest.py) hooks into
-  `pytest-xdist` so `pytest -n auto` only spawns workers for
-  FlashInfer-supported GPUs. You do not need to parametrize over devices yourself.
-- Use FP32 for reference math to avoid dtype-mismatch asserts with `assert_close`.
-- Keep tolerances loose enough for BF16 (`rtol=1e-2`, `atol=1e-2`); tighten for FP32-only ops.
-
-Run it:
-
-```bash
-pytest tests/rocm_tests/test_scale_hip.py -v
-# Or only on GPU 0
-HIP_VISIBLE_DEVICES=0 pytest tests/rocm_tests/test_scale_hip.py -v
-```
-
-## Step 8: Register for AOT (optional)
-
-If your op should also be available in pre-compiled wheels (the
-[`amd-flashinfer-jit-cache/`](../../../amd-flashinfer-jit-cache/) package), register the JIT
-generator in [`flashinfer/aot_hip.py`](../../../flashinfer/aot_hip.py). Add a generator that
-yields your `JitSpec`, and reference it from the main AOT-compile loop.
-
-Pattern (see existing `gen_fa2` in that file):
-
-```python
-def gen_scale() -> Iterator:
-    from .jit.scale import gen_scale_module
-    yield gen_scale_module()
-```
-
-Then AOT compile with:
-
-```bash
-cd amd-flashinfer-jit-cache
-export FLASHINFER_ROCM_ARCH_LIST="gfx942,gfx950"
-python -m build --no-isolation --wheel
-```
-
-The resulting wheel ships a pre-compiled `.so` per arch, indexed by the URI hash.
-
-## CDNA3 vs CDNA4 — what to watch for
-
-Both `gfx942` (CDNA3, MI300X/MI325X) and `gfx950` (CDNA4, MI350X/MI355X) are Matrix Core
-architectures, but they are not fully compatible:
-
-| Concern | CDNA3 (`gfx942`) | CDNA4 (`gfx950`) |
-| --- | --- | --- |
-| MFMA intrinsics | `__builtin_amdgcn_mfma_*` family (F16, BF16, I8, FP8) | Same family **plus** new CDNA4-only instructions (wider FP8 MFMAs, additional block sizes) |
-| FP8 format | `__hip_fp8_e4m3_fnuz`, `__hip_fp8_e5m2_fnuz` (FNUZ biasing) | Same FNUZ variants (OCP FP8 support depends on ROCm version) |
-| LDS capacity | 64 KB / CU | 160 KB / XCD on some configs — **do not** assume identical block/tile sizes |
-| Wavefront size | 64 | 64 |
-
-Practical implications when authoring a new kernel:
-
-- If you use MFMA intrinsics, guard them on the arch macro (`__gfx942__`, `__gfx950__`) or
-  behind the `FLASHINFER_SUPPORTED_ROCM_ARCHS` check at the Python level.
-- Do not hard-code LDS tile sizes. Either parameterize the kernel (Jinja) or query the device
-  properties at plan time (e.g. `torch.cuda.get_device_properties(dev).shared_memory_per_block`).
-- FP8: on both arches, the `_fnuz` variants are the safe default. Bit-exact parity with NVIDIA
-  `__nv_fp8_e4m3` is **not** guaranteed — reference tests must account for the FNUZ
-  representation.
-
-When in doubt, look at how
-[`flashinfer/prefill_rocm.py`](../../../flashinfer/prefill_rocm.py) and
-[`flashinfer/csrc_rocm/batch_prefill.cu`](../../../flashinfer/csrc_rocm/batch_prefill.cu) handle
-per-arch specialization.
-
-## Reference implementations in this repo
-
-| Complexity | Files |
+| Upstream CUDA | This fork |
 | --- | --- |
-| Simple, no Jinja | [`flashinfer/norm.py`](../../../flashinfer/norm.py) + [`flashinfer/csrc_rocm/norm.cu`](../../../flashinfer/csrc_rocm/norm.cu) + [`flashinfer/csrc_rocm/flashinfer_norm_binding.cu`](../../../flashinfer/csrc_rocm/flashinfer_norm_binding.cu) + [`flashinfer/jit/norm.py`](../../../flashinfer/jit/norm.py) |
-| Moderate, with Jinja | [`flashinfer/csrc_rocm/single_prefill.cu`](../../../flashinfer/csrc_rocm/single_prefill.cu) + [`flashinfer/csrc_rocm/single_prefill_customize_config.jinja`](../../../flashinfer/csrc_rocm/single_prefill_customize_config.jinja) + [`flashinfer/csrc_rocm/single_prefill_kernel_inst.jinja`](../../../flashinfer/csrc_rocm/single_prefill_kernel_inst.jinja) |
-| Complex (plan-run, AITER, FP8) | [`flashinfer/prefill_rocm.py`](../../../flashinfer/prefill_rocm.py) + [`flashinfer/csrc_rocm/batch_prefill.cu`](../../../flashinfer/csrc_rocm/batch_prefill.cu) |
-
-## Summary checklist
-
-When adding a new op, verify each box:
-
-- [ ] Header in `include/flashinfer/` — no Torch/HIP-runtime includes; uses `gpu_iface/` for
-      platform-differing primitives.
-- [ ] Launcher in `flashinfer/csrc_rocm/<name>.cu` with `#include "pytorch_extension_utils.h"`,
-      `at::Tensor` inputs, `at::hip::getCurrentHIPStream()`, and a `DISPATCH_PYTORCH_DTYPE_*`
-      block.
-- [ ] Binding in `flashinfer/csrc_rocm/flashinfer_<name>_binding.cu` using
-      `TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m)`.
-- [ ] (Optional) Jinja template for type specialization.
-- [ ] JIT generator in `flashinfer/jit/<name>.py` returning a `JitSpec` via `gen_jit_spec`.
-- [ ] Import exposed from the `IS_HIP` branches of `flashinfer/jit/__init__.py` **and**
-      `flashinfer/__init__.py`.
-- [ ] Python API with `@functools.cache`, destination-passing style, FP16/BF16 support,
-      and optional `@flashinfer_api`.
-- [ ] Tests in `tests/rocm_tests/test_<name>_hip.py`.
-- [ ] (Optional) AOT registration in `flashinfer/aot_hip.py`.
-- [ ] Run `pre-commit run -a` before committing.
-
-## Related documentation
-
-- [`CLAUDE.md`](../../../CLAUDE.md) — project overview, JIT architecture, feature matrix.
-- [`.claude/skills/benchmark-kernel/SKILL.md`](../benchmark-kernel/SKILL.md) — how to benchmark
-  the kernel you just added.
-- [`.claude/skills/debug-rocm-crash/SKILL.md`](../debug-rocm-crash/SKILL.md) — debugging recipes
-  when `TORCH_CHECK` fires or the GPU faults.
-- Upstream's [`add-cuda-kernel` skill](https://github.com/flashinfer-ai/flashinfer/blob/main/.claude/skills/add-cuda-kernel/SKILL.md)
-  — the source this tutorial was adapted from. Useful when you are porting a kernel from
-  upstream CUDA and want to see the "before" picture.
+| `csrc/<op>.cu` | `flashinfer/csrc_rocm/<op>.cu` |
+| `#include "tvm_ffi_utils.h"` | `#include "pytorch_extension_utils.h"` |
+| `tvm::ffi::TensorView` | `at::Tensor` |
+| `TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, op)` | `TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) { m.def("op", op); }` |
+| `TVM_FFI_THROW(ValueError) << "..."` | `TORCH_CHECK(cond, "...")` |
+| `DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP16` | `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16` |
+| `get_stream(tensor.device())` | `at::hip::getCurrentHIPStream()` |
+| `c10::cuda::OptionalCUDAGuard` | `c10::hip::OptionalHIPGuardMasqueradingAsCUDA` |
+| `nvcc` flags via `extra_cuda_cflags=[...]` | **Same kwarg name** (`extra_cuda_cflags`) — internally routed to `hipcc`. |
+| `flashinfer/aot.py` registration | `flashinfer/aot_hip.py` |
+| `tests/test_op.py` | `tests/rocm_tests/test_op_hip.py` |
+| `supported_major_versions=[9, 10]` | No analogue. Guard at Python layer via `FLASHINFER_SUPPORTED_ROCM_ARCHS`. |
+| `csrc/` (hardcoded) | `jit_env.FLASHINFER_CSRC_DIR` resolves to `flashinfer/csrc_rocm/` on HIP. **Never hardcode `csrc/`.** |
+| `PYBIND11_MODULE(...)` | **Don't.** Use `TORCH_LIBRARY_FRAGMENT` (integrates with `torch.compile`). |
+
+## Non-obvious gotchas
+
+- **PyTorch's ROCm masquerade.** `input.device.type == "cuda"` even on AMD. Never check for `"hip"`. PyTorch's HIP namespaces are reachable via `at::hip::...` and `c10::hip::OptionalHIPGuardMasqueradingAsCUDA` (literally the type name).
+- **`gpu_iface` over duplication.** If a primitive (MMA intrinsic, cross-lane shuffle, dtype container, warp reduction) differs between CUDA and HIP, add it under [`include/gpu_iface/backend/{cuda,hip}/`](../../../include/gpu_iface) and expose a common name from the top-level `gpu_iface/` header. Don't fork the kernel into `csrc_rocm/`. Existing HIP backends: `mma_hip.h`, `memory_ops_hip.h`, `math_hip.h`, `vec_dtypes_hip.h`.
+- **`-ffast-math` adds `-ffinite-math-only` on clang/hipcc.** [`jit/core.py`](../../../flashinfer/jit/core.py) explicitly re-adds `-fno-finite-math-only` so kernels that use `-inf` as a sentinel (online-softmax Map+Reduce) keep working. CUDA's `-use_fast_math` does *not* enable finite-math-only — divergence to be aware of when porting.
+- **`gen_jit_spec` auto-injects `--offload-arch=gfxNNN`** for every target arch plus `COMMON_HIPCC_FLAGS` (`-DFLASHINFER_ENABLE_HIP`, FP8 enables, etc.). Don't add `--offload-arch` by hand.
+- **Validation macros** live in [`pytorch_extension_utils.h`](../../../flashinfer/csrc_rocm/pytorch_extension_utils.h): `CHECK_INPUT` (GPU + contiguous), `CHECK_LAST_DIM_CONTIGUOUS_INPUT`, `CHECK_EQ`, `CHECK_DIM`, `CHECK_GE`, `CHECK_SHAPE`. Dispatch macros: `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16` (FP16+BF16), `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP8` (E4M3+E5M2, both `_fnuz` on CDNA3/4), and the unsuffixed `DISPATCH_PYTORCH_DTYPE_TO_CTYPE` (FP16+BF16+FP8 combined). There is **no** `_FP16_FP32` variant — if you need FP32, dispatch manually.
+- **The `_jit_pybind.cu` naming pattern** (e.g. `batch_decode_jit_pybind.cu`) is used by newer AITER-integrated bindings; the older `flashinfer_<op>_binding.cu` pattern is used by everything else. Both work — match the neighbors.
+
+## CDNA3 (`gfx942`) vs CDNA4 (`gfx950`)
+
+- **Wavefront = 64 on both.** Anything ported from CUDA assuming warp = 32 is wrong. Use `warpSize` for portability.
+- **FP8** is `__hip_fp8_e4m3_fnuz` / `__hip_fp8_e5m2_fnuz` on both. PyTorch dtype is `torch.float8_e4m3fnuz` (not `torch.float8_e4m3fn`, which is NVIDIA OCP FP8). Bit-exact parity with NVIDIA FP8 is not guaranteed — calibrate scale factors separately.
+- **MFMA intrinsics:** CDNA4 has additional FP8 MFMA shapes not on CDNA3. Guard arch-specific intrinsics with `__gfx942__` / `__gfx950__` or compute-capability dispatch at the Python layer.
+- **LDS / register / occupancy budgets differ.** Don't hard-code tile sizes — parameterize (Jinja) or query via `torch.cuda.get_device_properties(dev)` at plan time.
+
+## Quick checklist before commit
+
+- [ ] No `<torch/...>` under `include/`.
+- [ ] Launcher uses `at::hip::getCurrentHIPStream()` + `OptionalHIPGuardMasqueradingAsCUDA`.
+- [ ] Binding registered via `TORCH_LIBRARY_FRAGMENT`.
+- [ ] JIT generator uses `jit_env.FLASHINFER_CSRC_DIR` (not hardcoded `csrc/`).
+- [ ] Both `flashinfer/jit/__init__.py` and `flashinfer/__init__.py` IS_HIP branches updated.
+- [ ] Test file under `tests/rocm_tests/` named `test_*_hip.py`.
+- [ ] `pre-commit run -a` clean.
diff --git a/.claude/skills/benchmark-kernel/SKILL.md b/.claude/skills/benchmark-kernel/SKILL.md
index b97f137048..2f4c7de13a 100644
--- a/.claude/skills/benchmark-kernel/SKILL.md
+++ b/.claude/skills/benchmark-kernel/SKILL.md
@@ -3,370 +3,80 @@ name: benchmark-kernel
 description: Guide for benchmarking FlashInfer+ROCm kernels on AMD Instinct (CDNA3/CDNA4)
 ---
 
-# Tutorial: Benchmarking FlashInfer+ROCm Kernels
+# Benchmarking FlashInfer+ROCm Kernels
 
-This guide shows how to accurately benchmark kernels on the ROCm port of FlashInfer (the
-`amd-flashinfer` package), targeting AMD Instinct CDNA3 (`gfx942`) and CDNA4 (`gfx950`).
+For a real driver script to copy, see
+[`benchmarks/rocm_benchmarks/bench_fa2_prefill.py`](../../../benchmarks/rocm_benchmarks/bench_fa2_prefill.py) and [`benchmarks/rocm_benchmarks/bench_aiter_prefill.py`](../../../benchmarks/rocm_benchmarks/bench_aiter_prefill.py)
+For the in-repo profiler wrapper, see [`rocm_profiler/rocm_profiler.py`](../../../rocm_profiler/rocm_profiler.py).
 
-## Goal
+## Timing method matrix
 
-Measure the performance of FlashInfer+ROCm kernels:
-
-- Get accurate GPU kernel execution time on MI300X / MI325X / MI350X / MI355X.
-- Compare HIP-native and AITER (Composable-Kernel) prefill backends.
-- Generate reproducible benchmark results for regression tracking.
-- Save results to CSV / PNG rooflines for later analysis.
-
-## Timing methods on ROCm
-
-FlashInfer+ROCm supports three practical timing paths. **CUPTI is NVIDIA-only — do not try to
-install `cupti-python` on a ROCm host.**
-
-| Method | When to use | Source |
+| Method | When | How |
 | --- | --- | --- |
-| **CUDA events (HIP-backed via PyTorch)** | Default. Quick in-loop timing from Python. Good accuracy for kernels ≳ 50 µs. | `flashinfer.testing.bench_gpu_time` (the "CUDA event" path) |
-| **`rocprofv3` + [`rocm_profiler/rocm_profiler.py`](../../../rocm_profiler/rocm_profiler.py)** | Preferred for authoring or optimizing a kernel. Gives per-kernel time, hardware counters, and a two-panel roofline plot. | Wrapper spawns `rocprofv3` as a subprocess. |
-| **`omnitrace`** | Whole-process timeline with host + device events. Use when interaction with dataloaders / Python overhead is suspect. | Installed separately from ROCm. |
-
-Internally, `bench_gpu_time` on ROCm uses PyTorch's `torch.cuda.Event`, which maps to HIP events
-under the ROCm build. The `bench_gpu_time_with_cupti` code path in
-[`flashinfer/testing/utils.py`](../../../flashinfer/testing/utils.py) is never selected on a ROCm
-install because `cupti-python` will not import.
-
-## Pre-flight: what you can actually benchmark
-
-On a ROCm install of `amd-flashinfer`, only the APIs exposed in the `IS_HIP` branch of
-[`flashinfer/__init__.py`](../../../flashinfer/__init__.py) are callable:
-
-**Attention:**
-
-- `single_prefill_with_kv_cache` / `single_prefill_with_kv_cache_return_lse`
-- `BatchPrefillWithPagedKVCacheWrapper`, `BatchPrefillWithRaggedKVCacheWrapper`
-- `single_decode_with_kv_cache`
-- `BatchDecodeWithPagedKVCacheWrapper`, `CUDAGraphBatchDecodeWithPagedKVCacheWrapper`
-
-**Other:**
-
-- Normalization (`rmsnorm`, `fused_add_rmsnorm`, `gemma_rmsnorm`, …)
-- RoPE (`apply_rope_*`, `apply_llama31_rope_*`)
-- Sampling (`sampling_from_probs`, `top_k_*`, `top_p_*`, `min_p_sampling_from_probs`, …)
-- Paged KV management (`append_paged_kv_cache`, `get_batch_indices_positions`, …)
-- Quantization (`packbits`, `segment_packbits`)
-- Activation (`silu_and_mul`, `gelu_and_mul`, `gelu_tanh_and_mul`)
-
-**Not available on ROCm:** MLA, cascade, POD, FP4 quantization, TRT-LLM/CUTLASS MoE, cuDNN
-backends. Do not attempt to benchmark these — the symbol simply is not re-exported in the
-`IS_HIP` branch.
-
-**Backends that exist per op:**
-
-| Op family | Default (HIP) backend | AITER backend available? | How to select AITER |
-| --- | --- | --- | --- |
-| Single prefill | yes | yes (CK FMHA) | `backend="aiter"` kwarg |
-| Batch prefill (paged / ragged) | yes | yes (CK FMHA) | `backend="aiter"` kwarg |
-| Decode (single / batch / CUDA-graph) | yes | no | n/a |
-| All others (norm, rope, sampling, …) | yes | no | n/a |
-
-**AITER caveats** (see [`README.md`](../../../README.md) and
-[`flashinfer/prefill_rocm.py`](../../../flashinfer/prefill_rocm.py)):
-
-- `kv_layout="NHD"` only.
-- Batch prefill with AITER's CK FMHA requires `page_size ∈ {1, 16, 1024}`.
-- `amd-aiter` must be importable (usually `pip install amd-aiter --index-url https://pypi.amd.com/simple/`).
-
-Trying to benchmark an unsupported config under `backend="aiter"` will raise a Python error
-*before* the kernel launches, not silently fall back.
-
-## Method 1: In-script timing with `bench_gpu_time`
-
-For a quick perf check of one op, call
-[`flashinfer.testing.bench_gpu_time`](../../../flashinfer/testing/utils.py) directly. On ROCm it
-falls through to the `bench_gpu_time_with_cuda_event` path automatically.
-
-```python
-import torch
-import flashinfer
-from flashinfer.testing import bench_gpu_time
-
-seq_len       = 1024
-num_qo_heads  = 32
-num_kv_heads  = 8      # GQA 4:1
-head_dim      = 128
-dtype         = torch.bfloat16
-
-q = torch.randn(seq_len, num_qo_heads, head_dim, dtype=dtype, device="cuda")
-k = torch.randn(seq_len, num_kv_heads, head_dim, dtype=dtype, device="cuda")
-v = torch.randn(seq_len, num_kv_heads, head_dim, dtype=dtype, device="cuda")
-
-
-def run_default():
-    return flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True)
-
-
-def run_aiter():
-    return flashinfer.single_prefill_with_kv_cache(
-        q, k, v, causal=True, backend="aiter",
-    )
-
-
-def report(label, fn):
-    # enable_cupti=True is harmless on ROCm — it is silently ignored and the
-    # CUDA-events path is used. Passing it makes the script portable to CUDA hosts.
-    median_ms, std_ms = bench_gpu_time(
-        fn, args=(), enable_cupti=True, num_iters=30, dry_run_iters=5,
-    )
-    print(f"{label:12s}  median={median_ms:.3f} ms  std={std_ms:.3f} ms")
-
-
-report("hip-default", run_default)
-report("aiter",        run_aiter)
-```
-
-Typical output on an MI300X (numbers are illustrative — your exact values will depend on ROCm
-version, driver, and HIP-SDMA settings):
-
-```text
-hip-default  median=0.182 ms  std=0.004 ms
-aiter        median=0.146 ms  std=0.003 ms
-```
-
-**Important arguments:**
-
-| Arg | Purpose | Default |
-| --- | --- | --- |
-| `num_iters` | Measured iterations | 30 |
-| `dry_run_iters` | Warmup iterations | 5 |
-| `enable_cupti` | CUDA only; ignored on ROCm | False |
-| `l2_flush` / `rotate_buffers` | Flush L2 between iterations for memory-bound kernels | varies |
-
-## Method 2: `rocm_profiler` (recommended for optimization work)
-
-For anything you intend to optimize, use the in-repo
-[`rocm_profiler/rocm_profiler.py`](../../../rocm_profiler/rocm_profiler.py). It:
-
-1. Runs repeated GPU launches in the current process to get a median kernel time.
-2. Re-exec's the same driver script under `rocprofv3` as a subprocess (recognized by the
-   `_ROCM_PROFILER_INTERNAL` env sentinel) to collect hardware counters with one
-   warmup + one profiled launch.
-3. Produces a two-panel log-log **roofline plot** combining the timing and counter data.
-
-All outputs are written under `benchmarks/rocm_benchmarks/` (gitignored).
-
-### Minimal driver script
+| `flashinfer.testing.bench_gpu_time` | Quick in-loop check (kernels ≳ 50 µs) | Falls through to PyTorch `torch.cuda.Event` (HIP events under ROCm) automatically. |
+| `rocm_profiler` (`RocmProfiler`) | Anything you intend to optimize | Two-phase: in-process median timing, then re-execs the same script under `rocprofv3` (sentinel: `_ROCM_PROFILER_INTERNAL`) for hardware counters. Produces roofline PNG. |
+| `rocprofv3` directly | Full control over counter set | `rocprofv3 --stats --kernel-trace -- python script.py`; or `-i pmc.txt` for custom counters. |
+| `omnitrace` | Host + device timeline when Python overhead is suspect | Installed separately. |
 
-Start from the working example at
-[`benchmarks/rocm_benchmarks/bench_fa2_prefill.py`](../../../benchmarks/rocm_benchmarks/bench_fa2_prefill.py)
-and adapt:
+## Non-obvious gotchas
 
-```python
-# my_bench.py
-import torch
-import flashinfer
-from rocm_profiler import RocmProfiler, KernelConfig
+- **CUPTI is NVIDIA-only and `enable_cupti=True` WILL fail on ROCm.** [`flashinfer/testing/utils.py:1010`](../../../flashinfer/testing/utils.py) routes `enable_cupti=True` straight to `bench_gpu_time_with_cupti` with no HIP guard; `cupti-python` is not installable on ROCm. Leave `enable_cupti=False` (the default) — `bench_gpu_time` then uses `torch.cuda.Event` (HIP events under the hood).
+- **AITER backend constraints, accurately:**
+  - `kv_layout != "NHD"` → hard raise (`_check_kv_layout` / [`prefill_rocm.py:331`](../../../flashinfer/prefill_rocm.py)).
+  - Explicit `backend="aiter"` on non-gfx942/gfx950 → `RuntimeError`.
+  - `amd-aiter` not importable → `ImportError`.
+  - **"Native" page sizes** (no flat-gather): `{128, 256, 1024}` for `amd-aiter >= 0.1.10`, else `{16, 1024}` — see `_aiter_native_page_sizes()` in [`prefill_rocm.py:59`](../../../flashinfer/prefill_rocm.py). **Non-native page sizes are NOT rejected** — they go through a flat-gather code path. So the "{1, 16, 1024}" guidance from older docs is wrong.
+  - Auto-selection (no explicit `backend=`) silently falls back to `fa2` for: custom mask, dtype mismatch, head_dim mismatch, `pos_encoding_mode != "NONE"`.
+- **Always verify numerical parity before trusting perf numbers.** Compare default-HIP vs AITER outputs with `torch.testing.assert_close(rtol=1e-2, atol=1e-2)` for BF16/FP16 first.
+- **`gcnArchName` is the unambiguous arch marker.** Device strings show `cuda:0` on AMD too. Record `torch.cuda.get_device_properties(0).gcnArchName` and `torch.version.hip` alongside every number — a `gfx942` / ROCm 7.2 result is not comparable to a `gfx950` / ROCm 7.0.2 result.
 
-B, S, H_Q, H_KV, D = 1, 1024, 32, 8, 128
-dtype = torch.bfloat16
-q = torch.randn(S, H_Q, D, dtype=dtype, device="cuda")
-k = torch.randn(S, H_KV, D, dtype=dtype, device="cuda")
-v = torch.randn(S, H_KV, D, dtype=dtype, device="cuda")
+## What can actually be benchmarked on ROCm
 
-configs = [
-    KernelConfig(
-        name="s1024_causal",
-        run_fn=lambda: flashinfer.single_prefill_with_kv_cache_return_lse(
-            q, k, v, causal=True
-        ),
-        # FLOPs = 2 * S * S * H_Q * D (attention mat-muls), matches the formula
-        # used in benchmarks/rocm_benchmarks/bench_fa2_prefill.py.
-        theoretical_flops=2 * S * S * H_Q * D,
-        theoretical_bytes=(S * H_Q + 2 * S * H_KV) * D * dtype.itemsize,
-        label="seq=1024 causal",
-    ),
-]
+Only the APIs in the `IS_HIP` branch of [`flashinfer/__init__.py`](../../../flashinfer/__init__.py) are callable. **Not** available: MLA, cascade, POD, FP4, MoE, cuDNN backends. Don't try to import them.
 
-profiler = RocmProfiler(
-    configs=configs,
-    counters="roofline",            # or "compute", "memory", "occupancy", "stall", "basic"
-    kernel_name_regex="SinglePrefill",
-    output_dir="benchmarks/rocm_benchmarks",
-    label="my_bench",
-)
+AITER backend available for: single prefill, batch prefill (paged + ragged) — opt in via `backend="aiter"`. Not available for decode, norm, rope, sampling, etc.
 
-if __name__ == "__main__":
-    profiler.run()
-```
-
-### Run it
-
-```bash
-# Full pipeline: timing + counter collection + roofline PNG
-python my_bench.py
-
-# Change the counter preset (see header of rocm_profiler.py for the full list)
-python my_bench.py --counters occupancy
-python my_bench.py --counters stall
-python my_bench.py --counters memory
-
-# Timing only (no rocprofv3 at all — fast sanity check)
-python my_bench.py --timing-only
-
-# Run profiling but skip the roofline plot
-python my_bench.py --skip-roofline
-
-# Regenerate the roofline plot from existing CSVs (no GPU required)
-python my_bench.py --replot
-
-# List all built-in counter presets
-python my_bench.py --list-presets
-```
-
-### Outputs
-
-```text
-benchmarks/rocm_benchmarks/<label>_timing.csv             # median + std per config
-benchmarks/rocm_benchmarks/<label>_counters.yml           # rocprofv3 input spec
-benchmarks/rocm_benchmarks/<label>_counter_collection.csv # raw counters
-benchmarks/rocm_benchmarks/<label>_roofline.png           # only for counters=roofline
-```
+## `rocm_profiler` counter presets
 
-### Counter presets worth knowing
+Pass via `RocmProfiler(counters=...)` or `--counters` on the driver script.
 
-| Preset | What it shows | Typical use |
+| Preset | What it shows | Use for |
 | --- | --- | --- |
-| `roofline` (default) | `FetchSize`, `WriteSize`, MFMA ops, TCC DRAM requests | Is the kernel compute- or memory-bound? |
-| `compute` | MFMA ops + cycle counters | Matrix-core throughput on CDNA3/4 |
-| `memory` | L2 and DRAM bandwidth breakdown | L2 hit-rate, HBM traffic |
-| `occupancy` | `SQ_WAVES`, `SQ_BUSY_CYCLES`, `SQ_VALU_MFMA_BUSY_CYCLES`, `SQ_WAIT_INST_ANY`, `SQ_INSTS_LDS` | Wavefront density, scheduler efficiency |
-| `stall` | `SQ_WAIT_INST_VMEM`, `SQ_WAIT_INST_LDS`, `SQ_BUSY_CYCLES` | Diagnose memory stalls |
+| `roofline` (default) | `FetchSize`, `WriteSize`, MFMA ops, TCC DRAM requests | "Am I compute- or memory-bound?" |
+| `compute` | MFMA ops + cycle counters | Matrix-core throughput |
+| `memory` | L2 + DRAM breakdown | L2 hit-rate, HBM traffic |
+| `occupancy` | `SQ_WAVES`, `SQ_BUSY_CYCLES`, `SQ_VALU_MFMA_BUSY_CYCLES`, `SQ_INSTS_LDS` | Wavefront density |
+| `stall` | `SQ_WAIT_INST_VMEM`, `SQ_WAIT_INST_LDS` | Diagnose memory stalls |
 | `basic` | `FetchSize` / `WriteSize` | Minimal baseline |
 
-You can also pass a path to a `rocprofv3`-native YAML if you need a counter combination that is
-not in the preset list.
-
-## Method 3: Raw `rocprofv3` invocation
+Or pass a path to a `rocprofv3`-native YAML for a custom counter set.
 
-If you need full control over the counter set, bypass the Python wrapper and use `rocprofv3`
-directly. This also works against any standalone Python script.
-
-```bash
-# Timeline + per-kernel stats
-rocprofv3 --stats --kernel-trace \
-    --output-format csv \
-    --output-directory rpf-out \
-    -- python my_bench.py
-
-# Hardware counters (supply your own pmc / counter-input file)
-cat > my_counters.txt <<'EOF'
-pmc: SQ_WAVES SQ_BUSY_CYCLES SQ_WAIT_INST_VMEM
-EOF
-rocprofv3 -i my_counters.txt \
-    --output-format csv \
-    --output-directory rpf-counters \
-    -- python my_bench.py
-```
+Driver script flags: `--timing-only` (skip rocprofv3), `--skip-roofline`, `--replot` (regen PNG from existing CSVs, no GPU), `--list-presets`.
 
-Kernel-name filtering is available via `--kernel-rename` and regex selection via
-`--kernel-include-regex` in recent `rocprofv3` versions.
+Output (under `benchmarks/rocm_benchmarks/`, gitignored):
 
-## Reference checking
-
-When comparing the HIP-default and `backend="aiter"` paths (or any two backends), always verify
-numerical parity before trusting perf numbers:
-
-```python
-ref = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True)                   # HIP
-got = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, backend="aiter")  # AITER
-
-torch.testing.assert_close(got.float(), ref.float(), rtol=1e-2, atol=1e-2)
+```text
+<label>_timing.csv             # median + std per config
+<label>_counter_collection.csv # raw counters
+<label>_roofline.png           # only for counters=roofline
 ```
 
-Loose BF16 tolerances are expected; tighten for FP32-only ops.
-
-## Troubleshooting
+## Reproducibility checklist
 
-### Inconsistent results (large std)
-
-1. Raise `dry_run_iters` to 10–20 so the kernel cache and clocks settle.
-2. Raise `num_iters` to 50+ for sub-100-µs kernels.
-3. Pin the GPU clock:
+1. **Warm up.** `dry_run_iters >= 5`; raise to 10–20 if std is high. First call includes JIT compile.
+2. **Pin clocks** for sub-100-µs kernels:
 
    ```bash
-   # Query supported clocks
    rocm-smi --showclocks
-   # Lock SCLK / MCLK (requires sudo, restores on reboot)
    sudo rocm-smi --setsclk 7
    sudo rocm-smi --setmclk 3
    ```
 
-4. Disable ECC scrubbing interference: `sudo rocm-smi --resetprofile` between runs.
-
-### Kernel name does not match in `rocm_profiler`
-
-The `kernel_name_regex` you pass to `RocmProfiler` must match the mangled kernel name emitted by
-`rocprofv3`. If no rows appear in `<label>_counter_collection.csv`:
-
-```bash
-# 1. Dry-run to see what kernels are launched
-rocprofv3 --stats --kernel-trace --output-format csv \
-    --output-directory rpf-dbg -- python my_bench.py
-
-# 2. Inspect rpf-dbg/*_kernel_stats.csv and copy the name prefix into your driver.
-```
-
-### AITER backend errors
-
-If `backend="aiter"` raises before any timing runs, it is usually one of:
-
-- `page_size` not in `{1, 16, 1024}` (batch prefill + CK FMHA path).
-- `kv_layout != "NHD"`.
-- `amd-aiter` not installed.
-
-Fix the call or drop back to the default HIP backend for that config.
-
-### `rocm_profiler` hangs or produces empty CSV
-
-- Check that `rocprofv3` is on `PATH` and executable: `which rocprofv3`.
-- Make sure the driver script prints something from the `if __name__ == "__main__":` block —
-  the wrapper uses script output as a heartbeat.
-- Run with `--timing-only` first to confirm the kernel path itself works before involving
-  `rocprofv3`.
-
-## Best practices
-
-1. **Record the arch and ROCm version** alongside every perf number:
-
-   ```python
-   import torch
-   props = torch.cuda.get_device_properties(0)
-   print(props.name, props.gcnArchName, torch.version.hip)
-   ```
-
-   A `seq=1024` FA2 number on MI300X (`gfx942`, ROCm 7.2) is not comparable to one on MI350X
-   (`gfx950`, ROCm 7.0.2).
-
-2. **Always warm up.** First-call JIT compile will dominate the first measurement otherwise.
-   Use `dry_run_iters >= 5` and explicitly call the kernel once before timing in scripts that
-   measure the first iteration separately.
-
-3. **Verify correctness before performance.** A kernel that silently writes junk is always
-   faster than one that works.
-
-4. **Compare against the AITER backend where it exists.** For single / batch prefill on ROCm,
-   AITER's CK FMHA is often the competitive lower bound.
-
-5. **Prefer the `roofline` counter preset to start.** It instantly tells you whether further
-   optimization should target arithmetic intensity (MFMA ops) or HBM bandwidth (TCC DRAM
-   requests).
+3. **Record arch + ROCm version** in the log: `print(props.name, props.gcnArchName, torch.version.hip)`.
+4. **Isolate the GPU:** `HIP_VISIBLE_DEVICES=N` (or `ROCR_VISIBLE_DEVICES=N`, one layer deeper).
 
-## Related documentation
+## Troubleshooting `rocm_profiler`
 
-- [`CLAUDE.md`](../../../CLAUDE.md) — project overview and JIT architecture.
-- [`.claude/skills/add-rocm-kernel/SKILL.md`](../add-rocm-kernel/SKILL.md) — author a new kernel
-  to benchmark.
-- [`.claude/skills/debug-rocm-crash/SKILL.md`](../debug-rocm-crash/SKILL.md) — when a kernel
-  crashes during timing.
-- [`benchmarks/rocm_benchmarks/bench_fa2_prefill.py`](../../../benchmarks/rocm_benchmarks/bench_fa2_prefill.py)
-  — a real, working driver script to copy from.
-- [`rocm_profiler/rocm_profiler.py`](../../../rocm_profiler/rocm_profiler.py) — full API docs in
-  the module header.
-- `rocprofv3` docs: <https://rocm.docs.amd.com/projects/rocprofiler-sdk/>.
+- **Empty `_counter_collection.csv`:** `kernel_name_regex` doesn't match the mangled name. Run `rocprofv3 --stats --kernel-trace -- python my_bench.py` first and copy the prefix from `*_kernel_stats.csv`.
+- **Hang or no output:** confirm `which rocprofv3` is on `PATH`; the wrapper uses script `print()` output as a heartbeat — make sure the `if __name__ == "__main__":` block prints something.
+- **Use `--timing-only` first** to verify the kernel path works before involving `rocprofv3`.
diff --git a/.claude/skills/debug-rocm-crash/SKILL.md b/.claude/skills/debug-rocm-crash/SKILL.md
index d30cb896c3..1bbdb529ab 100644
--- a/.claude/skills/debug-rocm-crash/SKILL.md
+++ b/.claude/skills/debug-rocm-crash/SKILL.md
@@ -1,672 +1,81 @@
 ---
 name: debug-rocm-crash
-description: Tutorial for debugging HIP/ROCm kernel crashes in FlashInfer+ROCm using API logging plus HIP/ROCm runtime tooling
+description: Tutorial for debugging HIP kernel crashes in FlashInfer+ROCm using HIP/ROCm runtime tooling
 ---
 
-# Tutorial: Debugging ROCm Crashes in FlashInfer+ROCm
+# Debugging ROCm Crashes in FlashInfer+ROCm
 
-This guide shows how to debug HIP/ROCm kernel crashes and errors in the `amd-flashinfer` fork
-(CDNA3 `gfx942`, CDNA4 `gfx950`) using the `@flashinfer_api` logging decorator combined with
-ROCm's own debugging tools.
+> **Note:** earlier revisions of this skill (and CLAUDE.md) described a `@flashinfer_api`
+> decorator with `FLASHINFER_LOGLEVEL` / `FLASHINFER_LOGDEST` env vars. **That machinery does
+> not exist in this fork** (grep returns zero matches). Don't try to set those env vars —
+> use the HIP/ROCm tooling below instead.
 
-If you are used to upstream's `debug-cuda-crash` skill, the Python logging half is identical —
-`@flashinfer_api`, `FLASHINFER_LOGLEVEL`, `FLASHINFER_LOGDEST` all work unchanged on HIP. The
-CUDA-tooling half (`compute-sanitizer`, `cuda-gdb`, `CUDA_LAUNCH_BLOCKING`) is rewritten below
-using the ROCm equivalents.
+## The magic env-var combo
 
-## Goal
-
-When your code crashes on an AMD Instinct GPU with errors like:
-
-- `HIP error: the operation cannot be performed in the present state`
-- `hipErrorIllegalAddress`
-- `Memory access fault by GPU node-N (Agent handle: ...) on address 0x...`
-- `hipErrorOutOfMemory`
-- `RuntimeError: CUDA error: an illegal memory access was encountered` (PyTorch masquerades HIP
-  errors as CUDA errors)
-
-… you want to:
-
-- Capture input tensors BEFORE the crash (so the crash itself doesn't take the evidence with it).
-- Pinpoint exactly which kernel launch faulted.
-- Understand whether the bug is a shape mismatch, a bad page table / KV config, an AITER
-  limitation, or a genuine kernel bug.
-
-## Why use API logging?
-
-**Problem:** HIP faults frequently terminate the process with little more than a hex address,
-leaving no Python-level context.
-
-**Solution:** `@flashinfer_api` logs inputs (shape, dtype, device, strides, optionally min/max/mean
-and NaN/Inf counts) BEFORE the kernel runs. If the kernel crashes, the last log entry shows you
-exactly what data it received.
-
-## Step 1: Enable API logging
-
-### Basic (function names only)
-
-```bash
-export FLASHINFER_LOGLEVEL=1
-export FLASHINFER_LOGDEST=stdout
-
-python my_script.py
-```
-
-Output:
-
-```text
-[2026-04-21 10:30:45] FlashInfer API Call: single_prefill_with_kv_cache
-```
-
-### Detailed (inputs / outputs + metadata)
-
-```bash
-export FLASHINFER_LOGLEVEL=3
-export FLASHINFER_LOGDEST=debug.log
-
-python my_script.py
-```
-
-Example output in `debug.log`:
-
-```text
-================================================================================
-[2026-04-21 10:30:45] FlashInfer API Logging — System Information
-================================================================================
-FlashInfer version: 0.5.3+amd.1
-HIP / ROCm version: 7.1.1
-GPU 0: AMD Instinct MI300X
-  gcnArchName: gfx942:sramecc+:xnack-
-PyTorch version: 2.9.1+rocm7.1
-================================================================================
-
-[2026-04-21 10:30:46] FlashInfer API Call: batch_decode_with_paged_kv_cache
---------------------------------------------------------------------------------
-Positional input arguments:
-  arg[0]:
-    Tensor(
-      shape=(32, 8, 128)
-      dtype=torch.bfloat16
-      device=cuda:0
-      requires_grad=False
-      is_contiguous=True
-    )
-Keyword input arguments:
-  paged_kv_cache=
-    Tensor(
-      shape=(1024, 2, 8, 128)
-      dtype=torch.bfloat16
-      device=cuda:0
-      ...
-    )
-```
-
-Even though the device string shows `cuda:0`, the underlying device is an AMD GPU — this is
-expected because PyTorch's ROCm build reuses the `cuda` namespace. The `gcnArchName` line above
-is the unambiguous ROCm marker.
-
-### Full (with tensor statistics)
-
-```bash
-export FLASHINFER_LOGLEVEL=5
-export FLASHINFER_LOGDEST=debug.log
-
-python my_script.py
-```
-
-Adds:
-
-```text
-  Tensor(
-    shape=(32, 8, 128)
-    dtype=torch.bfloat16
-    device=cuda:0
-    requires_grad=False
-    is_contiguous=True
-    min=-3.125000
-    max=4.250000
-    mean=0.015625
-    nan_count=0
-    inf_count=0
-  )
-```
-
-Use level 5 when diagnosing numerical issues (NaN/Inf propagation). Note that HIP-graph capture
-paths auto-skip statistics; that is intentional and shows up as
-`[statistics skipped: HIP graph capture in progress]`.
-
-## Step 2: Force deterministic kernel launches before debugging
-
-HIP async launches make Python tracebacks point at the wrong line. Set these env vars **before**
-running your script:
-
-```bash
-export HIP_LAUNCH_BLOCKING=1
-export AMD_SERIALIZE_KERNEL=3
-```
-
-- `HIP_LAUNCH_BLOCKING=1` — force every HIP API call to be synchronous.
-- `AMD_SERIALIZE_KERNEL=3` — also serialize kernel launches through the queue. This is the
-  single most useful knob for `Memory access fault by GPU node-N` errors, because it pins the
-  fault to the *actual* faulting kernel rather than whichever subsequent launch happened to
-  finish first.
-
-Both are zero-overhead when there's no bug to chase, so enabling them in `pytest` runs while
-iterating on a new kernel is reasonable.
-
-## Step 3: Common ROCm errors and how to debug them
-
-### Error 1: Illegal memory access / GPU memory fault
-
-**Error messages** (any of these indicate the same class of bug):
-
-```text
-RuntimeError: CUDA error: an illegal memory access was encountered
-HIP error: hipErrorIllegalAddress
-Memory access fault by GPU node-1 (Agent handle: 0x...) on address 0x7f...
-VM_CONTEXT1_PROTECTION_FAULT_STATUS ... NO_RETRY: 0x0
-```
-
-**Recipe:**
-
-```bash
-export FLASHINFER_LOGLEVEL=3
-export FLASHINFER_LOGDEST=crash.log
-export AMD_SERIALIZE_KERNEL=3
-export HIP_LAUNCH_BLOCKING=1
-python my_script.py
-```
-
-In `crash.log`, find the **last** `FlashInfer API Call:` entry — that is the kernel that took
-the process down. Check:
-
-- Tensor **shapes** match what the kernel expects (head_dim, num_heads).
-- All tensors are on the same device (both `cuda:0`, not mixed `cuda:0` + `cpu`).
-- `is_contiguous=True` where required; non-contiguous strides are a classic cause of
-  out-of-bounds reads.
-- For paged-KV wrappers: `kv_indices` / `kv_indptr` values are within `[0, num_pages)`.
-
-**Common root causes in this fork:**
-
-- Wrong `head_dim_qk` / `head_dim_vo` mismatch between `q` and the KV cache.
-- CPU tensor accidentally passed to a GPU API.
-- Non-contiguous `q`/`k`/`v` from a `.transpose()` or `.view()` chain — add a `.contiguous()`.
-- Out-of-range `kv_indices` — often off-by-one when building page tables by hand.
-- **AITER-specific:** see dedicated section below.
-
-### Error 2: AITER backend crash (`backend="aiter"`)
-
-When using `backend="aiter"` on single or batch prefill, watch for two very specific gotchas
-(both documented in [`README.md`](../../../README.md) and enforced by the code in
-[`flashinfer/prefill_rocm.py`](../../../flashinfer/prefill_rocm.py)):
-
-| Symptom | Likely cause | Fix |
-| --- | --- | --- |
-| `ValueError` raised *before* any kernel launch | `page_size` is not in `{1, 16, 1024}` (batch prefill + CK FMHA) | Re-plan with one of the supported page sizes, or drop `backend="aiter"` for that call. |
-| `ValueError` about KV layout | `kv_layout != "NHD"` | Switch to `NHD` or use the default HIP backend. |
-| Hard GPU fault mid-kernel, no Python exception | `amd-aiter` version mismatch vs. the ROCm build | Reinstall `amd-aiter` matching your ROCm version (`--extra-index-url https://pypi.amd.com/rocm-<version>/simple`). |
-| `ModuleNotFoundError: aiter` | `amd-aiter` not installed | `pip install amd-aiter --index-url https://pypi.amd.com/simple/`. |
-
-If API logging shows a correct-looking call to a prefill API but the process dies with a GPU
-fault and no Python traceback, **disable the AITER backend** as a first step to see whether the
-bug is in AITER or in our side of the port.
-
-### Error 3: NaN / Inf values
-
-```text
-RuntimeError: ... returned nan or inf
-```
+For an unknown HIP fault, set these **before** running so the traceback points at the actual faulting kernel:
 
 ```bash
-export FLASHINFER_LOGLEVEL=5
-export FLASHINFER_LOGDEST=nan.log
-python my_script.py
+export AMD_SERIALIZE_KERNEL=3   # pins fault to the actual faulting kernel
+export HIP_LAUNCH_BLOCKING=1    # synchronous launches; tracebacks point at the right line
 ```
 
-Check `nan_count` / `inf_count` in the log. On CDNA3/4 the most common sources are:
-
-- FP8 path overflow — the `_fnuz` variants used on AMD
-  (`__hip_fp8_e4m3_fnuz`, `__hip_fp8_e5m2_fnuz`) have a different representable range than
-  NVIDIA's `__nv_fp8_e4m3`. A scale factor calibrated against an NVIDIA reference will
-  routinely overflow on ROCm.
-- A previous op producing `-inf` / `inf` that is then fed into `exp` (online softmax).
-- Uninitialized memory — `torch.empty(...)` vs `torch.zeros(...)`.
-
-### Error 4: Out of memory
+Both are near-zero-overhead and reasonable to leave on while iterating on a new kernel.
 
-```text
-RuntimeError: HIP out of memory.
-```
+For an in-script view of what's being passed, wrap the suspect call with `print(input.shape, input.dtype, input.device, input.is_contiguous())` and `torch.cuda.synchronize()` immediately before the FlashInfer call — this gives you the same info `@flashinfer_api` would have, manually.
 
-```bash
-rocm-smi --showmeminfo vram --showpids
-export FLASHINFER_LOGLEVEL=3
-python my_script.py
-```
-
-Look for unexpectedly large tensor shapes in the last log entry. If the process keeps getting
-OOM-killed on healthy-looking shapes, check:
-
-- Zombie processes holding VRAM: `rocm-smi --showpids` and `kill -9` them.
-- JIT cache compile spike — set `MAX_JOBS=1` to cap concurrent ninja jobs during AOT builds.
-- Another tenant on the same GPU — pin to a single GPU with `HIP_VISIBLE_DEVICES=N`.
-
-### Error 5: Wrong dtype
-
-```text
-RuntimeError: expected scalar type BFloat16 but found Half
-```
-
-```bash
-export FLASHINFER_LOGLEVEL=3
-python my_script.py
-```
-
-In the log, look for the mismatching `dtype=` field. On ROCm, confirm:
-
-- If the op supports FP8 on your arch: `gfx942`/`gfx950` use the `_fnuz` FP8 variants — a
-  callsite that expects `torch.float8_e4m3fn` (NVIDIA's OCP FP8) will mis-dispatch. The
-  PyTorch dtype used on ROCm for `__hip_fp8_e4m3_fnuz` is `torch.float8_e4m3fnuz`.
-
-## Step 4: Multi-GPU / multi-process debugging
-
-For multi-rank runs use the `%i` pattern in the log destination:
-
-```bash
-export FLASHINFER_LOGLEVEL=3
-export FLASHINFER_LOGDEST=debug_rank_%i.log
-export HIP_VISIBLE_DEVICES=0,1,2,3      # restrict to specific GPUs
-# (or ROCR_VISIBLE_DEVICES — same effect, but applied earlier in the stack)
+## Per-error recipe
 
-torchrun --nproc_per_node=4 my_script.py
-```
-
-This produces `debug_rank_<pid>.log` per process. Use `HIP_VISIBLE_DEVICES` instead of
-`CUDA_VISIBLE_DEVICES` when you need to isolate a specific AMD device.
-
-If a specific GPU is misbehaving (ECC errors, firmware stuck), check it with
-`rocm-smi --showreset --showuniqueid --showproductname` and open `dmesg -wH` in another
-terminal.
-
-## Step 5: Advanced debugging with ROCm tools
-
-### `rocgdb` (CUDA-GDB equivalent)
-
-```bash
-export FLASHINFER_LOGLEVEL=3
-export FLASHINFER_LOGDEST=debug.log
-export AMD_SERIALIZE_KERNEL=3
-export HIP_LAUNCH_BLOCKING=1
-
-rocgdb --args python my_script.py
-```
-
-Inside `rocgdb`:
-
-```text
-(rocgdb) catch throw
-(rocgdb) run
-(rocgdb) bt            # stack trace at the crash point
-(rocgdb) info agents   # list GPUs
-(rocgdb) info wavefronts
-```
-
-For attaching to a running process (e.g. a hang), set before you launch your script:
-
-```bash
-export ROCM_DEBUG_WAIT_FOR_DEBUGGER=1
-```
-
-Then `rocgdb -p <pid>` attaches; no debugger attached → the process waits at the first GPU API
-call.
-
-### HIP / HSA runtime tracing
-
-```bash
-export AMD_LOG_LEVEL=3         # HIP API + stream trace
-# export AMD_LOG_LEVEL=4       # very verbose, includes arg decoding
-export HSA_ENABLE_DEBUG=1      # one layer below HIP (runtime queues, agents)
-python my_script.py 2> hip.trace
-```
-
-Grep for `hipLaunchKernel`, `hipMemcpy`, and `error` in `hip.trace`. The trace is linear with
-`HIP_LAUNCH_BLOCKING=1`, which makes it possible to correlate each FlashInfer API call with the
-exact underlying HIP launches.
-
-### Device state snapshots with `rocm-smi`
-
-Leave this running in another terminal while reproducing a hang:
-
-```bash
-watch -n 1 'rocm-smi --showuse --showmeminfo vram --showpids --showprofile'
-```
-
-Watch for:
-
-- GPU stuck at 100% but no `SQ` activity — kernel is looping.
-- VRAM pinned high after your process exits — another process is still holding it.
-- Throttling indicators (`POWERCAP`, `THERMAL`) — reproduce on a cooler box before filing a
-  kernel bug.
-
-### `dmesg` for firmware-level faults
-
-```bash
-sudo dmesg -T | grep -i -E "amdgpu|kfd|vm_fault" | tail -100
-```
-
-`VM_CONTEXT1_PROTECTION_FAULT_STATUS` entries here tell you page-fault class, access type, and
-the offending address — useful when the Python log only says `hipErrorIllegalAddress`.
-
-## Step 6: Kernel-level debugging with `printf`
-
-`printf()` works inside HIP device code exactly the way it does on CUDA:
-
-```cpp
-__global__ void MyKernel(const float* __restrict__ input,
-                         float* __restrict__ output, int n) {
-  int idx = blockIdx.x * blockDim.x + threadIdx.x;
-
-  // Print from one thread per block to avoid flood
-  if (threadIdx.x == 0 && blockIdx.x == 0) {
-    printf("n=%d, input[0]=%f\n", n, input[0]);
-  }
-
-  if (idx < n) {
-    output[idx] = input[idx] * 2.0f;
-  }
-}
-```
-
-Flush after the launch from Python:
-
-```python
-my_kernel(input, output)
-torch.cuda.synchronize()  # Flushes device printf buffer on ROCm too
-```
-
-### Warp / wavefront considerations
-
-The wavefront size on CDNA3 and CDNA4 is **64** (not 32 as on NVIDIA). Adjust any
-representative-thread logic accordingly:
-
-```cpp
-// CDNA3/4: wavefront size = 64
-if (threadIdx.x % 64 == 0) {
-  printf("Wavefront %d processing\n", threadIdx.x / 64);
-}
-```
-
-Common mistake ported blindly from a CUDA example:
-
-```cpp
-// ❌ Assumes warp size 32; prints from thread 32 of a CDNA wavefront
-if (threadIdx.x % 32 == 0) {
-  printf("...");
-}
-```
-
-Use `warpSize` (a built-in `unsigned int`) when writing portable code.
-
-### Device asserts
-
-```cpp
-assert(value >= 0.0f && "Value must be non-negative");
-```
-
-Build with JIT debug flags to make these trip reliably:
-
-```bash
-export FLASHINFER_JIT_VERBOSE=1
-```
-
-(Unlike upstream there is no `FLASHINFER_JIT_DEBUG=1` `-O0 -g -G` mode on the HIP side yet;
-`-O0` is not wired into `hipcc` invocations. Add `-g` via `extra_cuda_cflags` temporarily in
-the JIT generator while debugging.)
-
-## Environment Variables Reference
-
-### FlashInfer logging
-
-| Variable | Values | Description |
-| --- | --- | --- |
-| `FLASHINFER_LOGLEVEL` | `0` | No logging (default). Zero overhead. |
-| | `1` | Function names only. |
-| | `3` | Inputs/outputs with shape/dtype/device/strides. |
-| | `5` | + min/max/mean/nan/inf statistics. |
-| `FLASHINFER_LOGDEST` | `stdout` | Console (default). |
-| | `stderr` | Stderr. |
-| | `<path>` | File. |
-| | `log_%i.txt` | Multi-process; `%i` expands to PID. |
-| `FLASHINFER_JIT_VERBOSE` | `1` | Print every `hipcc` invocation and build command. |
-
-### HIP / ROCm runtime
-
-| Variable | Effect |
+| Symptom | First check |
 | --- | --- |
-| `HIP_LAUNCH_BLOCKING=1` | Force synchronous launches (stack traces pin the faulting kernel). |
-| `AMD_SERIALIZE_KERNEL=3` | Serialize kernel launches through the queue. |
-| `AMD_LOG_LEVEL=3` (or `4`) | HIP API trace. |
-| `HSA_ENABLE_DEBUG=1` | HSA runtime trace. |
-| `HIP_VISIBLE_DEVICES=0,1` | Restrict visible GPUs (preferred on ROCm). |
-| `ROCR_VISIBLE_DEVICES=0,1` | Same as above, applied one layer deeper. |
-| `ROCM_DEBUG_WAIT_FOR_DEBUGGER=1` | Block until `rocgdb` attaches. |
+| `Memory access fault by GPU node-N` / `hipErrorIllegalAddress` / "CUDA error: illegal memory access" (PyTorch's ROCm reports HIP errors as "CUDA" errors) | Run with the env combo above. Print tensor shapes/dtypes/strides just before the call. Verify: `is_contiguous()` where required, all tensors on the same `cuda:N`, `kv_indices` within `[0, num_pages)`, `head_dim_qk` matches between Q and KV. |
+| `backend="aiter"` `ValueError` before launch | `kv_layout != "NHD"` (only NHD is allowed — see [`prefill_rocm.py:331`](../../../flashinfer/prefill_rocm.py)). |
+| `backend="aiter"` `RuntimeError` | Non-gfx942/gfx950 GPU. |
+| `backend="aiter"` `ImportError` | `amd-aiter` not installed (`pip install amd-aiter --index-url https://pypi.amd.com/simple/`). |
+| `backend="aiter"` hard GPU fault mid-kernel | `amd-aiter` version mismatch vs. ROCm. Reinstall matching your ROCm version. Try the default HIP backend to confirm the bug is in AITER, not our side. |
+| NaN / Inf in outputs | Insert `torch.isnan(t).any()` / `torch.isinf(t).any()` checks around the call. On CDNA3/4: `_fnuz` FP8 has different representable range than NVIDIA OCP FP8 — scale factors calibrated against NVIDIA refs overflow. Or `-inf` from a previous op fed into `exp`. Or `torch.empty` vs `torch.zeros`. |
+| `HIP out of memory` | `rocm-smi --showmeminfo vram --showpids` — kill zombies. JIT-compile spike → `MAX_JOBS=1`. Other tenant → `HIP_VISIBLE_DEVICES=N`. |
+| `expected scalar type X but found Y` (FP8 callsites) | PyTorch dtype for `_fnuz` FP8 is `torch.float8_e4m3fnuz` / `torch.float8_e5m2fnuz`, **not** `torch.float8_e4m3fn` (which is NVIDIA OCP FP8). A callsite expecting `e4m3fn` mis-dispatches on ROCm. |
+
+## ROCm-specific tooling
+
+| Tool | Use |
+| --- | --- |
+| `rocgdb --args python my_script.py` | CUDA-GDB equivalent. Inside: `catch throw`, `run`, `bt`, `info agents`, `info wavefronts`. |
+| `ROCM_DEBUG_WAIT_FOR_DEBUGGER=1` | Process blocks at first GPU API call until `rocgdb -p <pid>` attaches. |
+| `AMD_LOG_LEVEL=3` (or `4`) | HIP API + stream trace. Linear under `HIP_LAUNCH_BLOCKING=1`, so each Python call correlates 1:1 with HIP launches. |
+| `HSA_ENABLE_DEBUG=1` | HSA layer trace (one below HIP — queues, agents). |
+| `sudo dmesg -T \| grep -iE 'amdgpu\|kfd\|vm_fault'` | `VM_CONTEXT1_PROTECTION_FAULT_STATUS` gives page-fault class, access type, offending address — useful when Python only says `hipErrorIllegalAddress`. |
+| `watch -n 1 'rocm-smi --showuse --showmeminfo vram --showpids'` | Hang diagnosis: 100% GPU + no SQ activity = looping kernel; VRAM still pinned after exit = another process holds it. |
 
-## Best practices
+`compute-sanitizer` / `cuda-gdb` have **no direct ROCm equivalent.** Closest workflow is the env-var combo above plus `rocgdb`.
 
-### 1. Always start with `FLASHINFER_LOGLEVEL=3`
+## AMD-specific gotchas
 
-```bash
-export FLASHINFER_LOGLEVEL=3
-```
-
-Gives you tensor metadata without overwhelming output.
+- **PyTorch's ROCm masquerade.** Device strings show `cuda:0` on AMD; "CUDA error" messages may be HIP errors. The unambiguous arch field is `torch.cuda.get_device_properties(0).gcnArchName`.
+- **Wavefront = 64**, not 32. Any representative-thread `printf` ported from CUDA needs `threadIdx.x % 64 == 0` (or use the `warpSize` builtin).
+- **`FLASHINFER_JIT_DEBUG=1` is wired on the CUDA path only.** On HIP it does nothing for debug build flags (no `-O0 -g`). Add `-g` via `extra_cuda_cflags` in the JIT generator for the op being debugged, clear `~/.cache/flashinfer/`, retry. See CLAUDE.md "Non-Obvious Gotchas".
+- **HIP installs are stripped.** `rocgdb` exits with `no symbol table loaded` unless you rebuild with `-g` (see previous bullet).
+- **Device `printf` flushes on `torch.cuda.synchronize()`** — works the same as CUDA.
+- **`HIP_VISIBLE_DEVICES`** is the canonical AMD scoping env var (`ROCR_VISIBLE_DEVICES` works one layer deeper). `CUDA_VISIBLE_DEVICES` may also be honored by PyTorch.
 
-### 2. Combine with `AMD_SERIALIZE_KERNEL=3` on first reproduction
+## Quick recipes
 
 ```bash
-export FLASHINFER_LOGLEVEL=3
+# Hard GPU fault
 export AMD_SERIALIZE_KERNEL=3
 export HIP_LAUNCH_BLOCKING=1
-```
-
-This is the single most useful env combination for debugging an unknown HIP fault.
-
-### 3. Log to a file for crashes
-
-```bash
-export FLASHINFER_LOGDEST=crash.log
-```
-
-Console output can be lost when the process SIGKILLs itself on a GPU fault.
-
-### 4. Compare before / after on the last API call
-
-- Last successful `FlashInfer API Call:` with **both** inputs and outputs logged — OK.
-- Last `FlashInfer API Call:` with inputs logged but **no outputs** — that's your crash site.
-
-### 5. Disable logging in production
-
-```bash
-unset FLASHINFER_LOGLEVEL     # or export FLASHINFER_LOGLEVEL=0
-```
-
-The `@flashinfer_api` decorator short-circuits to a zero-overhead path when disabled.
-
-## Troubleshooting the debugger itself
-
-### No logs appear
-
-- Verify the API you're calling actually has `@flashinfer_api` on it — decoration coverage is
-  a work in progress; a handful of low-level APIs may not be wrapped yet.
-- Check the env vars are exported in the right shell:
-
-  ```bash
-  echo $FLASHINFER_LOGLEVEL  # expect "3"
-  echo $FLASHINFER_LOGDEST   # expect path or "stdout"
-  ```
-
-### Statistics skipped at level 5
-
-```text
-[statistics skipped: HIP graph capture in progress]
-```
-
-Expected: min/max/mean/nan/inf would require synchronization that is illegal during graph
-capture. Temporarily drop to `FLASHINFER_LOGLEVEL=3` if you need inputs from inside a captured
-graph.
-
-### `rocgdb` exits immediately with `no symbol table loaded`
-
-`pip install`-installed HIP binaries are often stripped. Reinstall with
-`-DCMAKE_BUILD_TYPE=RelWithDebInfo` or add `"-g"` to `extra_cuda_cflags` in the JIT generator
-for the op you are debugging, clear `~/.cache/flashinfer/`, and retry.
-
-## Quick examples
-
-### Debug shape mismatch
-
-```bash
-export FLASHINFER_LOGLEVEL=3
-export FLASHINFER_LOGDEST=stdout
 python my_script.py
-# Read tensor shapes in stdout
-```
-
-### Debug NaN / Inf
-
-```bash
-export FLASHINFER_LOGLEVEL=5
-export FLASHINFER_LOGDEST=nan.log
-python my_script.py
-# Grep "nan_count=" / "inf_count=" in nan.log
-```
-
-### Debug a hard GPU fault
+# Python traceback now points at the right call. Also: sudo dmesg -T | tail -50
 
-```bash
-export FLASHINFER_LOGLEVEL=3
-export FLASHINFER_LOGDEST=gpu_fault.log
+# Step into a kernel
 export AMD_SERIALIZE_KERNEL=3
 export HIP_LAUNCH_BLOCKING=1
-python my_script.py
-# Last entry in gpu_fault.log is the faulting call.
-# Also check `sudo dmesg -T | tail -50` for VM_CONTEXT1_PROTECTION_FAULT_STATUS.
-```
-
-### Debug multi-GPU
-
-```bash
-export FLASHINFER_LOGLEVEL=3
-export FLASHINFER_LOGDEST=rank_%i.log
-export HIP_VISIBLE_DEVICES=0,1,2,3
-torchrun --nproc_per_node=4 train.py
-# Inspect rank_*.log files per process.
-```
-
-### Full `rocgdb` session
-
-```bash
-export FLASHINFER_LOGLEVEL=3
-export FLASHINFER_LOGDEST=debug.log
-export AMD_SERIALIZE_KERNEL=3
 rocgdb --args python my_script.py
 # (rocgdb) catch throw
 # (rocgdb) run
 # (rocgdb) bt
-```
-
-## Example: full debug session
-
-### Your code crashes
-
-```python
-import torch
-import flashinfer
 
-q  = torch.randn(32, 8, 128, dtype=torch.bfloat16, device="cuda")
-kv = torch.randn(1024, 2, 8, 64, dtype=torch.bfloat16, device="cuda")   # wrong head_dim!
-
-out = flashinfer.single_decode_with_kv_cache(q, kv[:, 0], kv[:, 1])     # crashes
-```
-
-Output:
-
-```text
-Memory access fault by GPU node-1 (Agent handle: 0x...) on address 0x7f9d...
-```
-
-### Enable logging + deterministic launches
-
-```bash
-export FLASHINFER_LOGLEVEL=3
-export FLASHINFER_LOGDEST=debug.log
-export AMD_SERIALIZE_KERNEL=3
-export HIP_LAUNCH_BLOCKING=1
-python test.py
-```
-
-### Read `debug.log`
-
-```text
-[...] FlashInfer API Call: single_decode_with_kv_cache
-Positional input arguments:
-  arg[0]:
-    Tensor(shape=(32, 8, 128), dtype=torch.bfloat16, device=cuda:0, ...)
-  arg[1]:
-    Tensor(shape=(1024, 8, 64), dtype=torch.bfloat16, device=cuda:0, ...)   # ← head_dim=64, not 128
-  arg[2]:
-    Tensor(shape=(1024, 8, 64), dtype=torch.bfloat16, device=cuda:0, ...)   # ← also wrong
-```
-
-### Fix
-
-```python
-kv = torch.randn(1024, 2, 8, 128, dtype=torch.bfloat16, device="cuda")  # fixed
-```
-
-### Success
-
-```bash
-python test.py
-# No crash; debug.log shows both the call and the output tensor.
+# HIP API trace
+AMD_LOG_LEVEL=3 HIP_LAUNCH_BLOCKING=1 python my_script.py 2> hip.trace
+# grep hipLaunchKernel / hipMemcpy / error in hip.trace
 ```
-
-## Summary
-
-1. Before anything else:
-
-   ```bash
-   export FLASHINFER_LOGLEVEL=3
-   export FLASHINFER_LOGDEST=debug.log
-   export AMD_SERIALIZE_KERNEL=3
-   export HIP_LAUNCH_BLOCKING=1
-   ```
-
-2. Reproduce the crash. Inputs are logged BEFORE each kernel runs, so the last entry tells you
-   which call faulted.
-
-3. If the shape/dtype/device picture in the log looks correct, escalate to
-   `AMD_LOG_LEVEL=3`, then to `rocgdb`, then to `dmesg` for VM-level faults.
-
-4. For AITER crashes, check the layout/page-size invariants first — they cover a large fraction
-   of "illegal address" reports in practice.
-
-5. Disable logging when done:
-
-   ```bash
-   export FLASHINFER_LOGLEVEL=0
-   ```
-
-## Related documentation
-
-- [`CLAUDE.md`](../../../CLAUDE.md) — project overview; see the "Debugging" and "API Logging"
-  sections for background.
-- [`.claude/skills/add-rocm-kernel/SKILL.md`](../add-rocm-kernel/SKILL.md) — when you are
-  debugging a kernel you just wrote.
-- [`.claude/skills/benchmark-kernel/SKILL.md`](../benchmark-kernel/SKILL.md) — when the crash
-  only happens under profiling.
-- ROCm debugging documentation: <https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/debugging.html>
-- `rocgdb` user guide: <https://rocm.docs.amd.com/projects/llvm-project/en/latest/reference/rocgdb.html>
-- Upstream's [`debug-cuda-crash` skill](https://github.com/flashinfer-ai/flashinfer/blob/main/.claude/skills/debug-cuda-crash/SKILL.md) —
-  the source this tutorial was adapted from; useful when cross-referencing a bug that reproduces
-  on both backends.
diff --git a/CLAUDE.md b/CLAUDE.md
index 008a09eb5f..2293d3fd68 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -15,7 +15,6 @@
 | Set target arch | `export FLASHINFER_ROCM_ARCH_LIST="gfx942,gfx950"` |
 | Limit parallel build | `export MAX_JOBS=4` |
 | Verbose JIT output | `export FLASHINFER_JIT_VERBOSE=1` |
-| Debug build (-O0) | `export FLASHINFER_JIT_DEBUG=1` |
 | Run linting | `pre-commit run -a` |
 
 ## Installing Torch
@@ -37,6 +36,12 @@ See the [GPU and ROCm Support](README.md#gpu-and-rocm-support) table in
 the file is missing. Changing env vars (`FLASHINFER_ROCM_ARCH_LIST`, extra
 cflags) is a **silent no-op** unless you call `spec.write_ninja()` first.
 
+**`FLASHINFER_JIT_DEBUG=1` is a CUDA-only no-op**: the env var is read in
+[`flashinfer/jit/core.py`](flashinfer/jit/core.py) only on the `IS_CUDA` branch
+(adds `-O0 -g -G`). The `IS_HIP` branch ignores it. To get a debug build on
+ROCm, add `"-g"` (and remove `-O3`) via `extra_cuda_cflags` in the op's JIT
+generator and clear `~/.cache/flashinfer/`.
+
 **Framework separation**: Torch headers **must not** be included in `include/`
 files. `include/` is framework-agnostic (raw pointers only);
 `flashinfer/csrc_rocm/` is where PyTorch tensor handling lives. Violations
@@ -82,13 +87,3 @@ gh api repos/ROCm/flashinfer/pulls/<number> --method PATCH --field body="<body>"
 # Or from a file
 gh api repos/ROCm/flashinfer/pulls/<number> --method PATCH --field body="$(cat /tmp/pr_body.md)"
 ```
-
-## Plan Files
-
-Save approved plans to the Claude Code project memory directory for this repo
-(visible via `/memory` in Claude Code).
-
-**Naming:** `plan_<short_descriptive_slug>.md`
-
-**Index:** add a one-line entry to `MEMORY.md` in that same directory:
-`- [Plan: <title>](plan_<slug>.md) — <one-line summary>`

From a2263ab3a51dc0aabdfa7848bb67ec60cf39eb91 Mon Sep 17 00:00:00 2001
From: Debasis Mandal <debasis.mandal@amd.com>
Date: Wed, 20 May 2026 18:20:35 +0000
Subject: [PATCH 3/5] docs: add PR description convention to CLAUDE.md

Documents the PR body structure (Summary / What changed / Architecture
notes / Benchmark results / Test plan) so future sessions produce
consistent PRs without rediscovering the format each time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 CLAUDE.md | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/CLAUDE.md b/CLAUDE.md
index 2293d3fd68..4292a25e6f 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -87,3 +87,20 @@ gh api repos/ROCm/flashinfer/pulls/<number> --method PATCH --field body="<body>"
 # Or from a file
 gh api repos/ROCm/flashinfer/pulls/<number> --method PATCH --field body="$(cat /tmp/pr_body.md)"
 ```
+
+## PR Description
+
+**Body** — include sections that apply, skip the rest:
+
+- `## Summary` — 1–3 sentences on what and why.
+- `### What changed` with `####` per component when the PR spans multiple
+  subsystems. Bullet by file: ``- **`path`** — one-line purpose``. Call out
+  non-obvious design choices.
+- `### Architecture / design notes` — only when there's a real choice to record.
+  Tables for routing/dispatch logic; explain *why*.
+- `## Benchmark results` — for perf-touching PRs. Shape line + table per entry
+  point + mean overhead/speedup row.
+- `## Test plan` — checklist of what was actually run (not aspirational), ending
+  with `pre-commit run -a`.
+
+Don't restate the diff and commits. Explain non-obvious decisions and surprising behaviors.

From 11d9c5ea7754e176fd6fcf063ad1f94a18e705d2 Mon Sep 17 00:00:00 2001
From: Debasis Mandal <debasis.mandal@amd.com>
Date: Wed, 20 May 2026 18:33:15 +0000
Subject: [PATCH 4/5] docs: address Copilot review on PR #237
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- CLAUDE.md: clarify FLASHINFER_JIT_DEBUG wording (was ambiguous —
  "CUDA-only no-op" parses two ways).
- add-rocm-kernel: drop the @flashinfer_api reference (no such
  decorator exists in this fork).
- benchmark-kernel: fix AITER kv_layout!=NHD citation — the hard
  raise lives in the wrapper plan() (prefill_rocm.py:1978/2920),
  not in _check_kv_layout or the auto-selection function. Expand
  the auto-selection fallback list to match _auto_select_prefill_backend.
- debug-rocm-crash: same kv_layout citation fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .claude/skills/add-rocm-kernel/SKILL.md  | 2 +-
 .claude/skills/benchmark-kernel/SKILL.md | 4 ++--
 .claude/skills/debug-rocm-crash/SKILL.md | 2 +-
 CLAUDE.md                                | 8 ++++----
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/.claude/skills/add-rocm-kernel/SKILL.md b/.claude/skills/add-rocm-kernel/SKILL.md
index 8623bc3a09..67540ed3da 100644
--- a/.claude/skills/add-rocm-kernel/SKILL.md
+++ b/.claude/skills/add-rocm-kernel/SKILL.md
@@ -22,7 +22,7 @@ For a complete worked example to copy, read these together:
 | 3 | `flashinfer/csrc_rocm/flashinfer_<op>_binding.cu` | `TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) { m.def("<op>", <op>); }`. |
 | 4 (opt) | `flashinfer/csrc_rocm/<op>_customize_config.jinja` | Compile-time type specialization. Skip if runtime dispatch is enough. |
 | 5 | `flashinfer/jit/<op>.py` | `gen_<op>_module() -> JitSpec` via `gen_jit_spec(...)`. |
-| 6 | `flashinfer/<op>.py` | Python API: `@functools.cache` module loader, optional `@flashinfer_api`, destination-passing (`out=`). |
+| 6 | `flashinfer/<op>.py` | Python API: `@functools.cache` module loader, destination-passing (`out=`). |
 | 7 | `tests/rocm_tests/test_<op>_hip.py` | Correctness tests; FP32 reference math, loose BF16 tolerances. |
 | 8 | `flashinfer/jit/__init__.py` (`IS_HIP` branch) | `from .<op> import gen_<op>_module as gen_<op>_module`. |
 | 9 | `flashinfer/__init__.py` (`IS_HIP` branch) | `from .<op> import <op> as <op>`. |
diff --git a/.claude/skills/benchmark-kernel/SKILL.md b/.claude/skills/benchmark-kernel/SKILL.md
index 2f4c7de13a..0e86a98b35 100644
--- a/.claude/skills/benchmark-kernel/SKILL.md
+++ b/.claude/skills/benchmark-kernel/SKILL.md
@@ -22,11 +22,11 @@ For the in-repo profiler wrapper, see [`rocm_profiler/rocm_profiler.py`](../../.
 
 - **CUPTI is NVIDIA-only and `enable_cupti=True` WILL fail on ROCm.** [`flashinfer/testing/utils.py:1010`](../../../flashinfer/testing/utils.py) routes `enable_cupti=True` straight to `bench_gpu_time_with_cupti` with no HIP guard; `cupti-python` is not installable on ROCm. Leave `enable_cupti=False` (the default) — `bench_gpu_time` then uses `torch.cuda.Event` (HIP events under the hood).
 - **AITER backend constraints, accurately:**
-  - `kv_layout != "NHD"` → hard raise (`_check_kv_layout` / [`prefill_rocm.py:331`](../../../flashinfer/prefill_rocm.py)).
+  - Explicit `backend="aiter"` + `kv_layout != "NHD"` → `ValueError` at `plan()` time. Raised in the prefill wrapper, e.g. [`prefill_rocm.py:1978`](../../../flashinfer/prefill_rocm.py) (single/paged) and the batch-paged wrapper around line 2920. Not raised by auto-selection — that path silently falls back to `fa2`.
   - Explicit `backend="aiter"` on non-gfx942/gfx950 → `RuntimeError`.
   - `amd-aiter` not importable → `ImportError`.
   - **"Native" page sizes** (no flat-gather): `{128, 256, 1024}` for `amd-aiter >= 0.1.10`, else `{16, 1024}` — see `_aiter_native_page_sizes()` in [`prefill_rocm.py:59`](../../../flashinfer/prefill_rocm.py). **Non-native page sizes are NOT rejected** — they go through a flat-gather code path. So the "{1, 16, 1024}" guidance from older docs is wrong.
-  - Auto-selection (no explicit `backend=`) silently falls back to `fa2` for: custom mask, dtype mismatch, head_dim mismatch, `pos_encoding_mode != "NONE"`.
+  - Auto-selection (no explicit `backend=`) silently falls back to `fa2` for any of: `kv_layout != "NHD"`, custom mask, dtype not in `{fp16, bf16}`, `dtype_q != dtype_kv`, `head_dim_qk != head_dim_vo`, `pos_encoding_mode != "NONE"`, or `amd-aiter` not importable. See `_auto_select_prefill_backend()` in [`prefill_rocm.py:311`](../../../flashinfer/prefill_rocm.py) for the authoritative list.
 - **Always verify numerical parity before trusting perf numbers.** Compare default-HIP vs AITER outputs with `torch.testing.assert_close(rtol=1e-2, atol=1e-2)` for BF16/FP16 first.
 - **`gcnArchName` is the unambiguous arch marker.** Device strings show `cuda:0` on AMD too. Record `torch.cuda.get_device_properties(0).gcnArchName` and `torch.version.hip` alongside every number — a `gfx942` / ROCm 7.2 result is not comparable to a `gfx950` / ROCm 7.0.2 result.
 
diff --git a/.claude/skills/debug-rocm-crash/SKILL.md b/.claude/skills/debug-rocm-crash/SKILL.md
index 1bbdb529ab..70fade705a 100644
--- a/.claude/skills/debug-rocm-crash/SKILL.md
+++ b/.claude/skills/debug-rocm-crash/SKILL.md
@@ -28,7 +28,7 @@ For an in-script view of what's being passed, wrap the suspect call with `print(
 | Symptom | First check |
 | --- | --- |
 | `Memory access fault by GPU node-N` / `hipErrorIllegalAddress` / "CUDA error: illegal memory access" (PyTorch's ROCm reports HIP errors as "CUDA" errors) | Run with the env combo above. Print tensor shapes/dtypes/strides just before the call. Verify: `is_contiguous()` where required, all tensors on the same `cuda:N`, `kv_indices` within `[0, num_pages)`, `head_dim_qk` matches between Q and KV. |
-| `backend="aiter"` `ValueError` before launch | `kv_layout != "NHD"` (only NHD is allowed — see [`prefill_rocm.py:331`](../../../flashinfer/prefill_rocm.py)). |
+| `backend="aiter"` `ValueError` before launch | `kv_layout != "NHD"` (only NHD is allowed — raised in the prefill wrapper's `plan()`, e.g. [`prefill_rocm.py:1978`](../../../flashinfer/prefill_rocm.py)). |
 | `backend="aiter"` `RuntimeError` | Non-gfx942/gfx950 GPU. |
 | `backend="aiter"` `ImportError` | `amd-aiter` not installed (`pip install amd-aiter --index-url https://pypi.amd.com/simple/`). |
 | `backend="aiter"` hard GPU fault mid-kernel | `amd-aiter` version mismatch vs. ROCm. Reinstall matching your ROCm version. Try the default HIP backend to confirm the bug is in AITER, not our side. |
diff --git a/CLAUDE.md b/CLAUDE.md
index 4292a25e6f..47b420a304 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -36,11 +36,11 @@ See the [GPU and ROCm Support](README.md#gpu-and-rocm-support) table in
 the file is missing. Changing env vars (`FLASHINFER_ROCM_ARCH_LIST`, extra
 cflags) is a **silent no-op** unless you call `spec.write_ninja()` first.
 
-**`FLASHINFER_JIT_DEBUG=1` is a CUDA-only no-op**: the env var is read in
+**`FLASHINFER_JIT_DEBUG=1` is a no-op on ROCm/HIP**: the env var is read in
 [`flashinfer/jit/core.py`](flashinfer/jit/core.py) only on the `IS_CUDA` branch
-(adds `-O0 -g -G`). The `IS_HIP` branch ignores it. To get a debug build on
-ROCm, add `"-g"` (and remove `-O3`) via `extra_cuda_cflags` in the op's JIT
-generator and clear `~/.cache/flashinfer/`.
+(where it adds `-O0 -g -G`). The `IS_HIP` branch ignores it entirely. To get a
+debug build on ROCm, add `"-g"` (and remove `-O3`) via `extra_cuda_cflags` in
+the op's JIT generator and clear `~/.cache/flashinfer/`.
 
 **Framework separation**: Torch headers **must not** be included in `include/`
 files. `include/` is framework-agnostic (raw pointers only);

From eca784ea532993b6b67b21e33de4770cf9c4aa40 Mon Sep 17 00:00:00 2001
From: Debasis Mandal <debasis.mandal@amd.com>
Date: Wed, 20 May 2026 18:39:34 +0000
Subject: [PATCH 5/5] docs: address second Copilot review on PR #237
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- benchmark-kernel: CUPTI on ROCm doesn't "WILL fail" — the wrapper
  try/excepts the cupti import, warns, and falls back to CUDA/HIP
  event timing. Reword to reflect actual behavior.
- debug-rocm-crash: scope the "grep returns zero" claim to code paths
  (`git grep` under flashinfer/ and include/) so it stays true now
  that the disclaimer itself mentions the missing names.
- CLAUDE.md: HIP path injects `-O3` into cuda_cflags *before*
  appending extra_cuda_cflags, so you can't remove it. Append
  `-O0 -g` so trailing `-O0` overrides on the hipcc command line.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .claude/skills/benchmark-kernel/SKILL.md | 2 +-
 .claude/skills/debug-rocm-crash/SKILL.md | 3 ++-
 CLAUDE.md                                | 6 ++++--
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/.claude/skills/benchmark-kernel/SKILL.md b/.claude/skills/benchmark-kernel/SKILL.md
index 0e86a98b35..a8f7bef2af 100644
--- a/.claude/skills/benchmark-kernel/SKILL.md
+++ b/.claude/skills/benchmark-kernel/SKILL.md
@@ -20,7 +20,7 @@ For the in-repo profiler wrapper, see [`rocm_profiler/rocm_profiler.py`](../../.
 
 ## Non-obvious gotchas
 
-- **CUPTI is NVIDIA-only and `enable_cupti=True` WILL fail on ROCm.** [`flashinfer/testing/utils.py:1010`](../../../flashinfer/testing/utils.py) routes `enable_cupti=True` straight to `bench_gpu_time_with_cupti` with no HIP guard; `cupti-python` is not installable on ROCm. Leave `enable_cupti=False` (the default) — `bench_gpu_time` then uses `torch.cuda.Event` (HIP events under the hood).
+- **CUPTI is NVIDIA-only — `enable_cupti=True` on ROCm warns and falls back.** [`flashinfer/testing/utils.py:1010`](../../../flashinfer/testing/utils.py) routes through `bench_gpu_time_with_cupti`, which `try/except`s the `cupti` import, emits a `UserWarning`, and reverts to CUDA/HIP event timing. No functional benefit on ROCm; just leave `enable_cupti=False` (the default) so `bench_gpu_time` uses `torch.cuda.Event` (HIP events) directly without the warning.
 - **AITER backend constraints, accurately:**
   - Explicit `backend="aiter"` + `kv_layout != "NHD"` → `ValueError` at `plan()` time. Raised in the prefill wrapper, e.g. [`prefill_rocm.py:1978`](../../../flashinfer/prefill_rocm.py) (single/paged) and the batch-paged wrapper around line 2920. Not raised by auto-selection — that path silently falls back to `fa2`.
   - Explicit `backend="aiter"` on non-gfx942/gfx950 → `RuntimeError`.
diff --git a/.claude/skills/debug-rocm-crash/SKILL.md b/.claude/skills/debug-rocm-crash/SKILL.md
index 70fade705a..21ded8c7fd 100644
--- a/.claude/skills/debug-rocm-crash/SKILL.md
+++ b/.claude/skills/debug-rocm-crash/SKILL.md
@@ -7,7 +7,8 @@ description: Tutorial for debugging HIP kernel crashes in FlashInfer+ROCm using
 
 > **Note:** earlier revisions of this skill (and CLAUDE.md) described a `@flashinfer_api`
 > decorator with `FLASHINFER_LOGLEVEL` / `FLASHINFER_LOGDEST` env vars. **That machinery does
-> not exist in this fork** (grep returns zero matches). Don't try to set those env vars —
+> not exist in this fork** — no matches in code (`git grep` under `flashinfer/` and `include/`
+> returns nothing; the only hits are this disclaimer). Don't try to set those env vars —
 > use the HIP/ROCm tooling below instead.
 
 ## The magic env-var combo
diff --git a/CLAUDE.md b/CLAUDE.md
index 47b420a304..0a622166ad 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -39,8 +39,10 @@ cflags) is a **silent no-op** unless you call `spec.write_ninja()` first.
 **`FLASHINFER_JIT_DEBUG=1` is a no-op on ROCm/HIP**: the env var is read in
 [`flashinfer/jit/core.py`](flashinfer/jit/core.py) only on the `IS_CUDA` branch
 (where it adds `-O0 -g -G`). The `IS_HIP` branch ignores it entirely. To get a
-debug build on ROCm, add `"-g"` (and remove `-O3`) via `extra_cuda_cflags` in
-the op's JIT generator and clear `~/.cache/flashinfer/`.
+debug build on ROCm, append `"-O0", "-g"` via `extra_cuda_cflags` in the op's
+JIT generator (the HIP path injects `-O3` before `extra_cuda_cflags`, so trailing
+`-O0` is what actually overrides it on the hipcc command line) and clear
+`~/.cache/flashinfer/`.
 
 **Framework separation**: Torch headers **must not** be included in `include/`
 files. `include/` is framework-agnostic (raw pointers only);