diff --git a/.claude/skills/add-rocm-kernel/SKILL.md b/.claude/skills/add-rocm-kernel/SKILL.md new file mode 100644 index 0000000000..67540ed3da --- /dev/null +++ b/.claude/skills/add-rocm-kernel/SKILL.md @@ -0,0 +1,78 @@ +--- +name: add-rocm-kernel +description: Step-by-step tutorial for adding new HIP kernels to FlashInfer+ROCm (amd-flashinfer) +--- + +# Adding a New Kernel to FlashInfer+ROCm + +For a complete worked example to copy, read these together: +[`norm.cu`](../../../flashinfer/csrc_rocm/norm.cu) + +[`flashinfer_norm_binding.cu`](../../../flashinfer/csrc_rocm/flashinfer_norm_binding.cu) + +[`jit/norm.py`](../../../flashinfer/jit/norm.py) + +[`norm.py`](../../../flashinfer/norm.py). For plan-run / multi-backend / FP8 see +[`batch_prefill.cu`](../../../flashinfer/csrc_rocm/batch_prefill.cu) + +[`prefill_rocm.py`](../../../flashinfer/prefill_rocm.py). + +## File touchpoints (every new op needs each row, in order) + +| Step | File | Purpose | +| --- | --- | --- | +| 1 | `include/flashinfer/.cuh` | Framework-agnostic kernel + launcher template. **No `` includes here.** | +| 2 | `flashinfer/csrc_rocm/.cu` | PyTorch launcher: `at::Tensor` in, `at::hip::getCurrentHIPStream()`, `TORCH_CHECK`, `DISPATCH_PYTORCH_DTYPE_*`. | +| 3 | `flashinfer/csrc_rocm/flashinfer__binding.cu` | `TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) { m.def("", ); }`. | +| 4 (opt) | `flashinfer/csrc_rocm/_customize_config.jinja` | Compile-time type specialization. Skip if runtime dispatch is enough. | +| 5 | `flashinfer/jit/.py` | `gen__module() -> JitSpec` via `gen_jit_spec(...)`. | +| 6 | `flashinfer/.py` | Python API: `@functools.cache` module loader, destination-passing (`out=`). | +| 7 | `tests/rocm_tests/test__hip.py` | Correctness tests; FP32 reference math, loose BF16 tolerances. | +| 8 | `flashinfer/jit/__init__.py` (`IS_HIP` branch) | `from . import gen__module as gen__module`. | +| 9 | `flashinfer/__init__.py` (`IS_HIP` branch) | `from . import as `. | +| 10 (opt) | `flashinfer/aot_hip.py` | Register `gen__module` for pre-compiled wheels. | + +**Forgetting steps 8 and 9 is the most common bug** — the module compiles but is invisible from `import flashinfer`. + +## CUDA → ROCm porting cheat sheet + +When porting an upstream kernel, mechanically rewrite: + +| Upstream CUDA | This fork | +| --- | --- | +| `csrc/.cu` | `flashinfer/csrc_rocm/.cu` | +| `#include "tvm_ffi_utils.h"` | `#include "pytorch_extension_utils.h"` | +| `tvm::ffi::TensorView` | `at::Tensor` | +| `TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, op)` | `TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) { m.def("op", op); }` | +| `TVM_FFI_THROW(ValueError) << "..."` | `TORCH_CHECK(cond, "...")` | +| `DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP16` | `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16` | +| `get_stream(tensor.device())` | `at::hip::getCurrentHIPStream()` | +| `c10::cuda::OptionalCUDAGuard` | `c10::hip::OptionalHIPGuardMasqueradingAsCUDA` | +| `nvcc` flags via `extra_cuda_cflags=[...]` | **Same kwarg name** (`extra_cuda_cflags`) — internally routed to `hipcc`. | +| `flashinfer/aot.py` registration | `flashinfer/aot_hip.py` | +| `tests/test_op.py` | `tests/rocm_tests/test_op_hip.py` | +| `supported_major_versions=[9, 10]` | No analogue. Guard at Python layer via `FLASHINFER_SUPPORTED_ROCM_ARCHS`. | +| `csrc/` (hardcoded) | `jit_env.FLASHINFER_CSRC_DIR` resolves to `flashinfer/csrc_rocm/` on HIP. **Never hardcode `csrc/`.** | +| `PYBIND11_MODULE(...)` | **Don't.** Use `TORCH_LIBRARY_FRAGMENT` (integrates with `torch.compile`). | + +## Non-obvious gotchas + +- **PyTorch's ROCm masquerade.** `input.device.type == "cuda"` even on AMD. Never check for `"hip"`. PyTorch's HIP namespaces are reachable via `at::hip::...` and `c10::hip::OptionalHIPGuardMasqueradingAsCUDA` (literally the type name). +- **`gpu_iface` over duplication.** If a primitive (MMA intrinsic, cross-lane shuffle, dtype container, warp reduction) differs between CUDA and HIP, add it under [`include/gpu_iface/backend/{cuda,hip}/`](../../../include/gpu_iface) and expose a common name from the top-level `gpu_iface/` header. Don't fork the kernel into `csrc_rocm/`. Existing HIP backends: `mma_hip.h`, `memory_ops_hip.h`, `math_hip.h`, `vec_dtypes_hip.h`. +- **`-ffast-math` adds `-ffinite-math-only` on clang/hipcc.** [`jit/core.py`](../../../flashinfer/jit/core.py) explicitly re-adds `-fno-finite-math-only` so kernels that use `-inf` as a sentinel (online-softmax Map+Reduce) keep working. CUDA's `-use_fast_math` does *not* enable finite-math-only — divergence to be aware of when porting. +- **`gen_jit_spec` auto-injects `--offload-arch=gfxNNN`** for every target arch plus `COMMON_HIPCC_FLAGS` (`-DFLASHINFER_ENABLE_HIP`, FP8 enables, etc.). Don't add `--offload-arch` by hand. +- **Validation macros** live in [`pytorch_extension_utils.h`](../../../flashinfer/csrc_rocm/pytorch_extension_utils.h): `CHECK_INPUT` (GPU + contiguous), `CHECK_LAST_DIM_CONTIGUOUS_INPUT`, `CHECK_EQ`, `CHECK_DIM`, `CHECK_GE`, `CHECK_SHAPE`. Dispatch macros: `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16` (FP16+BF16), `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP8` (E4M3+E5M2, both `_fnuz` on CDNA3/4), and the unsuffixed `DISPATCH_PYTORCH_DTYPE_TO_CTYPE` (FP16+BF16+FP8 combined). There is **no** `_FP16_FP32` variant — if you need FP32, dispatch manually. +- **The `_jit_pybind.cu` naming pattern** (e.g. `batch_decode_jit_pybind.cu`) is used by newer AITER-integrated bindings; the older `flashinfer__binding.cu` pattern is used by everything else. Both work — match the neighbors. + +## CDNA3 (`gfx942`) vs CDNA4 (`gfx950`) + +- **Wavefront = 64 on both.** Anything ported from CUDA assuming warp = 32 is wrong. Use `warpSize` for portability. +- **FP8** is `__hip_fp8_e4m3_fnuz` / `__hip_fp8_e5m2_fnuz` on both. PyTorch dtype is `torch.float8_e4m3fnuz` (not `torch.float8_e4m3fn`, which is NVIDIA OCP FP8). Bit-exact parity with NVIDIA FP8 is not guaranteed — calibrate scale factors separately. +- **MFMA intrinsics:** CDNA4 has additional FP8 MFMA shapes not on CDNA3. Guard arch-specific intrinsics with `__gfx942__` / `__gfx950__` or compute-capability dispatch at the Python layer. +- **LDS / register / occupancy budgets differ.** Don't hard-code tile sizes — parameterize (Jinja) or query via `torch.cuda.get_device_properties(dev)` at plan time. + +## Quick checklist before commit + +- [ ] No `` under `include/`. +- [ ] Launcher uses `at::hip::getCurrentHIPStream()` + `OptionalHIPGuardMasqueradingAsCUDA`. +- [ ] Binding registered via `TORCH_LIBRARY_FRAGMENT`. +- [ ] JIT generator uses `jit_env.FLASHINFER_CSRC_DIR` (not hardcoded `csrc/`). +- [ ] Both `flashinfer/jit/__init__.py` and `flashinfer/__init__.py` IS_HIP branches updated. +- [ ] Test file under `tests/rocm_tests/` named `test_*_hip.py`. +- [ ] `pre-commit run -a` clean. diff --git a/.claude/skills/benchmark-kernel/SKILL.md b/.claude/skills/benchmark-kernel/SKILL.md new file mode 100644 index 0000000000..a8f7bef2af --- /dev/null +++ b/.claude/skills/benchmark-kernel/SKILL.md @@ -0,0 +1,82 @@ +--- +name: benchmark-kernel +description: Guide for benchmarking FlashInfer+ROCm kernels on AMD Instinct (CDNA3/CDNA4) +--- + +# Benchmarking FlashInfer+ROCm Kernels + +For a real driver script to copy, see +[`benchmarks/rocm_benchmarks/bench_fa2_prefill.py`](../../../benchmarks/rocm_benchmarks/bench_fa2_prefill.py) and [`benchmarks/rocm_benchmarks/bench_aiter_prefill.py`](../../../benchmarks/rocm_benchmarks/bench_aiter_prefill.py) +For the in-repo profiler wrapper, see [`rocm_profiler/rocm_profiler.py`](../../../rocm_profiler/rocm_profiler.py). + +## Timing method matrix + +| Method | When | How | +| --- | --- | --- | +| `flashinfer.testing.bench_gpu_time` | Quick in-loop check (kernels ≳ 50 µs) | Falls through to PyTorch `torch.cuda.Event` (HIP events under ROCm) automatically. | +| `rocm_profiler` (`RocmProfiler`) | Anything you intend to optimize | Two-phase: in-process median timing, then re-execs the same script under `rocprofv3` (sentinel: `_ROCM_PROFILER_INTERNAL`) for hardware counters. Produces roofline PNG. | +| `rocprofv3` directly | Full control over counter set | `rocprofv3 --stats --kernel-trace -- python script.py`; or `-i pmc.txt` for custom counters. | +| `omnitrace` | Host + device timeline when Python overhead is suspect | Installed separately. | + +## Non-obvious gotchas + +- **CUPTI is NVIDIA-only — `enable_cupti=True` on ROCm warns and falls back.** [`flashinfer/testing/utils.py:1010`](../../../flashinfer/testing/utils.py) routes through `bench_gpu_time_with_cupti`, which `try/except`s the `cupti` import, emits a `UserWarning`, and reverts to CUDA/HIP event timing. No functional benefit on ROCm; just leave `enable_cupti=False` (the default) so `bench_gpu_time` uses `torch.cuda.Event` (HIP events) directly without the warning. +- **AITER backend constraints, accurately:** + - Explicit `backend="aiter"` + `kv_layout != "NHD"` → `ValueError` at `plan()` time. Raised in the prefill wrapper, e.g. [`prefill_rocm.py:1978`](../../../flashinfer/prefill_rocm.py) (single/paged) and the batch-paged wrapper around line 2920. Not raised by auto-selection — that path silently falls back to `fa2`. + - Explicit `backend="aiter"` on non-gfx942/gfx950 → `RuntimeError`. + - `amd-aiter` not importable → `ImportError`. + - **"Native" page sizes** (no flat-gather): `{128, 256, 1024}` for `amd-aiter >= 0.1.10`, else `{16, 1024}` — see `_aiter_native_page_sizes()` in [`prefill_rocm.py:59`](../../../flashinfer/prefill_rocm.py). **Non-native page sizes are NOT rejected** — they go through a flat-gather code path. So the "{1, 16, 1024}" guidance from older docs is wrong. + - Auto-selection (no explicit `backend=`) silently falls back to `fa2` for any of: `kv_layout != "NHD"`, custom mask, dtype not in `{fp16, bf16}`, `dtype_q != dtype_kv`, `head_dim_qk != head_dim_vo`, `pos_encoding_mode != "NONE"`, or `amd-aiter` not importable. See `_auto_select_prefill_backend()` in [`prefill_rocm.py:311`](../../../flashinfer/prefill_rocm.py) for the authoritative list. +- **Always verify numerical parity before trusting perf numbers.** Compare default-HIP vs AITER outputs with `torch.testing.assert_close(rtol=1e-2, atol=1e-2)` for BF16/FP16 first. +- **`gcnArchName` is the unambiguous arch marker.** Device strings show `cuda:0` on AMD too. Record `torch.cuda.get_device_properties(0).gcnArchName` and `torch.version.hip` alongside every number — a `gfx942` / ROCm 7.2 result is not comparable to a `gfx950` / ROCm 7.0.2 result. + +## What can actually be benchmarked on ROCm + +Only the APIs in the `IS_HIP` branch of [`flashinfer/__init__.py`](../../../flashinfer/__init__.py) are callable. **Not** available: MLA, cascade, POD, FP4, MoE, cuDNN backends. Don't try to import them. + +AITER backend available for: single prefill, batch prefill (paged + ragged) — opt in via `backend="aiter"`. Not available for decode, norm, rope, sampling, etc. + +## `rocm_profiler` counter presets + +Pass via `RocmProfiler(counters=...)` or `--counters` on the driver script. + +| Preset | What it shows | Use for | +| --- | --- | --- | +| `roofline` (default) | `FetchSize`, `WriteSize`, MFMA ops, TCC DRAM requests | "Am I compute- or memory-bound?" | +| `compute` | MFMA ops + cycle counters | Matrix-core throughput | +| `memory` | L2 + DRAM breakdown | L2 hit-rate, HBM traffic | +| `occupancy` | `SQ_WAVES`, `SQ_BUSY_CYCLES`, `SQ_VALU_MFMA_BUSY_CYCLES`, `SQ_INSTS_LDS` | Wavefront density | +| `stall` | `SQ_WAIT_INST_VMEM`, `SQ_WAIT_INST_LDS` | Diagnose memory stalls | +| `basic` | `FetchSize` / `WriteSize` | Minimal baseline | + +Or pass a path to a `rocprofv3`-native YAML for a custom counter set. + +Driver script flags: `--timing-only` (skip rocprofv3), `--skip-roofline`, `--replot` (regen PNG from existing CSVs, no GPU), `--list-presets`. + +Output (under `benchmarks/rocm_benchmarks/`, gitignored): + +```text +