docs: Add Claude Code skills and update Claude.md by demandal25 · Pull Request #237 · ROCm/flashinfer

demandal25 · 2026-05-20T15:40:31Z

Summary

Primary source is the following from the upstream. Then they are simplified and made specific to rocm-flashinfer, rocm ecosystem, and AMD GPUs.

Trims the three .claude/skills/*/SKILL.md files from 1575 → 239 lines (~85% reduction), mirroring the slim CLAUDE.md philosophy from #225. Keeps only the non-obvious essentials needed when updating code or making design decisions.

While trimming, an independent fact-check surfaced several factual errors in the original docs that were carried into early drafts. Those are fixed here.

Factual fixes

debug-rocm-crash: removed the entire @flashinfer_api / FLASHINFER_LOGLEVEL / FLASHINFER_LOGDEST premise — that machinery does not exist in this fork (grep returns zero matches). Replaced with the AMD_SERIALIZE_KERNEL=3 + HIP_LAUNCH_BLOCKING=1 env-var combo and a per-error recipe table using print() + torch.cuda.synchronize() for input inspection. Added an explicit note at the top so future readers don't reintroduce the fiction.
benchmark-kernel (CUPTI): corrected "silently ignored" → "WILL fail on ROCm". flashinfer/testing/utils.py:1010 routes enable_cupti=True straight to bench_gpu_time_with_cupti with no HIP guard; cupti-python is not installable on ROCm.
benchmark-kernel (AITER): replaced the wrong {1, 16, 1024} page-size whitelist with the accurate constraint set from _aiter_native_page_sizes() in flashinfer/prefill_rocm.py:59 — native page sizes are {128, 256, 1024} for amd-aiter >= 0.1.10, else {16, 1024}; non-native sizes go through a flat-gather path and are NOT rejected. Also documented the silent fallback-to-fa2 auto-selection cases.
add-rocm-kernel: dropped the fictional DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16_FP32 macro; documented the actual three variants (_FP16, _FP8, unsuffixed) and noted that FP32 needs manual dispatch.

CLAUDE.md cleanup

Removed the misleading FLASHINFER_JIT_DEBUG=1 row from the Essential Commands table. The flag is wired on the CUDA path only; on HIP it does nothing for debug build flags. Added a gotcha pointing to the HIP workaround (add -g via extra_cuda_cflags in the JIT generator).

Test plan

pre-commit run -a clean
All file paths and line numbers in skills verified to exist
Fact-check pass via independent subagent

- Add root CLAUDE.md: HIP/ROCm, gfx942/gfx950, csrc_rocm, gpu_iface, AITER, feature matrix, JIT, debugging/benchmarking with ROCm tooling - Add .claude/skills: add-rocm-kernel, benchmark-kernel, debug-rocm-crash - Markdown tables/fences satisfy markdownlint (MD040, MD060, MD031) Made-with: Cursor

Restructure the three .claude/skills/*/SKILL.md files (1575 → 239 lines, ~85% reduction) to mirror the slim CLAUDE.md philosophy: keep only what is hard to derive from code or remember between sessions. Drop the full walkthrough examples — the real files in flashinfer/csrc_rocm/ and flashinfer/jit/ are better references than a Markdown copy. Also fix factual errors discovered while fact-checking the originals against the current codebase: - debug-rocm-crash: remove the entire `@flashinfer_api` / FLASHINFER_LOGLEVEL / FLASHINFER_LOGDEST premise. Grep returns zero matches in the codebase — that decorator and those env vars do not exist in this fork. Replace with the actual debug workflow (AMD_SERIALIZE_KERNEL=3 + HIP_LAUNCH_BLOCKING=1 + manual print + torch.cuda.synchronize, rocgdb, dmesg). - benchmark-kernel: AITER's "native" page sizes are {128, 256, 1024} for amd-aiter ≥ 0.1.10 (else {16, 1024}), not {1, 16, 1024}. Non-native page sizes fall through a flat-gather path; they are not rejected. CUPTI is not silently ignored — enable_cupti=True routes straight to bench_gpu_time_with_cupti with no HIP guard and will fail; leave it False on ROCm. - add-rocm-kernel: DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16_FP32 does not exist. Only _FP16, _FP8, and the unsuffixed variant are defined in pytorch_extension_utils.h. - CLAUDE.md: drop the misleading "Debug build (-O0) FLASHINFER_JIT_DEBUG" row — that env var is read only on the IS_CUDA branch of flashinfer/jit/core.py. Add a gotcha explaining the HIP workaround (add -g via extra_cuda_cflags in the JIT generator). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Updates the project’s Claude Code guidance by trimming the .claude/skills/*/SKILL.md documents and correcting ROCm/FlashInfer fork-specific factual inaccuracies, plus a small cleanup in CLAUDE.md around JIT debug behavior.

Changes:

Removes/condenses large portions of the three .claude/skills/*/SKILL.md docs while keeping ROCm-specific essentials.
Corrects several ROCm-vs-CUDA factual details (e.g., CUPTI behavior, AITER page-size constraints, HIP debug workflow).
Updates CLAUDE.md to remove a misleading command row and adds a HIP-specific debug-build workaround note.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`CLAUDE.md`	Removes the `FLASHINFER_JIT_DEBUG` command row and adds a clarification note about CUDA vs ROCm behavior.
`.claude/skills/debug-rocm-crash/SKILL.md`	Adds a ROCm crash-debugging recipe focused on HIP runtime tooling and environment variables.
`.claude/skills/benchmark-kernel/SKILL.md`	Adds a ROCm benchmarking guide, including timing method guidance and AITER/CUPTI gotchas.
`.claude/skills/add-rocm-kernel/SKILL.md`	Adds a step-by-step guide for adding HIP kernels, including file touchpoints and ROCm porting notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Documents the PR body structure (Summary / What changed / Architecture notes / Benchmark results / Test plan) so future sessions produce consistent PRs without rediscovering the format each time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- CLAUDE.md: clarify FLASHINFER_JIT_DEBUG wording (was ambiguous — "CUDA-only no-op" parses two ways). - add-rocm-kernel: drop the @flashinfer_api reference (no such decorator exists in this fork). - benchmark-kernel: fix AITER kv_layout!=NHD citation — the hard raise lives in the wrapper plan() (prefill_rocm.py:1978/2920), not in _check_kv_layout or the auto-selection function. Expand the auto-selection fallback list to match _auto_select_prefill_backend. - debug-rocm-crash: same kv_layout citation fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

- benchmark-kernel: CUPTI on ROCm doesn't "WILL fail" — the wrapper try/excepts the cupti import, warns, and falls back to CUDA/HIP event timing. Reword to reflect actual behavior. - debug-rocm-crash: scope the "grep returns zero" claim to code paths (`git grep` under flashinfer/ and include/) so it stays true now that the disclaimer itself mentions the missing names. - CLAUDE.md: HIP path injects `-O3` into cuda_cflags *before* appending extra_cuda_cflags, so you can't remove it. Append `-O0 -g` so trailing `-O0` overrides on the hipcc command line. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

demandal25 and others added 2 commits May 20, 2026 14:14

Copilot AI review requested due to automatic review settings May 20, 2026 15:40

Copilot started reviewing on behalf of demandal25 May 20, 2026 15:40 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread CLAUDE.md Outdated

Comment thread .claude/skills/add-rocm-kernel/SKILL.md Outdated

Comment thread .claude/skills/benchmark-kernel/SKILL.md Outdated

Comment thread .claude/skills/benchmark-kernel/SKILL.md Outdated

Comment thread .claude/skills/debug-rocm-crash/SKILL.md Outdated

demandal25 and others added 2 commits May 20, 2026 18:20

Copilot AI review requested due to automatic review settings May 20, 2026 18:33

Copilot started reviewing on behalf of demandal25 May 20, 2026 18:33 View session

Merge branch 'amd-integration' into add-claude-setup

10d9610

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread .claude/skills/debug-rocm-crash/SKILL.md Outdated

Comment thread .claude/skills/benchmark-kernel/SKILL.md Outdated

Comment thread CLAUDE.md Outdated

demandal25 changed the title ~~docs: trim Claude Code skills and fix factual errors~~ docs: Add Claude Code skills and update Claude.md May 20, 2026

demandal25 merged commit 889350b into ROCm:amd-integration May 20, 2026
1 check passed

demandal25 deleted the add-claude-setup branch May 20, 2026 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add Claude Code skills and update Claude.md#237

docs: Add Claude Code skills and update Claude.md#237
demandal25 merged 6 commits into
ROCm:amd-integrationfrom
demandal25:add-claude-setup

demandal25 commented May 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

demandal25 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Factual fixes

CLAUDE.md cleanup

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

demandal25 commented May 20, 2026 •

edited

Loading