docs: Add Claude Code skills and update Claude.md#237
Merged
Conversation
- Add root CLAUDE.md: HIP/ROCm, gfx942/gfx950, csrc_rocm, gpu_iface, AITER, feature matrix, JIT, debugging/benchmarking with ROCm tooling - Add .claude/skills: add-rocm-kernel, benchmark-kernel, debug-rocm-crash - Markdown tables/fences satisfy markdownlint (MD040, MD060, MD031) Made-with: Cursor
Restructure the three .claude/skills/*/SKILL.md files (1575 → 239 lines,
~85% reduction) to mirror the slim CLAUDE.md philosophy: keep only what
is hard to derive from code or remember between sessions. Drop the full
walkthrough examples — the real files in flashinfer/csrc_rocm/ and
flashinfer/jit/ are better references than a Markdown copy.
Also fix factual errors discovered while fact-checking the originals
against the current codebase:
- debug-rocm-crash: remove the entire `@flashinfer_api` /
FLASHINFER_LOGLEVEL / FLASHINFER_LOGDEST premise. Grep returns zero
matches in the codebase — that decorator and those env vars do not
exist in this fork. Replace with the actual debug workflow
(AMD_SERIALIZE_KERNEL=3 + HIP_LAUNCH_BLOCKING=1 + manual print +
torch.cuda.synchronize, rocgdb, dmesg).
- benchmark-kernel: AITER's "native" page sizes are {128, 256, 1024} for
amd-aiter ≥ 0.1.10 (else {16, 1024}), not {1, 16, 1024}. Non-native
page sizes fall through a flat-gather path; they are not rejected.
CUPTI is not silently ignored — enable_cupti=True routes straight to
bench_gpu_time_with_cupti with no HIP guard and will fail; leave it
False on ROCm.
- add-rocm-kernel: DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16_FP32 does not
exist. Only _FP16, _FP8, and the unsuffixed variant are defined in
pytorch_extension_utils.h.
- CLAUDE.md: drop the misleading "Debug build (-O0) FLASHINFER_JIT_DEBUG"
row — that env var is read only on the IS_CUDA branch of
flashinfer/jit/core.py. Add a gotcha explaining the HIP workaround
(add -g via extra_cuda_cflags in the JIT generator).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Updates the project’s Claude Code guidance by trimming the .claude/skills/*/SKILL.md documents and correcting ROCm/FlashInfer fork-specific factual inaccuracies, plus a small cleanup in CLAUDE.md around JIT debug behavior.
Changes:
- Removes/condenses large portions of the three
.claude/skills/*/SKILL.mddocs while keeping ROCm-specific essentials. - Corrects several ROCm-vs-CUDA factual details (e.g., CUPTI behavior, AITER page-size constraints, HIP debug workflow).
- Updates
CLAUDE.mdto remove a misleading command row and adds a HIP-specific debug-build workaround note.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
CLAUDE.md |
Removes the FLASHINFER_JIT_DEBUG command row and adds a clarification note about CUDA vs ROCm behavior. |
.claude/skills/debug-rocm-crash/SKILL.md |
Adds a ROCm crash-debugging recipe focused on HIP runtime tooling and environment variables. |
.claude/skills/benchmark-kernel/SKILL.md |
Adds a ROCm benchmarking guide, including timing method guidance and AITER/CUPTI gotchas. |
.claude/skills/add-rocm-kernel/SKILL.md |
Adds a step-by-step guide for adding HIP kernels, including file touchpoints and ROCm porting notes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Documents the PR body structure (Summary / What changed / Architecture notes / Benchmark results / Test plan) so future sessions produce consistent PRs without rediscovering the format each time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- CLAUDE.md: clarify FLASHINFER_JIT_DEBUG wording (was ambiguous — "CUDA-only no-op" parses two ways). - add-rocm-kernel: drop the @flashinfer_api reference (no such decorator exists in this fork). - benchmark-kernel: fix AITER kv_layout!=NHD citation — the hard raise lives in the wrapper plan() (prefill_rocm.py:1978/2920), not in _check_kv_layout or the auto-selection function. Expand the auto-selection fallback list to match _auto_select_prefill_backend. - debug-rocm-crash: same kv_layout citation fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- benchmark-kernel: CUPTI on ROCm doesn't "WILL fail" — the wrapper try/excepts the cupti import, warns, and falls back to CUDA/HIP event timing. Reword to reflect actual behavior. - debug-rocm-crash: scope the "grep returns zero" claim to code paths (`git grep` under flashinfer/ and include/) so it stays true now that the disclaimer itself mentions the missing names. - CLAUDE.md: HIP path injects `-O3` into cuda_cflags *before* appending extra_cuda_cflags, so you can't remove it. Append `-O0 -g` so trailing `-O0` overrides on the hipcc command line. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Primary source is the following from the upstream. Then they are simplified and made specific to rocm-flashinfer, rocm ecosystem, and AMD GPUs.
Trims the three
.claude/skills/*/SKILL.mdfiles from 1575 → 239 lines (~85% reduction), mirroring the slim CLAUDE.md philosophy from #225. Keeps only the non-obvious essentials needed when updating code or making design decisions.While trimming, an independent fact-check surfaced several factual errors in the original docs that were carried into early drafts. Those are fixed here.
Factual fixes
debug-rocm-crash: removed the entire@flashinfer_api/FLASHINFER_LOGLEVEL/FLASHINFER_LOGDESTpremise — that machinery does not exist in this fork (grep returns zero matches). Replaced with theAMD_SERIALIZE_KERNEL=3+HIP_LAUNCH_BLOCKING=1env-var combo and a per-error recipe table usingprint()+torch.cuda.synchronize()for input inspection. Added an explicit note at the top so future readers don't reintroduce the fiction.benchmark-kernel(CUPTI): corrected "silently ignored" → "WILL fail on ROCm".flashinfer/testing/utils.py:1010routesenable_cupti=Truestraight tobench_gpu_time_with_cuptiwith no HIP guard;cupti-pythonis not installable on ROCm.benchmark-kernel(AITER): replaced the wrong{1, 16, 1024}page-size whitelist with the accurate constraint set from_aiter_native_page_sizes()inflashinfer/prefill_rocm.py:59— native page sizes are{128, 256, 1024}foramd-aiter >= 0.1.10, else{16, 1024}; non-native sizes go through a flat-gather path and are NOT rejected. Also documented the silent fallback-to-fa2auto-selection cases.add-rocm-kernel: dropped the fictionalDISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16_FP32macro; documented the actual three variants (_FP16,_FP8, unsuffixed) and noted that FP32 needs manual dispatch.CLAUDE.md cleanup
Removed the misleading
FLASHINFER_JIT_DEBUG=1row from the Essential Commands table. The flag is wired on the CUDA path only; on HIP it does nothing for debug build flags. Added a gotcha pointing to the HIP workaround (add-gviaextra_cuda_cflagsin the JIT generator).Test plan
pre-commit run -aclean