Skip to content

docs: Add Claude Code skills and update Claude.md#237

Merged
demandal25 merged 6 commits into
ROCm:amd-integrationfrom
demandal25:add-claude-setup
May 20, 2026
Merged

docs: Add Claude Code skills and update Claude.md#237
demandal25 merged 6 commits into
ROCm:amd-integrationfrom
demandal25:add-claude-setup

Conversation

@demandal25
Copy link
Copy Markdown
Collaborator

@demandal25 demandal25 commented May 20, 2026

Summary

Primary source is the following from the upstream. Then they are simplified and made specific to rocm-flashinfer, rocm ecosystem, and AMD GPUs.

Trims the three .claude/skills/*/SKILL.md files from 1575 → 239 lines (~85% reduction), mirroring the slim CLAUDE.md philosophy from #225. Keeps only the non-obvious essentials needed when updating code or making design decisions.

While trimming, an independent fact-check surfaced several factual errors in the original docs that were carried into early drafts. Those are fixed here.

Factual fixes

  • debug-rocm-crash: removed the entire @flashinfer_api / FLASHINFER_LOGLEVEL / FLASHINFER_LOGDEST premise — that machinery does not exist in this fork (grep returns zero matches). Replaced with the AMD_SERIALIZE_KERNEL=3 + HIP_LAUNCH_BLOCKING=1 env-var combo and a per-error recipe table using print() + torch.cuda.synchronize() for input inspection. Added an explicit note at the top so future readers don't reintroduce the fiction.
  • benchmark-kernel (CUPTI): corrected "silently ignored" → "WILL fail on ROCm". flashinfer/testing/utils.py:1010 routes enable_cupti=True straight to bench_gpu_time_with_cupti with no HIP guard; cupti-python is not installable on ROCm.
  • benchmark-kernel (AITER): replaced the wrong {1, 16, 1024} page-size whitelist with the accurate constraint set from _aiter_native_page_sizes() in flashinfer/prefill_rocm.py:59 — native page sizes are {128, 256, 1024} for amd-aiter >= 0.1.10, else {16, 1024}; non-native sizes go through a flat-gather path and are NOT rejected. Also documented the silent fallback-to-fa2 auto-selection cases.
  • add-rocm-kernel: dropped the fictional DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16_FP32 macro; documented the actual three variants (_FP16, _FP8, unsuffixed) and noted that FP32 needs manual dispatch.

CLAUDE.md cleanup

Removed the misleading FLASHINFER_JIT_DEBUG=1 row from the Essential Commands table. The flag is wired on the CUDA path only; on HIP it does nothing for debug build flags. Added a gotcha pointing to the HIP workaround (add -g via extra_cuda_cflags in the JIT generator).

Test plan

  • pre-commit run -a clean
  • All file paths and line numbers in skills verified to exist
  • Fact-check pass via independent subagent

demandal25 and others added 2 commits May 20, 2026 14:14
- Add root CLAUDE.md: HIP/ROCm, gfx942/gfx950, csrc_rocm, gpu_iface, AITER,
  feature matrix, JIT, debugging/benchmarking with ROCm tooling
- Add .claude/skills: add-rocm-kernel, benchmark-kernel, debug-rocm-crash
- Markdown tables/fences satisfy markdownlint (MD040, MD060, MD031)

Made-with: Cursor
Restructure the three .claude/skills/*/SKILL.md files (1575 → 239 lines,
~85% reduction) to mirror the slim CLAUDE.md philosophy: keep only what
is hard to derive from code or remember between sessions. Drop the full
walkthrough examples — the real files in flashinfer/csrc_rocm/ and
flashinfer/jit/ are better references than a Markdown copy.

Also fix factual errors discovered while fact-checking the originals
against the current codebase:

- debug-rocm-crash: remove the entire `@flashinfer_api` /
  FLASHINFER_LOGLEVEL / FLASHINFER_LOGDEST premise. Grep returns zero
  matches in the codebase — that decorator and those env vars do not
  exist in this fork. Replace with the actual debug workflow
  (AMD_SERIALIZE_KERNEL=3 + HIP_LAUNCH_BLOCKING=1 + manual print +
  torch.cuda.synchronize, rocgdb, dmesg).

- benchmark-kernel: AITER's "native" page sizes are {128, 256, 1024} for
  amd-aiter ≥ 0.1.10 (else {16, 1024}), not {1, 16, 1024}. Non-native
  page sizes fall through a flat-gather path; they are not rejected.
  CUPTI is not silently ignored — enable_cupti=True routes straight to
  bench_gpu_time_with_cupti with no HIP guard and will fail; leave it
  False on ROCm.

- add-rocm-kernel: DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16_FP32 does not
  exist. Only _FP16, _FP8, and the unsuffixed variant are defined in
  pytorch_extension_utils.h.

- CLAUDE.md: drop the misleading "Debug build (-O0) FLASHINFER_JIT_DEBUG"
  row — that env var is read only on the IS_CUDA branch of
  flashinfer/jit/core.py. Add a gotcha explaining the HIP workaround
  (add -g via extra_cuda_cflags in the JIT generator).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 20, 2026 15:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the project’s Claude Code guidance by trimming the .claude/skills/*/SKILL.md documents and correcting ROCm/FlashInfer fork-specific factual inaccuracies, plus a small cleanup in CLAUDE.md around JIT debug behavior.

Changes:

  • Removes/condenses large portions of the three .claude/skills/*/SKILL.md docs while keeping ROCm-specific essentials.
  • Corrects several ROCm-vs-CUDA factual details (e.g., CUPTI behavior, AITER page-size constraints, HIP debug workflow).
  • Updates CLAUDE.md to remove a misleading command row and adds a HIP-specific debug-build workaround note.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
CLAUDE.md Removes the FLASHINFER_JIT_DEBUG command row and adds a clarification note about CUDA vs ROCm behavior.
.claude/skills/debug-rocm-crash/SKILL.md Adds a ROCm crash-debugging recipe focused on HIP runtime tooling and environment variables.
.claude/skills/benchmark-kernel/SKILL.md Adds a ROCm benchmarking guide, including timing method guidance and AITER/CUPTI gotchas.
.claude/skills/add-rocm-kernel/SKILL.md Adds a step-by-step guide for adding HIP kernels, including file touchpoints and ROCm porting notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread CLAUDE.md Outdated
Comment thread .claude/skills/add-rocm-kernel/SKILL.md Outdated
Comment thread .claude/skills/benchmark-kernel/SKILL.md Outdated
Comment thread .claude/skills/benchmark-kernel/SKILL.md Outdated
Comment thread .claude/skills/debug-rocm-crash/SKILL.md Outdated
demandal25 and others added 2 commits May 20, 2026 18:20
Documents the PR body structure (Summary / What changed / Architecture
notes / Benchmark results / Test plan) so future sessions produce
consistent PRs without rediscovering the format each time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- CLAUDE.md: clarify FLASHINFER_JIT_DEBUG wording (was ambiguous —
  "CUDA-only no-op" parses two ways).
- add-rocm-kernel: drop the @flashinfer_api reference (no such
  decorator exists in this fork).
- benchmark-kernel: fix AITER kv_layout!=NHD citation — the hard
  raise lives in the wrapper plan() (prefill_rocm.py:1978/2920),
  not in _check_kv_layout or the auto-selection function. Expand
  the auto-selection fallback list to match _auto_select_prefill_backend.
- debug-rocm-crash: same kv_layout citation fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 20, 2026 18:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread .claude/skills/debug-rocm-crash/SKILL.md Outdated
Comment thread .claude/skills/benchmark-kernel/SKILL.md Outdated
Comment thread CLAUDE.md Outdated
- benchmark-kernel: CUPTI on ROCm doesn't "WILL fail" — the wrapper
  try/excepts the cupti import, warns, and falls back to CUDA/HIP
  event timing. Reword to reflect actual behavior.
- debug-rocm-crash: scope the "grep returns zero" claim to code paths
  (`git grep` under flashinfer/ and include/) so it stays true now
  that the disclaimer itself mentions the missing names.
- CLAUDE.md: HIP path injects `-O3` into cuda_cflags *before*
  appending extra_cuda_cflags, so you can't remove it. Append
  `-O0 -g` so trailing `-O0` overrides on the hipcc command line.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@demandal25 demandal25 changed the title docs: trim Claude Code skills and fix factual errors docs: Add Claude Code skills and update Claude.md May 20, 2026
@demandal25 demandal25 merged commit 889350b into ROCm:amd-integration May 20, 2026
1 check passed
@demandal25 demandal25 deleted the add-claude-setup branch May 20, 2026 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants