[Feature] Add block size hint for cuda codegen#2012
[Feature] Add block size hint for cuda codegen#2012Rachmanino wants to merge 6 commits intotile-ai:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughCodegen now detects an explicit Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@regression-perf |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/target/codegen_cuda.cc (1)
467-485:⚠️ Potential issue | 🔴 CriticalRemove or justify the
__block_size__attribute.The
__block_size__((x, y, z))attribute does not exist anywhere in the codebase and is not a recognized CUDA attribute. The kernel parser intilelang/jit/adapter/utils.pyexplicitly does not expect this attribute (its regex pattern only accounts for__launch_bounds__, not__block_size__). The HIP backend similarly emits only__launch_bounds__, not__block_size__.Either remove this line or provide documentation and corresponding parser support in the framework if it's meant as a custom annotation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_cuda.cc` around lines 467 - 485, The PrintExtraAttrs method in CodeGenTileLangCUDA emits a nonstandard "__block_size__((x, y, z))" annotation (see CodeGenTileLangCUDA::PrintExtraAttrs and use of extractor.threadIdx_x_ext/yi/zi), which the rest of the codebase and parsers do not recognize; either remove the stream output that writes "__block_size__" so only the supported "__launch_bounds__" is emitted, or if this custom annotation is required, add framework support: document the annotation, update the kernel parser (tilelang/jit/adapter/utils.py) to accept "__block_size__" and ensure the HIP backend and other consumers handle it, and keep the existing logic that computes xi/yi/zi and extractor.min_blocks_per_sm.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/target/codegen_cuda.cc`:
- Around line 467-485: The PrintExtraAttrs method in CodeGenTileLangCUDA emits a
nonstandard "__block_size__((x, y, z))" annotation (see
CodeGenTileLangCUDA::PrintExtraAttrs and use of
extractor.threadIdx_x_ext/yi/zi), which the rest of the codebase and parsers do
not recognize; either remove the stream output that writes "__block_size__" so
only the supported "__launch_bounds__" is emitted, or if this custom annotation
is required, add framework support: document the annotation, update the kernel
parser (tilelang/jit/adapter/utils.py) to accept "__block_size__" and ensure the
HIP backend and other consumers handle it, and keep the existing logic that
computes xi/yi/zi and extractor.min_blocks_per_sm.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5d538f98-9e3e-4151-bb51-9f95990557f0
📒 Files selected for processing (1)
src/target/codegen_cuda.cc
|
@regression-perf |
… min_blocks_per_sm annotation. Update tests to reflect changes in emitted kernel source, ensuring correct handling of block size and launch bounds attributes.
…flict
Three bugs fixed:
1. (Critical) lower.py: `get_target_arch` was used to guard `__block_size__`
emission, but it unconditionally appends 'a' for all SM9.x, making the
condition `not endswith("a")` always False. Replace with `target.arch`
(the user-specified arch string) so sm_90 enables the hint while sm_90a
(which implies cluster-launch metadata) correctly disables it.
2. (Critical) codegen_cuda.cc: `PrintFunctionSignature` (used for forward
declarations) was also calling `PrintExtraAttrs`, placing `__block_size__`
on both the declaration and the definition. NVCC 12.9 rejects duplicate
`__block_size__` on kernels that include TMA copy headers (which bring in
`shared::cluster` PTX). Fix: remove the call from the forward-declaration
path; the definition already emits the hint via the explicit `PrintExtraAttrs`
call in `AddFunction`.
3. (Minor) codegen_cuda.cc: remove the no-op `CodeGenC::PrintExtraAttrs` call
that preceded `this->PrintExtraAttrs`.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…avior Tests: make __block_size__ / __launch_bounds__ assertions conditional on runtime NVCC version and compute capability, since sm_90a targets (the default on H100) no longer emit __block_size__ after the gating fix. annotations.py: update docstring for annotate_min_blocks_per_sm to drop the outdated reference to __block_size__ being the alternative. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tilelang/engine/lower.py (2)
44-47: Make attribute application explicit and guard module attrs.Line 46 relies on dict-overload behavior of
with_attr, and Line 47 may be fragile ifdevice_mod.attrsisNone. Using explicit key/value updates and aNoneguard is safer across TVM API variants.Proposed fix
def _annotate_device_mod(device_mod: tvm.IRModule, attrs: dict) -> tvm.IRModule: """Return a copy of *device_mod* with *attrs* added to every PrimFunc.""" - new_funcs = {gv: (func.with_attr(attrs) if isinstance(func, tir.PrimFunc) else func) for gv, func in device_mod.functions.items()} - return tvm.IRModule(new_funcs).with_attrs(device_mod.attrs) + new_funcs = {} + for gv, func in device_mod.functions.items(): + if isinstance(func, tir.PrimFunc): + for k, v in attrs.items(): + func = func.with_attr(k, v) + new_funcs[gv] = func + new_mod = tvm.IRModule(new_funcs) + return new_mod.with_attrs(device_mod.attrs) if device_mod.attrs is not None else new_mod🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilelang/engine/lower.py` around lines 44 - 47, In _annotate_device_mod, avoid relying on the dict-overload of tir.PrimFunc.with_attr and guard against device_mod.attrs being None: iterate over attrs.items() and call func = func.with_attr(k, v) for each key/value when func is a tir.PrimFunc, build new_funcs accordingly, and when attaching module-level attrs use device_mod.attrs if not None (or an empty dict) when calling tvm.IRModule(...).with_attrs(...) so the code works across TVM API variants.
40-41: Narrow the exception scope in capability detection.Line 40 catches all exceptions, which can suppress unexpected breakages. Prefer catching expected failures only.
Proposed fix
- except Exception: + except (AttributeError, ValueError, RuntimeError): return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilelang/engine/lower.py` around lines 40 - 41, The broad except Exception: return False in the capability-detection block should be narrowed to only the expected failure types (e.g., ImportError, AttributeError, OSError, ValueError depending on what the detection code does) so unexpected errors aren't silently suppressed; replace the blanket except with explicit except clauses for the anticipated exceptions and either return False there, and re-raise (or let propagate) any other exceptions, optionally logging the caught error for diagnostics. Target the specific except Exception: return False occurrence in the capability detection code (the capability detection block/function in lower.py) and update it accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tilelang/engine/lower.py`:
- Around line 44-47: In _annotate_device_mod, avoid relying on the dict-overload
of tir.PrimFunc.with_attr and guard against device_mod.attrs being None: iterate
over attrs.items() and call func = func.with_attr(k, v) for each key/value when
func is a tir.PrimFunc, build new_funcs accordingly, and when attaching
module-level attrs use device_mod.attrs if not None (or an empty dict) when
calling tvm.IRModule(...).with_attrs(...) so the code works across TVM API
variants.
- Around line 40-41: The broad except Exception: return False in the
capability-detection block should be narrowed to only the expected failure types
(e.g., ImportError, AttributeError, OSError, ValueError depending on what the
detection code does) so unexpected errors aren't silently suppressed; replace
the blanket except with explicit except clauses for the anticipated exceptions
and either return False there, and re-raise (or let propagate) any other
exceptions, optionally logging the caught error for diagnostics. Target the
specific except Exception: return False occurrence in the capability detection
code (the capability detection block/function in lower.py) and update it
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5da199fa-be6c-4c68-a7dc-29532a783fc4
📒 Files selected for processing (6)
src/target/codegen_cuda.ccsrc/target/codegen_cuda.htesting/python/issue/test_tilelang_issue_tma_no_ws.pytesting/python/issue/test_tilelang_issue_ws_simt_copy_full_producer_extent.pytilelang/engine/lower.pytilelang/language/annotations.py
✅ Files skipped from review due to trivial changes (1)
- tilelang/language/annotations.py
🚧 Files skipped from review as they are similar to previous changes (3)
- testing/python/issue/test_tilelang_issue_ws_simt_copy_full_producer_extent.py
- testing/python/issue/test_tilelang_issue_tma_no_ws.py
- src/target/codegen_cuda.cc
Summary by CodeRabbit