[Feature] Add block size hint for cuda codegen by Rachmanino · Pull Request #2012 · tile-ai/tilelang

Rachmanino · 2026-04-03T17:38:03Z

Summary by CodeRabbit

Bug Fixes
- Kernel launch attribute emission now only produces block-size or launch-bounds hints when thread extents are statically known and based on an explicit min-blocks annotation or a target-enabled hint flag.
New Features
- Device codegen honors a target-controlled flag to allow emitting block-size hints when supported.
Tests
- Tests updated to check for either block-size or launch-bounds forms conditionally based on CUDA/runtime capability.
Documentation
- Docstring clarified how min-blocks-per-SM maps to emitted launch bounds.

github-actions · 2026-04-03T17:38:13Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-04-03T17:38:19Z

📝 Walkthrough

Walkthrough

Codegen now detects an explicit tl.min_blocks_per_sm attribute and separately simplifies each thread-dimension extent; it emits __block_size__((x,y,z)) only when all extents are static and block-size hints are allowed, and emits __launch_bounds__ with or without min-blocks depending on the presence of the annotation.

Changes

Cohort / File(s)	Summary
CUDA codegen (header) `src/target/codegen_cuda.h`	`CodeGenTileLangCUDA::PrintExtraAttrs` signature changed to accept `std::ostream &os` and marked `final`; added `allow_block_size_hint_` member.
CUDA codegen (implementation) `src/target/codegen_cuda.cc`	`PrintExtraAttrs` now: independently simplifies `threadIdx_x_ext`, `threadIdx_y_ext`, `threadIdx_z_ext`; only emits a block-size hint when all three simplify to static `IntImmNode`s and product>1 and `allow_block_size_hint_` is enabled; detects `tl.min_blocks_per_sm` via `LaunchConfigExtractor::saw_min_blocks_per_sm_attr` and conditionally emits `__launch_bounds__(total_threads, min_blocks_per_sm)` when present, otherwise emits `__launch_bounds__(total_threads)` if applicable. Call sites updated to pass `os`.
Lowering / annotation propagation `tilelang/engine/lower.py`	Adds `_allow_block_size_hint` to compute whether block-size hints are allowed for a CUDA target; adds `_annotate_device_mod` to attach TIR attrs to device PrimFuncs; `lower()` annotates device PrimFuncs with `"tl.allow_block_size_hint"` before device codegen.
Tests — runtime-aware expectations `testing/python/issue/test_tilelang_issue_tma_no_ws.py`, `testing/python/issue/test_tilelang_issue_ws_simt_copy_full_producer_extent.py`	Tests import `tilelang.contrib.nvcc` and `torch` to detect CUDA version/compute capability and make launch-hint assertions conditional (accept either `__block_size__((...))` when hints allowed or `__launch_bounds__(...)` otherwise).
Tests — min blocks assertion `testing/python/language/test_tilelang_language_min_blocks_per_sm.py`	Now asserts generated source contains `__launch_bounds__(128, 2)` and that `__block_size__` is not emitted for annotated functions.
Annotations docs `tilelang/language/annotations.py`	Docstring for `annotate_min_blocks_per_sm` clarified to state codegen emits `__launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)` when annotated; no API changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

[Feature] Introduce annotation for minBlocksPerMultiprocessor in __launch_bounds__ #1979 — Adds recognition/use of tl.min_blocks_per_sm in CUDA codegen (strongly related).

Suggested reviewers

LeiWang1999

Poem

🐰 I hopped through threadIdx x, y, z,

Counting lanes where kernels flee,
If bounds are told, I shout with glee,
Else I hint the block — soft as can be,
A rabbit's clap for CUDA harmony.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.11% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and concisely describes the main feature: adding block size hints for CUDA code generation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Rachmanino · 2026-04-03T17:39:24Z

@regression-perf

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/target/codegen_cuda.cc (1)
467-485: ⚠️ Potential issue | 🔴 Critical

Remove or justify the __block_size__ attribute.

The __block_size__((x, y, z)) attribute does not exist anywhere in the codebase and is not a recognized CUDA attribute. The kernel parser in tilelang/jit/adapter/utils.py explicitly does not expect this attribute (its regex pattern only accounts for __launch_bounds__, not __block_size__). The HIP backend similarly emits only __launch_bounds__, not __block_size__.

Either remove this line or provide documentation and corresponding parser support in the framework if it's meant as a custom annotation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 467 - 485, The PrintExtraAttrs
method in CodeGenTileLangCUDA emits a nonstandard "__block_size__((x, y, z))"
annotation (see CodeGenTileLangCUDA::PrintExtraAttrs and use of
extractor.threadIdx_x_ext/yi/zi), which the rest of the codebase and parsers do
not recognize; either remove the stream output that writes "__block_size__" so
only the supported "__launch_bounds__" is emitted, or if this custom annotation
is required, add framework support: document the annotation, update the kernel
parser (tilelang/jit/adapter/utils.py) to accept "__block_size__" and ensure the
HIP backend and other consumers handle it, and keep the existing logic that
computes xi/yi/zi and extractor.min_blocks_per_sm.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/target/codegen_cuda.cc`:
- Around line 467-485: The PrintExtraAttrs method in CodeGenTileLangCUDA emits a
nonstandard "__block_size__((x, y, z))" annotation (see
CodeGenTileLangCUDA::PrintExtraAttrs and use of
extractor.threadIdx_x_ext/yi/zi), which the rest of the codebase and parsers do
not recognize; either remove the stream output that writes "__block_size__" so
only the supported "__launch_bounds__" is emitted, or if this custom annotation
is required, add framework support: document the annotation, update the kernel
parser (tilelang/jit/adapter/utils.py) to accept "__block_size__" and ensure the
HIP backend and other consumers handle it, and keep the existing logic that
computes xi/yi/zi and extractor.min_blocks_per_sm.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5d538f98-9e3e-4151-bb51-9f95990557f0

📥 Commits

Reviewing files that changed from the base of the PR and between 5f70374 and 3f66da3.

📒 Files selected for processing (1)

src/target/codegen_cuda.cc

LeiWang1999 · 2026-04-04T17:59:36Z

@regression-perf

… min_blocks_per_sm annotation. Update tests to reflect changes in emitted kernel source, ensuring correct handling of block size and launch bounds attributes.

…flict Three bugs fixed: 1. (Critical) lower.py: `get_target_arch` was used to guard `__block_size__` emission, but it unconditionally appends 'a' for all SM9.x, making the condition `not endswith("a")` always False. Replace with `target.arch` (the user-specified arch string) so sm_90 enables the hint while sm_90a (which implies cluster-launch metadata) correctly disables it. 2. (Critical) codegen_cuda.cc: `PrintFunctionSignature` (used for forward declarations) was also calling `PrintExtraAttrs`, placing `__block_size__` on both the declaration and the definition. NVCC 12.9 rejects duplicate `__block_size__` on kernels that include TMA copy headers (which bring in `shared::cluster` PTX). Fix: remove the call from the forward-declaration path; the definition already emits the hint via the explicit `PrintExtraAttrs` call in `AddFunction`. 3. (Minor) codegen_cuda.cc: remove the no-op `CodeGenC::PrintExtraAttrs` call that preceded `this->PrintExtraAttrs`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…avior Tests: make __block_size__ / __launch_bounds__ assertions conditional on runtime NVCC version and compute capability, since sm_90a targets (the default on H100) no longer emit __block_size__ after the gating fix. annotations.py: update docstring for annotate_min_blocks_per_sm to drop the outdated reference to __block_size__ being the alternative. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (2)

tilelang/engine/lower.py (2)

44-47: Make attribute application explicit and guard module attrs.

Line 46 relies on dict-overload behavior of with_attr, and Line 47 may be fragile if device_mod.attrs is None. Using explicit key/value updates and a None guard is safer across TVM API variants.

Proposed fix

 def _annotate_device_mod(device_mod: tvm.IRModule, attrs: dict) -> tvm.IRModule:
     """Return a copy of *device_mod* with *attrs* added to every PrimFunc."""
-    new_funcs = {gv: (func.with_attr(attrs) if isinstance(func, tir.PrimFunc) else func) for gv, func in device_mod.functions.items()}
-    return tvm.IRModule(new_funcs).with_attrs(device_mod.attrs)
+    new_funcs = {}
+    for gv, func in device_mod.functions.items():
+        if isinstance(func, tir.PrimFunc):
+            for k, v in attrs.items():
+                func = func.with_attr(k, v)
+        new_funcs[gv] = func
+    new_mod = tvm.IRModule(new_funcs)
+    return new_mod.with_attrs(device_mod.attrs) if device_mod.attrs is not None else new_mod

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tilelang/engine/lower.py` around lines 44 - 47, In _annotate_device_mod,
avoid relying on the dict-overload of tir.PrimFunc.with_attr and guard against
device_mod.attrs being None: iterate over attrs.items() and call func =
func.with_attr(k, v) for each key/value when func is a tir.PrimFunc, build
new_funcs accordingly, and when attaching module-level attrs use
device_mod.attrs if not None (or an empty dict) when calling
tvm.IRModule(...).with_attrs(...) so the code works across TVM API variants.

40-41: Narrow the exception scope in capability detection.

Line 40 catches all exceptions, which can suppress unexpected breakages. Prefer catching expected failures only.

Proposed fix

-    except Exception:
+    except (AttributeError, ValueError, RuntimeError):
         return False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tilelang/engine/lower.py` around lines 40 - 41, The broad except Exception:
return False in the capability-detection block should be narrowed to only the
expected failure types (e.g., ImportError, AttributeError, OSError, ValueError
depending on what the detection code does) so unexpected errors aren't silently
suppressed; replace the blanket except with explicit except clauses for the
anticipated exceptions and either return False there, and re-raise (or let
propagate) any other exceptions, optionally logging the caught error for
diagnostics. Target the specific except Exception: return False occurrence in
the capability detection code (the capability detection block/function in
lower.py) and update it accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tilelang/engine/lower.py`:
- Around line 44-47: In _annotate_device_mod, avoid relying on the dict-overload
of tir.PrimFunc.with_attr and guard against device_mod.attrs being None: iterate
over attrs.items() and call func = func.with_attr(k, v) for each key/value when
func is a tir.PrimFunc, build new_funcs accordingly, and when attaching
module-level attrs use device_mod.attrs if not None (or an empty dict) when
calling tvm.IRModule(...).with_attrs(...) so the code works across TVM API
variants.
- Around line 40-41: The broad except Exception: return False in the
capability-detection block should be narrowed to only the expected failure types
(e.g., ImportError, AttributeError, OSError, ValueError depending on what the
detection code does) so unexpected errors aren't silently suppressed; replace
the blanket except with explicit except clauses for the anticipated exceptions
and either return False there, and re-raise (or let propagate) any other
exceptions, optionally logging the caught error for diagnostics. Target the
specific except Exception: return False occurrence in the capability detection
code (the capability detection block/function in lower.py) and update it
accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5da199fa-be6c-4c68-a7dc-29532a783fc4

📥 Commits

Reviewing files that changed from the base of the PR and between 631817c and e7d243a.

📒 Files selected for processing (6)

src/target/codegen_cuda.cc
src/target/codegen_cuda.h
testing/python/issue/test_tilelang_issue_tma_no_ws.py
testing/python/issue/test_tilelang_issue_ws_simt_copy_full_producer_extent.py
tilelang/engine/lower.py
tilelang/language/annotations.py

✅ Files skipped from review due to trivial changes (1)

tilelang/language/annotations.py

🚧 Files skipped from review as they are similar to previous changes (3)

testing/python/issue/test_tilelang_issue_ws_simt_copy_full_producer_extent.py
testing/python/issue/test_tilelang_issue_tma_no_ws.py
src/target/codegen_cuda.cc

add block size hint for cuda codegen

3f66da3

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

Rachmanino and others added 5 commits April 7, 2026 13:01

Enhance CUDA codegen to conditionally emit __launch_bounds__ based on…

6e4072f

… min_blocks_per_sm annotation. Update tests to reflect changes in emitted kernel source, ensuring correct handling of block size and launch bounds attributes.

lint

631817c

lint

e7d243a

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add block size hint for cuda codegen#2012

[Feature] Add block size hint for cuda codegen#2012
Rachmanino wants to merge 6 commits intotile-ai:mainfrom
Rachmanino:block-size-hint

Rachmanino commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Rachmanino commented Apr 3, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

LeiWang1999 commented Apr 4, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Rachmanino commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Rachmanino commented Apr 3, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 commented Apr 4, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rachmanino commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading