Skip to content

[Feature] Add block size hint for cuda codegen#2012

Open
Rachmanino wants to merge 6 commits intotile-ai:mainfrom
Rachmanino:block-size-hint
Open

[Feature] Add block size hint for cuda codegen#2012
Rachmanino wants to merge 6 commits intotile-ai:mainfrom
Rachmanino:block-size-hint

Conversation

@Rachmanino
Copy link
Copy Markdown
Collaborator

@Rachmanino Rachmanino commented Apr 3, 2026

Summary by CodeRabbit

  • Bug Fixes
    • Kernel launch attribute emission now only produces block-size or launch-bounds hints when thread extents are statically known and based on an explicit min-blocks annotation or a target-enabled hint flag.
  • New Features
    • Device codegen honors a target-controlled flag to allow emitting block-size hints when supported.
  • Tests
    • Tests updated to check for either block-size or launch-bounds forms conditionally based on CUDA/runtime capability.
  • Documentation
    • Docstring clarified how min-blocks-per-SM maps to emitted launch bounds.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Codegen now detects an explicit tl.min_blocks_per_sm attribute and separately simplifies each thread-dimension extent; it emits __block_size__((x,y,z)) only when all extents are static and block-size hints are allowed, and emits __launch_bounds__ with or without min-blocks depending on the presence of the annotation.

Changes

Cohort / File(s) Summary
CUDA codegen (header)
src/target/codegen_cuda.h
CodeGenTileLangCUDA::PrintExtraAttrs signature changed to accept std::ostream &os and marked final; added allow_block_size_hint_ member.
CUDA codegen (implementation)
src/target/codegen_cuda.cc
PrintExtraAttrs now: independently simplifies threadIdx_x_ext, threadIdx_y_ext, threadIdx_z_ext; only emits a block-size hint when all three simplify to static IntImmNodes and product>1 and allow_block_size_hint_ is enabled; detects tl.min_blocks_per_sm via LaunchConfigExtractor::saw_min_blocks_per_sm_attr and conditionally emits __launch_bounds__(total_threads, min_blocks_per_sm) when present, otherwise emits __launch_bounds__(total_threads) if applicable. Call sites updated to pass os.
Lowering / annotation propagation
tilelang/engine/lower.py
Adds _allow_block_size_hint to compute whether block-size hints are allowed for a CUDA target; adds _annotate_device_mod to attach TIR attrs to device PrimFuncs; lower() annotates device PrimFuncs with "tl.allow_block_size_hint" before device codegen.
Tests — runtime-aware expectations
testing/python/issue/test_tilelang_issue_tma_no_ws.py, testing/python/issue/test_tilelang_issue_ws_simt_copy_full_producer_extent.py
Tests import tilelang.contrib.nvcc and torch to detect CUDA version/compute capability and make launch-hint assertions conditional (accept either __block_size__((...)) when hints allowed or __launch_bounds__(...) otherwise).
Tests — min blocks assertion
testing/python/language/test_tilelang_language_min_blocks_per_sm.py
Now asserts generated source contains __launch_bounds__(128, 2) and that __block_size__ is not emitted for annotated functions.
Annotations docs
tilelang/language/annotations.py
Docstring for annotate_min_blocks_per_sm clarified to state codegen emits __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor) when annotated; no API changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999

Poem

🐰 I hopped through threadIdx x, y, z,

Counting lanes where kernels flee,
If bounds are told, I shout with glee,
Else I hint the block — soft as can be,
A rabbit's clap for CUDA harmony.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely describes the main feature: adding block size hints for CUDA code generation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Rachmanino
Copy link
Copy Markdown
Collaborator Author

@regression-perf

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/target/codegen_cuda.cc (1)

467-485: ⚠️ Potential issue | 🔴 Critical

Remove or justify the __block_size__ attribute.

The __block_size__((x, y, z)) attribute does not exist anywhere in the codebase and is not a recognized CUDA attribute. The kernel parser in tilelang/jit/adapter/utils.py explicitly does not expect this attribute (its regex pattern only accounts for __launch_bounds__, not __block_size__). The HIP backend similarly emits only __launch_bounds__, not __block_size__.

Either remove this line or provide documentation and corresponding parser support in the framework if it's meant as a custom annotation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 467 - 485, The PrintExtraAttrs
method in CodeGenTileLangCUDA emits a nonstandard "__block_size__((x, y, z))"
annotation (see CodeGenTileLangCUDA::PrintExtraAttrs and use of
extractor.threadIdx_x_ext/yi/zi), which the rest of the codebase and parsers do
not recognize; either remove the stream output that writes "__block_size__" so
only the supported "__launch_bounds__" is emitted, or if this custom annotation
is required, add framework support: document the annotation, update the kernel
parser (tilelang/jit/adapter/utils.py) to accept "__block_size__" and ensure the
HIP backend and other consumers handle it, and keep the existing logic that
computes xi/yi/zi and extractor.min_blocks_per_sm.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/target/codegen_cuda.cc`:
- Around line 467-485: The PrintExtraAttrs method in CodeGenTileLangCUDA emits a
nonstandard "__block_size__((x, y, z))" annotation (see
CodeGenTileLangCUDA::PrintExtraAttrs and use of
extractor.threadIdx_x_ext/yi/zi), which the rest of the codebase and parsers do
not recognize; either remove the stream output that writes "__block_size__" so
only the supported "__launch_bounds__" is emitted, or if this custom annotation
is required, add framework support: document the annotation, update the kernel
parser (tilelang/jit/adapter/utils.py) to accept "__block_size__" and ensure the
HIP backend and other consumers handle it, and keep the existing logic that
computes xi/yi/zi and extractor.min_blocks_per_sm.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5d538f98-9e3e-4151-bb51-9f95990557f0

📥 Commits

Reviewing files that changed from the base of the PR and between 5f70374 and 3f66da3.

📒 Files selected for processing (1)
  • src/target/codegen_cuda.cc

@LeiWang1999
Copy link
Copy Markdown
Member

@regression-perf

Rachmanino and others added 5 commits April 7, 2026 13:01
… min_blocks_per_sm annotation. Update tests to reflect changes in emitted kernel source, ensuring correct handling of block size and launch bounds attributes.
…flict

Three bugs fixed:

1. (Critical) lower.py: `get_target_arch` was used to guard `__block_size__`
   emission, but it unconditionally appends 'a' for all SM9.x, making the
   condition `not endswith("a")` always False.  Replace with `target.arch`
   (the user-specified arch string) so sm_90 enables the hint while sm_90a
   (which implies cluster-launch metadata) correctly disables it.

2. (Critical) codegen_cuda.cc: `PrintFunctionSignature` (used for forward
   declarations) was also calling `PrintExtraAttrs`, placing `__block_size__`
   on both the declaration and the definition.  NVCC 12.9 rejects duplicate
   `__block_size__` on kernels that include TMA copy headers (which bring in
   `shared::cluster` PTX).  Fix: remove the call from the forward-declaration
   path; the definition already emits the hint via the explicit `PrintExtraAttrs`
   call in `AddFunction`.

3. (Minor) codegen_cuda.cc: remove the no-op `CodeGenC::PrintExtraAttrs` call
   that preceded `this->PrintExtraAttrs`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…avior

Tests: make __block_size__ / __launch_bounds__ assertions conditional on
runtime NVCC version and compute capability, since sm_90a targets (the
default on H100) no longer emit __block_size__ after the gating fix.

annotations.py: update docstring for annotate_min_blocks_per_sm to drop
the outdated reference to __block_size__ being the alternative.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tilelang/engine/lower.py (2)

44-47: Make attribute application explicit and guard module attrs.

Line 46 relies on dict-overload behavior of with_attr, and Line 47 may be fragile if device_mod.attrs is None. Using explicit key/value updates and a None guard is safer across TVM API variants.

Proposed fix
 def _annotate_device_mod(device_mod: tvm.IRModule, attrs: dict) -> tvm.IRModule:
     """Return a copy of *device_mod* with *attrs* added to every PrimFunc."""
-    new_funcs = {gv: (func.with_attr(attrs) if isinstance(func, tir.PrimFunc) else func) for gv, func in device_mod.functions.items()}
-    return tvm.IRModule(new_funcs).with_attrs(device_mod.attrs)
+    new_funcs = {}
+    for gv, func in device_mod.functions.items():
+        if isinstance(func, tir.PrimFunc):
+            for k, v in attrs.items():
+                func = func.with_attr(k, v)
+        new_funcs[gv] = func
+    new_mod = tvm.IRModule(new_funcs)
+    return new_mod.with_attrs(device_mod.attrs) if device_mod.attrs is not None else new_mod
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/engine/lower.py` around lines 44 - 47, In _annotate_device_mod,
avoid relying on the dict-overload of tir.PrimFunc.with_attr and guard against
device_mod.attrs being None: iterate over attrs.items() and call func =
func.with_attr(k, v) for each key/value when func is a tir.PrimFunc, build
new_funcs accordingly, and when attaching module-level attrs use
device_mod.attrs if not None (or an empty dict) when calling
tvm.IRModule(...).with_attrs(...) so the code works across TVM API variants.

40-41: Narrow the exception scope in capability detection.

Line 40 catches all exceptions, which can suppress unexpected breakages. Prefer catching expected failures only.

Proposed fix
-    except Exception:
+    except (AttributeError, ValueError, RuntimeError):
         return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/engine/lower.py` around lines 40 - 41, The broad except Exception:
return False in the capability-detection block should be narrowed to only the
expected failure types (e.g., ImportError, AttributeError, OSError, ValueError
depending on what the detection code does) so unexpected errors aren't silently
suppressed; replace the blanket except with explicit except clauses for the
anticipated exceptions and either return False there, and re-raise (or let
propagate) any other exceptions, optionally logging the caught error for
diagnostics. Target the specific except Exception: return False occurrence in
the capability detection code (the capability detection block/function in
lower.py) and update it accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tilelang/engine/lower.py`:
- Around line 44-47: In _annotate_device_mod, avoid relying on the dict-overload
of tir.PrimFunc.with_attr and guard against device_mod.attrs being None: iterate
over attrs.items() and call func = func.with_attr(k, v) for each key/value when
func is a tir.PrimFunc, build new_funcs accordingly, and when attaching
module-level attrs use device_mod.attrs if not None (or an empty dict) when
calling tvm.IRModule(...).with_attrs(...) so the code works across TVM API
variants.
- Around line 40-41: The broad except Exception: return False in the
capability-detection block should be narrowed to only the expected failure types
(e.g., ImportError, AttributeError, OSError, ValueError depending on what the
detection code does) so unexpected errors aren't silently suppressed; replace
the blanket except with explicit except clauses for the anticipated exceptions
and either return False there, and re-raise (or let propagate) any other
exceptions, optionally logging the caught error for diagnostics. Target the
specific except Exception: return False occurrence in the capability detection
code (the capability detection block/function in lower.py) and update it
accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5da199fa-be6c-4c68-a7dc-29532a783fc4

📥 Commits

Reviewing files that changed from the base of the PR and between 631817c and e7d243a.

📒 Files selected for processing (6)
  • src/target/codegen_cuda.cc
  • src/target/codegen_cuda.h
  • testing/python/issue/test_tilelang_issue_tma_no_ws.py
  • testing/python/issue/test_tilelang_issue_ws_simt_copy_full_producer_extent.py
  • tilelang/engine/lower.py
  • tilelang/language/annotations.py
✅ Files skipped from review due to trivial changes (1)
  • tilelang/language/annotations.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • testing/python/issue/test_tilelang_issue_ws_simt_copy_full_producer_extent.py
  • testing/python/issue/test_tilelang_issue_tma_no_ws.py
  • src/target/codegen_cuda.cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants