[Transform] Add InjectTcgen05Fence pass by LeiWang1999 · Pull Request #2003 · tile-ai/tilelang

LeiWang1999 · 2026-03-31T07:45:47Z

Summary

Add InjectTcgen05Fence TIR pass that automatically inserts tcgen05.fence::before_thread_sync and tcgen05.fence::after_thread_sync around every __syncthreads() on Blackwell (SM100+) targets when the kernel uses TMEM/tcgen05 operations.
On Blackwell, the tcgen05 accumulator (TMEM) lives in a separate address space. Regular thread barriers do not synchronize TMEM state across threads. The fence instructions bridge this gap: before_fence flushes TMEM writes before the barrier, after_fence makes other threads' TMEM writes visible after the barrier.
The pass is a no-op on non-SM100 targets or functions without tcgen05/TMEM usage.

Changes

File	Description
`src/op/builtin.h` / `.cc`	Define two new TIR intrinsic Ops
`src/target/codegen_cuda.cc`	Emit `tl::tcgen05_before_thread_sync()` / `tl::tcgen05_after_thread_sync()`
`src/transform/inject_tcgen05_fence.cc`	Pass implementation
`tilelang/transform/__init__.py`	Python pass registration
`tilelang/engine/phase.py`	Insert pass after `ThreadSync("shared.dyn")`

Generated code example

Before:

__syncthreads();

After:

tl::tcgen05_before_thread_sync();
__syncthreads();
tl::tcgen05_after_thread_sync();

Test plan

Build passes (make succeeds)
Format passes (format.sh pre-commit checks all pass)
Verified fence insertion on gemm_tcgen5mma SM100 example: all __syncthreads() correctly wrapped with fence pair
End-to-end correctness test on Blackwell hardware

Summary by CodeRabbit

New Features
- Added synchronization fence operations for NVIDIA Blackwell (SM100+) GPU support.
- Introduced automatic fence injection around thread synchronization barriers on Blackwell targets to optimize memory synchronization.
- New transformation pass conditionally applies fence wrapping to shared memory synchronization calls on compatible architectures.

On Blackwell GPUs, the tcgen05 accumulator (TMEM) resides in a separate address space that is not synchronized by regular thread barriers like __syncthreads() or mbarrier. Two PTX fence instructions are required to ensure cross-thread visibility of TMEM state: tcgen05.fence::before_thread_sync -- flush TMEM before barrier tcgen05.fence::after_thread_sync -- pull TMEM after barrier This commit introduces the `InjectTcgen05Fence` TIR pass that automatically wraps every `tvm_storage_sync("shared")` call with the fence pair when the target is SM100+ and the function uses tcgen05/TMEM operations. Changes: - Define two new TIR intrinsic Ops: `tcgen05_before_thread_sync` and `tcgen05_after_thread_sync` (builtin.h/cc) - Add codegen support to emit `tl::tcgen05_before_thread_sync()` and `tl::tcgen05_after_thread_sync()` (codegen_cuda.cc) - Implement the `InjectTcgen05Fence` pass (inject_tcgen05_fence.cc) - Register the pass in the Python transform module (__init__.py) - Insert the pass in OptimizeForTarget after ThreadSync (phase.py)

github-actions · 2026-03-31T07:45:57Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

…t-tcgen05-fence

coderabbitai · 2026-03-31T09:44:44Z

📝 Walkthrough

Walkthrough

This PR introduces new thread synchronization fence intrinsics (tcgen05_before_thread_sync and tcgen05_after_thread_sync) for Blackwell GPUs. The changes include builtin declarations, CUDA codegen support, a new transformation pass that wraps shared memory synchronization barriers with these fences, and integration into the optimization pipeline.

Changes

Cohort / File(s)	Summary
Intrinsic Declarations `src/op/builtin.h`, `src/op/builtin.cc`	Added two new builtin intrinsic declarations for `tcgen05_before_thread_sync` and `tcgen05_after_thread_sync` with opaque effect kind and zero inputs. Builtin registrations repositioned in `.cc` file (moved earlier, same definitions).
CUDA Code Generation `src/target/codegen_cuda.cc`	Extended CallNode handling to emit extern calls for the new `tcgen05_before_thread_sync` and `tcgen05_after_thread_sync` intrinsics during CUDA code generation.
Fence Injection Pass `src/transform/inject_tcgen05_fence.cc`	Introduced new `InjectTcgen05Fence` TVM transformation pass that rewrites PrimFunc bodies to conditionally wrap `tvm_storage_sync("shared")` and `tvm_storage_sync("shared.dyn")` calls with fence intrinsics on SM100+ (Blackwell) targets. Uses StmtExprMutator to inject before/after fence calls.
Pass Integration `tilelang/engine/phase.py`, `tilelang/transform/__init__.py`	Integrated new `InjectTcgen05Fence` pass into the optimization pipeline after thread synchronization injection, and exposed via Python FFI wrapper function.

Sequence Diagram

sequenceDiagram
    participant Pass as InjectTcgen05Fence<br/>Pass
    participant PrimFunc as PrimFunc Body
    participant Visitor as StmtExprMutator<br/>Visitor
    participant TIR as TIR Nodes

    Pass->>PrimFunc: Check target attribute<br/>(SM100+?)
    PrimFunc-->>Pass: Target verified
    Pass->>PrimFunc: Scan for tcgen05/TMEM<br/>operations
    PrimFunc-->>Pass: Operations detected
    
    Pass->>Visitor: Traverse function body
    Visitor->>TIR: Find tvm_storage_sync<br/>("shared"/"shared.dyn")
    TIR-->>Visitor: Located sync calls
    
    Visitor->>Visitor: Build replacement:<br/>SeqStmt[ before_fence,<br/>original_sync,<br/>after_fence ]
    Visitor-->>Pass: Rewritten body
    
    Pass->>PrimFunc: Update with<br/>transformed body
    PrimFunc-->>Pass: PrimFunc updated
    Pass-->>Pass: Return modified<br/>function

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

[Bugfix] Fix tensor memory copy layout #933: Extends tcgen05-related intrinsics and modifies CUDA codegen CallNode handling for similar fence-related operations.
[BugFix] Add missing fences in GEMM SM100 examples and canonicalize the order of blockIdx #1980: Adds matching tcgen05_before_thread_sync / tcgen05_after_thread_sync intrinsics with equivalent fence-injection transformation logic.

Poem

🐰 Fences spring up 'round shared barriers bright,
Blackwell's tcgen05 holds sync just right,
Before and after, a dance so neat,
Memory whispers, threads that meet! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[Transform] Add InjectTcgen05Fence pass' clearly and concisely describes the main change: introducing a new transformation pass.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/op/builtin.h (1)

431-445: Drop the duplicate public declarations.

These two APIs are already declared again at Lines 729-735. Keeping both blocks means the next signature or doc change has two places to update.

✂️ Proposed cleanup

-/*!
- * \brief Emit tcgen05.fence::before_thread_sync on Blackwell (SM100+)
- *
- * tcgen05_before_thread_sync()
- *
- */
-TVM_DLL const Op &tcgen05_before_thread_sync();
-
-/*!
- * \brief Emit tcgen05.fence::after_thread_sync on Blackwell (SM100+)
- *
- * tcgen05_after_thread_sync()
- *
- */
-TVM_DLL const Op &tcgen05_after_thread_sync();
-
 /*!
  * \brief Indicate arrival of warp issuing TMA_STORE

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/op/builtin.h` around lines 431 - 445, This file contains duplicate public
declarations for the two ops tcgen05_before_thread_sync() and
tcgen05_after_thread_sync(); remove the earlier duplicate block (the
declarations inside the commented brief around the first occurrence) so there is
a single canonical declaration for each (keep the later declarations at the
other location), ensuring only the unique symbols tcgen05_before_thread_sync and
tcgen05_after_thread_sync remain declared once in the header.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 2034-2039: Remove the duplicate tcgen05 fence handler branches
that check op->op.same_as(tl::tcgen05_before_thread_sync()) and
tl::tcgen05_after_thread_sync() (which set need_tcgen05_common_h_ and call
print_extern_call_stmt), because they shadow the intended handler that performs
ICHECK_EQ(op->args.size(), 0U); delete these earlier branches so the later
handler (the one that validates op->args.size()) runs and malformed IR triggers
the ICHECK_EQ failure instead of falling through to bad C++ emission.

In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 109-112: The current gate uses TargetIsSm100 which restricts to
SM100–110; change the check to use TargetHasSMVersionGE(opt_target.value(), 100)
so the pass applies to all SM100+ (Blackwell) targets; specifically, in the
block that gets the target via Optional<Target> opt_target =
f->GetAttr<Target>(tvm::attr::kTarget) and currently calls
TargetIsSm100(opt_target.value()), replace that call with
TargetHasSMVersionGE(opt_target.value(), 100) while keeping the existing
opt_target.defined() guard and the early return (return f) behavior.

---

Nitpick comments:
In `@src/op/builtin.h`:
- Around line 431-445: This file contains duplicate public declarations for the
two ops tcgen05_before_thread_sync() and tcgen05_after_thread_sync(); remove the
earlier duplicate block (the declarations inside the commented brief around the
first occurrence) so there is a single canonical declaration for each (keep the
later declarations at the other location), ensuring only the unique symbols
tcgen05_before_thread_sync and tcgen05_after_thread_sync remain declared once in
the header.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6fce64a2-16bb-432b-9c7c-09bd38ee5ca1

📥 Commits

Reviewing files that changed from the base of the PR and between 0f7c214 and fec911c.

📒 Files selected for processing (6)

src/op/builtin.cc
src/op/builtin.h
src/target/codegen_cuda.cc
src/transform/inject_tcgen05_fence.cc
tilelang/engine/phase.py
tilelang/transform/__init__.py

coderabbitai · 2026-03-31T09:59:05Z

src/target/codegen_cuda.cc

+  } else if (op->op.same_as(tl::tcgen05_before_thread_sync())) {
+    need_tcgen05_common_h_ = true;
+    print_extern_call_stmt("tl::tcgen05_before_thread_sync");
+  } else if (op->op.same_as(tl::tcgen05_after_thread_sync())) {
+    need_tcgen05_common_h_ = true;
+    print_extern_call_stmt("tl::tcgen05_after_thread_sync");


⚠️ Potential issue | 🟡 Minor

Remove the shadow tcgen05 fence handler.

These branches duplicate the existing handlers at Lines 2631-2640. Because this copy matches first, the later ICHECK_EQ(op->args.size(), 0U) never runs, so malformed IR now falls through to bad C++ emission instead of failing here.

🧹 Proposed cleanup

- } else if (op->op.same_as(tl::tcgen05_before_thread_sync())) { - need_tcgen05_common_h_ = true; - print_extern_call_stmt("tl::tcgen05_before_thread_sync"); - } else if (op->op.same_as(tl::tcgen05_after_thread_sync())) { - need_tcgen05_common_h_ = true; - print_extern_call_stmt("tl::tcgen05_after_thread_sync"); } else if (op->op.same_as(tl::tma_store_arrive())) {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

} else if (op->op.same_as(tl::tcgen05_before_thread_sync())) {

need_tcgen05_common_h_ = true;

print_extern_call_stmt("tl::tcgen05_before_thread_sync");

} else if (op->op.same_as(tl::tcgen05_after_thread_sync())) {

need_tcgen05_common_h_ = true;

print_extern_call_stmt("tl::tcgen05_after_thread_sync");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_cuda.cc` around lines 2034 - 2039, Remove the duplicate tcgen05 fence handler branches that check op->op.same_as(tl::tcgen05_before_thread_sync()) and tl::tcgen05_after_thread_sync() (which set need_tcgen05_common_h_ and call print_extern_call_stmt), because they shadow the intended handler that performs ICHECK_EQ(op->args.size(), 0U); delete these earlier branches so the later handler (the one that validates op->args.size()) runs and malformed IR triggers the ICHECK_EQ failure instead of falling through to bad C++ emission.

coderabbitai · 2026-03-31T09:59:05Z

src/transform/inject_tcgen05_fence.cc

+    // Only apply on SM100+ (Blackwell) targets.
+    Optional<Target> opt_target = f->GetAttr<Target>(tvm::attr::kTarget);
+    if (!opt_target.defined() || !TargetIsSm100(opt_target.value())) {
+      return f;


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n -C5 '\bTargetIsSm100\b|\bTargetHasSMVersionGE\b' src/target src/transform/inject_tcgen05_fence.cc

Repository: tile-ai/tilelang

Length of output: 4416

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check for SM120 and other architecture references echo "=== Architecture definitions ===" rg -n 'SM120|SM130|arch.*12[0-9]|arch.*13[0-9]' src/target echo "=== Blackwell references in codebase ===" rg -n -i 'blackwell' src/ echo "=== File header for inject_tcgen05_fence ===" head -20 src/transform/inject_tcgen05_fence.cc echo "=== TargetIsSM120 function ===" rg -n -A3 'bool TargetIsSM120' src/target/utils.cc

Repository: tile-ai/tilelang

Length of output: 2835

Switch to TargetHasSMVersionGE(opt_target.value(), 100) to match the "SM100+" scope stated in the file header.

TargetIsSm100 limits the gate to arch 100–110, but the file header specifies SM100+ (Blackwell targets). Since TargetIsSM120 defines a separate architecture family (120–129), future SM120+ targets would bypass this pass despite requiring tcgen05 fence injection. The CUDA backend already uses TargetHasSMVersionGE(cur_target, 100) for SM100+ feature checks (codegen_cuda.cc:886), so align this gate with that pattern to ensure all Blackwell-family targets are covered.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/transform/inject_tcgen05_fence.cc` around lines 109 - 112, The current gate uses TargetIsSm100 which restricts to SM100–110; change the check to use TargetHasSMVersionGE(opt_target.value(), 100) so the pass applies to all SM100+ (Blackwell) targets; specifically, in the block that gets the target via Optional<Target> opt_target = f->GetAttr<Target>(tvm::attr::kTarget) and currently calls TargetIsSm100(opt_target.value()), replace that call with TargetHasSMVersionGE(opt_target.value(), 100) while keeping the existing opt_target.defined() guard and the early return (return f) behavior.

LeiWang1999 changed the title ~~[Transform] Add InjectTcgen05Fence pass for Blackwell (SM100+)~~ [Transform] Add InjectTcgen05Fence pass Mar 31, 2026

LeiWang1999 added 2 commits March 31, 2026 17:27

Merge branch 'main' of https://github.com/tile-ai/tilelang into injec…

0617294

…t-tcgen05-fence

fix

fec911c

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into inject-tcgen05-fence

3d4a8ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transform] Add InjectTcgen05Fence pass#2003

[Transform] Add InjectTcgen05Fence pass#2003
LeiWang1999 wants to merge 4 commits intotile-ai:mainfrom
LeiWang1999:inject-tcgen05-fence

LeiWang1999 commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LeiWang1999 commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Generated code example

Test plan

Summary by CodeRabbit

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LeiWang1999 commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading