Skip to content

[Transform] Add InjectTcgen05Fence pass#2003

Open
LeiWang1999 wants to merge 4 commits intotile-ai:mainfrom
LeiWang1999:inject-tcgen05-fence
Open

[Transform] Add InjectTcgen05Fence pass#2003
LeiWang1999 wants to merge 4 commits intotile-ai:mainfrom
LeiWang1999:inject-tcgen05-fence

Conversation

@LeiWang1999
Copy link
Copy Markdown
Member

@LeiWang1999 LeiWang1999 commented Mar 31, 2026

Summary

  • Add InjectTcgen05Fence TIR pass that automatically inserts tcgen05.fence::before_thread_sync and tcgen05.fence::after_thread_sync around every __syncthreads() on Blackwell (SM100+) targets when the kernel uses TMEM/tcgen05 operations.
  • On Blackwell, the tcgen05 accumulator (TMEM) lives in a separate address space. Regular thread barriers do not synchronize TMEM state across threads. The fence instructions bridge this gap: before_fence flushes TMEM writes before the barrier, after_fence makes other threads' TMEM writes visible after the barrier.
  • The pass is a no-op on non-SM100 targets or functions without tcgen05/TMEM usage.

Changes

File Description
src/op/builtin.h / .cc Define two new TIR intrinsic Ops
src/target/codegen_cuda.cc Emit tl::tcgen05_before_thread_sync() / tl::tcgen05_after_thread_sync()
src/transform/inject_tcgen05_fence.cc Pass implementation
tilelang/transform/__init__.py Python pass registration
tilelang/engine/phase.py Insert pass after ThreadSync("shared.dyn")

Generated code example

Before:

__syncthreads();

After:

tl::tcgen05_before_thread_sync();
__syncthreads();
tl::tcgen05_after_thread_sync();

Test plan

  • Build passes (make succeeds)
  • Format passes (format.sh pre-commit checks all pass)
  • Verified fence insertion on gemm_tcgen5mma SM100 example: all __syncthreads() correctly wrapped with fence pair
  • End-to-end correctness test on Blackwell hardware

Summary by CodeRabbit

  • New Features
    • Added synchronization fence operations for NVIDIA Blackwell (SM100+) GPU support.
    • Introduced automatic fence injection around thread synchronization barriers on Blackwell targets to optimize memory synchronization.
    • New transformation pass conditionally applies fence wrapping to shared memory synchronization calls on compatible architectures.

On Blackwell GPUs, the tcgen05 accumulator (TMEM) resides in a separate
address space that is not synchronized by regular thread barriers like
__syncthreads() or mbarrier. Two PTX fence instructions are required to
ensure cross-thread visibility of TMEM state:

  tcgen05.fence::before_thread_sync  -- flush TMEM before barrier
  tcgen05.fence::after_thread_sync   -- pull TMEM after barrier

This commit introduces the `InjectTcgen05Fence` TIR pass that
automatically wraps every `tvm_storage_sync("shared")` call with the
fence pair when the target is SM100+ and the function uses tcgen05/TMEM
operations.

Changes:
- Define two new TIR intrinsic Ops: `tcgen05_before_thread_sync` and
  `tcgen05_after_thread_sync` (builtin.h/cc)
- Add codegen support to emit `tl::tcgen05_before_thread_sync()` and
  `tl::tcgen05_after_thread_sync()` (codegen_cuda.cc)
- Implement the `InjectTcgen05Fence` pass (inject_tcgen05_fence.cc)
- Register the pass in the Python transform module (__init__.py)
- Insert the pass in OptimizeForTarget after ThreadSync (phase.py)
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@LeiWang1999 LeiWang1999 changed the title [Transform] Add InjectTcgen05Fence pass for Blackwell (SM100+) [Transform] Add InjectTcgen05Fence pass Mar 31, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

This PR introduces new thread synchronization fence intrinsics (tcgen05_before_thread_sync and tcgen05_after_thread_sync) for Blackwell GPUs. The changes include builtin declarations, CUDA codegen support, a new transformation pass that wraps shared memory synchronization barriers with these fences, and integration into the optimization pipeline.

Changes

Cohort / File(s) Summary
Intrinsic Declarations
src/op/builtin.h, src/op/builtin.cc
Added two new builtin intrinsic declarations for tcgen05_before_thread_sync and tcgen05_after_thread_sync with opaque effect kind and zero inputs. Builtin registrations repositioned in .cc file (moved earlier, same definitions).
CUDA Code Generation
src/target/codegen_cuda.cc
Extended CallNode handling to emit extern calls for the new tcgen05_before_thread_sync and tcgen05_after_thread_sync intrinsics during CUDA code generation.
Fence Injection Pass
src/transform/inject_tcgen05_fence.cc
Introduced new InjectTcgen05Fence TVM transformation pass that rewrites PrimFunc bodies to conditionally wrap tvm_storage_sync("shared") and tvm_storage_sync("shared.dyn") calls with fence intrinsics on SM100+ (Blackwell) targets. Uses StmtExprMutator to inject before/after fence calls.
Pass Integration
tilelang/engine/phase.py, tilelang/transform/__init__.py
Integrated new InjectTcgen05Fence pass into the optimization pipeline after thread synchronization injection, and exposed via Python FFI wrapper function.

Sequence Diagram

sequenceDiagram
    participant Pass as InjectTcgen05Fence<br/>Pass
    participant PrimFunc as PrimFunc Body
    participant Visitor as StmtExprMutator<br/>Visitor
    participant TIR as TIR Nodes

    Pass->>PrimFunc: Check target attribute<br/>(SM100+?)
    PrimFunc-->>Pass: Target verified
    Pass->>PrimFunc: Scan for tcgen05/TMEM<br/>operations
    PrimFunc-->>Pass: Operations detected
    
    Pass->>Visitor: Traverse function body
    Visitor->>TIR: Find tvm_storage_sync<br/>("shared"/"shared.dyn")
    TIR-->>Visitor: Located sync calls
    
    Visitor->>Visitor: Build replacement:<br/>SeqStmt[ before_fence,<br/>original_sync,<br/>after_fence ]
    Visitor-->>Pass: Rewritten body
    
    Pass->>PrimFunc: Update with<br/>transformed body
    PrimFunc-->>Pass: PrimFunc updated
    Pass-->>Pass: Return modified<br/>function
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

Poem

🐰 Fences spring up 'round shared barriers bright,
Blackwell's tcgen05 holds sync just right,
Before and after, a dance so neat,
Memory whispers, threads that meet!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Transform] Add InjectTcgen05Fence pass' clearly and concisely describes the main change: introducing a new transformation pass.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/op/builtin.h (1)

431-445: Drop the duplicate public declarations.

These two APIs are already declared again at Lines 729-735. Keeping both blocks means the next signature or doc change has two places to update.

✂️ Proposed cleanup
-/*!
- * \brief Emit tcgen05.fence::before_thread_sync on Blackwell (SM100+)
- *
- * tcgen05_before_thread_sync()
- *
- */
-TVM_DLL const Op &tcgen05_before_thread_sync();
-
-/*!
- * \brief Emit tcgen05.fence::after_thread_sync on Blackwell (SM100+)
- *
- * tcgen05_after_thread_sync()
- *
- */
-TVM_DLL const Op &tcgen05_after_thread_sync();
-
 /*!
  * \brief Indicate arrival of warp issuing TMA_STORE
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/op/builtin.h` around lines 431 - 445, This file contains duplicate public
declarations for the two ops tcgen05_before_thread_sync() and
tcgen05_after_thread_sync(); remove the earlier duplicate block (the
declarations inside the commented brief around the first occurrence) so there is
a single canonical declaration for each (keep the later declarations at the
other location), ensuring only the unique symbols tcgen05_before_thread_sync and
tcgen05_after_thread_sync remain declared once in the header.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 2034-2039: Remove the duplicate tcgen05 fence handler branches
that check op->op.same_as(tl::tcgen05_before_thread_sync()) and
tl::tcgen05_after_thread_sync() (which set need_tcgen05_common_h_ and call
print_extern_call_stmt), because they shadow the intended handler that performs
ICHECK_EQ(op->args.size(), 0U); delete these earlier branches so the later
handler (the one that validates op->args.size()) runs and malformed IR triggers
the ICHECK_EQ failure instead of falling through to bad C++ emission.

In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 109-112: The current gate uses TargetIsSm100 which restricts to
SM100–110; change the check to use TargetHasSMVersionGE(opt_target.value(), 100)
so the pass applies to all SM100+ (Blackwell) targets; specifically, in the
block that gets the target via Optional<Target> opt_target =
f->GetAttr<Target>(tvm::attr::kTarget) and currently calls
TargetIsSm100(opt_target.value()), replace that call with
TargetHasSMVersionGE(opt_target.value(), 100) while keeping the existing
opt_target.defined() guard and the early return (return f) behavior.

---

Nitpick comments:
In `@src/op/builtin.h`:
- Around line 431-445: This file contains duplicate public declarations for the
two ops tcgen05_before_thread_sync() and tcgen05_after_thread_sync(); remove the
earlier duplicate block (the declarations inside the commented brief around the
first occurrence) so there is a single canonical declaration for each (keep the
later declarations at the other location), ensuring only the unique symbols
tcgen05_before_thread_sync and tcgen05_after_thread_sync remain declared once in
the header.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6fce64a2-16bb-432b-9c7c-09bd38ee5ca1

📥 Commits

Reviewing files that changed from the base of the PR and between 0f7c214 and fec911c.

📒 Files selected for processing (6)
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/target/codegen_cuda.cc
  • src/transform/inject_tcgen05_fence.cc
  • tilelang/engine/phase.py
  • tilelang/transform/__init__.py

Comment on lines +2034 to +2039
} else if (op->op.same_as(tl::tcgen05_before_thread_sync())) {
need_tcgen05_common_h_ = true;
print_extern_call_stmt("tl::tcgen05_before_thread_sync");
} else if (op->op.same_as(tl::tcgen05_after_thread_sync())) {
need_tcgen05_common_h_ = true;
print_extern_call_stmt("tl::tcgen05_after_thread_sync");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the shadow tcgen05 fence handler.

These branches duplicate the existing handlers at Lines 2631-2640. Because this copy matches first, the later ICHECK_EQ(op->args.size(), 0U) never runs, so malformed IR now falls through to bad C++ emission instead of failing here.

🧹 Proposed cleanup
-  } else if (op->op.same_as(tl::tcgen05_before_thread_sync())) {
-    need_tcgen05_common_h_ = true;
-    print_extern_call_stmt("tl::tcgen05_before_thread_sync");
-  } else if (op->op.same_as(tl::tcgen05_after_thread_sync())) {
-    need_tcgen05_common_h_ = true;
-    print_extern_call_stmt("tl::tcgen05_after_thread_sync");
   } else if (op->op.same_as(tl::tma_store_arrive())) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} else if (op->op.same_as(tl::tcgen05_before_thread_sync())) {
need_tcgen05_common_h_ = true;
print_extern_call_stmt("tl::tcgen05_before_thread_sync");
} else if (op->op.same_as(tl::tcgen05_after_thread_sync())) {
need_tcgen05_common_h_ = true;
print_extern_call_stmt("tl::tcgen05_after_thread_sync");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/target/codegen_cuda.cc` around lines 2034 - 2039, Remove the duplicate
tcgen05 fence handler branches that check
op->op.same_as(tl::tcgen05_before_thread_sync()) and
tl::tcgen05_after_thread_sync() (which set need_tcgen05_common_h_ and call
print_extern_call_stmt), because they shadow the intended handler that performs
ICHECK_EQ(op->args.size(), 0U); delete these earlier branches so the later
handler (the one that validates op->args.size()) runs and malformed IR triggers
the ICHECK_EQ failure instead of falling through to bad C++ emission.

Comment on lines +109 to +112
// Only apply on SM100+ (Blackwell) targets.
Optional<Target> opt_target = f->GetAttr<Target>(tvm::attr::kTarget);
if (!opt_target.defined() || !TargetIsSm100(opt_target.value())) {
return f;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -C5 '\bTargetIsSm100\b|\bTargetHasSMVersionGE\b' src/target src/transform/inject_tcgen05_fence.cc

Repository: tile-ai/tilelang

Length of output: 4416


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check for SM120 and other architecture references
echo "=== Architecture definitions ==="
rg -n 'SM120|SM130|arch.*12[0-9]|arch.*13[0-9]' src/target

echo "=== Blackwell references in codebase ==="
rg -n -i 'blackwell' src/

echo "=== File header for inject_tcgen05_fence ==="
head -20 src/transform/inject_tcgen05_fence.cc

echo "=== TargetIsSM120 function ==="
rg -n -A3 'bool TargetIsSM120' src/target/utils.cc

Repository: tile-ai/tilelang

Length of output: 2835


Switch to TargetHasSMVersionGE(opt_target.value(), 100) to match the "SM100+" scope stated in the file header.

TargetIsSm100 limits the gate to arch 100–110, but the file header specifies SM100+ (Blackwell targets). Since TargetIsSM120 defines a separate architecture family (120–129), future SM120+ targets would bypass this pass despite requiring tcgen05 fence injection. The CUDA backend already uses TargetHasSMVersionGE(cur_target, 100) for SM100+ feature checks (codegen_cuda.cc:886), so align this gate with that pattern to ensure all Blackwell-family targets are covered.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/inject_tcgen05_fence.cc` around lines 109 - 112, The current
gate uses TargetIsSm100 which restricts to SM100–110; change the check to use
TargetHasSMVersionGE(opt_target.value(), 100) so the pass applies to all SM100+
(Blackwell) targets; specifically, in the block that gets the target via
Optional<Target> opt_target = f->GetAttr<Target>(tvm::attr::kTarget) and
currently calls TargetIsSm100(opt_target.value()), replace that call with
TargetHasSMVersionGE(opt_target.value(), 100) while keeping the existing
opt_target.defined() guard and the early return (return f) behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants