[Transform] Add InjectTcgen05Fence pass#2003
[Transform] Add InjectTcgen05Fence pass#2003LeiWang1999 wants to merge 4 commits intotile-ai:mainfrom
Conversation
On Blackwell GPUs, the tcgen05 accumulator (TMEM) resides in a separate
address space that is not synchronized by regular thread barriers like
__syncthreads() or mbarrier. Two PTX fence instructions are required to
ensure cross-thread visibility of TMEM state:
tcgen05.fence::before_thread_sync -- flush TMEM before barrier
tcgen05.fence::after_thread_sync -- pull TMEM after barrier
This commit introduces the `InjectTcgen05Fence` TIR pass that
automatically wraps every `tvm_storage_sync("shared")` call with the
fence pair when the target is SM100+ and the function uses tcgen05/TMEM
operations.
Changes:
- Define two new TIR intrinsic Ops: `tcgen05_before_thread_sync` and
`tcgen05_after_thread_sync` (builtin.h/cc)
- Add codegen support to emit `tl::tcgen05_before_thread_sync()` and
`tl::tcgen05_after_thread_sync()` (codegen_cuda.cc)
- Implement the `InjectTcgen05Fence` pass (inject_tcgen05_fence.cc)
- Register the pass in the Python transform module (__init__.py)
- Insert the pass in OptimizeForTarget after ThreadSync (phase.py)
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
…t-tcgen05-fence
📝 WalkthroughWalkthroughThis PR introduces new thread synchronization fence intrinsics ( Changes
Sequence DiagramsequenceDiagram
participant Pass as InjectTcgen05Fence<br/>Pass
participant PrimFunc as PrimFunc Body
participant Visitor as StmtExprMutator<br/>Visitor
participant TIR as TIR Nodes
Pass->>PrimFunc: Check target attribute<br/>(SM100+?)
PrimFunc-->>Pass: Target verified
Pass->>PrimFunc: Scan for tcgen05/TMEM<br/>operations
PrimFunc-->>Pass: Operations detected
Pass->>Visitor: Traverse function body
Visitor->>TIR: Find tvm_storage_sync<br/>("shared"/"shared.dyn")
TIR-->>Visitor: Located sync calls
Visitor->>Visitor: Build replacement:<br/>SeqStmt[ before_fence,<br/>original_sync,<br/>after_fence ]
Visitor-->>Pass: Rewritten body
Pass->>PrimFunc: Update with<br/>transformed body
PrimFunc-->>Pass: PrimFunc updated
Pass-->>Pass: Return modified<br/>function
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/op/builtin.h (1)
431-445: Drop the duplicate public declarations.These two APIs are already declared again at Lines 729-735. Keeping both blocks means the next signature or doc change has two places to update.
✂️ Proposed cleanup
-/*! - * \brief Emit tcgen05.fence::before_thread_sync on Blackwell (SM100+) - * - * tcgen05_before_thread_sync() - * - */ -TVM_DLL const Op &tcgen05_before_thread_sync(); - -/*! - * \brief Emit tcgen05.fence::after_thread_sync on Blackwell (SM100+) - * - * tcgen05_after_thread_sync() - * - */ -TVM_DLL const Op &tcgen05_after_thread_sync(); - /*! * \brief Indicate arrival of warp issuing TMA_STORE🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/op/builtin.h` around lines 431 - 445, This file contains duplicate public declarations for the two ops tcgen05_before_thread_sync() and tcgen05_after_thread_sync(); remove the earlier duplicate block (the declarations inside the commented brief around the first occurrence) so there is a single canonical declaration for each (keep the later declarations at the other location), ensuring only the unique symbols tcgen05_before_thread_sync and tcgen05_after_thread_sync remain declared once in the header.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/target/codegen_cuda.cc`:
- Around line 2034-2039: Remove the duplicate tcgen05 fence handler branches
that check op->op.same_as(tl::tcgen05_before_thread_sync()) and
tl::tcgen05_after_thread_sync() (which set need_tcgen05_common_h_ and call
print_extern_call_stmt), because they shadow the intended handler that performs
ICHECK_EQ(op->args.size(), 0U); delete these earlier branches so the later
handler (the one that validates op->args.size()) runs and malformed IR triggers
the ICHECK_EQ failure instead of falling through to bad C++ emission.
In `@src/transform/inject_tcgen05_fence.cc`:
- Around line 109-112: The current gate uses TargetIsSm100 which restricts to
SM100–110; change the check to use TargetHasSMVersionGE(opt_target.value(), 100)
so the pass applies to all SM100+ (Blackwell) targets; specifically, in the
block that gets the target via Optional<Target> opt_target =
f->GetAttr<Target>(tvm::attr::kTarget) and currently calls
TargetIsSm100(opt_target.value()), replace that call with
TargetHasSMVersionGE(opt_target.value(), 100) while keeping the existing
opt_target.defined() guard and the early return (return f) behavior.
---
Nitpick comments:
In `@src/op/builtin.h`:
- Around line 431-445: This file contains duplicate public declarations for the
two ops tcgen05_before_thread_sync() and tcgen05_after_thread_sync(); remove the
earlier duplicate block (the declarations inside the commented brief around the
first occurrence) so there is a single canonical declaration for each (keep the
later declarations at the other location), ensuring only the unique symbols
tcgen05_before_thread_sync and tcgen05_after_thread_sync remain declared once in
the header.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6fce64a2-16bb-432b-9c7c-09bd38ee5ca1
📒 Files selected for processing (6)
src/op/builtin.ccsrc/op/builtin.hsrc/target/codegen_cuda.ccsrc/transform/inject_tcgen05_fence.cctilelang/engine/phase.pytilelang/transform/__init__.py
| } else if (op->op.same_as(tl::tcgen05_before_thread_sync())) { | ||
| need_tcgen05_common_h_ = true; | ||
| print_extern_call_stmt("tl::tcgen05_before_thread_sync"); | ||
| } else if (op->op.same_as(tl::tcgen05_after_thread_sync())) { | ||
| need_tcgen05_common_h_ = true; | ||
| print_extern_call_stmt("tl::tcgen05_after_thread_sync"); |
There was a problem hiding this comment.
Remove the shadow tcgen05 fence handler.
These branches duplicate the existing handlers at Lines 2631-2640. Because this copy matches first, the later ICHECK_EQ(op->args.size(), 0U) never runs, so malformed IR now falls through to bad C++ emission instead of failing here.
🧹 Proposed cleanup
- } else if (op->op.same_as(tl::tcgen05_before_thread_sync())) {
- need_tcgen05_common_h_ = true;
- print_extern_call_stmt("tl::tcgen05_before_thread_sync");
- } else if (op->op.same_as(tl::tcgen05_after_thread_sync())) {
- need_tcgen05_common_h_ = true;
- print_extern_call_stmt("tl::tcgen05_after_thread_sync");
} else if (op->op.same_as(tl::tma_store_arrive())) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if (op->op.same_as(tl::tcgen05_before_thread_sync())) { | |
| need_tcgen05_common_h_ = true; | |
| print_extern_call_stmt("tl::tcgen05_before_thread_sync"); | |
| } else if (op->op.same_as(tl::tcgen05_after_thread_sync())) { | |
| need_tcgen05_common_h_ = true; | |
| print_extern_call_stmt("tl::tcgen05_after_thread_sync"); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/target/codegen_cuda.cc` around lines 2034 - 2039, Remove the duplicate
tcgen05 fence handler branches that check
op->op.same_as(tl::tcgen05_before_thread_sync()) and
tl::tcgen05_after_thread_sync() (which set need_tcgen05_common_h_ and call
print_extern_call_stmt), because they shadow the intended handler that performs
ICHECK_EQ(op->args.size(), 0U); delete these earlier branches so the later
handler (the one that validates op->args.size()) runs and malformed IR triggers
the ICHECK_EQ failure instead of falling through to bad C++ emission.
| // Only apply on SM100+ (Blackwell) targets. | ||
| Optional<Target> opt_target = f->GetAttr<Target>(tvm::attr::kTarget); | ||
| if (!opt_target.defined() || !TargetIsSm100(opt_target.value())) { | ||
| return f; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -C5 '\bTargetIsSm100\b|\bTargetHasSMVersionGE\b' src/target src/transform/inject_tcgen05_fence.ccRepository: tile-ai/tilelang
Length of output: 4416
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check for SM120 and other architecture references
echo "=== Architecture definitions ==="
rg -n 'SM120|SM130|arch.*12[0-9]|arch.*13[0-9]' src/target
echo "=== Blackwell references in codebase ==="
rg -n -i 'blackwell' src/
echo "=== File header for inject_tcgen05_fence ==="
head -20 src/transform/inject_tcgen05_fence.cc
echo "=== TargetIsSM120 function ==="
rg -n -A3 'bool TargetIsSM120' src/target/utils.ccRepository: tile-ai/tilelang
Length of output: 2835
Switch to TargetHasSMVersionGE(opt_target.value(), 100) to match the "SM100+" scope stated in the file header.
TargetIsSm100 limits the gate to arch 100–110, but the file header specifies SM100+ (Blackwell targets). Since TargetIsSM120 defines a separate architecture family (120–129), future SM120+ targets would bypass this pass despite requiring tcgen05 fence injection. The CUDA backend already uses TargetHasSMVersionGE(cur_target, 100) for SM100+ feature checks (codegen_cuda.cc:886), so align this gate with that pattern to ensure all Blackwell-family targets are covered.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/transform/inject_tcgen05_fence.cc` around lines 109 - 112, The current
gate uses TargetIsSm100 which restricts to SM100–110; change the check to use
TargetHasSMVersionGE(opt_target.value(), 100) so the pass applies to all SM100+
(Blackwell) targets; specifically, in the block that gets the target via
Optional<Target> opt_target = f->GetAttr<Target>(tvm::attr::kTarget) and
currently calls TargetIsSm100(opt_target.value()), replace that call with
TargetHasSMVersionGE(opt_target.value(), 100) while keeping the existing
opt_target.defined() guard and the early return (return f) behavior.
Summary
InjectTcgen05FenceTIR pass that automatically insertstcgen05.fence::before_thread_syncandtcgen05.fence::after_thread_syncaround every__syncthreads()on Blackwell (SM100+) targets when the kernel uses TMEM/tcgen05 operations.before_fenceflushes TMEM writes before the barrier,after_fencemakes other threads' TMEM writes visible after the barrier.Changes
src/op/builtin.h/.ccsrc/target/codegen_cuda.cctl::tcgen05_before_thread_sync()/tl::tcgen05_after_thread_sync()src/transform/inject_tcgen05_fence.cctilelang/transform/__init__.pytilelang/engine/phase.pyThreadSync("shared.dyn")Generated code example
Before:
__syncthreads();After:
Test plan
makesucceeds)format.shpre-commit checks all pass)gemm_tcgen5mmaSM100 example: all__syncthreads()correctly wrapped with fence pairSummary by CodeRabbit