[Bugfix] Fix incorrect sync hoist for fragment buffer conditions in ThreadSync#2030
Conversation
…hreadSync
The ConditionThreadPropertyChecker in ThreadSync incorrectly classified
conditions derived from fragment (local-scope) buffer loads as
non-block-uniform, solely based on storage scope. This caused the sync
planner to hoist __syncthreads() from inside the if-body to before the
if-statement, removing the write-before-read synchronization guarantee
between shared memory writes and TMA store reads.
Fragment buffers commonly hold block-uniform data when populated from
block-uniform global addresses (e.g., T.copy(BlockMask[blockIdx.y, :],
fragment)). The fix removes the scope-based heuristic and instead relies
on the recursive visit of buffer load indices — if any index depends on
threadIdx, VisitExpr_(VarNode*) will correctly mark the load as
non-block-uniform.
Before fix:
__syncthreads(); // hoisted here (too early)
if (a >= 0) {
write_to_shared(); // all threads
tma_store(); // elected thread — no sync protection!
}
After fix:
__syncthreads(); // loop-carried sync
if (a >= 0) {
write_to_shared(); // all threads
__syncthreads(); // correctly placed intra-iteration sync
tma_store(); // elected thread
}
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughModified buffer load handling in thread property checker to delegate uniformity determination to load index traversal instead of unconditionally marking thread-local buffer loads as non-uniform. Runtime dependency is always recorded during analysis. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
…hreadSync (#2030) The ConditionThreadPropertyChecker in ThreadSync incorrectly classified conditions derived from fragment (local-scope) buffer loads as non-block-uniform, solely based on storage scope. This caused the sync planner to hoist __syncthreads() from inside the if-body to before the if-statement, removing the write-before-read synchronization guarantee between shared memory writes and TMA store reads. Fragment buffers commonly hold block-uniform data when populated from block-uniform global addresses (e.g., T.copy(BlockMask[blockIdx.y, :], fragment)). The fix removes the scope-based heuristic and instead relies on the recursive visit of buffer load indices — if any index depends on threadIdx, VisitExpr_(VarNode*) will correctly mark the load as non-block-uniform. Before fix: __syncthreads(); // hoisted here (too early) if (a >= 0) { write_to_shared(); // all threads tma_store(); // elected thread — no sync protection! } After fix: __syncthreads(); // loop-carried sync if (a >= 0) { write_to_shared(); // all threads __syncthreads(); // correctly placed intra-iteration sync tma_store(); // elected thread }
Summary
ConditionThreadPropertyCheckerwhere conditions derived from fragment (local-scope) buffer loads were incorrectly classified as non-block-uniform based solely on storage scopeThreadSyncto hoist__syncthreads()out of an if-body, removing the write-before-read synchronization between shared memory writes and TMA store readsProblem
When a blocksparse copy kernel uses a fragment buffer for block mask indices (e.g.,
a = block_mask_f[i]), the conditiona >= 0guarding the copy body is actually block-uniform — all threads in a block hold the same fragment data loaded fromBlockMask[blockIdx.y, :].However,
ConditionThreadPropertyChecker::VisitExpr_(BufferLoadNode*)marked all local-scope buffer loads as non-block-uniform. This triggered the sync hoist logic, which moved__syncthreads()from between the shared memory writes and TMA store to before the if-statement — breaking the synchronization guarantee.Before fix (incorrect):
After fix (correct):
Root Cause
In
ConditionThreadPropertyChecker::VisitExpr_(BufferLoadNode*):This unconditionally marked fragment buffer loads as non-block-uniform. The fix removes this check and instead relies on the recursive visit of buffer load indices — if any index depends on
threadIdx,VisitExpr_(VarNode*)will correctly setis_block_uniform = false.Test plan
test_blocksparse_copy_tma— previously failing, now passestest_blocksparse_copy_cp_async— passestest_tilelang_transform_thread_synctests passSummary by CodeRabbit