Skip to content

[TLERaw] Remove redundant copy && Support buffered_tensor with tle_raw.call#479

Merged
sgjzfzzf merged 9 commits intoflagos-ai:triton_v3.6.xfrom
lizhangyu258:triton_v3.6.x
Apr 13, 2026
Merged

[TLERaw] Remove redundant copy && Support buffered_tensor with tle_raw.call#479
sgjzfzzf merged 9 commits intoflagos-ai:triton_v3.6.xfrom
lizhangyu258:triton_v3.6.x

Conversation

@lizhangyu258
Copy link
Copy Markdown

@lizhangyu258 lizhangyu258 commented Mar 26, 2026

[TLERaw]

  • Eliminate redundant copies caused when passing tensors as parameters in tle_raw.call().
  • Support passing buffered_tensors as parameters with tle_raw.call_smem().
  • Update mlir/05-topk.py by giving alloc initialization value.

[Bug]

  • When passing parameters using tle_raw.call_smem(), you need to comment out this tle.passes.add_assign_local_pointers_encoding(pm) statement; otherwise, occasional compilation errors may occur.

[Fixed]

  • There are no errors when testing based on the tle_topk_3.6 branch.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 26, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the TLE Raw pipeline to (1) support passing ttg::MemDescType/buffered shared-memory tensors through tle_raw calls and (2) eliminate redundant shared-memory copies by rewriting specific scf.for loop argument patterns.

Changes:

  • Extend TLERaw protocol flatten/pack logic to handle ttg::MemDescType and add a MemDescPattern to signatures.
  • Add and wire a new tle-remove-redundant-copy MLIR pass (plus small SCF utility helpers) into the NVIDIA backend compilation pipeline.
  • Add Python tle_raw.call_smem() and extend tle_gpu.alloc() with an init_value to support initialized shared-memory buffers; update tutorials accordingly.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
third_party/tle/utils/lib/Protocol.cpp Add MemDesc signature handling and allow packing memdesc returns in LLVMStructurePattern.
third_party/tle/utils/include/Protocol.h Expose MemDescPattern and include TritonGPU dialect types for memdesc support.
third_party/tle/triton_tle.cc Register the new raw pass wrapper add_tle_remove_redundant_copy.
third_party/tle/dialect/lib/Transforms/TleUtility.cpp Add SCF helper utilities used by multiple transforms.
third_party/tle/dialect/lib/Transforms/RemoveRedundantCopy.cpp New transform pass to rewrite loop-carried args to avoid redundant local copies.
third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp Adjust arg-to-memdesc conversion behavior in single-loop/iter-arg cases and barrier insertion conditions.
third_party/tle/dialect/lib/Transforms/CMakeLists.txt Build integration for the new transform sources.
third_party/tle/dialect/include/Transforms/TleUtility.h Header for SCF helper utilities.
third_party/tle/dialect/include/Transforms/RemoveRedundantCopy.h Header for new redundant-copy removal patterns.
third_party/tle/dialect/include/Transforms/Passes.td Add tle-remove-redundant-copy pass definition.
third_party/nvidia/backend/compiler.py Insert the new raw pass into the CUDA TTGIR pipeline.
python/triton/experimental/tle/language/raw/core.py Add call_smem() returning buffered_tensor outputs.
python/triton/experimental/tle/language/raw/init.py Export call_smem.
python/triton/experimental/tle/language/gpu/core.py Add init_value support to tle_gpu.alloc() for initialized smem buffers.
python/tutorials/tle/raw/mlir/05-topk.py Switch to initialized smem buffers + call_smem usage for outputs/inputs.
python/tutorials/tle/raw/mlir/03-matrix-multiplication-smem.py New MLIR tutorial example using call_smem and buffered shared memory.
python/tutorials/tle/raw/cuda/03-matrix-multiplication-smem.py New CUDA tutorial example using call_smem and buffered shared memory.

Comment thread third_party/tle/dialect/lib/Transforms/RemoveRedundantCopy.cpp Outdated
Comment thread third_party/tle/dialect/lib/Transforms/RemoveRedundantCopy.cpp
Comment thread third_party/tle/utils/lib/Protocol.cpp
sgjzfzzf
sgjzfzzf previously approved these changes Apr 9, 2026
Comment thread third_party/tle/utils/lib/Protocol.cpp
sunnycase
sunnycase previously approved these changes Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@sunnycase sunnycase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@sunnycase sunnycase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@zhzhcookie zhzhcookie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@sgjzfzzf sgjzfzzf merged commit aa0db2f into flagos-ai:triton_v3.6.x Apr 13, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants