Skip to content

[codex] add compilable a3 test3 validshape tpush/tpop sample#475

Draft
zhangstevenunity wants to merge 1 commit intomainfrom
codex/add-compilable-a3-test3-validshape-pop
Draft

[codex] add compilable a3 test3 validshape tpush/tpop sample#475
zhangstevenunity wants to merge 1 commit intomainfrom
codex/add-compilable-a3-test3-validshape-pop

Conversation

@zhangstevenunity
Copy link
Copy Markdown
Collaborator

What changed

  • recreate test/samples/TPushTPop/a3/test3/kernel.pto using the same PTO syntax/style as the existing a3/test1 and a3/test2 samples
  • keep the requested semantics: pto.tmatmul writes into an ACC tile, pto.set_validshape changes it to (8, 16), and the vector side pops 4x16 tiles
  • keep the looped cube->vector handoff structure so the sample stays aligned with the existing TPushTPop/a3/test2 pattern

Why

The first PR carried the intended logic, but the file shape/syntax did not match the current A3 sample style closely enough and failed parsing with the available local ptoas.

Validation

  • compiled the updated sample with an existing local WSL ptoas 0.23
  • command used:
    • /home/rdp/hw-native-sys/PTOAS/build/tools/ptoas/ptoas --pto-arch=a3 --enable-insert-sync -o /tmp/ptoas-a3-test3/kernel.cpp /mnt/c/Users/rdp/Documents/ptoas/test/samples/TPushTPop/a3/test3/kernel.pto
  • confirmed generated C++ contains:
    • SetValidShape(...)
    • Tile<TileType::Vec, float, 4, 16, ...>
    • TPOP<..., Tile<TileType::Vec, float, 4, 16, ...>>

Notes

  • rebuilding the current workspace ptoas under build-wsl-pr443 is presently blocked by an unrelated compile error in lib/PTO/IR/PTO.cpp (verifyTileBufSameShapeAndElem missing), so sample validation used an already-built local ptoas binary instead

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new test case for matrix multiplication using TPush and TPop operations between cube and vector kernels. Several issues were identified regarding the data flow and logic: the vector kernel's loop count and tile shapes do not align with the data produced by the cube kernel, and the use of tfree_from_aic may lead to data loss. Additionally, there are opportunities to optimize the vector kernel by removing redundant tile allocations and move operations. Finally, the placement of set_validshape within the cube kernel loop may cause unintended side effects on subsequent matrix multiplications due to in-place metadata updates.

Comment on lines +84 to +90
scf.for %i = %c0 to %c4 step %c1 {
%fifo_tile = pto.tpop_from_aic {split = 0}
-> !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>
pto.tmov ins(%fifo_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>)
pto.tprint ins(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>)
pto.tfree_from_aic {split = 0}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a significant data mismatch and potential logic error in the vector kernel loop:

  1. Loop Count Mismatch: The cube kernel pushes 4 tiles of shape 8x16 (total 32 rows). The vector kernel loop only runs 4 times (%c4) and pops 4x16 tiles (total 16 rows). To consume all pushed data, the vector loop should run 8 times (%c8).
  2. Shape and Split Mismatch: The cube pushes 8x16 tiles with {split = 0}, while the vector pops 4x16 tiles with {split = 0}. Typically, {split = 0} (no split) requires exact shape matching between the producer and consumer. If splitting is intended, the split attribute should likely be non-zero (e.g., split = 1 for height split) to allow popping sub-tiles from a larger pushed slot.
  3. Premature Slot Release: pto.tfree_from_aic is called in every iteration. If one push (8x16) is intended to be consumed by two pops (4x16), calling tfree after the first pop will release the FIFO slot and discard the remaining data.

Comment on lines +87 to +88
pto.tmov ins(%fifo_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>)
pto.tprint ins(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pto.tmov operation is redundant. %fifo_tile is already a tile buffer in the vec location with the required shape and type. You can pass %fifo_tile directly to pto.tprint and remove this move, as well as the unnecessary %vec_print allocation at line 83.

      pto.tprint ins(%fifo_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>)

Comment on lines +59 to +60
pto.set_validshape %acc_tile, %c8, %c16
: !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=?, v_col=?, blayout=col_major, slayout=row_major, fractal=1024, pad=0>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The placement of pto.set_validshape inside the loop causes an inconsistency between iterations. %acc_tile is initialized with a 16x16 valid shape (line 42). In the first iteration, pto.tmatmul (line 58) operates on a 16x16 region. However, set_validshape updates the tile handle metadata in-place to 8x16, meaning all subsequent tmatmul calls in the loop will see and potentially respect the restricted 8x16 valid shape. If tmatmul is intended to always compute a full 16x16 result, consider moving set_validshape before the loop or resetting the shape after the push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant