[codex] add compilable a3 test3 validshape tpush/tpop sample#475
[codex] add compilable a3 test3 validshape tpush/tpop sample#475zhangstevenunity wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new test case for matrix multiplication using TPush and TPop operations between cube and vector kernels. Several issues were identified regarding the data flow and logic: the vector kernel's loop count and tile shapes do not align with the data produced by the cube kernel, and the use of tfree_from_aic may lead to data loss. Additionally, there are opportunities to optimize the vector kernel by removing redundant tile allocations and move operations. Finally, the placement of set_validshape within the cube kernel loop may cause unintended side effects on subsequent matrix multiplications due to in-place metadata updates.
| scf.for %i = %c0 to %c4 step %c1 { | ||
| %fifo_tile = pto.tpop_from_aic {split = 0} | ||
| -> !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0> | ||
| pto.tmov ins(%fifo_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) | ||
| pto.tprint ins(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) | ||
| pto.tfree_from_aic {split = 0} | ||
| } |
There was a problem hiding this comment.
There is a significant data mismatch and potential logic error in the vector kernel loop:
- Loop Count Mismatch: The cube kernel pushes 4 tiles of shape 8x16 (total 32 rows). The vector kernel loop only runs 4 times (
%c4) and pops 4x16 tiles (total 16 rows). To consume all pushed data, the vector loop should run 8 times (%c8). - Shape and Split Mismatch: The cube pushes 8x16 tiles with
{split = 0}, while the vector pops 4x16 tiles with{split = 0}. Typically,{split = 0}(no split) requires exact shape matching between the producer and consumer. If splitting is intended, thesplitattribute should likely be non-zero (e.g.,split = 1for height split) to allow popping sub-tiles from a larger pushed slot. - Premature Slot Release:
pto.tfree_from_aicis called in every iteration. If one push (8x16) is intended to be consumed by two pops (4x16), callingtfreeafter the first pop will release the FIFO slot and discard the remaining data.
| pto.tmov ins(%fifo_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) | ||
| pto.tprint ins(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) |
There was a problem hiding this comment.
The pto.tmov operation is redundant. %fifo_tile is already a tile buffer in the vec location with the required shape and type. You can pass %fifo_tile directly to pto.tprint and remove this move, as well as the unnecessary %vec_print allocation at line 83.
pto.tprint ins(%fifo_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>)
| pto.set_validshape %acc_tile, %c8, %c16 | ||
| : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=?, v_col=?, blayout=col_major, slayout=row_major, fractal=1024, pad=0> |
There was a problem hiding this comment.
The placement of pto.set_validshape inside the loop causes an inconsistency between iterations. %acc_tile is initialized with a 16x16 valid shape (line 42). In the first iteration, pto.tmatmul (line 58) operates on a 16x16 region. However, set_validshape updates the tile handle metadata in-place to 8x16, meaning all subsequent tmatmul calls in the loop will see and potentially respect the restricted 8x16 valid shape. If tmatmul is intended to always compute a full 16x16 result, consider moving set_validshape before the loop or resetting the shape after the push.
What changed
test/samples/TPushTPop/a3/test3/kernel.ptousing the same PTO syntax/style as the existinga3/test1anda3/test2samplespto.tmatmulwrites into an ACC tile,pto.set_validshapechanges it to(8, 16), and the vector side pops4x16tilesTPushTPop/a3/test2patternWhy
The first PR carried the intended logic, but the file shape/syntax did not match the current A3 sample style closely enough and failed parsing with the available local
ptoas.Validation
ptoas 0.23/home/rdp/hw-native-sys/PTOAS/build/tools/ptoas/ptoas --pto-arch=a3 --enable-insert-sync -o /tmp/ptoas-a3-test3/kernel.cpp /mnt/c/Users/rdp/Documents/ptoas/test/samples/TPushTPop/a3/test3/kernel.ptoSetValidShape(...)Tile<TileType::Vec, float, 4, 16, ...>TPOP<..., Tile<TileType::Vec, float, 4, 16, ...>>Notes
ptoasunderbuild-wsl-pr443is presently blocked by an unrelated compile error inlib/PTO/IR/PTO.cpp(verifyTileBufSameShapeAndElemmissing), so sample validation used an already-built localptoasbinary instead