[codex] add a3 test3 validshape tpush/tpop sample#472
[codex] add a3 test3 validshape tpush/tpop sample#472zhangstevenunity wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new PTO kernel file to test TPush and TPop operations between cube and vector cores. However, the implementation contains several critical issues. There is a logical mismatch between the data produced in @cube_func and consumed in @vec_func, resulting in data being discarded prematurely due to incorrect tile sizes and improper use of tfree_from_aic. Additionally, multiple operations—including aic_initialize_pipe, reserve_buffer, tload, tmov, and various pipe operations—violate the PTO dialect definitions specified in PTOOps.td regarding required operands, return types, and attribute formats.
| scf.for %i = %c0 to %c4 step %c1 { | ||
| %vec_tile = pto.tpop_from_aic ins(%vec_l0c : !pto.async_buffer<core="vector", direction="in", slots=4, slot_size=1024>) { split = 0 : index } : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0> | ||
| %vec_print = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0> | ||
| pto.tmov ins(%vec_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) | ||
| pto.tprint ins(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) | ||
| pto.tfree_from_aic ins(%vec_l0c : !pto.async_buffer<core="vector", direction="in", slots=4, slot_size=1024>) | ||
| } |
There was a problem hiding this comment.
There is a logical mismatch between the data produced on the cube side and consumed on the vector side. The cube side pushes 4 tiles of 8x16 (Line 24). The vector side pops 4 times, but each pop is only 4x16 (Line 36). Furthermore, tfree_from_aic is called in every iteration (Line 40), which releases the entire slot. This means half of the data in each pushed slot is discarded. To process all data in 4x16 chunks as described in the PR, the vector loop should run 8 times, and tfree should only be called after every two pops.
| pto.aic_initialize_pipe 4, 1 | ||
| %left_l1 = pto.reserve_buffer "cube" [1] { slot_size = 256 } : !pto.async_buffer<core="cube", direction="in", slots=1, slot_size=256> | ||
| %right_l1 = pto.reserve_buffer "cube" [1] { slot_size = 256 } : !pto.async_buffer<core="cube", direction="in", slots=1, slot_size=256> | ||
| %acc_l0c = pto.reserve_buffer "cube" [4] { slot_size = 1024 } : !pto.async_buffer<core="cube", direction="out", slots=4, slot_size=1024> |
There was a problem hiding this comment.
The operations pto.aic_initialize_pipe and pto.reserve_buffer do not follow the definitions in PTOOps.td. aic_initialize_pipe is missing the attribute dictionary braces and the required operands (gm_slot_buffer, c2v_consumer_buf, v2c_consumer_buf). reserve_buffer uses an incorrect syntax and returns a !pto.async_buffer type, whereas the ODS specifies it returns an i32 address.
| %left_tile = pto.tload ins(%left[%c0] : memref<64xf32, #pto.address_space<gm>>) : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=4, blayout=row_major, slayout=none_box, fractal=512, pad=0> | ||
| %offset = arith.muli %i, %c64 : index | ||
| %right_tile = pto.tload ins(%right[%offset] : memref<256xf32, #pto.address_space<gm>>) : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0> | ||
| %left_tile_l0a = pto.tmov ins(%left_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=16, v_col=4, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(!pto.tile_buf<loc=left, dtype=f32, rows=16, cols=16, v_row=16, v_col=4, blayout=row_major, slayout=row_major, fractal=512, pad=0>) | ||
| %right_tile_l0b = pto.tmov ins(%right_tile : !pto.tile_buf<loc=mat, dtype=f32, rows=16, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(!pto.tile_buf<loc=right, dtype=f32, rows=16, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=col_major, fractal=512, pad=0>) |
There was a problem hiding this comment.
Several operations use an incorrect syntax relative to the PTO dialect definition:
pto.tload(Lines 16, 18) is used in a functional style, but it is defined as a Destination-Passing Style (DPS) operation inPTOOps.tdrequiring anoutsoperand.pto.tmov(Lines 19, 20) provides a type in theoutsclause instead of a destination operand.
| pto.tpush_to_aiv ins(%acc_tile : !pto.tile_buf<loc=acc, dtype=f32, rows=16, cols=16, v_row=?, v_col=?, blayout=col_major, slayout=row_major, fractal=1024, pad=0>) outs(%acc_l0c : !pto.async_buffer<core="cube", direction="out", slots=4, slot_size=1024>) { split = 0 : index } | ||
| } | ||
| return | ||
| } | ||
|
|
||
| func.func @vec_func() { | ||
| %c0 = arith.constant 0 : index | ||
| %c1 = arith.constant 1 : index | ||
| %c4 = arith.constant 4 : index | ||
| pto.aiv_initialize_pipe 1, 4 | ||
| %vec_l0c = pto.reserve_buffer "vector" [4] { slot_size = 1024 } : !pto.async_buffer<core="vector", direction="in", slots=4, slot_size=1024> | ||
| scf.for %i = %c0 to %c4 step %c1 { | ||
| %vec_tile = pto.tpop_from_aic ins(%vec_l0c : !pto.async_buffer<core="vector", direction="in", slots=4, slot_size=1024>) { split = 0 : index } : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0> | ||
| %vec_print = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0> | ||
| pto.tmov ins(%vec_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) outs(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) | ||
| pto.tprint ins(%vec_print : !pto.tile_buf<loc=vec, dtype=f32, rows=4, cols=16, v_row=4, v_col=16, blayout=row_major, slayout=none_box, fractal=512, pad=0>) | ||
| pto.tfree_from_aic ins(%vec_l0c : !pto.async_buffer<core="vector", direction="in", slots=4, slot_size=1024>) |
There was a problem hiding this comment.
The frontend pipe operations do not match the ODS:
pto.tpush_to_aiv(Line 24) andpto.tfree_from_aic(Line 40) should not have anoutsorinsoperand respectively.pto.tpop_from_aic(Line 36) should not have aninsoperand.- The
splitattribute in these operations should be ani8integer attribute (e.g.,0 : i8), not anindex.
|
/run a3 test/samples/TPushTPop/a3/test3/kernel.pto |
A3 板测失败
日志尾部 |
|
/run a3 test/samples/TPushTPop/a3/test3/kernel |
A3 板测失败
日志尾部 |
|
Superseded by #475. The new PR carries the PTO syntax rewrite that compiles with the available local |
What changed
test/samples/TPushTPop/a3/test3/kernel.ptoTPushTPop%acc_tileto valid shape(8, 16)afterpto.tmatmul4x16tile and print itWhy
This adds the exact A3 sample variant requested for the
tmatmul -> set_validshape -> vec popflow, but undera3/test3instead ofa3/test2.Impact
tpush/tpop8x16valid accumulator tile from the vector side in4x16chunksValidation
ptoasvalidation in this environment because a runnableptoasbinary was not available in the workspacegithub.com:443, so the branch/file/PR were created through the GitHub connector instead