Summary
Record a practical use case where ptoas --enable-insert-sync still has ~10% room for performance improvement, compared to a known manual-sync plan.
Background
I wrote a dynamic-shape matmul that is 2x faster than original pto-isa gemm_performance example and 0.9~1.1x of aclnnMatmul in CANN 8.5.0. See matmul_swizzle/simple_demo to reproduce.
The auto-sync version is only ~100 lines of Python, and reaching 90% of manual-sync is quite decent. I just wonder if the last 10% perf gap can be filled.
Command line
ptoas --enable-insert-sync simple_matmul_auto_sync.pto -o simple_matmul_auto_sync.cpp
ptoas simple_matmul_manual_sync.pto -o simple_matmul_manual_sync.cpp
Reproduction input
pto_matmul.zip
contains both inputs:
simple_matmul_auto_sync.pto
simple_matmul_manual_sync.pto
and outputs:
simple_matmul_auto_sync.cpp
simple_matmul_manual_sync.cpp
Expected performance
Auto-sync should be ideally as fast as manual sync version. (or discover even faster pipelining?)
Actual performance
Auto-sync is 5~15% slower than manual-sync, see the detailed PRs below (contains full code with kernel launch, and on-device performance measurement on 910B2:):
Git commit
29ed536
Summary
Record a practical use case where
ptoas --enable-insert-syncstill has ~10% room for performance improvement, compared to a known manual-sync plan.Background
I wrote a dynamic-shape matmul that is 2x faster than original pto-isa
gemm_performanceexample and 0.9~1.1x ofaclnnMatmulin CANN 8.5.0. See matmul_swizzle/simple_demo to reproduce.The auto-sync version is only ~100 lines of Python, and reaching 90% of manual-sync is quite decent. I just wonder if the last 10% perf gap can be filled.
Command line
Reproduction input
pto_matmul.zip
contains both inputs:
simple_matmul_auto_sync.ptosimple_matmul_manual_sync.ptoand outputs:
simple_matmul_auto_sync.cppsimple_matmul_manual_sync.cppExpected performance
Auto-sync should be ideally as fast as manual sync version. (or discover even faster pipelining?)
Actual performance
Auto-sync is 5~15% slower than manual-sync, see the detailed PRs below (contains full code with kernel launch, and on-device performance measurement on 910B2:):
Git commit
29ed536