need to generalize code generation logic for different direction, precision, arch * global load/store: - [ ] support different precision, fp32/fp16(short)/ubyte - [ ] support 2d/3d load, and have exec mask from different dimension - [ ] support `global_load`/`buffer_load` and accumulate through sgpr/vgpr * share memory load/store: - [ ] support 1d/2d load/store from different precision - [ ] support k pack * coalescing store: - [ ] support multiple groups to do coalescing store - [ ] support fp16/int8 final store out pack operation - [ ] support some case not need LDS shuffle - [ ] vector write out support * mfma main loop: - [ ] different repeat/step - [ ] support need inst-schedule or no need inst-schedule - [ ] support k pack suitable from instruction requirement and precision - [ ] support share load multiple k_pack at once, then do mfma multiple times - [ ] pass through LDS * fma main loop * thread mapping
need to generalize code generation logic for different direction, precision, arch
global load/store:
global_load/buffer_loadand accumulate through sgpr/vgprshare memory load/store:
coalescing store:
mfma main loop:
fma main loop
thread mapping