This issue tracks a series of 3 pull request(s) targeting ROCm/aiter.
Status: PRs being prepared — full description will be added shortly.
- PR 1: [Perf][Kernel] Add decode buffer caches to eliminate per-step HIP malloc in fused_moe
- PR 2: [Perf][Kernel] Add gfx950 1-stage ASM fast path for FP8 blockscale decode with BLOCK_SIZE_M=16
- PR 3: [Perf] Add MiniMax-M2.5 GEMM and FMoE tuning configs with doweight_stage1=0 dispatch fix
This issue tracks a series of 3 pull request(s) targeting
ROCm/aiter.Status: PRs being prepared — full description will be added shortly.