Skip to content

[Fix] Align buffer-resource num_records with logical tensor sizes#555

Closed
coderfeli wants to merge 1 commit into
mainfrom
fix/buffer-resource-num-records
Closed

[Fix] Align buffer-resource num_records with logical tensor sizes#555
coderfeli wants to merge 1 commit into
mainfrom
fix/buffer-resource-num-records

Conversation

@coderfeli
Copy link
Copy Markdown
Collaborator

Summary

Several MoE / GEMM kernels call buffer_ops.create_buffer_resource(arg, max_size=False) without supplying num_records_bytes. With max_size=False the descriptor size is inferred from the memref type at trace time, which silently truncates when the kernel is reused for tensors with larger logical extents — producing out-of-bounds reads (returns 0) or dropped writes.

This mirrors aiter PR ROCm/aiter#3314 and extends the same fix to two files aiter does not cover (moe_blockscale_2stage.py and the w_rsrc/bias_rsrc sites in mixed_moe_gemm_2stage.py).

What changes

For each affected site, compute num_records from compile-time constants (experts, model_dim, inter_dim, num_groups, …) and pass it explicitly to create_buffer_resource. Pre-existing dead expressions that lacked the * elem_bytes multiplier are also corrected.

Files (18 buffer-resource sites total)

File Sites
kernels/moe_gemm_2stage.py stage1: w_rsrc, sw_rsrc, sorted_rsrc, sorted_w_rsrc; stage2: w_rsrc, sw_rsrc
kernels/moe_blockscale_2stage.py stage1: w_rsrc, sw_rsrc, sorted_rsrc, sorted_w_rsrc; stage2: w_rsrc, sw_rsrc
kernels/mixed_moe_gemm_2stage.py stage1: w_rsrc, bias_rsrc, sorted_scale_rsrc; stage2: w_rsrc, bias_rsrc
kernels/preshuffle_gemm.py scale_a_rsrc (with fp4 path)

Test plan

  • tests/kernels/test_preshuffle_gemm.py — 103 passed
  • tests/kernels/test_moe_gemm.py — 349 passed
  • tests/kernels/test_moe_blockscale.py — 4 passed
  • bash scripts/check_python_style.sh — clean

Related

🤖 Generated with Claude Code

Several MoE / GEMM kernels call ``buffer_ops.create_buffer_resource(arg,
max_size=False)`` without supplying ``num_records_bytes``.  With
``max_size=False`` the descriptor size is inferred from the memref *type*
at trace time, which silently truncates when the kernel is reused for
tensors with larger logical extents — producing out-of-bounds reads
(returns 0) or dropped writes.

This mirrors aiter PR ROCm/aiter#3314 and extends the same fix to two
files aiter does not cover (``moe_blockscale_2stage.py`` and the
``w_rsrc``/``bias_rsrc`` sites in ``mixed_moe_gemm_2stage.py``).

Compute ``num_records`` from compile-time constants (experts, model_dim,
inter_dim, num_groups, …) and pass it explicitly to
``create_buffer_resource``.  Also fixes pre-existing dead expressions
that lacked the ``* elem_bytes`` multiplier.

Files touched (18 buffer-resource sites total):
- kernels/moe_gemm_2stage.py        — stage1: w/sw/sorted/sorted_w; stage2: w/sw
- kernels/moe_blockscale_2stage.py  — stage1: w/sw/sorted/sorted_w; stage2: w/sw
- kernels/mixed_moe_gemm_2stage.py  — stage1: w/bias/sorted_scale; stage2: w/bias
- kernels/preshuffle_gemm.py        — scale_a (with fp4 path)

Test plan:
- tests/kernels/test_preshuffle_gemm.py  (103 passed)
- tests/kernels/test_moe_gemm.py         (349 passed)
- tests/kernels/test_moe_blockscale.py   (4 passed)
- bash scripts/check_python_style.sh     (clean)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderfeli coderfeli closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant