matmul_all_reduce: no validation that lock array is large enough for current tile count

## Bug

In `iris/ops/matmul_all_reduce.py`, the lock array is pre-allocated with a size based on block dimensions at allocation time. If a subsequent call uses smaller block sizes (producing more tiles), the kernel writes lock entries beyond the lock array bounds. On MI355X this silently corrupts adjacent symmetric heap objects, leading to non-deterministic wrong results on later calls.

**Example:** Allocating with `bm=64` for M=2048 gives `32 * 23 = 736` tiles. A later call with `bm=32` needs `64 * 23 = 1472` lock entries — writing 736 entries past the end of the array.

## Impact

Silent symmetric heap corruption. No error raised. Wrong numerical results on subsequent kernel calls.

## Fix

Add a runtime check in `matmul_all_reduce()`:

```python
if workspace.locks is not None and workspace.locks.numel() < total_tiles:
    raise ValueError(
        f"Lock array too small: have {workspace.locks.numel()} but need {total_tiles}. "
        f"Pre-allocate workspace with the smallest block sizes you intend to use."
    )
```

## Component

`iris/ops/matmul_all_reduce.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matmul_all_reduce: no validation that lock array is large enough for current tile count #463

Bug

Impact

Fix

Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

matmul_all_reduce: no validation that lock array is large enough for current tile count #463

Description

Bug

Impact

Fix

Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions