Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones.
Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length $N$ as a matrix of $16$ rows and $N/16$ columns, and use a tiled matrix-multiplication instruction sliding through that wide matrix, multiplying it by a $16$-element vector of ones, and accumulating into $16$ other floats.
In reality, we can't user Intel AMX with float32 inputs, but we can use Arm SME, and later apply similar techniques to SimSIMD.
Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones.
Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length$N$ as a matrix of $16$ rows and $N/16$ columns, and use a tiled matrix-multiplication instruction sliding through that wide matrix, multiplying it by a $16$ -element vector of ones, and accumulating into $16$ other floats.
In reality, we can't user Intel AMX with
float32inputs, but we can use Arm SME, and later apply similar techniques to SimSIMD.