Discussion of possible algorithmic approaches with the good stuff near the end
It can even be fully parallelized, but for the block shuffling pass it would require either a rav1d-style DisjointMut which is unsafe but maybe not in a really bad way, or something like Vec<Mutex<&mut [f32]>> while the per-block operation should be trivial to express with SIMD and a small out-of-place buffer
Discussion of possible algorithmic approaches with the good stuff near the end
It can even be fully parallelized, but for the block shuffling pass it would require either a rav1d-style
DisjointMutwhich is unsafe but maybe not in a really bad way, or something likeVec<Mutex<&mut [f32]>>while the per-block operation should be trivial to express with SIMD and a small out-of-place buffer