fix: avx512vl masked load/store by DiamonDinoia · Pull Request #1353 · xtensor-stack/xsimd

DiamonDinoia · 2026-05-21T07:24:01Z

xsimd::batch<uint32_t, avx512vl_256>::store(ptr, constexpr_mask, mode) used
to compile to a 6-instruction per-lane scalar extract loop instead of a single
EVEX-encoded masked store, because the call site in xsimd_batch.hpp:766
over-specified template arguments and SFINAE'd away every viable masked-store
overload except the scalar common fallback.

The load side was unaffected — its call site (:743) was already correct.

#include <xsimd/xsimd.hpp>

using A    = xsimd::avx512vl_256;
using Bu32 = xsimd::batch<uint32_t, A>;

// Constant alternating mask: lanes 0,2,4,6 active.
struct alt {
    static constexpr bool get(std::size_t i, std::size_t) { return (i & 1) == 0; }
};

static constexpr auto mask = xsimd::make_batch_bool_constant<uint32_t, alt, A>();

Bu32 load_u32(uint32_t const* p) {
    return Bu32::load(p, mask, xsimd::unaligned_mode{});
}

void store_u32(uint32_t* p, Bu32 v) {
    v.store(p, mask, xsimd::unaligned_mode{});
}

g++ -O3 -S -masm=intel -std=c++14 -march=skylake-avx512 -DXSIMD_DEFAULT_ARCH=xsimd::avx512vl_256

Codegen — before (master, commit `7d30b9cc`)

_Z8load_u32PKj:                  # load_u32
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rdi]
    ret

_Z9store_u32PjN5xsimd5batchIjNS0_12avx512vl_256EEE:   # store_u32
    valignd       ymm1, ymm0, ymm0, 6
    vmovd         DWORD PTR  [rdi], xmm0
    vpextrd       DWORD PTR 8[rdi], xmm0, 2
    vextracti32x4 xmm2, ymm0, 1
    vmovd         DWORD PTR 24[rdi], xmm1
    vmovd         DWORD PTR 16[rdi], xmm2
    ret

Load is fine. Store is the scalar fallback: GCC unrolled the 8-lane loop in
xsimd_common_memory.hpp:377 into four per-lane 32-bit stores, materialising
each active lane no mask instruction involved.

Codegen — after

_Z8load_u32PKj:                  # load_u32 — unchanged
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rdi]
    ret

_Z9store_u32PjN5xsimd5batchIjNS0_12avx512vl_256EEE:   # store_u32
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   YMMWORD PTR [rdi]{k1}, ymm0
    ret

One masked instruction.

Bug fixes

AVX-512VL masked store collapsed to a 6-insn scalar fallback. An over-specified template arg list at the public batch::store(mask) call site pushed a type into a non-type pack, SFINAE'ing every per-arch overload away. The store now reaches the EVEX vmov*{k} intrinsic.
CI linux.yml typo silently dropped CXXFLAGS in the VL_128 matrix row ($CXX_FLAGS vs $CXXFLAGS), so the VL_128-default test job was building with stock flags instead of the requested override.
avx512vl register-traits comment misattributed to AVX512DQ.
Half-confined masked op fell back to scalar on AVX/AVX2/AVX-512F. The half-fold hardcoded sse4_2/avx2 as the half-width target, but two-phase lookup made the better-arch overload invisible at template-definition time — so the recursive call dispatched to the wrong arch. Fixed by include reorder + make_sized_batch_t<T, half>.

Features added

Per-type load_masked / store_masked for avx512vl_128 and avx512vl_256 covering i32/u32/i64/u64/f32/f64 in both aligned and unaligned modes; partial ordering picks them over the avx2 bridges these archs
inherit.
Compile-time guarantee that default_arch matches XSIMD_DEFAULT_ARCH when the macro is set — a static_assert in test_arch.cpp (plus the CMake plumbing to forward the macro) catches default-arch wiring regressions at compile time instead of at runtime.

Cleanups

AVX detail::maskstore helpers now take batch<> types, symmetric with detail::maskload.
Half-store sites and the common select use a typed variable (const batch<T, A> lo = …) instead of const auto lo = batch<T, A>{ … }.
Aligned/unaligned dispatch in the VL overloads uses XSIMD_IF_CONSTEXPR, so the inactive intrinsic isn't instantiated.
Half-fold in xsimd_avx.hpp / xsimd_avx2.hpp / xsimd_avx512f.hpp uses make_sized_batch_t<T, half> instead of a hardcoded arch — picks avx_128 / avx2_128 / avx512vl_256 when available, so half-confined stores land on the EVEX or VEX masked intrinsic.
xsimd_isa.hpp include order: _128 siblings before their wider arch, VL before avx512f.hpp. Required so the recursive store_masked<half_arch> call sees the better-arch overload at template-definition time.

(Changelog by @claude)

The CI job that exercises avx512vl_128 as the default arch was passing the variable but the test build never received it; this fixes both halves of the loop (linux.yml typo $CXX_FLAGS->$CXXFLAGS, CMakeLists propagation, and a static_assert in test_arch).

- sse2 store_masked: align call signature to the (Mode, common{}) shape - avx512vl_register: doc comment AVX512DQ -> AVX512VL typo - isa.hpp: include _128 headers before wider arch so the AVX half-fold's recursive call resolves at parse time (load-bearing for the upcoming perf change); wrap in clang-format off/on - avx_128 swizzle: split the dynamic-mask overload into per-type ovlds for true C++14 builds where XSIMD_IF_CONSTEXPR is plain 'if' and both branches must type-check

Three coupled changes; gcc-10 partial-ordering forces them into a single commit: 1) Rewrite avx512vl_128 / avx512vl_256 masked load/store. Adds the missing int64/uint64/float/double load_masked ovlds, corrects the batch_bool_constant typing on store_masked (was uint32_t/uint64_t for signed-int/float/double stores, now matches the value type), and branches aligned vs. unaligned to the right EVEX intrinsic. Unsigned ovlds delegate to the signed one via bitwise_cast. 2) Constrain the non-VL master ovlds (avx_128 float/double, avx2_128 int32/uint32 + int64/uint64, avx2 templated and int32/uint32/int64/ uint64) and the common-memory int<->float bridges with !is_base_of<avx512vl_*, A>. gcc-10's partial ordering otherwise sees a concrete requires_arch<X> and the inherited concrete requires_arch<Y> (Y a base of X) as equally specialized, likewise for templated bridge<A> vs. native<avx512vl_*> when A is VL. gcc-14 handles both cases naturally so this is a no-op there. The avx native gains an is_floating_point<T> SFINAE and the avx2 templated gains is_integral<T> && sizeof>=4 so the new half-fold dispatch (half_arch = avx for floats, avx2 for ints in a 512-bit batch) is unambiguous on gcc-10. 3) Resolve the half-fold target arch in avx / avx2 / avx512f through make_sized_batch_t<T, half>::arch_type so the dispatch picks avx512vl_128 / avx512vl_256 when available and emits EVEX vmovdqu32{k}{z} instead of VEX vpmaskmovd / vmaskmovps. (Without (3), (2)'s is_integral SFINAE on the avx2 templated form leaves the pre-existing avx512f.hpp:339 'store_masked<avx2>(float*, __m256, ...)' callsite with no matching ovld on gcc-10.) The xsimd_batch dispatch drops the explicit <A, T, U, Values...> args on the kernel::store_masked call so the SFINAE'd overload set can be resolved by ADL, and adds a fwd decl of make_sized_batch ahead of xsimd_isa.hpp so the half-fold sites can see the type at parse time. bridge_not_vl lives in xsimd_common_fwd next to the bridge fwd-decls; fwd.hpp now pulls xsimd_avx512vl_register so the trait sees complete types. The 4 redundant register.hpp includes that would otherwise be added at the point-of-use are dropped — they're reachable transitively through fwd.hpp.

serge-sans-paille · 2026-05-23T06:34:21Z

        if [[ '${{ matrix.sys.flags }}' == 'avx512vl_128' ]]; then
          CMAKE_EXTRA_ARGS="$CMAKE_EXTRA_ARGS -DTARGET_ARCH=skylake-avx512"
-          CXXFLAGS="$CXX_FLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128"
+          CXXFLAGS="$CXXFLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128"


oopsie. Thanks for fixing this one.

serge-sans-paille

I wish we could keep each architecture file unaware from other architectures. I do understand this will disappear once we move to C++17-based architecture, but I'd be happier if you could find another way to apply the constraints.

serge-sans-paille · 2026-05-23T06:35:42Z

+        // and the VL native as equally specialized for A=avx512vl_*. (bridge_not_vl in fwd.hpp)
        template <class A, bool... Values, class Mode>
-        XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...>, convert<int32_t>, Mode, requires_arch<A>) noexcept
+        XSIMD_INLINE std::enable_if_t<bridge_not_vl<A>::value, batch<int32_t, A>>


there should not be any arch-specific code in xsimd_common_memory.hpp. Could you find another approach?

serge-sans-paille · 2026-05-23T06:38:09Z


            template <class A>
-            XSIMD_INLINE void maskstore(double* mem, batch_bool<double, A> const& mask, batch<double, A> const& src) noexcept
+            XSIMD_INLINE void maskstore(double* mem, batch<as_integer_t<double>, A> const& mask, batch<double, A> const& src) noexcept


wgy that change? In my mental model, the masked store take a bool mask, not an integer mask

serge-sans-paille · 2026-05-23T06:39:00Z

        // single templated implementation for integer masked loads (32/64-bit)
-        template <class A, class T, bool... Values, class Mode>
+        template <class A, class T, bool... Values, class Mode,
+                  class = std::enable_if_t<std::is_base_of<avx2, A>::value && !std::is_base_of<avx512vl_256, A>::value>>


I quite dislike the fact that xsimd_avx2.hpp needs to know stuff about avx512vl

serge-sans-paille · 2026-05-23T06:40:23Z

        }
-        template <class A, bool... Values, class Mode>
+        template <class A, bool... Values, class Mode,
+                  class = std::enable_if_t<std::is_base_of<avx_128, A>::value && !std::is_base_of<avx512vl_128, A>::value>>


same architecture mix reference issue here.

serge-sans-paille · 2026-05-23T06:40:46Z

 #if XSIMD_WITH_AVX
-#include "./xsimd_avx.hpp"
+// clang-format off
+// _128 first: avx half-fold recursive call needs avx_128 visible at parse time.


nice catch.

serge-sans-paille · 2026-05-23T06:41:02Z

     * @ingroup architectures
     *
-     * AVX512DQ instructions
+     * AVX512VL instructions


serge-sans-paille · 2026-05-23T06:41:22Z

+    template <typename T, std::size_t N>
+    struct make_sized_batch;
+    template <typename T, std::size_t N>
+    using make_sized_batch_t = typename make_sized_batch<T, N>::type;


serge-sans-paille · 2026-05-23T06:43:04Z

    message(STATUS "Using emulated target: ${TARGET_EMULATED}")
    set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1)
    unset(TARGET_ARCH CACHE)
+elseif (DEFINED XSIMD_DEFAULT_ARCH AND NOT "${XSIMD_DEFAULT_ARCH}" STREQUAL "")


Per https://cmake.org/cmake/help/latest/command/if.html#constant I think

if(XSIMD_DEFAULT_ARCH)

is enough

(confirmed by https://godbolt.org/z/b64GcKaxM)

serge-sans-paille · 2026-05-23T06:46:04Z

 static_assert(xsimd::all_architectures::contains<xsimd::default_arch>(), "default arch is a valid arch");
+#else
+namespace xsimd
+{


DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch 4 times, most recently from a9a8eb2 to fe2938e Compare May 21, 2026 10:18

DiamonDinoia added 3 commits May 21, 2026 06:33

DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch from fe2938e to ea882e6 Compare May 21, 2026 12:04

DiamonDinoia marked this pull request as ready for review May 21, 2026 12:51

DiamonDinoia requested a review from serge-sans-paille May 21, 2026 12:52

serge-sans-paille reviewed May 23, 2026

View reviewed changes

serge-sans-paille requested changes May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avx512vl masked load/store#1353

fix: avx512vl masked load/store#1353
DiamonDinoia wants to merge 3 commits into
xtensor-stack:masterfrom
DiamonDinoia:fix/avx512vl-masked-memory

DiamonDinoia commented May 21, 2026 •

edited

Loading

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

serge-sans-paille left a comment

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

serge-sans-paille May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DiamonDinoia commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegen — before (master, commit 7d30b9cc)

Codegen — after

Bug fixes

Features added

Cleanups

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DiamonDinoia commented May 21, 2026 •

edited

Loading

Codegen — before (master, commit `7d30b9cc`)