Skip to content

fix: avx512vl masked load/store#1353

Open
DiamonDinoia wants to merge 3 commits into
xtensor-stack:masterfrom
DiamonDinoia:fix/avx512vl-masked-memory
Open

fix: avx512vl masked load/store#1353
DiamonDinoia wants to merge 3 commits into
xtensor-stack:masterfrom
DiamonDinoia:fix/avx512vl-masked-memory

Conversation

@DiamonDinoia
Copy link
Copy Markdown
Contributor

@DiamonDinoia DiamonDinoia commented May 21, 2026

xsimd::batch<uint32_t, avx512vl_256>::store(ptr, constexpr_mask, mode) used
to compile to a 6-instruction per-lane scalar extract loop instead of a single
EVEX-encoded masked store, because the call site in xsimd_batch.hpp:766
over-specified template arguments and SFINAE'd away every viable masked-store
overload except the scalar common fallback.

The load side was unaffected — its call site (:743) was already correct.

#include <xsimd/xsimd.hpp>

using A    = xsimd::avx512vl_256;
using Bu32 = xsimd::batch<uint32_t, A>;

// Constant alternating mask: lanes 0,2,4,6 active.
struct alt {
    static constexpr bool get(std::size_t i, std::size_t) { return (i & 1) == 0; }
};

static constexpr auto mask = xsimd::make_batch_bool_constant<uint32_t, alt, A>();

Bu32 load_u32(uint32_t const* p) {
    return Bu32::load(p, mask, xsimd::unaligned_mode{});
}

void store_u32(uint32_t* p, Bu32 v) {
    v.store(p, mask, xsimd::unaligned_mode{});
}
g++ -O3 -S -masm=intel -std=c++14 -march=skylake-avx512 -DXSIMD_DEFAULT_ARCH=xsimd::avx512vl_256

Codegen — before (master, commit 7d30b9cc)

_Z8load_u32PKj:                  # load_u32
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rdi]
    ret

_Z9store_u32PjN5xsimd5batchIjNS0_12avx512vl_256EEE:   # store_u32
    valignd       ymm1, ymm0, ymm0, 6
    vmovd         DWORD PTR  [rdi], xmm0
    vpextrd       DWORD PTR 8[rdi], xmm0, 2
    vextracti32x4 xmm2, ymm0, 1
    vmovd         DWORD PTR 24[rdi], xmm1
    vmovd         DWORD PTR 16[rdi], xmm2
    ret

Load is fine. Store is the scalar fallback: GCC unrolled the 8-lane loop in
xsimd_common_memory.hpp:377 into four per-lane 32-bit stores, materialising
each active lane no mask instruction involved.

Codegen — after

_Z8load_u32PKj:                  # load_u32 — unchanged
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rdi]
    ret

_Z9store_u32PjN5xsimd5batchIjNS0_12avx512vl_256EEE:   # store_u32
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   YMMWORD PTR [rdi]{k1}, ymm0
    ret

One masked instruction.

Bug fixes

  • AVX-512VL masked store collapsed to a 6-insn scalar fallback. An over-specified template arg list at the public batch::store(mask) call site pushed a type into a non-type pack, SFINAE'ing every per-arch overload away. The store now reaches the EVEX vmov*{k} intrinsic.

  • CI linux.yml typo silently dropped CXXFLAGS in the VL_128 matrix row ($CXX_FLAGS vs $CXXFLAGS), so the VL_128-default test job was building with stock flags instead of the requested override.

  • avx512vl register-traits comment misattributed to AVX512DQ.

  • Half-confined masked op fell back to scalar on AVX/AVX2/AVX-512F. The half-fold hardcoded sse4_2/avx2 as the half-width target, but two-phase lookup made the better-arch overload invisible at template-definition time — so the recursive call dispatched to the wrong arch. Fixed by include reorder + make_sized_batch_t<T, half>.

Features added

  • Per-type load_masked / store_masked for avx512vl_128 and avx512vl_256 covering i32/u32/i64/u64/f32/f64 in both aligned and unaligned modes; partial ordering picks them over the avx2 bridges these archs
    inherit.
  • Compile-time guarantee that default_arch matches XSIMD_DEFAULT_ARCH when the macro is set — a static_assert in test_arch.cpp (plus the CMake plumbing to forward the macro) catches default-arch wiring regressions at compile time instead of at runtime.

Cleanups

  • AVX detail::maskstore helpers now take batch<> types, symmetric with detail::maskload.
  • Half-store sites and the common select use a typed variable (const batch<T, A> lo = …) instead of const auto lo = batch<T, A>{ … }.
  • Aligned/unaligned dispatch in the VL overloads uses XSIMD_IF_CONSTEXPR, so the inactive intrinsic isn't instantiated.
  • Half-fold in xsimd_avx.hpp / xsimd_avx2.hpp / xsimd_avx512f.hpp uses make_sized_batch_t<T, half> instead of a hardcoded arch — picks avx_128 / avx2_128 / avx512vl_256 when available, so half-confined stores land on the EVEX or VEX masked intrinsic.
  • xsimd_isa.hpp include order: _128 siblings before their wider arch, VL before avx512f.hpp. Required so the recursive store_masked<half_arch> call sees the better-arch overload at template-definition time.

(Changelog by @claude)

@DiamonDinoia DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch 4 times, most recently from a9a8eb2 to fe2938e Compare May 21, 2026 10:18
The CI job that exercises avx512vl_128 as the default arch was passing
the variable but the test build never received it; this fixes both
halves of the loop (linux.yml typo $CXX_FLAGS->$CXXFLAGS, CMakeLists
propagation, and a static_assert in test_arch).
- sse2 store_masked: align call signature to the (Mode, common{}) shape
- avx512vl_register: doc comment AVX512DQ -> AVX512VL typo
- isa.hpp: include _128 headers before wider arch so the AVX half-fold's
  recursive call resolves at parse time (load-bearing for the upcoming
  perf change); wrap in clang-format off/on
- avx_128 swizzle: split the dynamic-mask overload into per-type ovlds
  for true C++14 builds where XSIMD_IF_CONSTEXPR is plain 'if' and both
  branches must type-check
Three coupled changes; gcc-10 partial-ordering forces them into a
single commit:

1) Rewrite avx512vl_128 / avx512vl_256 masked load/store. Adds the
   missing int64/uint64/float/double load_masked ovlds, corrects the
   batch_bool_constant typing on store_masked (was uint32_t/uint64_t for
   signed-int/float/double stores, now matches the value type), and
   branches aligned vs. unaligned to the right EVEX intrinsic. Unsigned
   ovlds delegate to the signed one via bitwise_cast.

2) Constrain the non-VL master ovlds (avx_128 float/double, avx2_128
   int32/uint32 + int64/uint64, avx2 templated and int32/uint32/int64/
   uint64) and the common-memory int<->float bridges with
   !is_base_of<avx512vl_*, A>. gcc-10's partial ordering otherwise sees
   a concrete requires_arch<X> and the inherited concrete
   requires_arch<Y> (Y a base of X) as equally specialized, likewise
   for templated bridge<A> vs. native<avx512vl_*> when A is VL.
   gcc-14 handles both cases naturally so this is a no-op there.
   The avx native gains an is_floating_point<T> SFINAE and the avx2
   templated gains is_integral<T> && sizeof>=4 so the new half-fold
   dispatch (half_arch = avx for floats, avx2 for ints in a 512-bit
   batch) is unambiguous on gcc-10.

3) Resolve the half-fold target arch in avx / avx2 / avx512f through
   make_sized_batch_t<T, half>::arch_type so the dispatch picks
   avx512vl_128 / avx512vl_256 when available and emits EVEX
   vmovdqu32{k}{z} instead of VEX vpmaskmovd / vmaskmovps. (Without
   (3), (2)'s is_integral SFINAE on the avx2 templated form leaves the
   pre-existing avx512f.hpp:339 'store_masked<avx2>(float*, __m256,
   ...)' callsite with no matching ovld on gcc-10.)

The xsimd_batch dispatch drops the explicit <A, T, U, Values...> args
on the kernel::store_masked call so the SFINAE'd overload set can be
resolved by ADL, and adds a fwd decl of make_sized_batch ahead of
xsimd_isa.hpp so the half-fold sites can see the type at parse time.

bridge_not_vl lives in xsimd_common_fwd next to the bridge fwd-decls;
fwd.hpp now pulls xsimd_avx512vl_register so the trait sees complete
types. The 4 redundant register.hpp includes that would otherwise be
added at the point-of-use are dropped — they're reachable transitively
through fwd.hpp.
@DiamonDinoia DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch from fe2938e to ea882e6 Compare May 21, 2026 12:04
@DiamonDinoia DiamonDinoia marked this pull request as ready for review May 21, 2026 12:51
if [[ '${{ matrix.sys.flags }}' == 'avx512vl_128' ]]; then
CMAKE_EXTRA_ARGS="$CMAKE_EXTRA_ARGS -DTARGET_ARCH=skylake-avx512"
CXXFLAGS="$CXX_FLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128"
CXXFLAGS="$CXXFLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oopsie. Thanks for fixing this one.

Copy link
Copy Markdown
Contributor

@serge-sans-paille serge-sans-paille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could keep each architecture file unaware from other architectures. I do understand this will disappear once we move to C++17-based architecture, but I'd be happier if you could find another way to apply the constraints.

// and the VL native as equally specialized for A=avx512vl_*. (bridge_not_vl in fwd.hpp)
template <class A, bool... Values, class Mode>
XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...>, convert<int32_t>, Mode, requires_arch<A>) noexcept
XSIMD_INLINE std::enable_if_t<bridge_not_vl<A>::value, batch<int32_t, A>>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should not be any arch-specific code in xsimd_common_memory.hpp. Could you find another approach?


template <class A>
XSIMD_INLINE void maskstore(double* mem, batch_bool<double, A> const& mask, batch<double, A> const& src) noexcept
XSIMD_INLINE void maskstore(double* mem, batch<as_integer_t<double>, A> const& mask, batch<double, A> const& src) noexcept
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wgy that change? In my mental model, the masked store take a bool mask, not an integer mask

// single templated implementation for integer masked loads (32/64-bit)
template <class A, class T, bool... Values, class Mode>
template <class A, class T, bool... Values, class Mode,
class = std::enable_if_t<std::is_base_of<avx2, A>::value && !std::is_base_of<avx512vl_256, A>::value>>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite dislike the fact that xsimd_avx2.hpp needs to know stuff about avx512vl

}
template <class A, bool... Values, class Mode>
template <class A, bool... Values, class Mode,
class = std::enable_if_t<std::is_base_of<avx_128, A>::value && !std::is_base_of<avx512vl_128, A>::value>>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same architecture mix reference issue here.

#if XSIMD_WITH_AVX
#include "./xsimd_avx.hpp"
// clang-format off
// _128 first: avx half-fold recursive call needs avx_128 visible at parse time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch.

* @ingroup architectures
*
* AVX512DQ instructions
* AVX512VL instructions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oopsie

template <typename T, std::size_t N>
struct make_sized_batch;
template <typename T, std::size_t N>
using make_sized_batch_t = typename make_sized_batch<T, N>::type;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Comment thread test/CMakeLists.txt
message(STATUS "Using emulated target: ${TARGET_EMULATED}")
set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1)
unset(TARGET_ARCH CACHE)
elseif (DEFINED XSIMD_DEFAULT_ARCH AND NOT "${XSIMD_DEFAULT_ARCH}" STREQUAL "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://cmake.org/cmake/help/latest/command/if.html#constant I think

if(XSIMD_DEFAULT_ARCH)

is enough

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread test/test_arch.cpp
static_assert(xsimd::all_architectures::contains<xsimd::default_arch>(), "default arch is a valid arch");
#else
namespace xsimd
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants