fix: avx512vl masked load/store#1353
Conversation
a9a8eb2 to
fe2938e
Compare
The CI job that exercises avx512vl_128 as the default arch was passing the variable but the test build never received it; this fixes both halves of the loop (linux.yml typo $CXX_FLAGS->$CXXFLAGS, CMakeLists propagation, and a static_assert in test_arch).
- sse2 store_masked: align call signature to the (Mode, common{}) shape
- avx512vl_register: doc comment AVX512DQ -> AVX512VL typo
- isa.hpp: include _128 headers before wider arch so the AVX half-fold's
recursive call resolves at parse time (load-bearing for the upcoming
perf change); wrap in clang-format off/on
- avx_128 swizzle: split the dynamic-mask overload into per-type ovlds
for true C++14 builds where XSIMD_IF_CONSTEXPR is plain 'if' and both
branches must type-check
Three coupled changes; gcc-10 partial-ordering forces them into a
single commit:
1) Rewrite avx512vl_128 / avx512vl_256 masked load/store. Adds the
missing int64/uint64/float/double load_masked ovlds, corrects the
batch_bool_constant typing on store_masked (was uint32_t/uint64_t for
signed-int/float/double stores, now matches the value type), and
branches aligned vs. unaligned to the right EVEX intrinsic. Unsigned
ovlds delegate to the signed one via bitwise_cast.
2) Constrain the non-VL master ovlds (avx_128 float/double, avx2_128
int32/uint32 + int64/uint64, avx2 templated and int32/uint32/int64/
uint64) and the common-memory int<->float bridges with
!is_base_of<avx512vl_*, A>. gcc-10's partial ordering otherwise sees
a concrete requires_arch<X> and the inherited concrete
requires_arch<Y> (Y a base of X) as equally specialized, likewise
for templated bridge<A> vs. native<avx512vl_*> when A is VL.
gcc-14 handles both cases naturally so this is a no-op there.
The avx native gains an is_floating_point<T> SFINAE and the avx2
templated gains is_integral<T> && sizeof>=4 so the new half-fold
dispatch (half_arch = avx for floats, avx2 for ints in a 512-bit
batch) is unambiguous on gcc-10.
3) Resolve the half-fold target arch in avx / avx2 / avx512f through
make_sized_batch_t<T, half>::arch_type so the dispatch picks
avx512vl_128 / avx512vl_256 when available and emits EVEX
vmovdqu32{k}{z} instead of VEX vpmaskmovd / vmaskmovps. (Without
(3), (2)'s is_integral SFINAE on the avx2 templated form leaves the
pre-existing avx512f.hpp:339 'store_masked<avx2>(float*, __m256,
...)' callsite with no matching ovld on gcc-10.)
The xsimd_batch dispatch drops the explicit <A, T, U, Values...> args
on the kernel::store_masked call so the SFINAE'd overload set can be
resolved by ADL, and adds a fwd decl of make_sized_batch ahead of
xsimd_isa.hpp so the half-fold sites can see the type at parse time.
bridge_not_vl lives in xsimd_common_fwd next to the bridge fwd-decls;
fwd.hpp now pulls xsimd_avx512vl_register so the trait sees complete
types. The 4 redundant register.hpp includes that would otherwise be
added at the point-of-use are dropped — they're reachable transitively
through fwd.hpp.
fe2938e to
ea882e6
Compare
| if [[ '${{ matrix.sys.flags }}' == 'avx512vl_128' ]]; then | ||
| CMAKE_EXTRA_ARGS="$CMAKE_EXTRA_ARGS -DTARGET_ARCH=skylake-avx512" | ||
| CXXFLAGS="$CXX_FLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128" | ||
| CXXFLAGS="$CXXFLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128" |
There was a problem hiding this comment.
oopsie. Thanks for fixing this one.
serge-sans-paille
left a comment
There was a problem hiding this comment.
I wish we could keep each architecture file unaware from other architectures. I do understand this will disappear once we move to C++17-based architecture, but I'd be happier if you could find another way to apply the constraints.
| // and the VL native as equally specialized for A=avx512vl_*. (bridge_not_vl in fwd.hpp) | ||
| template <class A, bool... Values, class Mode> | ||
| XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...>, convert<int32_t>, Mode, requires_arch<A>) noexcept | ||
| XSIMD_INLINE std::enable_if_t<bridge_not_vl<A>::value, batch<int32_t, A>> |
There was a problem hiding this comment.
there should not be any arch-specific code in xsimd_common_memory.hpp. Could you find another approach?
|
|
||
| template <class A> | ||
| XSIMD_INLINE void maskstore(double* mem, batch_bool<double, A> const& mask, batch<double, A> const& src) noexcept | ||
| XSIMD_INLINE void maskstore(double* mem, batch<as_integer_t<double>, A> const& mask, batch<double, A> const& src) noexcept |
There was a problem hiding this comment.
wgy that change? In my mental model, the masked store take a bool mask, not an integer mask
| // single templated implementation for integer masked loads (32/64-bit) | ||
| template <class A, class T, bool... Values, class Mode> | ||
| template <class A, class T, bool... Values, class Mode, | ||
| class = std::enable_if_t<std::is_base_of<avx2, A>::value && !std::is_base_of<avx512vl_256, A>::value>> |
There was a problem hiding this comment.
I quite dislike the fact that xsimd_avx2.hpp needs to know stuff about avx512vl
| } | ||
| template <class A, bool... Values, class Mode> | ||
| template <class A, bool... Values, class Mode, | ||
| class = std::enable_if_t<std::is_base_of<avx_128, A>::value && !std::is_base_of<avx512vl_128, A>::value>> |
There was a problem hiding this comment.
same architecture mix reference issue here.
| #if XSIMD_WITH_AVX | ||
| #include "./xsimd_avx.hpp" | ||
| // clang-format off | ||
| // _128 first: avx half-fold recursive call needs avx_128 visible at parse time. |
| * @ingroup architectures | ||
| * | ||
| * AVX512DQ instructions | ||
| * AVX512VL instructions |
| template <typename T, std::size_t N> | ||
| struct make_sized_batch; | ||
| template <typename T, std::size_t N> | ||
| using make_sized_batch_t = typename make_sized_batch<T, N>::type; |
| message(STATUS "Using emulated target: ${TARGET_EMULATED}") | ||
| set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1) | ||
| unset(TARGET_ARCH CACHE) | ||
| elseif (DEFINED XSIMD_DEFAULT_ARCH AND NOT "${XSIMD_DEFAULT_ARCH}" STREQUAL "") |
There was a problem hiding this comment.
Per https://cmake.org/cmake/help/latest/command/if.html#constant I think
if(XSIMD_DEFAULT_ARCH)
is enough
| static_assert(xsimd::all_architectures::contains<xsimd::default_arch>(), "default arch is a valid arch"); | ||
| #else | ||
| namespace xsimd | ||
| { |
xsimd::batch<uint32_t, avx512vl_256>::store(ptr, constexpr_mask, mode)usedto compile to a 6-instruction per-lane scalar extract loop instead of a single
EVEX-encoded masked store, because the call site in
xsimd_batch.hpp:766over-specified template arguments and SFINAE'd away every viable masked-store
overload except the scalar
commonfallback.The load side was unaffected — its call site (
:743) was already correct.Codegen — before (master, commit
7d30b9cc)Load is fine. Store is the scalar fallback: GCC unrolled the 8-lane loop in
xsimd_common_memory.hpp:377into four per-lane 32-bit stores, materialisingeach active lane no mask instruction involved.
Codegen — after
One masked instruction.
Bug fixes
AVX-512VL masked store collapsed to a 6-insn scalar fallback. An over-specified template arg list at the public
batch::store(mask)call site pushed a type into a non-type pack, SFINAE'ing every per-arch overload away. The store now reaches the EVEXvmov*{k}intrinsic.CI
linux.ymltypo silently droppedCXXFLAGSin the VL_128 matrix row ($CXX_FLAGSvs$CXXFLAGS), so the VL_128-default test job was building with stock flags instead of the requested override.avx512vlregister-traits comment misattributed to AVX512DQ.Half-confined masked op fell back to scalar on AVX/AVX2/AVX-512F. The half-fold hardcoded
sse4_2/avx2as the half-width target, but two-phase lookup made the better-arch overload invisible at template-definition time — so the recursive call dispatched to the wrong arch. Fixed by include reorder +make_sized_batch_t<T, half>.Features added
load_masked/store_maskedforavx512vl_128andavx512vl_256covering i32/u32/i64/u64/f32/f64 in both aligned and unaligned modes; partial ordering picks them over the avx2 bridges these archsinherit.
default_archmatchesXSIMD_DEFAULT_ARCHwhen the macro is set — astatic_assertintest_arch.cpp(plus the CMake plumbing to forward the macro) catches default-arch wiring regressions at compile time instead of at runtime.Cleanups
detail::maskstorehelpers now takebatch<>types, symmetric withdetail::maskload.selectuse a typed variable (const batch<T, A> lo = …) instead ofconst auto lo = batch<T, A>{ … }.XSIMD_IF_CONSTEXPR, so the inactive intrinsic isn't instantiated.xsimd_avx.hpp/xsimd_avx2.hpp/xsimd_avx512f.hppusesmake_sized_batch_t<T, half>instead of a hardcoded arch — picksavx_128/avx2_128/avx512vl_256when available, so half-confined stores land on the EVEX or VEX masked intrinsic.xsimd_isa.hppinclude order:_128siblings before their wider arch, VL beforeavx512f.hpp. Required so the recursivestore_masked<half_arch>call sees the better-arch overload at template-definition time.(Changelog by @claude)