Skip to content

bobjansen/Zorro

Repository files navigation

zorro

Zorro is a header-only xoshiro256++ library with automatic SIMD dispatch. Copy a single header into your project and get high-performance random number generation across uniform, normal, exponential, and Bernoulli distributions.

Usage

Copy include/zorro/zorro.hpp into your project. Compile with -mavx2 or -march=native (C++20 or later).

#include "zorro/zorro.hpp"
#include <vector>

int main() {
    zorro::Rng rng(42);
    std::vector<double> buf(1'000'000);

    rng.fill_uniform(buf.data(), buf.size());                   // [0, 1)
    rng.fill_uniform(buf.data(), buf.size(), -1.0, 1.0);        // [-1, 1)
    rng.fill_normal(buf.data(), buf.size());                     // N(0, 1)
    rng.fill_normal(buf.data(), buf.size(), 100.0, 15.0);        // N(100, 15)
    rng.fill_exponential(buf.data(), buf.size());                // Exp(1)
    rng.fill_exponential(buf.data(), buf.size(), 0.5);           // Exp(0.5)
    rng.fill_bernoulli(buf.data(), buf.size(), 0.3);             // Bernoulli(0.3)
}

High-dimensional example

The repo also includes a standalone example that simulates a high-dimensional factor model with heavy-tailed common shocks. It exposes a small swappable backend interface so the same workload can run on either std or zorro.

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target high_dimensional_random_variables
./build/high_dimensional_random_variables --backend both

Useful options:

  • --backend std|zorro|both to switch the RNG implementation at runtime.
  • --dimension N to change the random-vector dimension.
  • --paths N and --repetitions N to make the run shorter or longer.
  • --factors N and --degrees-of-freedom X to change the factor model.

Importance sampling example

There is also a rare-event example for importance sampling. It estimates a portfolio tail probability under a correlated Gaussian factor model using both plain Monte Carlo and a mean-shift importance proposal, again with a swappable std/zorro backend.

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target importance_sampling
./build/importance_sampling --backend both

Useful options:

  • --rare-z X sets the loss threshold in standard deviations above the mean.
  • --dimension N, --factors N, --paths N, and --repetitions N scale the workload.
  • --backend std|zorro|both switches the RNG implementation at runtime.

SIMD dispatch

The SIMD tier is selected at compile time based on which instruction sets are enabled. Within a tier, runtime CPUID checks handle AMD/Intel differences.

Compile flags Engine Uniform throughput
-mavx512f -mavx512vl -mavx512dq 16-wide AVX-512 (2x8 interleaved) ~3300 M/s
-mavx2 8-wide AVX2 (2x4 interleaved) ~2900 M/s
(neither) 4-wide portable (compiler auto-vec) ~1200 M/s

The AMD Zen 4 runtime check applies to FP-heavy kernels (normal): 512-bit sqrt/div serialize on the 256-bit datapath, so the library falls back to AVX2 vecpolar automatically. Integer-only kernels (uniform, bernoulli) use AVX-512 at full speed on AMD.

Vectorized log (optional)

Normal and exponential generation involve log(). By default, Zorro uses SIMD for the RNG core and scalar std::log for the transcendental. On glibc systems, define ZORRO_USE_LIBMVEC and link -lmvec -lm for fully vectorized log:

#define ZORRO_USE_LIBMVEC
#include "zorro/zorro.hpp"
g++ -O2 -mavx2 -DZORRO_USE_LIBMVEC main.cpp -lmvec -lm
Distribution Without libmvec With libmvec
Normal ~230 M/s ~400 M/s
Exponential ~280 M/s ~860 M/s

Other API

The header also provides building-block types for direct use:

  • zorro::Xoshiro256pp -- scalar engine, satisfies UniformRandomBitGenerator (drop-in for std::uniform_int_distribution(rng) etc.)
  • zorro::Xoshiro256pp_x2 / zorro::Xoshiro256pp_x4_portable -- portable multi-lane engines (SoA layout, compiler auto-vectorizes)
  • zorro::splitmix64, zorro::bits_to_01, zorro::bits_to_pm1 -- seed expansion and bit-to-float conversion utilities
  • zorro::get_rng() / zorro::reseed(seed) -- thread-local Rng singleton

Benchmarks

The benchmark target compares 2^20 samples across eight distribution suites and is configured for C++23.

  • Uniform(0, 1) -- scalar, x2, x4 portable, x4/x8 AVX2, x8/x16 AVX-512
  • Normal(0, 1) -- polar, batched, veclog, vecpolar (AVX2 and AVX-512)
  • Exponential(1) -- scalar log, veclog AVX2, veclog AVX-512
  • Bernoulli(0.3) -- naive, integer-threshold AVX2, native ucmp AVX-512
  • Bernoulli(0.3/0.5) -> uint8_t -- compact 1-byte output with bit-unpack special case for p=0.5
  • Gamma(2, 1) -- scalar fused, x8 AVX2 fused/decoupled/full
  • Student's t(5) -- scalar fused, x8 AVX2 fused/decoupled/fast

Each suite also includes Stephan Friedl's handwritten AVX2 Xoshiro256PlusSIMD variants where applicable.

A literate programming walkthrough of the core AVX2 4x2 loop lives in docs/xoshiro256pp_avx_4x2.md.

Build

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

For local machine-specific benchmark runs:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DRNG_BENCH_ENABLE_NATIVE=ON
cmake --build build -j

CMake options:

Option Default Description
RNG_BENCH_ENABLE_NATIVE OFF Compile with -march=native -mtune=native
RNG_BENCH_ENABLE_AVX512 auto-detected AVX-512F/VL/DQ kernels. Auto-enabled when the host CPU supports AVX-512 (compile-and-run test), disabled otherwise.
RNG_BENCH_ENABLE_STEPHANFR_AVX2 auto-detected Stephan Friedl's AVX2 comparison. Auto-enabled when the compiler supports -mavx2.
RNG_BENCH_MARCH_FLAGS "" Extra SIMD override flags appended after -march=native (e.g. -mno-avx512f).

Run

./build/benchmark_distributions

The benchmark uses a fixed seed and reports the best, median, and mean wall-clock time plus derived throughput in samples/second.

Dieharder

If dieharder is installed, you can run a targeted battery against the raw xoshiro256++ bitstreams exported by the library:

./benchmarks/run_dieharder.sh

Useful options:

  • --stream scalar|x2|x4|all to select which generator layout to test.
  • --out-dir DIR to keep the raw dieharder reports in a specific folder.

The script builds dieharder_stream if needed and automatically reruns any WEAK or FAILED result in ambiguity-resolution mode (-k 2 -Y 1).

TestU01

If TestU01 is installed, the project also exposes a direct runner linked against unif01/bbattery:

./benchmarks/run_testu01.sh

Useful options:

  • --stream scalar|x2|x4|all to select which generator layout to test.
  • --battery smallcrush|crush|bigcrush to pick the TestU01 battery.
  • --out-dir DIR to keep the raw TestU01 reports in a specific folder.

The default is SmallCrush, which is the practical quick check. Crush and especially BigCrush are much more expensive.

AWS benchmarking

aws/bench.sh launches spot instances, uploads the source, builds with both GCC and Clang across multiple SIMD levels (native, no-avx512, no-avx2, no-avx), and collects results. Results are saved under aws/results/.

./aws/bench.sh                                    # defaults: c6i, c7i, c7a
./aws/bench.sh --instances c7i.xlarge,c7a.xlarge  # custom instance types

Findings

Results from aws/bench.sh (GCC 15, native SIMD, best ms).

Uniform generation

Kernel c7a (Zen 4) c7i (SPR) c6i (Ice Lake)
scalar 1.073 / 977 M/s 1.262 / 831 M/s 1.492 / 703 M/s
x8 AVX2 (2x4) 0.268 / 3916 M/s 0.362 / 2895 M/s 0.348 / 3015 M/s
x16 AVX-512 (2x8) 0.273 / 3840 M/s 0.318 / 3299 M/s 0.321 / 3263 M/s

AVX-512 uniform is a modest win on Intel (~8-14% over AVX2) but roughly breaks even on AMD, where integer/bitwise 512-bit ops split into 256-bit micro-op pairs at full throughput.

Normal generation

Kernel c7a (Zen 4) c7i (SPR) c6i (Ice Lake)
x8 AVX2 + vecpolar 3.046 / 344 M/s 3.022 / 347 M/s 3.627 / 289 M/s
x16 AVX-512 + vecpolar 3.052 / 344 M/s 2.158 / 486 M/s 2.643 / 397 M/s

AVX-512 vecpolar gives a 1.3-1.4x speedup on Intel thanks to native 512-bit sqrt/div and hardware compress-store. On AMD Zen 4, the kernel automatically falls back to AVX2 vecpolar at runtime (CPUID vendor check) because 512-bit sqrt, div, and _mm512_mask_compressstoreu_pd serialize on the 256-bit datapath, making the AVX-512 version ~1.6x slower than AVX2 on AMD hardware.

Exponential and Bernoulli

Kernel c7a (Zen 4) c7i (SPR) c6i (Ice Lake)
Exp x8 AVX2 veclog 1.743 / 602 M/s 2.123 / 494 M/s 2.355 / 445 M/s
Exp x16 AVX-512 veclog 1.322 / 793 M/s 1.370 / 766 M/s 1.374 / 763 M/s
Bernoulli x8 AVX2 fast 0.269 / 3893 M/s 0.357 / 2934 M/s 0.360 / 2912 M/s
Bernoulli x16 AVX-512 ucmp 0.247 / 4247 M/s 0.325 / 3229 M/s 0.363 / 2890 M/s

AVX-512 exponential benefits strongly from the 8-wide _ZGVeN8v_log (+32-71% over AVX2). AVX-512 Bernoulli uses native _mm512_cmp_epu64_mask to eliminate the sign-flip workaround required by AVX2's signed-only _mm256_cmpgt_epi64.

Bernoulli output formats and the p=0.5 special case

The Bernoulli kernels support two output formats:

  • double (8 bytes/sample) -- 0.0 or 1.0, compatible with the other distribution suites.
  • uint8_t (1 byte/sample) -- 0 or 1, 8x less memory traffic.

For p = 0.5, every bit of the raw xoshiro256++ output is an independent Bernoulli trial. Instead of generating a uniform double and comparing, we unpack individual bits directly into the output -- one RNG call produces 208 samples (52 quality bits x 4 SIMD lanes) instead of 4.

The low 12 bits of each uint64 are discarded, matching the uniform path's >> 12 and avoiding the weakest bits of the xoshiro256++ output function.

Local AVX2 results (2^20 samples, best of 21 runs):

Kernel Output Best ms Throughput
naive (uniform + cmp) double 0.349 3.0 G/s
fast (int threshold) double 0.312 3.4 G/s
bit-unpack (p=0.5) double 0.176 6.0 G/s
naive (uniform + cmp) uint8_t 0.550 1.9 G/s
fast (int threshold) uint8_t 0.336 3.1 G/s
bit-unpack (p=0.5) uint8_t 0.050 21.2 G/s

The uint8_t bit-unpack path achieves 21 billion samples/second -- roughly 7 samples per clock cycle on a 3 GHz machine. At this speed, the RNG core itself is no longer the bottleneck; the limit is the byte-scatter from bits to the output buffer.

Fused vs decoupled generation

Gamma(2, 1) -- fused wins decisively

Kernel c7a (Zen 4) c7i (SPR)
scalar fused 48 M/s 47 M/s
x8 AVX2 fused 125 M/s 127 M/s
x8+x4 AVX2 full (vectorized MT) 133 M/s 126 M/s
x8 AVX2 decoupled 60 M/s 66 M/s

Fused keeps the PRNG state registers hot and feeds the Marsaglia-Tsang acceptance loop directly from a 64-sample L1-resident buffer. Decoupled materialises intermediate normals and uniforms to the heap, paying full memory bandwidth for no gain.

Student's t(5) -- decoupled wins

Kernel c7a (Zen 4) c7i (SPR)
scalar fused 26 M/s 29 M/s
x8 AVX2 fused 35 M/s 38 M/s
x8 AVX2 decoupled 62 M/s 61 M/s
x8+x4 AVX2 fast 64 M/s 62 M/s

For a compound distribution the fused variant forces a scalar Gamma loop into the hot path. Decoupled lets each sub-distribution run its own best kernel independently; the memory-traffic cost is more than recovered.

Summary

  • Fused integration is the right default when the distribution is self-contained and fits in L1 (Gamma: 2x over decoupled).
  • Decoupled is better for compound distributions where one sub-computation would otherwise serialize an otherwise-vectorised pipeline (Student-t: 1.7x over fused).
  • AVX-512 is not extended to Gamma or Student-t: their bottleneck is the scalar Marsaglia-Tsang acceptance loop, not the PRNG or vectorized math.

AMD vs Intel: when AVX-512 helps and when it hurts

AMD Zen 4 implements AVX-512 by splitting every 512-bit op into two 256-bit micro-ops. This is transparent for integer-heavy kernels (uniform, Bernoulli) where both halves pipeline through the ALUs at full throughput. But for FP-math-heavy kernels with sqrt, div, and compress-store (vecpolar), the two halves serialize on the single divider unit, making 512-bit strictly worse than 256-bit.

The cpu_is_amd() check in the vecpolar kernel detects this at runtime and routes AMD hardware to the AVX2 path automatically.

Third-party code

The handwritten SIMD comparison uses a vendored snapshot of Stephan Friedl's Xoshiro256PlusSIMD headers under third_party/stephanfr_xoshiro256plussimd, taken from upstream commit 29b30821f8f59b6106462c03d1d0225c93c4545d. The upstream license is preserved in third_party/stephanfr_xoshiro256plussimd/LICENSE.

About

Xoshiro256++ with AVX2 / AVX512

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors