Zorro is a header-only xoshiro256++ library with automatic SIMD dispatch.
Copy a single header into your project and get high-performance random number
generation across uniform, normal, exponential, and Bernoulli distributions.
Copy include/zorro/zorro.hpp into your project. Compile with -mavx2 or
-march=native (C++20 or later).
#include "zorro/zorro.hpp"
#include <vector>
int main() {
zorro::Rng rng(42);
std::vector<double> buf(1'000'000);
rng.fill_uniform(buf.data(), buf.size()); // [0, 1)
rng.fill_uniform(buf.data(), buf.size(), -1.0, 1.0); // [-1, 1)
rng.fill_normal(buf.data(), buf.size()); // N(0, 1)
rng.fill_normal(buf.data(), buf.size(), 100.0, 15.0); // N(100, 15)
rng.fill_exponential(buf.data(), buf.size()); // Exp(1)
rng.fill_exponential(buf.data(), buf.size(), 0.5); // Exp(0.5)
rng.fill_bernoulli(buf.data(), buf.size(), 0.3); // Bernoulli(0.3)
}The repo also includes a standalone example that simulates a high-dimensional
factor model with heavy-tailed common shocks. It exposes a small swappable
backend interface so the same workload can run on either std or zorro.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target high_dimensional_random_variables
./build/high_dimensional_random_variables --backend bothUseful options:
--backend std|zorro|bothto switch the RNG implementation at runtime.--dimension Nto change the random-vector dimension.--paths Nand--repetitions Nto make the run shorter or longer.--factors Nand--degrees-of-freedom Xto change the factor model.
There is also a rare-event example for importance sampling. It estimates a
portfolio tail probability under a correlated Gaussian factor model using both
plain Monte Carlo and a mean-shift importance proposal, again with a swappable
std/zorro backend.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target importance_sampling
./build/importance_sampling --backend bothUseful options:
--rare-z Xsets the loss threshold in standard deviations above the mean.--dimension N,--factors N,--paths N, and--repetitions Nscale the workload.--backend std|zorro|bothswitches the RNG implementation at runtime.
The SIMD tier is selected at compile time based on which instruction sets are enabled. Within a tier, runtime CPUID checks handle AMD/Intel differences.
| Compile flags | Engine | Uniform throughput |
|---|---|---|
-mavx512f -mavx512vl -mavx512dq |
16-wide AVX-512 (2x8 interleaved) | ~3300 M/s |
-mavx2 |
8-wide AVX2 (2x4 interleaved) | ~2900 M/s |
| (neither) | 4-wide portable (compiler auto-vec) | ~1200 M/s |
The AMD Zen 4 runtime check applies to FP-heavy kernels (normal): 512-bit sqrt/div serialize on the 256-bit datapath, so the library falls back to AVX2 vecpolar automatically. Integer-only kernels (uniform, bernoulli) use AVX-512 at full speed on AMD.
Normal and exponential generation involve log(). By default, Zorro uses SIMD
for the RNG core and scalar std::log for the transcendental. On glibc
systems, define ZORRO_USE_LIBMVEC and link -lmvec -lm for fully vectorized
log:
#define ZORRO_USE_LIBMVEC
#include "zorro/zorro.hpp"g++ -O2 -mavx2 -DZORRO_USE_LIBMVEC main.cpp -lmvec -lm| Distribution | Without libmvec | With libmvec |
|---|---|---|
| Normal | ~230 M/s | ~400 M/s |
| Exponential | ~280 M/s | ~860 M/s |
The header also provides building-block types for direct use:
zorro::Xoshiro256pp-- scalar engine, satisfiesUniformRandomBitGenerator(drop-in forstd::uniform_int_distribution(rng)etc.)zorro::Xoshiro256pp_x2/zorro::Xoshiro256pp_x4_portable-- portable multi-lane engines (SoA layout, compiler auto-vectorizes)zorro::splitmix64,zorro::bits_to_01,zorro::bits_to_pm1-- seed expansion and bit-to-float conversion utilitieszorro::get_rng()/zorro::reseed(seed)-- thread-localRngsingleton
The benchmark target compares 2^20 samples across eight distribution suites
and is configured for C++23.
- Uniform(0, 1) -- scalar, x2, x4 portable, x4/x8 AVX2, x8/x16 AVX-512
- Normal(0, 1) -- polar, batched, veclog, vecpolar (AVX2 and AVX-512)
- Exponential(1) -- scalar log, veclog AVX2, veclog AVX-512
- Bernoulli(0.3) -- naive, integer-threshold AVX2, native ucmp AVX-512
- Bernoulli(0.3/0.5) -> uint8_t -- compact 1-byte output with bit-unpack special case for p=0.5
- Gamma(2, 1) -- scalar fused, x8 AVX2 fused/decoupled/full
- Student's t(5) -- scalar fused, x8 AVX2 fused/decoupled/fast
Each suite also includes
Stephan Friedl's
handwritten AVX2 Xoshiro256PlusSIMD variants where applicable.
A literate programming walkthrough of the core AVX2 4x2 loop lives in docs/xoshiro256pp_avx_4x2.md.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -jFor local machine-specific benchmark runs:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DRNG_BENCH_ENABLE_NATIVE=ON
cmake --build build -jCMake options:
| Option | Default | Description |
|---|---|---|
RNG_BENCH_ENABLE_NATIVE |
OFF |
Compile with -march=native -mtune=native |
RNG_BENCH_ENABLE_AVX512 |
auto-detected | AVX-512F/VL/DQ kernels. Auto-enabled when the host CPU supports AVX-512 (compile-and-run test), disabled otherwise. |
RNG_BENCH_ENABLE_STEPHANFR_AVX2 |
auto-detected | Stephan Friedl's AVX2 comparison. Auto-enabled when the compiler supports -mavx2. |
RNG_BENCH_MARCH_FLAGS |
"" |
Extra SIMD override flags appended after -march=native (e.g. -mno-avx512f). |
./build/benchmark_distributionsThe benchmark uses a fixed seed and reports the best, median, and mean wall-clock time plus derived throughput in samples/second.
If dieharder is installed, you can run a targeted battery against the raw
xoshiro256++ bitstreams exported by the library:
./benchmarks/run_dieharder.shUseful options:
--stream scalar|x2|x4|allto select which generator layout to test.--out-dir DIRto keep the rawdieharderreports in a specific folder.
The script builds dieharder_stream if needed and automatically reruns any
WEAK or FAILED result in ambiguity-resolution mode (-k 2 -Y 1).
If TestU01 is installed, the project also exposes a direct runner linked
against unif01/bbattery:
./benchmarks/run_testu01.shUseful options:
--stream scalar|x2|x4|allto select which generator layout to test.--battery smallcrush|crush|bigcrushto pick the TestU01 battery.--out-dir DIRto keep the raw TestU01 reports in a specific folder.
The default is SmallCrush, which is the practical quick check. Crush and
especially BigCrush are much more expensive.
aws/bench.sh launches spot instances, uploads the source, builds with
both GCC and Clang across multiple SIMD levels (native, no-avx512, no-avx2,
no-avx), and collects results. Results are saved under aws/results/.
./aws/bench.sh # defaults: c6i, c7i, c7a
./aws/bench.sh --instances c7i.xlarge,c7a.xlarge # custom instance typesResults from aws/bench.sh (GCC 15, native SIMD, best ms).
| Kernel | c7a (Zen 4) | c7i (SPR) | c6i (Ice Lake) |
|---|---|---|---|
| scalar | 1.073 / 977 M/s | 1.262 / 831 M/s | 1.492 / 703 M/s |
| x8 AVX2 (2x4) | 0.268 / 3916 M/s | 0.362 / 2895 M/s | 0.348 / 3015 M/s |
| x16 AVX-512 (2x8) | 0.273 / 3840 M/s | 0.318 / 3299 M/s | 0.321 / 3263 M/s |
AVX-512 uniform is a modest win on Intel (~8-14% over AVX2) but roughly breaks even on AMD, where integer/bitwise 512-bit ops split into 256-bit micro-op pairs at full throughput.
| Kernel | c7a (Zen 4) | c7i (SPR) | c6i (Ice Lake) |
|---|---|---|---|
| x8 AVX2 + vecpolar | 3.046 / 344 M/s | 3.022 / 347 M/s | 3.627 / 289 M/s |
| x16 AVX-512 + vecpolar | 3.052 / 344 M/s | 2.158 / 486 M/s | 2.643 / 397 M/s |
AVX-512 vecpolar gives a 1.3-1.4x speedup on Intel thanks to native 512-bit
sqrt/div and hardware compress-store. On AMD Zen 4, the kernel automatically
falls back to AVX2 vecpolar at runtime (CPUID vendor check) because 512-bit
sqrt, div, and _mm512_mask_compressstoreu_pd serialize on the 256-bit
datapath, making the AVX-512 version ~1.6x slower than AVX2 on AMD hardware.
| Kernel | c7a (Zen 4) | c7i (SPR) | c6i (Ice Lake) |
|---|---|---|---|
| Exp x8 AVX2 veclog | 1.743 / 602 M/s | 2.123 / 494 M/s | 2.355 / 445 M/s |
| Exp x16 AVX-512 veclog | 1.322 / 793 M/s | 1.370 / 766 M/s | 1.374 / 763 M/s |
| Bernoulli x8 AVX2 fast | 0.269 / 3893 M/s | 0.357 / 2934 M/s | 0.360 / 2912 M/s |
| Bernoulli x16 AVX-512 ucmp | 0.247 / 4247 M/s | 0.325 / 3229 M/s | 0.363 / 2890 M/s |
AVX-512 exponential benefits strongly from the 8-wide _ZGVeN8v_log (+32-71%
over AVX2). AVX-512 Bernoulli uses native _mm512_cmp_epu64_mask to eliminate
the sign-flip workaround required by AVX2's signed-only _mm256_cmpgt_epi64.
The Bernoulli kernels support two output formats:
- double (8 bytes/sample) -- 0.0 or 1.0, compatible with the other distribution suites.
- uint8_t (1 byte/sample) -- 0 or 1, 8x less memory traffic.
For p = 0.5, every bit of the raw xoshiro256++ output is an independent Bernoulli trial. Instead of generating a uniform double and comparing, we unpack individual bits directly into the output -- one RNG call produces 208 samples (52 quality bits x 4 SIMD lanes) instead of 4.
The low 12 bits of each uint64 are discarded, matching the uniform path's
>> 12 and avoiding the weakest bits of the xoshiro256++ output function.
Local AVX2 results (2^20 samples, best of 21 runs):
| Kernel | Output | Best ms | Throughput |
|---|---|---|---|
| naive (uniform + cmp) | double | 0.349 | 3.0 G/s |
| fast (int threshold) | double | 0.312 | 3.4 G/s |
| bit-unpack (p=0.5) | double | 0.176 | 6.0 G/s |
| naive (uniform + cmp) | uint8_t | 0.550 | 1.9 G/s |
| fast (int threshold) | uint8_t | 0.336 | 3.1 G/s |
| bit-unpack (p=0.5) | uint8_t | 0.050 | 21.2 G/s |
The uint8_t bit-unpack path achieves 21 billion samples/second -- roughly 7 samples per clock cycle on a 3 GHz machine. At this speed, the RNG core itself is no longer the bottleneck; the limit is the byte-scatter from bits to the output buffer.
Gamma(2, 1) -- fused wins decisively
| Kernel | c7a (Zen 4) | c7i (SPR) |
|---|---|---|
| scalar fused | 48 M/s | 47 M/s |
| x8 AVX2 fused | 125 M/s | 127 M/s |
| x8+x4 AVX2 full (vectorized MT) | 133 M/s | 126 M/s |
| x8 AVX2 decoupled | 60 M/s | 66 M/s |
Fused keeps the PRNG state registers hot and feeds the Marsaglia-Tsang acceptance loop directly from a 64-sample L1-resident buffer. Decoupled materialises intermediate normals and uniforms to the heap, paying full memory bandwidth for no gain.
Student's t(5) -- decoupled wins
| Kernel | c7a (Zen 4) | c7i (SPR) |
|---|---|---|
| scalar fused | 26 M/s | 29 M/s |
| x8 AVX2 fused | 35 M/s | 38 M/s |
| x8 AVX2 decoupled | 62 M/s | 61 M/s |
| x8+x4 AVX2 fast | 64 M/s | 62 M/s |
For a compound distribution the fused variant forces a scalar Gamma loop into the hot path. Decoupled lets each sub-distribution run its own best kernel independently; the memory-traffic cost is more than recovered.
Summary
- Fused integration is the right default when the distribution is self-contained and fits in L1 (Gamma: 2x over decoupled).
- Decoupled is better for compound distributions where one sub-computation would otherwise serialize an otherwise-vectorised pipeline (Student-t: 1.7x over fused).
- AVX-512 is not extended to Gamma or Student-t: their bottleneck is the scalar Marsaglia-Tsang acceptance loop, not the PRNG or vectorized math.
AMD Zen 4 implements AVX-512 by splitting every 512-bit op into two 256-bit micro-ops. This is transparent for integer-heavy kernels (uniform, Bernoulli) where both halves pipeline through the ALUs at full throughput. But for FP-math-heavy kernels with sqrt, div, and compress-store (vecpolar), the two halves serialize on the single divider unit, making 512-bit strictly worse than 256-bit.
The cpu_is_amd() check in the vecpolar kernel detects this at runtime and
routes AMD hardware to the AVX2 path automatically.
The handwritten SIMD comparison uses a vendored snapshot of Stephan Friedl's Xoshiro256PlusSIMD headers under third_party/stephanfr_xoshiro256plussimd, taken from upstream commit 29b30821f8f59b6106462c03d1d0225c93c4545d. The upstream license is preserved in third_party/stephanfr_xoshiro256plussimd/LICENSE.