Skip to content

Releases: LostBeard/SpawnDev.ILGPU

SpawnDev.ILGPU v4.15.0

Choose a tag to compare

@LostBeard LostBeard released this 21 Jun 04:39

SpawnDev.ILGPU 4.15.0

Headline: the Wasm backend now auto-vectorizes kernels to WebAssembly SIMD128 (v128).

A separate kernel_simd is generated alongside the scalar kernel and selected at runtime when the engine supports SIMD (System.Runtime.Intrinsics.Wasm.PackedSimd.IsSupported). The scalar path stays byte-identical and first-class — SIMD-less browsers, older devices, and the desktop CLR run it unchanged — and the emitter bails to the scalar path on anything outside its class, so the feature is purely additive with zero regression. A by-4 dispatch processes four thread-ids per kernel_simd call with a scalar tail.

What vectorizes — the complete per-lane kernel class

  • Numeric tier: f32x4, i32x4, and (double-pumped, 2 lanes per v128) f64x2, i64x2.
  • Shapes: straight-line elementwise · counted loops (v128 accumulator) · divergent if-diamonds (mask + v128.bitselect) · gather · scatter · conditional/masked stores · general acyclic divergent control flow (chained and nested selects, via a per-phi control-dependence bitselect tree) · divergent loops (a data-dependent branch inside a counted loop).
  • Math (f32 and f64): + - * /, min/max, neg/abs, sqrt, floor/ceil, all compares; transcendentals (sin, cos, tan, asin, acos, atan, sinh, cosh, tanh, exp, exp2, log, log2, log10); rcp/rsqrt; pow/atan2/log_b.

Cross-mode determinism is a hard invariant: kernel_simd is bit-exact to the scalar kernel (no fused FMA; no saturating-vs-trapping convert; the per-lane Math fallbacks call the identical import). 34 Wasm_Simd128_* gates assert kernel_simd-is-emitted and simd == scalar == reference bit-exact, and the whole suite also runs in a SIMD-off CI mode as a permanent cross-mode oracle.

Kept on the existing path (out of class by design): group/barrier/atomic/warp kernels (the multi-worker shared-memory model), f32 → i32 saturating convert (kept scalar for determinism), non-inlined helper calls, and narrow i8/i16 element types.

Also in 4.15.0

  • All in-register quant decoders are now single-exit / branchlessFloat8E8M0, Float4E2M1, Float8E4M3, Float8E5M2 RawBitsToFloat (the subnormal-normalize while loops folded to a computed shift count; value-identical, verified bit-exact on all 6 backends). This fixes a WebGL/GLSL shader-size explosion: an early-return (multi-exit) decode inlined before a loop made the structurizer duplicate the loop continuation per exit arm, blowing past WebGL's compile limit (hit in MXFP4 dequant; the same class would hit MXFP8).

Verification

Full PMT suite 3980 pass / 0 fail / 258 skip across all six backends (CUDA, OpenCL, CPU, WebGPU, WebGL, Wasm) against the 4.15.0 bits. Forks bump to 2.0.42.


🖖 The SpawnDev Crew — LostBeard (Captain), Riker, Data, Tuvok, Geordi, Seven.

SpawnDev.ILGPU v4.14.1

Choose a tag to compare

@LostBeard LostBeard released this 20 Jun 13:27

SpawnDev.ILGPU 4.14.1

The 4-bit data-type tier: a packed FP4 float and packed INT4 integers on all 6 backends (now including inside non-inlined helpers), the Float8E8M0 MX scale that completes the OCP Float8 family, and low-precision conversion correctness pinned to the numpy / ml_dtypes references.

SpawnDev.ILGPU extends ILGPU with three browser GPU backends. It transpiles .NET IL into GPU shader languages at runtime, so the same C# kernel runs on 6 backends from one codebase: WebGPU (WGSL), WebGL (GLSL ES 3.0), and WebAssembly in the browser, and CUDA (PTX), OpenCL, and CPU on the desktop. The browser backends make Blazor WebAssembly a first-class GPU-compute target.

This release rolls up the whole 4.14.x line on top of 4.13.0's low-precision float support. Full per-version detail with code samples is in CHANGELOG.md.


Headline: TRUE packed 4-bit types

Three new 4-bit types, each stored genuinely sub-byte - [PackedBits(4)], 2 nibbles per byte, 8 per 32-bit word - so an ArrayView<T> of N elements is ceil(N/2) device bytes, the real NVFP4 / INT4 memory density rather than a 1-byte placeholder:

Type Kind Range / codes Reference
Float4E2M1 4-bit float (1/2/1) 16 codes {0, .5, 1, 1.5, 2, 3, 4, 6}, no Inf/NaN OCP E2M1FN, the NVFP4 / MXFP4 element format - bit-exact to ml_dtypes.float4_e2m1fn
QInt4 signed 4-bit int -8 .. 7 sign-extends to int
QUInt4 unsigned 4-bit int 0 .. 15 zero-extends to int

Each nibble is decoded to a wider register (f32 for FP4, i32 for the ints) at the load - the data stays packed in the buffer, so you get the storage/bandwidth win and full-precision compute. Per-backend support:

  • Load: all 6 backends - including inside a non-inlined ([MethodImpl(MethodImplOptions.NoInlining)]) helper, which 4.14.1 wired through every backend's helper-function code generator (a separate codegen path from the kernel generators that previously didn't understand packed sub-word storage).
  • In-kernel store: CUDA, OpenCL, WebGPU, Wasm (the nibble write is an atomic word read-modify-write). CPU and WebGL stores are fail-loud - WebGL has no atomics, and the CPU managed-reference indexer cannot address a sub-byte element, so both throw a typed exception rather than silently corrupting the enclosing word.
  • Radix-sort: keys + key/value pairs, ascending + descending, on the four store backends.

Working with packed buffers is explicit (there is no transparent typed host pack/unpack - the win is that the buffer stays packed): pack two nibbles per byte and upload the raw bytes.

// Pack N FP4 codes (each Float4E2M1.RawValue, 0..15) into ceil(N/2) bytes, upload, decode in-kernel.
var packed = new byte[(n + 1) / 2];
for (int k = 0; k < packed.Length; k++)
    packed[k] = (byte)((codes[2*k] & 0xF) | ((2*k+1 < n ? codes[2*k+1] & 0xF : 0) << 4));

using var buf = accelerator.Allocate1D<Float4E2M1>(n);   // ceil(N/2) device bytes
((IContiguousArrayView)buf.View.BaseView).AsRawArrayView().CopyFromCPU(packed);
// dispatch a kernel that reads buf as ArrayView<Float4E2M1> - the E2M1 nibble decodes to f32 in-register

To select a backend that can actually store packed 4-bit, the capability flags RequiresQInt4, RequiresQUInt4, and RequiresPacked4Store (the last rules out WebGL and CPU) join AcceleratorRequirements.


RawBitsToFloat - decode packed quant bits in-register

When your quantized weights live as raw integer words (a GGUF / MXFP4 block of u32s), you often want to decode one element inside a kernel without an ArrayView<Float4E2M1>. <Type>Extensions.RawBitsToFloat(int rawBits) does exactly that - a kernel-safe, all-6-backend decode of a raw nibble / byte / ushort to its f32 value:

// inside a kernel - decode the i-th FP4 nibble out of a packed u32 word, in-register
float v = Float4E2M1Extensions.RawBitsToFloat((int)((word >> (i * 4)) & 0xF));

Available for the sub-word float types whose bits you might hold raw: Float4E2M1, Float8E4M3, Float8E5M2, BFloat16, and now Float8E8M0 (below). Host-side, those types also expose FromRawBits(...) and a public RawValue for raw round-trips.


New: Float8E8M0 - the OCP MX scale format

Float8E8M0 (OCP float8_e8m0fnu) is the third member of the OCP Float8 family, alongside Float8E4M3 and Float8E5M2. It is 8 exponent bits, no sign, no mantissa, bias 127 - not an element format but the shared per-block scale for every OCP microscaling layout (MXFP4 / MXFP8 / MXINT8 / NVFP4): a pure power-of-two 2^(e-127). Byte e in 0..254 decodes to 2^(e-127); e == 0xFF is the only special and decodes to NaN (no zero, no Inf).

Because E8M0 and IEEE-754 binary32 share exponent bias 127, the decode is exactly the f32 whose biased-exponent field is e with a zero mantissa. It is intentionally minimal - you never add two scales in a kernel, you decode a scale and multiply by it - so it ships as a host struct plus a kernel-safe in-register decode that transpiles on all 6 backends:

// Decode an MX block's raw scale byte to f32 in-register, while the block stays packed:
float scale = Float8E8M0Extensions.RawBitsToFloat(scaleByte);   // 2^(e-127), e==0xFF -> NaN

// Host:
Float8E8M0 s = Float8E8M0.FromSingle(2.0f);   // RNE on the exponent; NaN/<=0/Inf -> 0xFF
byte raw = s.RawValue;                         // round-trips with Float8E8M0.FromRawBits(raw)

Low-precision conversion correctness

The four byte/2-byte low-precision floats (Half, BFloat16, Float8E4M3, Float8E5M2) reach feature-complete parity - a selectable saturating cast (FromSingle(x, saturate) / FromSingleSaturating) and the full radix-sort grid on all 6 backends - and two conversions were corrected against their authoritative references:

  • Float8E4M3 is now bit-exact to float8_e4m3fn (PyTorch / JAX / ml_dtypes). The cast and the IR-level convert use the fn convention - finite overflow and ±Inf map to NaN (previously saturated to ±448). The saturating clamp is opt-in via FromSingleSaturating (the NVIDIA Transformer Engine / OCP mode).
  • Half float→half is now IEEE round-to-nearest-even on every backend, bit-exact to numpy.float16 / PyTorch / CUDA / OpenCL. It previously truncated toward zero and flushed subnormals to zero - a silent divergence from numpy and from the desktop backends.

Every float→low-precision conversion is pinned to its numpy / ml_dtypes reference in CI, on each backend's actual on-device convert.


Backends at a glance

Backend Target Shader language
WebGPU Browser WGSL
WebGL Browser GLSL ES 3.0
Wasm Browser WebAssembly binary (multi-worker)
CUDA Desktop PTX
OpenCL Desktop OpenCL C
CPU Desktop .NET

The full sub-word + low-precision set (Int8/UInt8/Int16/UInt16, Half, BFloat16, FP8 Float8E4M3/E5M2, Float8E8M0, FP4 Float4E2M1, packed QInt4/QUInt4), i64/f64 emulation where there's no native hardware support, automatic backend selection with capability gating, and zero-copy CopyFromJS on the browser are all available across the matrix. Latest full cross-backend sweep: 3934 pass / 0 fail / 258 skip (the skips are the genuinely-impossible cells - in-kernel scatter/atomics/packed-store on WebGL, packed-store on CPU).


Install

dotnet add package SpawnDev.ILGPU

PublishTrimmed and RunAOTCompilation must remain false - ILGPU relies on IL reflection at runtime.

Links

Credits

Built on ILGPU by the ILGPU project. SpawnDev.ILGPU is part of the SpawnDev family by LostBeard.

SpawnDev.ILGPU v4.13.0

Choose a tag to compare

@LostBeard LostBeard released this 17 Jun 00:35

SpawnDev.ILGPU 4.13.0

Full low-precision floating-point support on all 6 backends, plus a portable-CUDA fix that brings bf16 and FP8 to pre-Ampere GPUs.

SpawnDev.ILGPU extends ILGPU with three browser GPU backends. It transpiles .NET IL into GPU shader languages at runtime, so the same C# kernel runs on 6 backends from one codebase: WebGPU (WGSL), WebGL (GLSL ES 3.0), and WebAssembly in the browser, and CUDA (PTX), OpenCL, and CPU on the desktop. The browser backends make Blazor WebAssembly a first-class GPU-compute target.

This is the first GitHub release since 4.6.0, and a great deal has changed across 4.7 through 4.13. The headline is low-precision types, but a lot more landed underneath. Full per-version detail with code samples is in CHANGELOG.md.


Headline: low-precision floating point, everywhere

Three kernel-native low-precision float types now join float/double/Half, each a real IR primitive type with full System.Numerics.INumber<T> support, and each working bit-identically on all 6 backends:

Type Layout Notes
ILGPU.BFloat16 1 / 8 / 7 "brain float" - the top 16 bits of an fp32, so it keeps fp32's full dynamic range (the right trade for ML weights/activations, where fp16's tiny range overflows/underflows).
ILGPU.Float8E4M3 1 / 4 / 3, bias 7 FP8 forward / inference format (E4M3FN): no infinities, saturates to ±448, single NaN. One extra mantissa bit vs E5M2.
ILGPU.Float8E5M2 1 / 5 / 2, bias 15 FP8 backward / gradient format: IEEE-754-style with infinities and NaNs (fp16-class range, which gradients need).

Use them exactly like ILGPU.Half:

// One generic kernel runs for float, Half, BFloat16, Float8E4M3, Float8E5M2 - no per-type variants.
static void FusedRelu<T>(Index1D i,
    ArrayView1D<T, Stride1D.Dense> x, ArrayView1D<T, Stride1D.Dense> y, T scale, T bias)
    where T : unmanaged, INumber<T>
{
    T v = x[i] * scale + bias;
    y[i] = v > T.Zero ? v : T.Zero;
}

// Read low-precision input, accumulate in float, write low-precision output - one generic op,
// no fp32 temp buffers. PrecisionConvert gives the float<->T conversion that a plain (float)t /
// (T)f cast cannot express in a generic kernel.
static void MeanGeneric<T>(Index1D row,
    ArrayView1D<T, Stride1D.Dense> input, ArrayView1D<T, Stride1D.Dense> output, int C)
    where T : unmanaged, INumber<T>
{
    int b = row * C; float acc = 0f;
    for (int c = 0; c < C; c++) acc += PrecisionConvert.ConvertToSingle(input[b + c]);
    output[row] = PrecisionConvert.ConvertFromSingle<T>(acc / C);
}

What's new this release that makes the above work:

  • Generic INumber<T> mixed-precision kernels. A single where T : INumber<T> kernel transpiles and runs for float/Half/bf16/fp8 on every backend, instead of N hand-written per-type copies. This includes by-value low-precision scalar parameters (e.g. a kernel's scale/bias), which previously arrived as zero on several backends.
  • PrecisionConvert.ConvertToSingle<T>(T) / ConvertFromSingle<T>(float). Inside a generic kernel there is no C# way to write (float)t or (T)f (no cast constraint exists), so callers reach for float.CreateChecked / T.CreateChecked - which touch System.Type and the kernel transpiler rejects on every GPU backend. These two methods lower to the same native conversion the concrete cast emits, so generic precision-aware ops just work.

All of this uses the f32-register model: low-precision values compute as f32 in-register and are converted to their narrow grid only at the load/store boundary, so accumulation stays full-precision (matching how real low-precision tensor hardware accumulates). The conversions are byte-identical across backends, emitted as callable helper functions on OpenCL/WGSL/GLSL, inline WebAssembly bytecode on Wasm, and inline PTX on CUDA.


bf16 and FP8 now run on every CUDA architecture (including pre-Ampere)

If you have an older NVIDIA card, this one matters: bf16 previously failed to compile on pre-Ampere GPUs (Pascal GTX 1080 = sm_61, Volta sm_70, Turing RTX 2060 = sm_75). The PTX path emitted the native cvt.f32.bf16 / cvt.rn.bf16.f32 instructions, which only exist on sm_80+ (Ampere/Ada/Hopper), so ptxas rejected them on anything older.

bf16 is now converted with portable bit-manipulation (basic integer ops available on every CUDA architecture), byte-identical to the result on every other backend. FP8 uses the same portable approach (its native cvt is sm_89/Hopper-only). The lesson, generalized: native-cvt shortcuts silently gate out older hardware, so the default is portable bit-manip unless support is explicitly capability-gated.

Verified: the full bf16 test surface (radix-sort keys, struct fields, range and ±Inf/NaN/RNE specials, arithmetic) passes on all 6 backends including CUDA; FP8 round-trips bit-exact vs the concrete cast on every backend.


Also since the last GitHub release (4.6.0)

A condensed tour of the bigger items across 4.7 -> 4.13 (the CHANGELOG has the full per-version detail):

  • Complete sub-word data type support (4.9.0). Int8, UInt8, Int16, UInt16, and Float16 (ILGPU.Half) buffer access on all 6 backends, stored packed with correct stride and sign/zero extension per backend (no more corruption from type-promotion mismatches), plus the Half.Abs/Min/Max/Clamp intrinsics. This is the foundation the bf16/FP8 work in 4.13.0 builds on. Also added CopyFromJS - zero-copy writes of a JS TypedArray/ArrayBuffer straight to GPU memory with no .NET heap allocation, on every browser backend.
  • Capability gating + typed codegen errors (4.9.2). AcceleratorRequirements (RequiresAtomics, RequiresFloat64Native, RequiresSharedMemory, ...) lets the selection path filter out incapable backends up front, and kernels that use a feature a backend cannot implement now throw a typed UnsupportedKernelFeatureException at compile time instead of silently producing wrong output. Plus IEEE-754 NaN/Inf correctness across the four emulated backends and helper-function emission to stay under browser shader-validator size limits.
  • Generic-math Half, then full mixed precision (4.9.12 -> 4.13.0). INumber<Half> kernels first, then the generic INumber<T> path generalized to bf16 and fp8, and PrecisionConvert for the generic in-kernel float<->T conversion (see the headline above).
  • Offline code generation + precompiled shaders (4.10.0). Generate a kernel's WGSL/GLSL/Wasm with no device on any host OS (ShaderCompiler.Generate + CapabilityProfile), precompile at build time via an MSBuild task, and load the artifact at runtime to skip IL-to-shader transpilation.
  • Wasm backend maturity (4.6.0 -> 4.13.0). A fiber-based barrier dispatch model (full ILGPU Algorithms - RadixSort/Scan/Reduce/Histogram at 100K-4M+ elements - run on Wasm), worker-function caching (3-4x dispatch speedup), then a multi-worker correctness overhaul that killed the large-sort / barrier race family (verified atomic stores, group-barrier release fences, monotonic kernel ids, yield-to-JS escapes under oversubscription) and a process-static shared worker pool + linear memory that keep a long session bounded. A first-class non-SIMD path stays supported forever, with an additive SIMD128 v128 fast path for ALU-dense elementwise kernels on SIMD-capable browsers. (Barriers use pure-spin synchronization to work around a V8 Atomics.wait visibility bug.)
  • Sync/async contract (4.12.0). Operations that wait or read a result back are async-only on the browser; the sync form now throws (Synchronize() -> await SynchronizeAsync(); sync device readback / device-to-device copy -> the ...Async forms) instead of silently returning stale data, while fire-and-forget work (dispatch, alloc, upload, Flush-submit) stays synchronous. AcceleratorRequirements.RequiresScatterStores (4.12.1) gates WebGL out of in-kernel scatter kernels at selection time.
  • CPU backend cooperative multi-multiprocessor execution (4.13.0). Thread-groups run one-per-core with a cheap cooperative barrier instead of oversubscribing one simulated multiprocessor, eliminating multi-second barrier thrash on heavy reduction/decode kernels.
  • And more: GpuTestVerify for GPU-side test verification without CPU readback (4.7.1), the SpawnDev.ILGPU.QR GPU QR encoder/decoder (4.7.1), WebGL Half RadixSort + a cross-backend sub-word sign-extension fix (4.9.13), CopyFromStreamAsync / MemoryPressure.AllocateWithReclaim, and assorted WebGPU/WebGL/OpenCL/PTX codegen correctness fixes.

Backends at a glance

Backend Target Shader language
WebGPU Browser WGSL
WebGL Browser GLSL ES 3.0
Wasm Browser WebAssembly binary (multi-worker)
CUDA Desktop PTX
OpenCL Desktop OpenCL C
CPU Desktop .NET

Sub-word and low-precision types (Int8/UInt8/Int16/UInt16/Half/BFloat16/Float8E4M3/Float8E5M2), i64/f64 emulation where there's no native hardware support, automatic backend selection with capability gating, and zero-copy CopyFromJS on the browser are all available across the matrix.


Install

dotnet add package SpawnDev.ILGPU

PublishTrimmed and RunAOTCompilation must remain false - ILGPU relies on IL reflection at runtime.

Links

Read more

SpawnDev.ILGPU v4.6.0

Choose a tag to compare

@LostBeard LostBeard released this 23 Mar 01:49

SpawnDev.ILGPU v4.6.0

6 backends. 1,511 tests. Zero failures.

CUDA, OpenCL, CPU, WebGPU, WebGL, and Wasm — all passing. GPU compute in the browser is no longer experimental.

Highlights

Full Multi-Worker Wasm Barrier Dispatch

The Wasm backend now supports full navigator.hardwareConcurrency workers with group barriers and shared memory. A pure spin barrier using i32.atomic.load loops replaces the previous wait32/notify approach, working around a V8 atomics visibility gap that caused data races with 3+ workers.

RadixSort Verified at Scale

RadixSort passes across all data types and sizes up to 4M elements on every backend — including Wasm in the browser. Key fixes:

  • Histogram counter buffer sizing — fixed undersized counters that caused real out-of-bounds writes during grid-stride iteration
  • Grid-stride tail byte padding — extended linear-memory slack allocation to prevent OOB traps on packed buffers
  • Per-worker scratch isolation — eliminated intermittent sort corruption in non-barrier kernels

20+ Wasm Codegen Fixes

Deep correctness pass across the Wasm code generator:

  • Fiber yield-per-phase with dynamic block splitting
  • Atomic loads/stores for all shared memory access in barrier kernels (including float via i32/i64 reinterpret)
  • Struct load copy semantics to prevent aliasing
  • Unsigned comparison in MinUInt32/MinUInt64 reductions
  • Correct atomic RMW opcode table for interleaved sub-word variants
  • Local alloca addressing, shared memory deduplication, and IR address space aliasing guards

WebGPU Backend Fixes

  • WGSL loop break + bool PHI: correct merge value generation when breaking from loops with boolean phi nodes
  • WGSL continuation after if-else with break: prevent unreachable code generation

Test Results

Backend Pass Fail Skip
CUDA all 0
OpenCL all 0
CPU all 0
WebGPU 229 0 12
WebGL 139 0 115
Wasm 249 0 3
Total 1,511 0 162

WebGL skips are architectural (GLSL ES 3.0 lacks shared memory/barriers/atomics). Wasm skips are subgroup-dependent features not available in browser WebAssembly.

What This Means

This release proves that GPU-class parallel algorithms — radix sort, scan, reduce, atomics, shared memory, group barriers — run correctly in the browser across WebGPU, WebGL, and WebAssembly, alongside native CUDA, OpenCL, and CPU backends. Write your kernel once, run it everywhere.

SpawnDev.ILGPU v4.0.0

Choose a tag to compare

@LostBeard LostBeard released this 15 Mar 22:23

SpawnDev.ILGPU v4.0.0

Run ILGPU C# kernels on WebGPU, WebGL, Wasm, CUDA, OpenCL, and CPU — from a single codebase.

This is a major release with deep improvements to the WebGPU and Wasm backends, bringing ILGPU's algorithm library (RadixSort, Scan, Reduce) to the browser for the first time.

Highlights

WebGPU RadixSort — Full Algorithm Support

All RadixSort variants now pass on WebGPU, including large-scale sorts (4M+ elements), pairs, descending, and multiple data types. Fixed shared memory sizing, scan barrier synchronization, range checks for auto-grouped kernels, and 256-byte alignment padding for minStorageBufferOffsetAlignment.

Wasm Backend — Barrier Kernel Infrastructure

The Wasm backend received 7 codegen and dispatch fixes enabling correct barrier-synchronized kernels (Scan, Reduce, and single-group RadixSort):

  • Struct-with-view serialization — Fixed CLR-to-IR layout mismatch for kernel structs containing ArrayViews (e.g., InitializerImplementation<T>). Manual IR-layout-aware serialization replaces Unsafe.Write.
  • View field mapping — Fixed GetField handler returning 0 for ArrayView1D's Extent (Length) field, which caused all view.Length checks to fail silently.
  • Local alloca addressing — Fixed local memory allocations defaulting to address 0, which caused the ExclusiveScan helper to corrupt the data buffer between sort passes.
  • Per-thread scratch memory — Each parallel Web Worker now gets its own scratch region, preventing cross-worker data races during struct construction.
  • Post-helper barriers — Added synchronization barriers after each ExclusiveScan helper call to prevent fast workers from starting the next scan while slow workers are still completing the previous one.
  • SpecializedValue unwrapping — Fixed dispatch to correctly extract scalar values from SpecializedValue<T> wrapper structs.
  • GetViewLength tracing — Added TraceToParameter() to resolve view sources through GetField/NewView chains.

WebGPU Backend Refactor

Major internal restructuring for maintainability and performance:

  • Extracted SharedMemoryResolver and UniformityAnalyzer into standalone subsystems
  • Per-function emulation library trimming via BFS dependency graph
  • Dead variable elimination post-pass for cleaner generated WGSL
  • i64 constant hoisting to module-scope const declarations
  • Pre-compiled regex patterns replacing runtime Regex.IsMatch calls
  • WGSL pre-validation (ValidateWGSL()) catches shader errors before GPU submission
  • KernelSpecialization for all algorithm kernel loaders (RadixSort, Histogram, Scan, etc.)

Device Loss Detection

  • WebGPU: Monitors device.lost promise. IsDeviceLost property and DeviceLost event.
  • WebGL: Monitors webglcontextlost event via glWorker.js. IsContextLost property and ContextLost event.
  • Intentional disposal (Dispose()) is filtered out — only unexpected losses fire the events.

Test Infrastructure

  • PlaywrightMultiTest: Unified NUnit + Playwright runner executes all tests (desktop + browser) in a single dotnet test invocation
  • 1316 tests passing across all 6 backends (WebGPU, WebGL, Wasm, CUDA, OpenCL, CPU), 0 failures

Browser Backend Capabilities

WebGPU WebGL Wasm
Shared Memory
Group.Barrier()
Atomics
ILGPU Algorithms ✅ RadixSort, Scan, Reduce, Histogram ✅ Scan, Reduce (single-group)
64-bit (f64/i64) ✅ Emulated ✅ Emulated ✅ Native

Known Limitations

  • Wasm multi-group barrier dispatch: Barrier kernels are fully correct for single-group workloads (up to 64 elements for groupSize=64). Multi-group workloads have a cross-group SharedArrayBuffer memory visibility limitation in current browsers. A cooperative scheduling fix is planned for a future release. Desktop backends and WebGPU have no such limitation.

Breaking Changes

None. Existing ILGPU kernels and API usage are fully compatible.

Installation

dotnet add package SpawnDev.ILGPU --version 4.0.0

Links

  • Live Demo — Fractal Explorer, 3D Raymarching, GPU Boids, Benchmarks, Unit Tests
  • Documentation — Getting Started, Backends, Kernels, Memory & Buffers, Canvas Rendering
  • GitHub

SpawnDev.ILGPU v3.5.0

Choose a tag to compare

@LostBeard LostBeard released this 06 Mar 16:14

SpawnDev.ILGPU 3.5.0

Half (f16) Support

  • WebGPU f16 kernelsFloat16 maps to native f16 in WGSL. Buffer alignment, constant emission, and Half ↔ float conversion intrinsics all wired up. Capability-gated on device feature support.
  • XMath.Min/Max/Clamp for Half — Added to XMath via float promotion.
  • Group Scan/Reduce for HalfExclusiveScan, InclusiveScan, AllReduce, and GroupReduce now support Half on WebGPU and CUDA.
  • CUDA PTX Half warp shufflesWarpShuffle, WarpShuffleDown, WarpShuffleUp, WarpShuffleXor (and SubWarp variants) for Half via b32 widening. Unlocks Half scan/reduce on CUDA.
  • Lock-free AllReduce — Rewrote AllReduce in both IL and PTX backends to use per-warp shared-memory slots instead of atomic operations. Removes the Half atomics dependency entirely and is correct for all types.
  • Half.One constant fix — Was 0x0001 (denormal ≈5.96e-8); corrected to 0x3C00 (IEEE-754 1.0).

WebGPU RadixSort with double / long Keys

  • RadixSortPairs<double, …> and RadixSortPairs<long, …> now work on WebGPU. Multiple root causes fixed end-to-end:
    • FloatAsInt/IntAsFloat casts for emulated f64 now correctly reconstruct the IEEE-754 64-bit pattern.
    • Structs containing emulated 64-bit fields are flattened to array<u32> in WGSL ("packed structs") to match CPU memory layout.
    • True element count is passed to the GPU via a dedicated _scalar_params slot, replacing the incorrect arrayLength() calculation for packed views.
    • Sub-view element offset is now computed in u32 units (padding / 4) instead of logical CPU elements, fixing sort correctness for array sizes where the inner temp allocation doesn't start at a 256-byte boundary.

Canvas Rendering (ICanvasRenderer)

  • ICanvasRenderer API — New interface for presenting ILGPU pixel buffers (MemoryBuffer2D<uint/int>, packed RGBA) directly to an HTML <canvas> element. Obtained via CanvasRendererFactory.Create(accelerator).
  • WebGPU — Zero-copy path: a cached WGSL fullscreen-triangle pipeline reads the pixel buffer directly from a read-only-storage binding. No CPU readback. Blit to the visible canvas via drawImage. Pipeline and bind-group are built once; uniforms only re-uploaded on resolution change.
  • WebGL — Delegates to an offscreen FBO blit in the GL Web Worker. Result is transferred as ImageBitmap back to the main thread, preventing Blazor's render cycle from clearing the canvas between frames.
  • CPU / Wasm — Fallback via putImageData. Browser-backed buffers use CopyToHostUint8ArrayAsync for a JS-side copy; pure CPU buffers fall back to synchronous CopyToCPU.

WebGPU Warp Reduce without Subgroups

  • GenerateWarpReduce now emits a full shared-memory butterfly reduction when the subgroups feature is unavailable, replacing the previous no-op passthrough. Correct results on hardware/drivers that don't expose subgroup extensions.

Algorithm Type Coverage

Added scan and reduce test/support variants for double, long, and uint:

Operation New Types
ExclusiveScan double, uint
InclusiveScan long, double, uint
AllReduce double, long, uint
GroupReduce float, long, double, uint, Half

SpawnDev.ILGPU v3.3.0

Choose a tag to compare

@LostBeard LostBeard released this 22 Feb 06:41

SpawnDev.ILGPU v3.3.0 Release Notes

Desktop & Browser

  • WPF Demo Application — new desktop demo running the same shared kernels (Fractal Explorer, 3D Raymarching, GPU Boids) on CUDA, OpenCL, and CPU with live backend switching
  • Shared Kernel Library — extracted SpawnDev.ILGPU.Demo.Shared so browser and desktop demos share identical kernel code
  • Console Test Runner — added SpawnDev.ILGPU.ConsoleDemo for running the full unit test suite on desktop backends with process isolation for crash resilience
  • OpenCL 3.0 Compatibility — relaxed the GenericAddressSpace requirement, enabling NVIDIA GPUs with OpenCL 3.0 drivers that were previously blocked
  • Multi-platform support — updated SupportedPlatform to include Windows, Linux, and macOS

WebGL2 Backend — GPU-Resident Buffers

The WebGL2 backend has been refactored to eliminate unnecessary CPU↔GPU data transfers:

  • GPU-resident buffers — buffers persist as textures in the GL worker; kernel dispatch sends buffer references, not data
  • On-demand readbackCopyToHostAsync() is the only GPU→CPU transfer path
  • New worker protocolallocBuffer, uploadBuffer, readbackBuffer, freeBuffer messages manage buffer lifecycle
  • Proper buffer disposal — buffers are freed in the worker when disposed on the C# side

Wasm Backend Improvements

  • Expanded API coverage including shared memory, barriers, dynamic shared memory, atomics, and broadcasting
  • Single-worker fallback mode when SharedArrayBuffer is unavailable

Transpiler Fixes

  • Break-PHI bug — fixed assignments before break in loops being dropped in WGSL and GLSL transpilers
  • CopySign — corrected argument swap in the CopySign intrinsic
  • 64-bit reduce — fixed signed/unsigned mismatch in MinUInt64 and emu_f64 buffer I/O for AddDouble/MaxDouble
  • WebGL raymarching — fixed GLSL rendering issues
  • BVH ray traversal — corrected WebGPU and WebGL backend issues for complex scene traversal

Upstream ILGPU Fixes

Six bugs from the original ILGPU repo have been fixed in our fork:

Issue Description Severity
#1361 MathF.CopySign argument order swapped — silent wrong results on all GPU backends High
#1309 uint to float cast routed through double — crashes on devices without fp64 Medium
#1479 Infinite compilation with large local arrays (new int[1_000_000]) — 10+ min, 10+ GB RAM High
#1538 Internal Compiler Error with nested struct properties — wrong field slicing after type unification Medium
#1539 OpenCL produces wrong results for complex kernels — stale phi variables persisted across blocks High
#1540 H100/H200 not working — added SM_90, SM_100, SM_101, SM_120 architecture support High

See upstream-issues.md for detailed root cause analysis and fix descriptions.

Documentation

  • Corrected synchronization semantics: Synchronize() = flush (non-blocking), SynchronizeAsync() = flush + wait, CopyToHostAsync() = only GPU→CPU path
  • Updated test count to 640 tests across 8 suites
  • Added WebGL GPU-resident buffer architecture documentation
  • Reduced default logging verbosity across all backends

Demo Improvements

  • Game of Life — fixed mouse interaction and added NavMenu icon
  • Fractal Explorer — moved to shared kernel library, improved WebGL2 rendering pipeline
  • Reduced console log noise for cleaner browser dev tools experience

Full Changelog: v3.2.0...v3.3.0

SpawnDev.ILGPU v3.2.0

Choose a tag to compare

@LostBeard LostBeard released this 21 Feb 14:14

SpawnDev.ILGPU v3.2.0

Cross-platform GPU compute from a single codebase — browser and desktop.

What's New

🖥️ Desktop Support Verified

  • SpawnDev.ILGPU now officially supports desktop/server environments (Console, WPF, ASP.NET) alongside Blazor WebAssembly
  • Same NuGet package provides browser backends (WebGPU, WebGL, Wasm) and native backends (Cuda, OpenCL, CPU)
  • SynchronizeAsync() and CopyToHostAsync() work everywhere — async in the browser, graceful sync fallback on desktop
  • New SpawnDev.ILGPU.ConsoleDemo project included as a working reference

🎮 New Demos

  • Game of Life — GPU-accelerated cellular automaton
  • Boids 3D — Flocking simulation on all backends
  • Compute 3D — 3D compute shader demo

🐛 Bug Fixes

  • Fixed 3 transpiler bugs found during Game of Life development
  • Fixed handling of Debug IL in WebGPU and WebGL transpilers
  • Updated Wasm backend intrinsics

📚 Comprehensive Documentation

  • New Docs/ folder with 8 markdown guides: Getting Started, Backends, Kernels, Memory & Buffers, Advanced Patterns (GPU intrinsics, device sharing, rendering), Limitations, and API Reference
  • Covers both Blazor WASM and desktop usage
  • Incorporates foundational ILGPU concepts adapted for the browser

Full Changelog

See README.md and Docs/ for complete documentation.

SpawnDev.ILGPU v3.0.0

Choose a tag to compare

@LostBeard LostBeard released this 16 Feb 17:39

SpawnDev.ILGPU v3.0.0

What's New

🚀 Next-Generation GPU Computing in Blazor Wasm — v3.0.0 brings major performance improvements, streamlined architecture, and enhanced compatibility. Run C# ILGPU kernels on WebGPU, WebGL, and native WebAssembly with automatic backend selection.

Key Features

  • Three Powerful Backends — WebGPU (modern GPU compute via WGSL), WebGL (universal GPU access via GLSL ES 3.0), and Wasm (native WebAssembly on Web Workers)
  • CPU Backend — Standard ILGPU CPU accelerator included for debugging and performance comparison
  • Universal GPU Access — WebGPU for cutting-edge browsers, WebGL for virtually every device
  • Intelligent Auto-SelectionCreatePreferredAcceleratorAsync() automatically picks the best available backend (WebGPU → WebGL → Wasm)
  • 64-bit Computing — Full double and long support via optimized emulation on both GPU backends
  • Multi-Worker Dispatch — Wasm backend distributes work across all available CPU cores
  • Zero-Copy Shared Memory — SharedArrayBuffer support for efficient data sharing
  • Atomic Operations — Workgroup synchronization and atomic operations on WebGPU and Wasm backends
  • Production Ready — Comprehensive test suite, stable APIs, and real-world optimization

Built For

  • Blazor WebAssembly — Run compute-intensive C# kernels in the browser
  • 🎮 Game Development — GPU-accelerated physics, graphics, and AI
  • 📊 Data Processing — High-performance number crunching without native compilation
  • 🔬 Scientific Computing — GPGPU capabilities in pure managed code

Resources

Full Changelog: v2.1.0...v3.0.0

SpawnDev.ILGPU v2.1.0

Choose a tag to compare

@LostBeard LostBeard released this 13 Feb 20:41

SpawnDev.ILGPU v2.1.0

What's New

🖼️ New WebGL Backend — GPU-accelerated compute on virtually every modern browser and device. C# kernels are transpiled to GLSL ES 3.0 vertex shaders and executed via Transform Feedback, providing broad GPU access even where WebGPU isn't supported.

Highlights

  • Five backends — WebGPU, WebGL, Wasm, Workers, and CPU
  • Two GPU backends — WebGPU for cutting-edge browsers, WebGL for universal coverage
  • Auto-selectionCreatePreferredAcceleratorAsync() picks the best available backend (WebGPU → WebGL → Wasm → Workers → CPU)
  • 64-bit emulation on both GPU backends (double/long support via software emulation)
  • Benchmarks page — New interactive benchmark suite comparing throughput across all backends
  • Workers performance — Cached compiled functions and script bodies to reduce per-dispatch overhead

Links

Full Changelog: v2.0.0...v2.1.0