Releases: LostBeard/SpawnDev.ILGPU
Release list
SpawnDev.ILGPU v4.15.0
SpawnDev.ILGPU 4.15.0
Headline: the Wasm backend now auto-vectorizes kernels to WebAssembly SIMD128 (v128).
A separate kernel_simd is generated alongside the scalar kernel and selected at runtime when the engine supports SIMD (System.Runtime.Intrinsics.Wasm.PackedSimd.IsSupported). The scalar path stays byte-identical and first-class — SIMD-less browsers, older devices, and the desktop CLR run it unchanged — and the emitter bails to the scalar path on anything outside its class, so the feature is purely additive with zero regression. A by-4 dispatch processes four thread-ids per kernel_simd call with a scalar tail.
What vectorizes — the complete per-lane kernel class
- Numeric tier:
f32x4,i32x4, and (double-pumped, 2 lanes per v128)f64x2,i64x2. - Shapes: straight-line elementwise · counted loops (v128 accumulator) · divergent if-diamonds (mask +
v128.bitselect) · gather · scatter · conditional/masked stores · general acyclic divergent control flow (chained and nested selects, via a per-phi control-dependence bitselect tree) · divergent loops (a data-dependent branch inside a counted loop). - Math (f32 and f64):
+ - * /, min/max, neg/abs, sqrt, floor/ceil, all compares; transcendentals (sin, cos, tan, asin, acos, atan, sinh, cosh, tanh, exp, exp2, log, log2, log10); rcp/rsqrt; pow/atan2/log_b.
Cross-mode determinism is a hard invariant: kernel_simd is bit-exact to the scalar kernel (no fused FMA; no saturating-vs-trapping convert; the per-lane Math fallbacks call the identical import). 34 Wasm_Simd128_* gates assert kernel_simd-is-emitted and simd == scalar == reference bit-exact, and the whole suite also runs in a SIMD-off CI mode as a permanent cross-mode oracle.
Kept on the existing path (out of class by design): group/barrier/atomic/warp kernels (the multi-worker shared-memory model), f32 → i32 saturating convert (kept scalar for determinism), non-inlined helper calls, and narrow i8/i16 element types.
Also in 4.15.0
- All in-register quant decoders are now single-exit / branchless —
Float8E8M0,Float4E2M1,Float8E4M3,Float8E5M2RawBitsToFloat(the subnormal-normalizewhileloops folded to a computed shift count; value-identical, verified bit-exact on all 6 backends). This fixes a WebGL/GLSL shader-size explosion: an early-return (multi-exit) decode inlined before a loop made the structurizer duplicate the loop continuation per exit arm, blowing past WebGL's compile limit (hit in MXFP4 dequant; the same class would hit MXFP8).
Verification
Full PMT suite 3980 pass / 0 fail / 258 skip across all six backends (CUDA, OpenCL, CPU, WebGPU, WebGL, Wasm) against the 4.15.0 bits. Forks bump to 2.0.42.
🖖 The SpawnDev Crew — LostBeard (Captain), Riker, Data, Tuvok, Geordi, Seven.
SpawnDev.ILGPU v4.14.1
SpawnDev.ILGPU 4.14.1
The 4-bit data-type tier: a packed FP4 float and packed INT4 integers on all 6 backends (now including inside non-inlined helpers), the Float8E8M0 MX scale that completes the OCP Float8 family, and low-precision conversion correctness pinned to the numpy / ml_dtypes references.
SpawnDev.ILGPU extends ILGPU with three browser GPU backends. It transpiles .NET IL into GPU shader languages at runtime, so the same C# kernel runs on 6 backends from one codebase: WebGPU (WGSL), WebGL (GLSL ES 3.0), and WebAssembly in the browser, and CUDA (PTX), OpenCL, and CPU on the desktop. The browser backends make Blazor WebAssembly a first-class GPU-compute target.
This release rolls up the whole 4.14.x line on top of 4.13.0's low-precision float support. Full per-version detail with code samples is in CHANGELOG.md.
Headline: TRUE packed 4-bit types
Three new 4-bit types, each stored genuinely sub-byte - [PackedBits(4)], 2 nibbles per byte, 8 per 32-bit word - so an ArrayView<T> of N elements is ceil(N/2) device bytes, the real NVFP4 / INT4 memory density rather than a 1-byte placeholder:
| Type | Kind | Range / codes | Reference |
|---|---|---|---|
Float4E2M1 |
4-bit float (1/2/1) | 16 codes {0, .5, 1, 1.5, 2, 3, 4, 6}, no Inf/NaN |
OCP E2M1FN, the NVFP4 / MXFP4 element format - bit-exact to ml_dtypes.float4_e2m1fn |
QInt4 |
signed 4-bit int | -8 .. 7 | sign-extends to int |
QUInt4 |
unsigned 4-bit int | 0 .. 15 | zero-extends to int |
Each nibble is decoded to a wider register (f32 for FP4, i32 for the ints) at the load - the data stays packed in the buffer, so you get the storage/bandwidth win and full-precision compute. Per-backend support:
- Load: all 6 backends - including inside a non-inlined (
[MethodImpl(MethodImplOptions.NoInlining)]) helper, which 4.14.1 wired through every backend's helper-function code generator (a separate codegen path from the kernel generators that previously didn't understand packed sub-word storage). - In-kernel store: CUDA, OpenCL, WebGPU, Wasm (the nibble write is an atomic word read-modify-write). CPU and WebGL stores are fail-loud - WebGL has no atomics, and the CPU managed-reference indexer cannot address a sub-byte element, so both throw a typed exception rather than silently corrupting the enclosing word.
- Radix-sort: keys + key/value pairs, ascending + descending, on the four store backends.
Working with packed buffers is explicit (there is no transparent typed host pack/unpack - the win is that the buffer stays packed): pack two nibbles per byte and upload the raw bytes.
// Pack N FP4 codes (each Float4E2M1.RawValue, 0..15) into ceil(N/2) bytes, upload, decode in-kernel.
var packed = new byte[(n + 1) / 2];
for (int k = 0; k < packed.Length; k++)
packed[k] = (byte)((codes[2*k] & 0xF) | ((2*k+1 < n ? codes[2*k+1] & 0xF : 0) << 4));
using var buf = accelerator.Allocate1D<Float4E2M1>(n); // ceil(N/2) device bytes
((IContiguousArrayView)buf.View.BaseView).AsRawArrayView().CopyFromCPU(packed);
// dispatch a kernel that reads buf as ArrayView<Float4E2M1> - the E2M1 nibble decodes to f32 in-registerTo select a backend that can actually store packed 4-bit, the capability flags RequiresQInt4, RequiresQUInt4, and RequiresPacked4Store (the last rules out WebGL and CPU) join AcceleratorRequirements.
RawBitsToFloat - decode packed quant bits in-register
When your quantized weights live as raw integer words (a GGUF / MXFP4 block of u32s), you often want to decode one element inside a kernel without an ArrayView<Float4E2M1>. <Type>Extensions.RawBitsToFloat(int rawBits) does exactly that - a kernel-safe, all-6-backend decode of a raw nibble / byte / ushort to its f32 value:
// inside a kernel - decode the i-th FP4 nibble out of a packed u32 word, in-register
float v = Float4E2M1Extensions.RawBitsToFloat((int)((word >> (i * 4)) & 0xF));Available for the sub-word float types whose bits you might hold raw: Float4E2M1, Float8E4M3, Float8E5M2, BFloat16, and now Float8E8M0 (below). Host-side, those types also expose FromRawBits(...) and a public RawValue for raw round-trips.
New: Float8E8M0 - the OCP MX scale format
Float8E8M0 (OCP float8_e8m0fnu) is the third member of the OCP Float8 family, alongside Float8E4M3 and Float8E5M2. It is 8 exponent bits, no sign, no mantissa, bias 127 - not an element format but the shared per-block scale for every OCP microscaling layout (MXFP4 / MXFP8 / MXINT8 / NVFP4): a pure power-of-two 2^(e-127). Byte e in 0..254 decodes to 2^(e-127); e == 0xFF is the only special and decodes to NaN (no zero, no Inf).
Because E8M0 and IEEE-754 binary32 share exponent bias 127, the decode is exactly the f32 whose biased-exponent field is e with a zero mantissa. It is intentionally minimal - you never add two scales in a kernel, you decode a scale and multiply by it - so it ships as a host struct plus a kernel-safe in-register decode that transpiles on all 6 backends:
// Decode an MX block's raw scale byte to f32 in-register, while the block stays packed:
float scale = Float8E8M0Extensions.RawBitsToFloat(scaleByte); // 2^(e-127), e==0xFF -> NaN
// Host:
Float8E8M0 s = Float8E8M0.FromSingle(2.0f); // RNE on the exponent; NaN/<=0/Inf -> 0xFF
byte raw = s.RawValue; // round-trips with Float8E8M0.FromRawBits(raw)Low-precision conversion correctness
The four byte/2-byte low-precision floats (Half, BFloat16, Float8E4M3, Float8E5M2) reach feature-complete parity - a selectable saturating cast (FromSingle(x, saturate) / FromSingleSaturating) and the full radix-sort grid on all 6 backends - and two conversions were corrected against their authoritative references:
Float8E4M3is now bit-exact tofloat8_e4m3fn(PyTorch / JAX /ml_dtypes). The cast and the IR-level convert use thefnconvention - finite overflow and ±Inf map to NaN (previously saturated to ±448). The saturating clamp is opt-in viaFromSingleSaturating(the NVIDIA Transformer Engine / OCP mode).Halffloat→half is now IEEE round-to-nearest-even on every backend, bit-exact tonumpy.float16/ PyTorch / CUDA / OpenCL. It previously truncated toward zero and flushed subnormals to zero - a silent divergence from numpy and from the desktop backends.
Every float→low-precision conversion is pinned to its numpy / ml_dtypes reference in CI, on each backend's actual on-device convert.
Backends at a glance
| Backend | Target | Shader language |
|---|---|---|
| WebGPU | Browser | WGSL |
| WebGL | Browser | GLSL ES 3.0 |
| Wasm | Browser | WebAssembly binary (multi-worker) |
| CUDA | Desktop | PTX |
| OpenCL | Desktop | OpenCL C |
| CPU | Desktop | .NET |
The full sub-word + low-precision set (Int8/UInt8/Int16/UInt16, Half, BFloat16, FP8 Float8E4M3/E5M2, Float8E8M0, FP4 Float4E2M1, packed QInt4/QUInt4), i64/f64 emulation where there's no native hardware support, automatic backend selection with capability gating, and zero-copy CopyFromJS on the browser are all available across the matrix. Latest full cross-backend sweep: 3934 pass / 0 fail / 258 skip (the skips are the genuinely-impossible cells - in-kernel scatter/atomics/packed-store on WebGL, packed-store on CPU).
Install
dotnet add package SpawnDev.ILGPU
PublishTrimmed and RunAOTCompilation must remain false - ILGPU relies on IL reflection at runtime.
Links
- Documentation: Docs/ - start with getting-started.md; the per-backend data-type matrix (including the packed 4-bit types and the raw-packed host I/O pattern) is in data-type-support.md
- Full changelog: CHANGELOG.md
Credits
Built on ILGPU by the ILGPU project. SpawnDev.ILGPU is part of the SpawnDev family by LostBeard.
SpawnDev.ILGPU v4.13.0
SpawnDev.ILGPU 4.13.0
Full low-precision floating-point support on all 6 backends, plus a portable-CUDA fix that brings bf16 and FP8 to pre-Ampere GPUs.
SpawnDev.ILGPU extends ILGPU with three browser GPU backends. It transpiles .NET IL into GPU shader languages at runtime, so the same C# kernel runs on 6 backends from one codebase: WebGPU (WGSL), WebGL (GLSL ES 3.0), and WebAssembly in the browser, and CUDA (PTX), OpenCL, and CPU on the desktop. The browser backends make Blazor WebAssembly a first-class GPU-compute target.
This is the first GitHub release since 4.6.0, and a great deal has changed across 4.7 through 4.13. The headline is low-precision types, but a lot more landed underneath. Full per-version detail with code samples is in CHANGELOG.md.
Headline: low-precision floating point, everywhere
Three kernel-native low-precision float types now join float/double/Half, each a real IR primitive type with full System.Numerics.INumber<T> support, and each working bit-identically on all 6 backends:
| Type | Layout | Notes |
|---|---|---|
ILGPU.BFloat16 |
1 / 8 / 7 | "brain float" - the top 16 bits of an fp32, so it keeps fp32's full dynamic range (the right trade for ML weights/activations, where fp16's tiny range overflows/underflows). |
ILGPU.Float8E4M3 |
1 / 4 / 3, bias 7 | FP8 forward / inference format (E4M3FN): no infinities, saturates to ±448, single NaN. One extra mantissa bit vs E5M2. |
ILGPU.Float8E5M2 |
1 / 5 / 2, bias 15 | FP8 backward / gradient format: IEEE-754-style with infinities and NaNs (fp16-class range, which gradients need). |
Use them exactly like ILGPU.Half:
// One generic kernel runs for float, Half, BFloat16, Float8E4M3, Float8E5M2 - no per-type variants.
static void FusedRelu<T>(Index1D i,
ArrayView1D<T, Stride1D.Dense> x, ArrayView1D<T, Stride1D.Dense> y, T scale, T bias)
where T : unmanaged, INumber<T>
{
T v = x[i] * scale + bias;
y[i] = v > T.Zero ? v : T.Zero;
}
// Read low-precision input, accumulate in float, write low-precision output - one generic op,
// no fp32 temp buffers. PrecisionConvert gives the float<->T conversion that a plain (float)t /
// (T)f cast cannot express in a generic kernel.
static void MeanGeneric<T>(Index1D row,
ArrayView1D<T, Stride1D.Dense> input, ArrayView1D<T, Stride1D.Dense> output, int C)
where T : unmanaged, INumber<T>
{
int b = row * C; float acc = 0f;
for (int c = 0; c < C; c++) acc += PrecisionConvert.ConvertToSingle(input[b + c]);
output[row] = PrecisionConvert.ConvertFromSingle<T>(acc / C);
}What's new this release that makes the above work:
- Generic
INumber<T>mixed-precision kernels. A singlewhere T : INumber<T>kernel transpiles and runs for float/Half/bf16/fp8 on every backend, instead of N hand-written per-type copies. This includes by-value low-precision scalar parameters (e.g. a kernel'sscale/bias), which previously arrived as zero on several backends. PrecisionConvert.ConvertToSingle<T>(T)/ConvertFromSingle<T>(float). Inside a generic kernel there is no C# way to write(float)tor(T)f(no cast constraint exists), so callers reach forfloat.CreateChecked/T.CreateChecked- which touchSystem.Typeand the kernel transpiler rejects on every GPU backend. These two methods lower to the same native conversion the concrete cast emits, so generic precision-aware ops just work.
All of this uses the f32-register model: low-precision values compute as f32 in-register and are converted to their narrow grid only at the load/store boundary, so accumulation stays full-precision (matching how real low-precision tensor hardware accumulates). The conversions are byte-identical across backends, emitted as callable helper functions on OpenCL/WGSL/GLSL, inline WebAssembly bytecode on Wasm, and inline PTX on CUDA.
bf16 and FP8 now run on every CUDA architecture (including pre-Ampere)
If you have an older NVIDIA card, this one matters: bf16 previously failed to compile on pre-Ampere GPUs (Pascal GTX 1080 = sm_61, Volta sm_70, Turing RTX 2060 = sm_75). The PTX path emitted the native cvt.f32.bf16 / cvt.rn.bf16.f32 instructions, which only exist on sm_80+ (Ampere/Ada/Hopper), so ptxas rejected them on anything older.
bf16 is now converted with portable bit-manipulation (basic integer ops available on every CUDA architecture), byte-identical to the result on every other backend. FP8 uses the same portable approach (its native cvt is sm_89/Hopper-only). The lesson, generalized: native-cvt shortcuts silently gate out older hardware, so the default is portable bit-manip unless support is explicitly capability-gated.
Verified: the full bf16 test surface (radix-sort keys, struct fields, range and ±Inf/NaN/RNE specials, arithmetic) passes on all 6 backends including CUDA; FP8 round-trips bit-exact vs the concrete cast on every backend.
Also since the last GitHub release (4.6.0)
A condensed tour of the bigger items across 4.7 -> 4.13 (the CHANGELOG has the full per-version detail):
- Complete sub-word data type support (4.9.0).
Int8,UInt8,Int16,UInt16, andFloat16(ILGPU.Half) buffer access on all 6 backends, stored packed with correct stride and sign/zero extension per backend (no more corruption from type-promotion mismatches), plus theHalf.Abs/Min/Max/Clampintrinsics. This is the foundation the bf16/FP8 work in 4.13.0 builds on. Also addedCopyFromJS- zero-copy writes of a JSTypedArray/ArrayBufferstraight to GPU memory with no .NET heap allocation, on every browser backend. - Capability gating + typed codegen errors (4.9.2).
AcceleratorRequirements(RequiresAtomics,RequiresFloat64Native,RequiresSharedMemory, ...) lets the selection path filter out incapable backends up front, and kernels that use a feature a backend cannot implement now throw a typedUnsupportedKernelFeatureExceptionat compile time instead of silently producing wrong output. Plus IEEE-754 NaN/Inf correctness across the four emulated backends and helper-function emission to stay under browser shader-validator size limits. - Generic-math
Half, then full mixed precision (4.9.12 -> 4.13.0).INumber<Half>kernels first, then the genericINumber<T>path generalized to bf16 and fp8, andPrecisionConvertfor the generic in-kernelfloat<->Tconversion (see the headline above). - Offline code generation + precompiled shaders (4.10.0). Generate a kernel's WGSL/GLSL/Wasm with no device on any host OS (
ShaderCompiler.Generate+CapabilityProfile), precompile at build time via an MSBuild task, and load the artifact at runtime to skip IL-to-shader transpilation. - Wasm backend maturity (4.6.0 -> 4.13.0). A fiber-based barrier dispatch model (full ILGPU Algorithms - RadixSort/Scan/Reduce/Histogram at 100K-4M+ elements - run on Wasm), worker-function caching (3-4x dispatch speedup), then a multi-worker correctness overhaul that killed the large-sort / barrier race family (verified atomic stores, group-barrier release fences, monotonic kernel ids, yield-to-JS escapes under oversubscription) and a process-static shared worker pool + linear memory that keep a long session bounded. A first-class non-SIMD path stays supported forever, with an additive SIMD128 v128 fast path for ALU-dense elementwise kernels on SIMD-capable browsers. (Barriers use pure-spin synchronization to work around a V8
Atomics.waitvisibility bug.) - Sync/async contract (4.12.0). Operations that wait or read a result back are async-only on the browser; the sync form now throws (
Synchronize()->await SynchronizeAsync(); sync device readback / device-to-device copy -> the...Asyncforms) instead of silently returning stale data, while fire-and-forget work (dispatch, alloc, upload,Flush-submit) stays synchronous.AcceleratorRequirements.RequiresScatterStores(4.12.1) gates WebGL out of in-kernel scatter kernels at selection time. - CPU backend cooperative multi-multiprocessor execution (4.13.0). Thread-groups run one-per-core with a cheap cooperative barrier instead of oversubscribing one simulated multiprocessor, eliminating multi-second barrier thrash on heavy reduction/decode kernels.
- And more:
GpuTestVerifyfor GPU-side test verification without CPU readback (4.7.1), theSpawnDev.ILGPU.QRGPU QR encoder/decoder (4.7.1), WebGLHalfRadixSort + a cross-backend sub-word sign-extension fix (4.9.13),CopyFromStreamAsync/MemoryPressure.AllocateWithReclaim, and assorted WebGPU/WebGL/OpenCL/PTX codegen correctness fixes.
Backends at a glance
| Backend | Target | Shader language |
|---|---|---|
| WebGPU | Browser | WGSL |
| WebGL | Browser | GLSL ES 3.0 |
| Wasm | Browser | WebAssembly binary (multi-worker) |
| CUDA | Desktop | PTX |
| OpenCL | Desktop | OpenCL C |
| CPU | Desktop | .NET |
Sub-word and low-precision types (Int8/UInt8/Int16/UInt16/Half/BFloat16/Float8E4M3/Float8E5M2), i64/f64 emulation where there's no native hardware support, automatic backend selection with capability gating, and zero-copy CopyFromJS on the browser are all available across the matrix.
Install
dotnet add package SpawnDev.ILGPU
PublishTrimmed and RunAOTCompilation must remain false - ILGPU relies on IL reflection at runtime.
Links
- Documentation: Docs/ (start with getting-started.md; see [data-type-support.md](https://gi...
SpawnDev.ILGPU v4.6.0
SpawnDev.ILGPU v4.6.0
6 backends. 1,511 tests. Zero failures.
CUDA, OpenCL, CPU, WebGPU, WebGL, and Wasm — all passing. GPU compute in the browser is no longer experimental.
Highlights
Full Multi-Worker Wasm Barrier Dispatch
The Wasm backend now supports full navigator.hardwareConcurrency workers with group barriers and shared memory. A pure spin barrier using i32.atomic.load loops replaces the previous wait32/notify approach, working around a V8 atomics visibility gap that caused data races with 3+ workers.
RadixSort Verified at Scale
RadixSort passes across all data types and sizes up to 4M elements on every backend — including Wasm in the browser. Key fixes:
- Histogram counter buffer sizing — fixed undersized counters that caused real out-of-bounds writes during grid-stride iteration
- Grid-stride tail byte padding — extended linear-memory slack allocation to prevent OOB traps on packed buffers
- Per-worker scratch isolation — eliminated intermittent sort corruption in non-barrier kernels
20+ Wasm Codegen Fixes
Deep correctness pass across the Wasm code generator:
- Fiber yield-per-phase with dynamic block splitting
- Atomic loads/stores for all shared memory access in barrier kernels (including float via i32/i64 reinterpret)
- Struct load copy semantics to prevent aliasing
- Unsigned comparison in
MinUInt32/MinUInt64reductions - Correct atomic RMW opcode table for interleaved sub-word variants
- Local alloca addressing, shared memory deduplication, and IR address space aliasing guards
WebGPU Backend Fixes
- WGSL loop break + bool PHI: correct merge value generation when breaking from loops with boolean phi nodes
- WGSL continuation after if-else with break: prevent unreachable code generation
Test Results
| Backend | Pass | Fail | Skip |
|---|---|---|---|
| CUDA | all | 0 | — |
| OpenCL | all | 0 | — |
| CPU | all | 0 | — |
| WebGPU | 229 | 0 | 12 |
| WebGL | 139 | 0 | 115 |
| Wasm | 249 | 0 | 3 |
| Total | 1,511 | 0 | 162 |
WebGL skips are architectural (GLSL ES 3.0 lacks shared memory/barriers/atomics). Wasm skips are subgroup-dependent features not available in browser WebAssembly.
What This Means
This release proves that GPU-class parallel algorithms — radix sort, scan, reduce, atomics, shared memory, group barriers — run correctly in the browser across WebGPU, WebGL, and WebAssembly, alongside native CUDA, OpenCL, and CPU backends. Write your kernel once, run it everywhere.
SpawnDev.ILGPU v4.0.0
SpawnDev.ILGPU v4.0.0
Run ILGPU C# kernels on WebGPU, WebGL, Wasm, CUDA, OpenCL, and CPU — from a single codebase.
This is a major release with deep improvements to the WebGPU and Wasm backends, bringing ILGPU's algorithm library (RadixSort, Scan, Reduce) to the browser for the first time.
Highlights
WebGPU RadixSort — Full Algorithm Support
All RadixSort variants now pass on WebGPU, including large-scale sorts (4M+ elements), pairs, descending, and multiple data types. Fixed shared memory sizing, scan barrier synchronization, range checks for auto-grouped kernels, and 256-byte alignment padding for minStorageBufferOffsetAlignment.
Wasm Backend — Barrier Kernel Infrastructure
The Wasm backend received 7 codegen and dispatch fixes enabling correct barrier-synchronized kernels (Scan, Reduce, and single-group RadixSort):
- Struct-with-view serialization — Fixed CLR-to-IR layout mismatch for kernel structs containing ArrayViews (e.g.,
InitializerImplementation<T>). Manual IR-layout-aware serialization replacesUnsafe.Write. - View field mapping — Fixed
GetFieldhandler returning 0 for ArrayView1D's Extent (Length) field, which caused allview.Lengthchecks to fail silently. - Local alloca addressing — Fixed local memory allocations defaulting to address 0, which caused the ExclusiveScan helper to corrupt the data buffer between sort passes.
- Per-thread scratch memory — Each parallel Web Worker now gets its own scratch region, preventing cross-worker data races during struct construction.
- Post-helper barriers — Added synchronization barriers after each ExclusiveScan helper call to prevent fast workers from starting the next scan while slow workers are still completing the previous one.
- SpecializedValue unwrapping — Fixed dispatch to correctly extract scalar values from
SpecializedValue<T>wrapper structs. - GetViewLength tracing — Added
TraceToParameter()to resolve view sources through GetField/NewView chains.
WebGPU Backend Refactor
Major internal restructuring for maintainability and performance:
- Extracted
SharedMemoryResolverandUniformityAnalyzerinto standalone subsystems - Per-function emulation library trimming via BFS dependency graph
- Dead variable elimination post-pass for cleaner generated WGSL
- i64 constant hoisting to module-scope
constdeclarations - Pre-compiled regex patterns replacing runtime
Regex.IsMatchcalls - WGSL pre-validation (
ValidateWGSL()) catches shader errors before GPU submission KernelSpecializationfor all algorithm kernel loaders (RadixSort, Histogram, Scan, etc.)
Device Loss Detection
- WebGPU: Monitors
device.lostpromise.IsDeviceLostproperty andDeviceLostevent. - WebGL: Monitors
webglcontextlostevent via glWorker.js.IsContextLostproperty andContextLostevent. - Intentional disposal (
Dispose()) is filtered out — only unexpected losses fire the events.
Test Infrastructure
- PlaywrightMultiTest: Unified NUnit + Playwright runner executes all tests (desktop + browser) in a single
dotnet testinvocation - 1316 tests passing across all 6 backends (WebGPU, WebGL, Wasm, CUDA, OpenCL, CPU), 0 failures
Browser Backend Capabilities
| WebGPU | WebGL | Wasm | |
|---|---|---|---|
| Shared Memory | ✅ | ❌ | ✅ |
| Group.Barrier() | ✅ | ❌ | ✅ |
| Atomics | ✅ | ❌ | ✅ |
| ILGPU Algorithms | ✅ RadixSort, Scan, Reduce, Histogram | ❌ | ✅ Scan, Reduce (single-group) |
| 64-bit (f64/i64) | ✅ Emulated | ✅ Emulated | ✅ Native |
Known Limitations
- Wasm multi-group barrier dispatch: Barrier kernels are fully correct for single-group workloads (up to 64 elements for groupSize=64). Multi-group workloads have a cross-group SharedArrayBuffer memory visibility limitation in current browsers. A cooperative scheduling fix is planned for a future release. Desktop backends and WebGPU have no such limitation.
Breaking Changes
None. Existing ILGPU kernels and API usage are fully compatible.
Installation
dotnet add package SpawnDev.ILGPU --version 4.0.0Links
- Live Demo — Fractal Explorer, 3D Raymarching, GPU Boids, Benchmarks, Unit Tests
- Documentation — Getting Started, Backends, Kernels, Memory & Buffers, Canvas Rendering
- GitHub
SpawnDev.ILGPU v3.5.0
SpawnDev.ILGPU 3.5.0
Half (f16) Support
- WebGPU f16 kernels —
Float16maps to nativef16in WGSL. Buffer alignment, constant emission, andHalf ↔ floatconversion intrinsics all wired up. Capability-gated on device feature support. XMath.Min/Max/ClampforHalf— Added toXMathvia float promotion.- Group Scan/Reduce for
Half—ExclusiveScan,InclusiveScan,AllReduce, andGroupReducenow supportHalfon WebGPU and CUDA. - CUDA PTX Half warp shuffles —
WarpShuffle,WarpShuffleDown,WarpShuffleUp,WarpShuffleXor(and SubWarp variants) forHalfviab32widening. Unlocks Half scan/reduce on CUDA. - Lock-free
AllReduce— RewroteAllReducein both IL and PTX backends to use per-warp shared-memory slots instead of atomic operations. Removes the Half atomics dependency entirely and is correct for all types. Half.Oneconstant fix — Was0x0001(denormal ≈5.96e-8); corrected to0x3C00(IEEE-7541.0).
WebGPU RadixSort with double / long Keys
RadixSortPairs<double, …>andRadixSortPairs<long, …>now work on WebGPU. Multiple root causes fixed end-to-end:FloatAsInt/IntAsFloatcasts for emulatedf64now correctly reconstruct the IEEE-754 64-bit pattern.- Structs containing emulated 64-bit fields are flattened to
array<u32>in WGSL ("packed structs") to match CPU memory layout. - True element count is passed to the GPU via a dedicated
_scalar_paramsslot, replacing the incorrectarrayLength()calculation for packed views. - Sub-view element offset is now computed in u32 units (
padding / 4) instead of logical CPU elements, fixing sort correctness for array sizes where the inner temp allocation doesn't start at a 256-byte boundary.
Canvas Rendering (ICanvasRenderer)
ICanvasRendererAPI — New interface for presenting ILGPU pixel buffers (MemoryBuffer2D<uint/int>, packed RGBA) directly to an HTML<canvas>element. Obtained viaCanvasRendererFactory.Create(accelerator).- WebGPU — Zero-copy path: a cached WGSL fullscreen-triangle pipeline reads the pixel buffer directly from a
read-only-storagebinding. No CPU readback. Blit to the visible canvas viadrawImage. Pipeline and bind-group are built once; uniforms only re-uploaded on resolution change. - WebGL — Delegates to an offscreen FBO blit in the GL Web Worker. Result is transferred as
ImageBitmapback to the main thread, preventing Blazor's render cycle from clearing the canvas between frames. - CPU / Wasm — Fallback via
putImageData. Browser-backed buffers useCopyToHostUint8ArrayAsyncfor a JS-side copy; pure CPU buffers fall back to synchronousCopyToCPU.
WebGPU Warp Reduce without Subgroups
GenerateWarpReducenow emits a full shared-memory butterfly reduction when thesubgroupsfeature is unavailable, replacing the previous no-op passthrough. Correct results on hardware/drivers that don't expose subgroup extensions.
Algorithm Type Coverage
Added scan and reduce test/support variants for double, long, and uint:
| Operation | New Types |
|---|---|
ExclusiveScan |
double, uint |
InclusiveScan |
long, double, uint |
AllReduce |
double, long, uint |
GroupReduce |
float, long, double, uint, Half |
SpawnDev.ILGPU v3.3.0
SpawnDev.ILGPU v3.3.0 Release Notes
Desktop & Browser
- WPF Demo Application — new desktop demo running the same shared kernels (Fractal Explorer, 3D Raymarching, GPU Boids) on CUDA, OpenCL, and CPU with live backend switching
- Shared Kernel Library — extracted
SpawnDev.ILGPU.Demo.Sharedso browser and desktop demos share identical kernel code - Console Test Runner — added
SpawnDev.ILGPU.ConsoleDemofor running the full unit test suite on desktop backends with process isolation for crash resilience - OpenCL 3.0 Compatibility — relaxed the
GenericAddressSpacerequirement, enabling NVIDIA GPUs with OpenCL 3.0 drivers that were previously blocked - Multi-platform support — updated
SupportedPlatformto include Windows, Linux, and macOS
WebGL2 Backend — GPU-Resident Buffers
The WebGL2 backend has been refactored to eliminate unnecessary CPU↔GPU data transfers:
- GPU-resident buffers — buffers persist as textures in the GL worker; kernel dispatch sends buffer references, not data
- On-demand readback —
CopyToHostAsync()is the only GPU→CPU transfer path - New worker protocol —
allocBuffer,uploadBuffer,readbackBuffer,freeBuffermessages manage buffer lifecycle - Proper buffer disposal — buffers are freed in the worker when disposed on the C# side
Wasm Backend Improvements
- Expanded API coverage including shared memory, barriers, dynamic shared memory, atomics, and broadcasting
- Single-worker fallback mode when
SharedArrayBufferis unavailable
Transpiler Fixes
- Break-PHI bug — fixed assignments before
breakin loops being dropped in WGSL and GLSL transpilers - CopySign — corrected argument swap in the
CopySignintrinsic - 64-bit reduce — fixed signed/unsigned mismatch in
MinUInt64andemu_f64buffer I/O forAddDouble/MaxDouble - WebGL raymarching — fixed GLSL rendering issues
- BVH ray traversal — corrected WebGPU and WebGL backend issues for complex scene traversal
Upstream ILGPU Fixes
Six bugs from the original ILGPU repo have been fixed in our fork:
| Issue | Description | Severity |
|---|---|---|
| #1361 | MathF.CopySign argument order swapped — silent wrong results on all GPU backends |
High |
| #1309 | uint to float cast routed through double — crashes on devices without fp64 |
Medium |
| #1479 | Infinite compilation with large local arrays (new int[1_000_000]) — 10+ min, 10+ GB RAM |
High |
| #1538 | Internal Compiler Error with nested struct properties — wrong field slicing after type unification | Medium |
| #1539 | OpenCL produces wrong results for complex kernels — stale phi variables persisted across blocks | High |
| #1540 | H100/H200 not working — added SM_90, SM_100, SM_101, SM_120 architecture support | High |
See upstream-issues.md for detailed root cause analysis and fix descriptions.
Documentation
- Corrected synchronization semantics:
Synchronize()= flush (non-blocking),SynchronizeAsync()= flush + wait,CopyToHostAsync()= only GPU→CPU path - Updated test count to 640 tests across 8 suites
- Added WebGL GPU-resident buffer architecture documentation
- Reduced default logging verbosity across all backends
Demo Improvements
- Game of Life — fixed mouse interaction and added NavMenu icon
- Fractal Explorer — moved to shared kernel library, improved WebGL2 rendering pipeline
- Reduced console log noise for cleaner browser dev tools experience
Full Changelog: v3.2.0...v3.3.0
SpawnDev.ILGPU v3.2.0
SpawnDev.ILGPU v3.2.0
Cross-platform GPU compute from a single codebase — browser and desktop.
What's New
🖥️ Desktop Support Verified
- SpawnDev.ILGPU now officially supports desktop/server environments (Console, WPF, ASP.NET) alongside Blazor WebAssembly
- Same NuGet package provides browser backends (WebGPU, WebGL, Wasm) and native backends (Cuda, OpenCL, CPU)
SynchronizeAsync()andCopyToHostAsync()work everywhere — async in the browser, graceful sync fallback on desktop- New
SpawnDev.ILGPU.ConsoleDemoproject included as a working reference
🎮 New Demos
- Game of Life — GPU-accelerated cellular automaton
- Boids 3D — Flocking simulation on all backends
- Compute 3D — 3D compute shader demo
🐛 Bug Fixes
- Fixed 3 transpiler bugs found during Game of Life development
- Fixed handling of Debug IL in WebGPU and WebGL transpilers
- Updated Wasm backend intrinsics
📚 Comprehensive Documentation
- New
Docs/folder with 8 markdown guides: Getting Started, Backends, Kernels, Memory & Buffers, Advanced Patterns (GPU intrinsics, device sharing, rendering), Limitations, and API Reference - Covers both Blazor WASM and desktop usage
- Incorporates foundational ILGPU concepts adapted for the browser
Full Changelog
SpawnDev.ILGPU v3.0.0
SpawnDev.ILGPU v3.0.0
What's New
🚀 Next-Generation GPU Computing in Blazor Wasm — v3.0.0 brings major performance improvements, streamlined architecture, and enhanced compatibility. Run C# ILGPU kernels on WebGPU, WebGL, and native WebAssembly with automatic backend selection.
Key Features
- Three Powerful Backends — WebGPU (modern GPU compute via WGSL), WebGL (universal GPU access via GLSL ES 3.0), and Wasm (native WebAssembly on Web Workers)
- CPU Backend — Standard ILGPU CPU accelerator included for debugging and performance comparison
- Universal GPU Access — WebGPU for cutting-edge browsers, WebGL for virtually every device
- Intelligent Auto-Selection —
CreatePreferredAcceleratorAsync()automatically picks the best available backend (WebGPU → WebGL → Wasm) - 64-bit Computing — Full
doubleandlongsupport via optimized emulation on both GPU backends - Multi-Worker Dispatch — Wasm backend distributes work across all available CPU cores
- Zero-Copy Shared Memory — SharedArrayBuffer support for efficient data sharing
- Atomic Operations — Workgroup synchronization and atomic operations on WebGPU and Wasm backends
- Production Ready — Comprehensive test suite, stable APIs, and real-world optimization
Built For
- ✨ Blazor WebAssembly — Run compute-intensive C# kernels in the browser
- 🎮 Game Development — GPU-accelerated physics, graphics, and AI
- 📊 Data Processing — High-performance number crunching without native compilation
- 🔬 Scientific Computing — GPGPU capabilities in pure managed code
Resources
Full Changelog: v2.1.0...v3.0.0
SpawnDev.ILGPU v2.1.0
SpawnDev.ILGPU v2.1.0
What's New
🖼️ New WebGL Backend — GPU-accelerated compute on virtually every modern browser and device. C# kernels are transpiled to GLSL ES 3.0 vertex shaders and executed via Transform Feedback, providing broad GPU access even where WebGPU isn't supported.
Highlights
- Five backends — WebGPU, WebGL, Wasm, Workers, and CPU
- Two GPU backends — WebGPU for cutting-edge browsers, WebGL for universal coverage
- Auto-selection —
CreatePreferredAcceleratorAsync()picks the best available backend (WebGPU → WebGL → Wasm → Workers → CPU) - 64-bit emulation on both GPU backends (
double/longsupport via software emulation) - Benchmarks page — New interactive benchmark suite comparing throughput across all backends
- Workers performance — Cached compiled functions and script bodies to reduce per-dispatch overhead
Links
Full Changelog: v2.0.0...v2.1.0