Entrolution · gvonness-apolitical · Feb 26, 2026 · Feb 26, 2026 · Feb 26, 2026 · Feb 26, 2026
@@ -7,6 +7,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.4.0] - 2026-02-26
+
+### Changed
+
+#### Internal Architecture
+- **BytecodeTape decomposition**: split 2,689-line monolithic `bytecode_tape.rs` into a directory module with 10 focused submodules (`forward.rs`, `reverse.rs`, `tangent.rs`, `jacobian.rs`, `sparse.rs`, `optimize.rs`, `taylor.rs`, `parallel.rs`, `serde_support.rs`, `thread_local.rs`). Zero public API changes; benchmarks confirm no performance impact.
+- Deduplicated reverse sweep in `gradient_with_buf()` and `sparse_jacobian_par()` — both now call shared `reverse_sweep_core()` instead of inlining the loop. `gradient_with_buf` gains the zero-adjoint skip optimization it was previously missing.
+- Bumped `nalgebra` dependency from 0.33 to 0.34
+
+### Fixed
+- Corrected opcode variant count in documentation (44 variants, not 38/43)
+- Fixed CONTRIBUTING.md MSRV reference (1.93, not 1.80)
+
 ## [0.3.0] - 2026-02-25
 
 ### Added
@@ -162,7 +175,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Forward-vs-reverse cross-validation on Rosenbrock, Beale, Ackley, Booth, and more
 - Criterion benchmarks for forward overhead and reverse gradient
 
-[Unreleased]: https://github.com/Entrolution/echidna/compare/v0.3.0...HEAD
+[Unreleased]: https://github.com/Entrolution/echidna/compare/v0.4.0...HEAD
+[0.4.0]: https://github.com/Entrolution/echidna/compare/v0.3.0...v0.4.0
 [0.3.0]: https://github.com/Entrolution/echidna/compare/v0.2.0...v0.3.0
 [0.2.0]: https://github.com/Entrolution/echidna/compare/v0.1.0...v0.2.0
 [0.1.0]: https://github.com/Entrolution/echidna/releases/tag/v0.1.0
@@ -21,7 +21,7 @@ This project is governed by the [Contributor Covenant Code of Conduct](CODE_OF_C
 
 ### Prerequisites
 
-- Rust 1.80 or later (install via [rustup](https://rustup.rs/))
+- Rust 1.93 or later (install via [rustup](https://rustup.rs/))
 - Cargo (included with Rust)
 
 ### Building
@@ -37,9 +37,19 @@ cargo build --release
 ### Testing
 
 ```bash
-# Run all tests
+# Run core tests (no feature flags)
 cargo test
 
+# Run with all CPU features
+cargo test --features "bytecode,taylor,laurent,stde,diffop,serde,faer,nalgebra,ndarray"
+
+# Run GPU tests (requires hardware)
+cargo test --features "bytecode,gpu-wgpu"
+cargo test --features "bytecode,gpu-cuda"
+
+# Run echidna-optim tests
+cargo test -p echidna-optim
+
 # Run tests with output
 cargo test -- --nocapture
 
@@ -65,12 +75,29 @@ cargo doc --no-deps
 ### Benchmarks
 
 ```bash
-# Run all benchmarks
+# Run core benchmarks (forward + reverse)
 cargo bench
 
-# Run a specific benchmark
-cargo bench --bench forward
-cargo bench --bench reverse
+# Bytecode tape
+cargo bench --features bytecode --bench bytecode
+
+# Taylor mode
+cargo bench --features stde --bench taylor
+
+# STDE estimators
+cargo bench --features stde --bench stde
+
+# Differential operators
+cargo bench --features diffop --bench diffop
+
+# Cross-country, sparse, nonsmooth
+cargo bench --features bytecode --bench advanced
+
+# GPU
+cargo bench --features gpu-wgpu --bench gpu
+
+# Comparison vs other libraries
+cargo bench --features bytecode --bench comparison
 ```
 
 ### Security Audits
@@ -136,7 +163,7 @@ src/
 ├── api.rs                 # Public API: grad, jvp, vjp, jacobian, hessian, ...
 ├── breverse.rs            # BReverse<F> bytecode-tape reverse variable [bytecode]
 ├── bytecode_tape.rs       # BytecodeTape SoA representation [bytecode]
-├── opcode.rs              # Opcode definitions and dispatch [bytecode]
+├── opcode.rs              # Opcode definitions and dispatch (44 opcodes) [bytecode]
 ├── sparse.rs              # Sparsity detection and graph coloring [bytecode]
 ├── cross_country.rs       # Markowitz vertex elimination [bytecode]
 ├── nonsmooth.rs           # Branch tracking, Clarke Jacobian [bytecode]
@@ -146,7 +173,14 @@ src/
 ├── taylor_ops.rs          # Shared Taylor propagation rules [taylor]
 ├── laurent.rs             # Laurent<F, K> singularity analysis [laurent]
 ├── stde.rs                # Stochastic derivative estimators [stde]
-├── gpu/                   # GPU acceleration [gpu-wgpu, gpu-cuda]
+├── diffop.rs              # Arbitrary differential operators via jets [diffop]
+├── gpu/
+│   ├── mod.rs             # GpuBackend trait, GpuTapeData, GpuError
+│   ├── wgpu_backend.rs    # WgpuContext (Metal/Vulkan/DX12, f32) [gpu-wgpu]
+│   ├── cuda_backend.rs    # CudaContext (NVIDIA, f32+f64) [gpu-cuda]
+│   ├── stde_gpu.rs        # GPU-accelerated STDE functions [stde]
+│   ├── shaders/           # 5 WGSL compute shaders [gpu-wgpu]
+│   └── kernels/           # CUDA kernels (tape_eval.cu, taylor_eval.cu) [gpu-cuda]
 ├── faer_support.rs        # faer integration [faer]
 ├── nalgebra_support.rs    # nalgebra integration [nalgebra]
 ├── ndarray_support.rs     # ndarray integration [ndarray]

@@ -3,7 +3,7 @@ members = [".", "echidna-optim"]
 
 [package]
 name = "echidna"
-version = "0.3.0"
+version = "0.4.0"
 edition = "2021"
 rust-version = "1.93"
 description = "A high-performance automatic differentiation library for Rust"
@@ -26,7 +26,7 @@ rayon = { version = "1", optional = true }
 serde = { version = "1", features = ["derive"], optional = true }
 ndarray = { version = "0.17", optional = true }
 faer = { version = "0.24", optional = true }
-nalgebra = { version = "0.33", optional = true }
+nalgebra = { version = "0.34", optional = true }
 wgpu = { version = "28", optional = true }
 bytemuck = { version = "1", features = ["derive"], optional = true }
 pollster = { version = "0.4", optional = true }
@@ -37,7 +37,7 @@ criterion = { version = "0.8", features = ["html_reports"] }
 approx = "0.5"
 serde_json = "1"
 ciborium = "0.2"
-nalgebra = "0.33"
+nalgebra = "0.34"
 num-dual = "0.11"
 ad_trait = "0.2.2"
 tempfile = "3"

@@ -46,7 +46,7 @@ Add to `Cargo.toml`:
 
 ```toml
 [dependencies]
-echidna = "0.3"
+echidna = "0.4"
 ```
 
 ### Gradient via reverse mode
@@ -157,7 +157,7 @@ assert!((d2_dxdy - 2.0).abs() < 1e-6);
 Enable features in `Cargo.toml`:
 
 ```toml
-echidna = { version = "0.3", features = ["bytecode", "taylor"] }
+echidna = { version = "0.4", features = ["bytecode", "taylor"] }
 ```
 
 ## API
@@ -200,12 +200,18 @@ echidna = { version = "0.3", features = ["bytecode", "taylor"] }
 | `tape.ode_taylor_step::<K>(y0)` | ODE Taylor integration step |
 | `stde::laplacian(tape, x, dirs)` | Stochastic Laplacian estimate |
 | `stde::hessian_diagonal(tape, x)` | Stochastic Hessian diagonal |
+| `stde::diagonal_kth_order(tape, x, k)` | Exact k-th order diagonal (dynamic, arena-based) |
 | `stde::diagonal_kth_order_const::<K>(tape, x)` | Exact k-th order diagonal (const-generic, stack-allocated) |
+| `stde::diagonal_kth_order_stochastic(tape, x, k, indices)` | Stochastic k-th order diagonal via subsampling |
 | `stde::dense_stde_2nd(tape, x, cholesky, z)` | Dense STDE for positive-definite 2nd-order operators |
+| `stde::stde_sparse(tape, x, dist, indices)` | Sparse STDE for arbitrary differential operators |
+| `stde::parabolic_diffusion(tape, x, sigma)` | Parabolic PDE σ-transform diffusion operator |
 | `stde::directional_derivatives(tape, x, dirs)` | Batched 1st and 2nd order directional derivatives |
 | `stde::laplacian_with_control(tape, x, dirs, ctrl)` | Laplacian with control variate variance reduction |
 | `stde::laplacian_hutchpp(tape, x, dirs_s, dirs_g)` | Hutch++ Laplacian estimator (lower variance) |
 | `stde::divergence(tape, x, dirs)` | Stochastic divergence (trace of Jacobian) estimator |
+| `stde::estimate(estimator, tape, x, dirs)` | Run an `Estimator` with Welford statistics |
+| `stde::estimate_weighted(estimator, tape, x, dirs, weights)` | Importance-weighted estimator (West's algorithm) |
 
 ### Differential Operators (requires `diffop`)
 
@@ -215,6 +221,10 @@ echidna = { version = "0.3", features = ["bytecode", "taylor"] }
 | `diffop::hessian(tape, x)` | Full Hessian via jet extraction |
 | `JetPlan::plan(n, indices)` | Plan once for a set of multi-indices |
 | `diffop::eval_dyn(plan, tape, x)` | Evaluate a plan at a new point (reuses precomputed slot assignments) |
+| `DiffOp::new(n, terms)` | Differential operator from `(coefficient, MultiIndex)` pairs |
+| `DiffOp::laplacian(n)` / `biharmonic(n)` / `diagonal(n, k)` | Common operator constructors |
+| `diffop.eval(tape, x)` | Evaluate operator at a point via `JetPlan` |
+| `diffop.sparse_distribution()` | Build `SparseSamplingDistribution` for stochastic estimation |
 
 ### GPU (requires `gpu-wgpu` or `gpu-cuda`)
 
@@ -229,6 +239,8 @@ echidna = { version = "0.3", features = ["bytecode", "taylor"] }
 | `stde_gpu::laplacian_gpu(ctx, bufs, x, dirs)` | GPU-accelerated Laplacian estimator |
 | `stde_gpu::hessian_diagonal_gpu(ctx, bufs, x)` | GPU-accelerated exact Hessian diagonal |
 | `stde_gpu::laplacian_with_control_gpu(ctx, bufs, x, dirs, ctrl)` | GPU Laplacian with control variate |
+| `stde_gpu::laplacian_gpu_cuda(ctx, bufs, x, dirs)` | CUDA-accelerated Laplacian estimator |
+| `stde_gpu::hessian_diagonal_gpu_cuda(ctx, bufs, x)` | CUDA-accelerated exact Hessian diagonal |
 
 CUDA additionally supports `_f64` variants of all methods.
 
@@ -267,7 +279,7 @@ cargo bench                                         # Forward + reverse
 cargo bench --features bytecode --bench bytecode    # Bytecode tape
 cargo bench --features stde --bench taylor          # Taylor mode
 cargo bench --features gpu-wgpu --bench gpu         # GPU
-cargo bench --features "bytecode,nalgebra" --bench comparison  # vs other libraries
+cargo bench --features bytecode --bench comparison             # vs other libraries
 ```
 
 ### Gradient benchmark (Rosenbrock, lower is better)
@@ -296,7 +308,7 @@ The [`echidna-optim`](echidna-optim/) crate provides optimization solvers and im
 
 ```toml
 [dependencies]
-echidna-optim = "0.3"
+echidna-optim = "0.4"
 ```
 
 Optional features: `parallel` (enables rayon parallelism via `echidna/parallel`), `sparse-implicit` (sparse implicit differentiation via faer).

@@ -4,8 +4,8 @@
 
 | Version | Supported |
 |---------|-----------|
-| 0.3.x   | Yes       |
-| < 0.3   | No        |
+| 0.4.x   | Yes       |
+| < 0.4   | No        |
 
 Only the latest minor release receives security updates.
 
@@ -30,7 +30,7 @@ echidna uses `unsafe` in the following scoped contexts:
 
 | Location | Purpose |
 |----------|---------|
-| `tape.rs`, `bytecode_tape.rs`, `taylor_dyn.rs` | Thread-local mutable pointer dereference for tape/arena access. Each is RAII-guarded: the pointer is valid for the lifetime of the enclosing scope guard. |
+| `tape.rs`, `bytecode_tape/thread_local.rs`, `taylor_dyn.rs` | Thread-local mutable pointer dereference for tape/arena access. Each is RAII-guarded: the pointer is valid for the lifetime of the enclosing scope guard. |
 | `checkpoint.rs` | Byte-level transmutation (`&[F]` ↔ `&[u8]`) for disk-backed checkpoint serialisation. Relies on `F: Float` being `f32`/`f64` (IEEE 754, no padding). |
 | `gpu/cuda_backend.rs` | FFI kernel launches via `cudarc`. Each call passes validated buffer sizes and grid dimensions to the CUDA driver. |
 | `traits/simba_impls.rs` | `extract_unchecked` / `replace_unchecked` for simba's `SimdValue` trait. Scalar types have only one lane, so the index is always 0. |
@@ -2,8 +2,9 @@
 
 **Date**: 2026-02-25
 **Status**: Accepted
+**Last updated**: 2026-02-25 (added STDE deferred items)
 
-All core roadmap items (R1–R13) and Phases 1–5 are complete. This ADR captures every evaluated-but-not-implemented item with explicit reasoning, so future planning doesn't re-investigate the same paths.
+All core roadmap items (R1–R13), Phases 1–5, and the STDE deferred items plan are complete. This ADR captures every evaluated-but-not-implemented item with explicit reasoning, so future planning doesn't re-investigate the same paths.
 
 ---
 
@@ -12,12 +13,26 @@ All core roadmap items (R1–R13) and Phases 1–5 are complete. This ADR captur
 | Item | Blocker | Revisit When |
 |------|---------|--------------|
 | RUSTSEC-2024-0436 (paste/simba) | Upstream simba must release with paste alternative | simba publishes a new release |
-| nalgebra 0.33 → 0.34 | nalgebra 0.34 requires Rust 1.87+; MSRV is 1.80 | MSRV is raised to 1.87+ |
+
+---
+
+## Deferred (valuable, not yet needed)
+
+| Item | Reasoning | Revisit When |
+|------|-----------|--------------|
+| nalgebra 0.33 → 0.34 | MSRV blocker resolved (now 1.93, nalgebra 0.34 requires 1.87+). May have API changes in `NalgebraVec` and sparse matrix types. | When convenient; test thoroughly for API changes. |
+| Indefinite dense STDE | Eigendecomposition-based approach for indefinite coefficient matrices C (6 parameters, sign-splitting into C⁺ - C⁻) adds significant API complexity. The positive-definite Cholesky case (`dense_stde_2nd`) covers most PDE use cases (Fokker-Planck, Black-Scholes, HJB). Users with indefinite C can manually split into C⁺ - C⁻, compute Cholesky factors for each, and call `dense_stde_2nd` twice. | A concrete user need for indefinite C arises |
+| General-K GPU Taylor kernels | GPU Taylor kernel is hardcoded to K=3 (second-order). Hardcoding allows complete loop unrolling — critical for GPU performance. General K would need dynamic loops or a family of K-specialized kernels. | Need for GPU-accelerated 3rd+ order derivatives |
+| Chunked GPU Taylor dispatch | Working buffer is `3 * num_variables * batch_size * 4` bytes. WebGPU's 128 MB limit caps `num_variables * batch_size ≤ ~10M`. For larger problems, the dispatch function should chunk the batch and accumulate results. | Users hit the buffer limit in practice |
+| CUDA `laplacian_with_control_gpu` | `laplacian_with_control_gpu` is currently wgpu-only. CUDA equivalent (`laplacian_with_control_gpu_cuda`) is straightforward to add — same CPU-side Welford aggregation, just dispatches through `CudaContext`. | CUDA users need variance-reduced Laplacian |
+| `taylor_forward_2nd_batch` in `GpuBackend` trait | Currently an inherent method on each backend, not part of the `GpuBackend` trait. Adding to the trait would enable generic code over backends but requires an associated type for Taylor results. | Multiple backends need to be used generically for Taylor |
 
 ---
 
 ## Rejected (evaluated, explicit reasoning)
 
+### Core AD
+
 | Item | Reasoning | What exists instead |
 |------|-----------|---------------------|
 | Constant deduplication (5.2) | `FloatBits` orphan rule blocks impl for `Dual<F>`. CSE handles ops over duplicate constants. | CSE pass in bytecode tape |
@@ -28,3 +43,11 @@ All core roadmap items (R1–R13) and Phases 1–5 are complete. This ADR captur
 | Preaccumulation of straight-line segments | Superseded by cross-country Markowitz elimination (R5), which achieves the same goal more generally. | `jacobian_cross_country()` |
 | Trait impl macros for num_traits | ~3,200 lines are mechanical but readable. Macros hurt error messages, IDE support, and debuggability. Low maintenance cost since impls rarely change. | Manual trait impls |
 | DiffOperator trait abstraction | Concrete estimators work well as standalone functions. `Estimator` trait (5.1) already provides the needed abstraction. Another trait layer would over-abstract. | `Estimator` trait + concrete functions |
+
+### STDE Paper Items
+
+| Item | Paper Section | Reasoning | What exists instead |
+|------|---------------|-----------|---------------------|
+| Multi-pushforward correction | App F, Eq 48-52 | Current collision-free prime-window approach in `diagonal_kth_order` is simpler and equally effective for practical operators. The correction addresses collisions in multi-index sampling that don't occur with the prime-window design. | `diagonal_kth_order`, `diagonal_kth_order_const` |
+| Amortized PINN training | §5.1, Eq 25 | Application-level integration — amortizing STDE samples across training steps is a training loop concern, not a core AD primitive. Users can implement this in their training code using existing `stde::*` functions. | `stde::laplacian`, `stde::stde_sparse` |
+| Weight sharing across network layers | App G | Network architecture concern — sharing Taylor jet computation across layers with shared weights requires network-level orchestration, not AD-level support. | — |
@@ -72,7 +72,7 @@ inputs. The tape uses a Structure-of-Arrays (SoA) layout: separate contiguous
 arrays for opcodes, operand indices, and constant values. This layout improves
 cache utilization — the opcode stream is compact enough to fit in L1 cache for
 moderate-sized tapes, and the separation allows independent vectorization of
-forward and reverse sweeps. The tape currently defines 38 opcodes covering all
+forward and reverse sweeps. The tape currently defines 44 opcodes covering all
 standard elemental operations (arithmetic, transcendental, power, comparison).
 
 A key advantage of the graph representation is that it enables optimization
@@ -505,11 +505,12 @@ numerically intensive phase. Custom operations are rejected at tape upload
 time since they cannot be compiled to GPU code.
 
 The wgpu backend (feature `gpu-wgpu`) provides cross-platform GPU access via
-Metal, Vulkan, and DX12. It implements four WGSL compute shaders: `forward`
+Metal, Vulkan, and DX12. It implements five WGSL compute shaders: `forward`
 (batched forward evaluation), `reverse` (batched adjoint sweep),
-`tangent_forward` (forward tangent for JVP and sparse Jacobian), and
-`tangent_reverse` (forward-over-reverse for HVP and sparse Hessian). All 43
-opcodes are implemented in each shader. The backend is limited to f32 due to
+`tangent_forward` (forward tangent for JVP and sparse Jacobian),
+`tangent_reverse` (forward-over-reverse for HVP and sparse Hessian), and
+`taylor_forward_2nd` (batched second-order Taylor forward propagation for STDE).
+All 44 opcodes are implemented in each shader. The backend is limited to f32 due to
 WGSL's lack of native f64 support. The CUDA backend (feature `gpu-cuda`)
 targets NVIDIA GPUs and supports both f32 and f64. It uses a single templated
 CUDA kernel file compiled via NVRTC at runtime, instantiated for both `float`

@@ -5,7 +5,7 @@
 This roadmap synthesizes:
 - Deferred items from all implementation phases to date
 - Remaining phases from the [book breakdown](book-breakdown.md)
-- Taylor mode AD and stochastic estimators from the STDE paper (Schatz et al., NeurIPS 2024)
+- Taylor mode AD and stochastic estimators from the STDE paper (Shi et al., NeurIPS 2024)
 - Organic improvements identified during implementation
 
 ---
@@ -70,7 +70,7 @@ Given `y' = f(y)`, computes the Taylor expansion of `y(t)` to order K via coeffi
 Key file: `src/bytecode_tape.rs`.
 
 ### R2. Stochastic Taylor Derivative Estimators (STDE)
-**STDE paper (Schatz et al., NeurIPS 2024)**
+**STDE paper (Shi et al., NeurIPS 2024)**
 
 Builds on R1. Uses random jets to estimate differential operators without computing full derivative tensors.
 
@@ -219,7 +219,7 @@ Two GPU backends for batched tape evaluation:
 
 **CUDA backend** (`gpu-cuda` feature) — NVIDIA only, f32 + f64. Single templated CUDA kernel file `tape_eval.cu` with four kernels (`forward_eval`, `reverse_sweep`, `tangent_forward`, `tangent_reverse`), compiled via NVRTC at runtime for both float and double. `CudaContext` mirrors the wgpu API surface plus f64 variants: `forward_batch_f64`, `gradient_batch_f64`, `sparse_jacobian_f64`, `sparse_hessian_f64`, `hvp_batch_f64`. 6 CUDA tests (f32 + f64).
 
-CPU-side sparsity detection and graph coloring drive the GPU tangent sweeps. Custom ops are rejected at upload time (`GpuTapeData::from_tape`). All 38 opcodes implemented in all four shader/kernel variants.
+CPU-side sparsity detection and graph coloring drive the GPU tangent sweeps. Custom ops are rejected at upload time (`GpuTapeData::from_tape`). All 44 opcodes implemented in all five shader/kernel variants.
 
 Key files: `src/gpu/mod.rs`, `src/gpu/wgpu_backend.rs`, `src/gpu/cuda_backend.rs`, `src/gpu/shaders/forward.wgsl`, `src/gpu/shaders/reverse.wgsl`, `src/gpu/shaders/tangent_forward.wgsl`, `src/gpu/shaders/tangent_reverse.wgsl`, `src/gpu/kernels/tape_eval.cu`, `tests/gpu_wgpu_tests.rs`, `tests/gpu_cuda_tests.rs`.