Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.4.0] - 2026-02-26

### Changed

#### Internal Architecture
- **BytecodeTape decomposition**: split 2,689-line monolithic `bytecode_tape.rs` into a directory module with 10 focused submodules (`forward.rs`, `reverse.rs`, `tangent.rs`, `jacobian.rs`, `sparse.rs`, `optimize.rs`, `taylor.rs`, `parallel.rs`, `serde_support.rs`, `thread_local.rs`). Zero public API changes; benchmarks confirm no performance impact.
- Deduplicated reverse sweep in `gradient_with_buf()` and `sparse_jacobian_par()` — both now call shared `reverse_sweep_core()` instead of inlining the loop. `gradient_with_buf` gains the zero-adjoint skip optimization it was previously missing.
- Bumped `nalgebra` dependency from 0.33 to 0.34

### Fixed
- Corrected opcode variant count in documentation (44 variants, not 38/43)
- Fixed CONTRIBUTING.md MSRV reference (1.93, not 1.80)

## [0.3.0] - 2026-02-25

### Added
Expand Down Expand Up @@ -162,7 +175,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Forward-vs-reverse cross-validation on Rosenbrock, Beale, Ackley, Booth, and more
- Criterion benchmarks for forward overhead and reverse gradient

[Unreleased]: https://github.com/Entrolution/echidna/compare/v0.3.0...HEAD
[Unreleased]: https://github.com/Entrolution/echidna/compare/v0.4.0...HEAD
[0.4.0]: https://github.com/Entrolution/echidna/compare/v0.3.0...v0.4.0
[0.3.0]: https://github.com/Entrolution/echidna/compare/v0.2.0...v0.3.0
[0.2.0]: https://github.com/Entrolution/echidna/compare/v0.1.0...v0.2.0
[0.1.0]: https://github.com/Entrolution/echidna/releases/tag/v0.1.0
50 changes: 42 additions & 8 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ This project is governed by the [Contributor Covenant Code of Conduct](CODE_OF_C

### Prerequisites

- Rust 1.80 or later (install via [rustup](https://rustup.rs/))
- Rust 1.93 or later (install via [rustup](https://rustup.rs/))
- Cargo (included with Rust)

### Building
Expand All @@ -37,9 +37,19 @@ cargo build --release
### Testing

```bash
# Run all tests
# Run core tests (no feature flags)
cargo test

# Run with all CPU features
cargo test --features "bytecode,taylor,laurent,stde,diffop,serde,faer,nalgebra,ndarray"

# Run GPU tests (requires hardware)
cargo test --features "bytecode,gpu-wgpu"
cargo test --features "bytecode,gpu-cuda"

# Run echidna-optim tests
cargo test -p echidna-optim

# Run tests with output
cargo test -- --nocapture

Expand All @@ -65,12 +75,29 @@ cargo doc --no-deps
### Benchmarks

```bash
# Run all benchmarks
# Run core benchmarks (forward + reverse)
cargo bench

# Run a specific benchmark
cargo bench --bench forward
cargo bench --bench reverse
# Bytecode tape
cargo bench --features bytecode --bench bytecode

# Taylor mode
cargo bench --features stde --bench taylor

# STDE estimators
cargo bench --features stde --bench stde

# Differential operators
cargo bench --features diffop --bench diffop

# Cross-country, sparse, nonsmooth
cargo bench --features bytecode --bench advanced

# GPU
cargo bench --features gpu-wgpu --bench gpu

# Comparison vs other libraries
cargo bench --features bytecode --bench comparison
```

### Security Audits
Expand Down Expand Up @@ -136,7 +163,7 @@ src/
├── api.rs # Public API: grad, jvp, vjp, jacobian, hessian, ...
├── breverse.rs # BReverse<F> bytecode-tape reverse variable [bytecode]
├── bytecode_tape.rs # BytecodeTape SoA representation [bytecode]
├── opcode.rs # Opcode definitions and dispatch [bytecode]
├── opcode.rs # Opcode definitions and dispatch (44 opcodes) [bytecode]
├── sparse.rs # Sparsity detection and graph coloring [bytecode]
├── cross_country.rs # Markowitz vertex elimination [bytecode]
├── nonsmooth.rs # Branch tracking, Clarke Jacobian [bytecode]
Expand All @@ -146,7 +173,14 @@ src/
├── taylor_ops.rs # Shared Taylor propagation rules [taylor]
├── laurent.rs # Laurent<F, K> singularity analysis [laurent]
├── stde.rs # Stochastic derivative estimators [stde]
├── gpu/ # GPU acceleration [gpu-wgpu, gpu-cuda]
├── diffop.rs # Arbitrary differential operators via jets [diffop]
├── gpu/
│ ├── mod.rs # GpuBackend trait, GpuTapeData, GpuError
│ ├── wgpu_backend.rs # WgpuContext (Metal/Vulkan/DX12, f32) [gpu-wgpu]
│ ├── cuda_backend.rs # CudaContext (NVIDIA, f32+f64) [gpu-cuda]
│ ├── stde_gpu.rs # GPU-accelerated STDE functions [stde]
│ ├── shaders/ # 5 WGSL compute shaders [gpu-wgpu]
│ └── kernels/ # CUDA kernels (tape_eval.cu, taylor_eval.cu) [gpu-cuda]
├── faer_support.rs # faer integration [faer]
├── nalgebra_support.rs # nalgebra integration [nalgebra]
├── ndarray_support.rs # ndarray integration [ndarray]
Expand Down
6 changes: 3 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ members = [".", "echidna-optim"]

[package]
name = "echidna"
version = "0.3.0"
version = "0.4.0"
edition = "2021"
rust-version = "1.93"
description = "A high-performance automatic differentiation library for Rust"
Expand All @@ -26,7 +26,7 @@ rayon = { version = "1", optional = true }
serde = { version = "1", features = ["derive"], optional = true }
ndarray = { version = "0.17", optional = true }
faer = { version = "0.24", optional = true }
nalgebra = { version = "0.33", optional = true }
nalgebra = { version = "0.34", optional = true }
wgpu = { version = "28", optional = true }
bytemuck = { version = "1", features = ["derive"], optional = true }
pollster = { version = "0.4", optional = true }
Expand All @@ -37,7 +37,7 @@ criterion = { version = "0.8", features = ["html_reports"] }
approx = "0.5"
serde_json = "1"
ciborium = "0.2"
nalgebra = "0.33"
nalgebra = "0.34"
num-dual = "0.11"
ad_trait = "0.2.2"
tempfile = "3"
Expand Down
20 changes: 16 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Add to `Cargo.toml`:

```toml
[dependencies]
echidna = "0.3"
echidna = "0.4"
```

### Gradient via reverse mode
Expand Down Expand Up @@ -157,7 +157,7 @@ assert!((d2_dxdy - 2.0).abs() < 1e-6);
Enable features in `Cargo.toml`:

```toml
echidna = { version = "0.3", features = ["bytecode", "taylor"] }
echidna = { version = "0.4", features = ["bytecode", "taylor"] }
```

## API
Expand Down Expand Up @@ -200,12 +200,18 @@ echidna = { version = "0.3", features = ["bytecode", "taylor"] }
| `tape.ode_taylor_step::<K>(y0)` | ODE Taylor integration step |
| `stde::laplacian(tape, x, dirs)` | Stochastic Laplacian estimate |
| `stde::hessian_diagonal(tape, x)` | Stochastic Hessian diagonal |
| `stde::diagonal_kth_order(tape, x, k)` | Exact k-th order diagonal (dynamic, arena-based) |
| `stde::diagonal_kth_order_const::<K>(tape, x)` | Exact k-th order diagonal (const-generic, stack-allocated) |
| `stde::diagonal_kth_order_stochastic(tape, x, k, indices)` | Stochastic k-th order diagonal via subsampling |
| `stde::dense_stde_2nd(tape, x, cholesky, z)` | Dense STDE for positive-definite 2nd-order operators |
| `stde::stde_sparse(tape, x, dist, indices)` | Sparse STDE for arbitrary differential operators |
| `stde::parabolic_diffusion(tape, x, sigma)` | Parabolic PDE σ-transform diffusion operator |
| `stde::directional_derivatives(tape, x, dirs)` | Batched 1st and 2nd order directional derivatives |
| `stde::laplacian_with_control(tape, x, dirs, ctrl)` | Laplacian with control variate variance reduction |
| `stde::laplacian_hutchpp(tape, x, dirs_s, dirs_g)` | Hutch++ Laplacian estimator (lower variance) |
| `stde::divergence(tape, x, dirs)` | Stochastic divergence (trace of Jacobian) estimator |
| `stde::estimate(estimator, tape, x, dirs)` | Run an `Estimator` with Welford statistics |
| `stde::estimate_weighted(estimator, tape, x, dirs, weights)` | Importance-weighted estimator (West's algorithm) |

### Differential Operators (requires `diffop`)

Expand All @@ -215,6 +221,10 @@ echidna = { version = "0.3", features = ["bytecode", "taylor"] }
| `diffop::hessian(tape, x)` | Full Hessian via jet extraction |
| `JetPlan::plan(n, indices)` | Plan once for a set of multi-indices |
| `diffop::eval_dyn(plan, tape, x)` | Evaluate a plan at a new point (reuses precomputed slot assignments) |
| `DiffOp::new(n, terms)` | Differential operator from `(coefficient, MultiIndex)` pairs |
| `DiffOp::laplacian(n)` / `biharmonic(n)` / `diagonal(n, k)` | Common operator constructors |
| `diffop.eval(tape, x)` | Evaluate operator at a point via `JetPlan` |
| `diffop.sparse_distribution()` | Build `SparseSamplingDistribution` for stochastic estimation |

### GPU (requires `gpu-wgpu` or `gpu-cuda`)

Expand All @@ -229,6 +239,8 @@ echidna = { version = "0.3", features = ["bytecode", "taylor"] }
| `stde_gpu::laplacian_gpu(ctx, bufs, x, dirs)` | GPU-accelerated Laplacian estimator |
| `stde_gpu::hessian_diagonal_gpu(ctx, bufs, x)` | GPU-accelerated exact Hessian diagonal |
| `stde_gpu::laplacian_with_control_gpu(ctx, bufs, x, dirs, ctrl)` | GPU Laplacian with control variate |
| `stde_gpu::laplacian_gpu_cuda(ctx, bufs, x, dirs)` | CUDA-accelerated Laplacian estimator |
| `stde_gpu::hessian_diagonal_gpu_cuda(ctx, bufs, x)` | CUDA-accelerated exact Hessian diagonal |

CUDA additionally supports `_f64` variants of all methods.

Expand Down Expand Up @@ -267,7 +279,7 @@ cargo bench # Forward + reverse
cargo bench --features bytecode --bench bytecode # Bytecode tape
cargo bench --features stde --bench taylor # Taylor mode
cargo bench --features gpu-wgpu --bench gpu # GPU
cargo bench --features "bytecode,nalgebra" --bench comparison # vs other libraries
cargo bench --features bytecode --bench comparison # vs other libraries
```

### Gradient benchmark (Rosenbrock, lower is better)
Expand Down Expand Up @@ -296,7 +308,7 @@ The [`echidna-optim`](echidna-optim/) crate provides optimization solvers and im

```toml
[dependencies]
echidna-optim = "0.3"
echidna-optim = "0.4"
```

Optional features: `parallel` (enables rayon parallelism via `echidna/parallel`), `sparse-implicit` (sparse implicit differentiation via faer).
Expand Down
6 changes: 3 additions & 3 deletions SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

| Version | Supported |
|---------|-----------|
| 0.3.x | Yes |
| < 0.3 | No |
| 0.4.x | Yes |
| < 0.4 | No |

Only the latest minor release receives security updates.

Expand All @@ -30,7 +30,7 @@ echidna uses `unsafe` in the following scoped contexts:

| Location | Purpose |
|----------|---------|
| `tape.rs`, `bytecode_tape.rs`, `taylor_dyn.rs` | Thread-local mutable pointer dereference for tape/arena access. Each is RAII-guarded: the pointer is valid for the lifetime of the enclosing scope guard. |
| `tape.rs`, `bytecode_tape/thread_local.rs`, `taylor_dyn.rs` | Thread-local mutable pointer dereference for tape/arena access. Each is RAII-guarded: the pointer is valid for the lifetime of the enclosing scope guard. |
| `checkpoint.rs` | Byte-level transmutation (`&[F]` ↔ `&[u8]`) for disk-backed checkpoint serialisation. Relies on `F: Float` being `f32`/`f64` (IEEE 754, no padding). |
| `gpu/cuda_backend.rs` | FFI kernel launches via `cudarc`. Each call passes validated buffer sizes and grid dimensions to the CUDA driver. |
| `traits/simba_impls.rs` | `extract_unchecked` / `replace_unchecked` for simba's `SimdValue` trait. Scalar types have only one lane, so the index is always 0. |
27 changes: 25 additions & 2 deletions docs/adr-deferred-work.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@

**Date**: 2026-02-25
**Status**: Accepted
**Last updated**: 2026-02-25 (added STDE deferred items)

All core roadmap items (R1–R13) and Phases 1–5 are complete. This ADR captures every evaluated-but-not-implemented item with explicit reasoning, so future planning doesn't re-investigate the same paths.
All core roadmap items (R1–R13), Phases 1–5, and the STDE deferred items plan are complete. This ADR captures every evaluated-but-not-implemented item with explicit reasoning, so future planning doesn't re-investigate the same paths.

---

Expand All @@ -12,12 +13,26 @@ All core roadmap items (R1–R13) and Phases 1–5 are complete. This ADR captur
| Item | Blocker | Revisit When |
|------|---------|--------------|
| RUSTSEC-2024-0436 (paste/simba) | Upstream simba must release with paste alternative | simba publishes a new release |
| nalgebra 0.33 → 0.34 | nalgebra 0.34 requires Rust 1.87+; MSRV is 1.80 | MSRV is raised to 1.87+ |

---

## Deferred (valuable, not yet needed)

| Item | Reasoning | Revisit When |
|------|-----------|--------------|
| nalgebra 0.33 → 0.34 | MSRV blocker resolved (now 1.93, nalgebra 0.34 requires 1.87+). May have API changes in `NalgebraVec` and sparse matrix types. | When convenient; test thoroughly for API changes. |
| Indefinite dense STDE | Eigendecomposition-based approach for indefinite coefficient matrices C (6 parameters, sign-splitting into C⁺ - C⁻) adds significant API complexity. The positive-definite Cholesky case (`dense_stde_2nd`) covers most PDE use cases (Fokker-Planck, Black-Scholes, HJB). Users with indefinite C can manually split into C⁺ - C⁻, compute Cholesky factors for each, and call `dense_stde_2nd` twice. | A concrete user need for indefinite C arises |
| General-K GPU Taylor kernels | GPU Taylor kernel is hardcoded to K=3 (second-order). Hardcoding allows complete loop unrolling — critical for GPU performance. General K would need dynamic loops or a family of K-specialized kernels. | Need for GPU-accelerated 3rd+ order derivatives |
| Chunked GPU Taylor dispatch | Working buffer is `3 * num_variables * batch_size * 4` bytes. WebGPU's 128 MB limit caps `num_variables * batch_size ≤ ~10M`. For larger problems, the dispatch function should chunk the batch and accumulate results. | Users hit the buffer limit in practice |
| CUDA `laplacian_with_control_gpu` | `laplacian_with_control_gpu` is currently wgpu-only. CUDA equivalent (`laplacian_with_control_gpu_cuda`) is straightforward to add — same CPU-side Welford aggregation, just dispatches through `CudaContext`. | CUDA users need variance-reduced Laplacian |
| `taylor_forward_2nd_batch` in `GpuBackend` trait | Currently an inherent method on each backend, not part of the `GpuBackend` trait. Adding to the trait would enable generic code over backends but requires an associated type for Taylor results. | Multiple backends need to be used generically for Taylor |

---

## Rejected (evaluated, explicit reasoning)

### Core AD

| Item | Reasoning | What exists instead |
|------|-----------|---------------------|
| Constant deduplication (5.2) | `FloatBits` orphan rule blocks impl for `Dual<F>`. CSE handles ops over duplicate constants. | CSE pass in bytecode tape |
Expand All @@ -28,3 +43,11 @@ All core roadmap items (R1–R13) and Phases 1–5 are complete. This ADR captur
| Preaccumulation of straight-line segments | Superseded by cross-country Markowitz elimination (R5), which achieves the same goal more generally. | `jacobian_cross_country()` |
| Trait impl macros for num_traits | ~3,200 lines are mechanical but readable. Macros hurt error messages, IDE support, and debuggability. Low maintenance cost since impls rarely change. | Manual trait impls |
| DiffOperator trait abstraction | Concrete estimators work well as standalone functions. `Estimator` trait (5.1) already provides the needed abstraction. Another trait layer would over-abstract. | `Estimator` trait + concrete functions |

### STDE Paper Items

| Item | Paper Section | Reasoning | What exists instead |
|------|---------------|-----------|---------------------|
| Multi-pushforward correction | App F, Eq 48-52 | Current collision-free prime-window approach in `diagonal_kth_order` is simpler and equally effective for practical operators. The correction addresses collisions in multi-index sampling that don't occur with the prime-window design. | `diagonal_kth_order`, `diagonal_kth_order_const` |
| Amortized PINN training | §5.1, Eq 25 | Application-level integration — amortizing STDE samples across training steps is a training loop concern, not a core AD primitive. Users can implement this in their training code using existing `stde::*` functions. | `stde::laplacian`, `stde::stde_sparse` |
| Weight sharing across network layers | App G | Network architecture concern — sharing Taylor jet computation across layers with shared weights requires network-level orchestration, not AD-level support. | — |
11 changes: 6 additions & 5 deletions docs/algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ inputs. The tape uses a Structure-of-Arrays (SoA) layout: separate contiguous
arrays for opcodes, operand indices, and constant values. This layout improves
cache utilization — the opcode stream is compact enough to fit in L1 cache for
moderate-sized tapes, and the separation allows independent vectorization of
forward and reverse sweeps. The tape currently defines 38 opcodes covering all
forward and reverse sweeps. The tape currently defines 44 opcodes covering all
standard elemental operations (arithmetic, transcendental, power, comparison).

A key advantage of the graph representation is that it enables optimization
Expand Down Expand Up @@ -505,11 +505,12 @@ numerically intensive phase. Custom operations are rejected at tape upload
time since they cannot be compiled to GPU code.

The wgpu backend (feature `gpu-wgpu`) provides cross-platform GPU access via
Metal, Vulkan, and DX12. It implements four WGSL compute shaders: `forward`
Metal, Vulkan, and DX12. It implements five WGSL compute shaders: `forward`
(batched forward evaluation), `reverse` (batched adjoint sweep),
`tangent_forward` (forward tangent for JVP and sparse Jacobian), and
`tangent_reverse` (forward-over-reverse for HVP and sparse Hessian). All 43
opcodes are implemented in each shader. The backend is limited to f32 due to
`tangent_forward` (forward tangent for JVP and sparse Jacobian),
`tangent_reverse` (forward-over-reverse for HVP and sparse Hessian), and
`taylor_forward_2nd` (batched second-order Taylor forward propagation for STDE).
All 44 opcodes are implemented in each shader. The backend is limited to f32 due to
WGSL's lack of native f64 support. The CUDA backend (feature `gpu-cuda`)
targets NVIDIA GPUs and supports both f32 and f64. It uses a single templated
CUDA kernel file compiled via NVRTC at runtime, instantiated for both `float`
Expand Down
6 changes: 3 additions & 3 deletions docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
This roadmap synthesizes:
- Deferred items from all implementation phases to date
- Remaining phases from the [book breakdown](book-breakdown.md)
- Taylor mode AD and stochastic estimators from the STDE paper (Schatz et al., NeurIPS 2024)
- Taylor mode AD and stochastic estimators from the STDE paper (Shi et al., NeurIPS 2024)
- Organic improvements identified during implementation

---
Expand Down Expand Up @@ -70,7 +70,7 @@ Given `y' = f(y)`, computes the Taylor expansion of `y(t)` to order K via coeffi
Key file: `src/bytecode_tape.rs`.

### R2. Stochastic Taylor Derivative Estimators (STDE)
**STDE paper (Schatz et al., NeurIPS 2024)**
**STDE paper (Shi et al., NeurIPS 2024)**

Builds on R1. Uses random jets to estimate differential operators without computing full derivative tensors.

Expand Down Expand Up @@ -219,7 +219,7 @@ Two GPU backends for batched tape evaluation:

**CUDA backend** (`gpu-cuda` feature) — NVIDIA only, f32 + f64. Single templated CUDA kernel file `tape_eval.cu` with four kernels (`forward_eval`, `reverse_sweep`, `tangent_forward`, `tangent_reverse`), compiled via NVRTC at runtime for both float and double. `CudaContext` mirrors the wgpu API surface plus f64 variants: `forward_batch_f64`, `gradient_batch_f64`, `sparse_jacobian_f64`, `sparse_hessian_f64`, `hvp_batch_f64`. 6 CUDA tests (f32 + f64).

CPU-side sparsity detection and graph coloring drive the GPU tangent sweeps. Custom ops are rejected at upload time (`GpuTapeData::from_tape`). All 38 opcodes implemented in all four shader/kernel variants.
CPU-side sparsity detection and graph coloring drive the GPU tangent sweeps. Custom ops are rejected at upload time (`GpuTapeData::from_tape`). All 44 opcodes implemented in all five shader/kernel variants.

Key files: `src/gpu/mod.rs`, `src/gpu/wgpu_backend.rs`, `src/gpu/cuda_backend.rs`, `src/gpu/shaders/forward.wgsl`, `src/gpu/shaders/reverse.wgsl`, `src/gpu/shaders/tangent_forward.wgsl`, `src/gpu/shaders/tangent_reverse.wgsl`, `src/gpu/kernels/tape_eval.cu`, `tests/gpu_wgpu_tests.rs`, `tests/gpu_cuda_tests.rs`.

Expand Down
Loading
Loading