Skip to content

fix: compiled fused-training — standard-Adam default, OCP dispatch, MlpForward wiring, loud fallback#1469

Open
ooples wants to merge 18 commits into
masterfrom
perf/fused-training-gate
Open

fix: compiled fused-training — standard-Adam default, OCP dispatch, MlpForward wiring, loud fallback#1469
ooples wants to merge 18 commits into
masterfrom
perf/fused-training-gate

Conversation

@ooples
Copy link
Copy Markdown
Owner

@ooples ooples commented May 29, 2026

Closes #1447 (wire AIsEval fused primitives into NN layers + builder — framework side of AiDotNet.Tensors#436)
Closes #1470 (Transformer per-call training stalls — Noam LR frozen on the fused path)

Wires AiDotNet's NN training/inference onto AiDotNet.Tensors' fused-compiled kernels (the compile-once / replay-many path the Tensors micro-benchmarks beat PyTorch-CPU on), via open/closed self-describing interfaces — no central enum whitelist to keep in sync. Found by profiling the AIsEval PyTorch-vs-AiDotNet benchmark, where every Train() step silently fell back to the eager tape.

1. Default-optimizer gate (root cause of "compiled does nothing")

GetOrCreateBaseOptimizer defaulted UseAMSGrad = true (a non-standard band-aid from #1350); the fused mapper rejected AMSGrad, so every default-optimizer model fell back to the eager tape — silently. Reverted to standard Adam (matches PyTorch/TF/Optax, all default amsgrad=False).

2. Open/closed fused dispatch (replaces type-switch + enum whitelist)

  • IFusedOptimizerSpec — optimizers self-describe their FusedOptimizerConfig; no central catalog.
  • IFusedActivation — activations self-declare their FusedActivationType.
  • TryGetFusedLrSchedule — LR schedulers map to the per-step fused LrSchedule (the fused kernel evaluates GetLr(step) every optimizer step, exactly like PyTorch fused=True).

3. #1470 — Adam+Noam on the fused fast path (true adaptive-LR fix)

Bumped AiDotNet.Tensors 0.86.6 → 0.88.0 for LrSchedule.Noam (Tensors #504). TryGetFusedLrSchedule now maps NoamSchedule → LrSchedule.Noam(d, warmup, factor) (replacing an eager-fallback workaround). The default Transformer recipe (Adam β₂=0.98 + Noam) now trains on the fused path with a correct per-step warmup ramp, bit-identical to the eager schedule. Verified: a default-Noam Transformer per-call Train engages the fused path 3200/3200 steps and converges (PPL 5.06, top-1 7/8) instead of freezing at the uniform floor.

4. Proper wiring — PredictMlpForward

FeedForwardNeuralNetwork.Predict runs a pure dense+fused-activation stack as one IEngine.MlpForward call instead of the per-layer tape walk, via the activation interface. Falls back to generic Forward for anything unrepresentable.

5. Loud fallback (observability)

The fused path silently fell back at the default diagnostic level. Now emits a one-time warning per model naming the reason (suppressible via AIDOTNET_QUIET).

Coverage being completed on this branch

AiDotNet.Tensors 0.88.0 exposes 37 fused activations, 22 fused optimizers, 8 fused LR-schedule shapes. This PR expands AiDotNet's mappings toward full coverage, each gated by a numerical-parity test (fused result == eager result within tolerance) so no optimizer/activation is wired to a kernel whose math differs:

  • Schedulers: Constant, Cosine, Exponential, Noam, + Step / Cyclic / OneCycle / LinearWarmupCosine.
  • Optimizers: Adam, AdamW, SGD(+momentum), + the remaining fused kernels whose AiDotNet update math matches (AMSGrad, AdaMax, Nadam, RAdam, Adagrad, RMSprop, AdaDelta, Lion, …).
  • Activations: ReLU/Sigmoid/Tanh/Identity, + the remaining exact-equivalent fused shapes (GELU, LeakyReLU, ELU, Softplus, Mish, Swish, HardSwish/HardSigmoid/HardTanh, ReLU6, SoftSign, CELU, …).
  • Benchmarks: AIsEval PyTorch-parity harness re-run to confirm the compiled training plan beats PyTorch compiled (torch.compile) on the target shapes.

Builds clean on net10.0 + net471.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Broad fused activation support and a fused MLP inference fast path.
    • Fused/compiled optimizer training paths added for many optimizers (Adam variants, SGD, RMSProp, Adagrad, AdaDelta, AdaMax, LAMB, Lion, Nadam).
  • Improvements

    • Learning-rate schedules integrated with compiled training; improved fused-path diagnostics and fallback behavior.
    • Default Adam tuned for fused-kernel alignment.
  • Documentation

    • Added PyTorch parity benchmark suite.
  • Tests

    • New parity and integration tests validating fused inference and fused training.
  • Dependencies

    • Updated native libraries to v0.90.2.

✅ Coverage completed + verified (parity-gated)

All wirings gated by a numerical-parity test; anything that diverged was left on eager and documented.

LR schedulersLrSchedule: Constant, Cosine (fixed a pre-existing off-by-one), Exponential, Noam, Step, Cyclic(triangular). Step-for-step parity. (OneCycle deferred — AiDotNet uses linear warmup vs the kernel's cosine.)

Optimizers → fused kernels (fused-vs-eager training parity, Adam as control, all maxAbsDiff=0): SGD, Adam, AdamW, AMSGrad, AdaMax, Nadam, RMSprop, Adagrad, Lion, AdaDelta, LAMB. Expanded the stale 4-type allowlist → 20 and generalized the per-type fallback latch. (LARS/FTRL need params the config can't carry; RAdam/ASGD/Rprop have no AiDotNet class.)

ActivationsFusedActivationType (parity ≤5e-7 via identity-weight FusedLinear): added Mish, SELU, Softplus, SoftSign, Sign, BentIdentity, Gaussian, LiSHT, SQRBF, ReLU6, HardSwish (on top of ReLU/Sigmoid/Tanh/Identity/GELU/LeakyReLU/SiLU/Swish). The gate caught HardSigmoid (0.2 slope vs kernel's x/6) — left on eager. Parametric (ELU/CELU/HardTanh/…) and RReLU/softmax-family deferred (documented).

Benchmark — vs torch.compile (TorchInductor), MLP CPU, steady-state

AiDotNet fused torch.compile verdict
Training (compiled training plan) ~0.014 s/epoch ~0.084 s/epoch AiDotNet ~6× faster
Inference (Predict latency, bs32) p95 1.27 ms mean 0.32 ms torch.compile ~4× faster

The compiled training plan beats torch.compile (~6×, and ~15× over AiDotNet eager). Inference-latency vs TorchInductor is an honest remaining gap. Harness: benchmarks/AiDotNet.PyTorchParity (now with --compile).

The compiled fused-optimizer training path (CompiledTapeTrainingStep) is the
fast path and is attempted on every Train() step (EnableCompilation defaults
true). When its gates aren't met it silently falls back to the eager autograd
tape — a multi-x perf cliff with ZERO signal at the default diagnostic level.
A user can "enable compilation" and unknowingly train on the slow path forever.

This was found via the AIsEval benchmark: a bare FeedForwardNeuralNetwork with
the default AdamOptimizer had every step rejected by TryMapToFusedOptimizerConfig
("optimizer AdamOptimizer not compatible with fused kernel") and fell back
invisibly — only TrainingDiagnosticsConfig at PerStep surfaced it.

Emit a one-time Trace.TraceWarning per model instance the first time the fused
path doesn't engage, naming the reason and how to re-enable it. Gated by a
one-shot flag so it never spams the per-step training loop, and suppressible
via AIDOTNET_QUIET. PerStep diagnostics still give per-step detail.

This is observability only — no behavior change to training itself.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
aidotnet_website Ignored Ignored Preview May 31, 2026 5:42pm
aidotnet-playground-api Ignored Ignored Preview May 31, 2026 5:42pm

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds IFusedActivation and IFusedOptimizerSpec contracts; implements TryGetFusedActivation across many activations (with parameter gating where needed); implements a fused dense+activation MLP inference fast-path; maps LR schedulers for fused optimizers and adds IFusedOptimizerSpec implementations for many optimizers; improves compiled-training fused gating and adds tests plus a PyTorch-parity benchmark project with Python scripts.

Changes

Fused Execution Infrastructure

Layer / File(s) Summary
Fused activation contract & implementations
src/ActivationFunctions/Fused/IFusedActivation.cs, src/ActivationFunctions/*Activation.cs
Adds public IFusedActivation.TryGetFusedActivation(out FusedActivationType) and implements it across many activations (identity→None, LeakyReLU alpha gating, SiLU→Swish, GELU documented numeric form; ELU/Mish intentionally non-claimed or documented).
Fused MLP inference fast-path
src/NeuralNetworks/FeedForwardNeuralNetwork.cs
Predict enters inference mode, attempts TryFusedDensePredict which validates dense-only layers and scalar IFusedActivation mappings, collects weights/biases, calls AiDotNetEngine.MlpForward, and falls back to Forward on ineligibility or InvalidOperationException while restoring training-mode.
Fused optimizer contract
src/Optimizers/Fused/IFusedOptimizerSpec.cs
Adds internal FusedOptimizerConfig and IFusedOptimizerSpec.TryGetFusedOptimizerConfig(out FusedOptimizerConfig) for fused-kernel optimizer configuration extraction.
Optimizers & LR schedule mapping
src/Optimizers/*, src/Optimizers/GradientBasedOptimizerBase.cs
Multiple optimizers implement IFusedOptimizerSpec.TryGetFusedOptimizerConfig returning fused configs when adaptive LR is disabled and a fused LR schedule is available; TryGetFusedLrSchedule maps supported schedulers (null/constant, CosineAnnealing, Exponential, Noam, Step, symmetric Triangular cyclic) to fused LrSchedule.
NeuralNetworkBase: fused optimizer dispatch & fallback logging
src/NeuralNetworks/NeuralNetworkBase.cs
TryMapToFusedOptimizerConfig dispatches via IFusedOptimizerSpec.TryGetFusedOptimizerConfig; adds a one-time fused-fallback TraceWarning guarded by _loggedFusedFallback; changes default base optimizer to Adam with UseAMSGrad=false.
Compiled training step: fused gating & latch
src/Training/CompiledTapeTrainingStep.cs
Expands fused optimizer allowlist (includes AMSGrad); introduces per-thread _fusedUnavailableTypes to remember optimizer types that failed fused execution and skip retries; latch cleared on Invalidate().
Tests
tests/*
Adds integration/unit tests validating fused optimizer engagement, fused activation numeric parity, LR schedule parity, LeakyReLU parameter gating, Mish/ELU non-claiming, and Predict stability on lazy initialization.

PyTorch-parity Benchmarks

Layer / File(s) Summary
Solution & project files
AiDotNet.sln, benchmarks/AiDotNet.PyTorchParity/*
Adds new benchmark project to solution, project file referencing src/AiDotNet.csproj, and .gitignore for benchmark artifacts.
C# benchmark harness & models
benchmarks/AiDotNet.PyTorchParity/Program.cs
Adds harness that runs configured models (mlp, mlp-fused, cnn, lstm, transformer) for training and inference, measures timing, peak RSS, optional nvidia-smi, and writes indented JSON report.
PyTorch benchmark & compare scripts
benchmarks/AiDotNet.PyTorchParity/pytorch/*
Adds Python parity benchmark, compare tool, requirements, and README describing parity workflow and measurement conventions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Suggested labels

feature, dependencies, testing, architecture, priority:p0

"A fused kernel hums beneath the code,
Activations mapped, the fast-path strode.
Adam learns while schedules climb,
Benchmarks hum and tests mark time.
Merge the harness — let metrics rhyme."

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/fused-training-gate

…to Predict

Two issues, same PR.

(1) Default-optimizer gate — the real reason "compiled training does nothing":
GetOrCreateBaseOptimizer defaulted UseAMSGrad=true (added in #1350 as a weak,
non-standard band-aid for post-convergence drift on a couple of recurrent
models — partial at best). AMSGrad-by-default is non-standard (PyTorch/TF/Optax
all default amsgrad=False) and, because the fused kernel mapper didn't accept
AMSGrad, it silently forced EVERY model onto the eager tape. Reverted to
standard Adam, restoring both the industry default and the fused fast path.

(2) Open/closed fused dispatch (replaces the type-switch / enum mapping):
- IFusedOptimizerSpec: optimizers that have a fused SIMD kernel self-describe
  their FusedOptimizerConfig. Only Adam/AdamW/SGD implement it (the only kernels
  that exist), so there's no central whitelist and no `OptimizerType is (… or …)`
  list to maintain — a new optimizer becomes fuse-able by implementing the
  interface. Adam/AdamW self-select the AMSGrad kernel variant when UseAMSGrad is
  set, so opt-in AMSGrad keeps the fast path (matching PyTorch's fused/compiled
  amsgrad) instead of being rejected. Scheduler→LrSchedule mapping moved to a
  shared GradientBasedOptimizerBase helper.
- IFusedActivation: activations with an exact fused equivalent (ReLU, Sigmoid,
  Tanh, Identity→None) self-declare their FusedActivationType. GELU intentionally
  omitted until tanh-approx-vs-erf equivalence is verified.

(3) Proper wiring — FeedForwardNeuralNetwork.Predict now runs a pure dense+fused-
activation stack through IEngine.MlpForward (one fused call) instead of the
per-layer tape walk, via the activation interface (no switch). Falls back to the
generic Forward for anything the kernel can't represent (non-dense, vector
activation, unmapped/mixed activations) or if MlpForward declines under a tape.

CompiledTapeTrainingStep now also accepts OptimizerType.AMSGrad (the AVX2
AMSGradUpdateSimd kernel already exists in Tensors); the companion Tensors PR
wires it through CompiledTrainingPlan's supported set + vMax buffer so opt-in
AMSGrad fully runs fused. Until that lands, AMSGrad falls back loudly (no wrong
update). Builds clean on net10.0 + net471.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ooples ooples changed the title fix: make the compiled fused-training fallback loud instead of silent fix: compiled fused-training — standard-Adam default, OCP dispatch, MlpForward wiring, loud fallback May 29, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/Training/CompiledTapeTrainingStep.cs (1)

351-353: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update fused-optimizer constraints docs to include AMSGrad

The XML constraint text is now stale; code allows OptimizerType.AMSGrad (Line 408). Please update this block so behavior/docs stay aligned.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Training/CompiledTapeTrainingStep.cs` around lines 351 - 353, The XML doc
for CompiledTapeTrainingStep currently lists only SGD, Adam, and AdamW as
supported by CompiledTrainingPlan{T}.ConfigureOptimizer but the code also allows
OptimizerType.AMSGrad; update the <item> text to include AMSGrad (and optionally
rephrase to "SGD, Adam, AdamW, and AMSGrad") and keep the note about using the
plain Step method or the eager tape path for other optimizer types so the
documentation matches the behavior of ConfigureOptimizer and references to Step
and the eager tape path remain accurate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/ActivationFunctions/Fused/IFusedActivation.cs`:
- Around line 19-23: The IFusedActivation interface is an internal plumbing
contract and must not be public; change its declaration from public to internal
(i.e., make IFusedActivation internal) and ensure the related type
FusedActivationType is also internal or otherwise not exposed publicly so
callers outside the assembly cannot reference this plumbing; update any
implementing classes (names referencing IFusedActivation) to match the new
internal visibility and run compilation to fix any accessibility errors caused
by this change.

In `@src/ActivationFunctions/ReLUActivation.cs`:
- Around line 27-31: The ReLUActivation<T> class exposes the fused-kernel
plumbing member FusedActivationType as a public property; change it to an
explicit interface implementation so it is not part of the concrete public API.
Replace the public property declaration with an explicit implementation of
Fused.IFusedActivation.FusedActivationType (i.e. implement
AiDotNet.Tensors.Engines.FusedActivationType
Fused.IFusedActivation.FusedActivationType =>
AiDotNet.Tensors.Engines.FusedActivationType.ReLU) inside the ReLUActivation<T>
class so the member is only accessible via the Fused.IFusedActivation interface.

In `@src/ActivationFunctions/SigmoidActivation.cs`:
- Around line 36-40: The public FusedActivationType property on
SigmoidActivation<T> is exposing internal routing metadata; remove the public
auto-property and implement AiDotNet.Tensors.Engines.FusedActivationType as an
explicit interface member for Fused.IFusedActivation (i.e., implement
Fused.IFusedActivation.FusedActivationType =>
AiDotNet.Tensors.Engines.FusedActivationType.Sigmoid) inside the
SigmoidActivation<T> class so it is not part of the concrete public API but
still satisfies the interface contract.

In `@src/NeuralNetworks/NeuralNetworkBase.cs`:
- Around line 6666-6672: GetOrCreateBaseOptimizer relies on
AdamOptimizerOptions' constructor default for UseAMSGrad; make the non-AMSGrad
behavior explicit by creating the AdamOptimizerOptions instance, set its
UseAMSGrad property to false, and pass that options object into the
AdamOptimizer constructor (i.e., modify GetOrCreateBaseOptimizer to instantiate
a Models.Options.AdamOptimizerOptions<T, Tensor<T>, Tensor<T>> options variable,
set options.UseAMSGrad = false, then new AdamOptimizer(..., options)).
- Around line 5328-5341: The warning should not be emitted when compilation is
explicitly disabled; update the condition around the _loggedFusedFallback branch
to also check TensorCodecOptions.Current.EnableCompilation (or
TensorCodecOptions.EnableCompilation) and only log the fused-training fallback
when compilation is enabled and _mixedPrecisionContext is null; keep the
existing checks for _loggedFusedFallback, _mixedPrecisionContext, the
AIDOTNET_QUIET env var, and include _pendingFusedMissReason/GetType().Name in
the message as before so the alert only appears for unexpected fallbacks when
compilation was intended to run.

In `@src/Optimizers/AdamOptimizer.cs`:
- Around line 126-144: The public TryGetFusedOptimizerConfig method on
AdamOptimizer should be removed from the public API and implemented as an
explicit IFusedOptimizerSpec member; change the public method into an explicit
interface implementation of IFusedOptimizerSpec.TryGetFusedOptimizerConfig
within the AdamOptimizer class, keeping the existing logic (including the
UseAdaptiveLearningRate check, TryGetFusedLrSchedule call, and construction of
Fused.FusedOptimizerConfig using _options, GetCurrentLearningRate(),
TryGetFusedLrSchedule and the AMSGrad branch) so the behavior is identical but
the method is no longer part of the concrete public surface.

In `@src/Optimizers/AdamWOptimizer.cs`:
- Around line 146-162: The public method TryGetFusedOptimizerConfig on
AdamWOptimizer should be converted to an explicit interface implementation so
fused internals are not exposed on the concrete AdamWOptimizer API; locate the
TryGetFusedOptimizerConfig method (and its use of Fused.FusedOptimizerConfig,
TryGetFusedLrSchedule, GetCurrentLearningRate and _options) and change its
signature to implement the fused interface explicitly (e.g.
IFusedOptimizer.TryGetFusedOptimizerConfig) rather than a public member,
retaining the existing logic and return behavior so callers through the
interface still receive the Fused.FusedOptimizerConfig while the concrete
AdamWOptimizer class no longer exposes the fused plumbing publicly.

In `@src/Optimizers/Fused/IFusedOptimizerSpec.cs`:
- Around line 18-25: The Fused optimizer plumbing types are exposed publicly but
should be internal; change the accessibility of FusedOptimizerConfig (currently
declared as "public readonly record struct FusedOptimizerConfig(...)") to
internal (e.g., "internal readonly record struct FusedOptimizerConfig(...)") and
likewise change the other plumbing types mentioned around lines 51-58 to
internal; update any callers within the project (tests or other internal
classes) to use the now-internal types (no API consumers should be affected) and
ensure the file compiles after switching the access modifiers.

In `@src/Optimizers/StochasticGradientDescentOptimizer.cs`:
- Around line 70-81: The public TryGetFusedOptimizerConfig method on
StochasticGradientDescentOptimizer is exposing internal fused-config plumbing;
change it to an explicit interface implementation so it is not part of the
concrete public API. Locate the public bool TryGetFusedOptimizerConfig(out
Fused.FusedOptimizerConfig config) on class StochasticGradientDescentOptimizer
and convert it to an explicit implementation of the appropriate internal
interface (keep the logic that checks _options.UseAdaptiveLearningRate, calls
TryGetFusedLrSchedule, and uses GetCurrentLearningRate to build the
Fused.FusedOptimizerConfig), removing the public modifier so only the interface
exposes this member. Ensure the signature matches the internal interface exactly
and update accessibility accordingly.

In `@src/Training/CompiledTapeTrainingStep.cs`:
- Around line 397-408: The hot path currently admits
AiDotNet.Tensors.Engines.Compilation.OptimizerType.AMSGrad but relies on
exception-catching inside TryStepWithFusedOptimizer (and the per-step warning at
line ~640) to handle unsupported runtime builds, causing per-step exceptions and
log churn; modify the control flow so ConfigureOptimizer (or the plan selection
logic) performs a cheap runtime capability check for AMSGrad support (e.g.,
whether vMax buffer and FusedOptimizer.AMSGradUpdateSimd kernel are available)
and return a boolean/capability flag that the calling code
(CompiledTapeTrainingStep.TryStepWithFusedOptimizer) inspects before entering
the fused path, falling back once deterministically to the eager tape with a
single one-time warning instead of relying on exceptions; update usage of
optimizerType, ConfigureOptimizer, and TryStepWithFusedOptimizer to gate AMSGrad
by that capability flag.

---

Outside diff comments:
In `@src/Training/CompiledTapeTrainingStep.cs`:
- Around line 351-353: The XML doc for CompiledTapeTrainingStep currently lists
only SGD, Adam, and AdamW as supported by
CompiledTrainingPlan{T}.ConfigureOptimizer but the code also allows
OptimizerType.AMSGrad; update the <item> text to include AMSGrad (and optionally
rephrase to "SGD, Adam, AdamW, and AMSGrad") and keep the note about using the
plain Step method or the eager tape path for other optimizer types so the
documentation matches the behavior of ConfigureOptimizer and references to Step
and the eager tape path remain accurate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e67cea0d-07a3-4271-8787-dd0f2f62624a

📥 Commits

Reviewing files that changed from the base of the PR and between d05f43e and f8da771.

📒 Files selected for processing (14)
  • src/ActivationFunctions/Fused/IFusedActivation.cs
  • src/ActivationFunctions/GELUActivation.cs
  • src/ActivationFunctions/IdentityActivation.cs
  • src/ActivationFunctions/ReLUActivation.cs
  • src/ActivationFunctions/SigmoidActivation.cs
  • src/ActivationFunctions/TanhActivation.cs
  • src/NeuralNetworks/FeedForwardNeuralNetwork.cs
  • src/NeuralNetworks/NeuralNetworkBase.cs
  • src/Optimizers/AdamOptimizer.cs
  • src/Optimizers/AdamWOptimizer.cs
  • src/Optimizers/Fused/IFusedOptimizerSpec.cs
  • src/Optimizers/GradientBasedOptimizerBase.cs
  • src/Optimizers/StochasticGradientDescentOptimizer.cs
  • src/Training/CompiledTapeTrainingStep.cs

Comment thread src/ActivationFunctions/Fused/IFusedActivation.cs
Comment thread src/ActivationFunctions/ReLUActivation.cs Outdated
Comment thread src/ActivationFunctions/SigmoidActivation.cs Outdated
Comment thread src/NeuralNetworks/NeuralNetworkBase.cs Outdated
Comment thread src/NeuralNetworks/NeuralNetworkBase.cs Outdated
Comment thread src/Optimizers/AdamOptimizer.cs
Comment thread src/Optimizers/AdamWOptimizer.cs
Comment thread src/Optimizers/Fused/IFusedOptimizerSpec.cs Outdated
Comment thread src/Optimizers/StochasticGradientDescentOptimizer.cs
Comment thread src/Training/CompiledTapeTrainingStep.cs Outdated
franklinic and others added 2 commits May 29, 2026 16:13
… wiring

Convert IFusedActivation from a FusedActivationType property to
TryGetFusedActivation(out type), so a parametric activation whose parameter
differs from the kernel's hardcoded value reports no fused equivalent and
stays on the exact generic path instead of silently inheriting the kernel
default.

Wire the 8 activations whose kernel is numerically identical through the
shipped Tensors FusedLinear/MlpForward path (resolved via
CpuFusedOperations._floatActivations/_doubleActivations): ReLU, Sigmoid,
Tanh, Identity(None), GELU(tanh-approx), Swish, SiLU(=Swish), LeakyReLU.
Each is locked by a parity test asserting the fused kernel equals the
scalar Activate() on the same pre-activation (<1e-4).

Unwire Mish and ELU: that path's activation tables register only
None/ReLU/GELU/Sigmoid/Tanh/LeakyReLU/Swish, so routing Mish/ELU would
throw (their formulas live only in the unrelated BlasManaged
ActivationEpilogue). MishAndElu_ReportNoFusedKernel locks the contract;
adding the kernels is tracked by AiDotNet.Tensors #499.

Fix LeakyReLU fused guard tolerance (1e-12 -> 1e-6): the default 0.01 slope
round-trips through float as 0.009999999776, ~2.2e-10 off the literal, so
1e-12 wrongly rejected the default.

Guard DenseLayer lazy-init in TryFusedDensePredict: a fresh network's first
Predict has [0,0] sentinel weights that MlpForward rejects, so bail to the
generic Forward for that call (Predict_FreshLazyNetwork_DoesNotThrow).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ngage fused path

Adds DefaultOptimizer_EngagesFusedPath_NotSilentEagerFallback: trains with no
explicitly-supplied optimizer (optimizer: null → GetOrCreateBaseOptimizer) and
asserts CompiledTapeTrainingStep.GetFusedStepCount() > 0. The default optimizer
previously constructed Adam with UseAMSGrad=true, which TryMapToFusedOptimizerConfig
rejected — silently demoting every default-configured model to the eager tape so
compiled training never ran. This regression test fails loudly if a non-mappable
default is ever reintroduced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ooples pushed a commit that referenced this pull request May 29, 2026
In-repo twin of AIsEval's aidotnet-benchmarks, but referencing the AiDotNet
*source* (ProjectReference, not a NuGet package) so it measures the current
working tree — the validation harness for perf changes a released-package
benchmark can't see (e.g. PR #1469's fused-training gate and the
FeedForwardNeuralNetwork.Predict -> IEngine.MlpForward inference wiring).

Both sides build the same MLP/CNN/LSTM/Transformer models with matching layer
shapes, run the same training + multi-batch (1/8/32/128) inference loop with p95
latency + RSS, and emit the same JSON schema; pytorch/compare.py lines the two
reports up row-by-row (gate: p95(AiDotNet) < mean(PyTorch)). PyTorch runs eager
on purpose so the comparison is kernels-vs-kernels, not compile-stack-vs-stack.

AIDOTNET_FUSED_DIAG=1 prints whether the compiled fused training step engaged
(Hit) and, on fallback, the captured root-cause exception via
CompiledTapeTrainingStep.GetLastFallbackException — which already earned its
keep: it shows FeedForwardNeuralNetwork's AMSGrad-mode default optimizer (chosen
for the #1332 drift fix) hits NotSupportedException in the Tensors
CompiledTrainingPlan, so compiled training silently falls back to eager for the
most common model class. Wiring AMSGrad's existing kernel into the plan dispatch
is tracked by Tensors #74.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…efs)

In-repo twin of AIsEval's aidotnet-benchmarks, but referencing the AiDotNet
source (ProjectReference, not a NuGet package) so it measures the current
working tree — the validation harness for perf changes a released-package
benchmark can't see (e.g. PR #1469's fused-training gate and the
FeedForwardNeuralNetwork.Predict -> IEngine.MlpForward inference wiring).

Both sides build the same MLP/CNN/LSTM/Transformer models with matching layer
shapes, run the same training + multi-batch (1/8/32/128) inference loop with p95
latency + RSS, and emit the same JSON schema; pytorch/compare.py lines the two
reports up row-by-row (gate: p95(AiDotNet) < mean(PyTorch)). PyTorch runs eager
on purpose so the comparison is kernels-vs-kernels, not compile-stack-vs-stack.

AIDOTNET_FUSED_DIAG=1 prints whether the compiled fused training step engaged
(Hit) and, on fallback, the captured root-cause exception via
CompiledTapeTrainingStep.GetLastFallbackException — which already earned its
keep: it shows FeedForwardNeuralNetwork's AMSGrad-mode default optimizer (chosen
for the #1332 drift fix) hits NotSupportedException in the Tensors
CompiledTrainingPlan, so compiled training silently falls back to eager for the
most common model class. Wiring AMSGrad's existing kernel into the plan dispatch
is tracked by Tensors #74.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ooples ooples force-pushed the perf/fused-training-gate branch from 3e3d6fe to a6851e4 Compare May 29, 2026 20:41
ooples pushed a commit to ooples/AiDotNet.Tensors that referenced this pull request May 29, 2026
CompiledTrainingPlan.ConfigureOptimizer previously threw NotSupportedException
for AMSGrad even though the AVX2 AMSGradUpdateSimd kernel already existed. That
silently demoted AiDotNet's FeedForwardNeuralNetwork — whose default optimizer
is AMSGrad-mode Adam (chosen for the drift fix in ooples/AiDotNet#1332) — to the
eager tape for every training run (the "compiled does nothing" symptom the
in-repo parity harness surfaced via GetLastFallbackException).

Wire AMSGrad into the plan's CPU fused-update closures:
- Add OptimizerType.AMSGrad to ValidatePlanOptimizerSupport.
- Thread a per-parameter vMax buffer (running max of the second moment) through
  all four closure builders (float/double x grouped/ungrouped); allocated only
  when the optimizer is AMSGrad.
- Add an AMSGrad case to each CPU step switch calling AMSGradUpdateSimd, using
  Adam's L2 weight-decay convention (wd=0 for the FFN default).
- Add a double overload of AMSGradUpdateSimd mirroring the float kernel so the
  double-precision plan keeps the same non-increasing-denominator guarantee.

GPU AMSGrad is intentionally still unsupported (no backend kernel) — a
GPU-resident parameter with AMSGrad throws clearly via the GPU switch default;
the common CPU path (FeedForwardNeuralNetwork on the CPU engine) is unblocked.

Tests (ConfigureOptimizerAMSGradTests): kernel-direct float/double parity against
an independent textbook AMSGrad over a rise-then-fall gradient sequence (so vMax
exceeds v and the max path matters); plan-level AMSGrad updates params in place
(the direct regression), equals Adam on step 1 (vMax==v), and diverges from Adam
over 40 steps (proves vMax is consulted, not aliased). Adam/SGD param-update and
double-path tests still green (no regression from the shared-closure edits).

Companion to ooples/AiDotNet#1469: once released and bumped, FeedForwardNeuralNetwork's
AMSGrad default finally engages compiled training instead of falling back.

Closes #500.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…allbacks

API surface (keep fused dispatch plumbing off the concrete public API):
- IFusedOptimizerSpec + FusedOptimizerConfig made internal (compiled-dispatch
  implementation details; only consumed in-assembly via the interface).
- AdamOptimizer / AdamWOptimizer / StochasticGradientDescentOptimizer now
  implement TryGetFusedOptimizerConfig as explicit interface implementations.
- All 8 IFusedActivation implementations (GELU/Identity/LeakyReLU/ReLU/Sigmoid/
  SiLU/Swish/Tanh) now implement TryGetFusedActivation explicitly.

Behavior:
- NeuralNetworkBase: suppress the loud fused-fallback warning when
  TensorCodecOptions.EnableCompilation is false (explicit opt-out, not an
  unexpected fallback).
- GetOrCreateBaseOptimizer pins UseAMSGrad = false explicitly so the fused fast
  path can't silently regress if the AdamOptimizerOptions default ever flips.
- CompiledTapeTrainingStep: latch a per-thread _amsgradFusedUnavailable flag on
  the first AMSGrad fused failure so later AMSGrad steps skip the fused attempt
  instead of reconfigure/throw/catch/warn every step (per-step churn).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ooples pushed a commit to ooples/AiDotNet.Tensors that referenced this pull request May 29, 2026
…r dispatch

CpuFusedOperations._floatActivations/_doubleActivations — the tables
FusedLinear/MlpForward resolves pointwise activations through — registered only
None/ReLU/GELU/Sigmoid/Tanh/LeakyReLU/Swish, so MlpForward THREW for the other
pointwise FusedActivationType values. That is exactly the gap that forced
AiDotNet's Mish/ELU activations off the fused inference path.

Adds ELU (alpha=1), SELU, Softplus (with the standard x>20 linear cutoff to avoid
exp overflow), Mish, HardSwish, HardSigmoid, HardTanh to both the float and double
tables, as inlined helper methods mirroring ApplyGelu. MlpForward/FusedLinear now
cover all 14 pointwise activations; only Softmax stays out (it's row-wise, not
pointwise — must be applied separately after the GEMM).

Tests (MlpForwardActivationParityTests): for each new activation, float and
double, MlpForward(activation) equals the canonical scalar formula applied to the
raw x·W (independent textbook reference, <1e-4 float / <1e-9 double).

Unblocks re-wiring AiDotNet's Mish/ELU (and SELU/Softplus/Hard* if classed) to
IFusedActivation once this ships — companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ooples added a commit to ooples/AiDotNet.Tensors that referenced this pull request May 29, 2026
…ations into the plan/MlpForward dispatch (#501)

* feat(#500): dispatch AMSGrad through CompiledTrainingPlan's fused update

CompiledTrainingPlan.ConfigureOptimizer previously threw NotSupportedException
for AMSGrad even though the AVX2 AMSGradUpdateSimd kernel already existed. That
silently demoted AiDotNet's FeedForwardNeuralNetwork — whose default optimizer
is AMSGrad-mode Adam (chosen for the drift fix in ooples/AiDotNet#1332) — to the
eager tape for every training run (the "compiled does nothing" symptom the
in-repo parity harness surfaced via GetLastFallbackException).

Wire AMSGrad into the plan's CPU fused-update closures:
- Add OptimizerType.AMSGrad to ValidatePlanOptimizerSupport.
- Thread a per-parameter vMax buffer (running max of the second moment) through
  all four closure builders (float/double x grouped/ungrouped); allocated only
  when the optimizer is AMSGrad.
- Add an AMSGrad case to each CPU step switch calling AMSGradUpdateSimd, using
  Adam's L2 weight-decay convention (wd=0 for the FFN default).
- Add a double overload of AMSGradUpdateSimd mirroring the float kernel so the
  double-precision plan keeps the same non-increasing-denominator guarantee.

GPU AMSGrad is intentionally still unsupported (no backend kernel) — a
GPU-resident parameter with AMSGrad throws clearly via the GPU switch default;
the common CPU path (FeedForwardNeuralNetwork on the CPU engine) is unblocked.

Tests (ConfigureOptimizerAMSGradTests): kernel-direct float/double parity against
an independent textbook AMSGrad over a rise-then-fall gradient sequence (so vMax
exceeds v and the max path matters); plan-level AMSGrad updates params in place
(the direct regression), equals Adam on step 1 (vMax==v), and diverges from Adam
over 40 steps (proves vMax is consulted, not aliased). Adam/SGD param-update and
double-path tests still green (no regression from the shared-closure edits).

Companion to ooples/AiDotNet#1469: once released and bumped, FeedForwardNeuralNetwork's
AMSGrad default finally engages compiled training instead of falling back.

Closes #500.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(#500): study PyTorch fused/compiled optimizer internals vs ours

Findings doc for the compiled-training perf work: PyTorch's 3-tier
(for-loop/foreach/fused) optimizer impls, multi_tensor_apply horizontal fusion,
and torch.compile (Inductor) vertical fusion — mapped against what
CompiledTrainingPlan already does well (compile-once-replay, inlined LR
schedule, live-backed in-place writes, AVX2 kernels, epilogue fusion) and the
gaps to close. Surfaced a concrete quick win: the per-param *UpdateSimd kernels
recompute the step-constant bias-correction powers (1-β^t) on every parameter
call; PyTorch computes them once per step. Prioritized action items included.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#500): wire the remaining float kernel-backed optimizers into the plan dispatch

Extends the AMSGrad dispatch to the other optimizers whose AVX2 kernels already
existed in FusedOptimizer but were rejected by CompiledTrainingPlan (forcing the
eager tape): Nadam, RAdam, LAMB, RMSprop, Adagrad, Lion, SGDMomentum, AdaMax.

- ValidatePlanOptimizerSupport is now dtype-aware: float allows the full set
  (SGD/Adam/AdamW/AMSGrad + the 8 above); double keeps SGD/Adam/AdamW/AMSGrad
  (the new kernels are float-only). Double/GPU use of a float-only optimizer is
  rejected at ConfigureOptimizer rather than configure-then-throw mid-step.
- Buffer allocation: RAdam/LAMB join the m+v set; AdaMax reuses the v slot as its
  infinity-norm u; the others reuse existing m or v.
- Hyperparameter mapping into the generic (lr, beta1, beta2, eps, wd) slots:
  beta2 = RMSprop decay (rho); beta1 = SGD momentum; Lion/LAMB apply decoupled
  weight decay inside their kernels, the rest use Adam's L2 convention.

Still NOT wired (need hyperparameters the ConfigureOptimizer API doesn't carry):
AdaDelta (rho + 2 accumulators), FTRL (l1/l2/lr_power), LARS (trust coeff), ASGD
(lambd/alpha/mu). These are cleanly rejected at configure time so callers fall
back to eager — tracked for a follow-up API extension.

Tests (ConfigureOptimizerFusedDispatchTests): all 12 wired float optimizers
dispatch through the plan, update params in place, stay finite, and move
meaningfully over 5 steps; float-only optimizers on a double plan throw; un-wired
optimizers throw. Kernel math correctness stays covered by the kernels' own
tests (AMSGrad additionally has full kernel-direct parity).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#500): wire AdaDelta/LARS/FTRL/ASGD/Rprop via a FusedOptimizerExtras API extension

Completes the dispatch for every per-parameter-elementwise kernel that exists in
FusedOptimizer. These four (five) needed hyperparameters the generic
(lr, beta1, beta2, eps, weightDecay) slots don't carry, so this adds an optional
FusedOptimizerExtras object to ConfigureOptimizer / ConfigureOptimizerGrouped
(and the interface) with documented per-field defaults:

- AdaDelta: 2 accumulators (accumGrad=v, accumUpdate=vMax); rho via the beta2 slot.
- LARS:     velocity=m; layer-wise trust ratio from extras.Momentum + TrustCoefficient + wd.
- FTRL:     z=v, n=vMax; extras.L1/L2/LrPower (FTRL owns its regularization).
- ASGD:     ax=m; closure computes eta_t = lr/(1+lambd*lr*t)^alpha and mu_t = 1/max(1,t-t0).
- Rprop:    prevGrad=m, stepSize=v (seeded to extras.RpropInitialStep on step 1);
            extras.RpropEtaPlus/EtaMinus/StepMin/StepMax. No lr (step-size based).

FusedOptimizerExtras is a class with property initializers (not a record struct) so
`new FusedOptimizerExtras()` yields the documented defaults rather than all-zero.

Now wired (17 float OptimizerTypes): SGD, SGDMomentum, Adam, AdamW, Adagrad,
RMSprop, Lion, AdaMax, AMSGrad, Nadam, AdaDelta, LARS, LAMB, FTRL, RAdam, ASGD,
Rprop. Still rejected (need a different execution model, fail fast at configure):
SparseAdam (sparse indices), LBFGS (closure line-search), HypergradientSGD /
DAdaptationSGD (GLOBAL cross-parameter reductions — per-tensor would be a
different algorithm), ScheduleFreeSGD (y-buffer written before the forward).

Tests: all 17 wired float optimizers dispatch + update in place + finite + move;
FTRL with a strong L1 drives sparsity (proves extras flow into the kernel, not
ignored); float-only on double throws; the 5 unwireable throw. Updated
FusedAdaptiveLrPlanTests (Lion/LAMB are no longer rejected — now supported).
Adam/SGD/double-path/adaptive-lr regression tests still green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#500): add the 7 missing pointwise activations to the FusedLinear dispatch

CpuFusedOperations._floatActivations/_doubleActivations — the tables
FusedLinear/MlpForward resolves pointwise activations through — registered only
None/ReLU/GELU/Sigmoid/Tanh/LeakyReLU/Swish, so MlpForward THREW for the other
pointwise FusedActivationType values. That is exactly the gap that forced
AiDotNet's Mish/ELU activations off the fused inference path.

Adds ELU (alpha=1), SELU, Softplus (with the standard x>20 linear cutoff to avoid
exp overflow), Mish, HardSwish, HardSigmoid, HardTanh to both the float and double
tables, as inlined helper methods mirroring ApplyGelu. MlpForward/FusedLinear now
cover all 14 pointwise activations; only Softmax stays out (it's row-wise, not
pointwise — must be applied separately after the GEMM).

Tests (MlpForwardActivationParityTests): for each new activation, float and
double, MlpForward(activation) equals the canonical scalar formula applied to the
raw x·W (independent textbook reference, <1e-4 float / <1e-9 double).

Unblocks re-wiring AiDotNet's Mish/ELU (and SELU/Softplus/Hard* if classed) to
IFusedActivation once this ships — companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#500): make the optimizer gate device-aware + label doc code fence (PR review)

CodeRabbit review on PR #501:
- ValidatePlanOptimizerSupport was dtype-aware but not device-aware: AMSGrad and
  the float-only CPU kernels were accepted for any float plan, but the GPU step
  path ships only SGD/Adam/AdamW backend kernels. On a mixed CPU/GPU plan the
  GPU-switch default throw lands AFTER earlier CPU params were updated — a
  partially-applied step. Now ConfigureOptimizer / ConfigureOptimizerGrouped
  detect any GPU-backed parameter and reject non-SGD/Adam/AdamW at configure time
  (atomic, before _optimizerUpdate is published). CPU-only plans are unaffected
  (hasGpuParams=false) — all 46 dispatch/parity tests still green.
- Tagged the unlabeled fenced code block in the research doc as csharp (MD040).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: franklinic <franklin@ivorycloud.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ooples pushed a commit to ooples/AiDotNet.Tensors that referenced this pull request May 29, 2026
… to full pointwise parity

Two fused activation paths existed with different coverage: the
MlpForward/FusedLinear dispatch tables (CpuFusedOperations) reached 14 pointwise
activations in #501, while the BlasManaged ActivationEpilogue stopped at 8
(ReLU/LeakyReLU/Sigmoid/Tanh/GELU/Swish/Mish/ELU). This:

- Adds two new pointwise FusedActivationType kernels: ReLU6 = min(max(0,x),6)
  (MobileNet/quantized) and SoftSign = x/(1+|x|) — wired into the CpuFusedOperations
  float+double tables.
- Brings ActivationEpilogue (fp32 + fp64) to parity with those tables: adds SELU,
  Softplus, HardSwish, HardSigmoid, HardTanh + the new ReLU6, SoftSign.

Both fused paths now cover all 16 pointwise activations. Still out of scope (not
pointwise / need parameter threading, tracked in #499): Softmax & Softmin (row-wise),
and the parametric activations (CELU/ThresholdedReLU/ScaledTanh/PReLU) which need a
parameter carried through the fused path — a follow-up API extension analogous to
FusedOptimizerExtras.

Tests: MlpForwardActivationParityTests + new EpilogueActivationParityTests verify
every new activation (float + double) matches an independent canonical formula
through both fused paths.

Companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ooples pushed a commit to ooples/AiDotNet.Tensors that referenced this pull request May 30, 2026
Adds an optional FusedActivationParams (Alpha/Beta/Theta, nullable so a missing
field resolves to each activation's canonical default) threaded through the fused
activation paths, so parametric activations fuse with ANY parameter instead of
only the hardcoded value:

- New FusedActivationType values: CELU, ThresholdedReLU, ScaledTanh.
- LeakyReLU now fuses for any slope (not just 0.01); ELU for any alpha (not just 1).
- CELU (alpha), ThresholdedReLU (theta), ScaledTanh (alpha, beta) fuse via params.

Plumbing (optional trailing param, fully back-compatible — null = prior behavior):
- CpuFusedOperations.GetFloatActivation/GetDoubleActivation build a parametric
  closure from the params (falling back to the per-activation default), with the
  non-parametric activations still served by the static dispatch tables.
- ApplyBiasActivationInPlace/Double, CpuEngine.FusedLinear, CpuEngine.MlpForward
  (hidden + output params), and the IEngine interface all carry the optional params.
- DirectGpuTensorEngine.FusedLinear defers to the base CPU params-aware path when
  custom params are supplied (GPU fused kernels don't carry them yet).
- ActivationEpilogue (fp32 + fp64) honors params for LeakyReLU/ELU and implements
  CELU/ThresholdedReLU/ScaledTanh.

Out of scope (documented): PReLU needs a per-channel slope vector (not a scalar) —
a separate kernel signature; the tape/graph training path applies activations via
ActivationRegistry (canonical defaults) — MlpForward is inference-only so the main
consumer path is fully covered; Softmax/Softmin remain row-wise.

Tests: MlpForwardActivationParityTests + EpilogueActivationParityTests gain
parametric cases (LeakyReLU 0.2, ELU 2, CELU 1.5, ThresholdedReLU 0.5,
ScaledTanh 1.7/0.66) verifying both fused paths honor the supplied parameter
(float + double). FusedLinear/MlpForward regression suite green (110 passed).

Companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ooples pushed a commit to ooples/AiDotNet.Tensors that referenced this pull request May 30, 2026
…ActivationType set

Implements the remaining declared activation types so every FusedActivationType
value now has a working fused kernel in BOTH paths (MlpForward/FusedLinear tables
and the BlasManaged ActivationEpilogue):

- PReLU: per-output-channel learned slope via FusedActivationParams.PReluSlope
  (length = features, or 1 for shared; default 0.25). Applied per output column,
  so it runs in a dedicated channel-aware pass (not the pointwise delegate).
- RReLU: deterministic eval form = leaky with slope (lower+upper)/2 (default
  ≈0.2292, override via Alpha) — fused paths are inference-only.
- Softmax / Softmin: row-wise (over the feature dim) with the standard max-shift
  for numerical stability; Softmin = softmax(-x). Run as a per-row pass after bias.

PReLU/Softmax/Softmin get a dedicated branch at the top of
ApplyBiasActivationInPlace/Double (and matching epilogue cases) because they need
column/row context; the pointwise delegate path and SIMD fast path are unchanged
for every other activation (no regression — existing activations skip the branch).

Tests: per-channel PReLU parity, Softmax/Softmin row-normalization (+ monotonic /
anti-monotonic ordering, rows sum to 1), RReLU added to the parametric set — float
and double, through both MlpForward and the epilogue. 58 activation tests green.

This closes the activation half of #499: all 23 FusedActivationType values fuse
(only the specialized non-enumerated activation classes — Sparsemax, Maxout,
GumbelSoftmax, etc. — remain unenumerated, marked lower-priority in the issue).

Companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
src/ActivationFunctions/Fused/IFusedActivation.cs (1)

28-35: ⚠️ Potential issue | 🔴 Critical | ⚖️ Poor tradeoff

BLOCKING: Fused activation dispatch contract must be internal, not public.

IFusedActivation is a fused-kernel routing interface consumed by internal layers (ActivationLayer.TryGetFusedActivationType, FeedForwardNeuralNetwork.TryFusedDensePredict). Exposing it as public violates the facade pattern—users should interact only with AiModelBuilder and configuration classes, not dispatch plumbing. All implementations use explicit interface syntax, so the interface visibility does not affect the concrete activation classes.

A prior review flagged this exact issue and marked it "Addressed in commits b029210 to a6851e4," yet the code remains public. Make it internal.

🔒 Proposed fix
-public interface IFusedActivation
+internal interface IFusedActivation
 {
     /// <summary>
     /// Reports the fused-kernel activation type equivalent to this activation, or
     /// returns <c>false</c> if this instance can't be reproduced by the kernel.
     /// </summary>
     bool TryGetFusedActivation(out FusedActivationType type);
 }

As per coding guidelines: "src/**: Users should ONLY interact with AiModelBuilder.cs and AiModelResult.cs" and "Prefer internal over public for plumbing/helper classes that users never instantiate or consume."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ActivationFunctions/Fused/IFusedActivation.cs` around lines 28 - 35,
Change the IFusedActivation interface from public to internal so the
fused-kernel routing contract is not exposed in the public API; update the
declaration of IFusedActivation accordingly (the explicit implementations on
concrete activation classes need no changes), and verify call sites like
ActivationLayer.TryGetFusedActivationType and
FeedForwardNeuralNetwork.TryFusedDensePredict still compile against the
now-internal interface.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/AiDotNet.PyTorchParity/pytorch/benchmark.py`:
- Line 276: The expression computing p95_idx uses an unnecessary int() cast
around round(); update the line that assigns p95_idx to remove int() so it reads
p95_idx = min(len(steady_sorted) - 1, round(0.95 * (len(steady_sorted) - 1))))
(referencing variable steady_sorted and the p95_idx assignment) ensuring the
result is still bounded by len(steady_sorted)-1; no other logic change required.
- Line 89: The __enter__ method's return type annotation uses unnecessary string
quotes; update the signature in the ResourceMonitor class by removing the quotes
so it reads a normal forward reference (i.e., change def __enter__(self) ->
"ResourceMonitor": to use ResourceMonitor without quotes), ensuring the
annotation matches the class name and Python's type hinting style.

In `@benchmarks/AiDotNet.PyTorchParity/pytorch/compare.py`:
- Around line 24-30: The helper _get lacks a return type annotation; update its
signature to include a return type and type for default (e.g. import Any from
typing and change to def _get(d: dict, *names: str, default: Any = None) ->
Any:) and ensure the typing import (from typing import Any) is added at the top;
this keeps behavior identical but provides explicit type information for callers
and linters.

In `@benchmarks/AiDotNet.PyTorchParity/pytorch/requirements.txt`:
- Line 3: Update the PyTorch minimum version in requirements.txt to avoid known
security vulnerabilities: replace the current "torch>=2.2" entry with a safer
minimum such as "torch>=2.5.0" or pin to a compatible patch series like
"torch~=2.6.0" so consumers use a known-safe release; ensure any CI or local
test docs that reference the "torch" requirement are updated accordingly.

---

Duplicate comments:
In `@src/ActivationFunctions/Fused/IFusedActivation.cs`:
- Around line 28-35: Change the IFusedActivation interface from public to
internal so the fused-kernel routing contract is not exposed in the public API;
update the declaration of IFusedActivation accordingly (the explicit
implementations on concrete activation classes need no changes), and verify call
sites like ActivationLayer.TryGetFusedActivationType and
FeedForwardNeuralNetwork.TryFusedDensePredict still compile against the
now-internal interface.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5cd946b5-dc1b-4935-9648-9e103d21f2d0

📥 Commits

Reviewing files that changed from the base of the PR and between f8da771 and e32ade6.

📒 Files selected for processing (28)
  • AiDotNet.sln
  • benchmarks/AiDotNet.PyTorchParity/.gitignore
  • benchmarks/AiDotNet.PyTorchParity/AiDotNet.PyTorchParity.csproj
  • benchmarks/AiDotNet.PyTorchParity/Program.cs
  • benchmarks/AiDotNet.PyTorchParity/README.md
  • benchmarks/AiDotNet.PyTorchParity/pytorch/benchmark.py
  • benchmarks/AiDotNet.PyTorchParity/pytorch/compare.py
  • benchmarks/AiDotNet.PyTorchParity/pytorch/requirements.txt
  • src/ActivationFunctions/ELUActivation.cs
  • src/ActivationFunctions/Fused/IFusedActivation.cs
  • src/ActivationFunctions/GELUActivation.cs
  • src/ActivationFunctions/IdentityActivation.cs
  • src/ActivationFunctions/LeakyReLUActivation.cs
  • src/ActivationFunctions/MishActivation.cs
  • src/ActivationFunctions/ReLUActivation.cs
  • src/ActivationFunctions/SiLUActivation.cs
  • src/ActivationFunctions/SigmoidActivation.cs
  • src/ActivationFunctions/SwishActivation.cs
  • src/ActivationFunctions/TanhActivation.cs
  • src/NeuralNetworks/FeedForwardNeuralNetwork.cs
  • src/NeuralNetworks/NeuralNetworkBase.cs
  • src/Optimizers/AdamOptimizer.cs
  • src/Optimizers/AdamWOptimizer.cs
  • src/Optimizers/Fused/IFusedOptimizerSpec.cs
  • src/Optimizers/StochasticGradientDescentOptimizer.cs
  • src/Training/CompiledTapeTrainingStep.cs
  • tests/AiDotNet.Tests/IntegrationTests/NeuralNetworks/FusedOptimizerIntegrationTests.cs
  • tests/AiDotNet.Tests/UnitTests/NeuralNetworks/FusedInferenceParityTests.cs

Comment thread benchmarks/AiDotNet.PyTorchParity/pytorch/benchmark.py Outdated
Comment thread benchmarks/AiDotNet.PyTorchParity/pytorch/benchmark.py Outdated
Comment thread benchmarks/AiDotNet.PyTorchParity/pytorch/compare.py Outdated
Comment thread benchmarks/AiDotNet.PyTorchParity/pytorch/requirements.txt Outdated
ooples pushed a commit to ooples/AiDotNet.Tensors that referenced this pull request May 30, 2026
…n/LiSHT/ISRU/SQRBF/BinarySpiking/BentIdentity + LogSoftmax/LogSoftmin/Spherical/Taylor/Gumbel/Sparsemax/Squash

Enumerates and fuses every remaining activation that can be an elementwise or
row-wise epilogue on the [batch, features] GEMM output. Formulas matched to the
AiDotNet activation classes.

Pointwise (added to the CpuFusedOperations resolvers + delegated by the epilogue):
  Sign, BentIdentity, Gaussian, LiSHT, ISRU(alpha), SQRBF(beta), BinarySpiking(threshold).

Row-/channel-wise (new shared RowwiseFusedActivations helper, used by BOTH the
MlpForward/FusedLinear epilogue and the BlasManaged ActivationEpilogue so the two
paths stay identical — float + double):
  LogSoftmax, LogSoftmin, SphericalSoftmax (softmax of x/‖x‖₂), TaylorSoftmax
  (2nd-order), GumbelSoftmax (deterministic eval = softmax(x/temperature); the
  training-time noise is not fused), Sparsemax (simplex projection via sort),
  Squash (capsule). PReLU + Softmax/Softmin also moved into this shared helper.

The epilogue now routes channel/row-wise types through RowwiseFusedActivations and
resolves any other pointwise activation from the shared registry (default branch),
so it covers the full set without duplicating 30+ inline cases.

Every FusedActivationType value (0–36) now has a working fused kernel in both
paths. The only activations NOT fused are the ones with NO FusedActivationType and
that are structurally not elementwise epilogues:
  • Maxout — reduces k channels to 1 (changes output dimensionality; a pooling op).
  • HierarchicalSoftmax — needs a class tree + target label (loss-coupled).
These are documented in the enum; fusing them would require a different op shape,
not a kernel.

Tests: MlpForwardActivationParityTests gains all 7 new pointwise (float+double) and
a row-wise theory (LogSoftmax/LogSoftmin/Spherical/Taylor/Gumbel/Sparsemax/Squash)
each vs an independent reference; existing PReLU/Softmax/Softmin/epilogue tests
still pass through the shared helper. 79 activation tests green; FusedLinear/
MlpForward regression green (135 passed / 0 failed).

Companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ooples pushed a commit to ooples/AiDotNet.Tensors that referenced this pull request May 30, 2026
…dient, D-Adaptation)

Two items previously documented as "needs a different op shape", now implemented.

FusedLinearMaxout (CpuEngine + IEngine): GEMM + bias then grouped-max over the
feature dim, [.., M, N] → [.., M, N/numPieces] (Goodfellow et al. 2013). Maxout is
a shape-changing reduction, not an activation epilogue, so it gets its own fused
op. Forward/inference-only (reuses the FusedLinear fast path for the GEMM).

HypergradientSGD + DAdaptationSGD: wired into CompiledTrainingPlan via a NEW
two-phase (global-reduce → apply) path in the float optimizer-update closure —
they maintain ONE scalar shared across ALL parameters, which the per-parameter
switch can't express:
  • Hypergradient: lr_t = lr_{t-1} + β·⟨g_t, g_{t-1}⟩ (global inner product), then
    p -= lr_t·g; prevGrad in m[p]. β via FusedOptimizerExtras.HyperLr.
  • D-Adaptation (growth-bounded / Prodigy): global ‖s‖² and r drive a single
    distance estimate d; p -= d·lr·g; s in m[p]. d0 / growth via extras.
State persists across steps in captured closure locals. CPU-only (the device gate
rejects them for GPU plans) and ungrouped (rejected with per-group schedules — a
single global LR is meaningless per group).

Still NOT fused, each needing machinery beyond a fused step (documented in tests):
  • SparseAdam — sparse-gradient index lists (plan operates on dense grads).
  • LBFGS — closure line-search (multiple loss evals per step).
  • ScheduleFreeSGD — needs y=(1-β)z+βx written BEFORE the forward (a pre-forward
    parameter-transform hook the plan doesn't have).
  • HierarchicalSoftmax — an alternative output LAYER with its own learned tree-node
    weights traversed over the input features; not an activation on the logits.

Tests: FusedLinearMaxoutTests (grouped-max parity for numPieces 2/3/4 + indivisible
guard); Hypergradient diverges from SGD (global LR adaptation active); D-Adaptation
grows d above d0 (moves ≫ d0·lr·g); both rejected with grouped schedules. Optimizer
+ activation regression suites green.

Companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… regression)

#1470: a from-scratch Transformer trained via the per-call minibatch pattern
(MaxIterations=1 + external epoch loop calling model.Train repeatedly) stalled
on 0.207.x — the default Noam schedule's LR stayed frozen at its warmup-step-1
value instead of ramping, so loss never left the uniform floor (PPL ≈ V).

Root cause: the compiled fused-training kernel bakes a CONSTANT learning rate.
A default Transformer's Adam+Noam (StepPerBatch) optimizer was committed to that
fused path, which froze the LR — the per-step Noam ramp can't be reproduced by a
constant-rate kernel. The fix on this branch is the IFusedOptimizerSpec gating:
TryGetFusedLrSchedule returns false for unmapped schedules (Noam), so
TryMapToFusedOptimizerConfig declines the fused path and training falls back to
the eager OnBatchEnd → StepScheduler path that actually ramps the LR.

Two guards (both verified passing on this branch):

1. AdamWithNoamSchedule_DoesNotMapToConstantRateFusedConfig — deterministic unit
   test of the exact fix seam: Adam+Noam.TryGetFusedOptimizerConfig() must return
   false (forces eager), while a no-scheduler Adam still returns true (fused fast
   path preserved). NeuralNetworkBase.TryMapToFusedOptimizerConfig delegates to
   this same spec method, so it faithfully guards the real training path.

2. Transformer_PerCallTrain_DefaultNoam_RampsLearningRateAcrossCalls — end-to-end
   smoke: a default-optimizer Transformer trained via repeated per-call Train must
   accumulate scheduler step state (one StepPerBatch advance per call) and ramp the
   Noam LR above its warmup-step-1 value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ooples ooples force-pushed the perf/fused-training-gate branch from bf7a785 to 7310df9 Compare May 30, 2026 02:56
ooples added a commit to ooples/AiDotNet.Tensors that referenced this pull request May 30, 2026
…op, global-reduction optimizers) (#502)

* feat(#499): add ReLU6 + SoftSign kernels and bring ActivationEpilogue to full pointwise parity

Two fused activation paths existed with different coverage: the
MlpForward/FusedLinear dispatch tables (CpuFusedOperations) reached 14 pointwise
activations in #501, while the BlasManaged ActivationEpilogue stopped at 8
(ReLU/LeakyReLU/Sigmoid/Tanh/GELU/Swish/Mish/ELU). This:

- Adds two new pointwise FusedActivationType kernels: ReLU6 = min(max(0,x),6)
  (MobileNet/quantized) and SoftSign = x/(1+|x|) — wired into the CpuFusedOperations
  float+double tables.
- Brings ActivationEpilogue (fp32 + fp64) to parity with those tables: adds SELU,
  Softplus, HardSwish, HardSigmoid, HardTanh + the new ReLU6, SoftSign.

Both fused paths now cover all 16 pointwise activations. Still out of scope (not
pointwise / need parameter threading, tracked in #499): Softmax & Softmin (row-wise),
and the parametric activations (CELU/ThresholdedReLU/ScaledTanh/PReLU) which need a
parameter carried through the fused path — a follow-up API extension analogous to
FusedOptimizerExtras.

Tests: MlpForwardActivationParityTests + new EpilogueActivationParityTests verify
every new activation (float + double) matches an independent canonical formula
through both fused paths.

Companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#499): parametric fused activations via FusedActivationParams

Adds an optional FusedActivationParams (Alpha/Beta/Theta, nullable so a missing
field resolves to each activation's canonical default) threaded through the fused
activation paths, so parametric activations fuse with ANY parameter instead of
only the hardcoded value:

- New FusedActivationType values: CELU, ThresholdedReLU, ScaledTanh.
- LeakyReLU now fuses for any slope (not just 0.01); ELU for any alpha (not just 1).
- CELU (alpha), ThresholdedReLU (theta), ScaledTanh (alpha, beta) fuse via params.

Plumbing (optional trailing param, fully back-compatible — null = prior behavior):
- CpuFusedOperations.GetFloatActivation/GetDoubleActivation build a parametric
  closure from the params (falling back to the per-activation default), with the
  non-parametric activations still served by the static dispatch tables.
- ApplyBiasActivationInPlace/Double, CpuEngine.FusedLinear, CpuEngine.MlpForward
  (hidden + output params), and the IEngine interface all carry the optional params.
- DirectGpuTensorEngine.FusedLinear defers to the base CPU params-aware path when
  custom params are supplied (GPU fused kernels don't carry them yet).
- ActivationEpilogue (fp32 + fp64) honors params for LeakyReLU/ELU and implements
  CELU/ThresholdedReLU/ScaledTanh.

Out of scope (documented): PReLU needs a per-channel slope vector (not a scalar) —
a separate kernel signature; the tape/graph training path applies activations via
ActivationRegistry (canonical defaults) — MlpForward is inference-only so the main
consumer path is fully covered; Softmax/Softmin remain row-wise.

Tests: MlpForwardActivationParityTests + EpilogueActivationParityTests gain
parametric cases (LeakyReLU 0.2, ELU 2, CELU 1.5, ThresholdedReLU 0.5,
ScaledTanh 1.7/0.66) verifying both fused paths honor the supplied parameter
(float + double). FusedLinear/MlpForward regression suite green (110 passed).

Companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#499): fuse PReLU, RReLU, Softmax, Softmin — completes the FusedActivationType set

Implements the remaining declared activation types so every FusedActivationType
value now has a working fused kernel in BOTH paths (MlpForward/FusedLinear tables
and the BlasManaged ActivationEpilogue):

- PReLU: per-output-channel learned slope via FusedActivationParams.PReluSlope
  (length = features, or 1 for shared; default 0.25). Applied per output column,
  so it runs in a dedicated channel-aware pass (not the pointwise delegate).
- RReLU: deterministic eval form = leaky with slope (lower+upper)/2 (default
  ≈0.2292, override via Alpha) — fused paths are inference-only.
- Softmax / Softmin: row-wise (over the feature dim) with the standard max-shift
  for numerical stability; Softmin = softmax(-x). Run as a per-row pass after bias.

PReLU/Softmax/Softmin get a dedicated branch at the top of
ApplyBiasActivationInPlace/Double (and matching epilogue cases) because they need
column/row context; the pointwise delegate path and SIMD fast path are unchanged
for every other activation (no regression — existing activations skip the branch).

Tests: per-channel PReLU parity, Softmax/Softmin row-normalization (+ monotonic /
anti-monotonic ordering, rows sum to 1), RReLU added to the parametric set — float
and double, through both MlpForward and the epilogue. 58 activation tests green.

This closes the activation half of #499: all 23 FusedActivationType values fuse
(only the specialized non-enumerated activation classes — Sparsemax, Maxout,
GumbelSoftmax, etc. — remain unenumerated, marked lower-priority in the issue).

Companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#499): fuse the remaining specialized activations — Sign/Gaussian/LiSHT/ISRU/SQRBF/BinarySpiking/BentIdentity + LogSoftmax/LogSoftmin/Spherical/Taylor/Gumbel/Sparsemax/Squash

Enumerates and fuses every remaining activation that can be an elementwise or
row-wise epilogue on the [batch, features] GEMM output. Formulas matched to the
AiDotNet activation classes.

Pointwise (added to the CpuFusedOperations resolvers + delegated by the epilogue):
  Sign, BentIdentity, Gaussian, LiSHT, ISRU(alpha), SQRBF(beta), BinarySpiking(threshold).

Row-/channel-wise (new shared RowwiseFusedActivations helper, used by BOTH the
MlpForward/FusedLinear epilogue and the BlasManaged ActivationEpilogue so the two
paths stay identical — float + double):
  LogSoftmax, LogSoftmin, SphericalSoftmax (softmax of x/‖x‖₂), TaylorSoftmax
  (2nd-order), GumbelSoftmax (deterministic eval = softmax(x/temperature); the
  training-time noise is not fused), Sparsemax (simplex projection via sort),
  Squash (capsule). PReLU + Softmax/Softmin also moved into this shared helper.

The epilogue now routes channel/row-wise types through RowwiseFusedActivations and
resolves any other pointwise activation from the shared registry (default branch),
so it covers the full set without duplicating 30+ inline cases.

Every FusedActivationType value (0–36) now has a working fused kernel in both
paths. The only activations NOT fused are the ones with NO FusedActivationType and
that are structurally not elementwise epilogues:
  • Maxout — reduces k channels to 1 (changes output dimensionality; a pooling op).
  • HierarchicalSoftmax — needs a class tree + target label (loss-coupled).
These are documented in the enum; fusing them would require a different op shape,
not a kernel.

Tests: MlpForwardActivationParityTests gains all 7 new pointwise (float+double) and
a row-wise theory (LogSoftmax/LogSoftmin/Spherical/Taylor/Gumbel/Sparsemax/Squash)
each vs an independent reference; existing PReLU/Softmax/Softmin/epilogue tests
still pass through the shared helper. 79 activation tests green; FusedLinear/
MlpForward regression green (135 passed / 0 failed).

Companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#499): FusedLinearMaxout + global-reduction optimizers (Hypergradient, D-Adaptation)

Two items previously documented as "needs a different op shape", now implemented.

FusedLinearMaxout (CpuEngine + IEngine): GEMM + bias then grouped-max over the
feature dim, [.., M, N] → [.., M, N/numPieces] (Goodfellow et al. 2013). Maxout is
a shape-changing reduction, not an activation epilogue, so it gets its own fused
op. Forward/inference-only (reuses the FusedLinear fast path for the GEMM).

HypergradientSGD + DAdaptationSGD: wired into CompiledTrainingPlan via a NEW
two-phase (global-reduce → apply) path in the float optimizer-update closure —
they maintain ONE scalar shared across ALL parameters, which the per-parameter
switch can't express:
  • Hypergradient: lr_t = lr_{t-1} + β·⟨g_t, g_{t-1}⟩ (global inner product), then
    p -= lr_t·g; prevGrad in m[p]. β via FusedOptimizerExtras.HyperLr.
  • D-Adaptation (growth-bounded / Prodigy): global ‖s‖² and r drive a single
    distance estimate d; p -= d·lr·g; s in m[p]. d0 / growth via extras.
State persists across steps in captured closure locals. CPU-only (the device gate
rejects them for GPU plans) and ungrouped (rejected with per-group schedules — a
single global LR is meaningless per group).

Still NOT fused, each needing machinery beyond a fused step (documented in tests):
  • SparseAdam — sparse-gradient index lists (plan operates on dense grads).
  • LBFGS — closure line-search (multiple loss evals per step).
  • ScheduleFreeSGD — needs y=(1-β)z+βx written BEFORE the forward (a pre-forward
    parameter-transform hook the plan doesn't have).
  • HierarchicalSoftmax — an alternative output LAYER with its own learned tree-node
    weights traversed over the input features; not an activation on the logits.

Tests: FusedLinearMaxoutTests (grouped-max parity for numPieces 2/3/4 + indivisible
guard); Hypergradient diverges from SGD (global LR adaptation active); D-Adaptation
grows d above d0 (moves ≫ d0·lr·g); both rejected with grouped schedules. Optimizer
+ activation regression suites green.

Companion to ooples/AiDotNet#1469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#499): wire ScheduleFreeSGD (pre-forward hook) + FusedHierarchicalSoftmax — the last two

ScheduleFreeSGD (Defazio et al. 2024): the SIMD kernels already existed in
FusedOptimizer; this wires them into CompiledTrainingPlan via a new
_preForwardParamTransform hook invoked in Step() before the forward replay.
The hook writes y=(1-β)z+βx into the live parameter backing so gradients are
evaluated at the interpolation point; the optimizer update advances z (SGD)
and x (running weighted average, weightSum += lr²) then restores x into the
backing as the eval copy. z/x live in m[p]/v[p] (seeded from the initial
weights). Added SfBeta to FusedOptimizerExtras; gate + grouped-guard updated;
ScheduleFreeSGD moved from the rejected list to a dedicated functional test
(eval weights shrink on Σwᵢ² and diverge from plain SGD).

FusedHierarchicalSoftmax (Morin & Bengio 2005): new virtual CpuEngine op
(inherited by DirectGpuTensorEngine). Computes the treeDepth shared per-level
gate sigmoids once per row then forms each leaf's root-to-leaf path product,
replacing the eager layer's per-class gate recomputation. Generic over T via
INumericOperations. Test matches the naive per-class reference for power-of-two
(sums to 1) and non-power-of-two (early-break) class counts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#499): classify FusedLinearMaxout + FusedHierarchicalSoftmax as NonDifferentiableOps

TapeCompletenessTests.AllTensorReturningMethods_AreClassified enumerates every
IEngine Tensor-returning method and requires each be registered. Both fused
output primitives are forward/inference-only (they throw under an active tape;
training decomposes into recordable per-layer ops), so they belong in
NonDifferentiableOps alongside MlpForward / LstmSequenceForward /
MultiHeadAttentionForward. FusedLinearMaxout was unclassified since it landed
in 15ec075; this fixes both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(#499): drive FusedLinear(...,FusedActivationParams) overload in the GPU coverage harness

EveryGpuKernel_IsAutoTestedOrAllowlisted flagged the parametric FusedLinear
overload as uncovered: the single-shape arg generator couldn't synthesize a
FusedActivationParams value, so the overload was neither auto-testable nor
allowlisted. Teach CandidatesForType to emit null for FusedActivationParams —
a valid value meaning "use defaults", which reduces the overload to the base
FusedLinear(...,FusedActivationType) GPU kernel that is already auto-tested.
This gives the params overload real GPU-vs-CPU coverage rather than an
allowlist skip. (Pre-existing gap from the parametric #499 work.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#499): address CodeRabbit review — CELU alpha guards, params threading, PReLU bounds, schedule-free/hypergradient correctness

Resolves all 9 review threads on PR #502:
- CELU divides by alpha → reject alpha <= 0 in both fused paths (ActivationEpilogue
  fp32/fp64, CpuFusedOperations float/double activation delegates).
- Thread FusedActivationParams through the public FusedGemmBiasActivation float
  and double entrypoints (+ Unchecked) so direct callers can use parametric
  LeakyReLU/ELU/CELU/ThresholdedReLU/ScaledTanh settings.
- PReLU per-channel slope: defensively clamp to the last element when a
  misconfigured slope array is shorter than the feature dim (was
  IndexOutOfRangeException), in both ApplyFloat and ApplyDouble.
- Schedule-Free: clear _preForwardParamTransform on grouped reconfigure so a
  stale y=(1-β)z+βx rewrite can't leak into a subsequent grouped optimizer.
- HypergradientSGD: honor a non-constant LrSchedule — effective lr is the
  per-step schedule base plus the accumulated hypergradient adjustment (was
  frozen at GetLr(1)); constant schedule reduces to the prior behavior.
- FusedOptimizerExtras.Validate(): reject HyperLr<0, D0<=0, DGrowthRate<1,
  SfBeta∉[0,1] at configure time; called from both ConfigureOptimizer paths.
- Test comment: note LBFGS is also still rejected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: franklinic <franklin@ivorycloud.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
franklinic and others added 3 commits May 30, 2026 10:20
…R fix)

Replaces the eager-fallback workaround with the real solution: the default
Transformer recipe (Adam β₂=0.98 + Noam warmup) now trains on the fused-
compiled path with a correct per-step LR ramp — no forced slow path.

The fused training plan already evaluates LrSchedule.GetLr(step) every
optimizer step (the same model as PyTorch fused=True, which takes lr as a
per-step scalar). Cosine/Exponential/OneCycle/LinearWarmupCosine were already
supported; Noam was the only missing shape, which is why Adam+Noam
Transformers fell back to eager (or, pre-gate, froze at a constant rate).

- Bump AiDotNet.Tensors 0.86.6 → 0.88.0 for LrSchedule.Noam (Tensors #504).
- GradientBasedOptimizerBase.TryGetFusedLrSchedule: map NoamSchedule →
  LrSchedule.Noam(modelDim, warmup, factor). Both use t = step (1-based), so
  the fused LR sequence is bit-identical to the eager NoamSchedule.
- NoamSchedule.Factor getter so the mapping is fully faithful.

Tests (3, all passing on the live CUDA box):
1. AdamWithNoamSchedule_MapsToFusedConfig_WithRampingSchedule — Adam+Noam now
   maps to a fused config (no eager fallback); mapped schedule ramps 4000× over
   warmup and matches the paper peak.
2. FusedNoamSchedule_MatchesEagerNoamSchedule_StepForStep — fused GetLr(N) ==
   eager lr(t=N) for 3× warmup steps.
3. Transformer_PerCallTrain_DefaultNoam_EngagesFusedPath_AndConverges —
   end-to-end: default-Noam Transformer per-call Train engages the fused path
   (3200/3200 steps fused) and converges to PPL 5.06 / top-1 7/8 (avgNll 1.62
   < ln(V) 2.08), proving the LR ramped instead of freezing.

Verified the pre-existing ModelFamily TableTransformer/TabTransformer/
DecisionTransformer failures reproduce identically at the 0.86.6 baseline —
they don't use Noam and are unrelated to this change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…x Cosine off-by-one

Extends TryGetFusedLrSchedule so more LR schedulers run on the fused-compiled
training path (the fused plan evaluates GetLr(step) per optimizer step):

- StepLRScheduler → LrSchedule.Step. Verified: eager lr0·γ^((N-1)/stepSize) on
  batch N == fused GetLr(N) (Tensors uses max(0,step-1)/stepSize).
- CyclicLRScheduler → LrSchedule.Cyclic, gated to the canonical symmetric-
  triangular case (mode==Triangular && stepSizeUp==stepSizeDown). Triangular2 /
  ExponentialRange / asymmetric have no fused shape and fall back to eager.
  Added a CyclicLRScheduler.Mode getter for the gate.

Also fixes a PRE-EXISTING off-by-one in the Cosine mapping that the new parity
test caught: eager CosineAnnealing uses cos(π·(N-1)/tMax) but fused CosineLr
uses cos(π·(s-1)/(totalSteps-1)); passing totalSteps = tMax (not tMax+1) made
the fused sequence drift ~4e-6/step from eager. Now passes tMax+1 for an exact
match. (Exponential verified already exact: lr0·γ^(N-1) both sides.)

New FusedLrScheduleMappingTests: step-for-step parity (eager sequence == fused
GetLr(N)) for Step / Cyclic-triangular / Cosine / Exponential, plus negative
guards that Triangular2 and asymmetric cyclic fall back to eager. All pass.

Note: OneCycle is NOT wired — AiDotNet's OneCycle uses LINEAR warmup while the
fused/PyTorch OneCycle uses cosine warmup; the formulas differ, so mapping it
would train differently on the fused path. Left on eager (documented).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…verified)

Tensors 0.88.0's CompiledTrainingPlan.ConfigureOptimizerFloat supports 20 fused
optimizer kernels, but AiDotNet's CompiledTapeTrainingStep had a stale allowlist
(SGD/Adam/AdamW/AMSGrad only) from when those were the only kernels — so any
other optimizer silently fell back to eager even with an IFusedOptimizerSpec.

- Expand the allowlist to the full set the linked Tensors build supports
  (SGD, SGDMomentum, Adam, AdamW, AMSGrad, Nadam, RAdam, AdaMax, AdaDelta,
  Adagrad, RMSprop, Lion, LARS, LAMB, FTRL, ASGD, Rprop, HypergradientSGD,
  ScheduleFreeSGD, DAdaptationSGD).
- Generalize the AMSGrad-only "fused-unavailable" latch to a per-OptimizerType
  set, so any type the linked build can't actually run falls back ONCE (loud
  warning) instead of throwing/reconfiguring every step — still never a wrong
  update.
- AdaMaxOptimizer + NadamOptimizer implement IFusedOptimizerSpec
  (OptimizerType.AdaMax / Nadam; no decoupled weight decay → WeightDecay 0;
  decline on adaptive LR / unmappable scheduler).

New FusedOptimizerParityTests gates each wiring with a fused-vs-eager training
comparison: train two identically-initialised MLPs (EnableCompilation true vs
false), compare final params. Adam is the control. Result: AdaMax and Nadam
both engage the fused path (fusedSteps=40/40) and match eager to maxAbsDiff=0
(bit-identical) — verified safe to wire. The test asserts fusedSteps>0 so a
silent eager fallback can't pass vacuously.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
franklinic and others added 3 commits May 30, 2026 11:57
…th (parity-verified)

Five more optimizers self-describe via IFusedOptimizerSpec, mapped to their
Tensors fused kernels using the exact param interpretation each kernel expects:
- RMSprop  → RMSpropUpdateSimd(lr, decay=β2, eps)
- Adagrad  → AdagradUpdateSimd(lr, eps)
- Lion     → LionUpdateSimd(lr, β1, β2, wd)
- AdaDelta → AdaDeltaUpdateSimd(lr, rho=β2, eps)
- LAMB     → LAMBUpdateSimd(lr, β1, β2, eps, wd)

Each declines (→ eager) under UseAdaptiveLearningRate, which is what gates the
AiDotNet-side adaptive hyperparameter schedules (Adagrad LR factors, AdaDelta
rho schedule, Lion β factors) that the fixed-hyperparameter fused kernels don't
reproduce — so the fused path only engages for the canonical fixed-param case.

Parity-verified (FusedOptimizerParityTests, fused-vs-eager training): all five
engage the fused path (40/40 steps) and match eager to maxAbsDiff=0
(bit-identical), with a non-vacuous guard confirming training actually moved the
params (trainDelta: Lion 0.40, LAMB 0.39, RMSprop 0.14, Adagrad 0.06,
AdaDelta 3e-3 — distinct dynamics, not all the same).

Total fused optimizers now: SGD/Adam/AdamW/AMSGrad + AdaMax/Nadam +
RMSprop/Adagrad/Lion/AdaDelta/LAMB. LARS/FTRL/RAdam/ASGD/Rprop deferred (LARS/
FTRL need params the fixed (lr,β1,β2,ε,wd) config can't carry; RAdam/ASGD/Rprop
have no AiDotNet optimizer class).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ty-gated)

Tensors 0.88.0's CpuFusedOperations registry implements 26 pointwise activation
kernels (feat/499 "fuse every activation"); AiDotNet wired only 8
(ReLU/Identity/LeakyReLU/GELU/SiLU/Swish/Sigmoid/Tanh). Adds IFusedActivation to
11 more whose fused kernel is numerically identical to the eager scalar form:
Mish, SELU, Softplus, SoftSign, Sign, BentIdentity, Gaussian, LiSHT, SQRBF,
ReLU6, HardSwish.

Gated by a new FusedActivationParityTests harness: isolate each activation via
IEngine.FusedLinear(x, I, null, type) (identity weights → only the fused
activation applies) and compare element-wise to eager activation.Activate(x)
over inputs spanning saturating regions. All 13 wired non-parametric activations
match to ≤5e-7 (float epsilon).

The gate caught a real mismatch: HardSigmoid is NOT wired — AiDotNet's
HardSigmoidActivation uses slope 0.2 (clamp(0.2x+0.5,0,1)) while the fused kernel
uses the PyTorch form (x/6+0.5); parity measured 0.333 divergence, so it stays on
the eager path until the formula is reconciled.

Deferred (documented): parametric activations (ELU/CELU/HardTanh/ScaledTanh/
ThresholdedReLU/ISRU) need per-instance param guards vs the kernel's hardcoded
constants; RReLU is non-deterministic; the softmax family isn't pointwise so it
can't be a fused activation epilogue.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cord findings

Adds a --compile flag to the PyTorch side so the head-to-head can be run against
torch.compile (TorchInductor), not just eager — the comparison requested for the
fused-training-plan claim. (TorchInductor CPU needs MSVC cl.exe on PATH; run
under a VS Developer environment / vcvars64.bat.)

Findings (MLP, CPU, 8 threads, bs64, 20 train-batches × 5 epochs, steady-state
i.e. excluding torch.compile's ~3.7s first-epoch compilation):

  TRAINING (the compiled-training-plan claim):
    AiDotNet fused      ~0.014-0.017 s/epoch
    torch.compile        ~0.084      s/epoch   → AiDotNet ~6x faster
    AiDotNet eager       ~0.22       s/epoch   (fused ~15x over eager)

  INFERENCE (Predict latency, post-warmup): torch.compile wins ~2-4x on MLP
    (e.g. bs32: AiDotNet p95 1.27ms vs torch.compile mean 0.32ms). The fused
    training plan beats torch.compile; the inference path does not yet — an
    honest gap to close (TorchInductor's fused pointwise+GEMM inference codegen).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/Training/CompiledTapeTrainingStep.cs`:
- Line 128: The static field _fusedUnavailableTypes is being latched on any
exception which can permanently disable fused mode; update the logic so you only
add to _fusedUnavailableTypes for known non-transient exceptions (e.g.,
NotSupportedException/PlatformNotSupported or a specific
OptimizerUnsupportedException) instead of catching Exception, and ensure
Invalidate() actually clears or resets _fusedUnavailableTypes (or make it
instance-scoped) so transient failures don't permanently disable fused
execution; locate usages in CompiledTapeTrainingStep (the _fusedUnavailableTypes
field and the method where exceptions are caught and Invalidate() is
implemented) and change the catch to specific exception types and add a
safe-clear/reset in Invalidate().

In
`@tests/AiDotNet.Tests/IntegrationTests/Optimizers/FusedOptimizerParityTests.cs`:
- Around line 115-151: Each test (Adam_Control_FusedMatchesEager,
AdaMax_FusedMatchesEager_NoWorseThanAdam,
Nadam_FusedMatchesEager_NoWorseThanAdam) currently ignores the returned
trainDelta so they can pass with no parameter updates; update each test to
assert that trainDelta indicates actual parameter movement (e.g.,
Assert.True(trainDelta > 0 || maxAbs(trainDelta) > 1e-6) or similar non-zero
threshold) after calling Divergence(...) and include a clear failure message
mentioning the test name and that no training occurred; use the existing
trainDelta variable from the Divergence(...) call and keep the threshold
conservative (like 1e-6) so small-but-real updates pass while no-op runs fail.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 656f2a77-456b-4109-8222-49895364a143

📥 Commits

Reviewing files that changed from the base of the PR and between 53f58b4 and 7bba7ce.

📒 Files selected for processing (24)
  • src/ActivationFunctions/BentIdentityActivation.cs
  • src/ActivationFunctions/GaussianActivation.cs
  • src/ActivationFunctions/HardSwishActivation.cs
  • src/ActivationFunctions/LiSHTActivation.cs
  • src/ActivationFunctions/MishActivation.cs
  • src/ActivationFunctions/ReLU6Activation.cs
  • src/ActivationFunctions/SELUActivation.cs
  • src/ActivationFunctions/SQRBFActivation.cs
  • src/ActivationFunctions/SignActivation.cs
  • src/ActivationFunctions/SoftPlusActivation.cs
  • src/ActivationFunctions/SoftSignActivation.cs
  • src/LearningRateSchedulers/CyclicLRScheduler.cs
  • src/Optimizers/AdaDeltaOptimizer.cs
  • src/Optimizers/AdaMaxOptimizer.cs
  • src/Optimizers/AdagradOptimizer.cs
  • src/Optimizers/GradientBasedOptimizerBase.cs
  • src/Optimizers/LAMBOptimizer.cs
  • src/Optimizers/LionOptimizer.cs
  • src/Optimizers/NadamOptimizer.cs
  • src/Optimizers/RootMeanSquarePropagationOptimizer.cs
  • src/Training/CompiledTapeTrainingStep.cs
  • tests/AiDotNet.Tests/IntegrationTests/Optimizers/FusedActivationParityTests.cs
  • tests/AiDotNet.Tests/IntegrationTests/Optimizers/FusedLrScheduleMappingTests.cs
  • tests/AiDotNet.Tests/IntegrationTests/Optimizers/FusedOptimizerParityTests.cs

Comment thread src/Training/CompiledTapeTrainingStep.cs
…parity tests, py nits

- CompiledTapeTrainingStep: only latch a type as fused-unsupported for capability-gap
  exceptions (NotSupported/MissingMethod/TypeLoad/EntryPointNotFound/DllNotFound), not any
  exception — transient runtime failures fall back one step without permanently disabling
  fused mode; and clear _fusedUnavailableTypes in Invalidate() so a fresh lifecycle retries.
- FusedOptimizerParityTests: assert trainDelta > 1e-6 in Adam/AdaMax/Nadam tests so they
  cannot pass vacuously when both paths do no training.
- pytorch benchmark/compare: drop quoted self-type annotation, redundant int() cast, add
  _get return type; bump requirements torch>=2.5.0 for the torch.load RCE / DoS CVEs.

All 3 parity tests pass; solution builds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
franklinic and others added 3 commits May 31, 2026 11:47
…ce work + SpMM unconstrained fix)

0.90.2 pulls in the merged compiled-inference plan (Tensors#513 — CompiledMlp
self-tuning kernel selection, CNN conv im2col fast path, MlpForward small-batch
native-BLAS routing, public CpuInferenceConfig.PinBlasThreadsForLatency) and the
Tensors#520 fix that made ISparseEngine.SpMM<T> unconstrained again (0.90.0/0.90.1
broke the AiDotNet build — #379 had leaked `where T : unmanaged` into the public
API, failing SparseLinearLayer<T>; 0.90.2 is the first 0.90.x that compiles).

Core + tests + PyTorchParity benchmark all build clean against 0.90.2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dMlp plan

FeedForwardNeuralNetwork.Predict already collapsed a pure dense+activation stack
into one MlpForward call, but MlpForward is Tensor-based (per-call AutoTensorCache
+ dispatch + Tensor-wrapper overhead). The Tensors compiled-inference flagship —
CompiledMlp (array-based, near-zero per-call allocation, persistent prepacked
weights, per-layer managed-vs-native self-tuning) — beats torch.compile at the
kernel level but wasn't on the Predict path. It's internal to AiDotNet.Tensors and
reachable via [InternalsVisibleTo("AiDotNet")].

TryFusedDensePredict now adds a float tier (TryCompiledMlpPredict): build/cache a
CompiledMlp from the dense layers' weights/biases on first eligible inference, then
replay it. The plan is rebuilt when absent, when batch exceeds the buffers it was
sized for, or when any layer's weight backing array was reallocated (reference
guard) — the same frozen-weights-during-inference contract as the MlpForward path,
plus the reallocation guard the cached plan needs. Non-float and non-contiguous /
rank>2 inputs fall through to MlpForward unchanged.

Measured (AIsEval MLP 784->512->128->10, this machine): Predict bs1 avg
0.503 -> 0.225 ms — ~2.2x faster, now at parity with torch.compile (0.217 ms mean),
where the Tensor-based path was ~2.8x slower. (mlp-fused, which calls MlpForward
directly rather than via Predict, is unchanged — isolating the gain to this path.)

Correctness: FeedForwardCompiledMlpPredictTests asserts the CompiledMlp Predict
output equals the generic per-layer Forward (first-call lazy-weights path) within
1e-4 and is deterministic across calls, at bs 1/8/32. Builds clean on 0.90.2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ralNetwork.Predict

A canonical CNN classifier — [Conv(→ReLU) | MaxPool]+ → Flatten → Dense+ — now
replays inference by calling the engine kernels directly (FusedConv2D fusing
bias+activation; index-free MaxPool2D; cached-B FusedLinear), skipping the
per-layer LayerBase.Forward wrappers. Predict overrides the base to try this stem
and falls back to base.Predict for anything outside the pattern (non-float, active
tape, lazy/unmaterialized weights, a conv activation other than identity/ReLU).

Root-caused via a per-stage breakdown (CnnStemBreakdownBench) at bs1: the layer
path pools through MaxPool2DWithIndices — allocating a 5-D backward-index array
even at inference (~213 µs vs ~26 µs index-free) — and pays per-layer
shape-resolution / _lastInput-caching / Tensor-view churn. The stem drops both.

Result (parity CNN, this machine): bs8 inference 2.39 → 1.32 ms (~1.8x), bs1
0.78 → 0.69 ms, bs32 3.34 → 2.95 ms. Output matches the generic per-layer Forward
within 1e-4 and is deterministic (ConvNetFusedStemPredictTests, bs 1/4).

Honest ceiling: still ~3x behind torch.compile. The remaining gap is NOT layer
overhead — it's (a) the per-op Tensor allocation the stem still incurs (each
FusedConv2D/MaxPool2D returns a fresh Tensor; torch fuses the whole graph into one
allocation-free C++ fn) and (b) the conv kernel floor itself — the im2col-GEMM
convs sum to ~188 µs and the full kernel floor (~329 µs) already exceeds torch's
whole-CNN 254 µs. Fully matching torch needs faster conv kernels (oneDNN/direct-conv
codegen) or a zero-alloc array-based CompiledConvNet (FusedConv2DInto +
MaxPool2DInto + ping-pong NCHW buffers) — a larger Tensors effort, filed as follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants