fix: compiled fused-training — standard-Adam default, OCP dispatch, MlpForward wiring, loud fallback by ooples · Pull Request #1469 · ooples/AiDotNet

ooples · 2026-05-29T18:57:15Z

Closes #1447 (wire AIsEval fused primitives into NN layers + builder — framework side of AiDotNet.Tensors#436)
Closes #1470 (Transformer per-call training stalls — Noam LR frozen on the fused path)

Wires AiDotNet's NN training/inference onto AiDotNet.Tensors' fused-compiled kernels (the compile-once / replay-many path the Tensors micro-benchmarks beat PyTorch-CPU on), via open/closed self-describing interfaces — no central enum whitelist to keep in sync. Found by profiling the AIsEval PyTorch-vs-AiDotNet benchmark, where every Train() step silently fell back to the eager tape.

1. Default-optimizer gate (root cause of "compiled does nothing")

GetOrCreateBaseOptimizer defaulted UseAMSGrad = true (a non-standard band-aid from #1350); the fused mapper rejected AMSGrad, so every default-optimizer model fell back to the eager tape — silently. Reverted to standard Adam (matches PyTorch/TF/Optax, all default amsgrad=False).

2. Open/closed fused dispatch (replaces type-switch + enum whitelist)

IFusedOptimizerSpec — optimizers self-describe their FusedOptimizerConfig; no central catalog.
IFusedActivation — activations self-declare their FusedActivationType.
TryGetFusedLrSchedule — LR schedulers map to the per-step fused LrSchedule (the fused kernel evaluates GetLr(step) every optimizer step, exactly like PyTorch fused=True).

3. #1470 — Adam+Noam on the fused fast path (true adaptive-LR fix)

Bumped AiDotNet.Tensors 0.86.6 → 0.88.0 for LrSchedule.Noam (Tensors #504). TryGetFusedLrSchedule now maps NoamSchedule → LrSchedule.Noam(d, warmup, factor) (replacing an eager-fallback workaround). The default Transformer recipe (Adam β₂=0.98 + Noam) now trains on the fused path with a correct per-step warmup ramp, bit-identical to the eager schedule. Verified: a default-Noam Transformer per-call Train engages the fused path 3200/3200 steps and converges (PPL 5.06, top-1 7/8) instead of freezing at the uniform floor.

4. Proper wiring — `Predict` → `MlpForward`

FeedForwardNeuralNetwork.Predict runs a pure dense+fused-activation stack as one IEngine.MlpForward call instead of the per-layer tape walk, via the activation interface. Falls back to generic Forward for anything unrepresentable.

5. Loud fallback (observability)

The fused path silently fell back at the default diagnostic level. Now emits a one-time warning per model naming the reason (suppressible via AIDOTNET_QUIET).

Coverage being completed on this branch

AiDotNet.Tensors 0.88.0 exposes 37 fused activations, 22 fused optimizers, 8 fused LR-schedule shapes. This PR expands AiDotNet's mappings toward full coverage, each gated by a numerical-parity test (fused result == eager result within tolerance) so no optimizer/activation is wired to a kernel whose math differs:

Schedulers: Constant, Cosine, Exponential, Noam, + Step / Cyclic / OneCycle / LinearWarmupCosine.
Optimizers: Adam, AdamW, SGD(+momentum), + the remaining fused kernels whose AiDotNet update math matches (AMSGrad, AdaMax, Nadam, RAdam, Adagrad, RMSprop, AdaDelta, Lion, …).
Activations: ReLU/Sigmoid/Tanh/Identity, + the remaining exact-equivalent fused shapes (GELU, LeakyReLU, ELU, Softplus, Mish, Swish, HardSwish/HardSigmoid/HardTanh, ReLU6, SoftSign, CELU, …).
Benchmarks: AIsEval PyTorch-parity harness re-run to confirm the compiled training plan beats PyTorch compiled (torch.compile) on the target shapes.

Builds clean on net10.0 + net471.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Broad fused activation support and a fused MLP inference fast path.
- Fused/compiled optimizer training paths added for many optimizers (Adam variants, SGD, RMSProp, Adagrad, AdaDelta, AdaMax, LAMB, Lion, Nadam).
Improvements
- Learning-rate schedules integrated with compiled training; improved fused-path diagnostics and fallback behavior.
- Default Adam tuned for fused-kernel alignment.
Documentation
- Added PyTorch parity benchmark suite.
Tests
- New parity and integration tests validating fused inference and fused training.
Dependencies
- Updated native libraries to v0.90.2.

✅ Coverage completed + verified (parity-gated)

All wirings gated by a numerical-parity test; anything that diverged was left on eager and documented.

LR schedulers → LrSchedule: Constant, Cosine (fixed a pre-existing off-by-one), Exponential, Noam, Step, Cyclic(triangular). Step-for-step parity. (OneCycle deferred — AiDotNet uses linear warmup vs the kernel's cosine.)

Optimizers → fused kernels (fused-vs-eager training parity, Adam as control, all maxAbsDiff=0): SGD, Adam, AdamW, AMSGrad, AdaMax, Nadam, RMSprop, Adagrad, Lion, AdaDelta, LAMB. Expanded the stale 4-type allowlist → 20 and generalized the per-type fallback latch. (LARS/FTRL need params the config can't carry; RAdam/ASGD/Rprop have no AiDotNet class.)

Activations → FusedActivationType (parity ≤5e-7 via identity-weight FusedLinear): added Mish, SELU, Softplus, SoftSign, Sign, BentIdentity, Gaussian, LiSHT, SQRBF, ReLU6, HardSwish (on top of ReLU/Sigmoid/Tanh/Identity/GELU/LeakyReLU/SiLU/Swish). The gate caught HardSigmoid (0.2 slope vs kernel's x/6) — left on eager. Parametric (ELU/CELU/HardTanh/…) and RReLU/softmax-family deferred (documented).

Benchmark — vs `torch.compile` (TorchInductor), MLP CPU, steady-state

	AiDotNet fused	torch.compile	verdict
Training (compiled training plan)	~0.014 s/epoch	~0.084 s/epoch	AiDotNet ~6× faster ✅
Inference (Predict latency, bs32)	p95 1.27 ms	mean 0.32 ms	torch.compile ~4× faster

The compiled training plan beats torch.compile (~6×, and ~15× over AiDotNet eager). Inference-latency vs TorchInductor is an honest remaining gap. Harness: benchmarks/AiDotNet.PyTorchParity (now with --compile).

The compiled fused-optimizer training path (CompiledTapeTrainingStep) is the fast path and is attempted on every Train() step (EnableCompilation defaults true). When its gates aren't met it silently falls back to the eager autograd tape — a multi-x perf cliff with ZERO signal at the default diagnostic level. A user can "enable compilation" and unknowingly train on the slow path forever. This was found via the AIsEval benchmark: a bare FeedForwardNeuralNetwork with the default AdamOptimizer had every step rejected by TryMapToFusedOptimizerConfig ("optimizer AdamOptimizer not compatible with fused kernel") and fell back invisibly — only TrainingDiagnosticsConfig at PerStep surfaced it. Emit a one-time Trace.TraceWarning per model instance the first time the fused path doesn't engage, naming the reason and how to re-enable it. Gated by a one-shot flag so it never spams the per-step training loop, and suppressible via AIDOTNET_QUIET. PerStep diagnostics still give per-step detail. This is observability only — no behavior change to training itself. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel · 2026-05-29T18:57:21Z

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
aidotnet_website	Ignored	Preview	May 31, 2026 5:42pm
aidotnet-playground-api	Ignored	Preview	May 31, 2026 5:42pm

coderabbitai · 2026-05-29T18:57:23Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds IFusedActivation and IFusedOptimizerSpec contracts; implements TryGetFusedActivation across many activations (with parameter gating where needed); implements a fused dense+activation MLP inference fast-path; maps LR schedulers for fused optimizers and adds IFusedOptimizerSpec implementations for many optimizers; improves compiled-training fused gating and adds tests plus a PyTorch-parity benchmark project with Python scripts.

Changes

Fused Execution Infrastructure

Layer / File(s)	Summary
Fused activation contract & implementations `src/ActivationFunctions/Fused/IFusedActivation.cs`, `src/ActivationFunctions/*Activation.cs`	Adds public `IFusedActivation.TryGetFusedActivation(out FusedActivationType)` and implements it across many activations (identity→None, LeakyReLU alpha gating, SiLU→Swish, GELU documented numeric form; ELU/Mish intentionally non-claimed or documented).
Fused MLP inference fast-path `src/NeuralNetworks/FeedForwardNeuralNetwork.cs`	`Predict` enters inference mode, attempts `TryFusedDensePredict` which validates dense-only layers and scalar IFusedActivation mappings, collects weights/biases, calls `AiDotNetEngine.MlpForward`, and falls back to `Forward` on ineligibility or InvalidOperationException while restoring training-mode.
Fused optimizer contract `src/Optimizers/Fused/IFusedOptimizerSpec.cs`	Adds internal `FusedOptimizerConfig` and `IFusedOptimizerSpec.TryGetFusedOptimizerConfig(out FusedOptimizerConfig)` for fused-kernel optimizer configuration extraction.
Optimizers & LR schedule mapping `src/Optimizers/*`, `src/Optimizers/GradientBasedOptimizerBase.cs`	Multiple optimizers implement `IFusedOptimizerSpec.TryGetFusedOptimizerConfig` returning fused configs when adaptive LR is disabled and a fused LR schedule is available; `TryGetFusedLrSchedule` maps supported schedulers (null/constant, CosineAnnealing, Exponential, Noam, Step, symmetric Triangular cyclic) to fused `LrSchedule`.
NeuralNetworkBase: fused optimizer dispatch & fallback logging `src/NeuralNetworks/NeuralNetworkBase.cs`	`TryMapToFusedOptimizerConfig` dispatches via `IFusedOptimizerSpec.TryGetFusedOptimizerConfig`; adds a one-time fused-fallback TraceWarning guarded by `_loggedFusedFallback`; changes default base optimizer to Adam with `UseAMSGrad=false`.
Compiled training step: fused gating & latch `src/Training/CompiledTapeTrainingStep.cs`	Expands fused optimizer allowlist (includes AMSGrad); introduces per-thread `_fusedUnavailableTypes` to remember optimizer types that failed fused execution and skip retries; latch cleared on Invalidate().
Tests `tests/*`	Adds integration/unit tests validating fused optimizer engagement, fused activation numeric parity, LR schedule parity, LeakyReLU parameter gating, Mish/ELU non-claiming, and Predict stability on lazy initialization.

PyTorch-parity Benchmarks

Layer / File(s)	Summary
Solution & project files `AiDotNet.sln`, `benchmarks/AiDotNet.PyTorchParity/*`	Adds new benchmark project to solution, project file referencing `src/AiDotNet.csproj`, and `.gitignore` for benchmark artifacts.
C# benchmark harness & models `benchmarks/AiDotNet.PyTorchParity/Program.cs`	Adds harness that runs configured models (mlp, mlp-fused, cnn, lstm, transformer) for training and inference, measures timing, peak RSS, optional nvidia-smi, and writes indented JSON report.
PyTorch benchmark & compare scripts `benchmarks/AiDotNet.PyTorchParity/pytorch/*`	Adds Python parity benchmark, compare tool, requirements, and README describing parity workflow and measurement conventions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

perf: Close per-model gaps surfaced by AIsEval fair-comparison vs PyTorch (LSTM, Transformer, MLP, regression-pipeline, OpenCL RSS) AiDotNet.Tensors#436 — Framework-side fused-kernel wiring and MLP/optimizer fusion objectives align with these changes.

Possibly related PRs

ooples/AiDotNet#1386 — Related fused-Adam fused-step engagement tests and fused training path changes.
feat: auto-compiled training integration (Phase 2) #1107 — Overlaps compiled training-step execution flow adjustments.
feat: add learning rate scheduler integration and refactor diffusion architecture #574 — Related learning-rate scheduler and optimizer stepping integrations.

Suggested labels

feature, dependencies, testing, architecture, priority:p0

"A fused kernel hums beneath the code,
Activations mapped, the fast-path strode.
Adam learns while schedules climb,
Benchmarks hum and tests mark time.
Merge the harness — let metrics rhyme."

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/fused-training-gate

…to Predict Two issues, same PR. (1) Default-optimizer gate — the real reason "compiled training does nothing": GetOrCreateBaseOptimizer defaulted UseAMSGrad=true (added in #1350 as a weak, non-standard band-aid for post-convergence drift on a couple of recurrent models — partial at best). AMSGrad-by-default is non-standard (PyTorch/TF/Optax all default amsgrad=False) and, because the fused kernel mapper didn't accept AMSGrad, it silently forced EVERY model onto the eager tape. Reverted to standard Adam, restoring both the industry default and the fused fast path. (2) Open/closed fused dispatch (replaces the type-switch / enum mapping): - IFusedOptimizerSpec: optimizers that have a fused SIMD kernel self-describe their FusedOptimizerConfig. Only Adam/AdamW/SGD implement it (the only kernels that exist), so there's no central whitelist and no `OptimizerType is (… or …)` list to maintain — a new optimizer becomes fuse-able by implementing the interface. Adam/AdamW self-select the AMSGrad kernel variant when UseAMSGrad is set, so opt-in AMSGrad keeps the fast path (matching PyTorch's fused/compiled amsgrad) instead of being rejected. Scheduler→LrSchedule mapping moved to a shared GradientBasedOptimizerBase helper. - IFusedActivation: activations with an exact fused equivalent (ReLU, Sigmoid, Tanh, Identity→None) self-declare their FusedActivationType. GELU intentionally omitted until tanh-approx-vs-erf equivalence is verified. (3) Proper wiring — FeedForwardNeuralNetwork.Predict now runs a pure dense+fused- activation stack through IEngine.MlpForward (one fused call) instead of the per-layer tape walk, via the activation interface (no switch). Falls back to the generic Forward for anything the kernel can't represent (non-dense, vector activation, unmapped/mixed activations) or if MlpForward declines under a tape. CompiledTapeTrainingStep now also accepts OptimizerType.AMSGrad (the AVX2 AMSGradUpdateSimd kernel already exists in Tensors); the companion Tensors PR wires it through CompiledTrainingPlan's supported set + vMax buffer so opt-in AMSGrad fully runs fused. Until that lands, AMSGrad falls back loudly (no wrong update). Builds clean on net10.0 + net471. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/Training/CompiledTapeTrainingStep.cs (1)
351-353: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update fused-optimizer constraints docs to include AMSGrad

The XML constraint text is now stale; code allows OptimizerType.AMSGrad (Line 408). Please update this block so behavior/docs stay aligned.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Training/CompiledTapeTrainingStep.cs` around lines 351 - 353, The XML doc
for CompiledTapeTrainingStep currently lists only SGD, Adam, and AdamW as
supported by CompiledTrainingPlan{T}.ConfigureOptimizer but the code also allows
OptimizerType.AMSGrad; update the <item> text to include AMSGrad (and optionally
rephrase to "SGD, Adam, AdamW, and AMSGrad") and keep the note about using the
plain Step method or the eager tape path for other optimizer types so the
documentation matches the behavior of ConfigureOptimizer and references to Step
and the eager tape path remain accurate.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/ActivationFunctions/Fused/IFusedActivation.cs`:
- Around line 19-23: The IFusedActivation interface is an internal plumbing
contract and must not be public; change its declaration from public to internal
(i.e., make IFusedActivation internal) and ensure the related type
FusedActivationType is also internal or otherwise not exposed publicly so
callers outside the assembly cannot reference this plumbing; update any
implementing classes (names referencing IFusedActivation) to match the new
internal visibility and run compilation to fix any accessibility errors caused
by this change.

In `@src/ActivationFunctions/ReLUActivation.cs`:
- Around line 27-31: The ReLUActivation<T> class exposes the fused-kernel
plumbing member FusedActivationType as a public property; change it to an
explicit interface implementation so it is not part of the concrete public API.
Replace the public property declaration with an explicit implementation of
Fused.IFusedActivation.FusedActivationType (i.e. implement
AiDotNet.Tensors.Engines.FusedActivationType
Fused.IFusedActivation.FusedActivationType =>
AiDotNet.Tensors.Engines.FusedActivationType.ReLU) inside the ReLUActivation<T>
class so the member is only accessible via the Fused.IFusedActivation interface.

In `@src/ActivationFunctions/SigmoidActivation.cs`:
- Around line 36-40: The public FusedActivationType property on
SigmoidActivation<T> is exposing internal routing metadata; remove the public
auto-property and implement AiDotNet.Tensors.Engines.FusedActivationType as an
explicit interface member for Fused.IFusedActivation (i.e., implement
Fused.IFusedActivation.FusedActivationType =>
AiDotNet.Tensors.Engines.FusedActivationType.Sigmoid) inside the
SigmoidActivation<T> class so it is not part of the concrete public API but
still satisfies the interface contract.

In `@src/NeuralNetworks/NeuralNetworkBase.cs`:
- Around line 6666-6672: GetOrCreateBaseOptimizer relies on
AdamOptimizerOptions' constructor default for UseAMSGrad; make the non-AMSGrad
behavior explicit by creating the AdamOptimizerOptions instance, set its
UseAMSGrad property to false, and pass that options object into the
AdamOptimizer constructor (i.e., modify GetOrCreateBaseOptimizer to instantiate
a Models.Options.AdamOptimizerOptions<T, Tensor<T>, Tensor<T>> options variable,
set options.UseAMSGrad = false, then new AdamOptimizer(..., options)).
- Around line 5328-5341: The warning should not be emitted when compilation is
explicitly disabled; update the condition around the _loggedFusedFallback branch
to also check TensorCodecOptions.Current.EnableCompilation (or
TensorCodecOptions.EnableCompilation) and only log the fused-training fallback
when compilation is enabled and _mixedPrecisionContext is null; keep the
existing checks for _loggedFusedFallback, _mixedPrecisionContext, the
AIDOTNET_QUIET env var, and include _pendingFusedMissReason/GetType().Name in
the message as before so the alert only appears for unexpected fallbacks when
compilation was intended to run.

In `@src/Optimizers/AdamOptimizer.cs`:
- Around line 126-144: The public TryGetFusedOptimizerConfig method on
AdamOptimizer should be removed from the public API and implemented as an
explicit IFusedOptimizerSpec member; change the public method into an explicit
interface implementation of IFusedOptimizerSpec.TryGetFusedOptimizerConfig
within the AdamOptimizer class, keeping the existing logic (including the
UseAdaptiveLearningRate check, TryGetFusedLrSchedule call, and construction of
Fused.FusedOptimizerConfig using _options, GetCurrentLearningRate(),
TryGetFusedLrSchedule and the AMSGrad branch) so the behavior is identical but
the method is no longer part of the concrete public surface.

In `@src/Optimizers/AdamWOptimizer.cs`:
- Around line 146-162: The public method TryGetFusedOptimizerConfig on
AdamWOptimizer should be converted to an explicit interface implementation so
fused internals are not exposed on the concrete AdamWOptimizer API; locate the
TryGetFusedOptimizerConfig method (and its use of Fused.FusedOptimizerConfig,
TryGetFusedLrSchedule, GetCurrentLearningRate and _options) and change its
signature to implement the fused interface explicitly (e.g.
IFusedOptimizer.TryGetFusedOptimizerConfig) rather than a public member,
retaining the existing logic and return behavior so callers through the
interface still receive the Fused.FusedOptimizerConfig while the concrete
AdamWOptimizer class no longer exposes the fused plumbing publicly.

In `@src/Optimizers/Fused/IFusedOptimizerSpec.cs`:
- Around line 18-25: The Fused optimizer plumbing types are exposed publicly but
should be internal; change the accessibility of FusedOptimizerConfig (currently
declared as "public readonly record struct FusedOptimizerConfig(...)") to
internal (e.g., "internal readonly record struct FusedOptimizerConfig(...)") and
likewise change the other plumbing types mentioned around lines 51-58 to
internal; update any callers within the project (tests or other internal
classes) to use the now-internal types (no API consumers should be affected) and
ensure the file compiles after switching the access modifiers.

In `@src/Optimizers/StochasticGradientDescentOptimizer.cs`:
- Around line 70-81: The public TryGetFusedOptimizerConfig method on
StochasticGradientDescentOptimizer is exposing internal fused-config plumbing;
change it to an explicit interface implementation so it is not part of the
concrete public API. Locate the public bool TryGetFusedOptimizerConfig(out
Fused.FusedOptimizerConfig config) on class StochasticGradientDescentOptimizer
and convert it to an explicit implementation of the appropriate internal
interface (keep the logic that checks _options.UseAdaptiveLearningRate, calls
TryGetFusedLrSchedule, and uses GetCurrentLearningRate to build the
Fused.FusedOptimizerConfig), removing the public modifier so only the interface
exposes this member. Ensure the signature matches the internal interface exactly
and update accessibility accordingly.

In `@src/Training/CompiledTapeTrainingStep.cs`:
- Around line 397-408: The hot path currently admits
AiDotNet.Tensors.Engines.Compilation.OptimizerType.AMSGrad but relies on
exception-catching inside TryStepWithFusedOptimizer (and the per-step warning at
line ~640) to handle unsupported runtime builds, causing per-step exceptions and
log churn; modify the control flow so ConfigureOptimizer (or the plan selection
logic) performs a cheap runtime capability check for AMSGrad support (e.g.,
whether vMax buffer and FusedOptimizer.AMSGradUpdateSimd kernel are available)
and return a boolean/capability flag that the calling code
(CompiledTapeTrainingStep.TryStepWithFusedOptimizer) inspects before entering
the fused path, falling back once deterministically to the eager tape with a
single one-time warning instead of relying on exceptions; update usage of
optimizerType, ConfigureOptimizer, and TryStepWithFusedOptimizer to gate AMSGrad
by that capability flag.

---

Outside diff comments:
In `@src/Training/CompiledTapeTrainingStep.cs`:
- Around line 351-353: The XML doc for CompiledTapeTrainingStep currently lists
only SGD, Adam, and AdamW as supported by
CompiledTrainingPlan{T}.ConfigureOptimizer but the code also allows
OptimizerType.AMSGrad; update the <item> text to include AMSGrad (and optionally
rephrase to "SGD, Adam, AdamW, and AMSGrad") and keep the note about using the
plain Step method or the eager tape path for other optimizer types so the
documentation matches the behavior of ConfigureOptimizer and references to Step
and the eager tape path remain accurate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e67cea0d-07a3-4271-8787-dd0f2f62624a

📥 Commits

Reviewing files that changed from the base of the PR and between d05f43e and f8da771.

📒 Files selected for processing (14)

src/ActivationFunctions/Fused/IFusedActivation.cs
src/ActivationFunctions/GELUActivation.cs
src/ActivationFunctions/IdentityActivation.cs
src/ActivationFunctions/ReLUActivation.cs
src/ActivationFunctions/SigmoidActivation.cs
src/ActivationFunctions/TanhActivation.cs
src/NeuralNetworks/FeedForwardNeuralNetwork.cs
src/NeuralNetworks/NeuralNetworkBase.cs
src/Optimizers/AdamOptimizer.cs
src/Optimizers/AdamWOptimizer.cs
src/Optimizers/Fused/IFusedOptimizerSpec.cs
src/Optimizers/GradientBasedOptimizerBase.cs
src/Optimizers/StochasticGradientDescentOptimizer.cs
src/Training/CompiledTapeTrainingStep.cs

… wiring Convert IFusedActivation from a FusedActivationType property to TryGetFusedActivation(out type), so a parametric activation whose parameter differs from the kernel's hardcoded value reports no fused equivalent and stays on the exact generic path instead of silently inheriting the kernel default. Wire the 8 activations whose kernel is numerically identical through the shipped Tensors FusedLinear/MlpForward path (resolved via CpuFusedOperations._floatActivations/_doubleActivations): ReLU, Sigmoid, Tanh, Identity(None), GELU(tanh-approx), Swish, SiLU(=Swish), LeakyReLU. Each is locked by a parity test asserting the fused kernel equals the scalar Activate() on the same pre-activation (<1e-4). Unwire Mish and ELU: that path's activation tables register only None/ReLU/GELU/Sigmoid/Tanh/LeakyReLU/Swish, so routing Mish/ELU would throw (their formulas live only in the unrelated BlasManaged ActivationEpilogue). MishAndElu_ReportNoFusedKernel locks the contract; adding the kernels is tracked by AiDotNet.Tensors #499. Fix LeakyReLU fused guard tolerance (1e-12 -> 1e-6): the default 0.01 slope round-trips through float as 0.009999999776, ~2.2e-10 off the literal, so 1e-12 wrongly rejected the default. Guard DenseLayer lazy-init in TryFusedDensePredict: a fresh network's first Predict has [0,0] sentinel weights that MlpForward rejects, so bail to the generic Forward for that call (Predict_FreshLazyNetwork_DoesNotThrow). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ngage fused path Adds DefaultOptimizer_EngagesFusedPath_NotSilentEagerFallback: trains with no explicitly-supplied optimizer (optimizer: null → GetOrCreateBaseOptimizer) and asserts CompiledTapeTrainingStep.GetFusedStepCount() > 0. The default optimizer previously constructed Adam with UseAMSGrad=true, which TryMapToFusedOptimizerConfig rejected — silently demoting every default-configured model to the eager tape so compiled training never ran. This regression test fails loudly if a non-mappable default is ever reintroduced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

In-repo twin of AIsEval's aidotnet-benchmarks, but referencing the AiDotNet *source* (ProjectReference, not a NuGet package) so it measures the current working tree — the validation harness for perf changes a released-package benchmark can't see (e.g. PR #1469's fused-training gate and the FeedForwardNeuralNetwork.Predict -> IEngine.MlpForward inference wiring). Both sides build the same MLP/CNN/LSTM/Transformer models with matching layer shapes, run the same training + multi-batch (1/8/32/128) inference loop with p95 latency + RSS, and emit the same JSON schema; pytorch/compare.py lines the two reports up row-by-row (gate: p95(AiDotNet) < mean(PyTorch)). PyTorch runs eager on purpose so the comparison is kernels-vs-kernels, not compile-stack-vs-stack. AIDOTNET_FUSED_DIAG=1 prints whether the compiled fused training step engaged (Hit) and, on fallback, the captured root-cause exception via CompiledTapeTrainingStep.GetLastFallbackException — which already earned its keep: it shows FeedForwardNeuralNetwork's AMSGrad-mode default optimizer (chosen for the #1332 drift fix) hits NotSupportedException in the Tensors CompiledTrainingPlan, so compiled training silently falls back to eager for the most common model class. Wiring AMSGrad's existing kernel into the plan dispatch is tracked by Tensors #74. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…efs) In-repo twin of AIsEval's aidotnet-benchmarks, but referencing the AiDotNet source (ProjectReference, not a NuGet package) so it measures the current working tree — the validation harness for perf changes a released-package benchmark can't see (e.g. PR #1469's fused-training gate and the FeedForwardNeuralNetwork.Predict -> IEngine.MlpForward inference wiring). Both sides build the same MLP/CNN/LSTM/Transformer models with matching layer shapes, run the same training + multi-batch (1/8/32/128) inference loop with p95 latency + RSS, and emit the same JSON schema; pytorch/compare.py lines the two reports up row-by-row (gate: p95(AiDotNet) < mean(PyTorch)). PyTorch runs eager on purpose so the comparison is kernels-vs-kernels, not compile-stack-vs-stack. AIDOTNET_FUSED_DIAG=1 prints whether the compiled fused training step engaged (Hit) and, on fallback, the captured root-cause exception via CompiledTapeTrainingStep.GetLastFallbackException — which already earned its keep: it shows FeedForwardNeuralNetwork's AMSGrad-mode default optimizer (chosen for the #1332 drift fix) hits NotSupportedException in the Tensors CompiledTrainingPlan, so compiled training silently falls back to eager for the most common model class. Wiring AMSGrad's existing kernel into the plan dispatch is tracked by Tensors #74. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CompiledTrainingPlan.ConfigureOptimizer previously threw NotSupportedException for AMSGrad even though the AVX2 AMSGradUpdateSimd kernel already existed. That silently demoted AiDotNet's FeedForwardNeuralNetwork — whose default optimizer is AMSGrad-mode Adam (chosen for the drift fix in ooples/AiDotNet#1332) — to the eager tape for every training run (the "compiled does nothing" symptom the in-repo parity harness surfaced via GetLastFallbackException). Wire AMSGrad into the plan's CPU fused-update closures: - Add OptimizerType.AMSGrad to ValidatePlanOptimizerSupport. - Thread a per-parameter vMax buffer (running max of the second moment) through all four closure builders (float/double x grouped/ungrouped); allocated only when the optimizer is AMSGrad. - Add an AMSGrad case to each CPU step switch calling AMSGradUpdateSimd, using Adam's L2 weight-decay convention (wd=0 for the FFN default). - Add a double overload of AMSGradUpdateSimd mirroring the float kernel so the double-precision plan keeps the same non-increasing-denominator guarantee. GPU AMSGrad is intentionally still unsupported (no backend kernel) — a GPU-resident parameter with AMSGrad throws clearly via the GPU switch default; the common CPU path (FeedForwardNeuralNetwork on the CPU engine) is unblocked. Tests (ConfigureOptimizerAMSGradTests): kernel-direct float/double parity against an independent textbook AMSGrad over a rise-then-fall gradient sequence (so vMax exceeds v and the max path matters); plan-level AMSGrad updates params in place (the direct regression), equals Adam on step 1 (vMax==v), and diverges from Adam over 40 steps (proves vMax is consulted, not aliased). Adam/SGD param-update and double-path tests still green (no regression from the shared-closure edits). Companion to ooples/AiDotNet#1469: once released and bumped, FeedForwardNeuralNetwork's AMSGrad default finally engages compiled training instead of falling back. Closes #500. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…allbacks API surface (keep fused dispatch plumbing off the concrete public API): - IFusedOptimizerSpec + FusedOptimizerConfig made internal (compiled-dispatch implementation details; only consumed in-assembly via the interface). - AdamOptimizer / AdamWOptimizer / StochasticGradientDescentOptimizer now implement TryGetFusedOptimizerConfig as explicit interface implementations. - All 8 IFusedActivation implementations (GELU/Identity/LeakyReLU/ReLU/Sigmoid/ SiLU/Swish/Tanh) now implement TryGetFusedActivation explicitly. Behavior: - NeuralNetworkBase: suppress the loud fused-fallback warning when TensorCodecOptions.EnableCompilation is false (explicit opt-out, not an unexpected fallback). - GetOrCreateBaseOptimizer pins UseAMSGrad = false explicitly so the fused fast path can't silently regress if the AdamOptimizerOptions default ever flips. - CompiledTapeTrainingStep: latch a per-thread _amsgradFusedUnavailable flag on the first AMSGrad fused failure so later AMSGrad steps skip the fused attempt instead of reconfigure/throw/catch/warn every step (per-step churn). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…r dispatch CpuFusedOperations._floatActivations/_doubleActivations — the tables FusedLinear/MlpForward resolves pointwise activations through — registered only None/ReLU/GELU/Sigmoid/Tanh/LeakyReLU/Swish, so MlpForward THREW for the other pointwise FusedActivationType values. That is exactly the gap that forced AiDotNet's Mish/ELU activations off the fused inference path. Adds ELU (alpha=1), SELU, Softplus (with the standard x>20 linear cutoff to avoid exp overflow), Mish, HardSwish, HardSigmoid, HardTanh to both the float and double tables, as inlined helper methods mirroring ApplyGelu. MlpForward/FusedLinear now cover all 14 pointwise activations; only Softmax stays out (it's row-wise, not pointwise — must be applied separately after the GEMM). Tests (MlpForwardActivationParityTests): for each new activation, float and double, MlpForward(activation) equals the canonical scalar formula applied to the raw x·W (independent textbook reference, <1e-4 float / <1e-9 double). Unblocks re-wiring AiDotNet's Mish/ELU (and SELU/Softplus/Hard* if classed) to IFusedActivation once this ships — companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ations into the plan/MlpForward dispatch (#501) * feat(#500): dispatch AMSGrad through CompiledTrainingPlan's fused update CompiledTrainingPlan.ConfigureOptimizer previously threw NotSupportedException for AMSGrad even though the AVX2 AMSGradUpdateSimd kernel already existed. That silently demoted AiDotNet's FeedForwardNeuralNetwork — whose default optimizer is AMSGrad-mode Adam (chosen for the drift fix in ooples/AiDotNet#1332) — to the eager tape for every training run (the "compiled does nothing" symptom the in-repo parity harness surfaced via GetLastFallbackException). Wire AMSGrad into the plan's CPU fused-update closures: - Add OptimizerType.AMSGrad to ValidatePlanOptimizerSupport. - Thread a per-parameter vMax buffer (running max of the second moment) through all four closure builders (float/double x grouped/ungrouped); allocated only when the optimizer is AMSGrad. - Add an AMSGrad case to each CPU step switch calling AMSGradUpdateSimd, using Adam's L2 weight-decay convention (wd=0 for the FFN default). - Add a double overload of AMSGradUpdateSimd mirroring the float kernel so the double-precision plan keeps the same non-increasing-denominator guarantee. GPU AMSGrad is intentionally still unsupported (no backend kernel) — a GPU-resident parameter with AMSGrad throws clearly via the GPU switch default; the common CPU path (FeedForwardNeuralNetwork on the CPU engine) is unblocked. Tests (ConfigureOptimizerAMSGradTests): kernel-direct float/double parity against an independent textbook AMSGrad over a rise-then-fall gradient sequence (so vMax exceeds v and the max path matters); plan-level AMSGrad updates params in place (the direct regression), equals Adam on step 1 (vMax==v), and diverges from Adam over 40 steps (proves vMax is consulted, not aliased). Adam/SGD param-update and double-path tests still green (no regression from the shared-closure edits). Companion to ooples/AiDotNet#1469: once released and bumped, FeedForwardNeuralNetwork's AMSGrad default finally engages compiled training instead of falling back. Closes #500. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(#500): study PyTorch fused/compiled optimizer internals vs ours Findings doc for the compiled-training perf work: PyTorch's 3-tier (for-loop/foreach/fused) optimizer impls, multi_tensor_apply horizontal fusion, and torch.compile (Inductor) vertical fusion — mapped against what CompiledTrainingPlan already does well (compile-once-replay, inlined LR schedule, live-backed in-place writes, AVX2 kernels, epilogue fusion) and the gaps to close. Surfaced a concrete quick win: the per-param *UpdateSimd kernels recompute the step-constant bias-correction powers (1-β^t) on every parameter call; PyTorch computes them once per step. Prioritized action items included. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#500): wire the remaining float kernel-backed optimizers into the plan dispatch Extends the AMSGrad dispatch to the other optimizers whose AVX2 kernels already existed in FusedOptimizer but were rejected by CompiledTrainingPlan (forcing the eager tape): Nadam, RAdam, LAMB, RMSprop, Adagrad, Lion, SGDMomentum, AdaMax. - ValidatePlanOptimizerSupport is now dtype-aware: float allows the full set (SGD/Adam/AdamW/AMSGrad + the 8 above); double keeps SGD/Adam/AdamW/AMSGrad (the new kernels are float-only). Double/GPU use of a float-only optimizer is rejected at ConfigureOptimizer rather than configure-then-throw mid-step. - Buffer allocation: RAdam/LAMB join the m+v set; AdaMax reuses the v slot as its infinity-norm u; the others reuse existing m or v. - Hyperparameter mapping into the generic (lr, beta1, beta2, eps, wd) slots: beta2 = RMSprop decay (rho); beta1 = SGD momentum; Lion/LAMB apply decoupled weight decay inside their kernels, the rest use Adam's L2 convention. Still NOT wired (need hyperparameters the ConfigureOptimizer API doesn't carry): AdaDelta (rho + 2 accumulators), FTRL (l1/l2/lr_power), LARS (trust coeff), ASGD (lambd/alpha/mu). These are cleanly rejected at configure time so callers fall back to eager — tracked for a follow-up API extension. Tests (ConfigureOptimizerFusedDispatchTests): all 12 wired float optimizers dispatch through the plan, update params in place, stay finite, and move meaningfully over 5 steps; float-only optimizers on a double plan throw; un-wired optimizers throw. Kernel math correctness stays covered by the kernels' own tests (AMSGrad additionally has full kernel-direct parity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#500): wire AdaDelta/LARS/FTRL/ASGD/Rprop via a FusedOptimizerExtras API extension Completes the dispatch for every per-parameter-elementwise kernel that exists in FusedOptimizer. These four (five) needed hyperparameters the generic (lr, beta1, beta2, eps, weightDecay) slots don't carry, so this adds an optional FusedOptimizerExtras object to ConfigureOptimizer / ConfigureOptimizerGrouped (and the interface) with documented per-field defaults: - AdaDelta: 2 accumulators (accumGrad=v, accumUpdate=vMax); rho via the beta2 slot. - LARS: velocity=m; layer-wise trust ratio from extras.Momentum + TrustCoefficient + wd. - FTRL: z=v, n=vMax; extras.L1/L2/LrPower (FTRL owns its regularization). - ASGD: ax=m; closure computes eta_t = lr/(1+lambd*lr*t)^alpha and mu_t = 1/max(1,t-t0). - Rprop: prevGrad=m, stepSize=v (seeded to extras.RpropInitialStep on step 1); extras.RpropEtaPlus/EtaMinus/StepMin/StepMax. No lr (step-size based). FusedOptimizerExtras is a class with property initializers (not a record struct) so `new FusedOptimizerExtras()` yields the documented defaults rather than all-zero. Now wired (17 float OptimizerTypes): SGD, SGDMomentum, Adam, AdamW, Adagrad, RMSprop, Lion, AdaMax, AMSGrad, Nadam, AdaDelta, LARS, LAMB, FTRL, RAdam, ASGD, Rprop. Still rejected (need a different execution model, fail fast at configure): SparseAdam (sparse indices), LBFGS (closure line-search), HypergradientSGD / DAdaptationSGD (GLOBAL cross-parameter reductions — per-tensor would be a different algorithm), ScheduleFreeSGD (y-buffer written before the forward). Tests: all 17 wired float optimizers dispatch + update in place + finite + move; FTRL with a strong L1 drives sparsity (proves extras flow into the kernel, not ignored); float-only on double throws; the 5 unwireable throw. Updated FusedAdaptiveLrPlanTests (Lion/LAMB are no longer rejected — now supported). Adam/SGD/double-path/adaptive-lr regression tests still green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#500): add the 7 missing pointwise activations to the FusedLinear dispatch CpuFusedOperations._floatActivations/_doubleActivations — the tables FusedLinear/MlpForward resolves pointwise activations through — registered only None/ReLU/GELU/Sigmoid/Tanh/LeakyReLU/Swish, so MlpForward THREW for the other pointwise FusedActivationType values. That is exactly the gap that forced AiDotNet's Mish/ELU activations off the fused inference path. Adds ELU (alpha=1), SELU, Softplus (with the standard x>20 linear cutoff to avoid exp overflow), Mish, HardSwish, HardSigmoid, HardTanh to both the float and double tables, as inlined helper methods mirroring ApplyGelu. MlpForward/FusedLinear now cover all 14 pointwise activations; only Softmax stays out (it's row-wise, not pointwise — must be applied separately after the GEMM). Tests (MlpForwardActivationParityTests): for each new activation, float and double, MlpForward(activation) equals the canonical scalar formula applied to the raw x·W (independent textbook reference, <1e-4 float / <1e-9 double). Unblocks re-wiring AiDotNet's Mish/ELU (and SELU/Softplus/Hard* if classed) to IFusedActivation once this ships — companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#500): make the optimizer gate device-aware + label doc code fence (PR review) CodeRabbit review on PR #501: - ValidatePlanOptimizerSupport was dtype-aware but not device-aware: AMSGrad and the float-only CPU kernels were accepted for any float plan, but the GPU step path ships only SGD/Adam/AdamW backend kernels. On a mixed CPU/GPU plan the GPU-switch default throw lands AFTER earlier CPU params were updated — a partially-applied step. Now ConfigureOptimizer / ConfigureOptimizerGrouped detect any GPU-backed parameter and reject non-SGD/Adam/AdamW at configure time (atomic, before _optimizerUpdate is published). CPU-only plans are unaffected (hasGpuParams=false) — all 46 dispatch/parity tests still green. - Tagged the unlabeled fenced code block in the research doc as csharp (MD040). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: franklinic <franklin@ivorycloud.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… to full pointwise parity Two fused activation paths existed with different coverage: the MlpForward/FusedLinear dispatch tables (CpuFusedOperations) reached 14 pointwise activations in #501, while the BlasManaged ActivationEpilogue stopped at 8 (ReLU/LeakyReLU/Sigmoid/Tanh/GELU/Swish/Mish/ELU). This: - Adds two new pointwise FusedActivationType kernels: ReLU6 = min(max(0,x),6) (MobileNet/quantized) and SoftSign = x/(1+|x|) — wired into the CpuFusedOperations float+double tables. - Brings ActivationEpilogue (fp32 + fp64) to parity with those tables: adds SELU, Softplus, HardSwish, HardSigmoid, HardTanh + the new ReLU6, SoftSign. Both fused paths now cover all 16 pointwise activations. Still out of scope (not pointwise / need parameter threading, tracked in #499): Softmax & Softmin (row-wise), and the parametric activations (CELU/ThresholdedReLU/ScaledTanh/PReLU) which need a parameter carried through the fused path — a follow-up API extension analogous to FusedOptimizerExtras. Tests: MlpForwardActivationParityTests + new EpilogueActivationParityTests verify every new activation (float + double) matches an independent canonical formula through both fused paths. Companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds an optional FusedActivationParams (Alpha/Beta/Theta, nullable so a missing field resolves to each activation's canonical default) threaded through the fused activation paths, so parametric activations fuse with ANY parameter instead of only the hardcoded value: - New FusedActivationType values: CELU, ThresholdedReLU, ScaledTanh. - LeakyReLU now fuses for any slope (not just 0.01); ELU for any alpha (not just 1). - CELU (alpha), ThresholdedReLU (theta), ScaledTanh (alpha, beta) fuse via params. Plumbing (optional trailing param, fully back-compatible — null = prior behavior): - CpuFusedOperations.GetFloatActivation/GetDoubleActivation build a parametric closure from the params (falling back to the per-activation default), with the non-parametric activations still served by the static dispatch tables. - ApplyBiasActivationInPlace/Double, CpuEngine.FusedLinear, CpuEngine.MlpForward (hidden + output params), and the IEngine interface all carry the optional params. - DirectGpuTensorEngine.FusedLinear defers to the base CPU params-aware path when custom params are supplied (GPU fused kernels don't carry them yet). - ActivationEpilogue (fp32 + fp64) honors params for LeakyReLU/ELU and implements CELU/ThresholdedReLU/ScaledTanh. Out of scope (documented): PReLU needs a per-channel slope vector (not a scalar) — a separate kernel signature; the tape/graph training path applies activations via ActivationRegistry (canonical defaults) — MlpForward is inference-only so the main consumer path is fully covered; Softmax/Softmin remain row-wise. Tests: MlpForwardActivationParityTests + EpilogueActivationParityTests gain parametric cases (LeakyReLU 0.2, ELU 2, CELU 1.5, ThresholdedReLU 0.5, ScaledTanh 1.7/0.66) verifying both fused paths honor the supplied parameter (float + double). FusedLinear/MlpForward regression suite green (110 passed). Companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ActivationType set Implements the remaining declared activation types so every FusedActivationType value now has a working fused kernel in BOTH paths (MlpForward/FusedLinear tables and the BlasManaged ActivationEpilogue): - PReLU: per-output-channel learned slope via FusedActivationParams.PReluSlope (length = features, or 1 for shared; default 0.25). Applied per output column, so it runs in a dedicated channel-aware pass (not the pointwise delegate). - RReLU: deterministic eval form = leaky with slope (lower+upper)/2 (default ≈0.2292, override via Alpha) — fused paths are inference-only. - Softmax / Softmin: row-wise (over the feature dim) with the standard max-shift for numerical stability; Softmin = softmax(-x). Run as a per-row pass after bias. PReLU/Softmax/Softmin get a dedicated branch at the top of ApplyBiasActivationInPlace/Double (and matching epilogue cases) because they need column/row context; the pointwise delegate path and SIMD fast path are unchanged for every other activation (no regression — existing activations skip the branch). Tests: per-channel PReLU parity, Softmax/Softmin row-normalization (+ monotonic / anti-monotonic ordering, rows sum to 1), RReLU added to the parametric set — float and double, through both MlpForward and the epilogue. 58 activation tests green. This closes the activation half of #499: all 23 FusedActivationType values fuse (only the specialized non-enumerated activation classes — Sparsemax, Maxout, GumbelSoftmax, etc. — remain unenumerated, marked lower-priority in the issue). Companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

src/ActivationFunctions/Fused/IFusedActivation.cs (1)
28-35: ⚠️ Potential issue | 🔴 Critical | ⚖️ Poor tradeoff

BLOCKING: Fused activation dispatch contract must be internal, not public.

IFusedActivation is a fused-kernel routing interface consumed by internal layers (ActivationLayer.TryGetFusedActivationType, FeedForwardNeuralNetwork.TryFusedDensePredict). Exposing it as public violates the facade pattern—users should interact only with AiModelBuilder and configuration classes, not dispatch plumbing. All implementations use explicit interface syntax, so the interface visibility does not affect the concrete activation classes.

A prior review flagged this exact issue and marked it "Addressed in commits b029210 to a6851e4," yet the code remains public. Make it internal.
🔒 Proposed fix
-public interface IFusedActivation
+internal interface IFusedActivation
 {
     /// <summary>
     /// Reports the fused-kernel activation type equivalent to this activation, or
     /// returns <c>false</c> if this instance can't be reproduced by the kernel.
     /// </summary>
     bool TryGetFusedActivation(out FusedActivationType type);
 }
As per coding guidelines: "src/**: Users should ONLY interact with AiModelBuilder.cs and AiModelResult.cs" and "Prefer internal over public for plumbing/helper classes that users never instantiate or consume."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ActivationFunctions/Fused/IFusedActivation.cs` around lines 28 - 35,
Change the IFusedActivation interface from public to internal so the
fused-kernel routing contract is not exposed in the public API; update the
declaration of IFusedActivation accordingly (the explicit implementations on
concrete activation classes need no changes), and verify call sites like
ActivationLayer.TryGetFusedActivationType and
FeedForwardNeuralNetwork.TryFusedDensePredict still compile against the
now-internal interface.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/AiDotNet.PyTorchParity/pytorch/benchmark.py`:
- Line 276: The expression computing p95_idx uses an unnecessary int() cast
around round(); update the line that assigns p95_idx to remove int() so it reads
p95_idx = min(len(steady_sorted) - 1, round(0.95 * (len(steady_sorted) - 1))))
(referencing variable steady_sorted and the p95_idx assignment) ensuring the
result is still bounded by len(steady_sorted)-1; no other logic change required.
- Line 89: The __enter__ method's return type annotation uses unnecessary string
quotes; update the signature in the ResourceMonitor class by removing the quotes
so it reads a normal forward reference (i.e., change def __enter__(self) ->
"ResourceMonitor": to use ResourceMonitor without quotes), ensuring the
annotation matches the class name and Python's type hinting style.

In `@benchmarks/AiDotNet.PyTorchParity/pytorch/compare.py`:
- Around line 24-30: The helper _get lacks a return type annotation; update its
signature to include a return type and type for default (e.g. import Any from
typing and change to def _get(d: dict, *names: str, default: Any = None) ->
Any:) and ensure the typing import (from typing import Any) is added at the top;
this keeps behavior identical but provides explicit type information for callers
and linters.

In `@benchmarks/AiDotNet.PyTorchParity/pytorch/requirements.txt`:
- Line 3: Update the PyTorch minimum version in requirements.txt to avoid known
security vulnerabilities: replace the current "torch>=2.2" entry with a safer
minimum such as "torch>=2.5.0" or pin to a compatible patch series like
"torch~=2.6.0" so consumers use a known-safe release; ensure any CI or local
test docs that reference the "torch" requirement are updated accordingly.

---

Duplicate comments:
In `@src/ActivationFunctions/Fused/IFusedActivation.cs`:
- Around line 28-35: Change the IFusedActivation interface from public to
internal so the fused-kernel routing contract is not exposed in the public API;
update the declaration of IFusedActivation accordingly (the explicit
implementations on concrete activation classes need no changes), and verify call
sites like ActivationLayer.TryGetFusedActivationType and
FeedForwardNeuralNetwork.TryFusedDensePredict still compile against the
now-internal interface.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5cd946b5-dc1b-4935-9648-9e103d21f2d0

📥 Commits

Reviewing files that changed from the base of the PR and between f8da771 and e32ade6.

📒 Files selected for processing (28)

AiDotNet.sln
benchmarks/AiDotNet.PyTorchParity/.gitignore
benchmarks/AiDotNet.PyTorchParity/AiDotNet.PyTorchParity.csproj
benchmarks/AiDotNet.PyTorchParity/Program.cs
benchmarks/AiDotNet.PyTorchParity/README.md
benchmarks/AiDotNet.PyTorchParity/pytorch/benchmark.py
benchmarks/AiDotNet.PyTorchParity/pytorch/compare.py
benchmarks/AiDotNet.PyTorchParity/pytorch/requirements.txt
src/ActivationFunctions/ELUActivation.cs
src/ActivationFunctions/Fused/IFusedActivation.cs
src/ActivationFunctions/GELUActivation.cs
src/ActivationFunctions/IdentityActivation.cs
src/ActivationFunctions/LeakyReLUActivation.cs
src/ActivationFunctions/MishActivation.cs
src/ActivationFunctions/ReLUActivation.cs
src/ActivationFunctions/SiLUActivation.cs
src/ActivationFunctions/SigmoidActivation.cs
src/ActivationFunctions/SwishActivation.cs
src/ActivationFunctions/TanhActivation.cs
src/NeuralNetworks/FeedForwardNeuralNetwork.cs
src/NeuralNetworks/NeuralNetworkBase.cs
src/Optimizers/AdamOptimizer.cs
src/Optimizers/AdamWOptimizer.cs
src/Optimizers/Fused/IFusedOptimizerSpec.cs
src/Optimizers/StochasticGradientDescentOptimizer.cs
src/Training/CompiledTapeTrainingStep.cs
tests/AiDotNet.Tests/IntegrationTests/NeuralNetworks/FusedOptimizerIntegrationTests.cs
tests/AiDotNet.Tests/UnitTests/NeuralNetworks/FusedInferenceParityTests.cs

…n/LiSHT/ISRU/SQRBF/BinarySpiking/BentIdentity + LogSoftmax/LogSoftmin/Spherical/Taylor/Gumbel/Sparsemax/Squash Enumerates and fuses every remaining activation that can be an elementwise or row-wise epilogue on the [batch, features] GEMM output. Formulas matched to the AiDotNet activation classes. Pointwise (added to the CpuFusedOperations resolvers + delegated by the epilogue): Sign, BentIdentity, Gaussian, LiSHT, ISRU(alpha), SQRBF(beta), BinarySpiking(threshold). Row-/channel-wise (new shared RowwiseFusedActivations helper, used by BOTH the MlpForward/FusedLinear epilogue and the BlasManaged ActivationEpilogue so the two paths stay identical — float + double): LogSoftmax, LogSoftmin, SphericalSoftmax (softmax of x/‖x‖₂), TaylorSoftmax (2nd-order), GumbelSoftmax (deterministic eval = softmax(x/temperature); the training-time noise is not fused), Sparsemax (simplex projection via sort), Squash (capsule). PReLU + Softmax/Softmin also moved into this shared helper. The epilogue now routes channel/row-wise types through RowwiseFusedActivations and resolves any other pointwise activation from the shared registry (default branch), so it covers the full set without duplicating 30+ inline cases. Every FusedActivationType value (0–36) now has a working fused kernel in both paths. The only activations NOT fused are the ones with NO FusedActivationType and that are structurally not elementwise epilogues: • Maxout — reduces k channels to 1 (changes output dimensionality; a pooling op). • HierarchicalSoftmax — needs a class tree + target label (loss-coupled). These are documented in the enum; fusing them would require a different op shape, not a kernel. Tests: MlpForwardActivationParityTests gains all 7 new pointwise (float+double) and a row-wise theory (LogSoftmax/LogSoftmin/Spherical/Taylor/Gumbel/Sparsemax/Squash) each vs an independent reference; existing PReLU/Softmax/Softmin/epilogue tests still pass through the shared helper. 79 activation tests green; FusedLinear/ MlpForward regression green (135 passed / 0 failed). Companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…dient, D-Adaptation) Two items previously documented as "needs a different op shape", now implemented. FusedLinearMaxout (CpuEngine + IEngine): GEMM + bias then grouped-max over the feature dim, [.., M, N] → [.., M, N/numPieces] (Goodfellow et al. 2013). Maxout is a shape-changing reduction, not an activation epilogue, so it gets its own fused op. Forward/inference-only (reuses the FusedLinear fast path for the GEMM). HypergradientSGD + DAdaptationSGD: wired into CompiledTrainingPlan via a NEW two-phase (global-reduce → apply) path in the float optimizer-update closure — they maintain ONE scalar shared across ALL parameters, which the per-parameter switch can't express: • Hypergradient: lr_t = lr_{t-1} + β·⟨g_t, g_{t-1}⟩ (global inner product), then p -= lr_t·g; prevGrad in m[p]. β via FusedOptimizerExtras.HyperLr. • D-Adaptation (growth-bounded / Prodigy): global ‖s‖² and r drive a single distance estimate d; p -= d·lr·g; s in m[p]. d0 / growth via extras. State persists across steps in captured closure locals. CPU-only (the device gate rejects them for GPU plans) and ungrouped (rejected with per-group schedules — a single global LR is meaningless per group). Still NOT fused, each needing machinery beyond a fused step (documented in tests): • SparseAdam — sparse-gradient index lists (plan operates on dense grads). • LBFGS — closure line-search (multiple loss evals per step). • ScheduleFreeSGD — needs y=(1-β)z+βx written BEFORE the forward (a pre-forward parameter-transform hook the plan doesn't have). • HierarchicalSoftmax — an alternative output LAYER with its own learned tree-node weights traversed over the input features; not an activation on the logits. Tests: FusedLinearMaxoutTests (grouped-max parity for numPieces 2/3/4 + indivisible guard); Hypergradient diverges from SGD (global LR adaptation active); D-Adaptation grows d above d0 (moves ≫ d0·lr·g); both rejected with grouped schedules. Optimizer + activation regression suites green. Companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… regression) #1470: a from-scratch Transformer trained via the per-call minibatch pattern (MaxIterations=1 + external epoch loop calling model.Train repeatedly) stalled on 0.207.x — the default Noam schedule's LR stayed frozen at its warmup-step-1 value instead of ramping, so loss never left the uniform floor (PPL ≈ V). Root cause: the compiled fused-training kernel bakes a CONSTANT learning rate. A default Transformer's Adam+Noam (StepPerBatch) optimizer was committed to that fused path, which froze the LR — the per-step Noam ramp can't be reproduced by a constant-rate kernel. The fix on this branch is the IFusedOptimizerSpec gating: TryGetFusedLrSchedule returns false for unmapped schedules (Noam), so TryMapToFusedOptimizerConfig declines the fused path and training falls back to the eager OnBatchEnd → StepScheduler path that actually ramps the LR. Two guards (both verified passing on this branch): 1. AdamWithNoamSchedule_DoesNotMapToConstantRateFusedConfig — deterministic unit test of the exact fix seam: Adam+Noam.TryGetFusedOptimizerConfig() must return false (forces eager), while a no-scheduler Adam still returns true (fused fast path preserved). NeuralNetworkBase.TryMapToFusedOptimizerConfig delegates to this same spec method, so it faithfully guards the real training path. 2. Transformer_PerCallTrain_DefaultNoam_RampsLearningRateAcrossCalls — end-to-end smoke: a default-optimizer Transformer trained via repeated per-call Train must accumulate scheduler step state (one StepPerBatch advance per call) and ramp the Noam LR above its warmup-step-1 value. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…op, global-reduction optimizers) (#502) * feat(#499): add ReLU6 + SoftSign kernels and bring ActivationEpilogue to full pointwise parity Two fused activation paths existed with different coverage: the MlpForward/FusedLinear dispatch tables (CpuFusedOperations) reached 14 pointwise activations in #501, while the BlasManaged ActivationEpilogue stopped at 8 (ReLU/LeakyReLU/Sigmoid/Tanh/GELU/Swish/Mish/ELU). This: - Adds two new pointwise FusedActivationType kernels: ReLU6 = min(max(0,x),6) (MobileNet/quantized) and SoftSign = x/(1+|x|) — wired into the CpuFusedOperations float+double tables. - Brings ActivationEpilogue (fp32 + fp64) to parity with those tables: adds SELU, Softplus, HardSwish, HardSigmoid, HardTanh + the new ReLU6, SoftSign. Both fused paths now cover all 16 pointwise activations. Still out of scope (not pointwise / need parameter threading, tracked in #499): Softmax & Softmin (row-wise), and the parametric activations (CELU/ThresholdedReLU/ScaledTanh/PReLU) which need a parameter carried through the fused path — a follow-up API extension analogous to FusedOptimizerExtras. Tests: MlpForwardActivationParityTests + new EpilogueActivationParityTests verify every new activation (float + double) matches an independent canonical formula through both fused paths. Companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#499): parametric fused activations via FusedActivationParams Adds an optional FusedActivationParams (Alpha/Beta/Theta, nullable so a missing field resolves to each activation's canonical default) threaded through the fused activation paths, so parametric activations fuse with ANY parameter instead of only the hardcoded value: - New FusedActivationType values: CELU, ThresholdedReLU, ScaledTanh. - LeakyReLU now fuses for any slope (not just 0.01); ELU for any alpha (not just 1). - CELU (alpha), ThresholdedReLU (theta), ScaledTanh (alpha, beta) fuse via params. Plumbing (optional trailing param, fully back-compatible — null = prior behavior): - CpuFusedOperations.GetFloatActivation/GetDoubleActivation build a parametric closure from the params (falling back to the per-activation default), with the non-parametric activations still served by the static dispatch tables. - ApplyBiasActivationInPlace/Double, CpuEngine.FusedLinear, CpuEngine.MlpForward (hidden + output params), and the IEngine interface all carry the optional params. - DirectGpuTensorEngine.FusedLinear defers to the base CPU params-aware path when custom params are supplied (GPU fused kernels don't carry them yet). - ActivationEpilogue (fp32 + fp64) honors params for LeakyReLU/ELU and implements CELU/ThresholdedReLU/ScaledTanh. Out of scope (documented): PReLU needs a per-channel slope vector (not a scalar) — a separate kernel signature; the tape/graph training path applies activations via ActivationRegistry (canonical defaults) — MlpForward is inference-only so the main consumer path is fully covered; Softmax/Softmin remain row-wise. Tests: MlpForwardActivationParityTests + EpilogueActivationParityTests gain parametric cases (LeakyReLU 0.2, ELU 2, CELU 1.5, ThresholdedReLU 0.5, ScaledTanh 1.7/0.66) verifying both fused paths honor the supplied parameter (float + double). FusedLinear/MlpForward regression suite green (110 passed). Companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#499): fuse PReLU, RReLU, Softmax, Softmin — completes the FusedActivationType set Implements the remaining declared activation types so every FusedActivationType value now has a working fused kernel in BOTH paths (MlpForward/FusedLinear tables and the BlasManaged ActivationEpilogue): - PReLU: per-output-channel learned slope via FusedActivationParams.PReluSlope (length = features, or 1 for shared; default 0.25). Applied per output column, so it runs in a dedicated channel-aware pass (not the pointwise delegate). - RReLU: deterministic eval form = leaky with slope (lower+upper)/2 (default ≈0.2292, override via Alpha) — fused paths are inference-only. - Softmax / Softmin: row-wise (over the feature dim) with the standard max-shift for numerical stability; Softmin = softmax(-x). Run as a per-row pass after bias. PReLU/Softmax/Softmin get a dedicated branch at the top of ApplyBiasActivationInPlace/Double (and matching epilogue cases) because they need column/row context; the pointwise delegate path and SIMD fast path are unchanged for every other activation (no regression — existing activations skip the branch). Tests: per-channel PReLU parity, Softmax/Softmin row-normalization (+ monotonic / anti-monotonic ordering, rows sum to 1), RReLU added to the parametric set — float and double, through both MlpForward and the epilogue. 58 activation tests green. This closes the activation half of #499: all 23 FusedActivationType values fuse (only the specialized non-enumerated activation classes — Sparsemax, Maxout, GumbelSoftmax, etc. — remain unenumerated, marked lower-priority in the issue). Companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#499): fuse the remaining specialized activations — Sign/Gaussian/LiSHT/ISRU/SQRBF/BinarySpiking/BentIdentity + LogSoftmax/LogSoftmin/Spherical/Taylor/Gumbel/Sparsemax/Squash Enumerates and fuses every remaining activation that can be an elementwise or row-wise epilogue on the [batch, features] GEMM output. Formulas matched to the AiDotNet activation classes. Pointwise (added to the CpuFusedOperations resolvers + delegated by the epilogue): Sign, BentIdentity, Gaussian, LiSHT, ISRU(alpha), SQRBF(beta), BinarySpiking(threshold). Row-/channel-wise (new shared RowwiseFusedActivations helper, used by BOTH the MlpForward/FusedLinear epilogue and the BlasManaged ActivationEpilogue so the two paths stay identical — float + double): LogSoftmax, LogSoftmin, SphericalSoftmax (softmax of x/‖x‖₂), TaylorSoftmax (2nd-order), GumbelSoftmax (deterministic eval = softmax(x/temperature); the training-time noise is not fused), Sparsemax (simplex projection via sort), Squash (capsule). PReLU + Softmax/Softmin also moved into this shared helper. The epilogue now routes channel/row-wise types through RowwiseFusedActivations and resolves any other pointwise activation from the shared registry (default branch), so it covers the full set without duplicating 30+ inline cases. Every FusedActivationType value (0–36) now has a working fused kernel in both paths. The only activations NOT fused are the ones with NO FusedActivationType and that are structurally not elementwise epilogues: • Maxout — reduces k channels to 1 (changes output dimensionality; a pooling op). • HierarchicalSoftmax — needs a class tree + target label (loss-coupled). These are documented in the enum; fusing them would require a different op shape, not a kernel. Tests: MlpForwardActivationParityTests gains all 7 new pointwise (float+double) and a row-wise theory (LogSoftmax/LogSoftmin/Spherical/Taylor/Gumbel/Sparsemax/Squash) each vs an independent reference; existing PReLU/Softmax/Softmin/epilogue tests still pass through the shared helper. 79 activation tests green; FusedLinear/ MlpForward regression green (135 passed / 0 failed). Companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#499): FusedLinearMaxout + global-reduction optimizers (Hypergradient, D-Adaptation) Two items previously documented as "needs a different op shape", now implemented. FusedLinearMaxout (CpuEngine + IEngine): GEMM + bias then grouped-max over the feature dim, [.., M, N] → [.., M, N/numPieces] (Goodfellow et al. 2013). Maxout is a shape-changing reduction, not an activation epilogue, so it gets its own fused op. Forward/inference-only (reuses the FusedLinear fast path for the GEMM). HypergradientSGD + DAdaptationSGD: wired into CompiledTrainingPlan via a NEW two-phase (global-reduce → apply) path in the float optimizer-update closure — they maintain ONE scalar shared across ALL parameters, which the per-parameter switch can't express: • Hypergradient: lr_t = lr_{t-1} + β·⟨g_t, g_{t-1}⟩ (global inner product), then p -= lr_t·g; prevGrad in m[p]. β via FusedOptimizerExtras.HyperLr. • D-Adaptation (growth-bounded / Prodigy): global ‖s‖² and r drive a single distance estimate d; p -= d·lr·g; s in m[p]. d0 / growth via extras. State persists across steps in captured closure locals. CPU-only (the device gate rejects them for GPU plans) and ungrouped (rejected with per-group schedules — a single global LR is meaningless per group). Still NOT fused, each needing machinery beyond a fused step (documented in tests): • SparseAdam — sparse-gradient index lists (plan operates on dense grads). • LBFGS — closure line-search (multiple loss evals per step). • ScheduleFreeSGD — needs y=(1-β)z+βx written BEFORE the forward (a pre-forward parameter-transform hook the plan doesn't have). • HierarchicalSoftmax — an alternative output LAYER with its own learned tree-node weights traversed over the input features; not an activation on the logits. Tests: FusedLinearMaxoutTests (grouped-max parity for numPieces 2/3/4 + indivisible guard); Hypergradient diverges from SGD (global LR adaptation active); D-Adaptation grows d above d0 (moves ≫ d0·lr·g); both rejected with grouped schedules. Optimizer + activation regression suites green. Companion to ooples/AiDotNet#1469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#499): wire ScheduleFreeSGD (pre-forward hook) + FusedHierarchicalSoftmax — the last two ScheduleFreeSGD (Defazio et al. 2024): the SIMD kernels already existed in FusedOptimizer; this wires them into CompiledTrainingPlan via a new _preForwardParamTransform hook invoked in Step() before the forward replay. The hook writes y=(1-β)z+βx into the live parameter backing so gradients are evaluated at the interpolation point; the optimizer update advances z (SGD) and x (running weighted average, weightSum += lr²) then restores x into the backing as the eval copy. z/x live in m[p]/v[p] (seeded from the initial weights). Added SfBeta to FusedOptimizerExtras; gate + grouped-guard updated; ScheduleFreeSGD moved from the rejected list to a dedicated functional test (eval weights shrink on Σwᵢ² and diverge from plain SGD). FusedHierarchicalSoftmax (Morin & Bengio 2005): new virtual CpuEngine op (inherited by DirectGpuTensorEngine). Computes the treeDepth shared per-level gate sigmoids once per row then forms each leaf's root-to-leaf path product, replacing the eager layer's per-class gate recomputation. Generic over T via INumericOperations. Test matches the naive per-class reference for power-of-two (sums to 1) and non-power-of-two (early-break) class counts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#499): classify FusedLinearMaxout + FusedHierarchicalSoftmax as NonDifferentiableOps TapeCompletenessTests.AllTensorReturningMethods_AreClassified enumerates every IEngine Tensor-returning method and requires each be registered. Both fused output primitives are forward/inference-only (they throw under an active tape; training decomposes into recordable per-layer ops), so they belong in NonDifferentiableOps alongside MlpForward / LstmSequenceForward / MultiHeadAttentionForward. FusedLinearMaxout was unclassified since it landed in 15ec075; this fixes both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(#499): drive FusedLinear(...,FusedActivationParams) overload in the GPU coverage harness EveryGpuKernel_IsAutoTestedOrAllowlisted flagged the parametric FusedLinear overload as uncovered: the single-shape arg generator couldn't synthesize a FusedActivationParams value, so the overload was neither auto-testable nor allowlisted. Teach CandidatesForType to emit null for FusedActivationParams — a valid value meaning "use defaults", which reduces the overload to the base FusedLinear(...,FusedActivationType) GPU kernel that is already auto-tested. This gives the params overload real GPU-vs-CPU coverage rather than an allowlist skip. (Pre-existing gap from the parametric #499 work.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#499): address CodeRabbit review — CELU alpha guards, params threading, PReLU bounds, schedule-free/hypergradient correctness Resolves all 9 review threads on PR #502: - CELU divides by alpha → reject alpha <= 0 in both fused paths (ActivationEpilogue fp32/fp64, CpuFusedOperations float/double activation delegates). - Thread FusedActivationParams through the public FusedGemmBiasActivation float and double entrypoints (+ Unchecked) so direct callers can use parametric LeakyReLU/ELU/CELU/ThresholdedReLU/ScaledTanh settings. - PReLU per-channel slope: defensively clamp to the last element when a misconfigured slope array is shorter than the feature dim (was IndexOutOfRangeException), in both ApplyFloat and ApplyDouble. - Schedule-Free: clear _preForwardParamTransform on grouped reconfigure so a stale y=(1-β)z+βx rewrite can't leak into a subsequent grouped optimizer. - HypergradientSGD: honor a non-constant LrSchedule — effective lr is the per-step schedule base plus the accumulated hypergradient adjustment (was frozen at GetLr(1)); constant schedule reduces to the prior behavior. - FusedOptimizerExtras.Validate(): reject HyperLr<0, D0<=0, DGrowthRate<1, SfBeta∉[0,1] at configure time; called from both ConfigureOptimizer paths. - Test comment: note LBFGS is also still rejected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: franklinic <franklin@ivorycloud.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…R fix) Replaces the eager-fallback workaround with the real solution: the default Transformer recipe (Adam β₂=0.98 + Noam warmup) now trains on the fused- compiled path with a correct per-step LR ramp — no forced slow path. The fused training plan already evaluates LrSchedule.GetLr(step) every optimizer step (the same model as PyTorch fused=True, which takes lr as a per-step scalar). Cosine/Exponential/OneCycle/LinearWarmupCosine were already supported; Noam was the only missing shape, which is why Adam+Noam Transformers fell back to eager (or, pre-gate, froze at a constant rate). - Bump AiDotNet.Tensors 0.86.6 → 0.88.0 for LrSchedule.Noam (Tensors #504). - GradientBasedOptimizerBase.TryGetFusedLrSchedule: map NoamSchedule → LrSchedule.Noam(modelDim, warmup, factor). Both use t = step (1-based), so the fused LR sequence is bit-identical to the eager NoamSchedule. - NoamSchedule.Factor getter so the mapping is fully faithful. Tests (3, all passing on the live CUDA box): 1. AdamWithNoamSchedule_MapsToFusedConfig_WithRampingSchedule — Adam+Noam now maps to a fused config (no eager fallback); mapped schedule ramps 4000× over warmup and matches the paper peak. 2. FusedNoamSchedule_MatchesEagerNoamSchedule_StepForStep — fused GetLr(N) == eager lr(t=N) for 3× warmup steps. 3. Transformer_PerCallTrain_DefaultNoam_EngagesFusedPath_AndConverges — end-to-end: default-Noam Transformer per-call Train engages the fused path (3200/3200 steps fused) and converges to PPL 5.06 / top-1 7/8 (avgNll 1.62 < ln(V) 2.08), proving the LR ramped instead of freezing. Verified the pre-existing ModelFamily TableTransformer/TabTransformer/ DecisionTransformer failures reproduce identically at the 0.86.6 baseline — they don't use Noam and are unrelated to this change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…x Cosine off-by-one Extends TryGetFusedLrSchedule so more LR schedulers run on the fused-compiled training path (the fused plan evaluates GetLr(step) per optimizer step): - StepLRScheduler → LrSchedule.Step. Verified: eager lr0·γ^((N-1)/stepSize) on batch N == fused GetLr(N) (Tensors uses max(0,step-1)/stepSize). - CyclicLRScheduler → LrSchedule.Cyclic, gated to the canonical symmetric- triangular case (mode==Triangular && stepSizeUp==stepSizeDown). Triangular2 / ExponentialRange / asymmetric have no fused shape and fall back to eager. Added a CyclicLRScheduler.Mode getter for the gate. Also fixes a PRE-EXISTING off-by-one in the Cosine mapping that the new parity test caught: eager CosineAnnealing uses cos(π·(N-1)/tMax) but fused CosineLr uses cos(π·(s-1)/(totalSteps-1)); passing totalSteps = tMax (not tMax+1) made the fused sequence drift ~4e-6/step from eager. Now passes tMax+1 for an exact match. (Exponential verified already exact: lr0·γ^(N-1) both sides.) New FusedLrScheduleMappingTests: step-for-step parity (eager sequence == fused GetLr(N)) for Step / Cyclic-triangular / Cosine / Exponential, plus negative guards that Triangular2 and asymmetric cyclic fall back to eager. All pass. Note: OneCycle is NOT wired — AiDotNet's OneCycle uses LINEAR warmup while the fused/PyTorch OneCycle uses cosine warmup; the formulas differ, so mapping it would train differently on the fused path. Left on eager (documented). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…verified) Tensors 0.88.0's CompiledTrainingPlan.ConfigureOptimizerFloat supports 20 fused optimizer kernels, but AiDotNet's CompiledTapeTrainingStep had a stale allowlist (SGD/Adam/AdamW/AMSGrad only) from when those were the only kernels — so any other optimizer silently fell back to eager even with an IFusedOptimizerSpec. - Expand the allowlist to the full set the linked Tensors build supports (SGD, SGDMomentum, Adam, AdamW, AMSGrad, Nadam, RAdam, AdaMax, AdaDelta, Adagrad, RMSprop, Lion, LARS, LAMB, FTRL, ASGD, Rprop, HypergradientSGD, ScheduleFreeSGD, DAdaptationSGD). - Generalize the AMSGrad-only "fused-unavailable" latch to a per-OptimizerType set, so any type the linked build can't actually run falls back ONCE (loud warning) instead of throwing/reconfiguring every step — still never a wrong update. - AdaMaxOptimizer + NadamOptimizer implement IFusedOptimizerSpec (OptimizerType.AdaMax / Nadam; no decoupled weight decay → WeightDecay 0; decline on adaptive LR / unmappable scheduler). New FusedOptimizerParityTests gates each wiring with a fused-vs-eager training comparison: train two identically-initialised MLPs (EnableCompilation true vs false), compare final params. Adam is the control. Result: AdaMax and Nadam both engage the fused path (fusedSteps=40/40) and match eager to maxAbsDiff=0 (bit-identical) — verified safe to wire. The test asserts fusedSteps>0 so a silent eager fallback can't pass vacuously. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…th (parity-verified) Five more optimizers self-describe via IFusedOptimizerSpec, mapped to their Tensors fused kernels using the exact param interpretation each kernel expects: - RMSprop → RMSpropUpdateSimd(lr, decay=β2, eps) - Adagrad → AdagradUpdateSimd(lr, eps) - Lion → LionUpdateSimd(lr, β1, β2, wd) - AdaDelta → AdaDeltaUpdateSimd(lr, rho=β2, eps) - LAMB → LAMBUpdateSimd(lr, β1, β2, eps, wd) Each declines (→ eager) under UseAdaptiveLearningRate, which is what gates the AiDotNet-side adaptive hyperparameter schedules (Adagrad LR factors, AdaDelta rho schedule, Lion β factors) that the fixed-hyperparameter fused kernels don't reproduce — so the fused path only engages for the canonical fixed-param case. Parity-verified (FusedOptimizerParityTests, fused-vs-eager training): all five engage the fused path (40/40 steps) and match eager to maxAbsDiff=0 (bit-identical), with a non-vacuous guard confirming training actually moved the params (trainDelta: Lion 0.40, LAMB 0.39, RMSprop 0.14, Adagrad 0.06, AdaDelta 3e-3 — distinct dynamics, not all the same). Total fused optimizers now: SGD/Adam/AdamW/AMSGrad + AdaMax/Nadam + RMSprop/Adagrad/Lion/AdaDelta/LAMB. LARS/FTRL/RAdam/ASGD/Rprop deferred (LARS/ FTRL need params the fixed (lr,β1,β2,ε,wd) config can't carry; RAdam/ASGD/Rprop have no AiDotNet optimizer class). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ty-gated) Tensors 0.88.0's CpuFusedOperations registry implements 26 pointwise activation kernels (feat/499 "fuse every activation"); AiDotNet wired only 8 (ReLU/Identity/LeakyReLU/GELU/SiLU/Swish/Sigmoid/Tanh). Adds IFusedActivation to 11 more whose fused kernel is numerically identical to the eager scalar form: Mish, SELU, Softplus, SoftSign, Sign, BentIdentity, Gaussian, LiSHT, SQRBF, ReLU6, HardSwish. Gated by a new FusedActivationParityTests harness: isolate each activation via IEngine.FusedLinear(x, I, null, type) (identity weights → only the fused activation applies) and compare element-wise to eager activation.Activate(x) over inputs spanning saturating regions. All 13 wired non-parametric activations match to ≤5e-7 (float epsilon). The gate caught a real mismatch: HardSigmoid is NOT wired — AiDotNet's HardSigmoidActivation uses slope 0.2 (clamp(0.2x+0.5,0,1)) while the fused kernel uses the PyTorch form (x/6+0.5); parity measured 0.333 divergence, so it stays on the eager path until the formula is reconciled. Deferred (documented): parametric activations (ELU/CELU/HardTanh/ScaledTanh/ ThresholdedReLU/ISRU) need per-instance param guards vs the kernel's hardcoded constants; RReLU is non-deterministic; the softmax family isn't pointwise so it can't be a fused activation epilogue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cord findings Adds a --compile flag to the PyTorch side so the head-to-head can be run against torch.compile (TorchInductor), not just eager — the comparison requested for the fused-training-plan claim. (TorchInductor CPU needs MSVC cl.exe on PATH; run under a VS Developer environment / vcvars64.bat.) Findings (MLP, CPU, 8 threads, bs64, 20 train-batches × 5 epochs, steady-state i.e. excluding torch.compile's ~3.7s first-epoch compilation): TRAINING (the compiled-training-plan claim): AiDotNet fused ~0.014-0.017 s/epoch torch.compile ~0.084 s/epoch → AiDotNet ~6x faster AiDotNet eager ~0.22 s/epoch (fused ~15x over eager) INFERENCE (Predict latency, post-warmup): torch.compile wins ~2-4x on MLP (e.g. bs32: AiDotNet p95 1.27ms vs torch.compile mean 0.32ms). The fused training plan beats torch.compile; the inference path does not yet — an honest gap to close (TorchInductor's fused pointwise+GEMM inference codegen). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/Training/CompiledTapeTrainingStep.cs`:
- Line 128: The static field _fusedUnavailableTypes is being latched on any
exception which can permanently disable fused mode; update the logic so you only
add to _fusedUnavailableTypes for known non-transient exceptions (e.g.,
NotSupportedException/PlatformNotSupported or a specific
OptimizerUnsupportedException) instead of catching Exception, and ensure
Invalidate() actually clears or resets _fusedUnavailableTypes (or make it
instance-scoped) so transient failures don't permanently disable fused
execution; locate usages in CompiledTapeTrainingStep (the _fusedUnavailableTypes
field and the method where exceptions are caught and Invalidate() is
implemented) and change the catch to specific exception types and add a
safe-clear/reset in Invalidate().

In
`@tests/AiDotNet.Tests/IntegrationTests/Optimizers/FusedOptimizerParityTests.cs`:
- Around line 115-151: Each test (Adam_Control_FusedMatchesEager,
AdaMax_FusedMatchesEager_NoWorseThanAdam,
Nadam_FusedMatchesEager_NoWorseThanAdam) currently ignores the returned
trainDelta so they can pass with no parameter updates; update each test to
assert that trainDelta indicates actual parameter movement (e.g.,
Assert.True(trainDelta > 0 || maxAbs(trainDelta) > 1e-6) or similar non-zero
threshold) after calling Divergence(...) and include a clear failure message
mentioning the test name and that no training occurred; use the existing
trainDelta variable from the Divergence(...) call and keep the threshold
conservative (like 1e-6) so small-but-real updates pass while no-op runs fail.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 656f2a77-456b-4109-8222-49895364a143

📥 Commits

Reviewing files that changed from the base of the PR and between 53f58b4 and 7bba7ce.

📒 Files selected for processing (24)

src/ActivationFunctions/BentIdentityActivation.cs
src/ActivationFunctions/GaussianActivation.cs
src/ActivationFunctions/HardSwishActivation.cs
src/ActivationFunctions/LiSHTActivation.cs
src/ActivationFunctions/MishActivation.cs
src/ActivationFunctions/ReLU6Activation.cs
src/ActivationFunctions/SELUActivation.cs
src/ActivationFunctions/SQRBFActivation.cs
src/ActivationFunctions/SignActivation.cs
src/ActivationFunctions/SoftPlusActivation.cs
src/ActivationFunctions/SoftSignActivation.cs
src/LearningRateSchedulers/CyclicLRScheduler.cs
src/Optimizers/AdaDeltaOptimizer.cs
src/Optimizers/AdaMaxOptimizer.cs
src/Optimizers/AdagradOptimizer.cs
src/Optimizers/GradientBasedOptimizerBase.cs
src/Optimizers/LAMBOptimizer.cs
src/Optimizers/LionOptimizer.cs
src/Optimizers/NadamOptimizer.cs
src/Optimizers/RootMeanSquarePropagationOptimizer.cs
src/Training/CompiledTapeTrainingStep.cs
tests/AiDotNet.Tests/IntegrationTests/Optimizers/FusedActivationParityTests.cs
tests/AiDotNet.Tests/IntegrationTests/Optimizers/FusedLrScheduleMappingTests.cs
tests/AiDotNet.Tests/IntegrationTests/Optimizers/FusedOptimizerParityTests.cs

…parity tests, py nits - CompiledTapeTrainingStep: only latch a type as fused-unsupported for capability-gap exceptions (NotSupported/MissingMethod/TypeLoad/EntryPointNotFound/DllNotFound), not any exception — transient runtime failures fall back one step without permanently disabling fused mode; and clear _fusedUnavailableTypes in Invalidate() so a fresh lifecycle retries. - FusedOptimizerParityTests: assert trainDelta > 1e-6 in Adam/AdaMax/Nadam tests so they cannot pass vacuously when both paths do no training. - pytorch benchmark/compare: drop quoted self-type annotation, redundant int() cast, add _get return type; bump requirements torch>=2.5.0 for the torch.load RCE / DoS CVEs. All 3 parity tests pass; solution builds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ce work + SpMM unconstrained fix) 0.90.2 pulls in the merged compiled-inference plan (Tensors#513 — CompiledMlp self-tuning kernel selection, CNN conv im2col fast path, MlpForward small-batch native-BLAS routing, public CpuInferenceConfig.PinBlasThreadsForLatency) and the Tensors#520 fix that made ISparseEngine.SpMM<T> unconstrained again (0.90.0/0.90.1 broke the AiDotNet build — #379 had leaked `where T : unmanaged` into the public API, failing SparseLinearLayer<T>; 0.90.2 is the first 0.90.x that compiles). Core + tests + PyTorchParity benchmark all build clean against 0.90.2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…dMlp plan FeedForwardNeuralNetwork.Predict already collapsed a pure dense+activation stack into one MlpForward call, but MlpForward is Tensor-based (per-call AutoTensorCache + dispatch + Tensor-wrapper overhead). The Tensors compiled-inference flagship — CompiledMlp (array-based, near-zero per-call allocation, persistent prepacked weights, per-layer managed-vs-native self-tuning) — beats torch.compile at the kernel level but wasn't on the Predict path. It's internal to AiDotNet.Tensors and reachable via [InternalsVisibleTo("AiDotNet")]. TryFusedDensePredict now adds a float tier (TryCompiledMlpPredict): build/cache a CompiledMlp from the dense layers' weights/biases on first eligible inference, then replay it. The plan is rebuilt when absent, when batch exceeds the buffers it was sized for, or when any layer's weight backing array was reallocated (reference guard) — the same frozen-weights-during-inference contract as the MlpForward path, plus the reallocation guard the cached plan needs. Non-float and non-contiguous / rank>2 inputs fall through to MlpForward unchanged. Measured (AIsEval MLP 784->512->128->10, this machine): Predict bs1 avg 0.503 -> 0.225 ms — ~2.2x faster, now at parity with torch.compile (0.217 ms mean), where the Tensor-based path was ~2.8x slower. (mlp-fused, which calls MlpForward directly rather than via Predict, is unchanged — isolating the gain to this path.) Correctness: FeedForwardCompiledMlpPredictTests asserts the CompiledMlp Predict output equals the generic per-layer Forward (first-call lazy-weights path) within 1e-4 and is deterministic across calls, at bs 1/8/32. Builds clean on 0.90.2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ralNetwork.Predict A canonical CNN classifier — [Conv(→ReLU) | MaxPool]+ → Flatten → Dense+ — now replays inference by calling the engine kernels directly (FusedConv2D fusing bias+activation; index-free MaxPool2D; cached-B FusedLinear), skipping the per-layer LayerBase.Forward wrappers. Predict overrides the base to try this stem and falls back to base.Predict for anything outside the pattern (non-float, active tape, lazy/unmaterialized weights, a conv activation other than identity/ReLU). Root-caused via a per-stage breakdown (CnnStemBreakdownBench) at bs1: the layer path pools through MaxPool2DWithIndices — allocating a 5-D backward-index array even at inference (~213 µs vs ~26 µs index-free) — and pays per-layer shape-resolution / _lastInput-caching / Tensor-view churn. The stem drops both. Result (parity CNN, this machine): bs8 inference 2.39 → 1.32 ms (~1.8x), bs1 0.78 → 0.69 ms, bs32 3.34 → 2.95 ms. Output matches the generic per-layer Forward within 1e-4 and is deterministic (ConvNetFusedStemPredictTests, bs 1/4). Honest ceiling: still ~3x behind torch.compile. The remaining gap is NOT layer overhead — it's (a) the per-op Tensor allocation the stem still incurs (each FusedConv2D/MaxPool2D returns a fresh Tensor; torch fuses the whole graph into one allocation-free C++ fn) and (b) the conv kernel floor itself — the im2col-GEMM convs sum to ~188 µs and the full kernel floor (~329 µs) already exceeds torch's whole-CNN 254 µs. Fully matching torch needs faster conv kernels (oneDNN/direct-conv codegen) or a zero-alloc array-based CompiledConvNet (FusedConv2DInto + MaxPool2DInto + ping-pong NCHW buffers) — a larger Tensors effort, filed as follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview – aidotnet_website May 29, 2026 18:57 View deployment

vercel Bot deployed to Preview – aidotnet-playground-api May 29, 2026 18:57 View deployment

ooples changed the title ~~fix: make the compiled fused-training fallback loud instead of silent~~ fix: compiled fused-training — standard-Adam default, OCP dispatch, MlpForward wiring, loud fallback May 29, 2026

ooples mentioned this pull request May 29, 2026

Add fused kernels for the missing activations + optimizer types (no existing kernel) ooples/AiDotNet.Tensors#499

Closed

coderabbitai Bot requested changes May 29, 2026

View reviewed changes

franklinic and others added 2 commits May 29, 2026 16:13

ooples force-pushed the perf/fused-training-gate branch from 3e3d6fe to a6851e4 Compare May 29, 2026 20:41

ooples mentioned this pull request May 29, 2026

Wire the existing AMSGrad kernel into CompiledTrainingPlan's fused-update dispatch ooples/AiDotNet.Tensors#500

Closed

ooples mentioned this pull request May 29, 2026

feat(#500): wire ALL fused kernel-backed optimizers + pointwise activations into the plan/MlpForward dispatch ooples/AiDotNet.Tensors#501

Merged

coderabbitai Bot approved these changes May 29, 2026

View reviewed changes

Merge branch 'master' into perf/fused-training-gate

e32ade6

coderabbitai Bot requested changes May 30, 2026

View reviewed changes

ooples mentioned this pull request May 30, 2026

feat(#499): fuse every activation + the remaining optimizers (Maxout op, global-reduction optimizers) ooples/AiDotNet.Tensors#502

Merged

ooples force-pushed the perf/fused-training-gate branch from bf7a785 to 7310df9 Compare May 30, 2026 02:56

ooples mentioned this pull request May 30, 2026

Per-call model.Train (tape path) with default Noam scheduler keeps LR frozen at warmup-step-1 -> from-scratch Transformer training stalls (regression 0.207.0-fix1380 -> 0.207.x) #1470

Open

franklinic and others added 3 commits May 30, 2026 10:20

vercel Bot deployed to Preview – aidotnet-playground-api May 30, 2026 15:13 View deployment

vercel Bot deployed to Preview – aidotnet_website May 30, 2026 15:13 View deployment

franklinic and others added 3 commits May 30, 2026 11:57

coderabbitai Bot requested changes May 30, 2026

View reviewed changes

Comment thread src/Training/CompiledTapeTrainingStep.cs

Comment thread tests/AiDotNet.Tests/IntegrationTests/Optimizers/FusedOptimizerParityTests.cs

coderabbitai Bot approved these changes May 30, 2026

View reviewed changes

franklinic and others added 3 commits May 31, 2026 11:47

Uh oh!

Conversation

ooples commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Default-optimizer gate (root cause of "compiled does nothing")

2. Open/closed fused dispatch (replaces type-switch + enum whitelist)

3. #1470 — Adam+Noam on the fused fast path (true adaptive-LR fix)

4. Proper wiring — Predict → MlpForward

5. Loud fallback (observability)

Coverage being completed on this branch

Summary by CodeRabbit

✅ Coverage completed + verified (parity-gated)

Benchmark — vs torch.compile (TorchInductor), MLP CPU, steady-state

Uh oh!

vercel Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ooples commented May 29, 2026 •

edited by coderabbitai Bot

Loading

4. Proper wiring — `Predict` → `MlpForward`

Benchmark — vs `torch.compile` (TorchInductor), MLP CPU, steady-state

vercel Bot commented May 29, 2026 •

edited

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading