sourcepirate · sourcepirate · May 30, 2026 · May 30, 2026 · May 30, 2026 · May 30, 2026
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -13,7 +13,7 @@ jobs:
       contents: write
     strategy:
       matrix:
-        python-version: ["3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.10", "3.11", "3.12"]
 
     steps:
     - uses: actions/checkout@v4

diff --git a/Agents.md b/Agents.md
@@ -33,6 +33,18 @@ You are an agent working on `neutro`, an "intentionally naive" and educational i
 - `RegexTokenizer` is preferred for LLM tasks, implementing byte-level BPE with regex splitting.
 - Maintain educational clarity: explicitly implement the greedy merge process without obscure optimizations.
 
+## Documentation Sync
+
+Whenever you modify a source file under `neutro/layers/`, `neutro/models/`, or `neutro/engine/`, you MUST update its corresponding documentation file under `docs/`. The doc path mirrors the source path (e.g., `neutro/layers/core/dense.py` ↔ `docs/layers/core/dense.md`).
+
+Required for every doc change:
+- Follow the **line-by-line walkthrough** style: explain `__init__`, `build`, `forward`, `backward` in sequence.
+- Add 🔍 **"Why" annotations** on every stored/cached value — explain what it's used for in backward.
+- Add 📐 **Shape walkthroughs** on every matrix operation — show `(B, D) @ (D, U) → (B, U)`.
+- Reference exact file paths and line numbers in the source.
+- If creating a new layer, create a new `.md` file in the corresponding `docs/` subdirectory.
+- Run `pytest` after doc changes to verify no regressions.
+
 ## Testing
 - Aim for >90% test coverage.
 - Use `pytest`.

diff --git a/docs/activations/activations.md b/docs/activations/activations.md
@@ -0,0 +1,77 @@
+# Activation Functions
+
+## Theory
+
+Activation functions introduce non-linearity into neural networks. Without them, stacking linear layers would collapse into a single linear transformation.
+
+### ReLU — `neutro/activations/relu.py`
+
+$$\text{ReLU}(x) = \max(0, x)$$
+
+$$\text{ReLU}'(x) = \mathbf{1}_{x > 0}$$
+
+- **Gradient**: 1 for positive inputs, 0 for negative. This causes the "dying ReLU" problem where neurons can get stuck at 0.
+
+### Sigmoid — `neutro/activations/sigmoid.py`
+
+$$\sigma(x) = \frac{1}{1 + e^{-x}}$$
+
+$$\sigma'(x) = \sigma(x)(1 - \sigma(x))$$
+
+- Output range: (0, 1). Used for binary classification or as gating mechanism (LSTM, GRU).
+- **Vanishing gradient**: for very large or very small inputs, the gradient approaches 0.
+
+### Tanh — `neutro/activations/tanh.py`
+
+$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$
+
+$$\tanh'(x) = 1 - \tanh^2(x)$$
+
+- Output range: (-1, 1). Zero-centered, often preferred over sigmoid in hidden layers.
+
+### Softmax — `neutro/activations/softmax.py`
+
+$$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$
+
+- Output: probability distribution over classes.
+- **Jacobian-Vector Product** (`gradient_fast`, line 18): computes $y * (\text{grad\_output} - \sum(y * \text{grad\_output}))$ without building the full $N \times N$ Jacobian.
+
+### SiLU — `neutro/activations/silu.py` (Sigmoid Linear Unit)
+
+$$\text{SiLU}(x) = x \cdot \sigma(x)$$
+
+$$\text{SiLU}'(x) = \sigma(x) + x \cdot \sigma(x) \cdot (1 - \sigma(x))$$
+
+- Also called Swish. Used in modern architectures (e.g., Llama, GPT).
+
+## Implementation Guide
+
+All activations follow the same pattern:
+
+```python
+class ReLU:
+    def forward(self, x): ...
+    def gradient(self, x): ...        # element-wise gradient
+    def gradient_fast(self, x, grad): ...  # fused JVP (optional)
+```
+
+- `forward` is used by `Dense` and other layers in the forward pass.
+- `gradient` returns the element-wise derivative, which is multiplied by the upstream gradient in `Dense.backward`.
+- `gradient_fast` is an optimization used by Softmax to avoid the full Jacobian matrix.
+
+## Usage Example
+
+```python
+from neutro.activations import get_activation
+
+relu = get_activation('relu')
+x = np.array([-1, 0, 2])
+y = relu(x)          # [0, 0, 2]
+dy = relu.gradient(x)  # [0, 0, 1]
+```
+
+## References
+
+- Nair, V., & Hinton, G. E. (2010). **Rectified Linear Units Improve Restricted Boltzmann Machines**.
+- Hendrycks, D., & Gimpel, K. (2016). **Gaussian Error Linear Units (GELUs)**.
+- Elfwing, S., Uchibe, E., & Doya, K. (2018). **Sigmoid-weighted linear units for neural network function approximation in reinforcement learning**.
diff --git a/docs/callbacks/callbacks.md b/docs/callbacks/callbacks.md
@@ -0,0 +1,60 @@
+# Callbacks
+
+## Theory
+
+Callbacks are objects that hook into the training loop at various points. They allow you to monitor training, save checkpoints, adjust learning rates, and stop training early without cluttering the training loop itself.
+
+**Hook points** (in order):
+1. `on_train_begin` / `on_train_end`
+2. `on_epoch_begin` / `on_epoch_end`
+3. `on_batch_begin` / `on_batch_end`
+
+## Implementation Guide
+
+### File: `neutro/callbacks/base.py`
+
+```python
+class Callback:
+    def set_model(self, model): ...
+    def on_train_begin(self, logs=None): ...
+    def on_train_end(self, logs=None): ...
+    def on_epoch_begin(self, epoch, logs=None): ...
+    def on_epoch_end(self, epoch, logs=None): ...
+    def on_batch_begin(self, batch, logs=None): ...
+    def on_batch_end(self, batch, logs=None): ...
+```
+
+All methods are no-ops by default. Subclasses override the needed hooks.
+
+### History — `neutro/callbacks/history.py`
+
+Records per-epoch metrics into `history.history` dict (keys: `loss`, `val_loss`, `accuracy`, etc.).
+
+### EarlyStopping — `neutro/callbacks/early_stopping.py`
+
+Monitors a metric (e.g., `val_loss`) and stops training if it hasn't improved for `patience` epochs. Uses `model.stop_training = True`.
+
+### ReduceLROnPlateau / LR Scheduler — `neutro/callbacks/lr_scheduler.py`
+
+Reduces the learning rate when a metric plateaus, or follows a predefined schedule.
+
+### Checkpoint — `neutro/callbacks/checkpoint.py`
+
+Saves the model to disk at the end of each epoch using `joblib.dump`.
+
+## Usage Example
+
+```python
+from neutro.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
+
+callbacks = [
+    EarlyStopping(monitor='val_loss', patience=5),
+    ModelCheckpoint('best_model.pkl', save_best_only=True),
+    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3),
+]
+model.fit(X, y, callbacks=callbacks, epochs=100)
+```
+
+## References
+
+- Keras Callbacks API. [Keras.io](https://keras.io/api/callbacks/)
diff --git a/docs/data/data.md b/docs/data/data.md
@@ -0,0 +1,34 @@
+# Data
+
+## DataLoader — `neutro/data.py`
+
+A simple data loader for batching and shuffling:
+
+```python
+class DataLoader:
+    def __init__(self, x, y, batch_size=32, shuffle=True, augmenter=None):
+        self.x = x
+        self.y = y
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.augmenter = augmenter
+        self.indices = np.arange(len(x))
+        self.on_epoch_end()
+
+    def __len__(self):
+        return int(np.ceil(len(self.x) / self.batch_size))
+
+    def on_epoch_end(self):
+        if self.shuffle:
+            np.random.shuffle(self.indices)
+
+    def __getitem__(self, index):
+        batch_idx = self.indices[index * self.batch_size:(index + 1) * self.batch_size]
+        batch_x, batch_y = self.x[batch_idx], self.y[batch_idx]
+        return batch_x, batch_y
+
+    def __iter__(self):
+        for i in range(len(self)):
+            yield self[i]
+        self.on_epoch_end()
+```
diff --git a/docs/engine/node.md b/docs/engine/node.md
@@ -0,0 +1,113 @@
+# KerasTensor, Node, and the Functional API Graph Engine
+
+## Theory
+
+The Functional API lets you build models as directed acyclic graphs (DAGs) of layers, rather than as linear stacks. This requires a mechanism to track *symbolic* data flow during model construction, before any real data is seen.
+
+Two core classes enable this:
+
+- **`KerasTensor`**: A symbolic placeholder representing the *future* output of a layer. It carries a `shape` but no actual data.
+- **`Node`**: A record of one *call* to a layer. It links input `KerasTensor`s → output `KerasTensor`s and is stored on the layer's `_inbound_nodes` list.
+
+When you write `outputs = Dense(32)(inputs)`, the layer's `__call__` method detects that `inputs` is a `KerasTensor`, builds the layer (if needed), computes the output shape symbolically, wraps it in a new `KerasTensor`, and records a `Node`. No NumPy computation occurs.
+
+Later, `Model._init_graph` traverses the graph backward from the outputs to discover all reachable `Node`s and `Layer`s, producing a topological ordering used for forward and backward execution.
+
+## Implementation Guide
+
+### `KerasTensor` — `neutro/engine/node.py:3-13`
+
+```python
+class KerasTensor:
+    def __init__(self, shape, node=None, name=None):
+        self.shape = shape
+        self.node = node      # The Node that produced this tensor
+        self.name = name
+```
+
+- `shape` is a tuple like `(None, 32)` — the batch dimension is `None` (unknown until runtime).
+- `node` is set when a `Node` is created and links back to the producing layer.
+
+### `Node` — `neutro/engine/node.py:15-38`
+
+```python
+class Node:
+    def __init__(self, layer, input_tensors, output_tensors):
+        self.layer = layer
+        self.input_tensors = input_tensors
+        self.output_tensors = output_tensors
+        layer._inbound_nodes.append(self)
+        # Link output tensors back to this node
+        if isinstance(output_tensors, list):
+            for t in output_tensors:
+                t.node = self
+        else:
+            output_tensors.node = self
+```
+
+Key behaviors:
+- **Registration**: The node registers itself on `layer._inbound_nodes`, enabling multi-parent graph traversal.
+- **One layer, many nodes**: A shared layer used 3 times will have 3 entries in `_inbound_nodes`, each with different input/output tensors.
+- **List outputs**: Layers like `Add` that take lists of inputs store the lists in `input_tensors`. Multi-output layers store lists in `output_tensors`.
+
+### How `Layer.__call__` triggers Node creation — `neutro/layers/base.py:67-105`
+
+The symbolic path (line 77-97):
+
+```python
+if is_symbolic:
+    if not self.built:
+        self.build(input_shapes)           # e.g., Dense.build((None, 10))
+    output_shape = self.compute_output_shape(input_shapes)
+    output_tensors = KerasTensor(shape=output_shape)
+    Node(self, input_tensors=inputs, output_tensors=output_tensors)
+    return output_tensors
+```
+
+This is a **zero-computation** path: no `forward` is called, only shape inference.
+
+## Graph Discovery (`Model._init_graph`) — `neutro/models/base_model.py:25-62`
+
+```python
+def traverse(tensor):
+    if hasattr(tensor, 'node') and tensor.node:
+        node = tensor.node
+        if node not in visited_nodes:
+            visited_nodes.add(node)
+            # Recursively visit inputs
+            if isinstance(node.input_tensors, list):
+                for t in node.input_tensors:
+                    traverse(t)
+            else:
+                traverse(node.input_tensors)
+            nodes_ordered.append(node)
+```
+
+This produces `_nodes_ordered` in **reverse topological order** (inputs before outputs). The backward pass iterates `reversed(_nodes_ordered)`.
+
+## Usage Example
+
+```python
+from neutro.layers import Input, Dense
+from neutro.models import Model
+from neutro.engine.node import KerasTensor, Node
+
+# Symbolic construction
+inputs = Input(shape=(4,))          # returns a KerasTensor
+x = Dense(8, activation='relu')(inputs)  # Layer.__call__ creates a Node
+outputs = Dense(1)(x)
+
+# Inspect the graph
+print(type(inputs))          # <class 'KerasTensor'>
+print(inputs.shape)          # (None, 4)
+print(outputs.node.layer)    # Dense(1) — the final layer
+
+# Model discovers nodes via traversal
+model = Model(inputs=inputs, outputs=outputs)
+print(len(model._nodes_ordered))  # Number of Nodes discovered
+```
+
+## References
+
+- Chollet, F. (2015). **Keras** — the Functional API was introduced in Keras 1.0. [GitHub](https://github.com/keras-team/keras)
+- Keras Functional API Guide. [Keras.io](https://keras.io/guides/functional_api/)
diff --git a/docs/initializers/initializers.md b/docs/initializers/initializers.md
@@ -0,0 +1,52 @@
+# Initializers
+
+## Theory
+
+Weight initialization is critical for training deep networks. Poor initialization can cause vanishing/exploding gradients. `neutro` implements several strategies.
+
+### Glorot (Xavier) Uniform — `neutro/initializers/glorot.py`
+
+$$W \sim U\left[-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right]$$
+
+Recommended for layers with tanh or sigmoid activation.
+
+### He Initialization — `neutro/initializers/he.py`
+
+$$W \sim N\left(0, \sqrt{\frac{2}{n_{\text{in}}}}\right)$$
+
+Recommended for layers with ReLU activation. Keeps variance of activations constant across layers.
+
+### Constant — `neutro/initializers/constant.py`
+
+$W = c$ for a constant $c$. Used for bias initialization (typically $c=0$).
+
+### Random — `neutro/initializers/random.py`
+
+$$W \sim N(\text{mean}, \text{stddev})$$
+
+## Implementation Guide
+
+All initializers are callable objects:
+
+```python
+class GlorotUniform:
+    def __call__(self, shape):
+        limit = np.sqrt(6 / (shape[0] + shape[1]))
+        return np.random.uniform(-limit, limit, size=shape)
+```
+
+They are instantiated in layer `__init__` and called in `build`:
+
+```python
+class Dense(Layer):
+    def __init__(self, units, kernel_initializer='glorot_uniform', ...):
+        self.kernel_initializer = get_initializer(kernel_initializer)
+
+    def build(self, input_shape):
+        self.params['W'] = self.kernel_initializer((input_shape[-1], self.units))
+```
+
+## References
+
+- Glorot, X., & Bengio, Y. (2010). **Understanding the difficulty of training deep feedforward neural networks**. *AISTATS*.
+- He, K., et al. (2015). **Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification**. [arXiv:1502.01852](https://arxiv.org/abs/1502.01852)