Skip to content

dnexdev/tiramisu

Repository files navigation

tiramisu

MNIST training — loss over 10 epochs

A 2-layer MLP (784 → 128 → ReLU → 10) training on MNIST with Adam. Generated from the stock examples/mnist binary — regenerate with python scripts/plot_training.py build/mnist_run.log (see Training GIF).

The deep learning framework you can actually read.

Tiramisu is a from-scratch machine learning stack in C++20 — about 5,000 lines of framework code, structured like a production framework but small enough to read in an afternoon.

  • ~5,000 lines of framework code (coreopsautogradnnoptim)
  • Stdlib-only compute — no Eigen, no BLAS, no PyTorch at link time
  • PyTorch-familiar APITensor, requires_grad, backward(), Module, Linear, Adam (Python and C++)
  • Built to teach — explicit autograd graph, readable kernels, end-to-end MNIST

Real tensors, real autograd, real training — at a scale you can read cover to cover.

CMake on multiple platforms


Development timeline

What's shipped today vs what's planned. Legend: ✅ shipped · 🚧 in progress · 📋 roadmap

Phase What
Foundation Storage, Tensor views, strides, dtypes
Ops Elementwise ops, broadcast, reduce, matmul (AVX2/FMA)
Autograd Node, backward(), gradcheck, NoGradGuard
NN + optim Linear, cross_entropy_loss, SGD, Adam, MNIST example
Normalization softmax, layernorm forward + backward
Batched matmul N-D GEMM with batch broadcast
Transformer / GPT Embedding, MHA, FFN, TransformerBlock, GPT
CUDA backend GPU training via -DTIRAMISU_ENABLE_CUDA=ON and --cuda
Python bindings pip install . — Tensor, autograd, nn, optim
Conv2d, serialize, quant 📋 README placeholders today

API at a glance

If you know PyTorch, you already know most of tiramisu — in Python (pip install .) or C++.

Autograd

PyTorchTiramisu (Python)Tiramisu (C++)
x = torch.tensor([2.0], requires_grad=True)
y = x * x + 3.0 * x
y.backward()
print(x.grad)  # tensor([7.])
import numpy as np
import tiramisu as tr

x = tr.from_numpy(np.array([2.0], dtype=np.float32))
x.requires_grad = True
y = tr.add(tr.mul(x, x), tr.mul(x, 3.0))
y.backward()
print(x.grad)  # [7.]
#include "tiramisu/autograd/ops.hpp"

Tensor x({1});
x.at<float>({0}) = 2.0f;
x.set_requires_grad(true);

Tensor y = autograd::add(autograd::mul(x, x),
                         autograd::mul(x, 3.0f));
autograd::backward(y);
// x.grad()->at<float>({0}) == 7.0f

Training step

PyTorchTiramisu (Python)Tiramisu (C++)
optimizer.zero_grad()
logits = model(batch_x)
loss = F.cross_entropy(logits, batch_y)
loss.backward()
optimizer.step()
import tiramisu as tr

opt.zero_grad()
h = layer1.forward(batch_x)
h = tr.relu(h)
logits = layer2.forward(h)
loss = tr.nn.cross_entropy_loss(logits, batch_y)
loss.backward()
opt.step()
opt.zero_grad();
Tensor h = layer1->forward(batch_x);
h = autograd::relu(h);
Tensor logits = layer2->forward(h);
Tensor loss = nn::cross_entropy_loss(logits, batch_y);
autograd::backward(loss);
opt.step();

Linear layer

PyTorchTiramisu (Python)Tiramisu (C++)
layer = nn.Linear(784, 128)
# y = x @ W.T + b
y = layer(x)
import tiramisu as tr

layer = tr.nn.Linear(784, 128)
out = layer.forward(x)
// weight: (in_features, out_features), bias: (out_features,)
Tensor out = autograd::matmul(x, weight);
y = autograd::add(out, bias);  // bias broadcasts over batch

Quick start (Python)

Requires Python 3.10+ and a C++20 compiler.

pip install .
python -c "import tiramisu as tr; print(tr.nn.Linear(10, 5))"

Forward pass with NumPy interop:

import numpy as np
import tiramisu as tr

x = tr.from_numpy(np.random.randn(2, 784).astype(np.float32))
layer = tr.nn.Linear(784, 10)
out = layer.forward(x)
print(out.shape())  # [2, 10]

See examples/python/ and python/README.md for GPT training steps and the full binding reference.


Quick start (C++)

Requires CMake 3.20+ and a C++20 compiler.

cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Debug
cmake --build build --parallel
ctest --test-dir build --output-on-failure

Debug builds enable ASan + UBSan by default (TIRAMISU_ENABLE_SANITIZERS=ON).
Use -DCMAKE_BUILD_TYPE=Release for optimized ops (-O3 -mavx2 -mfma) without sanitizers.

Run MNIST

  1. Download MNIST IDX files into data/:
    • train-images-idx3-ubyte, train-labels-idx1-ubyte
    • t10k-images-idx3-ubyte, t10k-labels-idx1-ubyte
  2. Build (above), then:
cd build/examples && ./mnist

Expected: loss decreases over 10 epochs, ~95%+ test accuracy.


Read the whole stack

A guided path through the codebase (~2.3k LOC of libraries):

# File Why read it
1 core/include/tiramisu/core/tensor.hpp Views, strides, autograd hooks
2 ops/cpu/broadcast.cpp NumPy-style broadcast rules
3 ops/cpu/elementwise.cpp Stride-0 broadcast trick
4 autograd/src/ops.cpp backward() + wrapper pattern
5 nn/src/linear.cpp One layer: Y = XW + b
6 examples/mnist.cpp Full training loop

Compare to micrograd (minimal autograd in Python), tinygrad (full stack, large codebase), and llm.c (training-focused C). Tiramisu targets typed C++, modular libraries, and MNIST end-to-end as a readable reference implementation — not production throughput.


Training GIF

Regenerate the README animation after a local training run:

cd build/examples && ./mnist 2>&1 | tee ../mnist_run.log
pip install -r scripts/requirements.txt   # matplotlib, pillow
python scripts/plot_training.py build/mnist_run.log

Output: docs/assets/mnist_training.gif


Char-level GPT training

Train a tiny, ~2M, or ~10M-parameter GPT on Tiny Shakespeare:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target train_shakespeare
./build/examples/train_shakespeare --preset tiny --epochs 3

GPU training (~2M preset tuned for a 6GB card):

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DTIRAMISU_ENABLE_CUDA=ON
cmake --build build --target train_shakespeare
./build/examples/train_shakespeare --preset 2m --cuda --epochs 5 \
  --checkpoint build/shakespeare_2m.ckpt

Smoke test on GPU:

./build/examples/train_shakespeare --preset tiny --cuda --epochs 1 --max-batches 10

For the ~10M config (CPU-slow; use CUDA when available):

./scripts/run_10m_training.sh

Checkpoints use the binary format in serialize/. Options: --preset tiny|2m|10m, --cuda, --checkpoint PATH, --max-batches N.


Layout

core/       Storage, Tensor, dtype, device
ops/cpu/    Forward kernels (elementwise, reduce, matmul, normalization)
ops/cuda/   Optional CUDA kernels (-DTIRAMISU_ENABLE_CUDA=ON)
autograd/   Differentiable wrappers, backward(), gradcheck
nn/         Module, Linear, GPT, loss, LayerNorm, …
optim/      SGD, Adam, AdamW, grad clipping, cosine LR
python/     pybind11 bindings (`pip install .`)
serialize/  GPT checkpoint save/load
examples/   hello_tiramisu, mnist, train_shakespeare
tests/      GoogleTest (fetched by CMake)

GoogleTest is the only non-stdlib fetch at configure time. Compute uses the C++ standard library only (plus compiler intrinsics for AVX2 in ops).

Releases

No releases published

Packages

 
 
 

Contributors