makemore is a series of progressively more sophisticated autoregressive character-level language models from Andrej Karpathy's Neural Networks: Zero to Hero lecture series. All of them train on the same dataset — ~32k first names from the SSA names list — and ask the same question: given a prefix of characters, what is the next character?
This tutorial walks through an OCANNL implementation of that progression,
mirroring Karpathy's structure part-by-part. Each part produces a standalone
runnable example under test/training/:
- Part 1 — Bigram →
bigram.ml - Part 2 — Bengio MLP →
mlp_names.ml - Part 3 — BatchNorm MLP →
mlp_bn_names.ml - Part 4 — How OCANNL compiles gradients
- Part 5 — WaveNet (stretch goal — deferred)
- Part 6 — Transformer →
transformer_names.ml
All examples run on the CPU (OCANNL_BACKEND=sync_cc dune runtest test/training/) and use Utils.settings.fixed_state_for_init <- Some 3 so
that .expected fixtures stay deterministic across runs.
The Names dataset is provided by the opam package
dataprep. It exposes a dict of
size 28: the 26 lowercase letters plus . (used as both beginning- and
end-of-name marker) and a space (used to pad short names where a fixed-length
sequence is required). The helpers we use across the examples:
Dataprep.Names.read_names : unit -> string list— returns ~32k names.Dataprep.Names.char_index : char -> int— alphabet → index.Dataprep.Names.letters_with_dot : char list— index → character.
The examples below each split the name list deterministically into
80% / 10% / 10% train / dev / test via a Fisher–Yates shuffle with a fixed
seed (split_seed = 42).
Karpathy: Building makemore Part 1: bigrams, probabilities, negative-log-likelihood
The bigram model predicts the next character given only the previous
character. Each training example is a pair (char_prev, char_next).
bigram.ml implements the neural-net formulation
directly — one weight matrix w of shape [vocab_size -> vocab_size], softmax
per row, cross-entropy against the one-hot target. Karpathy's lecture shows a
counts-only warm-up notebook step first; we skip that and go straight to the
neural formulation to keep the training + generation loop compact.
The softmax denominator is expressed as an einsum reduction:
let counts = exp (({ w } + 1) * input) in
counts /. (counts ++ "...|... => ...|0")In PyTorch that would be F.softmax(w @ input, dim=-1). In OCANNL the
...|... => ...|0 einsum reduces the output axis to size 1 by summing,
leaving batch and other axes untouched. This is the same pattern you'll see
repeated in each subsequent part, plus a numerical-stability max subtraction
starting from Part 2.
Sampling is a CDF-based walk over the 28 outputs (counts[i] / sum(counts)),
terminated when . is emitted.
Karpathy's lecture additionally applies a small 0.01 * (W ** 2).mean ()
regularizer. We omit it — the loss thresholds in bigram.expected already
accommodate its absence — and keep the example focused on the core softmax +
cross-entropy + generation triangle.
Karpathy: Building makemore Part 2: MLP
mlp_names.ml is the Bengio et al. 2003
neural probabilistic language model: a learned character embedding table c
of shape [vocab_size -> embed_dim] (here embed_dim = 10), a fixed context
window of block_size = 3 preceding characters, and an MLP over the
concatenated embeddings:
┌─ C[x_{t-2}] ─┐
x → │ C[x_{t-1}] │ → tanh (W1 @ concat + b1) → (W2 @ hidden + b2) → softmax
└─ C[x_t] ─┘
OCANNL's shape inference replaces Karpathy's explicit .view(-1, n_embd * block_size) reshape. We keep block_size as a batch axis of the embedded
tensor and use an einsum contraction with +* to fuse block_size and
embed_dim into the hidden weight:
let%op embed input = { c; o = [ embed_dim ] } * input in
(* embed : batch=[batch_size, block_size], output=[embed_dim] *)
let%op hidden x =
tanh
((embed x +* { w1 } "bs|->e; |se->h => b|->h" [ "s"; "e" ])
+ { b1; o = [ hid_dim ] })
(* hidden : batch=[batch_size], output=[hid_dim] *)The einsum spec reads "LHS has batch axes b,s and output e; RHS (w1) has
input axes s,e and output h; result has batch b and output h." The
repeated s and e are the contracted axes — same semantics as PyTorch's
matrix multiply over a flattened [block_size * embed_dim] view, but here
expressed without a reshape.
The cross-entropy loss uses the numerically-stable log-softmax (subtract per- row max before exp) — identical to the transformer example and Karpathy's later lectures.
OCANNL's default uniform1 init produces values in [0, 1), i.e. all
positive. For an MLP with tanh that saturates the preactivation and traps
SGD at a high-loss plateau. We mitigate post-hoc by recentering every
parameter to [-0.25, 0.25), matching the pattern
transformer_names.ml uses:
Set.iter batch_loss.Tensor.params ~f:(fun p ->
let tn = p.Tensor.value in
Train.set_on_host tn;
let vals = Tn.get_values tn in
Array.iteri vals ~f:(fun i v -> vals.(i) <- 0.5 *. (v -. 0.5));
Tn.set_values tn vals);Karpathy's lecture handles the same pain point a different way, which is itself the entry point to Part 3.
Under the fixed seed, mlp_names.ml converges to a final train/dev/test NLL
of ~2.32 over 15 epochs. The three sampled names (ahla, nilia, gatro)
are recognizably Names-like.
Karpathy: Building makemore Part 3: Activations & Gradients, BatchNorm
mlp_bn_names.ml extends Part 2 by
inserting Nn_blocks.batch_norm1d between the hidden
linear and the tanh. The new block mirrors the existing batch_norm2d but
normalizes over the batch axis only (no spatial axes to reduce). Its einsum
is:
let mean = (x ++ "..o.. | ..c.. => 0 | ..c.." [ "o" ]) /. dim o inwhere o is the captured batch-axis length and ..c.. is the channel row
variable passed through unchanged.
The model becomes:
embed → (w1 +* ... + b1) → batch_norm1d → tanh → (w2 * hidden + b2)
The hidden weight w1 uses Kaiming-normal initialization — normal1 ()
sampled from a standard normal then scaled by sqrt(scale_sq / fan_in) per
Nn_blocks.kaiming. Threading train_step through
the layers requires the unit-closure pattern from nn_blocks.ml:
let bn1 = Nn_blocks.batch_norm1d ~label:[ "bn1" ] () in
let%op mk_hidden () ~train_step x =
tanh
(bn1 ~train_step
((embed x +* { w1 = kaiming normal1 () }
"bs|->e; |se->h => b|->h" [ "s"; "e" ])
+ { b1; o = [ hid_dim ] }))
in
let hidden = mk_hidden ()let%op mk_hidden () ~train_step x = ... lifts the inline { w1 = … } and
{ b1 } parameters to the () closure scope — they are constructed once
when mk_hidden () is called, and shared across every subsequent invocation
(training, dev/test eval, inference generation). The post-init recentering
pass (inherited from Part 2) explicitly skips w1 so Kaiming's scale is
preserved; c, b1, w2, b2 are still recentered from uniform [0,1)
to roughly centered.
batch_norm1d inherits batch_norm2d's FIXME: running statistics are not
implemented, so momentum is ignored and the inference path
(~train_step:None) falls through to (gamma *. normalized) + beta computed
from batch statistics rather than population statistics. For a single-
example inference batch, mean == x, centered == 0, normalized == 0, so
the output collapses to beta regardless of input. Generation quality
degrades accordingly — mlp_bn_names.ml's three sampled names are noticeably
noisier than Part 2's. The tutorial leaves this as a pedagogical demonstration
of why running statistics matter; the framework-level fix is tracked as a
follow-up to this task.
Karpathy: Building makemore Part 4: becoming a backprop ninja
Part 4 of the lecture series replaces loss.backward() with a hand-written
backward pass through the Part 3 MLP — the reader implements gradient
formulas for each forward operation.
OCANNL does the equivalent automatically. The %cd / %op syntax extensions
generate both forward and backward code at Train.grad_update time. To read
the generated code for yourself, enable the debug flag:
OCANNL_BACKEND=sync_cc \
dune exec test/training/mlp_bn_names.exe \
-- --ocannl_output_debug_files_in_build_directory=trueAfter the run, _build/default/test/training/build_files/ holds three files
per compiled routine:
*.cd— the high-level assignment IR (forward + backward interleaved),*.ll— the low-level for-loop IR, and*.c/*.cu/*.metal— the backend-specific source passed to the compiler.
Skim the .cd file for the training routine and look at the sub-expression
for tanh(w1 +* embed x + b1). Karpathy's lecture has the reader write, by
hand:
dh = (1 - h ** 2) * dhpreact # d tanh / d preact
dhpreact = dh # alias for clarity
dw1 = embed.T @ dhpreact # weight gradient
dembed = dhpreact @ w1.T # backprop into embedIn OCANNL the emitted .cd expresses the same four gradient terms as
explicit assignment statements, annotated by the tensor label they correspond
to. The tanh derivative in particular surfaces as an elementwise
multiplication of the upstream gradient by 1 - h**2, matching the manual
formula line-for-line. The difference is that you don't maintain the
gradient bookkeeping — the PPX threads it for every operation you wrote in
%op / %cd, and re-threads it when you change the model.
Treat Part 4 as the bridge from "I could hand-write a gradient" to "the compiler has already written it" — the same mental model, without the upkeep.
Karpathy: Building makemore Part 5: Building a WaveNet
Part 5 introduces hierarchical 1D dilated causal convolutions. OCANNL's
conv2d has no dilation support at HEAD, and the einsum does not yet have
first-class stride-dilation primitives for the 1D case. A WaveNet example is
therefore deferred to a follow-up issue — implementing it is a meaningful
einsum / lowering extension, not a tutorial polish.
Karpathy: Context covered across Zero to Hero Parts 7–9 (intro to language modelling → GPT from scratch), condensed into the decoder-only GPT.
transformer_names.ml is a
masked-attention decoder-only transformer trained on the Names dataset with a
16-character context and teacher-forced targets. It was merged as part of
gh-ocannl-57 and is the
continuation of this tutorial — once you've worked through Parts 2 and 3, the
transformer replaces the fixed-context-window MLP with self-attention and
adds positional encoding + FFN residual blocks.
The same data-prep primitives (name_to_sequences, seqs_to_flat_one_hot)
are visible there, adapted to a full sequence instead of a sliding window.
Mapping summary
- Part 1 →
bigram.ml - Part 2 →
mlp_names.ml - Part 3 →
mlp_bn_names.ml - Part 4 → this page, §Part 4
- Part 5 → deferred — follow-up issue
- Part 6 →
transformer_names.ml