From 82dc4d5fbec286790402ae592dee3f19f212ff47 Mon Sep 17 00:00:00 2001
From: Grill cheese <grillcheeseapi@gmail.com>
Date: Tue, 7 Apr 2026 12:22:20 -0400
Subject: [PATCH] Update README.md

---
 README.md | 194 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 193 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index ca886a3..8b57a3c 100644
--- a/README.md
+++ b/README.md
@@ -395,7 +395,199 @@ uv run pytest tests/test_linear.py -v
 
 ## What's New
 
-### 0.6.x (current)
+### Pre-v1.0 (current `main`, since 0.6.1) {#whats-new-since-061}
+
+A long debug session against a real workload (`vsa_lm_v3c_grilly` —
+language modeling with multiplication-free FFN + causal Linear-RNN
+mixer) surfaced and fixed a stack of bugs and perf cliffs that the
+0.6.1 test suite never tripped. Each fix is small in isolation; the
+pile is large enough to warrant a major version bump.
+
+#### Performance — bridge dispatch overhauled
+
+- **`BufferPool::allocateBuffer` VMA fix.** Changed `preferredFlags`
+  → `requiredFlags = DEVICE_LOCAL_BIT`. The old code silently fell
+  back to slow host-visible BAR memory on AMD/Windows when the
+  allocator's auto-select picked the wrong heap; the fix forces
+  `memoryType[2]` (DEVICE_LOCAL+HOST_VISIBLE+HOST_COHERENT) under
+  Resizable BAR or fails loudly when ReBAR is unavailable.
+  ([cpp/src/buffer_pool.cpp](cpp/src/buffer_pool.cpp))
+- **3-way bucket pool routing.** `acquire` / `acquireDeviceLocal` /
+  `acquireReadback` now have separate per-size pools; `release`
+  routes by the buffer's `deviceLocal` / `readback` flag. Prevents a
+  DL buffer from being picked up by a host-visible `acquire` and
+  crashing on `mappedPtr=null`.
+- **Staging pattern across all hot ops** ("Thread A"). Each op
+  acquires DEVICE_LOCAL VRAM compute buffers + WC sequential-write
+  stage-in + HOST_CACHED random-read stage-out, batches a single
+  command buffer with `copyBuffer × N → barrier → dispatch →
+  barrier → copyBuffer × M → submit/wait`. Applied to:
+  - `cpp/src/ops/linear.cpp` — `linear`, `linearBackward`, `dropout`
+  - `cpp/src/ops/activations.cpp` — `activationForward` /
+    `activationBackward` helpers (covers ReLU/GELU/SiLU/Tanh)
+  - `cpp/src/ops/layernorm.cpp` — `layernorm`, `layernormBackward`
+  - `cpp/src/ops/embedding.cpp` — `embeddingLookup`
+  - `cpp/src/ops/optimizer.cpp` — `adamUpdate`, `adamwUpdate`
+  - `cpp/src/ops/loss.cpp` — `crossEntropyLoss`, `crossEntropyBackward`
+- **Measured impact**: forward `nn.Linear` on a 4096×384×1152 GEMM
+  went from **763 ms → 19 ms** on an AMD RX 6750 XT (~40x). The
+  download phase alone collapsed from **749 ms → 2.7 ms** once the
+  output stage moved to `HOST_CACHED` memory (random-read instead of
+  uncached WC reads).
+- **`transferComputeBarrier()`** added to `CommandBatch` — bidirectional
+  TRANSFER ↔ COMPUTE memory + execution barrier needed by the
+  staging pattern (the existing `barrier()` is COMPUTE→COMPUTE only,
+  kept unchanged for `linearBackward`'s 3-pass intra-shader barriers).
+
+#### fp16 + cooperative matrix GEMM
+
+- **`shaders/gemm-coopmat-shared.glsl`** — fp16 tiled GEMM via
+  `VK_KHR_cooperative_matrix` with shared-memory staging. Subgroup
+  scope, 16×64 (M×N) tile per workgroup, 256 threads (4×Wave64
+  subgroups), fp32 accumulator. Dispatches to native WMMA on RDNA3
+  and NVIDIA RTX, falls through the driver emulation path on
+  RDNA1/RDNA2.
+- **`shaders/gemm-bias-add.glsl`** — companion row-broadcast bias
+  add (the coopmat store can't interleave bias inline).
+- **`LinearParams.elemSize`** — new field (4 = fp32, 2 = fp16).
+  `linear()` selects `gemm-coopmat-shared` when `elemSize == 2`,
+  cooperative-matrix is supported, AND shape is aligned
+  (M%16, K%16, N%64); otherwise falls back to `fnn-linear.glsl`.
+- **Pybind: generic `py::array`** — `bindings_linear.cpp` now accepts
+  fp32 OR fp16 numpy input via `xBuf.itemsize`. Output is always
+  fp32 (coopmat accumulator). Bias must be fp32 regardless of input
+  dtype.
+- **`linearBackward` interface upgrade** — same `void*` + `elemSize`
+  signature so the fp16 path slots in cleanly when an fp16 backward
+  shader lands. For now `elemSize != 4` raises with a clear message.
+
+#### Causal Linear-RNN prefix scan (new feature)
+
+- **`shaders/prefix-scan-causal.glsl`** — `h_t = a_t * h_{t-1} + x_t`
+  in O(log S) parallel depth via `subgroupInclusiveAdd` on `log(a)`
+  and the rescaled input (Blelloch's two-scan trick). Strictly
+  causal; one workgroup per `(batch, hidden_dim)` pair.
+- **`shaders/prefix-scan-causal-backward.glsl`** — anti-causal scan
+  for `grad_x` and `grad_a` via the identity
+  `R[t] = total - F[t] + w[t]` (no `subgroupShuffle`, which is
+  undefined on partial Wave64 subgroups). Hits fp32 epsilon vs the
+  closed-form gradient (verified `max abs err ≈ 3.6e-6`).
+- **`grilly/cpp/src/ops/prefix_scan.cpp`** — C++ dispatcher with the
+  same staging pattern as the rest of Thread A.
+- **`grilly/cpp/python/bindings_prefix_scan.cpp`** — pybind exposing
+  `prefix_scan_causal` and `prefix_scan_causal_backward`.
+- **`grilly/nn/prefix_scan.py`** — Python autograd wrapper
+  (`prefix_scan_causal()`) wired into grilly's `Variable` /
+  `GradFn` system, plus a `CausalSequenceMixer` module that uses it
+  as a drop-in causal sequence-pooling replacement.
+- **Constraint**: `seq_len <= 32` (one thread per time step in a
+  single subgroup). A hierarchical multi-subgroup version is on the
+  roadmap.
+
+#### Autograd — actually working now
+
+- **`Module.__setattr__` auto-registration**. `self.weight =
+  nn.Parameter(...)` and `self.lin = nn.Linear(...)` now populate
+  `_parameters` / `_modules` automatically. Standard PyTorch idiom.
+  Was previously silently broken — every Module subclass returned 0
+  parameters from `parameters()`, AdamW silently no-op'd.
+- **`nn.Linear.forward` autograd wiring.** When the input is a
+  `Variable`, the output is wrapped in a `Variable` with a `GradFn`
+  whose backward closure calls the existing `Linear.backward()`
+  (which already populates `weight.grad`/`bias.grad` via the GPU
+  shader). Same template applied to `nn.LayerNorm.forward` and
+  `nn.Embedding.forward`.
+- **`Variable.__array__`** — numpy array protocol on
+  `nn.autograd.Variable`. `np.matmul(tensor, w)` /
+  `np.dot(tensor, w)` / `np.asarray(tensor)` now operate on the
+  backing ndarray transparently. Required to let grilly's existing
+  numpy-native layer code keep working when called from torch_api
+  Tensor inputs.
+- **`Module.__call__` Variable passthrough + output wrap.** Inputs
+  of type `Tensor` / `LongTensor` / `Variable` are passed through to
+  `forward()` unchanged; raw ndarray outputs are re-wrapped in
+  `Tensor` so chained calls preserve torch-style type all the way
+  through user-defined Module subclasses.
+- **`Parameter` shape methods** — `unsqueeze`, `view`,
+  `mean(dim=...)`, `detach` added to `nn.Parameter` so user
+  `forward` code can do `self.weight.unsqueeze(0)` /
+  `self.weight.view(...)` / `self.weight.mean(dim=-1)` without
+  knowing that `Parameter` is an `np.ndarray` subclass.
+- **`nn.init.normal_/uniform_`** — added a `_writable_array(tensor)`
+  helper that unwraps Tensor/Variable wrappers to their backing
+  ndarray for in-place init. Previously raised `TypeError: 'Tensor'
+  object does not support item assignment` for the standard
+  `nn.init.normal_(self.weight, 0, 0.02)` idiom.
+- **`F.gelu` re-export** in `grilly.nn.functional` (was importable
+  via `grilly.nn.autograd.gelu` but missing from the public
+  `functional` namespace).
+
+#### Checkpoint format
+
+- **`.grl` save/load roundtrip fixed.**
+  `torch.save({'model': sd, 'step': N}, path)` followed by
+  `ck = torch.load(path)` now returns exactly what was saved
+  (matches `torch.save`/`torch.load` semantics). The previous
+  `load_grl` force-wrapped content under a fixed `'model'` key,
+  producing `ck['model']['model']['weight']` instead of
+  `ck['model']['weight']` for any payload that already contained a
+  `model` key.
+
+#### Editable install / Vulkan probe
+
+- **`grilly/__init__.py` sys.path fix.** Added an `os.path.dirname`
+  insert at the very top of the package init so `import grilly_core`
+  works under PEP 660 editable installs. The path hook used by
+  modern editable installs (`__editable__.grilly-X.Y.finder.__path_hook__`)
+  doesn't add the package directory to `sys.path`, so the sibling
+  `grilly_core.<plat>.pyd` was invisible to `import grilly_core`.
+  The downstream effect: `backend/base.py:_probe_cpp_vulkan()`
+  silently caught the `ModuleNotFoundError`, set
+  `VULKAN_AVAILABLE = False`, and the entire `nn` stack thought it
+  had no GPU (despite a perfectly working Vulkan device).
+- **`Module._get_backend()` graceful None.** Catches the legacy
+  `VulkanCompute` init exception and returns `None` so layers that
+  only used `_get_backend()` for one-time GPU Xavier init at
+  construction time don't crash when the legacy Python `vulkan`
+  ctypes package isn't installed (the new C++ `_bridge` path doesn't
+  need it).
+
+#### Pre-existing shader bugs surfaced by recompile
+
+Three shaders that had stale `.spv` files compiled against a more
+permissive glslang version. Recent glslang catches them:
+
+- **`fused-layernorm-linear.glsl`** — added missing
+  `#extension GL_EXT_shader_atomic_float : require` for the
+  `atomicAdd(shared_sum, sg_sum)` accumulator.
+- **`lstm-cell-forward.glsl`** — renamed buffer field `input` →
+  `input_data` (`input` is a reserved word in recent glslang).
+  Also removed an incorrect `writeonly` qualifier on the gates
+  buffer that the shader actually reads back.
+- **`vsa-explore.glsl`** — renamed buffer field `output` →
+  `output_data`. Same `writeonly` mismatch fix.
+
+#### Tooling
+
+- **`rebuild.ps1`** — one-command Windows rebuild. Compiles all
+  GLSL → SPIR-V (with `-S comp` to disambiguate the stage,
+  `--target-env vulkan1.3` for cooperative matrix + subgroup
+  extensions), runs `cmake --build build2 --config Release --target
+  grilly_core`, copies the freshly built `.pyd` to the package
+  root. Skips up-to-date shaders by mtime comparison.
+- **`PipelineCache::getDevice()`** accessor — needed by `linear.cpp`
+  to query `hasCooperativeMatrix()` before selecting the coopmat
+  shader path.
+
+#### Lint cleanup
+
+- 75 ruff errors fixed across the codebase. Mix of unsorted imports
+  (`I001`), unused imports (`F401`), missing f-string placeholders
+  (`F541`), deprecated typing imports (`UP035`), non-PEP 585
+  annotations (`UP006`), and a `yield from` modernization in
+  `nn.Module.named_buffers`.
+
+### 0.6.x
 
 - **MindForge** adapter hypernetwork integration (via CubeMind)
 - **Synaptic shaders**: `synapsis-stdp-update.glsl`, `bridge-spike-to-continuous.glsl`