nkane · nkane · Jun 3, 2026 · Jun 3, 2026
diff --git a/README.md b/README.md
@@ -496,12 +496,26 @@ opens a reverse-incremental search through history — each keystroke
 narrows the match, Ctrl-R again walks to the next older one, Esc restores
 the original line, Enter accepts.
 
-Press `<` to rewind one instruction. Each explicit step (`s`, `S`, `n`,
-`f`) records a full CPU + RAM snapshot beforehand, kept in a 256-entry
-FIFO ring; the status bar shows `rwd:N` while non-empty. Free-run via
-`r` does NOT snapshot — the 64 KiB-per-step cost would dominate at multi-MHz
-throughput — so reverse-step covers single-stepping sessions, not whole
-program executions.
+Press `<` to rewind one instruction. Every step — explicit (`s`, `S`, `n`,
+`f`) or free-run (`r`) — records a page-level copy-on-write delta beforehand,
+kept in a 256-entry FIFO ring; the status bar shows `rwd:N` while non-empty.
+
+For jumps deeper than that ring, **deep rewind** keeps periodic full-RAM
+*keyframes* (one every 4096 steps) and reconstructs any earlier step by
+restoring the nearest keyframe and replaying forward to the exact target:
+
+| Command            | Effect                                                      |
+|--------------------|-------------------------------------------------------------|
+| `:rewind N`        | Step back N executed steps (keyframe replay for deep jumps) |
+| `:rewind-budget MB`| Cap keyframe memory; sets the deep-rewind reach             |
+
+Reach (steps) = `budget / 64 KiB × 4096`. At the default 128 MiB cap that's
+~8.4M steps; `:rewind-budget 256` reaches ~16.7M. The budget is a ceiling —
+a short run holds only the keyframes it produced — and the status bar shows
+`deep:<reach>@<budget>` once keyframes exist. A deep rewind replays at most
+4096 instructions (sub-millisecond on the cycle-accurate core). Replay assumes
+deterministic execution between keyframes; live keyboard input is captured in
+the snapshot, so buffered input replays correctly.
 
 ---
 

diff --git a/cpu/keyframe.go b/cpu/keyframe.go
@@ -0,0 +1,146 @@
+package cpu
+
+// Keyframe-based deep rewind (issue #392).
+//
+// The per-step SnapshotRing only reaches back as far as its capacity (a few
+// hundred steps) — fine for "oops, step back one" but useless for "rewind to
+// somewhere in the last few million steps". Storing a delta for every one of
+// those steps is infeasible, so deep rewind instead keeps periodic *full*
+// machine snapshots (keyframes) and reconstructs an arbitrary earlier state
+// by restoring the nearest keyframe at or before the target step and
+// replaying forward the handful of steps in between.
+//
+//	reach (steps) = ring capacity (keyframes) × keyframe interval (steps)
+//	ring capacity = budget bytes / KeyframeBytes
+//
+// Memory is a *cap*, not a preallocation: a short run holds only as many
+// keyframes as it produced. Forward-replay cost is bounded by the interval,
+// so a larger interval trades replay latency for reach at a fixed budget.
+
+// KeyframeBytes is the accounting size of one keyframe: a full 64 KiB RAM
+// image. Register/peripheral state is negligible next to it, so the budget
+// math treats every keyframe as this fixed size.
+const KeyframeBytes = 0x10000
+
+// Keyframe is a full machine snapshot tagged with the step index at which it
+// was taken. Snap.Pages holds every page (a complete RAM image), so Restore
+// reconstructs the exact state with no delta chain.
+type Keyframe struct {
+	Step uint64
+	Snap Snapshot
+}
+
+// SnapshotFull captures a complete RAM image (all 256 pages) plus registers,
+// suitable for use as a keyframe base. Unlike CPU.Snapshot — which records
+// only a page delta for undoing a single step — this is self-contained:
+// Restore needs nothing else. Peripherals are filled in by the caller, as
+// with the delta path.
+func (c *CPU) SnapshotFull(ram *RAM) Snapshot {
+	s := c.Snapshot(ram)
+	pages := make(map[byte][256]byte, 256)
+	for p := 0; p < 256; p++ {
+		var img [256]byte
+		base := p << 8
+		copy(img[:], ram.Data[base:base+256])
+		pages[byte(p)] = img
+	}
+	s.Pages = pages
+	return s
+}
+
+// KeyframeRing is a fixed-capacity FIFO of keyframes ordered by ascending
+// step. Push appends the newest; when full it drops the oldest, so the ring
+// always holds the most recent `cap` keyframes. Nil receiver methods are
+// safe and behave as an empty, zero-capacity ring.
+type KeyframeRing struct {
+	buf  []Keyframe
+	head int // next-write index
+	size int
+	cap  int
+}
+
+// NewKeyframeRing builds a ring sized to hold budgetBytes worth of keyframes.
+// A budget too small for even one keyframe still yields a 1-slot ring so deep
+// rewind degrades to "nearest keyframe" rather than disabling outright; a
+// non-positive budget yields nil (feature off).
+func NewKeyframeRing(budgetBytes int) *KeyframeRing {
+	if budgetBytes <= 0 {
+		return nil
+	}
+	c := budgetBytes / KeyframeBytes
+	if c < 1 {
+		c = 1
+	}
+	return &KeyframeRing{buf: make([]Keyframe, c), cap: c}
+}
+
+// Cap returns the ring's keyframe capacity (0 for a nil ring).
+func (r *KeyframeRing) Cap() int {
+	if r == nil {
+		return 0
+	}
+	return r.cap
+}
+
+// Len returns the number of keyframes currently held.
+func (r *KeyframeRing) Len() int {
+	if r == nil {
+		return 0
+	}
+	return r.size
+}
+
+// Bytes is the approximate resident size of the held keyframes.
+func (r *KeyframeRing) Bytes() int {
+	return r.Len() * KeyframeBytes
+}
+
+// Push appends a keyframe. Callers are responsible for pushing in ascending
+// step order (the TUI does, since it captures during forward execution).
+func (r *KeyframeRing) Push(kf Keyframe) {
+	if r == nil || r.cap == 0 {
+		return
+	}
+	r.buf[r.head] = kf
+	r.head = (r.head + 1) % r.cap
+	if r.size < r.cap {
+		r.size++
+	}
+}
+
+// Nearest returns the latest keyframe whose Step is <= target, and true. When
+// the ring is empty or every held keyframe is newer than target (target fell
+// off the back of the reach window), it returns false.
+func (r *KeyframeRing) Nearest(target uint64) (Keyframe, bool) {
+	if r == nil || r.size == 0 {
+		return Keyframe{}, false
+	}
+	// Entries run oldest..newest starting at (head - size). Scan newest-first
+	// and take the first with Step <= target.
+	for i := 0; i < r.size; i++ {
+		idx := (r.head - 1 - i + r.cap) % r.cap
+		if r.buf[idx].Step <= target {
+			return r.buf[idx], true
+		}
+	}
+	return Keyframe{}, false
+}
+
+// Oldest returns the lowest step still reachable (the back of the window) and
+// true, or (0,false) when empty. Used to report reach to the user.
+func (r *KeyframeRing) Oldest() (uint64, bool) {
+	if r == nil || r.size == 0 {
+		return 0, false
+	}
+	idx := (r.head - r.size + r.cap) % r.cap
+	return r.buf[idx].Step, true
+}
+
+// Reset drops all keyframes without freeing the backing buffer.
+func (r *KeyframeRing) Reset() {
+	if r == nil {
+		return
+	}
+	r.head = 0
+	r.size = 0
+}
diff --git a/cpu/keyframe_test.go b/cpu/keyframe_test.go
@@ -0,0 +1,95 @@
+package cpu
+
+import "testing"
+
+func TestSnapshotFull_RoundTrip(t *testing.T) {
+	ram := NewRAM()
+	ram.EnableShadow()
+	c := New(ram)
+	for a := 0; a < 0x10000; a += 257 {
+		ram.Data[a] = byte(a)
+	}
+	c.A, c.X, c.PC = 0x11, 0x22, 0x9000
+
+	kf := c.SnapshotFull(ram)
+	if len(kf.Pages) != 256 {
+		t.Fatalf("SnapshotFull captured %d pages; want 256", len(kf.Pages))
+	}
+	// Mutate everything, then restore.
+	for a := 0; a < 0x10000; a++ {
+		ram.Data[a] = 0xEE
+	}
+	c.A, c.X, c.PC = 0, 0, 0
+	c.Restore(kf, ram)
+	if c.A != 0x11 || c.X != 0x22 || c.PC != 0x9000 {
+		t.Errorf("regs not restored: A=%02X X=%02X PC=%04X", c.A, c.X, c.PC)
+	}
+	for a := 0; a < 0x10000; a += 257 {
+		if ram.Data[a] != byte(a) {
+			t.Fatalf("RAM[%04X] = %02X; want %02X", a, ram.Data[a], byte(a))
+		}
+	}
+}
+
+func TestKeyframeRing_CapFromBudget(t *testing.T) {
+	if r := NewKeyframeRing(0); r != nil {
+		t.Error("zero budget should yield nil ring")
+	}
+	// 64 MiB / 64 KiB = 1024.
+	if r := NewKeyframeRing(64 << 20); r.Cap() != 1024 {
+		t.Errorf("cap = %d; want 1024", r.Cap())
+	}
+	// Sub-keyframe budget still yields a 1-slot ring.
+	if r := NewKeyframeRing(100); r.Cap() != 1 {
+		t.Errorf("tiny budget cap = %d; want 1", r.Cap())
+	}
+}
+
+func TestKeyframeRing_NearestAndEviction(t *testing.T) {
+	r := NewKeyframeRing(3 * KeyframeBytes) // cap 3
+	for _, step := range []uint64{0, 1000, 2000, 3000} {
+		r.Push(Keyframe{Step: step})
+	}
+	// Cap 3 -> step 0 evicted; window is {1000,2000,3000}.
+	if old, _ := r.Oldest(); old != 1000 {
+		t.Errorf("oldest = %d; want 1000", old)
+	}
+	cases := []struct {
+		target uint64
+		step   uint64
+		ok     bool
+	}{
+		{3500, 3000, true},
+		{3000, 3000, true},
+		{2999, 2000, true},
+		{2000, 2000, true},
+		{1000, 1000, true},
+		{999, 0, false}, // older than the back of the window
+	}
+	for _, c := range cases {
+		kf, ok := r.Nearest(c.target)
+		if ok != c.ok || (ok && kf.Step != c.step) {
+			t.Errorf("Nearest(%d) = (%d,%v); want (%d,%v)", c.target, kf.Step, ok, c.step, c.ok)
+		}
+	}
+}
+
+func TestKeyframeRing_Bytes(t *testing.T) {
+	r := NewKeyframeRing(10 * KeyframeBytes)
+	r.Push(Keyframe{Step: 0})
+	r.Push(Keyframe{Step: 1})
+	if got := r.Bytes(); got != 2*KeyframeBytes {
+		t.Errorf("Bytes = %d; want %d", got, 2*KeyframeBytes)
+	}
+}
+
+func TestKeyframeRing_NilSafe(t *testing.T) {
+	var r *KeyframeRing
+	r.Push(Keyframe{})
+	if r.Len() != 0 || r.Cap() != 0 || r.Bytes() != 0 {
+		t.Error("nil ring should report zero")
+	}
+	if _, ok := r.Nearest(5); ok {
+		t.Error("nil ring Nearest should be false")
+	}
+}
diff --git a/docs/context.md b/docs/context.md
@@ -143,6 +143,7 @@ Bus chain: `CPU → tui.WBus → cpu.MMIO → cpu.RAM`
 - #1, #2, #3, #7, #8 (cycle audit), #9 (65C02), #10 (IRQ/NMI), #11–#15
 
 ### Merged PRs of note
+- Deep rewind via keyframes (issue #392, v1.3.0): the per-step `SnapshotRing` only reaches back its capacity (256 steps) — fine for "step back one", useless for "rewind into the last few million steps". Added keyframe-based deep rewind: `cpu.KeyframeRing` holds periodic full-RAM snapshots (`CPU.SnapshotFull` captures all 256 pages; one keyframe every `keyframeInterval`=4096 steps), and `:rewind N` reconstructs any earlier step by restoring the nearest keyframe ≤ target (`KeyframeRing.Nearest`) and replaying forward to the exact step (`rewindToStep` → `stepReplay` loop under a `replayingRewind` guard so replay doesn't re-capture keyframes). Small jumps still pop the fine ring exactly. `:rewind-budget MB` resizes the ring (cap = budget/64KiB); reach = cap × interval, shown in the status bar as `deep:<reach>@<budget>`. **Note — the issue's own numbers are mutually inconsistent**: full 64 KiB keyframes every 1k steps can't reach 10M under 256 MiB (that's ~4M). Used interval 4096 instead so 256 MiB reaches ~16.7M while forward-replay stays ≤4096 instructions (benchmarked **1.3 ms** incl. replay, vs the 100 ms acceptance). Memory is a *cap* not a reservation — the ring only fills to the run length; the old "fixed 256-entry ring" already sat at ≤16 MiB so the issue's "ring grows" framing was off. A step-0 keyframe is seeded on the first step so sub-interval targets are reachable. `StepCount` tracks position; `<` and reset keep it in sync. Determinism caveat: forward replay assumes deterministic execution between keyframes (buffered keyboard input is snapshotted, so it replays). Deltas-from-previous-keyframe compression is a future optimisation. No state-format change (StepCount/keyframes are ephemeral). `cpu` ring logic unit-tested apart from the TUI; deep-rewind exactness verified byte-for-byte against a RAM-mutating loop ROM.
 - Trace replay — search / jump-to-cycle / diff (issue #391, v1.3.0): four navigation features on top of `-trace-replay` (issue #64's playback). (1) **`:find EXPR` / `:rfind EXPR`** — jump to the next/previous frame matching an expression over the frame's registers/flags, reusing the breakpoint-condition `expr` grammar against a scratch CPU loaded per frame (`framePredicate`). A bare `=` is normalised to `==` (`normalizeFindExpr`) so `:find PC=$8042` works as users type it; bare `:find` repeats the last expression to sweep matches. (2) **`:cycle N`** — `Replay.SeekCycle` binary-searches the monotonic cycle column (O(log N) on a 1M-frame trace). (3) **`-diff PATH`** — loads a second trace; `trace.Diff` walks both by index and returns the first `Frame.Equal` mismatch (or a length-mismatch divergence at the shorter trace's end) as `trace.Divergence{Index,Cycle,Found}`, computed eagerly in `WithReplayDiff` and surfaced in the status line. (4) **`d` / `D`** — `d` toggles a side-by-side diff overlay (`diffModal`, double-bordered like the help modal) centred on the primary cursor with mismatched frames in red + a `✗` gutter at the divergence; `D` jumps both cursors there. Pure-`trace` logic (SeekCycle/FindFunc/Diff/Frame.Equal) is unit-tested separately from the TUI wiring. No state-format change.
 - Watch panel array expansion (issue #390, v1.3.0): `:watch` learns an `xN` (or `[N]`) array token — `:watch grid word x16` pins 16 consecutive LE words and renders them as indexed rows `grid[0..15]` (header `[16]`, first `maxWatchElemRows`=8 shown, rest collapsed to `… +N more`). Element width = the watch's `byte`/`word` kind; addresses are `Addr + i*Width`. `symbols.Table` now parses the cc65 `sym size=` field (`Size(addr)`) and seeds the count automatically when present — but **the issue's premise was false**: cc65 V2.18 `.dbg` carries *no* struct member layout, array bounds, or element types. C globals get bare `sym ... type=lab` records with no `size=`; even local `csym` records collapse every type to `type id=0 val="00"` (void). So struct-tree expansion is impossible from `.dbg` and the auto-seed rarely fires for data globals — `xN` is the workhorse. Scoped to array-only best-effort per that finding; struct overlays + DAP `variables` array children deferred (DAP has no globals scope yet). New `Watch.Count` is an optional v1 state field (omitempty, no schema bump). Tests: `symbols` size parse, `:watch xN`/`[N]` parsing + element addressing, panel render + truncation.
 - Blargg `apu_test` 4/8 → 8/8 PASS — Mesen2 frame-counter substeps + DMC alignment (PRs #379-#382, nessy v0.10): wired Blargg's `apu_test.nes` (8 sub-tests: len_ctr, len_table, irq_flag, irq_timing, len_timing, irq_flag_timing, dmc_basics, dmc_rates) into the accuracy harness (#379) and closed every gap it surfaced over three follow-up PRs. (1) **6 internal frame-counter sub-steps** (#380) — Mesen2 `ApuFrameCounter.h:19` table encodes the user-visible 'step 3' of 4-step mode as 3 CPU cycles (29828, 29829, 29830) where IRQ asserts continuously and the half-frame tick fires at cycle 29829. chippy's 4-entry interval table from #377 fired the tick at 29828; replaced with `frameStepIntervalsNtsc4Step = [6]int{7456, 7458, 7457, 1, 1, 7457}` + 5-step analogue, switch in `advanceFrameStep` extended to 6 cases (step 3 = IRQ-only, step 4 = q+h+IRQ, step 5 = idle/reset for 4-step). Cleared 5-len_timing. (2) **DMC buffer-fill + enable-fetch + $4015 read** (#381) — three real-silicon DMC behaviors chippy was getting wrong: `maybeRefill` was silencing whenever `bufferEmpty=true` at the 8-bit boundary instead of only when `bytesRemaining=0` too; `setEnabled` didn't schedule the initial DMA fetch (Mesen `SetEnabled` does via `transferStartDelay`); $4015 read was clearing the DMC IRQ flag (per nesdev + Mesen `NesApu.cpp:101`, only frame-counter IRQ is cleared by $4015 read — DMC IRQ acks via $4015 write or $4010 bit-7 clear). dmcChannel now inits with `bufferEmpty=true`+`silenced=true`. Cleared 7-dmc_basics' 18 sub-tests. (3) **Mesen-aligned DMC Clock** (#382) — three compounding structural mismatches: chippy burned an extra 'reload-only' fire per byte (each byte = 9 fires instead of Mesen's 8), the timer reload was period+1 cycles between fires (429 vs Mesen's 428), and the fetch-schedule check only ran at byte boundaries. Replaced `clockShift`+`maybeRefill` with a unified `clock()` mirroring Mesen `DeltaModulationChannel::Run`'s inner body: always shift+decrement, reload at `bitsRemaining=0` boundary, schedule fetch on every clock when buffer-empty+bytes-pending. Initialise `bitsRemaining=8` (matches Mesen `Reset:36`). Cleared 8-dmc_rates' 16 rates × 2 boundary checks. **All four accuracy ROMs now PASS**: `ppu_vbl_nmi` 10/10, `instr_timing`, `cpu_interrupts_v2` 5/5, `apu_test` 8/8. No regression on nestest / Klaus / demo SHAs. The DMC restructure also fixes any ROM that uses delta samples — the rate timing was off by ~12% before. Refs #318 (rolling accuracy tracker).

diff --git a/internal/tui/complete.go b/internal/tui/complete.go
@@ -23,6 +23,7 @@ var defaultVerbs = func() []string {
 		"syms", "symbols",
 		"mem",
 		"find", "rfind", "cycle",
+		"rewind", "rewind-budget",
 		"trace",
 		"textsave",
 		"theme",