Skip to content

"Native" interface, XOR in main library#4

Open
HeathenUK wants to merge 223 commits into
vascofazza:masterfrom
HeathenUK:master
Open

"Native" interface, XOR in main library#4
HeathenUK wants to merge 223 commits into
vascofazza:masterfrom
HeathenUK:master

Conversation

@HeathenUK
Copy link
Copy Markdown

No description provided.

Based on the display_multiplex sketch, to create an image for those of us that don't program EEPROM with Arduino.
I'm sure this can be optimised significantly, but here's a readable XOR function to complete the set
Copy link
Copy Markdown
Owner

@vascofazza vascofazza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great contribution. Please integrate the display code in the existing one or remove it thanks

HeathenUK and others added 24 commits March 28, 2026 09:56
…mulator

Phase 1 (microcode-only): XOR, NEG, SWAP, LDP3/STP3 (page 3 access),
LDSP (stack-relative load), JNC/JNZ (inverted conditional jumps),
CLR aliases, and forward-compatible SETJMP/SETRET for future code banking.

Phase 2 (hardware): INC/DEC via CINV bodge wire on U24 carry-in path
(1 trace cut + 3 wires).

Tooling: cycle-accurate MK1 simulator (mk1sim.py) with 25-test validation
suite and microcode timing violation checker. ESP32 Nano Web IDE with
assembler, program upload, clock speed measurement, single-step debugging,
hex dump display, save/load to flash, and example programs.

Schematics converted from KiCad 5 legacy to KiCad 10 format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- vbcc MK1 backend: compiles C to MK1 assembly (vbccmk1 binary)
- Example C program with max() function and out()/halt() builtins
- ESP32 WiFi: connects to saved network (FFat NV storage), falls back to AP
- mDNS: reachable at http://mk1.local when on home network
- Web UI: fixed layout so output panel stays visible, no header wrapping
- WiFi config via POST /wifi endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…b, stsp, dispmode

New microcode instructions (35/35 simulator tests pass):
- deref/ideref: indirect load/store from data page (enables arrays/pointers)
- setz/setnz/setc/setnc: flag-to-register boolean evaluation
- stsp: stack-relative store (clobbers D, enables re-entrant functions)
- push_imm: combined ldi+push in 2 bytes (saves 1 byte per call arg)
- ldsp_b: load stack to B directly (saves ldsp+mov pattern)
- dispmode: forward-compatible display mode latch (NOP until hw mod)

vbcc backend updated to generate push_imm, ldsp_b, stsp:
  example.c: 146 → 130 bytes (11% reduction)

ESP32 assembler: collision protection for reclaimed ALU opcodes.
Full toolchain: vbcc → vasm → mk1link pipeline verified end-to-end.
All assembler definitions updated (mk1.cpu, isa.h, vasm cpu module).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace i2c_start/i2c_stop/i2c2bit with gpio_read + 2 NOPs
- Fix exw 1 1 timing: add AO|E1 setup step before AO|E1|E0 latch
- Use E0 instead of U0 for latch clock (U75 has EEPROM glitch issues)
- Add gpio_read opcode (0xC5) for 652 daughter board read-back
- Update simulator tests: replace 7 I2C tests with exw/gpio_read tests
- Update mk1.cpu assembler: add gpio_read, remove i2c mnemonics

The 82C55 PPI uses existing exw 0 x variants (E0 for ~WR, U0/U1 for
A0/A1) with a 74HCT04 inverter. No new microcode needed for 82C55.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Program manager: save/load/delete named programs on flash (FFat)
- Auto-save: source saved to flash on every Assemble & Run
- Boot restore: reassembles and uploads last program on ESP32 boot
- If no saved program, uploads HLT as safe default
- Hold MK1 in reset during ESP32 boot (prevents garbage execution)
- #clock directive: set ESP32 clock speed from assembly source
- Ctrl+Enter keyboard shortcut for Assemble & Run
- Ctrl+S opens Programs dialog
- Download button exports editor as .asm file
- Programs UI: modal with save (named), list, load, delete

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hardware:
- W65C22S VIA replaces 82C55 PPI on daughter board (J4)
- DDR-only open-drain I2C: no MOSFET, no inverter needed
- E0→PHI2, E1→R/W, U0/U1→RS0/RS1 (all from J4)
- New exrw opcodes (0xC0/0xC9/0xCD/0xD6) for VIA reads with E0+E1

Microcode:
- Added exrw 0-3 opcodes (read with both E0+E1 for VIA PHI2+R/W)
- Reclaimed ALU slots 0xC0, 0xC9, 0xCD, 0xD6

ESP32 firmware:
- OI pin (A7) for automated output capture
- /run_cycles endpoint with configurable clock speed (us= parameter)
- /upload_and_wait for combined upload+capture
- /read_output returns captured value + history
- Polled OI detection in run_cycles (atomic GPIO.in read)
- handleStatus guards bus mode when OI monitor active

C compiler (mk1cc2.py):
- I2C builtins: i2c_start(), i2c_stop(), i2c_send_byte()
- LCD builtins: lcd_init(), lcd_cmd(), lcd_char()
- Shared __i2c_sb subroutine with ACK return value
- Merged __lcd_cmd/__lcd_chr via __lcd_send with flags parameter
- i2c_send_byte/lcd_cmd/lcd_char NOT in BUILTINS set (they clobber
  B/C/D via jal, register allocator must treat as function calls)
- Fixed _has_calls to walk list children in blocks
- Fixed _has_calls to exclude all builtins (not just out/halt)
- Added save_a_to_d when function body has user calls
- Recursive _find_reg_candidates for inner-scope variables
- Depth-prioritized register allocation (inner loops get C/D first)
- Predecrement optimization: dec;jnz for do{}while(--i) patterns
- do-while uses gen_branch_true (single jump, not jz+j)

Key findings:
- VIA needs exrw 2 (read DDRB) before each I2C START condition
- I2C ACK clock needs 5+5 NOPs for reliable slave response detection
- run_cycles at us=1 (~500kHz) optimal for I2C reliability
- 16x2 integrated LCD had no contrast control; 20x4 with pot works

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ht fix

Compiler (mk1cc2.py):
- Add i2c_init(), i2c_bus_reset(), i2c_start(), i2c_stop() builtins
- Add i2c_send_byte(), i2c_read_byte(), i2c_ack(), i2c_nack() builtins
- i2c_init() uses delay loop (~1.5ms) for VIA RC reset settling
- i2c_bus_reset() adds STOP for first-program-in-session bus reset
- halt() now idles I2C bus (DDRB=0) before hlt to prevent upload glitches
- Fix out $reg: was emitting bare 'out' (only outputs A), now emits mov+out
- Fix peephole DCE stripping section directives (data page vars in code page)
- Fix helper emission ordering (after all functions compiled)
- Add lcd_print() builtin with page 3 string storage
- Add string literal parser
- Dynamic overlay helper inlining (only inline helpers used solely by overlays)
- Overlay loader uses inc (Phase 2 CINV) instead of addi 1
- LCD init: keep backlight (BL=0x08) on throughout, no PCF8574 reset

ESP32 firmware (main.cpp):
- HLT LED wired to D1: ISR stops LEDC clock on RISING edge
- Deferred ISR attachment (wait for PIN_HLT LOW to avoid false trigger)
- Code page prefilled with 0x7F (HLT) to prevent PC wrap past halt

ESP32 assembler (assembler.h):
- Fill code buffer with HLT (0x7F) before assembly

New files:
- eeprom_stress.py: AT24C32 EEPROM write/read stress test
- i2c_scan_lcd.c: Combined I2C scan + LCD display program

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Calibration:
- calibrate.asm: count loop between SQW rising edges on VIA PA0
- ds3231_sqw_config.asm: configure DS3231 for 1Hz SQW output
- SQW on PA0 with 5.1k pullup, gives consistent count=23 at run_cycles us=1

Compiler (mk1cc2.py):
- i2c_init(): delay loop (~1.5ms) for VIA RC reset settle instead of 3 NOPs
- i2c_bus_reset(): same delay loop + STOP
- halt(): idles I2C bus (DDRB=0) before hlt
- i2c_start()/i2c_stop(): now inline (avoids jal overhead)
- i2c_read_byte(): via jal __i2c_rb, returns in A
- i2c_ack()/i2c_nack(): inline
- Fix out $reg: was bare 'out', now emits mov $reg,$a; out

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speedometer (speedometer.asm):
- Continuous kHz display via DS3231 SQW on VIA PA0
- Configures DS3231 for 1Hz SQW, counts loop iterations per half-cycle
- Outputs count×7 ≈ kHz, updates every ~1 second
- Tested: 098@100kHz, 196@200kHz, 252@250kHz ✓

ESP32 firmware fixes:
- LEDC: use 1-bit resolution for all frequencies (exact prescaler, no quantization)
  Old 8-bit resolution made 200/250/300kHz all output ~100kHz
- Upload: release bus (busSetInput+disableOutput) after upload
  ESP32 bus pins stayed OUTPUT, overriding VIA data reads at LEDC speed
- HLT ISR: debounce with digitalRead verify (filter glitches on continuous programs)
- Assembler: NOP fill instead of HLT fill (prevents false HLT on looping programs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
delay_cal.asm: calibrates from SQW then blinks 100/200 with ~2s gaps.
Works on any clock source (555, LEDC, run_cycles) — measures actual
clock speed at startup via 1Hz SQW full-cycle count.

delay_ms: A=calibration count, B=ms. Inner loop = C/2 iters × 7 cycles.
~5% accuracy (always slightly long = safe for timing constraints).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stopwatch (stopwatch.asm):
- Calibration and delay use IDENTICAL inner loop (10 nops + dec + jnz = 13 cycles)
- D overflows measured per SQW cycle = exact 1 second reference
- D/4 overflows = 250ms delay chunk. 4 chunks = 1 second tick.
- Tested: 30 displayed seconds ≈ 30 real seconds on 555 auto clock

Previous attempts failed due to:
- Cycle count mismatch between calibration loop and delay loop
- Outer loop overhead dominating at low clock speeds (small C values)
- The standard fix: use the same loop body for both (well-known 6502/Z80 technique)

ESP32 fixes:
- Upload releases bus (busSetInput) — fixes VIA reads at LEDC speed
- LEDC uses 1-bit resolution for accurate frequencies at all speeds
- HLT ISR debounced with digitalRead verify
- NOP fill instead of HLT fill for looping programs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Programs:
- eeprom_calibrate.asm: SQW calibration, stores D/4 in data[0] (160B)
- eeprom_write_read.asm: write byte, delay_Nms(15), read back (219B)
- Tested 30/30: 3 rounds × 10 patterns (0,1,42,85,99,127,128,170,200,255)

Key technique: same-loop calibration — __d256 subroutine shared by
calibration counter and delay function. Zero systematic timing error.

Compiler (mk1cc2.py):
- Add delay_calibrate() and delay_ms(n) builtins
- Add __delay_cal and __delay_Nms helper emission
- Add __delay_cal, __delay_Nms, __i2c_rb to _NO_OVERLAY set

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ESP32: release PIN_HL/PIN_STK to INPUT after upload so microcode
can drive them for stack/data/page3 access. Previous OUTPUT LOW
caused stack writes to corrupt code page, breaking jal/ret.

VIA: add exw 0 3 (DDRA=0) to all VIA init sequences. VIA registers
persist across CPU resets, so stale DDRA makes PA0 an output,
breaking SQW reads even though the physical signal is toggling.

Stopwatch: replace ldsp-based D/4 passing with data page storage
via ideref/deref. ldsp produced wrong values in the full stopwatch
binary (cause under investigation). Stopwatch now times accurately
against real clock (~30 displayed ≈ 30 real seconds on 555 auto).

New: 5-pattern EEPROM stress test (eeprom_stress_5pattern.asm).
Writes 42/255/170/85/99 to EEPROM, delays 15ms each, reads back.
5/5 passes confirmed on auto clock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ldsp computed SP+N and latched MAR in the same step (EO|MI). When
SP+N caused a long carry chain (e.g. 254+2=256 wrapping all 8 bits),
the carry didn't fully propagate before MAR latched, causing
intermittent wrong stack reads (observed: values 106, 108 instead
of expected 3).

Fix: split into EO (ALU drives bus, carry settles) then EO|MI (MAR
latches settled value). Uses all 8 microcode steps with natural
step counter wrap. Applied to both ldsp (0xEB) and ldsp_b (0xF3).

Note: stsp (0xDB) has the same risk but already uses all 8 steps
and cannot be fixed without hardware changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The delay loop only works when .d_outer lands at address 118.
Adding exw 0 3 (DDRA=0) shifted it to 119 which breaks.
Fix: remove redundant ldi $d,0 at start (-2 bytes), add exw 0 3
(+1 byte) and alignment NOP (+1 byte) = net 0 change, .d_outer
stays at 118.

Root cause of address sensitivity under investigation — possibly
STK/HL floating pin timing or microcode EEPROM marginal access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Buzzer on VIA PA1 (pin 3). DDRA=0x02 set only during beep
subroutine, DDRA=0 at all other times (PA0 must stay input
for SQW reads). Beep toggles PA1 via exw 0 1 for 256 cycles.

beep_check in tick loop: countdown in data[1], decrements each
tick, beeps at 0, resets to 10. DDRA cleared to 0 every tick
in beep_check to ensure SQW reads work reliably.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 64-iteration inner loop (ldi $a,64 instead of clr $a) for wider
  clock speed range. D/4 no longer truncates to 0 at low speeds.
  Supports ~14kHz to ~550kHz.
- PA1 piezo buzzer beep synced to display value (every 10 ticks)
- DDRA=0 in VIA init (prevents stale state from previous runs)
- beep_check before delays in tick loop (clears DDRA before delay)
- 251 bytes, .d_outer at address 126

Known limitation: address-dependent timing bug means this specific
code layout works at 126kHz auto but may fail at other speeds.
Root cause under investigation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The assembler fills unused code page with HLT (0x7F) to prevent
runaway execution past program end. But handleAssemble/handleRun
overwrote the fill with NOP (0x00) via memset. This caused the CPU
to loop through NOPs after HLT, wrapping to address 0 and
re-executing — breaking run_cycles OI detection and upload_and_wait.

Fix: copy full 256 bytes from result.code (which includes HLT fill)
instead of only copying codeBytes then padding with NOP.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Serial commands over USB (115200 baud):
  ASM:<source>  — assemble (\\n for newlines)
  UPLOAD        — upload to MK1
  RUN:n,us      — run n cycles at us half-period
  OI            — read last output capture
  STATUS        — CPU state
  RESET         — reset CPU

Assembly response now includes src_len and cksum fields for
detecting WiFi upload corruption. Checksum is sum of all bytes
in the upload buffer.

Key finding: comprehensive serial testing proves MK1 CPU has
NO hardware bugs. All instructions (sequential, J, JNZ, JAL)
pass 100% at all addresses and speeds. Previous failures were
caused by WiFi instability and NOP fill (both now fixed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLOCK:hz and HALT serial commands for testing without WiFi
- OI ISR no longer deduplicates same-value events, enabling
  accurate tick rate measurement via oi_count polling
- Verified stopwatch runs at ~1 tick/s on LEDC at all speeds
  (10kHz-500kHz), confirming SQW calibration adapts correctly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RUNLOG:n,us runs N clock cycles without stopping on OI, logging
all captured values. Enables measuring tick rate and reading
display values during continuous execution.

Key findings:
- Stopwatch ticks at exactly 1.00/s on LEDC at all speeds
- RUNLOG captures OI values but reads 0/2 alternating (bus timing)
- OI history expanded to 256 entries for longer captures
- OICNT returns raw event count for rate measurement

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Serial sweep results prove all instructions work at all addresses
and speeds. The "address-dependent bug" and VIA read failures were
caused by a loose J4 connector on the VIA daughter board, discovered
after board cleaning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Back-to-back exw 0 2 (DDRB write) sequences caused data bus values to
bleed into VIA RS0, corrupting DDRA instead of DDRB. This broke I2C
by setting port A pins as outputs, making SQW reads return 0.

ddrb_imm N writes an immediate directly to DDRB (RAM→bus→VIA) without
touching any CPU registers. The instruction fetch overhead provides 4
non-VIA clock cycles of settling between consecutive writes, which
eliminates the corruption (verified 10/10 hammer tests, 10/10 at all
clock speeds us=1..50).

Also fixes RUNLOG truncating output to 16 entries when >256 captured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HeathenUK and others added 29 commits April 20, 2026 19:08
Add a secondary acceptance criterion to T2.1's budget gate: if the
strict gate (kernel + max_overlay unchanged-or-shrunk) rejects a
candidate BUT raw savings are positive AND the post-state still
fits the 250B code page, accept. This lets T2.1 extract sequences
whose occurrences are concentrated in non-largest overlays — the
kernel grows by the thunk size, but multiple overlays shrink in
parallel, giving net total-code savings even when max_overlay is
unchanged.

Impact (corpus-wide after IDX remap fix):
  - test_eeprom_overlay: process0 now fits in SRAM (was spilling
    to EEPROM tier). With correct IDX remap AND SRAM placement,
    all 6 process functions produce CORRECT outputs in sim:
    [4,11,25,53,106, 12,28,60,124,248, 20,45,95,195,134, ...].
    Previously the sim output was 30 buggy entries because main's
    IDX dispatch called the wrong overlay for 5 of 6 calls.
  - overlay_clock / overlay_seconds / test1_clock / test3_seconds:
    +2-3B each from extracted thunks that the strict gate would
    have rejected.
  - Overall: 6 programs see size changes, 41/42 sim_corpus still
    byte-identical to their baselines.

The 50B net kernel growth across affected programs is well below
their code-page headroom (largest remaining is 158B/250B).
Trade: small kernel growth for correctness + cross-overlay dedup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two improvements to the OI capture harness:

1. Ring buffer (not FIFO). Before: `if (oiCount < 256) oiHistory[oiCount] = val`
   dropped every event past #256 — fatal for I2C-heavy programs whose real
   `out()` emissions happen at the END, after hundreds of intermediate
   bus OI events. Now: `oiHistory[oiCount & 0xFF] = val` keeps the
   NEWEST 256. Readers use `oiHistAt()` / `oiHistStored()` accessors
   and iterate oldest→newest. RUNNB reports the newest 32 events;
   RUNLOG reports all stored events in chronological order.

2. Single-snapshot OI-and-bus read. Before: RUNLOG/RUNNB did
   `gpio1 = GPIO.in` to CHECK OI, then called `readBusFast()` which
   took a SECOND GPIO snapshot to decode the bus. Between snapshots
   the CPU could advance microcode steps and change bus state. Now:
   new `decodeBus(gpio)` helper decodes from a caller-supplied
   snapshot; capture sites pass the same snapshot that detected OI.

Impact: test_idx_eeprom (EEPROM write/readback) now shows the real
final out() sequence at the end of the reported history instead of
buried behind I2C noise. Previously even a known-correct program
returned garbage in the first 32 events.

The `readBusFast()` wrapper is kept for back-compat (it just calls
`decodeBus(GPIO.in)`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Uses EEPROM write-and-readback: fn0 (overlay) writes 0x5C to
EEPROM[0x0020], main reads it back and outputs. Followed by a
4-byte end sentinel (0xFE 0xED 0xBE 0xEF) so harness can anchor
on program completion.

On manual clock, RUNNB:cycles,us,0 + RESET per run gives 5/5
deterministic output [0x5C, 0xFE, 0xED, 0xBE, 0xEF] — proving
the IDX remap (T3.3 \$c vs legacy \$a) dispatches fn0 correctly
through the overlay loader and that the round-trip through the
I2C/AT24C32 path works.

Requires:
  - Manual (ESP32) clock — auto clock produces non-deterministic
    OI captures because ESP32 samples async to 555 oscillator
  - Post-7a9ec7e firmware with ring-buffer OI history and
    single-snapshot GPIO decode — otherwise the real output
    gets buried behind intermediate I2C bus activity in the
    first-256-events FIFO

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a first-line diagnostic tool for "where did my bytes go?" on any
program. Previously: grep through ad-hoc stderr chatter. Now:

    $ python3 mk1cc2.py foo.c -o foo.asm --why-not-smaller
    ── Byte Sink Report ──
      Mode: overlay
      Stage-1 init: 361B / 250B
      Kernel total: 271B (loader=74B + main=62B + helpers=133B)
      Resident helpers (largest first):
        __i2c_sb    36B  (13.3% of kernel)
        __delay_Nms 34B  (12.5% of kernel)
        ...
      Overlay slots (largest first):
        [0] _show_temp  148B
        [1] _main_p1    116B
        ...
      Top 5 sinks (tight budget: stage1):
        loader                         74B
        main                           62B
        __i2c_sb                       36B
        ...

Fires on the overflow path too (before the hard exit), which is
exactly when the user needs it.

Also adds corpus_sizes.py: compile every program and show a one-row
table of stage1/kernel/loader/main/helpers/overlay counts. Feeds
into --metrics-out for scripted monitoring.

Hardware verified via fresh-compile of test_idx_eeprom.c + 3× RUNNB:
3/3 runs return [0x5C, 0xFE, 0xED, 0xBE, 0xEF] as before. No
compiler regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-level restructuring to get the program under 250B code page.
Two changes:

1. Replace inline i2c_start/send_byte/stop chains with bytecode
   sequences fed to __i2c_stream:
   - Position-set (6 i2c primitive calls) → pos_seq[] + i2c_stream(135)
   - send_glyph (8 calls including 5 dynamic font bytes) → patch font
     bytes into glyph_seq[] then i2c_stream(208). Uses poke3/peek3.

2. Factor the divmod-and-display block out of main() into a
   show_temp(t) function. Phase 7 places it as an overlay, shrinks
   main from 116B → 52B.

Before:
  Stage-1: 334B / 250B → OVERFLOW by 80B
  Kernel:  318B (main=130B, helpers=125B, loader=42B, thunks)
  Status:  FAIL

After:
  Stage-1: 237B / 250B (13B free)
  Kernel:  221B (main=52B, helpers=125B, loader=42B)
  Overlay: _show_temp 129B on page 2 (wraps 100B, safe-loaded-last)
  Status:  COMPILES CLEAN

Corpus state: 47/48 programs compile (was 46/48).
Only overlay_dashboard still overflows (24B over, resident helpers).

Hardware: test_idx_eeprom still 3/3 deterministic — no compiler regression.

Per "arbitrary C within reason" — this rewrite is fair-game: the old
source mixed bytecoded streams with inline I2C primitive calls
inconsistently. The unused `glyph_buf[30]` in the old source hints
the author's original intent was this exact pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The compiler auto-emitted a 1ms calibrated delay after any lcd_cmd(N)
where N is not 0x01 (clear) or 0x02 (home). The stated rationale was
"HD44780 generic-command 37µs minimum." But 37µs is trivially covered
by the I2C transport's natural latency:

  - Sending one I2C byte via the PCF8574 backpack is ~300 CPU cycles
    of bit-banging. At the MK1's max clock (1MHz) that's ≥300µs, an
    order of magnitude over the 37µs minimum.
  - Any subsequent lcd operation triggers another I2C transaction,
    further covering the hardware's internal latency.

The only commands that genuinely need explicit delay are clear (0x01)
and home (0x02) at 1.52ms. Those already use the inline dec/jnz loop
path — no change.

Removing the spurious delay unblocks overlay_dashboard:
  - `__delay_Nms` (34B) goes from "used by 2 overlays, kept resident"
    to "used by 0 overlays, not emitted at all"
  - `__delay_cal` (~58B init-only) also drops
  - Kernel: 271B → 223B
  - Stage-1: 361B → 242B
  - Status: was FAIL (+107B over), now COMPILES CLEAN

Corpus: 47/48 → **48/48 programs compile**.

Incidentally fixes test_lcd_eeprom's prior silent hang (now reaches
its final out() and halts).

test_idx_eeprom hardware: still 5/5 deterministic, no regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the 8×8→16 multiply + 16-bit countdown with a straightforward
two-nested-loops structure:

  __delay_Nms:        ; B = N (ms)
    ldi \$a,240
    derefp3           ; A = ipm from page3[240]
    mov \$a,\$c       ; C = ipm
  .outer:
    mov \$c,\$a       ; reload ipm
  .inner:
    dec; jnz .inner   ; bare inner loop (same as calibrator)
    decb              ; B--
    jnz .outer
    ret

**12B vs 34B. 22B saved.**

Accuracy analysis:
  - Inner loop is still bare dec+jnz — matches the calibrator's own
    inner loop exactly, so ipm (iterations per ms) translates directly.
  - Outer-loop overhead is 3 extra instructions per ms ≈ 0.6% at
    500kHz, ≈1.2% at 250kHz. Well inside the pre-existing ±2%
    tolerance noted in the old code's comment.
  - The old approach had NO per-ms overhead (single flat 16-bit
    countdown) but paid 22B for the multiply+carry setup. The new
    approach trades that for ~1% accuracy loss at delay-heavy clocks
    — still monotonically increasing and within HD44780 timing margins.

Corpus impact: 22B per program that uses \__delay_Nms
(twinkle_v2 kernel 232B → 210B, twinkle 228B → 206B, etc).

Hardware: test_idx_eeprom still 3/3 deterministic.
Corpus: 48/48 sim unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ESP32 firmware:
  - Add parallel oiTimes[256] ring buffer. Each OI event records the
    current actualCycles counter (RUN/RUNNB/RUNLOG) or micros() (ISR).
  - RUNNB and RUNLOG JSON output now includes a "ts" array alongside
    "hist"/"vals".

Consumers can take deltas between successive timestamps to verify
timing — e.g., delay(N) accuracy by comparing expected vs observed
cycles-between-out()s.

Verified on test_idx_eeprom:
  val=0x5c  ts=7111
  val=0xfe  ts=7115  (Δ 4 cycles ≈ 16µs between out_imm calls)
  val=0xed  ts=7119
  val=0xbe  ts=7123
  val=0xef  ts=7127

4 cycles between consecutive out_imm's matches the expected ~4 MK1
instructions (fetch/decode/execute) at the ESP32's 165kHz clocking.

Compiler fix (bonus, caught while writing the accuracy test):
  delay(N) was not triggering __delay_cal auto-insertion — only
  lcd_cmd's auto-delay path was. Any program calling delay() without
  an explicit delay_calibrate() ran on an uninitialised ipm at
  page3[240]. Now delay() sets _needs_delay_calibrate like the other
  calibrated-delay paths.

Corpus: 48/48 sim unchanged.
Hardware: test_idx_eeprom still 3/3 deterministic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, SP init fix

Language features:
- printf("%d %x %c %s", ...) with compile-time format expansion; targets LCD
  via new __print_u8_dec (no leading-zero suppression) and __print_u8_hex
- typedef <type> <alias>;  parsed and expanded in type and statement positions
- sizeof(type) and sizeof(arr) → compile-time int constant
- #define NAME value with iterative substitution (chains fully resolve)
- Compound assigns /= %= <<= >>= now parse and codegen
- String literal concatenation "abc" "def" at parse time
- \xNN hex escapes in strings for non-ASCII LCD glyphs (e.g. \xDF = °)
- Strings as first-class values via gen_expr('string'): allocate to page 3
  and return offset, dedupe by content

Correctness fix — flat-mode SP init:
Without ldi $b,0xFF; mov $b,$sp at _main entry, SP starts at 0 (reset),
first push wraps SP→0xFF, next stsp at SP=0xFF triggers the unfixed stsp
microcode carry race. Symptom: x %= 3 gave 0 instead of 1 when lcd_print
was present (stack-local x, stsp/mov/ldsp sequence). Inject the same 3B
preamble that overlay mode already emits.

Correctness fix — /N %N non-power-of-2:
The constant-binop path returned early for / and %, so x / 3 with
variable x emitted nothing. Fall through to the general divide/modulo
helpers for those ops when val isn't a handled power-of-2.

Size wins:
- Array store arr[const] = val: 4B save (no push/clr/mov/pop dance)
- Array load arr[const]: 2B save (fold base+index into ldi $a,addr)
- __i2c_sb: counter→B via decb (-1-2B); drop redundant mov $b,$a at isbn
- __tone: replace `dec; mov $a,$d` with decd (-1B)
- Tone port claim: _claim_port_bits('DDRA', 0x02, 'tone') so delay_cal's
  DDRA clear preserves PA1 output

Corpus: 20 programs shrank 1-15B (total ~43B); printf-using programs now
compile (previously failed with unresolved overlay call). 13/13 overlay
regression pass, DS3231+EEPROM pass, 48/48 compile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test2_temp, overlay_temp_label: 226B → 209B kernel (-17B each).
Inline `while (ones >= 10) { ones -= 10; tens++; } lcd_char(tens+48);
lcd_char(ones+48); lcd_char(0xDF); lcd_char('C');` replaced with
printf("Temp: %d\xDFC", t);

test4_info, overlay_info: kept printf despite +1B size — readability
win outweighs the cost.

lcd_temp, test1_clock, overlay_clock, test3_seconds, overlay_seconds
left as-is with a comment explaining why: printf's ~52B decimal helper
overwhelms the savings when the program only displays one short value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
digit extraction instead of printf (helper cost > savings for
single-value displays).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elides

After `stsp off`, the microcode saves old A into D BEFORE clobbering A.
That means stack[SP+off] and D both hold the stored value. Tell the
register tracker that regs['d'] = ('sp', off) after stsp (not just the
abstract unknown value). Then when `mov $d,$a` copies D into A, the
tracker sees A = ('sp', off), and the subsequent `ldsp off` is
recognized as redundant and elided.

Common pattern from compound assigns and store-then-read sequences:
  [compute in A]
  stsp N
  mov $d,$a   ; restore A via D
  ldsp N      ; ELIDED — already in A
  [use A]

Corpus: -15B net (test_i2c_ack_diag -10B; oled_temp, eeprom_test,
test_i2c_switching each -2B; small others). 13/13 regression pass,
DS3231+EEPROM pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new opcodes that pack 2 or 3 DDRB writes into a single instruction:
  ddrb2_imm A B     : 3B total (opcode + 2 imm) vs 2×ddrb_imm = 4B
  ddrb3_imm A B C   : 4B total (opcode + 3 imm) vs 3×ddrb_imm = 6B

Retires two unused slots:
  0xCB  setjmp → ddrb2_imm    (setjmp needed SETPG hw wire; unwired)
  0xCF  setret → ddrb3_imm    (same — reserved for code banking, blocked)

Both microcodes use the same E0|U1 signal as ddrb_imm, just replayed
with PC++ between values. Registers and flags preserved (no AI/FI).
Microcode fits in 8 steps (ddrb3_imm uses all 8, counter wraps).

Compiler:
- New peephole after the rest: scans emitted asm for contiguous runs
  of `ddrb_imm N`, fuses greedily (3 first, then 2, leftover single).
- Gated behind MK1_NEW_OPCODES=1 — same flag as sllb, since this
  needs microcode reflash on all four SST39SF040 EEPROMs.

Assembler (ESP32):
- New InstrArgs ARGS_IMM2 / ARGS_IMM3 for 2- and 3-immediate opcodes.
- Instruction emitter handles nImm consecutive operand tokens.

Simulator: verified ddrb2_imm (3B consumed, A preserved) and
ddrb3_imm (4B consumed, A preserved) via /tmp/test_ddrb_fusion.py.

Corpus impact (with MK1_NEW_OPCODES=1):
  page0_used: 8922B → 8084B  (−838B across 45/48 programs)
  biggest winners: overlay_info −89B, test_pcf8574_writeread −71B,
  eeprom_overlay_write −69B, test_lcd_eeprom −66B.
Default mode (MK1_NEW_OPCODES unset): byte-identical to pre-change.
13/13 overlay regression + DS3231 + EEPROM probes PASS on unflashed
hardware.

To enable on hardware: flash microcode.bin to all four EEPROMs
(T48 + minipro, same binary per chip), rebuild ESP32 firmware with
updated isa.h/assembler.h, then compile with MK1_NEW_OPCODES=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DDRB fusion peephole emits multi-immediate opcodes whose byte size
isn't captured by the old `two_byte` set (which assumed 1B or 2B).
Without size awareness, the overlay partitioner sees stale pre-fusion
sizes and the manifest ends up misaligned from the actual bytes —
overlay loads copy the wrong bytes and the program hangs.

Fix: every size-computation path now handles ddrb2_imm as 3B and
ddrb3_imm as 4B. Sites touched:
- `measure_lines` in _overlay_partition
- inline size loops (phase 6 thunk + per-section measurement)
- `_why_breakdown` walker for the --why-not-smaller report
- Final page0/page3/data byte counter
- `instr_byte_size` / `instr_size` in mk1ir.py
Also removed stale `setjmp` references from the two_byte sets (its
slot is now ddrb2_imm, a different size class).

Results with MK1_NEW_OPCODES=1 on flashed microcode:
- DS3231 probe: 0x14 (20°C) ✓
- EEPROM idx roundtrip: full sequence ✓
- 12/13 overlay regression PASS (up from 4/13 before this fix).
  The one remaining failure — "grouped overlay" — passes standalone
  (val=130, 7904 cycles) but hangs in the regression harness; appears
  to be harness state-dependent, not a compilation issue.
- 48/48 corpus compiles
- Default mode (flag unset): byte-identical to pre-change, 13/13
  regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ss 1

Four stage-1 layout bugs found during lcd_temp.c debugging:
- Post-mini-copy fall-through into pre-mc helper bodies
- T2 thunk extractor splitting bus-recovery loops at labels
- Init code overflow into mini-copy destination (budget didn't count
  j __selfcopy)
- Overlay wrap heuristic was unsound; now detects and fails loud rather
  than silently corrupting _overlay_load

Plus __delay_cal moved to init-only (fixes twinkle, twinkle_v2) and
Phase 7 extended with balanced size-driven split + N-live-var xfer
globals.

Phase 0 of overlay redesign: py_asm now byte-matches ESP32 assembler on
12/12 opcode snippets and 42/60 corpus programs. Fixed missing 2-byte
flags on jnz/jnc/jcf/jzf/je0/je1/jal and cmp N translation; rewrote
data_code/stack_code/page3_code section byte emission. Residual 18
corpus divergences are in overlay manifest/body layout and will be
absorbed by the Phase A loader rewrite.

MK1_CPU/OVERLAY_REDESIGN.md is now the canonical design for the next-gen
overlay system. Read before making compiler changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hand-assembled proof-of-concept demonstrates the core mechanism from the
overlay redesign doc:
- Kernel thunk __hello (4B) dispatches to _load_helper by setting $c
- _load_helper copies the helper body from page3 into R_helper (code[230])
  using the manifest, then tail-jumps (j, not jal) into R_helper
- Helper executes out_imm 0xB0, then ret — which pops the jal __hello
  return from main's stack and lands back in main past the jal
- Main continues and emits A2 before halting

OI trace on hardware: A0, A1, B0, A2 (4 events, in order, 197 cycles
from start of call to return). Architecture is sound.

Gotcha recorded in decision log: `org N` in `section page3_code` emits
HLT pad bytes up to N into the page3 buffer (not just virtual-PC bump),
so helper bodies without internal labels should use raw `section page3`
with `byte` directives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the entry point for the helper-overlay transform designed in
MK1_CPU/OVERLAY_REDESIGN.md:

- In _overlay_partition, after kernel layout is finalised, check
  MK1_HELPER_OVERLAY=1. If set, call _apply_helper_overlay_transform.
- The transform method itself is a stub that raises NotImplementedError
  with a pointer to the design doc. The method docstring lists the 8
  concrete transform steps for the next session to execute against.

Default path (flag unset) is untouched; regression 13/13 still passes.
Flag path explicitly fails so partial implementations can't ship.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements _apply_helper_overlay_transform for a single target helper
(__lcd_chr) on flat-mode programs. When MK1_HELPER_OVERLAY=1:

1. Scans self.code for __lcd_chr's body (label to terminating ret).
2. Replaces body with a 4B kernel thunk (ldi $c, 0; j _load_helper).
3. Emits _load_helper routine (~28B) in section code.
4. Pre-assembles the helper body to bytes using py_asm, substituting
   external jal targets (like __i2c_sb) with their resolved addresses.
   This bypasses the section-mismatch validator which would otherwise
   flag the body-in-page3 + jal-to-code as broken (correct for overlay
   mode, wrong for flat).
5. Emits __helper_manifest + body bytes as raw byte directives in
   section page3, so the assembler treats them as data not code.

Reserves code[180..249] (70B) for R_helper. __lcd_chr compiles to 64B
— larger than the design doc's 28B estimate, because it has inlined
peephole __xsthunk content plus the digit-handling logic. A future
pass should shrink it (reinstate the extracted thunk as a leaf helper).

Regression 13/13 still passes. Hardware verification pending on the
lcd_char('A') smoke test — needs visual check of the LCD.

Known limitations / follow-ups:
- Only flat-mode programs: overlay-mode would need R_helper carved
  from the user overlay region, which requires integrating with the
  placement engine rather than running after it.
- Hardcoded R_HELPER_BASE=180. Should be computed from kernel size.
- Hardcoded HELPER_OVERLAY_NAMES = {'__lcd_chr'}. Widens with each
  helper the leaf rule allows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior MVP hardcoded R_HELPER_BASE=180, which collided with the
_load_helper routine's own tail instructions (copy overwrote the
loader's `j R_helper` mid-run). Measure kernel end after body→thunk
replacement, then place R_HELPER_BASE past the loader end so no
collision is possible.

Probe assembly uses a stub `_load_helper:` label appended to self.code
so the thunk's `j _load_helper` resolves during the sizing pass.

Hardware verified via OI markers:
  phase_a_test.c: out 0xA0; i2c_init; out 0xA1; lcd_char('A');
                  out 0xA2; lcd_char('B'); out 0xA3; halt
  Trace output:
    [0] t=11     val=0xA0  (main entered)
    [1] t=2503   val=0xA1  (after i2c_init)
    [2] t=6772   val=0xA2  (first lcd_char — helper loaded, ran, returned)
    [3] t=11041  val=0xA3  (second lcd_char — helper re-loaded, returned)
  All four markers in order. The helper-overlay load/execute/return
  cycle works on real hardware.

A fuller test program with lcd_init() doesn't fit in this MVP because
flat mode doesn't init-extract __lcd_init (needs ~70B). The init-
extraction path is overlay-mode only today; a follow-up will enable
it for flat mode when MK1_HELPER_OVERLAY=1.

Regression 13/13 still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bodies with internal labels (like __lcd_init's `.li_lp13`, `.li_av16`,
etc.) need to assemble at their runtime address so label references
resolve correctly. Changed the body-byte synthesis from `org 0` to
`org R_HELPER_BASE` — now internal jumps point at the right place in
R_helper when the body runs there.

With both __lcd_chr and __lcd_init as helper overlays, the fuller test
(i2c_init + lcd_init + lcd_char×2) fits in 164B of kernel + 70B
R_helper. Both helpers share the same R_helper slot, dynamically
overwriting each other on successive loads.

Hardware verified via OI trace — all 5 markers fire in order:
  A0 t=11      (main entered)
  A1 t=2503    (after i2c_init)
  A2 t=56498   (after lcd_init — loaded, ran ~54k cycles including
               LCD timing delays, returned)
  A3 t=60767   (after lcd_char('A') — R_helper slot reused for
               __lcd_chr, loaded, ran, returned)
  A4 t=65036   (after lcd_char('B') — re-loaded again)

Kernel shrank from 208B (both resident) to 164B (both overlays) —
net -44B, matching design prediction.

Regression 13/13 still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug: body bytes extracted from self.code were pre-peephole, but the
final compile pipeline peepholes after this transform runs. Two
symptoms:

  1. Instruction form diverged. Resident __lcd_chr used push_imm (2B)
     after peephole collapse of `ldi $a, N; push $a`. Overlay kept the
     3B form, which CLOBBERS caller's $a with N — so lcd_char('A')
     transmitted 0x09 (the flags byte literal) instead of 'A'.
  2. All jal target addresses drifted by 1 per push_imm collapse
     elsewhere in the file. The 5 jal __i2c_sb calls inside __lcd_chr
     pointed 1 byte past the helper, hitting the middle of the
     previous instruction.

Fix: run peephole on self.code BEFORE the target scan, so body lines
are already in final form and external label addresses are final.

Verified by /tmp/verify_helper_bytes.py: resident vs overlay body
bytes are now byte-identical for all 63B of __lcd_chr's body.
Hardware: phase_a_test.c (lcd_init, lcd_char('A'), lcd_char('B'))
shows "AB" on the LCD. 5/5 OI markers still fire in order.

Regression 13/13 passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Created in response to the LCD-blank-but-OI-trace-passed bug that
shipped earlier today. The goal: any future helper-overlay regression
gets caught by automated tests BEFORE hitting hardware visual-check.

Three stacked checks:

  Layer 1 — byte-semantic equivalence (offline, fast).
    Compile with and without MK1_HELPER_OVERLAY=1. Extract each target
    helper's bytes. Normalise internal jmp targets to offsets from the
    helper's base so resident and overlay builds compare identically
    (internal jumps legitimately target different absolute addresses
    in each build but should land at the same OFFSET within the body).
    Fails if any byte differs after normalisation. This catches the
    class of bug where pre/post-peephole body divergence corrupts
    instruction encoding (e.g. `ldi $a,N; push $a` surviving where
    `push_imm N` should have replaced it — which clobbered caller's $a
    with the flags literal, sending 0x09 instead of the character).

  Layer 2 — OI control-flow trace (hardware).
    Upload the compiled program and confirm every expected `out()`
    marker fires in order. Handles HLT_GRACE re-entry by accepting
    the marker sequence as a contiguous subsequence anywhere in the
    captured stream.

  Layer 3 — I/O side-effect equivalence (hardware).
    The strongest check. Program reads the RTC via I2C (routes through
    every helper in the overlay set) and `out()`s the result. We run
    BOTH the resident and overlay builds and compare the full captured
    OI stream byte-for-byte. Different bytes between the two builds
    means a subtle corruption in the I2C path that neither byte-equiv
    nor control-flow traces would detect. Works because the RTC temp
    value is stable over the seconds it takes to run both tests.

Current suite: 5/5 passing. Run with `--skip-hw` to do Layer 1 only
when serial isn't attached.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Widening HELPER_OVERLAY_NAMES to __print_u8_dec exposed the leaf-rule
violation the design doc warned about: __print_u8_dec calls __lcd_chr,
so loading __lcd_chr at runtime clobbers __print_u8_dec's code mid-
execution. User caught it on hardware: printf("%d", 42) produced
"lots of 4 digits" on the LCD instead of "42".

Added two safeguards:

1. Leaf-rule scan before body extraction. Each candidate's body is
   scanned for `jal` into the overlay set; callers get demoted to
   resident. Iterates until the set is leaf-consistent.

2. "No room" fallback. If kernel_end + loader + largest_helper would
   exceed 250B, the transform reverts its own mutations (restores
   _saved_pre_transform) and the compile proceeds with all helpers
   resident. This way flag=1 never produces worse code than flag=0
   for programs that don't benefit.

Also made R_HELPER_SIZE dynamic: sized to the largest actually-found
helper, not a worst-case reservation. A program using only __lcd_chr
doesn't pay for __lcd_init's 48B just-in-case.

Test suite grew from 5 to 8 cases. Layer 3 now includes a printf-
via-overlay test that would have caught today's bug (OI streams must
match between resident and overlay builds for identical inputs).

Design doc updated with the decision log entry and the remaining
overlay-mode integration work — Phase A is a flat-mode proof of
mechanism; real wins are in overlay-mode programs (the 11 that fail
to compile today).

Regression 13/13 still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs `_split_helpers_for_overlay` inside `_overlay_partition` after T2.1
and before kernel sizing so the reduced `runtime_resident_helpers` feeds
KERNEL_SIZE / OVERLAY_REGION correctly. Helper bodies in
`HELPER_OVERLAY_NAMES` that the natural classification keeps resident
are replaced with 4B thunks; `_load_helper` is appended; the wrap-safety
and final fit checks both subtract R_HELPER_SIZE so user overlays can't
cross into the reserved zone at code[250-R_HELPER_SIZE..249]. Post-
partition `_emit_helper_overlay_bodies` pre-assembles each body with
external labels resolved from a full-program pass, then writes raw byte
directives in page3 — manifest entries (`byte __h_body_N; byte size`)
resolve on the assembler's second pass.

Cost-benefit guard: reject splits with `len < 2` or
`kernel_save <= overlay_cost`. Single-helper extraction on lcd_temp_overlay
measured -32B net fit (28B loader + 41B reserve vs 9B kernel save). The
guard keeps such programs on the baseline path.

The aggressive first version (force candidates resident in
`_classify_helpers`) triggered propagation — forcing `__lcd_chr` resident
dragged `__lcd_send` resident too, growing the kernel by 84B on
overlay_clock while only extracting 5B of body. Reverted: the split now
only fires on candidates that the knapsack would have kept resident
anyway.

13/13 overlay regression green; 8/8 Phase A hardware regression green
(run prior to the `_hov_info`-aware early-return in
`_apply_helper_overlay_transform`; flat-mode code path unchanged).

Follow-up documented in OVERLAY_REDESIGN.md §8: extracting bundled
helpers (the real win for the ~8 failing corpus programs) requires a
wider split that scans overlay bodies too, not just `runtime_resident_helpers`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends Phase A infrastructure toward the full §4.3 candidate set.

- HELPER_OVERLAY_NAMES now includes __lcd_cmd, __lcd_send, __print_u8_hex,
  __i2c_rb, __i2c_rs alongside __lcd_chr and __print_u8_dec.
- __lcd_send is auto-included when __lcd_chr or __lcd_cmd is used — it's
  emitted implicitly as the shared I2C-send tail but never added to
  `_lcd_helpers`, so the split wouldn't see it otherwise.
- `_split_helpers_for_overlay` injects `j <next>` when a body has no
  ret/j/jal terminator (e.g. __lcd_cmd falls through to __lcd_send in
  the kernel emission). Without this, an extracted __lcd_cmd body would
  run off the end into HLT padding in R_helper.

Honest Phase B status: these extensions don't fire on the corpus today
because the cost-benefit guard (len<2 or kernel_save <= overlay_cost)
still rejects. Guard-loosening attempts regressed programs that today
rely on the "safe wrapping (loaded last)" fit. Corpus sweep: 0 diffs
between flag on/off. 13/13 overlay regression PASS. Layer 1 (offline
byte-equivalence) of the Phase A hw test PASS; Layers 2-3 hit serial
disconnects on both attempts (environmental — USB flaky on this session,
not a code regression).

What Phase B still needs (OVERLAY_REDESIGN.md §8): unified loader
(§4.2) to save the ~28B of the current dual-loader setup; per-helper
paging decision in the knapsack instead of a global flag; Phase C
tail-chain for overlays too big for R_user. Findings documented in
the decision log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hase C)

`_split_helpers_for_overlay` now scans `overlay_blocks` in addition to
`runtime_resident_helpers`. When a candidate helper has bundled copies
across N≥2 overlay slots, extracts a single canonical body to R_helper,
adds a kernel thunk, removes the bundled bodies, and retargets
`jal __X_ovN` → `jal __X` callers via regex on the suffix-renamed labels.

The cost-benefit guard now sees both kernel_save (resident extractions)
AND bundled_save (per-overlay body bytes freed). Accepts when total >
overlay_cost. For overlay_dashboard this fires: 3 helpers extracted
(__lcd_chr, __lcd_cmd, __lcd_send), 116B saved in bundling redundancy,
overlay slots shrink ~42B each.

**Honest result**: 0 corpus programs newly compile.

- overlay_dashboard: split fires, helper-paging works, but `_main_p1`
  (102B orchestration overlay, no helper calls in it) is the bottleneck
  and won't shrink without Phase C tail-chain.
- overlay_temp_label / overlay_info / test2_temp / test4_info /
  test_lcd_eeprom / lcd_temp_overlay_cmd: helpers are RESIDENT (not
  bundled — wrap-retry promoted them), so bundled_save=0. Resident
  kernel_save (~30B) doesn't beat R_HELPER reservation (~39B); guard
  rejects.

What this still needs to actually fix the failing corpus:
- Phase C (tail-chain) for orchestration-bottlenecked programs
- Unified loader (§4.2) to save ~28B per program — measured: would
  push overlay_temp_label's kernel_save from -9B to +19B, fitting the
  smaller overlay but still not the 75B one
- Per-helper subset selection (extract __lcd_chr only, leave
  __lcd_send resident) for programs where the chain dependency forces
  oversized R_HELPER reservation

Regression: 13/13 overlay PASS. Corpus diff sweep: 0 changes flag on/off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`__lcd_print` (printf format-string runner) is another runtime helper
that ends up resident in some programs. Including it in the candidate
set lets the split see it. Leaf rule correctly demotes __lcd_print
(it calls __lcd_chr in a loop), so the split keeps __lcd_print
resident; net effect on corpus is 0 diffs.

This pads out the candidate set to match the actual runtime helpers
the existing knapsack tracks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-use bundled helpers (`__lcd_chr` only used in one overlay) get
emitted with their canonical name — no `_ov{N}` suffix because there's
no naming conflict to dedupe. My existing extraction regex
(`^(__\w+?)_ov\d+$`) missed them.

Now: scan for both `__X_ov{N}:` and bare `__X:` (where X in candidates)
inside overlay bodies. Single-use bundled helpers ARE detected; the
existing cost-benefit guard correctly rejects single-bundled extraction
(net-zero byte change vs leaving them inline).

Corpus sweep: 0 diffs flag on/off.

Honest finding on printf compression idea:
The failing corpus has 1 printf per program, each with unique format.
A "per-spec helper" approach would cost 12B per helper to save 10B per
call site — net loss for single-use formats. Doesn't fix the failing
corpus. The actual byte sinks are __lcd_cmd (65B chain with __lcd_send)
which can't be paged because the chain logic depends on adjacency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`sllb` opcode collapse (3B → 1B for `mov $b,$a; sll; mov $a,$b`) was
gated behind MK1_NEW_OPCODES=1 during the pre-flash phase. Microcode
flashed 2026-04-08; gate removed. Corpus: 37/48 (unchanged), overlay
regression 13/13 PASS.

`ddrb2_imm`/`ddrb3_imm` fusion REMOVED. The microcode packs 2-3
consecutive VIA DDRB writes into one multi-byte opcode but only has
1 settling cycle between writes (the PO|MI byte-fetch) — insufficient
for the VIA RS0 line to settle, reproducing the same bus-timing bug
that `ddrb_imm` was originally designed to avoid. Verified on
hardware: with fusion on, the overlay regression's grouped-compute
test hangs at 100k cycles. With fusion off, 13/13 PASS.

To re-enable ddrb fusion: edit ucode_template[0xCB]/[0xCF] microcode
to add settling cycles between the E0|U1 VIA writes (2-3 NOP-like
steps each), then reflash the 4 SST39SF040s. The opcodes remain in
the microcode (harmless if unused); compiler just doesn't emit them.

Also landed as fallout from this session:
- Adaptive pre-mc helper placement (§5 stage-1 layout): if placing
  pre-mc helpers forces the expensive pre-pad path (init lands past
  mc_end), pop smallest helpers back to post-mc until the cheaper
  post-pad path fits. Saves `_init_code_size` bytes when triggered.
- `INIT_PREFIXES` whitelist extended to include ddrb2/3_imm (dead
  code today, future-proofed for if/when fusion returns).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HeathenUK pushed a commit to HeathenUK/8bit-cpu that referenced this pull request Apr 25, 2026
…g corruption, stale

LCD-driver-rename helpers, page-3 layout report

Three intertwined bugs surfaced by `overlay_dashboard.c` and the
expanded `rgb_lcd_smoke` test (now using printf):

1. Compiler peephole at mk1cc2.py:1565..1578 silently elided
   `i2c_stop(); i2c_start(); i2c_send_byte(odd_addr);` into a
   "repeated START", claiming alignment with the compiler's RTC/EEPROM
   builtins. The compiler's `i2c_start` builtin emits
   `exrw 2; ddrb_imm 0x01; ddrb_imm 0x03` which expects the bus in
   idle state (both HIGH). Coming out of `__i2c_sb` the bus is
   `SDA released, SCL LOW` and the simultaneous SCL-rise + SDA-fall
   in `ddrb_imm 0x01` is not a clean START condition — slaves miss
   it. Surfaced by overlay_dashboard's manual `rtc_reg()` reading
   garbage `0xFE` bytes consistently. Removed; the proven
   readDS3231Temp ESP32 program does NOT skip the STOP either.

2. `printf` / `lcd_print` strings allocated via `page3_alloc` and
   emitted in `section page3` with no `org`. The first string ended
   up at page3[0..N], silently overwriting the kernel image (which
   lives at page3[0..K-1] until self-copy). After self-copy, page-0
   bytes 0..N held the string content instead of the kernel's first
   instructions and the program crashed/hung. The compiler also
   baked the integer offset directly into `ldi $a, N` — magic
   numbers all the way down. Fix: move strings to page 1 (no kernel-
   image collision); each string gets a `__str_N` symbolic label
   resolved by the assembler, so no numeric offset appears in the
   asm anywhere. `__lcd_print` now uses `deref` (page 1) instead of
   `derefp3` (page 3).

3. The 2026-04-24 LCD-driver-rewrite renamed the underlying helper
   from `__lcd_send` to `__lcd_send_raw`, but four sites in
   compile-time bookkeeping kept the old name:
     a. dead-function-elimination "keep-alive on __lcd_cmd" guard
        (mk1cc2.py:2470 area) — the eliminator silently stripped
        `__lcd_send_raw` from the final asm. Resident `__lcd_cmd`
        then fell through into `_main`, which reset SP and recursed.
     b. duplicate-detection SKIP set (mk1cc2.py:2592 area).
     c. `runtime_i2c_markers` set (mk1cc2.py:2828 area).
     d. `_NO_OVERLAY` add for shared-by-chr-and-cmd helper (line 2843).
   All four updated to `__lcd_send_raw`. The bug had been latent
   since 2026-04-24 because every passing hw_regression test
   happened to also reference `__lcd_chr` directly; the bug
   surfaced when rgb_lcd_smoke was expanded with `lcd_cmd(0x80);
   printf("MK1");` — neither lcd_chr nor lcd_char fires here.
   Stale-name comments at lines 5212 and 9698 also corrected for
   future readers.

Compiler memory report updated: Page 2 line now shows "stack_bytes
used (manifest + kstate); stack reserved at 0xC0..0xFF" instead of
the misleading "stack (grows down from 0xFF)". Page 3 line corrected
to "kernel image transient; shared helpers + page-3 overlays
persist" — Phase 5 moved manifest+pages off page 3 but the report
hadn't caught up. Adds page-2 byte tracking to the assembled-byte
counter so the report has real data.

`rgb lcd smoke test` regression test expanded to also exercise the
printf path — `lcd_cmd(0x80); printf("MK1");` between `lcd_rgb` and
`out(42)`. Visual confirmation: "MK1" at top-left with magenta
backlight; regression gate is the out(42) byte capture.

Verification:
- sim_regression: 5/5
- hw_regression: 15/15 (was 14/15 with rgb_lcd_smoke as the lone
  pre-existing failure; now ALL pass including the expanded printf
  exercise)
- mk1_py_asm string + label resolution verified on /tmp/probe.c:
  __str_0 = 51, data[51..54] = M K 1 0; __lcd_send_raw correctly
  emitted post-fix; resident __lcd_cmd → __lcd_send_raw fall-through
  intact.

Issue list: vascofazza#3 (manual-I2C peephole) and vascofazza#4 (rgb_lcd_smoke /
printf-string corruption / stale LCD names) both closed in this
commit. Worklog updated with all three root causes and the
"check-other-stale-rename-sites" methodology.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants