popsolutions · marcos-mendez · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/.github/workflows/bench.yml b/.github/workflows/bench.yml
@@ -0,0 +1,96 @@
+# SPDX-License-Identifier: CC-BY-SA-4.0
+name: bench
+
+# Manual-dispatch only. The model fetch (~668 MB) + decode (~minutes) is too
+# heavy to run on every PR; we'd swamp the existing Verilator/cocotb job in
+# ci.yml. Trigger this from the Actions tab when you want a fresh baseline
+# measurement (e.g., after a llama-cpp-python bump or a runner OS upgrade).
+on:
+  workflow_dispatch:
+    inputs:
+      n_tokens:
+        description: "Number of tokens to decode for the timing measurement"
+        required: false
+        default: "64"
+
+# Default least-privilege token scope for the entire workflow. The job does
+# not push to the repo, only reads source and uploads an artifact (which uses
+# the run's own implicit scope, not contents:write).
+permissions:
+  contents: read
+
+jobs:
+  tinyllama-baseline:
+    name: TinyLlama-1.1B Q4_0 baseline (x86 CPU)
+    runs-on: ubuntu-24.04
+    timeout-minutes: 30
+
+    steps:
+      # Action versions pinned to commit SHA — `@v4` style tags are mutable
+      # and a compromised tag in any of these repos would land in our runner.
+      - name: Checkout
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
+
+      - name: Set up Python 3.12
+        uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5
+        with:
+          python-version: "3.12"
+
+      - name: Cache TinyLlama GGUF model
+        id: model-cache
+        uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830  # v4
+        with:
+          path: bench/tinyllama-1.1b/tinyllama-1.1b-chat-v1.0.Q4_0.gguf
+          key: tinyllama-1.1b-chat-v1.0-Q4_0-gguf-v1
+
+      - name: Fetch TinyLlama model (cache miss only)
+        if: steps.model-cache.outputs.cache-hit != 'true'
+        env:
+          # Bootstrap: skip SHA256 verification on first run so we can capture
+          # the canonical hash from the logs and pin it in fetch_model.sh.
+          # Remove this env once EXPECTED_SHA256 is updated in the script.
+          EXPECTED_SHA256: ""
+        run: |
+          cd bench/tinyllama-1.1b
+          chmod +x ./fetch_model.sh
+          ./fetch_model.sh
+          echo "Computed SHA256 (record this and pin in fetch_model.sh):"
+          sha256sum tinyllama-1.1b-chat-v1.0.Q4_0.gguf
+
+      - name: Install bench dependencies
+        run: |
+          python -m pip install --upgrade pip
+          # llama-cpp-python pinned to a known-good CPU build.
+          # Pre-built wheel exists for ubuntu-24.04 x86_64.
+          pip install "llama-cpp-python==0.2.90" pytest
+
+      - name: Run schema tests (fast, no model)
+        run: |
+          cd bench/tinyllama-1.1b
+          python -m pytest test_bench.py -v
+
+      - name: Run TinyLlama bench and check threshold
+        # `${{ inputs.* }}` interpolation in `run:` blocks is workflow-script
+        # injection if the value isn't a clean integer (Actions templating
+        # happens before the shell parses the line — argparse `type=int`
+        # cannot save us). Route through env + validate before use.
+        env:
+          N_TOKENS: ${{ inputs.n_tokens || '64' }}
+        run: |
+          [[ "$N_TOKENS" =~ ^[0-9]+$ ]] || { echo "ERROR: n_tokens must be a positive integer, got: $N_TOKENS"; exit 1; }
+          cd bench/tinyllama-1.1b
+          python bench.py \
+            --model tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
+            --backend llama-cpp-python \
+            --n-tokens "$N_TOKENS" \
+            --check expected_baseline.json \
+            | tee bench-result.json
+
+      - name: Upload bench result
+        if: always()
+        uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02  # v4
+        with:
+          name: tinyllama-bench-result
+          path: bench/tinyllama-1.1b/bench-result.json
+          if-no-files-found: warn
+          retention-days: 90
diff --git a/bench/tinyllama-1.1b/README.md b/bench/tinyllama-1.1b/README.md
@@ -0,0 +1,209 @@
+<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
+
+# TinyLlama-1.1B reference workload (GGUF int4)
+
+> First "hello world" reference workload for InnerJib7EA / POPC_16A.
+> Closes [popsolutions/InnerJib7EA#3](https://github.com/popsolutions/InnerJib7EA/issues/3).
+
+This directory holds the **baseline harness**. It runs TinyLlama-1.1B-Chat
+(Q4_0 quantization) under stock `llama.cpp` on x86-64 CPU and emits a
+structured tokens/sec measurement. The number is the floor we need
+InnerJib7EA simulation (and eventually silicon) to beat by a meaningful
+margin to justify the project.
+
+The harness is intentionally **simple and dependency-light**. It does
+not attempt to use Spanker hardware (Spanker isn't taped out yet) or
+RVV / `Xpop_matmul`. Those land later — see *Migration path* below.
+
+---
+
+## What this bench does
+
+1. **Fetch** TinyLlama-1.1B-Chat-v1.0 in GGUF Q4_0 quantization
+   (~668 MB) from HuggingFace, verify SHA256, cache locally.
+2. **Run** 64 tokens of decode against a fixed prompt
+   (`"Hello, "`) using `llama.cpp` (via `llama-cpp-python` or the
+   `llama-cli` binary).
+3. **Emit** a JSON document on stdout with
+   `tokens_generated`, `wall_clock_seconds`, `tokens_per_second`,
+   plus environment metadata (host CPU, llama.cpp build).
+4. **Compare** measured `tokens_per_second` against
+   `expected_baseline.json`. If below threshold the harness exits
+   non-zero so CI fails.
+
+This is the same `Hello, ` prompt and same 16-token budget mentioned in
+the issue's task list (we extend to 64 tokens here so timing isn't
+dominated by prompt-eval and warm-up).
+
+## Model source
+
+| Field | Value |
+|---|---|
+| Model | TinyLlama 1.1B Chat v1.0 |
+| Quantization | Q4_0 (GGUF) |
+| Source | [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF on HuggingFace](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) |
+| Direct URL | `https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_0.gguf` |
+| File size | ~668 MB |
+| License | Apache 2.0 (TinyLlama) |
+
+`fetch_model.sh` is idempotent: if the file is already present and its
+SHA256 matches, it skips the download. The expected SHA256 is recorded
+in the script and verified after every fetch.
+
+## int4 quantization procedure
+
+Q4_0 is the simplest GGML int4 format: weights are quantized in blocks
+of 32, each block stores one fp16 scale + 32 packed 4-bit weights
+(symmetric, no zero-point). Dequantization at compute time is
+`weight = scale * (int4_value - 8)`. This is what `Xpop_matmul` will
+need to fuse into the matmul inner loop on POPC_16A.
+
+We use the pre-quantized model from TheBloke rather than running the
+quantization ourselves — the produced GGUF file is byte-identical to
+what stock `llama.cpp quantize` emits, so we save ~5 minutes of CI time.
+If the upstream artefact ever disappears, regenerate with:
+
+```bash
+./build/bin/llama-quantize tinyllama-1.1b-chat-v1.0.fp16.gguf \
+    tinyllama-1.1b-chat-v1.0.Q4_0.gguf Q4_0
+```
+
+## Expected baseline (sanity threshold)
+
+`expected_baseline.json` records conservative thresholds — set well
+below typical hardware so flaky CI doesn't false-positive. Indicative
+ranges measured on a few common hosts (informational, not gated):
+
+| Host | Tokens/sec |
+|---|---|
+| Modern x86 laptop (Ryzen 7 5800H, 8c) | 25-40 |
+| GitHub Actions ubuntu-24.04 (4 vCPU) | 6-12 |
+| Raspberry Pi 5 (8GB) | 3-6 |
+
+The CI threshold is **5 tokens/sec on a 4-vCPU runner** — TinyLlama Q4_0
+beats that comfortably. Anything below means something has gone wrong
+(bad model file, llama.cpp regression, runner hardware change).
+
+## Output schema
+
+`bench.py` writes one JSON document to stdout with this shape (schema
+version pinned in `expected_baseline.json` so consumers can detect
+breaks):
+
+```json
+{
+  "schema_version": "1",
+  "model": "tinyllama-1.1b-chat-v1.0.Q4_0.gguf",
+  "backend": "llama-cpp-python",
+  "prompt": "Hello, ",
+  "tokens_generated": 64,
+  "wall_clock_seconds": 7.12,
+  "tokens_per_second": 8.99,
+  "host": {"cpu": "AMD Ryzen 7 5800H", "cores": 8, "ram_gb": 16.0},
+  "llama_cpp_version": "0.2.90",
+  "timestamp_utc": "2026-05-06T12:34:56Z"
+}
+```
+
+`timestamp_utc` is ISO-8601 with trailing `Z`. `tokens_per_second` is
+computed from `tokens_generated / wall_clock_seconds` (decode-only —
+the prompt-eval pass is timed separately and discarded).
+
+## Running locally
+
+```bash
+cd bench/tinyllama-1.1b
+
+# 1. Fetch model (one-time, idempotent)
+./fetch_model.sh
+
+# 2. Install bench deps (CPU-only llama.cpp Python binding)
+pip install llama-cpp-python==0.2.90
+
+# 3. Run bench
+python3 bench.py --model tinyllama-1.1b-chat-v1.0.Q4_0.gguf
+
+# 4. Run bench AND check threshold
+python3 bench.py \
+    --model tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
+    --check expected_baseline.json
+```
+
+To use the `llama-cli` binary instead of the Python binding (skips
+`llama-cpp-python` install if you already have llama.cpp built):
+
+```bash
+python3 bench.py \
+    --model tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
+    --backend llama-cli \
+    --llama-cli-path /path/to/llama.cpp/build/bin/llama-cli
+```
+
+## CI integration
+
+`.github/workflows/bench.yml` runs the bench on **`workflow_dispatch`
+only** (manual trigger). We do not run it on every PR because the model
+download + first-token latency add ~5 minutes per run, which would
+swamp the existing Verilator/cocotb job (~3 minutes).
+
+To trigger from the GitHub UI: Actions to "bench" to Run workflow.
+
+## Schema test (unit test)
+
+`test_bench.py` is a **schema-only** unit test. It does not download or
+run TinyLlama (that's CI's job on manual dispatch). It runs `bench.py`
+with `--dry-run`, captures the JSON output, and asserts the document
+shape matches what downstream tooling (regression tracker,
+`Xpop_matmul` performance comparison plots) will consume.
+
+This satisfies the "always write tests for code changes" rule
+(`feedback_testing.md`) — the realistic test for a benchmark harness is
+its output schema, not its absolute speed (which is environment-dependent).
+
+```bash
+cd bench/tinyllama-1.1b
+python3 -m pytest test_bench.py -v
+```
+
+## Migration path to InnerJib7EA / Spanker
+
+The bench is structured so the *measurement skeleton* (prompt setup,
+token-count timing, JSON schema, threshold check) stays constant while
+the *backend* swaps. Planned phases:
+
+| Phase | Backend | When |
+|---|---|---|
+| **0 (this PR)** | stock `llama.cpp` on x86-64 CPU | now — establishes baseline |
+| 1 | `llama.cpp` on RISC-V softcore in Verilator (RVA23 + RVV 1.0) | when MAST adds RVV-enabled core |
+| 2 | `llama.cpp` with `Xpop_matmul` matmul kernel intrinsic | when MAST `Xpop_matmul` lands |
+| 3 | Spanker runtime driving InnerJib7EA over PCIe (single card) | when InnerJib7EA bitstream + PCIe driver land in MVP PCB |
+| 4 | Spanker runtime, multi-card tensor-parallel | post-MVP, per `project_multicard_parallelism.md` |
+
+Each phase plugs in a new `--backend <name>` to `bench.py`. The same
+JSON schema and the same threshold structure get reused — that's the
+whole point of fixing them now, before there's anything to measure
+against.
+
+## What this bench is NOT
+
+- Not a correctness test. We don't verify that generated tokens match a
+  golden reference. (Issue #3 task: "Compare token-by-token against
+  reference llama.cpp running on CPU" — that lands in a follow-up; see
+  *Out of scope*.) The current harness is timing only.
+- Not a Spanker integration test. Spanker hardware doesn't exist;
+  importing `spanker-runtime` is deferred until the crate stabilizes
+  and PR #6 (in-flight collective ops) lands.
+- Not a power/energy measurement. PCIe-side power instrumentation
+  arrives with the FPGA dev-board phase.
+
+## Out of scope (future PRs)
+
+- Token-by-token equivalence vs reference (issue #3 task 4)
+- Cycle / SRAM / DDR bandwidth telemetry (issue #3 task 5) — needs
+  Verilator integration and is RTL-side work (Agent 1 stream-1)
+- `docs/benchmarks/tinyllama-int4.md` write-up (issue #3 task 6) —
+  separate PR once we have real measured-vs-expected numbers from a
+  live POPC_16A simulation
+- Quantization to Q4_K (issue #3 task 1 mentions Q4_K first); Q4_0 is
+  simpler and the right starting point for the kernel. Q4_K bench
+  variant gets added once `Xpop_matmul` supports it.