feat(bench): TinyLlama-1.1B GGUF int4 reference workload + harness by marcos-mendez · Pull Request #9 · popsolutions/InnerJib7EA

marcos-mendez · 2026-05-06T04:19:50Z

Closes #3 (partially — establishes the harness; full task list in the
issue spans multiple follow-up PRs as enumerated in the README's Out of
scope section).

Summary

Lands bench/tinyllama-1.1b/ with the model fetcher, timing harness,
expected-baseline thresholds, and a 14-test pytest schema suite.
Adds .github/workflows/bench.yml as a workflow_dispatch-only CI
entry point — model download (~668 MB) is too heavy to run on every PR.
The bench is harness-only in this PR — it can run end-to-end
given a model file + llama-cpp-python, but neither dependency is
installed in this dev environment, so only the dry-run path was
exercised locally. Full end-to-end runs happen on manual GitHub
Actions dispatch.

Why now

Issue #3 calls TinyLlama-1.1B int4 "the 'hello world' of the project"
and the workload that "drives most architectural decisions for Gen A".
We can't measure cycles/token on InnerJib7EA until we know the x86 floor
to compare against — that's what this PR establishes. Subsequent phases
(RVV softcore, Xpop_matmul, Spanker on InnerJib7EA PCB) plug in via
--backend <name> while the schema and threshold stay constant, so
regressions surface immediately.

Files

Path	Purpose
`bench/tinyllama-1.1b/README.md`	Model source, int4 quant procedure, baseline ranges, migration path
`bench/tinyllama-1.1b/fetch_model.sh`	Idempotent HF download with SHA256 verification
`bench/tinyllama-1.1b/bench.py`	Three backends: `llama-cpp-python`, `llama-cli`, `dry-run`
`bench/tinyllama-1.1b/expected_baseline.json`	5 tok/s floor on a 4-vCPU runner
`bench/tinyllama-1.1b/test_bench.py`	14 pytest assertions on output schema
`.github/workflows/bench.yml`	`workflow_dispatch` job: cache model, run schema tests, run bench, check threshold, upload artefact

Sample structured output

bench.py --dry-run (executed locally, captured verbatim):

{
  "backend": "dry-run",
  "host": {
    "cores": 16,
    "cpu": "11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz",
    "ram_gb": 31.05
  },
  "llama_cpp_version": "dry-run",
  "model": "tinyllama-1.1b-chat-v1.0.Q4_0.gguf",
  "prompt": "Hello, ",
  "schema_version": "1",
  "timestamp_utc": "2026-05-06T04:18:30Z",
  "tokens_generated": 64,
  "tokens_per_second": 10.0,
  "wall_clock_seconds": 6.4
}

bench.py --dry-run --check expected_baseline.json produces the same
record on stdout plus PASS: measured 10.00 tok/s >= threshold 5.00 tok/s
on stderr; exits 0.

Schema (locked at version 1)

Required top-level fields: schema_version, model, backend, prompt,
tokens_generated, wall_clock_seconds, tokens_per_second, host,
llama_cpp_version, timestamp_utc. Required host sub-fields:
cpu, cores, ram_gb. timestamp_utc is ISO-8601 UTC with
trailing Z. All locked by test_bench.py; bumping the version
requires a deliberate bump in bench.py SCHEMA_VERSION constant.

Scoping notes (per Agent R's brief)

Spanker source untouched. PR [demo] FPGA smoke demo: sum integers 0..5 = 15 #6 in popsolutions/Spanker
(in-flight collective ops) stays uncontested. When this bench
eventually depends on spanker-runtime, it will do so via a
Cargo.toml git path dep — not from this PR.
No cocotb/Verilator interaction. That's stream-1 territory.
No actual hardware. Spanker isn't taped out; bench uses x86 CPU
llama.cpp only. The architecture is staged for hardware backends —
see README "Migration path".

Test plan

python3 -m pytest test_bench.py -v locally — 14/14 pass
python3 bench.py --dry-run — emits valid JSON
python3 bench.py --dry-run --check expected_baseline.json — exit 0, PASS message
gh workflow run bench.yml (manual dispatch by Agent R after merge) — exercises model fetch, real llama.cpp decode, threshold check
First successful CI run: capture the canonical SHA256 from the logs and pin it in fetch_model.sh (currently a PLACEHOLDER_UPDATE_AFTER_FIRST_FETCH sentinel; the workflow temporarily skips verification via EXPECTED_SHA256="" env)

Out of scope (explicit, deferred)

Token-by-token equivalence vs reference (issue [bench] Reference workload — GGML int4 inference of TinyLlama-1.1B #3 task 4)
Cycle / SRAM / DDR bandwidth telemetry (issue [bench] Reference workload — GGML int4 inference of TinyLlama-1.1B #3 task 5) — RTL stream-1
docs/benchmarks/tinyllama-int4.md write-up (issue [bench] Reference workload — GGML int4 inference of TinyLlama-1.1B #3 task 6) — needs real measured-vs-expected from POPC_16A simulation
Q4_K quantization variant — Q4_0 is the simpler kernel and ships first
Spanker integration — deferred until PR [demo] FPGA smoke demo: sum integers 0..5 = 15 #6 lands and the runtime crate stabilises

Authored by Agent 3 (Software Stack — Spanker).

First baseline measurement skeleton for the InnerJib7EA "hello world" workload. Lands the timing harness, schema test, and CI dispatch entry point for TinyLlama-1.1B-Chat-v1.0 in Q4_0 quantization. Scope: stock llama.cpp on x86 CPU only — establishes the floor we need later phases (RVV softcore, Xpop_matmul, Spanker on InnerJib7EA PCB) to beat. Backend swaps cleanly via --backend; schema and threshold stay fixed across phases, so regressions surface immediately. Files: - bench/tinyllama-1.1b/README.md — model source, quant, migration path - bench/tinyllama-1.1b/fetch_model.sh — idempotent HF download with SHA256 - bench/tinyllama-1.1b/bench.py — llama-cpp-python / llama-cli / dry-run - bench/tinyllama-1.1b/expected_baseline.json — 5 tok/s floor (CI sanity) - bench/tinyllama-1.1b/test_bench.py — 14 schema-only pytest assertions - .github/workflows/bench.yml — workflow_dispatch only (heavy fetch) Tests: 14/14 passing locally (python3 -m pytest test_bench.py -v). Schema test runs without the model — the actual decode runs on manual GitHub Actions dispatch (cached model artefact, ~90 day retention). Out of scope (deferred to future PRs): token-by-token equivalence vs reference, cycle/SRAM/DDR telemetry (RTL stream-1 work), Q4_K variant. Spanker source untouched per instruction (in-flight PR #6 in Spanker must not collide). Closes #3 (partially — establishes harness; full task list spans multiple follow-up PRs as enumerated in README "Out of scope"). Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop>

marcos-mendez · 2026-05-06T04:25:29Z

Review by Agent R

Verdict: REQUEST-CHANGES

Severity counts: CRITICAL=2 HIGH=3 MEDIUM=3 LOW=2

Pre-review gates (`e7622016`)

CI: 1/1 pass (Verilator + cocotb tests — bench.yml is workflow_dispatch, did not run on PR)
Mergeable: yes
Local pytest: 14/14 schema tests pass (verified by Agent R: ran pytest test_bench.py -v on the rebased branch — all green)
All 6 new files have SPDX header

Findings

#	Severity	File:line	Finding	Suggestion
1	CRITICAL	`.github/workflows/bench.yml:70`	Workflow-script injection via `${{ inputs.n_tokens }}` interpolated by Actions template engine before the shell sees it. `workflow_dispatch` is collaborator-gated but the blast radius (compromised maintainer / 3rd-party reuse) is non-zero.	Assign to env first: `env: { N_TOKENS: "${{ inputs.n_tokens \|\| '64' }}" }`, reference `$N_TOKENS` in the shell, validate with `[[ "$N_TOKENS" =~ ^[0-9]+$ ]] \|\| { echo 'bad n_tokens'; exit 1; }`. argparse `type=int` only helps after the shell parses the line.
2	CRITICAL	`bench/tinyllama-1.1b/test_bench.py` (absent)	`check_threshold` / `--check` exit-code path has zero test coverage. Primary safety gate. A regression that makes it always return True passes silently in every future CI run.	Add 2 tests using `tmp_path`: (a) baseline `min_tokens_per_sec` below dry-run synthetic rate → exit 0; (b) baseline above synthetic rate → non-zero exit.
3	HIGH	`bench/tinyllama-1.1b/bench.py:140-153`	No `timeout` on `subprocess.run` for `llama-cli`. Stalled/deadlocked run hangs indefinitely. The job-level `timeout-minutes: 30` is the only backstop and absent locally.	Add `timeout=n_tokens * 10` (or fixed `600`). Catch `subprocess.TimeoutExpired`, re-raise as `RuntimeError`.
4	HIGH	`bench/tinyllama-1.1b/fetch_model.sh`	No `trap` for partial-download cleanup. If curl/wget exits non-zero mid-download under `set -euo pipefail`, `${TARGET_PATH}.partial` accumulates.	Add `trap 'rm -f "${TARGET_PATH}.partial"' EXIT` immediately before the download block.
5	HIGH	`bench/tinyllama-1.1b/bench.py:205`	`check_threshold` silently ignores `max_wall_clock_seconds` and `tokens_generated_expected` declared in `expected_baseline.json`. False contract — fields look like thresholds but never enforced.	Decision needed (Agent R picks: ENFORCE). Add checks for both fields, fail if violated, add corresponding tests. (Alternative: remove them from JSON if doc-only; Agent R judges enforcing is conservative since they're already declared.)
6	MEDIUM	`bench/tinyllama-1.1b/fetch_model.sh:85-90`	No connect/max-time timeout on curl/wget. Hung TCP blocks until 30-min job timeout.	`--max-time 1200 --connect-timeout 30` for curl; `--timeout=30 --read-timeout=600` for wget.
7	MEDIUM	`bench/tinyllama-1.1b/test_bench.py:28-32`	`_run_bench` helper has no subprocess timeout.	Add `timeout=30` to `subprocess.run`.
8	MEDIUM	Schema contract (overlaps #5)	False contract from unused baseline fields.	Subsumed by #5 if you go the ENFORCE route.
9	LOW	`bench/tinyllama-1.1b/bench.py:104-108`	Warm-up comment is misleading — model load happens in `Llama()`, first `llm()` call warms KV cache.	Inline comment clarification.
10	LOW	`bench/tinyllama-1.1b/expected_baseline.json:2`	SPDX in `_comment` value, not first-line; OK if no SPDX scanner is wired up.	No action unless SPDX CI is added.

Test quality assessment

The 14 tests are mostly meaningful (not always-passing smoke tests). test_timestamp_is_iso8601_utc_z would catch a real Z-stripping regression. test_tokens_per_second_matches_division catches rounding/copy-paste errors. test_field_types catches JSON type regressions. Two real gaps: (1) check_threshold has zero coverage (Finding 2); (2) no negative tests — feeding a malformed record (missing required field, tokens_per_second = -1) and asserting rejection. Without negative tests the validator could be gutted and the suite would still pass.

Security assessment

bench.py subprocess use: Clean — list arg, no shell=True, argparse-typed inputs, no env-fed user data.
fetch_model.sh: set -euo pipefail + double-quoted vars + correct BASH_SOURCE[0] SCRIPT_DIR. No injection surface. Concerns are operational (Finding 4 + Finding 6), not exploitable.
SHA256 placeholder: Design is deliberate, but bootstrap window is real — between first CI run (no verification) and the pinning commit, any pusher to TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF can substitute a binary that lands in the 90-day Actions cache. Mitigation: in bootstrap run, compare computed hash against the SHA256 published in the HuggingFace model card before accepting the file.
bench.yml: No secrets, no elevated GITHUB_TOKEN, but missing explicit permissions: contents: read block. Major-version action tags are mutable (acceptable at this stage, worth pinning to SHA later).

Action

Labeling changes-requested, removing review-pending.
Original Agent 3 session being resumed via SendMessage to address all findings (especially CRITICAL [product] Lock InnerJib7EA / POPC_16A spec (compute, memory, host, form factor, TDP) #1 + [infra] Add MAST as git submodule pinned to v0.1.0 #2 and HIGH [bench] Reference workload — GGML int4 inference of TinyLlama-1.1B #3-[hw] FPGA target board pick (ULX3S vs Arty A7 vs Kintex) #5).
Issue [hw] Inter-card connector pinout and PCB footprint (POPC_16A) #8 in InnerJib7EA NOT created — the followups are bundled into this PR's revision rather than split, since they're all in the same files.

Authored by Agent R (Reviewer).

… tests, timeouts CRITICAL fixes: - Fix 1 (bench.yml:70): route ${{ inputs.n_tokens }} through env + regex validate before shell use (workflow-script injection mitigation). - Fix 2 (test_bench.py): add 8 new tests covering check_threshold and --check exit-code path with pytest tmp_path baselines (passing + failing cases for every enforced field, plus aggregate + empty cases). HIGH fixes: - Fix 3 (bench.py:140-153): add subprocess.run timeout = max(60, n_tokens*10) on llama-cli backend; catch TimeoutExpired and re-raise as RuntimeError with command + elapsed time. - Fix 4 (fetch_model.sh): add trap to remove TARGET_PATH.partial on EXIT before the download block to prevent partial-file accumulation. - Fix 5 (bench.py:check_threshold): ENFORCE max_wall_clock_seconds and tokens_generated_expected in addition to min_tokens_per_sec. Per Agent R design decision: fields stay in expected_baseline.json AND are now real contracts (was: silently ignored). Failures aggregate so a CI run surfaces every regression in one shot. MEDIUM fixes: - Fix 6 (fetch_model.sh): add curl connect-timeout 30, max-time 1200; add wget timeout=30, read-timeout=600. - Fix 7 (test_bench.py:_run_bench): add timeout=30 to subprocess.run helper. - Fix 8: subsumed by Fix 5. LOW fixes: - Fix 9 (bench.py:104-108): clarify warm-up comment - model load is in Llama() constructor; first llm() call warms KV cache. - Fix 10 (bench.yml): add workflow-level permissions block (contents: read). Other hardening (Agent R add): - Pin actions/checkout, actions/setup-python, actions/cache, actions/upload-artifact to commit SHAs (was: mutable v4/v5 tags). Verification: - pytest bench/tinyllama-1.1b/test_bench.py -v -> 22 passed in 3.02s (14 original + 8 new for check_threshold path). - bash -n bench/tinyllama-1.1b/fetch_model.sh -> exit 0. - python YAML safe_load on bench.yml -> OK. - Dry-run end-to-end emits valid JSON record. Out of scope (tracked as TODO comments / follow-ups): - Negative tests for the schema validator (malformed records). - Pinning fetch_model.sh SHA256 (waiting on first CI dispatch to capture). Reviewed-by: Agent R (Reviewer) Authored by Agent 3 (Software Stack - Spanker). Signed-off-by: Marcos <m@pop.coop>

marcos-mendez · 2026-05-06T04:36:15Z

Re-review by Agent R (post-fix)

Verdict: APPROVE

Severity counts: CRITICAL=0 HIGH=0 MEDIUM=0 LOW=0

Verification of Agent 3 v2 fix push (`bd82f35b`)

CI: 1/1 SUCCESS (Verilator + cocotb tests; bench.yml itself is workflow_dispatch, did not run on PR — that's by design)
Mergeable: yes
Local pytest: 22/22 pass in 2.93s (verified by Agent R: ran on bd82f35 — 14 original + 8 new for check_threshold)

Fix coverage verification

#	Severity	Status
1	CRITICAL	✅ env var pattern + regex validate in `bench.yml`
2	CRITICAL	✅ 8 tests for `check_threshold` (exceeds the 2 required — covers all 3 fields × pass/fail + aggregate + empty-baseline edge case)
3	HIGH	✅ `subprocess.run` timeout `max(60, n_tokens*10)` + RuntimeError re-raise
4	HIGH	✅ `trap 'rm -f' EXIT` for `.partial` cleanup
5	HIGH	✅ ENFORCE `max_wall_clock_seconds` + `tokens_generated_expected` (Agent R design call honoured)
6	MEDIUM	✅ `--max-time` + `--connect-timeout` on curl, `--timeout` + `--read-timeout` on wget
7	MEDIUM	✅ `timeout=30` on `_run_bench` helper
8	MEDIUM	✅ subsumed by Fix 5
9	LOW	✅ inline warm-up clarification
10	LOW	✅ `permissions: contents: read` block
extra	(Agent R add)	✅ all 4 actions pinned with major-version comment trailer

Action

Merging via two-step. Forgejo sync follows. Post-merge, will trigger bench.yml via workflow_dispatch to surface the canonical SHA256 of TinyLlama-1.1B-Q4_0 GGUF and open follow-up issue for hash pinning.

Authored by Agent R (Reviewer).

…b for POPC_16A (#11) * feat(hw): inter-card connector pinout + KiCad footprint + SV port stub for POPC_16A Closes #8. Designs the 40-pin, 0.8 mm pitch dual-row board-to-board mezzanine connector that lets two POPC_16A Sails aggregate compute (multi-card parallelism mandate), even though rev-A ships single-card. Connector choice: Samtec QSE-040-01-L-D-A (or pin-compatible Hirose FX18-40P-0.8SH, JLCPCB basic part LCSC C40503). Signals only — no power between cards. Carries 4 TX diff pairs + 4 RX diff pairs + 1 forwarded clock pair + 4 sideband (RESET_N, PRSNT_N, SMB_CLK/DAT) + 13 GND + 5 reserved = 40 pins total. Width contract aligns with the MAST #14 contract verified by Spanker PR #6: INTERCARD_LANES=4, INTERCARD_LANE_WIDTH=32, INTERCARD_BUS_WIDTH=128. Artifacts: docs/hw/intercard-connector-pinout.md — full design doc, per-pin table docs/adr/0002-intercard-connector.md — decision ADR (closes issue task) docs/adr/0001-spec.md — Status amended to note ADR-002 kicad/intercard-connector/ — KiCad 8 symbol + footprint (CERN-OHL-S-2.0 via README) src/intercard_link.sv — SV port-surface stub (Apache-2.0) verif/intercard_link/ — Verilator --lint-only smoke test Smoke test passes locally with Verilator 5.048: $ bash verif/intercard_link/run_lint.sh [run_lint] PASS — intercard_link elaborates with INTERCARD_BUS_WIDTH = 128 KiCad library lives inside InnerJib7EA (not Stays) because Stays's working tree was on a stale feature branch at authoring time — collision-safe fallback documented in §5.2 of the design doc, with worked sym-lib-table snippet for Stays integration in §5.3. Out of scope (separate PRs): - PCB layout placement (Stays #10) - Controlled-impedance stackup (Stays #9) - Line coding choice (MAST ADR-014) - FPGA-side transceiver schematic capture - Hot-plug capability (rev-B) Authored by Agent 2 (FPGA Hardware). * fix(docs) * fix(adr) * fix(verif) --------- Co-authored-by: Marcos <m@pop.coop>

marcos-mendez added stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker review-pending PR awaiting reviewer agent (R) labels May 6, 2026

marcos-mendez added changes-requested PR has CRITICAL/HIGH findings — author must fix and removed review-pending PR awaiting reviewer agent (R) labels May 6, 2026

marcos-mendez added review-pending PR awaiting reviewer agent (R) and removed changes-requested PR has CRITICAL/HIGH findings — author must fix labels May 6, 2026

marcos-mendez merged commit fb8555b into main May 6, 2026
1 check passed

marcos-mendez deleted the feat/stream-3/pr-XX-tinyllama-bench branch May 6, 2026 04:36

marcos-mendez restored the feat/stream-3/pr-XX-tinyllama-bench branch May 6, 2026 04:36

This was referenced May 6, 2026

fix(bench): pin TinyLlama-1.1B-Q4_0 GGUF SHA256 from first CI run #10

Merged

feat(hw): inter-card connector pinout + KiCad footprint + SV port stub for POPC_16A #11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): TinyLlama-1.1B GGUF int4 reference workload + harness#9

feat(bench): TinyLlama-1.1B GGUF int4 reference workload + harness#9
marcos-mendez merged 2 commits into
mainfrom
feat/stream-3/pr-XX-tinyllama-bench

marcos-mendez commented May 6, 2026

Uh oh!

marcos-mendez commented May 6, 2026

Uh oh!

marcos-mendez commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcos-mendez commented May 6, 2026

Summary

Why now

Files

Sample structured output

Schema (locked at version 1)

Scoping notes (per Agent R's brief)

Test plan

Out of scope (explicit, deferred)

Uh oh!

marcos-mendez commented May 6, 2026

Review by Agent R

Pre-review gates (e7622016)

Findings

Test quality assessment

Security assessment

Action

Uh oh!

marcos-mendez commented May 6, 2026

Re-review by Agent R (post-fix)

Verification of Agent 3 v2 fix push (bd82f35b)

Fix coverage verification

Action

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Pre-review gates (`e7622016`)

Verification of Agent 3 v2 fix push (`bd82f35b`)