feat(model-eval-ingest): [WIP] cloud-side worker for bench → kiloBench ingest by lambertjosh · Pull Request #3028 · Kilo-Org/cloud

lambertjosh · 2026-05-04T16:54:57Z

⚠️ DRAFT / WIP — DO NOT MERGE YET

This PR is opened to preserve in-progress cloud-side work. The bench-side (kilo-bench) promotion flow is still being built and needs to be finished and end-to-end tested against this worker before merge. Design lives at .plans/dashboard-v2.md in the kilo-bench repo.

Do not rush review — the bench-side work may still shake out changes to the wire contract.

Summary

Cloud-side plumbing for a new promotion workflow: human-reviewed eval results from bench.s1lv.com (internal, oauth2-proxy gated) get ingested into the public modelStats.benchmarks.kiloBench so kilo.ai/models/[slug] can show real-world benchmark numbers (success rate, avg cost per task, avg token usage) alongside the existing Artificial Analysis scores.

The design explicitly keeps bench and cloud separate: normal users never access bench, and internal identifiers (bench URLs, ingest IDs, promoter emails) stay out of the public JSONB.

What's in this PR

Schema (packages/db/src/schema.ts)
- New model_eval_ingest table — immutable append-only record of every promotion act. Partitioned by (provider, model, variant, task_source); latest wins at query time via DISTINCT ON … ORDER BY promoted_at DESC (no is_active flag, no supersede links).
- New KiloBenchEvalSchema Zod type and kiloBench extension on ModelStatsBenchmarksSchema. Only public-safe fields — successRate, avgCostUsd, avgInputTokens?, avgOutputTokens?, avgCacheReadTokens?, avgExecutionMs?, nTrials, lastPromotedAt. No bench URL, no ingest id, no promoter email.
Migration (0109_watery_freak.sql) — generated by pnpm drizzle generate against current main; contains only the new table + indexes.
Worker (services/model-eval-ingest/)
- Scaffolded from the session-ingest / webhook-agent-ingest template (same -ingest naming convention). Single Hono app, single Hyperdrive binding, one Secrets Store secret.
- src/middleware/hmac-auth.ts — HMAC-SHA-256 signature verification (X-Ingest-Signature + X-Ingest-Timestamp, 5-min skew window), portable constant-time compare (works under both Workers runtime and Node for unit tests).
- src/routes/api.ts
  - POST /api/model-eval-ingest/submit — validate + insert + recompute the modelStats.benchmarks.kiloBench JSONB. Rollup uses DISTINCT ON (task_source) … ORDER BY promoted_at DESC; JSONB || merge preserves artificialAnalysis and other benchmark sources.
  - GET /api/model-eval-ingest/latest/:model — bench-side delta preview.
  - GET /api/model-eval-ingest/:id — bench-side status / admin audit.
Tests (33 passing) — 19 HMAC middleware (hex parsing, determinism, missing/invalid headers, skew in both directions, wrong secret, missing cloud secret, happy path) + 14 submission schema (valid shapes, invalid URL/email, n_trials=0, negative cost, absurd success_rate, float tokens, zero-cost-is-fine).

Verification

Ran locally against origin/main (0c8c06b40):

pnpm --filter cloudflare-model-eval-ingest test      → 33 / 33 pass
pnpm --filter cloudflare-model-eval-ingest typecheck → clean
pnpm --filter @kilocode/db typecheck                 → clean
pnpm --filter web typecheck                          → 0 errors
pnpm --filter web test -- schema.test.ts             → 6 / 6 pass (schema drift)
pnpm format                                          → applied

Pre-push hook (lint + typecheck) passes without --no-verify.

Visual Changes

N/A — no UI in this PR. The public KiloBenchSection component for kilocode-landing model pages is a separate follow-up.

Reviewer Notes

Still needed before this can ship to production:

Bench-side promotion endpoint + SPA dialog (kilo-bench repo, Track 2 Phase D). Nothing calls this worker yet.
Secrets — MODEL_EVAL_INGEST_SECRET_PROD and _DEV need to be created in the shared Secrets Store (342a86d9e3a94da698e82d0c6e2a36f0) before first deploy. Rotate by bumping both sides simultaneously.
Admin UI — /admin/model-eval-ingest list + detail pages (Phase 2). Can land in a separate PR after this one.
Public KiloBenchSection on kilocode-landing/src/app/models/[...slug]/page.tsx (Phase 3). Same — separate PR.

Risk areas:

The JSONB || merge in recomputeKiloBench relies on the kiloBench key being a single top-level field. If someone later adds a kiloBench.evals cross-reference that needs deeper merging, that's the place to revisit.
variant is nullable. The DISTINCT ON SQL uses a conditional variant IS NULL vs variant = $variant branch. Covered by the plan's latest-wins-per-tuple semantics but worth a second set of eyes.
The HMAC middleware reads the raw body via c.req.text() exactly once and caches it on c.var.rawBody. Downstream handlers MUST use c.var.rawBody + JSON.parse rather than c.req.json() — they won't be able to re-read the stream. Current handler does this correctly; any future handler on the same middleware needs to as well.

Rollout order:

Land schema migration in a separate non-draft PR (safe; additive only).
Create Secrets Store secrets.
Deploy the worker (dev first, then prod) — it will just sit there until the bench side calls it.
Ship the bench-side promotion flow.
Verify end-to-end with a real clean eval run.
Add admin UI + public landing-page component.

Design doc: .plans/dashboard-v2.md in the kilo-bench repo.

…h ingest DRAFT / WIP — do not merge until the bench-side promotion flow is finished and end-to-end tested against this worker. This PR is opened to preserve the cloud-side work in progress; the kilo-bench side is still being built (see kilo-bench/.plans/dashboard-v2.md Track 2). Adds: - `model_eval_ingest` table — immutable append-only record of eval results promoted from bench.s1lv.com. Primary partitioning key is (provider, model, variant, task_source); a later row supersedes by `promoted_at` at query time. - `KiloBenchEvalSchema` + extended `ModelStatsBenchmarksSchema.kiloBench` — denormalised read cache for public model pages. Only public-safe fields (no bench URL, no ingest id, no promoter email). - `services/model-eval-ingest/` Cloudflare Worker: - HMAC-SHA-256 signature verification (`X-Ingest-Signature` + `X-Ingest-Timestamp`, 5-min skew), portable constant-time compare. - `POST /api/model-eval-ingest/submit` — insert + recompute the `modelStats.benchmarks.kiloBench` JSONB via a DISTINCT-ON-by- timestamp rollup (JSONB merge preserves other benchmark sources). - `GET /api/model-eval-ingest/:id` and `/latest/:model` for bench-side status + delta preview. - 33 vitest unit tests (19 HMAC middleware + 14 submission schema). Operational notes for reviewers: - `MODEL_EVAL_INGEST_SECRET_PROD` and `_DEV` secrets need to be created in the shared Secrets Store (342a86d9e3a94da698e82d0c6e2a36f0) before deploy. Rotate by bumping both sides simultaneously. - No admin UI yet (`/admin/model-eval-ingest` pages); tracked as Phase 2 of the plan. - No public `KiloBenchSection` component yet in kilocode-landing; tracked as Phase 3 of the plan. - No bench-side promotion endpoint yet; tracked as Phase D of the kilo-bench rebuild (Track 2). Verification run locally against origin/main (0c8c06b): pnpm --filter cloudflare-model-eval-ingest test → 33 / 33 pass pnpm --filter cloudflare-model-eval-ingest typecheck → clean pnpm --filter @kilocode/db typecheck → clean pnpm --filter web typecheck → 0 errors pnpm --filter web test -- schema.test.ts → 6 / 6 pass Design: `.plans/dashboard-v2.md` in the kilo-bench repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(model-eval-ingest): [WIP] cloud-side worker for bench → kiloBench ingest#3028

feat(model-eval-ingest): [WIP] cloud-side worker for bench → kiloBench ingest#3028
lambertjosh wants to merge 1 commit intomainfrom
feat/model-eval-ingest-worker

lambertjosh commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lambertjosh commented May 4, 2026

Summary

What's in this PR

Verification

Visual Changes

Reviewer Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant