Skip to content

feat(model-eval-ingest): [WIP] cloud-side worker for bench → kiloBench ingest#3028

Draft
lambertjosh wants to merge 1 commit intomainfrom
feat/model-eval-ingest-worker
Draft

feat(model-eval-ingest): [WIP] cloud-side worker for bench → kiloBench ingest#3028
lambertjosh wants to merge 1 commit intomainfrom
feat/model-eval-ingest-worker

Conversation

@lambertjosh
Copy link
Copy Markdown
Contributor

⚠️ DRAFT / WIP — DO NOT MERGE YET

This PR is opened to preserve in-progress cloud-side work. The bench-side (kilo-bench) promotion flow is still being built and needs to be finished and end-to-end tested against this worker before merge. Design lives at .plans/dashboard-v2.md in the kilo-bench repo.

Do not rush review — the bench-side work may still shake out changes to the wire contract.

Summary

Cloud-side plumbing for a new promotion workflow: human-reviewed eval results from bench.s1lv.com (internal, oauth2-proxy gated) get ingested into the public modelStats.benchmarks.kiloBench so kilo.ai/models/[slug] can show real-world benchmark numbers (success rate, avg cost per task, avg token usage) alongside the existing Artificial Analysis scores.

The design explicitly keeps bench and cloud separate: normal users never access bench, and internal identifiers (bench URLs, ingest IDs, promoter emails) stay out of the public JSONB.

What's in this PR

  • Schema (packages/db/src/schema.ts)

    • New model_eval_ingest table — immutable append-only record of every promotion act. Partitioned by (provider, model, variant, task_source); latest wins at query time via DISTINCT ON … ORDER BY promoted_at DESC (no is_active flag, no supersede links).
    • New KiloBenchEvalSchema Zod type and kiloBench extension on ModelStatsBenchmarksSchema. Only public-safe fields — successRate, avgCostUsd, avgInputTokens?, avgOutputTokens?, avgCacheReadTokens?, avgExecutionMs?, nTrials, lastPromotedAt. No bench URL, no ingest id, no promoter email.
  • Migration (0109_watery_freak.sql) — generated by pnpm drizzle generate against current main; contains only the new table + indexes.

  • Worker (services/model-eval-ingest/)

    • Scaffolded from the session-ingest / webhook-agent-ingest template (same -ingest naming convention). Single Hono app, single Hyperdrive binding, one Secrets Store secret.
    • src/middleware/hmac-auth.ts — HMAC-SHA-256 signature verification (X-Ingest-Signature + X-Ingest-Timestamp, 5-min skew window), portable constant-time compare (works under both Workers runtime and Node for unit tests).
    • src/routes/api.ts
      • POST /api/model-eval-ingest/submit — validate + insert + recompute the modelStats.benchmarks.kiloBench JSONB. Rollup uses DISTINCT ON (task_source) … ORDER BY promoted_at DESC; JSONB || merge preserves artificialAnalysis and other benchmark sources.
      • GET /api/model-eval-ingest/latest/:model — bench-side delta preview.
      • GET /api/model-eval-ingest/:id — bench-side status / admin audit.
  • Tests (33 passing) — 19 HMAC middleware (hex parsing, determinism, missing/invalid headers, skew in both directions, wrong secret, missing cloud secret, happy path) + 14 submission schema (valid shapes, invalid URL/email, n_trials=0, negative cost, absurd success_rate, float tokens, zero-cost-is-fine).

Verification

Ran locally against origin/main (0c8c06b40):

pnpm --filter cloudflare-model-eval-ingest test      → 33 / 33 pass
pnpm --filter cloudflare-model-eval-ingest typecheck → clean
pnpm --filter @kilocode/db typecheck                 → clean
pnpm --filter web typecheck                          → 0 errors
pnpm --filter web test -- schema.test.ts             → 6 / 6 pass (schema drift)
pnpm format                                          → applied

Pre-push hook (lint + typecheck) passes without --no-verify.

Visual Changes

N/A — no UI in this PR. The public KiloBenchSection component for kilocode-landing model pages is a separate follow-up.

Reviewer Notes

Still needed before this can ship to production:

  1. Bench-side promotion endpoint + SPA dialog (kilo-bench repo, Track 2 Phase D). Nothing calls this worker yet.
  2. SecretsMODEL_EVAL_INGEST_SECRET_PROD and _DEV need to be created in the shared Secrets Store (342a86d9e3a94da698e82d0c6e2a36f0) before first deploy. Rotate by bumping both sides simultaneously.
  3. Admin UI/admin/model-eval-ingest list + detail pages (Phase 2). Can land in a separate PR after this one.
  4. Public KiloBenchSection on kilocode-landing/src/app/models/[...slug]/page.tsx (Phase 3). Same — separate PR.

Risk areas:

  • The JSONB || merge in recomputeKiloBench relies on the kiloBench key being a single top-level field. If someone later adds a kiloBench.evals cross-reference that needs deeper merging, that's the place to revisit.
  • variant is nullable. The DISTINCT ON SQL uses a conditional variant IS NULL vs variant = $variant branch. Covered by the plan's latest-wins-per-tuple semantics but worth a second set of eyes.
  • The HMAC middleware reads the raw body via c.req.text() exactly once and caches it on c.var.rawBody. Downstream handlers MUST use c.var.rawBody + JSON.parse rather than c.req.json() — they won't be able to re-read the stream. Current handler does this correctly; any future handler on the same middleware needs to as well.

Rollout order:

  1. Land schema migration in a separate non-draft PR (safe; additive only).
  2. Create Secrets Store secrets.
  3. Deploy the worker (dev first, then prod) — it will just sit there until the bench side calls it.
  4. Ship the bench-side promotion flow.
  5. Verify end-to-end with a real clean eval run.
  6. Add admin UI + public landing-page component.

Design doc: .plans/dashboard-v2.md in the kilo-bench repo.

…h ingest

DRAFT / WIP — do not merge until the bench-side promotion flow is
finished and end-to-end tested against this worker. This PR is opened
to preserve the cloud-side work in progress; the kilo-bench side is
still being built (see kilo-bench/.plans/dashboard-v2.md Track 2).

Adds:

- `model_eval_ingest` table — immutable append-only record of eval
  results promoted from bench.s1lv.com. Primary partitioning key is
  (provider, model, variant, task_source); a later row supersedes by
  `promoted_at` at query time.
- `KiloBenchEvalSchema` + extended `ModelStatsBenchmarksSchema.kiloBench`
  — denormalised read cache for public model pages. Only public-safe
  fields (no bench URL, no ingest id, no promoter email).
- `services/model-eval-ingest/` Cloudflare Worker:
  - HMAC-SHA-256 signature verification (`X-Ingest-Signature` +
    `X-Ingest-Timestamp`, 5-min skew), portable constant-time compare.
  - `POST /api/model-eval-ingest/submit` — insert + recompute the
    `modelStats.benchmarks.kiloBench` JSONB via a DISTINCT-ON-by-
    timestamp rollup (JSONB merge preserves other benchmark sources).
  - `GET /api/model-eval-ingest/:id` and `/latest/:model` for bench-side
    status + delta preview.
- 33 vitest unit tests (19 HMAC middleware + 14 submission schema).

Operational notes for reviewers:

- `MODEL_EVAL_INGEST_SECRET_PROD` and `_DEV` secrets need to be
  created in the shared Secrets Store (342a86d9e3a94da698e82d0c6e2a36f0)
  before deploy. Rotate by bumping both sides simultaneously.
- No admin UI yet (`/admin/model-eval-ingest` pages); tracked as Phase
  2 of the plan.
- No public `KiloBenchSection` component yet in kilocode-landing;
  tracked as Phase 3 of the plan.
- No bench-side promotion endpoint yet; tracked as Phase D of the
  kilo-bench rebuild (Track 2).

Verification run locally against origin/main (0c8c06b):

    pnpm --filter cloudflare-model-eval-ingest test      → 33 / 33 pass
    pnpm --filter cloudflare-model-eval-ingest typecheck → clean
    pnpm --filter @kilocode/db typecheck                 → clean
    pnpm --filter web typecheck                          → 0 errors
    pnpm --filter web test -- schema.test.ts             → 6 / 6 pass

Design: `.plans/dashboard-v2.md` in the kilo-bench repo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant