Skip to content

Kimi/GLM L6 benchmarks + harness defect #8 + backlog fixes#69

Merged
unohee merged 6 commits into
mainfrom
bench/kimi-glm-harness-fixes
Jun 18, 2026
Merged

Kimi/GLM L6 benchmarks + harness defect #8 + backlog fixes#69
unohee merged 6 commits into
mainfrom
bench/kimi-glm-harness-fixes

Conversation

@unohee

@unohee unohee commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Summary

Post-v0.5.0 model-routing & benchmarking work plus the developable backlog. Five logical commits, each scoped to one issue area.

Benchmarks (L6, official swebench grading)

  • Kimi K2.5/K2.6 (INT-1455): both sweep L0–L5 100% (24/24), beat gemini-2.5-flash baseline (88%); k2.5 −57% $/pass. L6 hybrid — k2.5 worker 2/2 RESOLVED, k2.6 planner 1/1 RESOLVED. The frontier-diagnosis hybrid no longer requires gpt-5.
  • GLM-5.1 (INT-1461): solid implementer (5859 RESOLVED), recoverable diagnostician (7080 only via re-diagnosis escalate). kimi-k2.6 remains the better+cheaper planner.

Harness defect #8 — self-authored test masking (INT-1462)

The gold FAIL_TO_PASS test was absent from the sandbox, so the implementer reinvented it and on 7080 a wrong test validated a wrong fix. sweBench.ts now pre-applies the instance test_patch before baselining → local verification uses the real oracle, test files are protected, and model_patch stays source-only. Verified RESOLVED on 5859 with zero self-authored test files.

Backlog fixes

  • INT-1441 — extract shared resultParsing.ts; gpt/local/openrouter delegate (codex kept on its richer variant).
  • INT-1442 — 64-entry LRU on the loop read cache; distillation no longer downgrades a verified explicit memory type; document the empty-answer salvage turn.
  • INT-1437rowid DESC tiebreaker makes issue-event ordering deterministic (flaky test fixed).
  • INT-1460 — OpenRouter key from repo .env (OPENROUTER_API_KEY) via the zero-dep loader.

Verification

tsc --noEmit clean · full suite 692/692 · adapter suite 76/76 · defect-#8 fix and benchmark verdicts confirmed by the official swebench.harness.run_evaluation.

Related Issues

Refs INT-1455, INT-1460, INT-1461, INT-1462, INT-1441, INT-1442, INT-1437

unohee added 6 commits June 18, 2026 19:08
Benchmark the planner/worker hybrid with non-OpenAI models, graded by the
official swebench harness:
- Kimi K2.5/K2.6 sweep L0-L5 100% (24/24), beating gemini-2.5-flash (88%);
  k2.5 -57% $/pass. L6 hybrid: k2.5 worker 2/2 RESOLVED, k2.6 planner 1/1.
- GLM-5.1: solid implementer (5859 RESOLVED), recoverable diagnostician
  (7080 via re-diagnosis escalate). New failure mode: self-authored test
  masking (defect #8 candidate, fixed separately).

Also switch the OpenRouter key source from a manually-sourced OPENROUTER_API
to the repo-local OPENROUTER_API_KEY via the zero-dep .env loader; benchmarks
call loadEnvFile() at startup.

Refs: INT-1455, INT-1460, INT-1461
…king

SWE-bench images ship base_commit only; the FAIL_TO_PASS test is applied at
eval time. The implementer therefore reinvented the gold test, which on 7080
validated a wrong fix (local "pass" while official grading failed). Now
sweBench.ts git-applies the instance test_patch to the sandbox before
baselining: local verification runs the real oracle, the touched test files
join protectedFiles, and the gold test is absorbed into the baseline so the
extracted model_patch stays source-only (no manual test-hunk stripping).

Verified on 5859: worker created zero test files, model_patch was source-only,
official grading RESOLVED. Also adds a missing-test_patch WARNING (no silent
fallback). This is harness defect #8.

Refs: INT-1462
gpt, local, and openrouter each carried a byte-identical (modulo formatting)
copy of eight result-parsing functions, already drifting in comments — a parse
fix would have needed three-way sync. New src/adapters/resultParsing.ts exports
parseWorkerResult / parseReviewerResult; the three adapters delegate in one line
and drop their local copies (and the now-orphaned t import).

codex is intentionally left on its own variant: its text fallbacks salvage
filesChanged/commands/issues from prose and its JSON extractor differs, so
sharing would regress behavior or touch untested paths.

Adapter suite 76/76 green. The openrouter.ts edit also carries the INT-1460
key-source change.

Refs: INT-1441
- tools.ts: the loop-scoped read cache was an unbounded Map (megabytes over an
  80-turn run). Add a 64-entry LRU (cacheGet bumps to MRU, cacheSet evicts the
  oldest). New test covers eviction.
- memoryCore.ts: saveMemory let distillation silently override the caller's
  explicit type (system_pattern/constraint -> belief), escapable only via
  skipDistillation. New contract: distillation refines the type only when
  !isVerified; a verified explicit type is authoritative.
- agenticLoop.ts: document that an empty final answer intentionally triggers the
  salvage turn (not an accidental extra call).

Full suite 692/692.

Refs: INT-1442
getEvents/getRecentEvents ordered by created_at DESC alone, but created_at is
ms-precision TEXT — a created event and an immediate addEvent landing in the
same millisecond ordered non-deterministically, intermittently failing the
issueStore event test. Add rowid DESC as a tiebreaker (monotonic with insertion
order), making the order total. No schema change.

Isolated 25/25 and full suite 692/692 across repeated runs.

Refs: INT-1437
@unohee unohee merged commit c3dc1f1 into main Jun 18, 2026
9 checks passed
@unohee unohee deleted the bench/kimi-glm-harness-fixes branch June 18, 2026 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant