Kimi/GLM L6 benchmarks + harness defect #8 + backlog fixes#69
Merged
Conversation
Benchmark the planner/worker hybrid with non-OpenAI models, graded by the official swebench harness: - Kimi K2.5/K2.6 sweep L0-L5 100% (24/24), beating gemini-2.5-flash (88%); k2.5 -57% $/pass. L6 hybrid: k2.5 worker 2/2 RESOLVED, k2.6 planner 1/1. - GLM-5.1: solid implementer (5859 RESOLVED), recoverable diagnostician (7080 via re-diagnosis escalate). New failure mode: self-authored test masking (defect #8 candidate, fixed separately). Also switch the OpenRouter key source from a manually-sourced OPENROUTER_API to the repo-local OPENROUTER_API_KEY via the zero-dep .env loader; benchmarks call loadEnvFile() at startup. Refs: INT-1455, INT-1460, INT-1461
…king SWE-bench images ship base_commit only; the FAIL_TO_PASS test is applied at eval time. The implementer therefore reinvented the gold test, which on 7080 validated a wrong fix (local "pass" while official grading failed). Now sweBench.ts git-applies the instance test_patch to the sandbox before baselining: local verification runs the real oracle, the touched test files join protectedFiles, and the gold test is absorbed into the baseline so the extracted model_patch stays source-only (no manual test-hunk stripping). Verified on 5859: worker created zero test files, model_patch was source-only, official grading RESOLVED. Also adds a missing-test_patch WARNING (no silent fallback). This is harness defect #8. Refs: INT-1462
gpt, local, and openrouter each carried a byte-identical (modulo formatting) copy of eight result-parsing functions, already drifting in comments — a parse fix would have needed three-way sync. New src/adapters/resultParsing.ts exports parseWorkerResult / parseReviewerResult; the three adapters delegate in one line and drop their local copies (and the now-orphaned t import). codex is intentionally left on its own variant: its text fallbacks salvage filesChanged/commands/issues from prose and its JSON extractor differs, so sharing would regress behavior or touch untested paths. Adapter suite 76/76 green. The openrouter.ts edit also carries the INT-1460 key-source change. Refs: INT-1441
- tools.ts: the loop-scoped read cache was an unbounded Map (megabytes over an 80-turn run). Add a 64-entry LRU (cacheGet bumps to MRU, cacheSet evicts the oldest). New test covers eviction. - memoryCore.ts: saveMemory let distillation silently override the caller's explicit type (system_pattern/constraint -> belief), escapable only via skipDistillation. New contract: distillation refines the type only when !isVerified; a verified explicit type is authoritative. - agenticLoop.ts: document that an empty final answer intentionally triggers the salvage turn (not an accidental extra call). Full suite 692/692. Refs: INT-1442
getEvents/getRecentEvents ordered by created_at DESC alone, but created_at is ms-precision TEXT — a created event and an immediate addEvent landing in the same millisecond ordered non-deterministically, intermittently failing the issueStore event test. Add rowid DESC as a tiebreaker (monotonic with insertion order), making the order total. No schema change. Isolated 25/25 and full suite 692/692 across repeated runs. Refs: INT-1437
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Post-v0.5.0 model-routing & benchmarking work plus the developable backlog. Five logical commits, each scoped to one issue area.
Benchmarks (L6, official swebench grading)
Harness defect #8 — self-authored test masking (INT-1462)
The gold FAIL_TO_PASS test was absent from the sandbox, so the implementer reinvented it and on 7080 a wrong test validated a wrong fix.
sweBench.tsnow pre-applies the instancetest_patchbefore baselining → local verification uses the real oracle, test files are protected, and model_patch stays source-only. Verified RESOLVED on 5859 with zero self-authored test files.Backlog fixes
resultParsing.ts; gpt/local/openrouter delegate (codex kept on its richer variant).rowid DESCtiebreaker makes issue-event ordering deterministic (flaky test fixed)..env(OPENROUTER_API_KEY) via the zero-dep loader.Verification
tsc --noEmitclean · full suite 692/692 · adapter suite 76/76 · defect-#8 fix and benchmark verdicts confirmed by the officialswebench.harness.run_evaluation.Related Issues
Refs INT-1455, INT-1460, INT-1461, INT-1462, INT-1441, INT-1442, INT-1437