Kimi/GLM L6 benchmarks + harness defect #8 + backlog fixes by unohee · Pull Request #69 · unohee/OpenSwarm

unohee · 2026-06-18T10:09:46Z

Summary

Post-v0.5.0 model-routing & benchmarking work plus the developable backlog. Five logical commits, each scoped to one issue area.

Benchmarks (L6, official swebench grading)

Kimi K2.5/K2.6 (INT-1455): both sweep L0–L5 100% (24/24), beat gemini-2.5-flash baseline (88%); k2.5 −57% $/pass. L6 hybrid — k2.5 worker 2/2 RESOLVED, k2.6 planner 1/1 RESOLVED. The frontier-diagnosis hybrid no longer requires gpt-5.
GLM-5.1 (INT-1461): solid implementer (5859 RESOLVED), recoverable diagnostician (7080 only via re-diagnosis escalate). kimi-k2.6 remains the better+cheaper planner.

Harness defect #8 — self-authored test masking (INT-1462)

The gold FAIL_TO_PASS test was absent from the sandbox, so the implementer reinvented it and on 7080 a wrong test validated a wrong fix. sweBench.ts now pre-applies the instance test_patch before baselining → local verification uses the real oracle, test files are protected, and model_patch stays source-only. Verified RESOLVED on 5859 with zero self-authored test files.

Backlog fixes

INT-1441 — extract shared resultParsing.ts; gpt/local/openrouter delegate (codex kept on its richer variant).
INT-1442 — 64-entry LRU on the loop read cache; distillation no longer downgrades a verified explicit memory type; document the empty-answer salvage turn.
INT-1437 — rowid DESC tiebreaker makes issue-event ordering deterministic (flaky test fixed).
INT-1460 — OpenRouter key from repo .env (OPENROUTER_API_KEY) via the zero-dep loader.

Verification

tsc --noEmit clean · full suite 692/692 · adapter suite 76/76 · defect-#8 fix and benchmark verdicts confirmed by the official swebench.harness.run_evaluation.

Related Issues

Refs INT-1455, INT-1460, INT-1461, INT-1462, INT-1441, INT-1442, INT-1437

Benchmark the planner/worker hybrid with non-OpenAI models, graded by the official swebench harness: - Kimi K2.5/K2.6 sweep L0-L5 100% (24/24), beating gemini-2.5-flash (88%); k2.5 -57% $/pass. L6 hybrid: k2.5 worker 2/2 RESOLVED, k2.6 planner 1/1. - GLM-5.1: solid implementer (5859 RESOLVED), recoverable diagnostician (7080 via re-diagnosis escalate). New failure mode: self-authored test masking (defect #8 candidate, fixed separately). Also switch the OpenRouter key source from a manually-sourced OPENROUTER_API to the repo-local OPENROUTER_API_KEY via the zero-dep .env loader; benchmarks call loadEnvFile() at startup. Refs: INT-1455, INT-1460, INT-1461

…king SWE-bench images ship base_commit only; the FAIL_TO_PASS test is applied at eval time. The implementer therefore reinvented the gold test, which on 7080 validated a wrong fix (local "pass" while official grading failed). Now sweBench.ts git-applies the instance test_patch to the sandbox before baselining: local verification runs the real oracle, the touched test files join protectedFiles, and the gold test is absorbed into the baseline so the extracted model_patch stays source-only (no manual test-hunk stripping). Verified on 5859: worker created zero test files, model_patch was source-only, official grading RESOLVED. Also adds a missing-test_patch WARNING (no silent fallback). This is harness defect #8. Refs: INT-1462

gpt, local, and openrouter each carried a byte-identical (modulo formatting) copy of eight result-parsing functions, already drifting in comments — a parse fix would have needed three-way sync. New src/adapters/resultParsing.ts exports parseWorkerResult / parseReviewerResult; the three adapters delegate in one line and drop their local copies (and the now-orphaned t import). codex is intentionally left on its own variant: its text fallbacks salvage filesChanged/commands/issues from prose and its JSON extractor differs, so sharing would regress behavior or touch untested paths. Adapter suite 76/76 green. The openrouter.ts edit also carries the INT-1460 key-source change. Refs: INT-1441

- tools.ts: the loop-scoped read cache was an unbounded Map (megabytes over an 80-turn run). Add a 64-entry LRU (cacheGet bumps to MRU, cacheSet evicts the oldest). New test covers eviction. - memoryCore.ts: saveMemory let distillation silently override the caller's explicit type (system_pattern/constraint -> belief), escapable only via skipDistillation. New contract: distillation refines the type only when !isVerified; a verified explicit type is authoritative. - agenticLoop.ts: document that an empty final answer intentionally triggers the salvage turn (not an accidental extra call). Full suite 692/692. Refs: INT-1442

getEvents/getRecentEvents ordered by created_at DESC alone, but created_at is ms-precision TEXT — a created event and an immediate addEvent landing in the same millisecond ordered non-deterministically, intermittently failing the issueStore event test. Add rowid DESC as a tiebreaker (monotonic with insertion order), making the order total. No schema change. Isolated 25/25 and full suite 692/692 across repeated runs. Refs: INT-1437

unohee added 6 commits June 18, 2026 19:08

Merge branch 'main' into bench/kimi-glm-harness-fixes

fea17fc

unohee merged commit c3dc1f1 into main Jun 18, 2026
9 checks passed

unohee deleted the bench/kimi-glm-harness-fixes branch June 18, 2026 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kimi/GLM L6 benchmarks + harness defect #8 + backlog fixes#69

Kimi/GLM L6 benchmarks + harness defect #8 + backlog fixes#69
unohee merged 6 commits into
mainfrom
bench/kimi-glm-harness-fixes

unohee commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

unohee commented Jun 18, 2026

Summary

Benchmarks (L6, official swebench grading)

Harness defect #8 — self-authored test masking (INT-1462)

Backlog fixes

Verification

Related Issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant