ci(e2e-bot): drive the PR head + exercise compaction#3050
Merged
Conversation
The bot built reasonix from main-v2, so /e2e on a PR measured main-v2's agent, not the PR's code — worthless for pre-merge validation. Build the agent from the PR head (falling back to main-v2 only when the head predates run --metrics), shrink the e2e context_window to 20000 so the tiny suite actually crosses the compaction trigger, and add a compaction task whose six prose chapters force multi-file reads past the threshold while hiding the graded facts in the first and last chapter. Harness and suite still come from main-v2 so a PR can't weaken its own grader or tests.
added 2 commits
June 4, 2026 00:16
A single 'read all six files' prompt let the agent batch the reads into one turn, so the whole corpus landed in the kept tail with no foldable middle and compaction never fired. Chaining each chapter to the next forces one read per turn; history accumulates and folds, and a real run now triggers 3 auto-compactions. The final chapter restates the full deliverable so the task stays solvable across a degraded summary.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The e2e bot has been measuring the wrong binary. The workflow builds
reasonixfrommain-v2before checking out the PR head, then runs the suite with that main-v2 binary — so/e2eon a PR reported main-v2's accuracy/cache/token/cost and told us nothing about the PR's actual code. (The report footer even claimed "run on the PR head", which was false.)On top of that, the fixed suite is three trivial tasks (fizzbuzz / palindrome / fix-add-bug) against a 64K context window, so compaction never fires — the cache/compaction path, which is the whole point of measuring cache-hit% and cost, was never exercised.
Changes
run --metrics-capable binary (build break or predates the flag). Harness (e2ebench) and suite still come frommain-v2so a PR can't weaken its own grader or tests.context_windowto 20000 so a realistic task actually crosses the 0.8 compaction trigger.compactiontask: six ~22 KB prose chapters that force multi-file reads past the threshold. The graded facts (a village name in chapter 1, a House name in chapter 6) are woven into the prose under synonyms the prompt doesn't use, so the agent must read the chapters — it can't grep a needle and skip the context growth. Chapter 1's fact lands in the folded region and chapter 6's in the kept tail, so the task also checks that compaction preserves context.Why it matters
This is the prerequisite for using
/e2eto gate the cache/compaction PRs (#2405 / #2406 / #2407): only after this lands does a/e2erun reflect the PR's code and actually trigger the compaction logic those PRs change.