Skip to content

ci(e2e-bot): drive the PR head + exercise compaction#3050

Merged
esengine merged 3 commits into
main-v2from
chore/e2e-bot-pr-head
Jun 4, 2026
Merged

ci(e2e-bot): drive the PR head + exercise compaction#3050
esengine merged 3 commits into
main-v2from
chore/e2e-bot-pr-head

Conversation

@esengine
Copy link
Copy Markdown
Owner

@esengine esengine commented Jun 4, 2026

What

The e2e bot has been measuring the wrong binary. The workflow builds reasonix from main-v2 before checking out the PR head, then runs the suite with that main-v2 binary — so /e2e on a PR reported main-v2's accuracy/cache/token/cost and told us nothing about the PR's actual code. (The report footer even claimed "run on the PR head", which was false.)

On top of that, the fixed suite is three trivial tasks (fizzbuzz / palindrome / fix-add-bug) against a 64K context window, so compaction never fires — the cache/compaction path, which is the whole point of measuring cache-hit% and cost, was never exercised.

Changes

  • Build the agent under test from the PR head, falling back to the main-v2 build only when the head can't produce a run --metrics-capable binary (build break or predates the flag). Harness (e2ebench) and suite still come from main-v2 so a PR can't weaken its own grader or tests.
  • Shrink the e2e context_window to 20000 so a realistic task actually crosses the 0.8 compaction trigger.
  • Add a compaction task: six ~22 KB prose chapters that force multi-file reads past the threshold. The graded facts (a village name in chapter 1, a House name in chapter 6) are woven into the prose under synonyms the prompt doesn't use, so the agent must read the chapters — it can't grep a needle and skip the context growth. Chapter 1's fact lands in the folded region and chapter 6's in the kept tail, so the task also checks that compaction preserves context.
  • Drop the false "run on the PR head" line from the harness footer; the workflow footer now states the real agent source.

Why it matters

This is the prerequisite for using /e2e to gate the cache/compaction PRs (#2405 / #2406 / #2407): only after this lands does a /e2e run reflect the PR's code and actually trigger the compaction logic those PRs change.

The bot built reasonix from main-v2, so /e2e on a PR measured main-v2's agent, not the PR's code — worthless for pre-merge validation. Build the agent from the PR head (falling back to main-v2 only when the head predates run --metrics), shrink the e2e context_window to 20000 so the tiny suite actually crosses the compaction trigger, and add a compaction task whose six prose chapters force multi-file reads past the threshold while hiding the graded facts in the first and last chapter.

Harness and suite still come from main-v2 so a PR can't weaken its own grader or tests.
@github-actions github-actions Bot added the v2 Go rewrite (1.x) — main-v2 branch, active development label Jun 4, 2026
reasonix added 2 commits June 4, 2026 00:16
The bot needs to show whether a task actually triggered compaction — that's the signal the cache/compaction PRs (#2405-#2407) are measured by. metricsSink counts CompactionStarted; e2ebench adds a Compactions total to the summary and a per-task Compact column.
A single 'read all six files' prompt let the agent batch the reads into one turn, so the whole corpus landed in the kept tail with no foldable middle and compaction never fired. Chaining each chapter to the next forces one read per turn; history accumulates and folds, and a real run now triggers 3 auto-compactions. The final chapter restates the full deliverable so the task stays solvable across a degraded summary.
@esengine esengine merged commit a182c3c into main-v2 Jun 4, 2026
8 checks passed
@esengine esengine deleted the chore/e2e-bot-pr-head branch June 4, 2026 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v2 Go rewrite (1.x) — main-v2 branch, active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant