ci(e2e-bot): drive the PR head + exercise compaction by esengine · Pull Request #3050 · esengine/DeepSeek-Reasonix

esengine · 2026-06-04T07:06:51Z

What

The e2e bot has been measuring the wrong binary. The workflow builds reasonix from main-v2 before checking out the PR head, then runs the suite with that main-v2 binary — so /e2e on a PR reported main-v2's accuracy/cache/token/cost and told us nothing about the PR's actual code. (The report footer even claimed "run on the PR head", which was false.)

On top of that, the fixed suite is three trivial tasks (fizzbuzz / palindrome / fix-add-bug) against a 64K context window, so compaction never fires — the cache/compaction path, which is the whole point of measuring cache-hit% and cost, was never exercised.

Changes

Build the agent under test from the PR head, falling back to the main-v2 build only when the head can't produce a run --metrics-capable binary (build break or predates the flag). Harness (e2ebench) and suite still come from main-v2 so a PR can't weaken its own grader or tests.
Shrink the e2e context_window to 20000 so a realistic task actually crosses the 0.8 compaction trigger.
Add a compaction task: six ~22 KB prose chapters that force multi-file reads past the threshold. The graded facts (a village name in chapter 1, a House name in chapter 6) are woven into the prose under synonyms the prompt doesn't use, so the agent must read the chapters — it can't grep a needle and skip the context growth. Chapter 1's fact lands in the folded region and chapter 6's in the kept tail, so the task also checks that compaction preserves context.
Drop the false "run on the PR head" line from the harness footer; the workflow footer now states the real agent source.

Why it matters

This is the prerequisite for using /e2e to gate the cache/compaction PRs (#2405 / #2406 / #2407): only after this lands does a /e2e run reflect the PR's code and actually trigger the compaction logic those PRs change.

The bot built reasonix from main-v2, so /e2e on a PR measured main-v2's agent, not the PR's code — worthless for pre-merge validation. Build the agent from the PR head (falling back to main-v2 only when the head predates run --metrics), shrink the e2e context_window to 20000 so the tiny suite actually crosses the compaction trigger, and add a compaction task whose six prose chapters force multi-file reads past the threshold while hiding the graded facts in the first and last chapter. Harness and suite still come from main-v2 so a PR can't weaken its own grader or tests.

The bot needs to show whether a task actually triggered compaction — that's the signal the cache/compaction PRs (#2405-#2407) are measured by. metricsSink counts CompactionStarted; e2ebench adds a Compactions total to the summary and a per-task Compact column.

A single 'read all six files' prompt let the agent batch the reads into one turn, so the whole corpus landed in the kept tail with no foldable middle and compaction never fired. Chaining each chapter to the next forces one read per turn; history accumulates and folds, and a real run now triggers 3 auto-compactions. The final chapter restates the full deliverable so the task stays solvable across a degraded summary.

github-actions Bot added the v2 Go rewrite (1.x) — main-v2 branch, active development label Jun 4, 2026

reasonix added 2 commits June 4, 2026 00:16

esengine merged commit a182c3c into main-v2 Jun 4, 2026
8 checks passed

esengine deleted the chore/e2e-bot-pr-head branch June 4, 2026 07:51

esengine mentioned this pull request Jun 4, 2026

feat(agent): structured, economical compaction (soft/force ratios + fold economics) #3073

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(e2e-bot): drive the PR head + exercise compaction#3050

ci(e2e-bot): drive the PR head + exercise compaction#3050
esengine merged 3 commits into
main-v2from
chore/e2e-bot-pr-head

esengine commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

esengine commented Jun 4, 2026

What

Changes

Why it matters

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant