Offline context-engine evaluation for issue #286 is fixture-driven and deterministic. It targets
three behaviors already present in ContextEngine:
- layered context assembly
- token-guard trimming
- advisory track routing
- Fixture:
evals/fixtures/context/bench_v1.json - Benchmark library:
src/paperbot/context_engine/benchmark.py - CLI:
scripts/eval_context_engine.py - Smoke runner:
evals/runners/run_context_engine_benchmark_smoke.py
Each case includes:
query,query_type,stage- routing setup:
active_track_id, optionaltrack_id - optional
paper_id - optional
context_token_budget expected.layersexpected.token_guard- optional
expected.router_track_id - deterministic store state under
state
The benchmark builds a real ContextEngine with fake research/memory stores and runs
build_context_pack() directly, so the scorer observes the same layer assembly and token-guard
logic that production code uses.
layer_precision: expected populated layers vs actual populated layerslayer_recall: expected populated layers recovered by the enginetoken_guard_accuracy: whether guard triggering matches expectationtoken_guard_trigger_rate: observed guard trigger rate across casesrouter_coverage: share of router-evaluable cases that produced a suggestionrouter_accuracy: share of router-evaluable cases whose suggested track matched expectation
PYTHONPATH=src python scripts/eval_context_engine.py \
--fixtures evals/fixtures/context/bench_v1.json \
--output output/reports/context_bench_v1.json \
--fail-under-layer-precision 0.95 \
--fail-under-token-guard-accuracy 1.0 \
--fail-under-router-coverage 1.0 \
--fail-under-router-accuracy 1.0The seed cases cover the combinations called out in the issue:
- stages:
survey,writing,rebuttal - query types:
short,long,track_query,paper_targeted - one explicit token-guard overflow case
- two router-evaluable cases with deterministic switch expectations