This repository shares reproducible reliability probe artifacts on gpt-5.3-codex.
reports/report_codex_benchmark_20260211_singlefile_full_30k.md(Japanese)reports/report_codex_benchmark_20260211_singlefile_full_30k_en.md(English)reports/codex-turn-budget-decision-20260211.mdreports/codex-strategy100-execution-report-20260211.md
Raw outputs:
data/strategy100/baseline/data/strategy100/recap/data/strategy100/snapshot/
Runner:
scripts/codex_final_recall_probe.sh
Measured result (100 turns, 1 trial each):
| Strategy | Strict | Semantic | Outcome |
|---|---|---|---|
| baseline | 0/1 | 0/1 | failed at turn 29 (mid-turn) |
| recap (every 10 turns) | 1/1 | 1/1 | pass |
| snapshot (every 10 turns) | 1/1 | 1/1 | pass |
Cost signal from this run:
recapfinal-turninput_tokens:17,961,775snapshotfinal-turninput_tokens:440,459snapshotcontext footprint was about40.8xsmaller thanrecapat final turn.
# baseline
scripts/codex_final_recall_probe.sh \
--strategy baseline \
--plan 100x1 \
--max-input-tokens 20000000 \
--max-delta-input-tokens 300000
# recap (every 10 turns)
scripts/codex_final_recall_probe.sh \
--strategy recap \
--recap-interval 10 \
--plan 100x1 \
--max-input-tokens 20000000 \
--max-delta-input-tokens 300000
# snapshot (every 10 turns)
scripts/codex_final_recall_probe.sh \
--strategy snapshot \
--snapshot-interval 10 \
--plan 100x1 \
--max-input-tokens 20000000 \
--max-delta-input-tokens 300000
# render line chart from raw per_turn.tsv
python3 scripts/render_strategy100_context_growth_line_svg.py
- Recommended validation set:
baseline@100,recap@100,snapshot@100. - Routine stop-line:
<=100 turns. - For routine operation, avoid
>100unless running an explicit budget-approved stress test.
- This is a controlled reliability probe for long-turn operation, not a general coding benchmark.
- The core question is: which orchestration pattern preserves exact recall while controlling context cost?
- Main outcome from this dataset:
baselinefailed early (turn 29).recapandsnapshotboth reached 100 turns.snapshotachieved the same reliability with dramatically lower context footprint.
If the goal is higher real-world reliability per token budget, prioritize these levers:
- Thread segmentation by design (
snapshotpattern)
- Reset thread state every fixed interval (for example every 10 turns).
- Carry only compact state forward (token/checkpoint + required constraints).
- This directly controls context growth and reduces late-turn instability.
- Periodic memory anchoring (
recappattern)
- Re-state critical invariants at fixed cadence.
- Useful when single-thread continuity is required, but expect higher token cost than snapshot.
- Hard budget guardrails
- Stop or rotate strategy when
input_tokensor per-turn delta exceeds threshold. - Treat guardrails as part of correctness, not only cost control.
- Failure-type-aware QA
- Separate
mid_turnfailures fromfinal_recallfailures. - Different failure types require different mitigations (state reset, recap cadence, prompt tightening).
- Increase statistical confidence
- Current run count is intentionally small.
- Recommended next step: repeat each strategy with
n>=5(orn>=10) and publish confidence intervals.
- These are controlled reliability probes, not universal benchmarks.
- Sample size is still small; treat this as a reproducibility package and baseline for follow-up trials.

