Codex Reliability Probes (Deterministic Exact-Match)

This repository shares reproducible reliability probe artifacts on gpt-5.3-codex.

Reports

reports/report_codex_benchmark_20260211_singlefile_full_30k.md (Japanese)
reports/report_codex_benchmark_20260211_singlefile_full_30k_en.md (English)
reports/codex-turn-budget-decision-20260211.md
reports/codex-strategy100-execution-report-20260211.md

Figure

New Strategy-100 Dataset (baseline vs recap vs snapshot)

Raw outputs:

data/strategy100/baseline/
data/strategy100/recap/
data/strategy100/snapshot/

Runner:

scripts/codex_final_recall_probe.sh

Measured result (100 turns, 1 trial each):

Strategy	Strict	Semantic	Outcome
baseline	0/1	0/1	failed at turn 29 (mid-turn)
recap (every 10 turns)	1/1	1/1	pass
snapshot (every 10 turns)	1/1	1/1	pass

Cost signal from this run:

recap final-turn input_tokens: 17,961,775
snapshot final-turn input_tokens: 440,459
snapshot context footprint was about 40.8x smaller than recap at final turn.

Reproduce (CLI)

# baseline
scripts/codex_final_recall_probe.sh \
  --strategy baseline \
  --plan 100x1 \
  --max-input-tokens 20000000 \
  --max-delta-input-tokens 300000

# recap (every 10 turns)
scripts/codex_final_recall_probe.sh \
  --strategy recap \
  --recap-interval 10 \
  --plan 100x1 \
  --max-input-tokens 20000000 \
  --max-delta-input-tokens 300000

# snapshot (every 10 turns)
scripts/codex_final_recall_probe.sh \
  --strategy snapshot \
  --snapshot-interval 10 \
  --plan 100x1 \
  --max-input-tokens 20000000 \
  --max-delta-input-tokens 300000

# render line chart from raw per_turn.tsv
python3 scripts/render_strategy100_context_growth_line_svg.py

Operational Guidance

Recommended validation set: baseline@100, recap@100, snapshot@100.
Routine stop-line: <=100 turns.
For routine operation, avoid >100 unless running an explicit budget-approved stress test.

What This Is (Why It Matters)

This is a controlled reliability probe for long-turn operation, not a general coding benchmark.
The core question is: which orchestration pattern preserves exact recall while controlling context cost?
Main outcome from this dataset:
- baseline failed early (turn 29).
- recap and snapshot both reached 100 turns.
- snapshot achieved the same reliability with dramatically lower context footprint.

Model/Workflow Improvement Levers

If the goal is higher real-world reliability per token budget, prioritize these levers:

Thread segmentation by design (snapshot pattern)

Reset thread state every fixed interval (for example every 10 turns).
Carry only compact state forward (token/checkpoint + required constraints).
This directly controls context growth and reduces late-turn instability.

Periodic memory anchoring (recap pattern)

Re-state critical invariants at fixed cadence.
Useful when single-thread continuity is required, but expect higher token cost than snapshot.

Hard budget guardrails

Stop or rotate strategy when input_tokens or per-turn delta exceeds threshold.
Treat guardrails as part of correctness, not only cost control.

Failure-type-aware QA

Separate mid_turn failures from final_recall failures.
Different failure types require different mitigations (state reset, recap cadence, prompt tightening).

Increase statistical confidence

Current run count is intentionally small.
Recommended next step: repeat each strategy with n>=5 (or n>=10) and publish confidence intervals.

Notes

These are controlled reliability probes, not universal benchmarks.
Sample size is still small; treat this as a reproducibility package and baseline for follow-up trials.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets/figures		assets/figures
data/strategy100		data/strategy100
reports		reports
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codex Reliability Probes (Deterministic Exact-Match)

Reports

Figure

New Strategy-100 Dataset (baseline vs recap vs snapshot)

Reproduce (CLI)

Operational Guidance

What This Is (Why It Matters)

Model/Workflow Improvement Levers

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Codex Reliability Probes (Deterministic Exact-Match)

Reports

Figure

New Strategy-100 Dataset (baseline vs recap vs snapshot)

Reproduce (CLI)

Operational Guidance

What This Is (Why It Matters)

Model/Workflow Improvement Levers

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages