Skip to content

moc-com/codex-reliability-probes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Codex Reliability Probes (Deterministic Exact-Match)

This repository shares reproducible reliability probe artifacts on gpt-5.3-codex.

Reports

  • reports/report_codex_benchmark_20260211_singlefile_full_30k.md (Japanese)
  • reports/report_codex_benchmark_20260211_singlefile_full_30k_en.md (English)
  • reports/codex-turn-budget-decision-20260211.md
  • reports/codex-strategy100-execution-report-20260211.md

Figure

100-turn strategy bar dashboard 100-turn strategy comparison 100-turn context growth line chart

New Strategy-100 Dataset (baseline vs recap vs snapshot)

Raw outputs:

  • data/strategy100/baseline/
  • data/strategy100/recap/
  • data/strategy100/snapshot/

Runner:

  • scripts/codex_final_recall_probe.sh

Measured result (100 turns, 1 trial each):

Strategy Strict Semantic Outcome
baseline 0/1 0/1 failed at turn 29 (mid-turn)
recap (every 10 turns) 1/1 1/1 pass
snapshot (every 10 turns) 1/1 1/1 pass

Cost signal from this run:

  • recap final-turn input_tokens: 17,961,775
  • snapshot final-turn input_tokens: 440,459
  • snapshot context footprint was about 40.8x smaller than recap at final turn.

Reproduce (CLI)

# baseline
scripts/codex_final_recall_probe.sh \
  --strategy baseline \
  --plan 100x1 \
  --max-input-tokens 20000000 \
  --max-delta-input-tokens 300000

# recap (every 10 turns)
scripts/codex_final_recall_probe.sh \
  --strategy recap \
  --recap-interval 10 \
  --plan 100x1 \
  --max-input-tokens 20000000 \
  --max-delta-input-tokens 300000

# snapshot (every 10 turns)
scripts/codex_final_recall_probe.sh \
  --strategy snapshot \
  --snapshot-interval 10 \
  --plan 100x1 \
  --max-input-tokens 20000000 \
  --max-delta-input-tokens 300000

# render line chart from raw per_turn.tsv
python3 scripts/render_strategy100_context_growth_line_svg.py

Operational Guidance

  • Recommended validation set: baseline@100, recap@100, snapshot@100.
  • Routine stop-line: <=100 turns.
  • For routine operation, avoid >100 unless running an explicit budget-approved stress test.

What This Is (Why It Matters)

  • This is a controlled reliability probe for long-turn operation, not a general coding benchmark.
  • The core question is: which orchestration pattern preserves exact recall while controlling context cost?
  • Main outcome from this dataset:
    • baseline failed early (turn 29).
    • recap and snapshot both reached 100 turns.
    • snapshot achieved the same reliability with dramatically lower context footprint.

Model/Workflow Improvement Levers

If the goal is higher real-world reliability per token budget, prioritize these levers:

  1. Thread segmentation by design (snapshot pattern)
  • Reset thread state every fixed interval (for example every 10 turns).
  • Carry only compact state forward (token/checkpoint + required constraints).
  • This directly controls context growth and reduces late-turn instability.
  1. Periodic memory anchoring (recap pattern)
  • Re-state critical invariants at fixed cadence.
  • Useful when single-thread continuity is required, but expect higher token cost than snapshot.
  1. Hard budget guardrails
  • Stop or rotate strategy when input_tokens or per-turn delta exceeds threshold.
  • Treat guardrails as part of correctness, not only cost control.
  1. Failure-type-aware QA
  • Separate mid_turn failures from final_recall failures.
  • Different failure types require different mitigations (state reset, recap cadence, prompt tightening).
  1. Increase statistical confidence
  • Current run count is intentionally small.
  • Recommended next step: repeat each strategy with n>=5 (or n>=10) and publish confidence intervals.

Notes

  • These are controlled reliability probes, not universal benchmarks.
  • Sample size is still small; treat this as a reproducibility package and baseline for follow-up trials.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors