Skip to content

[codex] S3 cross-topic scheduled injection benchmark#26

Draft
Joacocade wants to merge 4 commits into
codex/s2-issue19-baselinefrom
codex/s3-cross-topic-injection
Draft

[codex] S3 cross-topic scheduled injection benchmark#26
Joacocade wants to merge 4 commits into
codex/s2-issue19-baselinefrom
codex/s3-cross-topic-injection

Conversation

@Joacocade

Copy link
Copy Markdown
Collaborator

Linked issue

S3 follow-up to S2 Issue 19 / PR #25. This PR is intentionally stacked on codex/s2-issue19-baseline so the diff only contains the S3 cross-topic extension.

Summary

Adds a reproducible S3 cross-topic scheduled-injection benchmark under backtesting/s3-cross-topic-injection/.

What changed:

  • Added compact topic packets for football, Bolivia, and IPC.
  • Added a 3 topics x 2 models x 7 conditions matrix.
  • Added a resumable runner with prepared-simulation reuse.
  • Added package validation, technical summarization, and deterministic metric extraction scripts.
  • Added committed smoke/full summaries, condition metrics, run ledger, final report, and detailed result analysis.

Experimental design

Models:

  • DeepInfra google/gemma-3-27b-it
  • DeepInfra meta-llama/Llama-3.3-70B-Instruct-Turbo

Topics:

  • football: Argentina vs Colombia, Copa America 2024 final
  • bolivia: Bolivia 2025 presidential runoff
  • ipc: Argentina IPC 2025 forecast

Conditions per topic/model:

  • baseline-control
  • signal-early
  • signal-mid
  • signal-late
  • counter-signal-mid
  • noise-near-mid
  • noise-off-mid

Full matrix size: 42 runs.

Results

Technical validity:

  • 42/42 full matrix rows valid.
  • 6/6 baselines fired zero scheduled events.
  • 36/36 injected conditions fired exactly one scheduled event.
  • 6/6 topic/model pairs reused one prepared simulation across their seven conditions.

Directional findings:

  • Football mostly reproduces the S2/V2 pattern: Gemma remains sticky toward Argentina, while Llama flips weakly to Colombia under counter-signal-mid.
  • Bolivia is the cleanest cross-topic signal result: both models move toward Paz under signal injections and toward Quiroga under counter-signal.
  • IPC remains stable toward lower/disinflation except under counter-signal, where Gemma becomes unclear and Llama flips to higher/rebound.
  • Off-topic noise is generally ignored; near-topic noise is more dangerous, especially in Bolivia.

Conclusion

S3 supports the S2/V2 conclusion that scheduled injection is technically robust and directionally meaningful, but extends it across domains. The effect transfers to Bolivia and IPC, with topic/model-specific sensitivity. The strongest technical result is the reliable event audit; the strongest behavioral result is that relevant counter-evidence can change direction, especially for Llama.

This PR should be read as a technical and deterministic directional benchmark. Artifact-only ReportAgent scoring is documented as optional follow-up, not included in this PR.

Evidence

Primary committed evidence:

  • backtesting/s3-cross-topic-injection/evaluation/final_s3_report.md
  • backtesting/s3-cross-topic-injection/evaluation/results_analysis.md
  • backtesting/s3-cross-topic-injection/evaluation/full_summary.md
  • backtesting/s3-cross-topic-injection/evaluation/condition_summary_metrics.md
  • backtesting/s3-cross-topic-injection/RUN_LEDGER.csv

Local reproducibility artifacts remain under runs/s3_cross_topic/* and are intentionally not committed because they include SQLite/log outputs.

How to test

Run from backend/:

uv run --frozen python -m py_compile ../backtesting/s3-cross-topic-injection/scripts/validate_s3_package.py ../backtesting/s3-cross-topic-injection/scripts/run_s3_matrix.py ../backtesting/s3-cross-topic-injection/scripts/summarize_s3_smoke.py ../backtesting/s3-cross-topic-injection/scripts/extract_s3_metrics.py
uv run --frozen python ../backtesting/s3-cross-topic-injection/scripts/validate_s3_package.py

Validated locally:

  • script py_compile passed;
  • package validation passed with topics=3 models=2 conditions=7 smoke_rows=12 full_rows=42.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant