[codex] S3 cross-topic scheduled injection benchmark by Joacocade · Pull Request #26 · LucasErcolano/MiroFish

Joacocade · 2026-06-18T19:11:38Z

Linked issue

S3 follow-up to S2 Issue 19 / PR #25. This PR is intentionally stacked on codex/s2-issue19-baseline so the diff only contains the S3 cross-topic extension.

Summary

Adds a reproducible S3 cross-topic scheduled-injection benchmark under backtesting/s3-cross-topic-injection/.

What changed:

Added compact topic packets for football, Bolivia, and IPC.
Added a 3 topics x 2 models x 7 conditions matrix.
Added a resumable runner with prepared-simulation reuse.
Added package validation, technical summarization, and deterministic metric extraction scripts.
Added committed smoke/full summaries, condition metrics, run ledger, final report, and detailed result analysis.

Experimental design

Models:

DeepInfra google/gemma-3-27b-it
DeepInfra meta-llama/Llama-3.3-70B-Instruct-Turbo

Topics:

football: Argentina vs Colombia, Copa America 2024 final
bolivia: Bolivia 2025 presidential runoff
ipc: Argentina IPC 2025 forecast

Conditions per topic/model:

baseline-control
signal-early
signal-mid
signal-late
counter-signal-mid
noise-near-mid
noise-off-mid

Full matrix size: 42 runs.

Results

Technical validity:

42/42 full matrix rows valid.
6/6 baselines fired zero scheduled events.
36/36 injected conditions fired exactly one scheduled event.
6/6 topic/model pairs reused one prepared simulation across their seven conditions.

Directional findings:

Football mostly reproduces the S2/V2 pattern: Gemma remains sticky toward Argentina, while Llama flips weakly to Colombia under counter-signal-mid.
Bolivia is the cleanest cross-topic signal result: both models move toward Paz under signal injections and toward Quiroga under counter-signal.
IPC remains stable toward lower/disinflation except under counter-signal, where Gemma becomes unclear and Llama flips to higher/rebound.
Off-topic noise is generally ignored; near-topic noise is more dangerous, especially in Bolivia.

Conclusion

S3 supports the S2/V2 conclusion that scheduled injection is technically robust and directionally meaningful, but extends it across domains. The effect transfers to Bolivia and IPC, with topic/model-specific sensitivity. The strongest technical result is the reliable event audit; the strongest behavioral result is that relevant counter-evidence can change direction, especially for Llama.

This PR should be read as a technical and deterministic directional benchmark. Artifact-only ReportAgent scoring is documented as optional follow-up, not included in this PR.

Evidence

Primary committed evidence:

backtesting/s3-cross-topic-injection/evaluation/final_s3_report.md
backtesting/s3-cross-topic-injection/evaluation/results_analysis.md
backtesting/s3-cross-topic-injection/evaluation/full_summary.md
backtesting/s3-cross-topic-injection/evaluation/condition_summary_metrics.md
backtesting/s3-cross-topic-injection/RUN_LEDGER.csv

Local reproducibility artifacts remain under runs/s3_cross_topic/* and are intentionally not committed because they include SQLite/log outputs.

How to test

Run from backend/:

uv run --frozen python -m py_compile ../backtesting/s3-cross-topic-injection/scripts/validate_s3_package.py ../backtesting/s3-cross-topic-injection/scripts/run_s3_matrix.py ../backtesting/s3-cross-topic-injection/scripts/summarize_s3_smoke.py ../backtesting/s3-cross-topic-injection/scripts/extract_s3_metrics.py
uv run --frozen python ../backtesting/s3-cross-topic-injection/scripts/validate_s3_package.py

Validated locally:

script py_compile passed;
package validation passed with topics=3 models=2 conditions=7 smoke_rows=12 full_rows=42.

Joacocade added 4 commits June 18, 2026 03:52

Add S3 cross-topic injection smoke package

e19e24c

Run S3 cross-topic smoke benchmark

eab3859

Complete S3 full matrix analysis

6c3f774

docs: add s3 results analysis

a60f328

LucasErcolano mentioned this pull request Jun 20, 2026

[Spike] Frontend: visualize new backend features (Fusion, routing, telemetry, wiki, deep search) #32

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] S3 cross-topic scheduled injection benchmark#26

[codex] S3 cross-topic scheduled injection benchmark#26
Joacocade wants to merge 4 commits into
codex/s2-issue19-baselinefrom
codex/s3-cross-topic-injection

Joacocade commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Joacocade commented Jun 18, 2026

Linked issue

Summary

Experimental design

Results

Conclusion

Evidence

How to test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant