S2 temporal backtest: Bolivia 2025 runoff by BrunoDC-dev · Pull Request #24 · LucasErcolano/MiroFish

BrunoDC-dev · 2026-06-03T20:28:10Z

Summary

Adds the S2 temporal backtesting case for the 2025 Bolivia presidential runoff and the passive MiroFish worldbuilding/planning trace capture needed to preserve pre-simulation artifacts for later judge training.

Linked issue

Closes #17

What changed

Added backtesting/case-b-s2-bolivia-2025-runoff/ with case card, manifest, temporal packages, question, rubric, private ground truth, evaluator, run notes, reports, and scored outputs.
Added strict issue-named temporal input artifacts: seed_T0.md, seed_T1.md, seed_T2.md, seed_T3.md.
Kept assembled_T0.md through assembled_T3.md as equivalent cumulative package aliases.
Updated ISSUE_RESPONSE.md, README.md, RESULTS.md, case_card.md, and testing_protocol.md to address the review comments explicitly.
Corrected evaluator metadata so committed Gemma probe outputs say model_policy: gemma_probe instead of being mislabeled as the primary Qwen policy.
Added passive worldbuilding_trace.json capture at simulation preparation time, including input context, filtered entities, generated profiles, simulation config, provenance, and artifact manifest.
Added PLANNING_CAPTURE_* config flags and a focused trace test.

Main findings

T0 failed to identify the correct runoff field with only early evidence.
T1 was the best Gemma probe run: after first-round surprise evidence, MiroFish predicted Rodrigo Paz correctly and nearly matched the final margin.
T2 and T3 shifted incorrectly toward Quiroga, showing salience/recency bias from platform framing and a late poll.
The T3 football-noise document did not materially affect the structured forecast.
No direct final-result leakage was found in the intended input packages.

Review status

Addressed now:

PR hygiene sections: ## Linked issue and ## How to test.
seed_T0/T1/T2/T3 artifacts are now present.
Complexity gate is documented explicitly, including >20 expected extractable entities.
Model-policy ambiguity is corrected in committed eval metadata.
Technical fixes are documented as run dependencies.

Still pending for strict S2 closure:

A clean primary fixed-model pass with qwen/qwen3-8b.
Three-run robustness/replica summary if the S2 best-condition rule is applied.
Any model ladder should be reported separately from the primary T0/T1/T2/T3 comparison.

How to test

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 backend/.venv/bin/python -m pytest tests/test_worldbuilding_trace.py -q
backend/.venv/bin/python -m py_compile backend/app/services/worldbuilding_trace.py backend/app/services/simulation_manager.py backend/app/config.py tests/test_worldbuilding_trace.py
python3 -m py_compile backtesting/case-b-s2-bolivia-2025-runoff/eval_objective.py
for t in T0 T1 T2 T3; do cmp -s backtesting/case-b-s2-bolivia-2025-runoff/seed_${t}.md backtesting/case-b-s2-bolivia-2025-runoff/assembled_${t}.md && echo "$t ok" || echo "$t mismatch"; done

Local results:

tests/test_worldbuilding_trace.py: 1 passed
Python compile checks: passed
seed_T* vs assembled_T*: all matched

Next work

Run the stricter primary-model experiments next, especially IPC and football/noise cases, using the new planning/worldbuilding traces as auditable pre-simulation artifacts.

LucasErcolano · 2026-06-06T17:49:55Z

Análisis cualitativo de gaps para cerrar bien la issue S2 (#17):

CI / PR Hygiene: ahora falla porque el body no tiene las secciones exactas esperadas. Agregar:
- ## Linked issue con Closes #17
- ## How to test con comandos/verificación.
Vinculación automática: GitHub no detecta closingIssuesReferences; usar Closes #17 bajo la sección esperada.
Artefactos T0/T1/T2/T3: la issue pedía seed_T0/T1/T2/T3; el PR usa assembled_T0.md etc. Para cierre estricto, agregar/renombrar artefactos seed_T0, seed_T1, seed_T2, seed_T3 o documentar formalmente la equivalencia.
Ejecutar todo lo pedido por la issue: dejar evidencia de T0, T1, T2 y T3 completos con la misma question.md salvo evidencia disponible, ground truth fuera del input y evaluación objetiva por cada paquete.
Modelo primario fijo: las runs guardadas figuran como gemma_probe, mientras la issue pide un modelo primario fijo para no mezclar arquitectura con modelo. Falta una pasada limpia con el modelo primario definido para S2, o documentar evidencia equivalente si ya existe.
Seeds/réplicas: si la issue aplica la regla S2 de mejor condición con 3 runs, agregar esas 3 runs y reportar media, desvío, rango min/max, estabilidad narrativa, costo por run y fallas/parses inválidos.
Model ladder: si se ejecuta escalera de modelos, debe quedar separada del experimento principal. La comparación T0/T1/T2/T3 debe estar hecha con un modelo primario fijo.
Complexity gate: documentar checklist explícito: mínimo 6 documentos, 3 fechas, 3 fuentes/tipos, 2 hipótesis causales, 1 noise temporalmente válido, >20 entidades, ground truth fuera del input, evento post-cutoff, métrica definida.
Scope técnico: el PR incluye fixes en llm_client.py, zep_tools.py y quality guards. Si son necesarios para completar la run, explicar esa dependencia; si no, separarlos para que el experimento quede más auditable.

En resumen: para cerrar #17, el PR debe demostrar cumplimiento completo de los paquetes temporales, modelo primario fijo, seeds/réplicas requeridas, complexity gate y evaluación objetiva. La evidencia actual como gemma_probe no alcanza como cierre estricto si no se ejecutó también la configuración primaria pedida.

BrunoDC-dev · 2026-06-16T01:20:57Z

Actualizacion subida en la rama feat/issue-17-bolivia-runoff-backtesting-pr.

Commits nuevos:

1eb1c55 - captura pasiva de worldbuilding/planning.
103fe68 - respuesta a comentarios del backtest Bolivia.

Que quedo hecho:

Se agrego worldbuilding_trace.json al final de prepare_simulation, en backend/uploads/simulations/<simulation_id>/worldbuilding_trace.json.
La traza guarda input context, fuentes/hash, entidades filtradas, perfiles OASIS, config de simulacion, provenance y manifest de artefactos, sin guardar secretos.
Se agregaron flags PLANNING_CAPTURE_* y test focalizado tests/test_worldbuilding_trace.py.
Se agregaron los artefactos estrictos pedidos por la issue: seed_T0.md, seed_T1.md, seed_T2.md, seed_T3.md.
Se reconstruyeron/commitearon assembled_T1.md, assembled_T2.md, assembled_T3.md; cada seed_T* matchea exactamente con su assembled_T*.
Se corrigio la metadata de evaluacion: las runs guardadas ahora dicen model_policy: gemma_probe, no primary_fixed_qwen3_8b.
ISSUE_RESPONSE.md, README.md, RESULTS.md, case_card.md y testing_protocol.md ahora separan claramente lo cubierto de lo pendiente.
El body del PR ya tiene ## Linked issue con Closes #17 y ## How to test.

Verificacion local:

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 backend/.venv/bin/python -m pytest tests/test_worldbuilding_trace.py -q
backend/.venv/bin/python -m py_compile backend/app/services/worldbuilding_trace.py backend/app/services/simulation_manager.py backend/app/config.py tests/test_worldbuilding_trace.py
python3 -m py_compile backtesting/case-b-s2-bolivia-2025-runoff/eval_objective.py
for t in T0 T1 T2 T3; do cmp -s backtesting/case-b-s2-bolivia-2025-runoff/seed_${t}.md backtesting/case-b-s2-bolivia-2025-runoff/assembled_${t}.md && echo "$t ok" || echo "$t mismatch"; done

Resultado: test de trace 1 passed, compile checks OK, y los cuatro pares seed_T* / assembled_T* matchean.

Estado honesto: Bolivia queda mejor documentado y con los nombres/metadata corregidos, pero para cierre estricto todavia falta la pasada primaria fija con Qwen y las replicas si aplicamos la regla S2 de robustez.

Proximo paso recomendado: correr las pruebas S2 siguientes con IPC y futbol/noise usando esta captura de planning/worldbuilding para auditar cada run antes de simular.

BrunoDC-dev · 2026-06-18T14:30:38Z

Matriz de resultados S2 T0-T3

Subi una matriz versionada con Bolivia, IPC y Copa America en backtesting/S2_TEMPORAL_RESULTS_MATRIX.md.

Bolivia 2025 runoff

T	Prediccion	Correcto	MAE voto	Error margen	Parse errors
T0	null	no	null	5.06	2
T1	paz_gana	si	2.0	0.06	0
T2	quiroga_gana	no	7.02	18.06	0
T3	quiroga_gana	no	7.687	18.06	0

IPC Argentina 2025

T	Score	Feb err	Dic err	MAE mensual	MAE mensual + acum
T0	1/5	0.6	1.6	1.538	4.03
T1	1/5	0.6	1.3	1.25	2.3
T2	1/5	0.1	1.6	1.2	2.26
T3	3/5	0.1	0.8	0.9	1.52

Copa America 2024 final

T	Prediccion	Correcto	Prob. punto	Rango ganador	Ancho valido	Margen goles	Score
T0	Argentina	si	0.625	0.575-0.675	no	1.0 [0.5, 1.5]	4/5
T1	Argentina	si	0.65	0.6-0.7	no	1.0 [0.5, 1.5]	5/5
T2	Argentina	si	0.49	0.46-0.51	si	1.0 [0.0, 2.0]	5/5
T3	Argentina	si	0.49	0.46-0.51	si	1.0 [1.0, 2.0]	5/5

Tambien quedan subidos los artefactos completos de IPC/Copa: structured_answer.json, eval_result.json, run_notes.md, worldbuilding_trace.json y worldbuilding_artifacts/llm_calls por variante.

BrunoDC-dev · 2026-06-18T14:47:24Z

Extensión Llama Line5 preparada para Bolivia/Copa

Agregué la preparación para replicar el diseño de Issue #18 / PR #22 sobre nuestros casos de Bolivia y Copa America usando Llama 3.3 70B Instruct:

backtesting/LINE5_LLAMA_BOLIVIA_COPA.md
backtesting/scripts/run_line5_llama_matrix.py
backtesting/case-b-s2-bolivia-2025-runoff/config_line5_llama.yaml
backtesting/case-d-s2-copa-america-line5-gemma/config_line5_llama.yaml

La matriz replica las 5 condiciones reducidas de Línea 5:

Condition	Rounds	Density
R10-D2	10	2
R40-D2	40	2
R80-D2	80	2
R40-D1	40	1
R40-D3	40	3

No ejecuté las corridas reales todavía porque localmente no hay DEEPINFRA_API_KEY y el backend no estaba levantado. El runner deja los outputs en output_llama_line5/ y conserva worldbuilding_trace.json, worldbuilding_artifacts/llm_calls, reportes y eval_result.json por variante.

BrunoDC-dev added 4 commits June 3, 2026 17:27

Stabilize report agent backtesting flow

b644b2a

Add Bolivia runoff temporal backtest

16ba3ff

Expand Bolivia runoff issue response

1cad2bb

Add issue 17 acceptance checklist

d67ff02

BrunoDC-dev mentioned this pull request Jun 3, 2026

S2 - Investigador 1: Caso cualitativo (Issue #12) + Línea 1 — Actualización temporal con evidencia post-cutoff #17

Closed

16 tasks

BrunoDC-dev added 2 commits June 15, 2026 22:19

Capture worldbuilding traces

1eb1c55

Address Bolivia backtest review notes

103fe68

BrunoDC-dev added 5 commits June 15, 2026 22:26

Import IPC line 5 case packet

7a70890

Add IPC Gemma backtesting packet

3fde4a5

Add IPC temporal evidence packages

b72a7e0

Fix local backend startup config

699a58a

add temporal S2 backtesting matrix and cases

5722acd

add llama line5 configs for bolivia and copa

2b95c66

BrunoDC-dev added 3 commits June 18, 2026 12:37

remove bolivia football noise from llama line5

4000d58

sanitize graphiti attributes for neo4j

3a6e8ca

add slim llama line5 matrices

563a203

LucasErcolano mentioned this pull request Jun 20, 2026

[Spike] Frontend: visualize new backend features (Fusion, routing, telemetry, wiki, deep search) #32

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S2 temporal backtest: Bolivia 2025 runoff#24

S2 temporal backtest: Bolivia 2025 runoff#24
BrunoDC-dev wants to merge 15 commits into
mainfrom
feat/issue-17-bolivia-runoff-backtesting-pr

BrunoDC-dev commented Jun 3, 2026 •

edited

Loading

Uh oh!

LucasErcolano commented Jun 6, 2026 •

edited

Loading

Uh oh!

BrunoDC-dev commented Jun 16, 2026

Uh oh!

BrunoDC-dev commented Jun 18, 2026

Uh oh!

BrunoDC-dev commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BrunoDC-dev commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Linked issue

What changed

Main findings

Review status

How to test

Next work

Uh oh!

LucasErcolano commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BrunoDC-dev commented Jun 16, 2026

Uh oh!

BrunoDC-dev commented Jun 18, 2026

Matriz de resultados S2 T0-T3

Bolivia 2025 runoff

IPC Argentina 2025

Copa America 2024 final

Uh oh!

BrunoDC-dev commented Jun 18, 2026

Extensión Llama Line5 preparada para Bolivia/Copa

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BrunoDC-dev commented Jun 3, 2026 •

edited

Loading

LucasErcolano commented Jun 6, 2026 •

edited

Loading