Skip to content

S2 temporal backtest: Bolivia 2025 runoff#24

Draft
BrunoDC-dev wants to merge 15 commits into
mainfrom
feat/issue-17-bolivia-runoff-backtesting-pr
Draft

S2 temporal backtest: Bolivia 2025 runoff#24
BrunoDC-dev wants to merge 15 commits into
mainfrom
feat/issue-17-bolivia-runoff-backtesting-pr

Conversation

@BrunoDC-dev

@BrunoDC-dev BrunoDC-dev commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the S2 temporal backtesting case for the 2025 Bolivia presidential runoff and the passive MiroFish worldbuilding/planning trace capture needed to preserve pre-simulation artifacts for later judge training.

Linked issue

Closes #17

What changed

  • Added backtesting/case-b-s2-bolivia-2025-runoff/ with case card, manifest, temporal packages, question, rubric, private ground truth, evaluator, run notes, reports, and scored outputs.
  • Added strict issue-named temporal input artifacts: seed_T0.md, seed_T1.md, seed_T2.md, seed_T3.md.
  • Kept assembled_T0.md through assembled_T3.md as equivalent cumulative package aliases.
  • Updated ISSUE_RESPONSE.md, README.md, RESULTS.md, case_card.md, and testing_protocol.md to address the review comments explicitly.
  • Corrected evaluator metadata so committed Gemma probe outputs say model_policy: gemma_probe instead of being mislabeled as the primary Qwen policy.
  • Added passive worldbuilding_trace.json capture at simulation preparation time, including input context, filtered entities, generated profiles, simulation config, provenance, and artifact manifest.
  • Added PLANNING_CAPTURE_* config flags and a focused trace test.

Main findings

  • T0 failed to identify the correct runoff field with only early evidence.
  • T1 was the best Gemma probe run: after first-round surprise evidence, MiroFish predicted Rodrigo Paz correctly and nearly matched the final margin.
  • T2 and T3 shifted incorrectly toward Quiroga, showing salience/recency bias from platform framing and a late poll.
  • The T3 football-noise document did not materially affect the structured forecast.
  • No direct final-result leakage was found in the intended input packages.

Review status

Addressed now:

  • PR hygiene sections: ## Linked issue and ## How to test.
  • seed_T0/T1/T2/T3 artifacts are now present.
  • Complexity gate is documented explicitly, including >20 expected extractable entities.
  • Model-policy ambiguity is corrected in committed eval metadata.
  • Technical fixes are documented as run dependencies.

Still pending for strict S2 closure:

  • A clean primary fixed-model pass with qwen/qwen3-8b.
  • Three-run robustness/replica summary if the S2 best-condition rule is applied.
  • Any model ladder should be reported separately from the primary T0/T1/T2/T3 comparison.

How to test

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 backend/.venv/bin/python -m pytest tests/test_worldbuilding_trace.py -q
backend/.venv/bin/python -m py_compile backend/app/services/worldbuilding_trace.py backend/app/services/simulation_manager.py backend/app/config.py tests/test_worldbuilding_trace.py
python3 -m py_compile backtesting/case-b-s2-bolivia-2025-runoff/eval_objective.py
for t in T0 T1 T2 T3; do cmp -s backtesting/case-b-s2-bolivia-2025-runoff/seed_${t}.md backtesting/case-b-s2-bolivia-2025-runoff/assembled_${t}.md && echo "$t ok" || echo "$t mismatch"; done

Local results:

  • tests/test_worldbuilding_trace.py: 1 passed
  • Python compile checks: passed
  • seed_T* vs assembled_T*: all matched

Next work

Run the stricter primary-model experiments next, especially IPC and football/noise cases, using the new planning/worldbuilding traces as auditable pre-simulation artifacts.

@LucasErcolano

LucasErcolano commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Análisis cualitativo de gaps para cerrar bien la issue S2 (#17):

  • CI / PR Hygiene: ahora falla porque el body no tiene las secciones exactas esperadas. Agregar:
    • ## Linked issue con Closes #17
    • ## How to test con comandos/verificación.
  • Vinculación automática: GitHub no detecta closingIssuesReferences; usar Closes #17 bajo la sección esperada.
  • Artefactos T0/T1/T2/T3: la issue pedía seed_T0/T1/T2/T3; el PR usa assembled_T0.md etc. Para cierre estricto, agregar/renombrar artefactos seed_T0, seed_T1, seed_T2, seed_T3 o documentar formalmente la equivalencia.
  • Ejecutar todo lo pedido por la issue: dejar evidencia de T0, T1, T2 y T3 completos con la misma question.md salvo evidencia disponible, ground truth fuera del input y evaluación objetiva por cada paquete.
  • Modelo primario fijo: las runs guardadas figuran como gemma_probe, mientras la issue pide un modelo primario fijo para no mezclar arquitectura con modelo. Falta una pasada limpia con el modelo primario definido para S2, o documentar evidencia equivalente si ya existe.
  • Seeds/réplicas: si la issue aplica la regla S2 de mejor condición con 3 runs, agregar esas 3 runs y reportar media, desvío, rango min/max, estabilidad narrativa, costo por run y fallas/parses inválidos.
  • Model ladder: si se ejecuta escalera de modelos, debe quedar separada del experimento principal. La comparación T0/T1/T2/T3 debe estar hecha con un modelo primario fijo.
  • Complexity gate: documentar checklist explícito: mínimo 6 documentos, 3 fechas, 3 fuentes/tipos, 2 hipótesis causales, 1 noise temporalmente válido, >20 entidades, ground truth fuera del input, evento post-cutoff, métrica definida.
  • Scope técnico: el PR incluye fixes en llm_client.py, zep_tools.py y quality guards. Si son necesarios para completar la run, explicar esa dependencia; si no, separarlos para que el experimento quede más auditable.

En resumen: para cerrar #17, el PR debe demostrar cumplimiento completo de los paquetes temporales, modelo primario fijo, seeds/réplicas requeridas, complexity gate y evaluación objetiva. La evidencia actual como gemma_probe no alcanza como cierre estricto si no se ejecutó también la configuración primaria pedida.

Copy link
Copy Markdown
Collaborator Author

Actualizacion subida en la rama feat/issue-17-bolivia-runoff-backtesting-pr.

Commits nuevos:

  • 1eb1c55 - captura pasiva de worldbuilding/planning.
  • 103fe68 - respuesta a comentarios del backtest Bolivia.

Que quedo hecho:

  • Se agrego worldbuilding_trace.json al final de prepare_simulation, en backend/uploads/simulations/<simulation_id>/worldbuilding_trace.json.
  • La traza guarda input context, fuentes/hash, entidades filtradas, perfiles OASIS, config de simulacion, provenance y manifest de artefactos, sin guardar secretos.
  • Se agregaron flags PLANNING_CAPTURE_* y test focalizado tests/test_worldbuilding_trace.py.
  • Se agregaron los artefactos estrictos pedidos por la issue: seed_T0.md, seed_T1.md, seed_T2.md, seed_T3.md.
  • Se reconstruyeron/commitearon assembled_T1.md, assembled_T2.md, assembled_T3.md; cada seed_T* matchea exactamente con su assembled_T*.
  • Se corrigio la metadata de evaluacion: las runs guardadas ahora dicen model_policy: gemma_probe, no primary_fixed_qwen3_8b.
  • ISSUE_RESPONSE.md, README.md, RESULTS.md, case_card.md y testing_protocol.md ahora separan claramente lo cubierto de lo pendiente.
  • El body del PR ya tiene ## Linked issue con Closes #17 y ## How to test.

Verificacion local:

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 backend/.venv/bin/python -m pytest tests/test_worldbuilding_trace.py -q
backend/.venv/bin/python -m py_compile backend/app/services/worldbuilding_trace.py backend/app/services/simulation_manager.py backend/app/config.py tests/test_worldbuilding_trace.py
python3 -m py_compile backtesting/case-b-s2-bolivia-2025-runoff/eval_objective.py
for t in T0 T1 T2 T3; do cmp -s backtesting/case-b-s2-bolivia-2025-runoff/seed_${t}.md backtesting/case-b-s2-bolivia-2025-runoff/assembled_${t}.md && echo "$t ok" || echo "$t mismatch"; done

Resultado: test de trace 1 passed, compile checks OK, y los cuatro pares seed_T* / assembled_T* matchean.

Estado honesto: Bolivia queda mejor documentado y con los nombres/metadata corregidos, pero para cierre estricto todavia falta la pasada primaria fija con Qwen y las replicas si aplicamos la regla S2 de robustez.

Proximo paso recomendado: correr las pruebas S2 siguientes con IPC y futbol/noise usando esta captura de planning/worldbuilding para auditar cada run antes de simular.

Copy link
Copy Markdown
Collaborator Author

Matriz de resultados S2 T0-T3

Subi una matriz versionada con Bolivia, IPC y Copa America en backtesting/S2_TEMPORAL_RESULTS_MATRIX.md.

Bolivia 2025 runoff

T Prediccion Correcto MAE voto Error margen Parse errors
T0 null no null 5.06 2
T1 paz_gana si 2.0 0.06 0
T2 quiroga_gana no 7.02 18.06 0
T3 quiroga_gana no 7.687 18.06 0

IPC Argentina 2025

T Score Feb err Dic err MAE mensual MAE mensual + acum Parse errors
T0 1/5 0.6 1.6 1.538 4.03 0
T1 1/5 0.6 1.3 1.25 2.3 0
T2 1/5 0.1 1.6 1.2 2.26 0
T3 3/5 0.1 0.8 0.9 1.52 0

Copa America 2024 final

T Prediccion Correcto Prob. punto Rango ganador Ancho valido Margen goles Score
T0 Argentina si 0.625 0.575-0.675 no 1.0 [0.5, 1.5] 4/5
T1 Argentina si 0.65 0.6-0.7 no 1.0 [0.5, 1.5] 5/5
T2 Argentina si 0.49 0.46-0.51 si 1.0 [0.0, 2.0] 5/5
T3 Argentina si 0.49 0.46-0.51 si 1.0 [1.0, 2.0] 5/5

Tambien quedan subidos los artefactos completos de IPC/Copa: structured_answer.json, eval_result.json, run_notes.md, worldbuilding_trace.json y worldbuilding_artifacts/llm_calls por variante.

Copy link
Copy Markdown
Collaborator Author

Extensión Llama Line5 preparada para Bolivia/Copa

Agregué la preparación para replicar el diseño de Issue #18 / PR #22 sobre nuestros casos de Bolivia y Copa America usando Llama 3.3 70B Instruct:

  • backtesting/LINE5_LLAMA_BOLIVIA_COPA.md
  • backtesting/scripts/run_line5_llama_matrix.py
  • backtesting/case-b-s2-bolivia-2025-runoff/config_line5_llama.yaml
  • backtesting/case-d-s2-copa-america-line5-gemma/config_line5_llama.yaml

La matriz replica las 5 condiciones reducidas de Línea 5:

Condition Rounds Density
R10-D2 10 2
R40-D2 40 2
R80-D2 80 2
R40-D1 40 1
R40-D3 40 3

No ejecuté las corridas reales todavía porque localmente no hay DEEPINFRA_API_KEY y el backend no estaba levantado. El runner deja los outputs en output_llama_line5/ y conserva worldbuilding_trace.json, worldbuilding_artifacts/llm_calls, reportes y eval_result.json por variante.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

S2 - Investigador 1: Caso cualitativo (Issue #12) + Línea 1 — Actualización temporal con evidencia post-cutoff

2 participants