Skip to content

Recover Phase 1 benchmark runner and CI gates onto main#15

Merged
magic-alt merged 4 commits into
mainfrom
feat/phase1-benchmark-runner
May 22, 2026
Merged

Recover Phase 1 benchmark runner and CI gates onto main#15
magic-alt merged 4 commits into
mainfrom
feat/phase1-benchmark-runner

Conversation

@magic-alt
Copy link
Copy Markdown
Owner

Summary

This PR recovers the Phase 1 benchmark runner and CI gate changes onto main.

The original Phase 1 work was previously reviewed in PR #14, but that PR was intentionally stacked on chore/phase0-eval-gates. PR #13 merged the Phase 0 branch into main first, while PR #14 later merged only into chore/phase0-eval-gates. As a result, the 4 Phase 1 commits are still not present on main.

This PR re-targets the same Phase 1 delta directly to main so the work is visible and mergeable on the default branch.

Included commits

  • 803fc3d add phase 1 benchmark runner and ci gates
  • 309830b fix PR14 CI regressions
  • fe89119 fix eval gate extraction
  • 5b47c78 fix markdown lint and ts config warnings

What Changed

  • Extend scripts/eval.py with local benchmark manifest support and document-level metrics
  • Add committed local manifest batches under benchmarks/local_manifests/
  • Expand key-fact eval coverage and thresholds
  • Wire eval threshold gating into CI and local CI
  • Include the follow-up CI, markdownlint, and TypeScript config fixes from the original stacked branch

Why this recovery PR exists

PR #14 was merged into chore/phase0-eval-gates, not into main.

Because main already received PR #13 earlier, GitHub no longer shows the original stacked path as the active route for getting this work onto the default branch. This PR restores that route explicitly.

Validation

  • python -m pytest tests/test_metrics.py tests/test_eval_script.py tests/test_benchmark_manifest.py tests/test_normalizer_and_events.py tests/golden/test_golden.py -q
  • python scripts/eval.py --skip-pytest --thresholds benchmarks/thresholds/golden_minimum.json --output-dir data/eval-gate-check
  • python scripts/eval.py --benchmark-manifest benchmarks/local_manifests/us_filings_private_batch.json --skip-pytest --output-dir data/eval-us-private
  • python scripts/eval.py --benchmark-manifest benchmarks/local_manifests/stress_private_batch.json --skip-pytest --output-dir data/eval-stress-private

Copilot AI review requested due to automatic review settings May 22, 2026 06:36
@magic-alt magic-alt merged commit f742895 into main May 22, 2026
8 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reintroduces the “Phase 1” benchmark runner and CI quality gates onto main, expanding eval coverage (including key-fact/document-level metrics), adding local benchmark manifests/fixtures, and wiring the thresholds gate into CI and the local CI script.

Changes:

  • Extend scripts/eval.py to support running benchmark manifests (--benchmark-manifest), emit per-document metrics, and compute key-fact accuracy metrics used by threshold gates.
  • Add committed local manifest metadata + a synthetic fixture to exercise the manifest flow, and expand golden cases/normalization to cover additional key facts (cash, gross profit, operating income, operating cash flow, capex).
  • Wire eval threshold gating into CI/local CI and update web TS config/tooling (TypeScript upgrade + config warnings cleanup).

Reviewed changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
web/tsconfig.json Adjust TS config (deprecation ignore + path mapping tweak).
web/package.json Bump TypeScript dev dependency.
web/package-lock.json Lockfile updates for the TypeScript bump.
tests/test_normalizer_and_events.py Add normalization alias tests for cash equivalents/capex.
tests/test_metrics.py Update test helper + add alias coverage for operating cash flow.
tests/test_eval_script.py Add tests for manifest execution, document metrics output, and local PDF manifest spec.
tests/test_benchmark_manifest.py Validate committed local manifests and ensure thresholds include key-fact metrics.
tests/golden/conftest.py Expand golden cases to include expected facts for new key metrics (incl. capex/cash).
src/utils/metrics.py Add concept alias support and reuse it for statement/fact metric lookups.
src/finance/normalizer.py Expand account-name normalization mappings (EN + ZH) for new concepts.
src/agent/nodes.py Expand extracted totals/labels; refine labeled-metric extraction segmentation/value selection.
scripts/local_ci.sh Add eval thresholds step and expand ruff target set.
scripts/eval.py Add benchmark-manifest runner, dataset metadata, per-case metrics, and key-fact accuracy metrics.
README.md Formatting-only newline normalization at EOF.
docs/financial_fact_platform_roadmap.md Heading formatting + EOF newline normalization.
benchmarks/thresholds/golden_minimum.json Tighten thresholds and add key-fact accuracy gates.
benchmarks/sample_manifest.json Switch sample manifest to use a committed synthetic fixture + richer expected facts.
benchmarks/README.md Document running sample manifest + local private batches.
benchmarks/local_manifests/us_filings_private_batch.json Add committed “local-private” US filings manifest metadata.
benchmarks/local_manifests/stress_private_batch.json Add committed “local-private” stress manifest metadata.
benchmarks/fixtures/synthetic_income_001.json Add synthetic fixture used by sample manifest.
.github/workflows/ci.yml Run ruff on scripts/, add eval thresholds gate, and change docker job triggering behavior.
Files not reviewed (1)
  • web/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/eval.py
Comment on lines 126 to +130
"## Metrics",
"",
]
dataset = report.get("dataset")
if dataset:
Comment thread .github/workflows/ci.yml
@@ -79,7 +82,6 @@ jobs:
docker:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants