Skip to content

Phase 0: Table extraction accuracy harness + fixtures + CI gate#62

Merged
pratyush618 merged 12 commits intomainfrom
feature/table-extraction-accuracy-harness
Apr 16, 2026
Merged

Phase 0: Table extraction accuracy harness + fixtures + CI gate#62
pratyush618 merged 12 commits intomainfrom
feature/table-extraction-accuracy-harness

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

Lays the measurement foundation for the table-extraction roadmap. No production code under crates/paperjam-core/src/table/* or py_src/paperjam/ is changed — this PR adds the harness that future phases (merged-cell reconstruction, multi-page stitching, borderless robustness, type inference) will drive against.

  • Deterministic synthetic fixture corpus under tests/fixtures/tables/ — 8 PDFs + .gt.json sidecars generated from a single TableSpec (PDF and GT cannot drift; SHA256-verified reproducible via reportlab.rl_config.invariant = 1 + pageCompression=0). Covers bordered/borderless, merged cells, 2-page continuation, landscape page, and sparse subtotal rows.
  • Pure-Python scorer at tests/python/grits.py: whitespace-normalized text, greedy bbox-IoU table matching, cell-level multiset P/R, and GriTS-Top F1 over cell topological signatures (structural metric, robust to text errors). 9 unit tests cover the identical / all-wrong / missing-col / extra-merge / page-mismatch / low-IoU cases.
  • pytest -m accuracy suite parametrized over every .gt.json. Writes tests/output/table_accuracy.json and includes a test_baseline_regression gate that fails CI if aggregate table-detection F1, cell F1, or GriTS-Top F1 drops >1% vs the committed baseline. --update-baseline refreshes after intentional improvements.
  • Criterion microbench at crates/paperjam-core/benches/table_extraction.rs — per-fixture wall-clock (13–107 µs locally). Not wired into CI yet (cross-runner noise); local-only until a stable single-runner baseline exists.
  • New CI job (accuracy) on ubuntu-latest + py3.13: regenerates fixtures and diffs against the committed copies (catches non-deterministic generator changes), runs -m accuracy, uploads the JSON report as an artifact.
  • Incidental cleanup: removes a redundant <h1>paperjam</h1> from the docs-site hero (the logo image already says "paperjam"); happy to split into its own PR on request.

Committed baseline (current heuristics, pre-Phase 1)

Metric Value
table_detection_f1 1.0000
avg_cell_f1 0.9923
avg_grits_top_f1 0.9557

The bordered_merged fixture alone scores GriTS-Top 0.757 — that number quantifies exactly the bug Phase 1 (merged-cell reconstruction in lattice.rs) will fix. Every future PR will show a measurable delta.

Commits (8)

  1. fix(docs-site) — remove redundant 'paperjam' hero heading
  2. chore(deps) — reportlab + pandas dev deps + accuracy pytest marker
  3. feat(fixtures) — deterministic synthetic fixture generator script
  4. feat(fixtures) — committed PDFs + GT sidecars
  5. feat(tests) — GriTS-Top + cell-level P/R scorer + 9 unit tests
  6. feat(tests) — pytest accuracy harness + baseline regression gate
  7. feat(bench) — criterion microbench
  8. ci — accuracy job in ci.yml

Test plan

  • uv run pytest tests/python/ — 88 existing + 18 new = all green
  • uv run pytest tests/python/ -m accuracy -v — 8 fixture scores + baseline gate green
  • uv run pytest tests/python/ -m accuracy --update-baseline round-trips cleanly
  • uv run python scripts/generate_table_fixtures.py is bit-identical (SHA256 diff shows no changes)
  • cargo bench -p paperjam-core --bench table_extraction --quick runs end-to-end
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • cargo fmt --all --check clean
  • uv run ruff check tests/python/ scripts/ + ruff format --check clean
  • uv run mypy py_src/ tests/ clean
  • cargo check -p paperjam-wasm --target wasm32-unknown-unknown clean
  • cargo test --workspace clean
  • CI accuracy job green on this branch once PR opens

The landing page rendered an h1 with the text 'paperjam' directly below
the logo image, which itself contains the word 'paperjam'. Drop the
duplicated heading — the logo's alt text keeps the product name
accessible to screen readers, and Docusaurus <Layout title='Home'>
already sets the HTML <title>.
…rker

Sets up the foundation for the table extraction accuracy harness:
- reportlab for deterministic synthetic PDF fixture generation.
- pandas promoted from optional to dev so tests can exercise
  Table.to_dataframe() in accuracy scoring.
- Registers the 'accuracy' pytest marker used by the harness tests.
Each fixture is described by a single TableSpec that is the shared
source of truth for both the rendered PDF and the ground-truth JSON
sidecar — by construction they cannot drift apart.

- reportlab.rl_config.invariant = 1 + pageCompression = 0 for
  bit-identical output across runs (verified via SHA256).
- 8 fixture types covering bordered/borderless, merged cells,
  borderless financial/invoice layouts, a landscape page, sparse
  subtotal rows, and a 2-page continuation with repeated header.
- Output lands under tests/fixtures/tables/, paired <name>.pdf +
  <name>.gt.json.
…cars

Output of scripts/generate_table_fixtures.py checked in so the
harness and CI don't depend on regenerating at test time. 8 fixtures
totaling ~50 KB:

- bordered_simple, bordered_dense, bordered_merged
- borderless_financial, borderless_invoice
- multipage_continuation (2 pages, repeated header)
- rotated_landscape (landscape page)
- sparse_cells (borderless with blank subtotal cells)

Regeneration is verified for bit-identical output in CI.
Pure-Python scorer — no external deps beyond stdlib. Implements:

- Whitespace-normalized text comparison.
- Greedy bbox-IoU 1:1 matching of predicted tables to ground truth,
  restricted to same page.
- Cell-level multiset precision/recall on normalized text.
- GriTS-Top: F1 over the set of (row_start, row_end, col_start,
  col_end) topological signatures of every cell, accounting for
  row/col spans. Robust to missing text; penalizes structural errors.

9 unit tests cover the identical, all-wrong-text, missing-column,
extra-merge, page-mismatch, and low-IoU cases.
New pytest suite under -m accuracy runs over every .gt.json fixture,
converts paperjam.Table outputs into the harness shape, and scores
against ground truth via GriTS-Top + cell-level P/R.

- conftest.py session-scoped accuracy_report fixture accumulates
  per-fixture scores and writes tests/output/table_accuracy.json on
  session teardown.
- --update-baseline CLI option rewrites tests/fixtures/tables/.accuracy_baseline.json
  after an intentional improvement.
- test_baseline_regression fails if aggregate table-detection F1,
  cell F1, or GriTS-Top F1 drops more than 1% from the committed
  baseline.
- tests/output/ is excluded from version control.

Initial committed baseline (on current heuristics):
  table_detection_f1   = 1.0000
  avg_cell_f1          = 0.9923
  avg_grits_top_f1     = 0.9557
Phase 1 (merged-cell reconstruction) will drive the GriTS-Top number
up by fixing bordered_merged (currently 0.757).
Runs against the full fixture corpus under tests/fixtures/tables/.
Document parsing and page loading happen outside the measured section
so the bench only times table::extract_tables — that's the code
Phases 1–4 will change.

Local baseline on this machine: 13–107 µs per fixture. Not wired into
CI yet since cross-runner wall-clock is too noisy; gets a dedicated
perf-regression job once a stable single-runner baseline exists.

Run with: cargo bench -p paperjam-core --bench table_extraction
New ubuntu-latest / Python 3.13 job runs on every PR, gated by the
existing lint job:

1. Regenerates fixtures via scripts/generate_table_fixtures.py and
   diffs against the committed copies. Catches non-deterministic
   generator changes before they pollute the corpus.
2. Runs pytest -m accuracy — the parametrized fixture tests plus the
   baseline regression gate.
3. Uploads tests/output/table_accuracy.json as an artifact so
   per-fixture deltas are visible on every PR.

The cross-platform pytest matrix already covers correctness on macOS
and Windows; one ubuntu runner is enough for accuracy scoring.
@github-actions github-actions Bot added documentation Improvements or additions to documentation github_actions Pull requests that update GitHub Actions code rust Pull requests that update rust code javascript Pull requests that update javascript code python Pull requests that update Python code labels Apr 16, 2026
CLAUDE.md is a local developer-assistant guide, not a tracked project
artifact.
New local hook (stages: [commit-msg]) that removes any Co-Authored-By
commit messages before the commit is created.

The hook script lives at .claude/hooks/strip-ai-attribution.sh which
is gitignored, so contributors who want the hook active must provide
their own copy locally. In standard pre-commit runs (and CI) this
hook is inert — it only fires on the commit-msg stage, which requires
'pre-commit install --hook-type commit-msg' to be enabled.
Rust 1.95 tightens collapsible_match and unnecessary_sort_by. These
errors were pre-existing on main but surface here because the CI
toolchain just bumped to 1.95; leaving them unfixed would keep the
accuracy CI job (and rust-test) red on every PR.

- paperjam-epub/toc.rs, parser.rs: fold single-branch if inside
  match arms into if-guards on the arm itself. No behavior change —
  subsequent arms are distinct Event variants or a _ catch-all.
- paperjam-pptx/parser.rs: same treatment for Event::Empty and
  Event::Text arms.
- paperjam-core/forms/mod.rs: merge the None / Some((true, _)) /
  Some((false, _)) with inner if-else into a guarded arm plus a
  catch-all. All three original 'not found' paths now share one
  catch-all body; semantics are identical.
- paperjam-core/manipulation/insert.rs: sort_by(|a, b| b.0.cmp(&a.0))
  → sort_by_key(|p| Reverse(p.0)).
Windows CI (test (windows-latest, 3.x)) fails with 'Invalid file
trailer' on every PDF under tests/fixtures/tables/ because those
fixtures are generated with pageCompression=0 (required for
byte-deterministic output) — the uncompressed ASCII streams fool
git's binary-vs-text heuristic, so core.autocrlf=true rewrites LF to
CRLF on Windows checkout and shifts xref/trailer offsets into
parser-breaking garbage.

Add an explicit .gitattributes declaring PDFs and other universally
binary formats (fonts, images, wasm) as binary so no content-based
heuristic applies. Ubuntu/macOS CI and local runs are unaffected.
@pratyush618 pratyush618 merged commit 8f79339 into main Apr 16, 2026
13 checks passed
@pratyush618 pratyush618 deleted the feature/table-extraction-accuracy-harness branch April 16, 2026 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation github_actions Pull requests that update GitHub Actions code javascript Pull requests that update javascript code python Pull requests that update Python code rust Pull requests that update rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant