Phase 0: Table extraction accuracy harness + fixtures + CI gate#62
Merged
pratyush618 merged 12 commits intomainfrom Apr 16, 2026
Merged
Phase 0: Table extraction accuracy harness + fixtures + CI gate#62pratyush618 merged 12 commits intomainfrom
pratyush618 merged 12 commits intomainfrom
Conversation
The landing page rendered an h1 with the text 'paperjam' directly below the logo image, which itself contains the word 'paperjam'. Drop the duplicated heading — the logo's alt text keeps the product name accessible to screen readers, and Docusaurus <Layout title='Home'> already sets the HTML <title>.
…rker Sets up the foundation for the table extraction accuracy harness: - reportlab for deterministic synthetic PDF fixture generation. - pandas promoted from optional to dev so tests can exercise Table.to_dataframe() in accuracy scoring. - Registers the 'accuracy' pytest marker used by the harness tests.
Each fixture is described by a single TableSpec that is the shared source of truth for both the rendered PDF and the ground-truth JSON sidecar — by construction they cannot drift apart. - reportlab.rl_config.invariant = 1 + pageCompression = 0 for bit-identical output across runs (verified via SHA256). - 8 fixture types covering bordered/borderless, merged cells, borderless financial/invoice layouts, a landscape page, sparse subtotal rows, and a 2-page continuation with repeated header. - Output lands under tests/fixtures/tables/, paired <name>.pdf + <name>.gt.json.
…cars Output of scripts/generate_table_fixtures.py checked in so the harness and CI don't depend on regenerating at test time. 8 fixtures totaling ~50 KB: - bordered_simple, bordered_dense, bordered_merged - borderless_financial, borderless_invoice - multipage_continuation (2 pages, repeated header) - rotated_landscape (landscape page) - sparse_cells (borderless with blank subtotal cells) Regeneration is verified for bit-identical output in CI.
Pure-Python scorer — no external deps beyond stdlib. Implements: - Whitespace-normalized text comparison. - Greedy bbox-IoU 1:1 matching of predicted tables to ground truth, restricted to same page. - Cell-level multiset precision/recall on normalized text. - GriTS-Top: F1 over the set of (row_start, row_end, col_start, col_end) topological signatures of every cell, accounting for row/col spans. Robust to missing text; penalizes structural errors. 9 unit tests cover the identical, all-wrong-text, missing-column, extra-merge, page-mismatch, and low-IoU cases.
New pytest suite under -m accuracy runs over every .gt.json fixture, converts paperjam.Table outputs into the harness shape, and scores against ground truth via GriTS-Top + cell-level P/R. - conftest.py session-scoped accuracy_report fixture accumulates per-fixture scores and writes tests/output/table_accuracy.json on session teardown. - --update-baseline CLI option rewrites tests/fixtures/tables/.accuracy_baseline.json after an intentional improvement. - test_baseline_regression fails if aggregate table-detection F1, cell F1, or GriTS-Top F1 drops more than 1% from the committed baseline. - tests/output/ is excluded from version control. Initial committed baseline (on current heuristics): table_detection_f1 = 1.0000 avg_cell_f1 = 0.9923 avg_grits_top_f1 = 0.9557 Phase 1 (merged-cell reconstruction) will drive the GriTS-Top number up by fixing bordered_merged (currently 0.757).
Runs against the full fixture corpus under tests/fixtures/tables/. Document parsing and page loading happen outside the measured section so the bench only times table::extract_tables — that's the code Phases 1–4 will change. Local baseline on this machine: 13–107 µs per fixture. Not wired into CI yet since cross-runner wall-clock is too noisy; gets a dedicated perf-regression job once a stable single-runner baseline exists. Run with: cargo bench -p paperjam-core --bench table_extraction
New ubuntu-latest / Python 3.13 job runs on every PR, gated by the existing lint job: 1. Regenerates fixtures via scripts/generate_table_fixtures.py and diffs against the committed copies. Catches non-deterministic generator changes before they pollute the corpus. 2. Runs pytest -m accuracy — the parametrized fixture tests plus the baseline regression gate. 3. Uploads tests/output/table_accuracy.json as an artifact so per-fixture deltas are visible on every PR. The cross-platform pytest matrix already covers correctness on macOS and Windows; one ubuntu runner is enough for accuracy scoring.
CLAUDE.md is a local developer-assistant guide, not a tracked project artifact.
New local hook (stages: [commit-msg]) that removes any Co-Authored-By commit messages before the commit is created. The hook script lives at .claude/hooks/strip-ai-attribution.sh which is gitignored, so contributors who want the hook active must provide their own copy locally. In standard pre-commit runs (and CI) this hook is inert — it only fires on the commit-msg stage, which requires 'pre-commit install --hook-type commit-msg' to be enabled.
Rust 1.95 tightens collapsible_match and unnecessary_sort_by. These errors were pre-existing on main but surface here because the CI toolchain just bumped to 1.95; leaving them unfixed would keep the accuracy CI job (and rust-test) red on every PR. - paperjam-epub/toc.rs, parser.rs: fold single-branch if inside match arms into if-guards on the arm itself. No behavior change — subsequent arms are distinct Event variants or a _ catch-all. - paperjam-pptx/parser.rs: same treatment for Event::Empty and Event::Text arms. - paperjam-core/forms/mod.rs: merge the None / Some((true, _)) / Some((false, _)) with inner if-else into a guarded arm plus a catch-all. All three original 'not found' paths now share one catch-all body; semantics are identical. - paperjam-core/manipulation/insert.rs: sort_by(|a, b| b.0.cmp(&a.0)) → sort_by_key(|p| Reverse(p.0)).
Windows CI (test (windows-latest, 3.x)) fails with 'Invalid file trailer' on every PDF under tests/fixtures/tables/ because those fixtures are generated with pageCompression=0 (required for byte-deterministic output) — the uncompressed ASCII streams fool git's binary-vs-text heuristic, so core.autocrlf=true rewrites LF to CRLF on Windows checkout and shifts xref/trailer offsets into parser-breaking garbage. Add an explicit .gitattributes declaring PDFs and other universally binary formats (fonts, images, wasm) as binary so no content-based heuristic applies. Ubuntu/macOS CI and local runs are unaffected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lays the measurement foundation for the table-extraction roadmap. No production code under
crates/paperjam-core/src/table/*orpy_src/paperjam/is changed — this PR adds the harness that future phases (merged-cell reconstruction, multi-page stitching, borderless robustness, type inference) will drive against.tests/fixtures/tables/— 8 PDFs +.gt.jsonsidecars generated from a singleTableSpec(PDF and GT cannot drift; SHA256-verified reproducible viareportlab.rl_config.invariant = 1+pageCompression=0). Covers bordered/borderless, merged cells, 2-page continuation, landscape page, and sparse subtotal rows.tests/python/grits.py: whitespace-normalized text, greedy bbox-IoU table matching, cell-level multiset P/R, and GriTS-Top F1 over cell topological signatures (structural metric, robust to text errors). 9 unit tests cover the identical / all-wrong / missing-col / extra-merge / page-mismatch / low-IoU cases.-m accuracysuite parametrized over every.gt.json. Writestests/output/table_accuracy.jsonand includes atest_baseline_regressiongate that fails CI if aggregate table-detection F1, cell F1, or GriTS-Top F1 drops >1% vs the committed baseline.--update-baselinerefreshes after intentional improvements.crates/paperjam-core/benches/table_extraction.rs— per-fixture wall-clock (13–107 µs locally). Not wired into CI yet (cross-runner noise); local-only until a stable single-runner baseline exists.accuracy) on ubuntu-latest + py3.13: regenerates fixtures and diffs against the committed copies (catches non-deterministic generator changes), runs-m accuracy, uploads the JSON report as an artifact.<h1>paperjam</h1>from the docs-site hero (the logo image already says "paperjam"); happy to split into its own PR on request.Committed baseline (current heuristics, pre-Phase 1)
table_detection_f1avg_cell_f1avg_grits_top_f1The
bordered_mergedfixture alone scores GriTS-Top 0.757 — that number quantifies exactly the bug Phase 1 (merged-cell reconstruction inlattice.rs) will fix. Every future PR will show a measurable delta.Commits (8)
fix(docs-site)— remove redundant 'paperjam' hero headingchore(deps)— reportlab + pandas dev deps +accuracypytest markerfeat(fixtures)— deterministic synthetic fixture generator scriptfeat(fixtures)— committed PDFs + GT sidecarsfeat(tests)— GriTS-Top + cell-level P/R scorer + 9 unit testsfeat(tests)— pytest accuracy harness + baseline regression gatefeat(bench)— criterion microbenchci— accuracy job inci.ymlTest plan
uv run pytest tests/python/— 88 existing + 18 new = all greenuv run pytest tests/python/ -m accuracy -v— 8 fixture scores + baseline gate greenuv run pytest tests/python/ -m accuracy --update-baselineround-trips cleanlyuv run python scripts/generate_table_fixtures.pyis bit-identical (SHA256 diff shows no changes)cargo bench -p paperjam-core --bench table_extraction --quickruns end-to-endcargo clippy --workspace --all-targets -- -D warningscleancargo fmt --all --checkcleanuv run ruff check tests/python/ scripts/+ruff format --checkcleanuv run mypy py_src/ tests/cleancargo check -p paperjam-wasm --target wasm32-unknown-unknowncleancargo test --workspacecleanaccuracyjob green on this branch once PR opens