Phase 0: Table extraction accuracy harness + fixtures + CI gate by pratyush618 · Pull Request #62 · ByteVeda/paperjam

pratyush618 · 2026-04-16T19:27:23Z

Summary

Lays the measurement foundation for the table-extraction roadmap. No production code under crates/paperjam-core/src/table/* or py_src/paperjam/ is changed — this PR adds the harness that future phases (merged-cell reconstruction, multi-page stitching, borderless robustness, type inference) will drive against.

Deterministic synthetic fixture corpus under tests/fixtures/tables/ — 8 PDFs + .gt.json sidecars generated from a single TableSpec (PDF and GT cannot drift; SHA256-verified reproducible via reportlab.rl_config.invariant = 1 + pageCompression=0). Covers bordered/borderless, merged cells, 2-page continuation, landscape page, and sparse subtotal rows.
Pure-Python scorer at tests/python/grits.py: whitespace-normalized text, greedy bbox-IoU table matching, cell-level multiset P/R, and GriTS-Top F1 over cell topological signatures (structural metric, robust to text errors). 9 unit tests cover the identical / all-wrong / missing-col / extra-merge / page-mismatch / low-IoU cases.
pytest -m accuracy suite parametrized over every .gt.json. Writes tests/output/table_accuracy.json and includes a test_baseline_regression gate that fails CI if aggregate table-detection F1, cell F1, or GriTS-Top F1 drops >1% vs the committed baseline. --update-baseline refreshes after intentional improvements.
Criterion microbench at crates/paperjam-core/benches/table_extraction.rs — per-fixture wall-clock (13–107 µs locally). Not wired into CI yet (cross-runner noise); local-only until a stable single-runner baseline exists.
New CI job (accuracy) on ubuntu-latest + py3.13: regenerates fixtures and diffs against the committed copies (catches non-deterministic generator changes), runs -m accuracy, uploads the JSON report as an artifact.
Incidental cleanup: removes a redundant <h1>paperjam</h1> from the docs-site hero (the logo image already says "paperjam"); happy to split into its own PR on request.

Committed baseline (current heuristics, pre-Phase 1)

Metric	Value
`table_detection_f1`	1.0000
`avg_cell_f1`	0.9923
`avg_grits_top_f1`	0.9557

The bordered_merged fixture alone scores GriTS-Top 0.757 — that number quantifies exactly the bug Phase 1 (merged-cell reconstruction in lattice.rs) will fix. Every future PR will show a measurable delta.

Commits (8)

fix(docs-site) — remove redundant 'paperjam' hero heading
chore(deps) — reportlab + pandas dev deps + accuracy pytest marker
feat(fixtures) — deterministic synthetic fixture generator script
feat(fixtures) — committed PDFs + GT sidecars
feat(tests) — GriTS-Top + cell-level P/R scorer + 9 unit tests
feat(tests) — pytest accuracy harness + baseline regression gate
feat(bench) — criterion microbench
ci — accuracy job in ci.yml

Test plan

The landing page rendered an h1 with the text 'paperjam' directly below the logo image, which itself contains the word 'paperjam'. Drop the duplicated heading — the logo's alt text keeps the product name accessible to screen readers, and Docusaurus <Layout title='Home'> already sets the HTML <title>.

…rker Sets up the foundation for the table extraction accuracy harness: - reportlab for deterministic synthetic PDF fixture generation. - pandas promoted from optional to dev so tests can exercise Table.to_dataframe() in accuracy scoring. - Registers the 'accuracy' pytest marker used by the harness tests.

Each fixture is described by a single TableSpec that is the shared source of truth for both the rendered PDF and the ground-truth JSON sidecar — by construction they cannot drift apart. - reportlab.rl_config.invariant = 1 + pageCompression = 0 for bit-identical output across runs (verified via SHA256). - 8 fixture types covering bordered/borderless, merged cells, borderless financial/invoice layouts, a landscape page, sparse subtotal rows, and a 2-page continuation with repeated header. - Output lands under tests/fixtures/tables/, paired <name>.pdf + <name>.gt.json.

…cars Output of scripts/generate_table_fixtures.py checked in so the harness and CI don't depend on regenerating at test time. 8 fixtures totaling ~50 KB: - bordered_simple, bordered_dense, bordered_merged - borderless_financial, borderless_invoice - multipage_continuation (2 pages, repeated header) - rotated_landscape (landscape page) - sparse_cells (borderless with blank subtotal cells) Regeneration is verified for bit-identical output in CI.

Pure-Python scorer — no external deps beyond stdlib. Implements: - Whitespace-normalized text comparison. - Greedy bbox-IoU 1:1 matching of predicted tables to ground truth, restricted to same page. - Cell-level multiset precision/recall on normalized text. - GriTS-Top: F1 over the set of (row_start, row_end, col_start, col_end) topological signatures of every cell, accounting for row/col spans. Robust to missing text; penalizes structural errors. 9 unit tests cover the identical, all-wrong-text, missing-column, extra-merge, page-mismatch, and low-IoU cases.

New pytest suite under -m accuracy runs over every .gt.json fixture, converts paperjam.Table outputs into the harness shape, and scores against ground truth via GriTS-Top + cell-level P/R. - conftest.py session-scoped accuracy_report fixture accumulates per-fixture scores and writes tests/output/table_accuracy.json on session teardown. - --update-baseline CLI option rewrites tests/fixtures/tables/.accuracy_baseline.json after an intentional improvement. - test_baseline_regression fails if aggregate table-detection F1, cell F1, or GriTS-Top F1 drops more than 1% from the committed baseline. - tests/output/ is excluded from version control. Initial committed baseline (on current heuristics): table_detection_f1 = 1.0000 avg_cell_f1 = 0.9923 avg_grits_top_f1 = 0.9557 Phase 1 (merged-cell reconstruction) will drive the GriTS-Top number up by fixing bordered_merged (currently 0.757).

Runs against the full fixture corpus under tests/fixtures/tables/. Document parsing and page loading happen outside the measured section so the bench only times table::extract_tables — that's the code Phases 1–4 will change. Local baseline on this machine: 13–107 µs per fixture. Not wired into CI yet since cross-runner wall-clock is too noisy; gets a dedicated perf-regression job once a stable single-runner baseline exists. Run with: cargo bench -p paperjam-core --bench table_extraction

New ubuntu-latest / Python 3.13 job runs on every PR, gated by the existing lint job: 1. Regenerates fixtures via scripts/generate_table_fixtures.py and diffs against the committed copies. Catches non-deterministic generator changes before they pollute the corpus. 2. Runs pytest -m accuracy — the parametrized fixture tests plus the baseline regression gate. 3. Uploads tests/output/table_accuracy.json as an artifact so per-fixture deltas are visible on every PR. The cross-platform pytest matrix already covers correctness on macOS and Windows; one ubuntu runner is enough for accuracy scoring.

CLAUDE.md is a local developer-assistant guide, not a tracked project artifact.

New local hook (stages: [commit-msg]) that removes any Co-Authored-By commit messages before the commit is created. The hook script lives at .claude/hooks/strip-ai-attribution.sh which is gitignored, so contributors who want the hook active must provide their own copy locally. In standard pre-commit runs (and CI) this hook is inert — it only fires on the commit-msg stage, which requires 'pre-commit install --hook-type commit-msg' to be enabled.

Rust 1.95 tightens collapsible_match and unnecessary_sort_by. These errors were pre-existing on main but surface here because the CI toolchain just bumped to 1.95; leaving them unfixed would keep the accuracy CI job (and rust-test) red on every PR. - paperjam-epub/toc.rs, parser.rs: fold single-branch if inside match arms into if-guards on the arm itself. No behavior change — subsequent arms are distinct Event variants or a _ catch-all. - paperjam-pptx/parser.rs: same treatment for Event::Empty and Event::Text arms. - paperjam-core/forms/mod.rs: merge the None / Some((true, _)) / Some((false, _)) with inner if-else into a guarded arm plus a catch-all. All three original 'not found' paths now share one catch-all body; semantics are identical. - paperjam-core/manipulation/insert.rs: sort_by(|a, b| b.0.cmp(&a.0)) → sort_by_key(|p| Reverse(p.0)).

Windows CI (test (windows-latest, 3.x)) fails with 'Invalid file trailer' on every PDF under tests/fixtures/tables/ because those fixtures are generated with pageCompression=0 (required for byte-deterministic output) — the uncompressed ASCII streams fool git's binary-vs-text heuristic, so core.autocrlf=true rewrites LF to CRLF on Windows checkout and shifts xref/trailer offsets into parser-breaking garbage. Add an explicit .gitattributes declaring PDFs and other universally binary formats (fonts, images, wasm) as binary so no content-based heuristic applies. Ubuntu/macOS CI and local runs are unaffected.

pratyush618 added 8 commits April 17, 2026 00:53

pratyush618 added 4 commits April 17, 2026 01:02

chore(gitignore): exclude CLAUDE.md project guide

a87609c

CLAUDE.md is a local developer-assistant guide, not a tracked project artifact.

pratyush618 merged commit 8f79339 into main Apr 16, 2026
13 checks passed

pratyush618 deleted the feature/table-extraction-accuracy-harness branch April 16, 2026 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 0: Table extraction accuracy harness + fixtures + CI gate#62

Phase 0: Table extraction accuracy harness + fixtures + CI gate#62
pratyush618 merged 12 commits intomainfrom
feature/table-extraction-accuracy-harness

pratyush618 commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pratyush618 commented Apr 16, 2026

Summary

Committed baseline (current heuristics, pre-Phase 1)

Commits (8)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant