Skip to content

Improve PDF extraction fallback with quality gates (chars/page + alpha ratio) #61

@JordanCoin

Description

@JordanCoin

Problem

openfoia/pipeline/pdf_extract.py currently treats extraction success mostly as char_count >= min_chars (default ~50). That allows some hard PDFs (especially scan-like or sparse line-art docs) to pass as "successful" even when extracted text quality is too low for embedding/search use.

Why this matters

We saw large-doc behavior where total chars can still look non-trivial, but normalized density is very low and likely not journalist-usable without OCR routing.

Proposed path

Implement quality-aware fallback gating for PDF extraction in OpenFOIA.

1) Add extraction quality metrics

After text extraction, compute and log:

  • total_chars
  • pages
  • chars_per_page
  • alpha_ratio (alphabetic chars / non-whitespace chars)
  • optional: printable_ratio and unique_token_ratio

2) Add configurable quality thresholds

Extend config and pdf_extract.py checks with:

  • min_chars (keep existing)
  • min_chars_per_page (new)
  • min_alpha_ratio (new)

If any threshold fails, treat as extraction quality failure and route to OCR fallback.

3) Add reason-coded fallback logging

Emit structured reason(s), e.g.:

  • LOW_TOTAL_CHARS
  • LOW_CHARS_PER_PAGE
  • LOW_ALPHA_RATIO
  • EXTRACT_TIMEOUT

4) Add tests + benchmark fixture set

Use a mixed corpus:

  • dense synthetic docs
  • normal text-heavy PDFs
  • difficult sparse/scan-like FOIA docs

Add a small benchmark script/report so threshold tuning is reproducible.

Suggested implementation files

  • openfoia/pipeline/pdf_extract.py
  • extraction config module/env wiring
  • tests around fallback decision logic

Acceptance criteria

  • Hard sparse docs that previously slipped through now route to OCR fallback.
  • Text-heavy docs still pass direct extraction.
  • Logs include threshold metrics + reason codes.
  • Thresholds are configurable without code changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions