Problem
openfoia/pipeline/pdf_extract.py currently treats extraction success mostly as char_count >= min_chars (default ~50). That allows some hard PDFs (especially scan-like or sparse line-art docs) to pass as "successful" even when extracted text quality is too low for embedding/search use.
Why this matters
We saw large-doc behavior where total chars can still look non-trivial, but normalized density is very low and likely not journalist-usable without OCR routing.
Proposed path
Implement quality-aware fallback gating for PDF extraction in OpenFOIA.
1) Add extraction quality metrics
After text extraction, compute and log:
total_chars
pages
chars_per_page
alpha_ratio (alphabetic chars / non-whitespace chars)
- optional:
printable_ratio and unique_token_ratio
2) Add configurable quality thresholds
Extend config and pdf_extract.py checks with:
min_chars (keep existing)
min_chars_per_page (new)
min_alpha_ratio (new)
If any threshold fails, treat as extraction quality failure and route to OCR fallback.
3) Add reason-coded fallback logging
Emit structured reason(s), e.g.:
LOW_TOTAL_CHARS
LOW_CHARS_PER_PAGE
LOW_ALPHA_RATIO
EXTRACT_TIMEOUT
4) Add tests + benchmark fixture set
Use a mixed corpus:
- dense synthetic docs
- normal text-heavy PDFs
- difficult sparse/scan-like FOIA docs
Add a small benchmark script/report so threshold tuning is reproducible.
Suggested implementation files
openfoia/pipeline/pdf_extract.py
- extraction config module/env wiring
- tests around fallback decision logic
Acceptance criteria
- Hard sparse docs that previously slipped through now route to OCR fallback.
- Text-heavy docs still pass direct extraction.
- Logs include threshold metrics + reason codes.
- Thresholds are configurable without code changes.
Problem
openfoia/pipeline/pdf_extract.pycurrently treats extraction success mostly aschar_count >= min_chars(default ~50). That allows some hard PDFs (especially scan-like or sparse line-art docs) to pass as "successful" even when extracted text quality is too low for embedding/search use.Why this matters
We saw large-doc behavior where total chars can still look non-trivial, but normalized density is very low and likely not journalist-usable without OCR routing.
Proposed path
Implement quality-aware fallback gating for PDF extraction in OpenFOIA.
1) Add extraction quality metrics
After text extraction, compute and log:
total_charspageschars_per_pagealpha_ratio(alphabetic chars / non-whitespace chars)printable_ratioandunique_token_ratio2) Add configurable quality thresholds
Extend config and
pdf_extract.pychecks with:min_chars(keep existing)min_chars_per_page(new)min_alpha_ratio(new)If any threshold fails, treat as extraction quality failure and route to OCR fallback.
3) Add reason-coded fallback logging
Emit structured reason(s), e.g.:
LOW_TOTAL_CHARSLOW_CHARS_PER_PAGELOW_ALPHA_RATIOEXTRACT_TIMEOUT4) Add tests + benchmark fixture set
Use a mixed corpus:
Add a small benchmark script/report so threshold tuning is reproducible.
Suggested implementation files
openfoia/pipeline/pdf_extract.pyAcceptance criteria