Improve PDF extraction fallback with quality gates (chars/page + alpha ratio)

## Problem
`openfoia/pipeline/pdf_extract.py` currently treats extraction success mostly as `char_count >= min_chars` (default ~50). That allows some hard PDFs (especially scan-like or sparse line-art docs) to pass as "successful" even when extracted text quality is too low for embedding/search use.

## Why this matters
We saw large-doc behavior where total chars can still look non-trivial, but normalized density is very low and likely not journalist-usable without OCR routing.

## Proposed path
Implement quality-aware fallback gating for PDF extraction in OpenFOIA.

### 1) Add extraction quality metrics
After text extraction, compute and log:
- `total_chars`
- `pages`
- `chars_per_page`
- `alpha_ratio` (alphabetic chars / non-whitespace chars)
- optional: `printable_ratio` and `unique_token_ratio`

### 2) Add configurable quality thresholds
Extend config and `pdf_extract.py` checks with:
- `min_chars` (keep existing)
- `min_chars_per_page` (new)
- `min_alpha_ratio` (new)

If any threshold fails, treat as extraction quality failure and route to OCR fallback.

### 3) Add reason-coded fallback logging
Emit structured reason(s), e.g.:
- `LOW_TOTAL_CHARS`
- `LOW_CHARS_PER_PAGE`
- `LOW_ALPHA_RATIO`
- `EXTRACT_TIMEOUT`

### 4) Add tests + benchmark fixture set
Use a mixed corpus:
- dense synthetic docs
- normal text-heavy PDFs
- difficult sparse/scan-like FOIA docs

Add a small benchmark script/report so threshold tuning is reproducible.

## Suggested implementation files
- `openfoia/pipeline/pdf_extract.py`
- extraction config module/env wiring
- tests around fallback decision logic

## Acceptance criteria
- Hard sparse docs that previously slipped through now route to OCR fallback.
- Text-heavy docs still pass direct extraction.
- Logs include threshold metrics + reason codes.
- Thresholds are configurable without code changes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PDF extraction fallback with quality gates (chars/page + alpha ratio) #61

Problem

Why this matters

Proposed path

1) Add extraction quality metrics

2) Add configurable quality thresholds

3) Add reason-coded fallback logging

4) Add tests + benchmark fixture set

Suggested implementation files

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve PDF extraction fallback with quality gates (chars/page + alpha ratio) #61

Description

Problem

Why this matters

Proposed path

1) Add extraction quality metrics

2) Add configurable quality thresholds

3) Add reason-coded fallback logging

4) Add tests + benchmark fixture set

Suggested implementation files

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions