After PR #83 cleared all PAGE BREAK pollution, the next deepest data-quality issue surfaced by the audit is UNK- protocol_number chunks*:
| state_code |
agencies_with_unk |
unk_chunks |
| CA |
16 |
5,809 |
These are chunks where the PDF extractor failed to identify a protocol number, defaulting to `UNK-1`, `UNK-2`, etc. They have valid embedded `content` but render in search results with junk titles like "UNK-170".
The runtime quality demotion in `server/_core/rag/scoring.ts` (× 0.3) handles these correctly today — they surface only when nothing better matches. So this is not a launch blocker.
Recommended approach
Long-term cleanup at ingestion time, not in the database:
- Improve `scripts/lib/protocol-extractor.ts` to derive protocol numbers from content headers (e.g., regex match for `POLICY \d+`, `PROTOCOL \d+`, `SECTION \d+\.\d+`) before defaulting to `UNK`.
- For chunks where no number can be derived, use the parent protocol's `protocol_number` (chunks belong to a parent protocol).
- Re-ingest the 16 affected agencies after the fix.
Why not a SQL cleanup like #83
PAGE BREAK was a literal string substring in title — surgically removable. UNK-1 / UNK-2 etc. are valid identifiers from the extractor's POV; we can't "strip" them without losing chunk identity. Recovering the real number requires either content parsing or re-ingestion.
Priority
H3 (long-term). The runtime demotion is sufficient for launch.
After PR #83 cleared all PAGE BREAK pollution, the next deepest data-quality issue surfaced by the audit is UNK- protocol_number chunks*:
These are chunks where the PDF extractor failed to identify a protocol number, defaulting to `UNK-1`, `UNK-2`, etc. They have valid embedded `content` but render in search results with junk titles like "UNK-170".
The runtime quality demotion in `server/_core/rag/scoring.ts` (× 0.3) handles these correctly today — they surface only when nothing better matches. So this is not a launch blocker.
Recommended approach
Long-term cleanup at ingestion time, not in the database:
Why not a SQL cleanup like #83
PAGE BREAK was a literal string substring in title — surgically removable. UNK-1 / UNK-2 etc. are valid identifiers from the extractor's POV; we can't "strip" them without losing chunk identity. Recovering the real number requires either content parsing or re-ingestion.
Priority
H3 (long-term). The runtime demotion is sufficient for launch.