Skip to content

Data: 5,809 UNK-* protocol_number chunks across 16 CA agencies (long-term) #84

@thefiredev-cloud

Description

@thefiredev-cloud

After PR #83 cleared all PAGE BREAK pollution, the next deepest data-quality issue surfaced by the audit is UNK- protocol_number chunks*:

state_code agencies_with_unk unk_chunks
CA 16 5,809

These are chunks where the PDF extractor failed to identify a protocol number, defaulting to `UNK-1`, `UNK-2`, etc. They have valid embedded `content` but render in search results with junk titles like "UNK-170".

The runtime quality demotion in `server/_core/rag/scoring.ts` (× 0.3) handles these correctly today — they surface only when nothing better matches. So this is not a launch blocker.

Recommended approach

Long-term cleanup at ingestion time, not in the database:

  1. Improve `scripts/lib/protocol-extractor.ts` to derive protocol numbers from content headers (e.g., regex match for `POLICY \d+`, `PROTOCOL \d+`, `SECTION \d+\.\d+`) before defaulting to `UNK`.
  2. For chunks where no number can be derived, use the parent protocol's `protocol_number` (chunks belong to a parent protocol).
  3. Re-ingest the 16 affected agencies after the fix.

Why not a SQL cleanup like #83

PAGE BREAK was a literal string substring in title — surgically removable. UNK-1 / UNK-2 etc. are valid identifiers from the extractor's POV; we can't "strip" them without losing chunk identity. Recovering the real number requires either content parsing or re-ingestion.

Priority

H3 (long-term). The runtime demotion is sufficient for launch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    autonomous-followupFollow-up from autonomous branch workh3Horizon 3: long-term, 6+ weeksingestionData ingestion - new agencies/states

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions