Skip to content

KnowledgeHarvester.ts classifier over-matches the People domain via substring matching #1367

@stratofax

Description

@stratofax

Component: PAI/TOOLS/KnowledgeHarvester.ts (classifyDomain + extractTags, v5)
Affected source: Releases/v5.0.0/.claude/PAI/TOOLS/KnowledgeHarvester.ts @ 2fde1bb
Severity: Medium — misfiles notes into the wrong KNOWLEDGE domain. Platform-independent (reproduces everywhere).

Summary

classifyDomain and extractTags match domain keywords by substring, so "person" matches
"personal"/"persona", and a single incidental keyword ("contact", "profile") in an otherwise technical
note is enough to win the high-precision People domain. Result: technical notes get misfiled as People.

Bug — Substring keyword matching + no precision floor

Line 322, inside classifyDomain (line 342 in extractTags has the same shape):

const score = keywords.reduce((acc, kw) => acc + (text.includes(kw) ? 1 : 0), 0);

Two problems:

  1. Substring matchingtext.includes("person") is true for "personal" and "persona"; any single
    incidental keyword ("contact", "profile") in a technical note can score the People domain.
  2. No precision floor — one weak keyword is enough to route a note to People.

Proposed fix

Word-boundary matching, and require a ≥2-keyword signal for the high-precision People domain (genuine
OSINT/dossier notes hit several: osint, dossier, linkedin, profile, background…):

// word-boundary so "person" doesn't match "personal"/"persona"
scores[domain] = keywords.reduce(
  (acc, kw) => acc + (new RegExp(`\\b${kw}\\b`).test(text) ? 1 : 0), 0);
// People is high-precision: a single incidental keyword must not route here
if (scores.People < 2) scores.People = 0;

extractTags gets the same \b-boundary treatment.

Note: \b boundaries won't help multi-word or hyphenated keywords, but the current keyword set has
none, so no further change is needed today.

Verification

  • A Linux PAI 5.x install (empirical): after the fix, all 23 graduated notes classified with
    0 misfiled into People (21 Ideas / 7 Research / 1 Companies). Acceptable residuals noted: invidious
    → Companies via "startup"; relationship notes → Ideas/Research rather than People dossiers.
  • This bug is not platform-specific — the substring logic misclassifies on macOS and Linux alike.

Related issues

Split from the Linux-path issue #1366 (that one is a Linux platform blocker; this is a platform-independent
classification-precision bug — both were found and fixed in the same pass). No existing issue covers
classifier/domain precision (searched 2026-06-20). Adjacent but distinct: #1171 (writes directly to
KNOWLEDGE, bypassing the _harvest-queue curation step) and #1351 (queue review/promote lifecycle)
concern the harvest pipeline, not keyword classification.

Suggested labels

bug, precision, tool:KnowledgeHarvester

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions