KnowledgeHarvester.ts classifier over-matches the People domain via substring matching

**Component:** `PAI/TOOLS/KnowledgeHarvester.ts` (`classifyDomain` + `extractTags`, v5)
**Affected source:** `Releases/v5.0.0/.claude/PAI/TOOLS/KnowledgeHarvester.ts` @ `2fde1bb`
**Severity:** Medium — misfiles notes into the wrong KNOWLEDGE domain. Platform-independent (reproduces everywhere).

## Summary

`classifyDomain` and `extractTags` match domain keywords by **substring**, so "person" matches
"personal"/"persona", and a single incidental keyword ("contact", "profile") in an otherwise technical
note is enough to win the high-precision People domain. Result: technical notes get misfiled as People.

## Bug — Substring keyword matching + no precision floor

Line 322, inside `classifyDomain` (line 342 in `extractTags` has the same shape):

```ts
const score = keywords.reduce((acc, kw) => acc + (text.includes(kw) ? 1 : 0), 0);
```

**Two problems:**
1. **Substring matching** — `text.includes("person")` is true for "personal" and "persona"; any single
   incidental keyword ("contact", "profile") in a technical note can score the People domain.
2. **No precision floor** — one weak keyword is enough to route a note to People.

## Proposed fix

Word-boundary matching, and require a ≥2-keyword signal for the high-precision People domain (genuine
OSINT/dossier notes hit several: osint, dossier, linkedin, profile, background…):

```ts
// word-boundary so "person" doesn't match "personal"/"persona"
scores[domain] = keywords.reduce(
  (acc, kw) => acc + (new RegExp(`\\b${kw}\\b`).test(text) ? 1 : 0), 0);
// People is high-precision: a single incidental keyword must not route here
if (scores.People < 2) scores.People = 0;
```

`extractTags` gets the same `\b`-boundary treatment.

> Note: `\b` boundaries won't help multi-word or hyphenated keywords, but the current keyword set has
> none, so no further change is needed today.

## Verification

- **A Linux PAI 5.x install (empirical):** after the fix, all 23 graduated notes classified with
  **0 misfiled into People** (21 Ideas / 7 Research / 1 Companies). Acceptable residuals noted: `invidious`
  → Companies via "startup"; relationship notes → Ideas/Research rather than People dossiers.
- This bug is **not platform-specific** — the substring logic misclassifies on macOS and Linux alike.

## Related issues

Split from the Linux-path issue #1366 (that one is a Linux platform blocker; this is a platform-independent
classification-precision bug — both were found and fixed in the same pass). No existing issue covers
classifier/domain precision (searched 2026-06-20). Adjacent but distinct: #1171 (writes directly to
KNOWLEDGE, bypassing the `_harvest-queue` curation step) and #1351 (queue review/promote lifecycle)
concern the harvest *pipeline*, not keyword classification.

## Suggested labels

`bug`, `precision`, `tool:KnowledgeHarvester`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KnowledgeHarvester.ts classifier over-matches the People domain via substring matching #1367

Summary

Bug — Substring keyword matching + no precision floor

Proposed fix

Verification

Related issues

Suggested labels

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

KnowledgeHarvester.ts classifier over-matches the People domain via substring matching #1367

Description

Summary

Bug — Substring keyword matching + no precision floor

Proposed fix

Verification

Related issues

Suggested labels

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions