Summary
khive-text (crates/khive-text, ~1391 LOC) provides shared text-analysis primitives —
tokenization, normalization, filtering, CJK/script detection, and a query noise-gate
(is_meaningful_query) — behind an Analyzer / Tokenizer / TokenFilter trait pipeline.
It was created to be the single home for this logic, deduplicating it across the retrieval and
memory paths.
It is currently wired into nothing. No crate declares khive-text as a dependency, and there
are no khive_text:: references anywhere in the workspace outside the crate itself.
The duplication it was meant to prevent
Two consumers reimplemented khive-text's logic instead of depending on it:
-
Keyword tokenization — khive-bm25 defines its own Tokenizer trait and
SimpleTokenizer in crates/khive-bm25/src/tokenizer.rs, parallel to khive-text's
Tokenizer / BoxedTokenizer / Analyzer.
-
Query noise-gate — khive-pack-memory reimplements is_meaningful_query and
is_cjk_char in crates/khive-pack-memory/src/scoring.rs. This copy is the live one
(called from handlers/recall.rs as the pre-embedding gate). The two is_meaningful_query
implementations have since diverged: khive-text's is Latin-centric (global dominant-char
heuristic), while pack-memory's is CJK-aware (per-word gibberish detection). pack-memory's
is the more capable of the two.
Options
- (A) Wire it in. Make
khive-bm25 and khive-pack-memory depend on khive-text and
delete their duplicated tokenizer / is_meaningful_query / is_cjk_char. Reconcile the two
is_meaningful_query variants first: pack-memory's CJK-aware version is the one to keep, so
khive-text's should be brought up to match before bm25 / pack-memory adopt it. This touches
the recall hot path, so it needs test-coverage parity before and after.
- (B) Drop it. Remove the crate and keep the inline implementations.
Related
khive-quant has a similar but only half-completed dedup. Its description says "shared by HNSW
and Vamana indexes," but only khive-vamana depends on it; khive-hnsw has its own quantizer
in crates/khive-hnsw/src/index/quantized.rs. Wiring HNSW onto khive-quant (and deleting its
local quantizer) is a separate, optional cleanup in the same spirit — tracked here only as
context, not part of this issue's scope.
Summary
khive-text(crates/khive-text, ~1391 LOC) provides shared text-analysis primitives —tokenization, normalization, filtering, CJK/script detection, and a query noise-gate
(
is_meaningful_query) — behind anAnalyzer/Tokenizer/TokenFiltertrait pipeline.It was created to be the single home for this logic, deduplicating it across the retrieval and
memory paths.
It is currently wired into nothing. No crate declares
khive-textas a dependency, and thereare no
khive_text::references anywhere in the workspace outside the crate itself.The duplication it was meant to prevent
Two consumers reimplemented khive-text's logic instead of depending on it:
Keyword tokenization —
khive-bm25defines its ownTokenizertrait andSimpleTokenizerincrates/khive-bm25/src/tokenizer.rs, parallel to khive-text'sTokenizer/BoxedTokenizer/Analyzer.Query noise-gate —
khive-pack-memoryreimplementsis_meaningful_queryandis_cjk_charincrates/khive-pack-memory/src/scoring.rs. This copy is the live one(called from
handlers/recall.rsas the pre-embedding gate). The twois_meaningful_queryimplementations have since diverged: khive-text's is Latin-centric (global dominant-char
heuristic), while pack-memory's is CJK-aware (per-word gibberish detection). pack-memory's
is the more capable of the two.
Options
khive-bm25andkhive-pack-memorydepend onkhive-textanddelete their duplicated tokenizer /
is_meaningful_query/is_cjk_char. Reconcile the twois_meaningful_queryvariants first: pack-memory's CJK-aware version is the one to keep, sokhive-text's should be brought up to match before bm25 / pack-memory adopt it. This touches
the recall hot path, so it needs test-coverage parity before and after.
Related
khive-quanthas a similar but only half-completed dedup. Its description says "shared by HNSWand Vamana indexes," but only
khive-vamanadepends on it;khive-hnswhas its own quantizerin
crates/khive-hnsw/src/index/quantized.rs. Wiring HNSW ontokhive-quant(and deleting itslocal quantizer) is a separate, optional cleanup in the same spirit — tracked here only as
context, not part of this issue's scope.