Skip to content

Wire khive-text into bm25 + pack-memory, or drop the crate #377

Description

@ohdearquant

Summary

khive-text (crates/khive-text, ~1391 LOC) provides shared text-analysis primitives —
tokenization, normalization, filtering, CJK/script detection, and a query noise-gate
(is_meaningful_query) — behind an Analyzer / Tokenizer / TokenFilter trait pipeline.
It was created to be the single home for this logic, deduplicating it across the retrieval and
memory paths.

It is currently wired into nothing. No crate declares khive-text as a dependency, and there
are no khive_text:: references anywhere in the workspace outside the crate itself.

The duplication it was meant to prevent

Two consumers reimplemented khive-text's logic instead of depending on it:

  1. Keyword tokenizationkhive-bm25 defines its own Tokenizer trait and
    SimpleTokenizer in crates/khive-bm25/src/tokenizer.rs, parallel to khive-text's
    Tokenizer / BoxedTokenizer / Analyzer.

  2. Query noise-gatekhive-pack-memory reimplements is_meaningful_query and
    is_cjk_char in crates/khive-pack-memory/src/scoring.rs. This copy is the live one
    (called from handlers/recall.rs as the pre-embedding gate). The two is_meaningful_query
    implementations have since diverged: khive-text's is Latin-centric (global dominant-char
    heuristic), while pack-memory's is CJK-aware (per-word gibberish detection). pack-memory's
    is the more capable of the two.

Options

  • (A) Wire it in. Make khive-bm25 and khive-pack-memory depend on khive-text and
    delete their duplicated tokenizer / is_meaningful_query / is_cjk_char. Reconcile the two
    is_meaningful_query variants first: pack-memory's CJK-aware version is the one to keep, so
    khive-text's should be brought up to match before bm25 / pack-memory adopt it. This touches
    the recall hot path, so it needs test-coverage parity before and after.
  • (B) Drop it. Remove the crate and keep the inline implementations.

Related

khive-quant has a similar but only half-completed dedup. Its description says "shared by HNSW
and Vamana indexes," but only khive-vamana depends on it; khive-hnsw has its own quantizer
in crates/khive-hnsw/src/index/quantized.rs. Wiring HNSW onto khive-quant (and deleting its
local quantizer) is a separate, optional cleanup in the same spirit — tracked here only as
context, not part of this issue's scope.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions