Skip to content

sanitize_fts5_query strips hyphens/dots instead of space-splitting, making kebab-case terms silently unfindable via FTS #397

Description

@ohdearquant

Bug

sanitize_fts5_query (crates/khive-db/src/stores/text.rs) removes - and . from queries instead of replacing them with spaces, which makes any hyphenated or dotted term silently unfindable through the FTS leg.

Pass 1 space-replaces ( ) , :; Pass 2 then filters out * " ' + - ^ . ~ ! $. So:

  • query khive-pack-memory → sanitized to khivepackmemory
  • indexed content still contains the literal khive-pack-memory, whose trigrams include the hyphens (e-p, k-m, ...)
  • the sanitized trigrams (epa, ckm, ...) never occur in the indexed text → 0 hits, no error

Verified empirically against 0.3.0 (in-memory runtime, no embedders, so the text leg is the only leg):

query hits against content LEGACY-FLAT-NOTE
LEGACY 1
LEGACY FLAT NOTE matches
LEGACY-FLAT 0
LEGACY-FLAT-NOTE (exact content) 0

Why it matters

Suggested fix

Move - and . (and plausibly + ~ ^) from the Pass 2 filter set into the Pass 1 space-replacement set, for exactly the reason the existing Pass 1 comment gives for :: tenant:isolationtenant isolation, not tenantisolation. LEGACY-FLAT-NOTELEGACY FLAT NOTE, whose trigrams all occur in the indexed content, restoring the match.

Characters that FTS5 rejects outright regardless of position ($, quotes, *) stay in Pass 2.

Regression test suggestion: index a document containing a kebab-case token and assert the exact hyphenated query returns it through the text-only path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions