Skip to content

perf: embedding-bearing verbs cost ~2s per op inside the warm daemon (recall/search); non-embedding verbs are at the 20-50ms floor #394

Description

@ohdearquant

Measurements (idle machine, warm kkernel mcp --daemon, production-scale DB)

Op Latency
stats() 0.04-0.05s
get(id=<full-uuid>) 0.02s
memory.recall(query=..., limit=3) 2.0-3.5s, flat (same or novel query)
search(kind="entity", query=..., limit=3) 2.9s
batch [recall, recall] (one request, distinct queries) 3.0s
first recall after daemon cold start ~37s
KHIVE_NO_DAEMON=1 recall (in-process) ~37s every time

Decomposition

  • The warm daemon IS serving (KHIVE_NO_DAEMON=1 forces the in-process path and pays the ~37s cold ANN + model load every call; the daemon path does not).
  • The ~2s cost is shared by every embedding-bearing verb (memory.recall and entity search alike) while by-ID/aggregate verbs sit at the 20-50ms floor — so it is not ANN search, not FTS, not SQLite, and not per-request dispatch overhead.
  • Batch of 2 recalls in one request = 3.0s vs 2.1s single: per-op embedding dominates, with weak parallel scaling (consistent with contention on a shared embedder).
  • Repeating an identical query does not get faster: no query-embedding cache.

Hypothesis

The per-query embedding (embed_one, model all-minilm-l6-v2 via lattice-embed) costs ~2s per call inside the daemon. KhiveRuntime::embed documents lazy-load-then-reuse of model weights, so either (a) the cached-weights path still costs ~2s of pure-Rust inference for a single short query, or (b) the daemon is not actually reusing the loaded embedder across requests. A MiniLM-class forward pass for one short sentence should be tens of milliseconds, not seconds.

Suggested next steps

  1. Instrument the daemon serve path with per-stage timing (embed / vector search / FTS / fusion / hydration) behind a debug flag, and confirm where the 2s lands.
  2. If (a): profile lattice-embed's forward pass for the single-query case (SIMD/quantized path, batch-of-1 overhead).
  3. If (b): fix embedder reuse across daemon requests.
  4. Consider a query-embedding LRU as a cheap orthogonal win for repeated queries.

Why it matters: every recall-shaped caller (hooks, agents, interactive sessions) pays this on every call. Dropping recall from ~2.1s to ~200ms improves every consumer of the substrate, not just any single integration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions