Measurements (idle machine, warm kkernel mcp --daemon, production-scale DB)
| Op |
Latency |
stats() |
0.04-0.05s |
get(id=<full-uuid>) |
0.02s |
memory.recall(query=..., limit=3) |
2.0-3.5s, flat (same or novel query) |
search(kind="entity", query=..., limit=3) |
2.9s |
batch [recall, recall] (one request, distinct queries) |
3.0s |
| first recall after daemon cold start |
~37s |
KHIVE_NO_DAEMON=1 recall (in-process) |
~37s every time |
Decomposition
- The warm daemon IS serving (
KHIVE_NO_DAEMON=1 forces the in-process path and pays the ~37s cold ANN + model load every call; the daemon path does not).
- The ~2s cost is shared by every embedding-bearing verb (
memory.recall and entity search alike) while by-ID/aggregate verbs sit at the 20-50ms floor — so it is not ANN search, not FTS, not SQLite, and not per-request dispatch overhead.
- Batch of 2 recalls in one request = 3.0s vs 2.1s single: per-op embedding dominates, with weak parallel scaling (consistent with contention on a shared embedder).
- Repeating an identical query does not get faster: no query-embedding cache.
Hypothesis
The per-query embedding (embed_one, model all-minilm-l6-v2 via lattice-embed) costs ~2s per call inside the daemon. KhiveRuntime::embed documents lazy-load-then-reuse of model weights, so either (a) the cached-weights path still costs ~2s of pure-Rust inference for a single short query, or (b) the daemon is not actually reusing the loaded embedder across requests. A MiniLM-class forward pass for one short sentence should be tens of milliseconds, not seconds.
Suggested next steps
- Instrument the daemon serve path with per-stage timing (embed / vector search / FTS / fusion / hydration) behind a debug flag, and confirm where the 2s lands.
- If (a): profile lattice-embed's forward pass for the single-query case (SIMD/quantized path, batch-of-1 overhead).
- If (b): fix embedder reuse across daemon requests.
- Consider a query-embedding LRU as a cheap orthogonal win for repeated queries.
Why it matters: every recall-shaped caller (hooks, agents, interactive sessions) pays this on every call. Dropping recall from ~2.1s to ~200ms improves every consumer of the substrate, not just any single integration.
Measurements (idle machine, warm
kkernel mcp --daemon, production-scale DB)stats()get(id=<full-uuid>)memory.recall(query=..., limit=3)search(kind="entity", query=..., limit=3)[recall, recall](one request, distinct queries)KHIVE_NO_DAEMON=1recall (in-process)Decomposition
KHIVE_NO_DAEMON=1forces the in-process path and pays the ~37s cold ANN + model load every call; the daemon path does not).memory.recalland entitysearchalike) while by-ID/aggregate verbs sit at the 20-50ms floor — so it is not ANN search, not FTS, not SQLite, and not per-request dispatch overhead.Hypothesis
The per-query embedding (
embed_one, modelall-minilm-l6-v2via lattice-embed) costs ~2s per call inside the daemon.KhiveRuntime::embeddocuments lazy-load-then-reuse of model weights, so either (a) the cached-weights path still costs ~2s of pure-Rust inference for a single short query, or (b) the daemon is not actually reusing the loaded embedder across requests. A MiniLM-class forward pass for one short sentence should be tens of milliseconds, not seconds.Suggested next steps
Why it matters: every recall-shaped caller (hooks, agents, interactive sessions) pays this on every call. Dropping recall from ~2.1s to ~200ms improves every consumer of the substrate, not just any single integration.