perf: embedding-bearing verbs cost ~2s per op inside the warm daemon (recall/search); non-embedding verbs are at the 20-50ms floor

## Measurements (idle machine, warm `kkernel mcp --daemon`, production-scale DB)

| Op | Latency |
|---|---|
| `stats()` | 0.04-0.05s |
| `get(id=<full-uuid>)` | 0.02s |
| `memory.recall(query=..., limit=3)` | 2.0-3.5s, flat (same or novel query) |
| `search(kind="entity", query=..., limit=3)` | 2.9s |
| batch `[recall, recall]` (one request, distinct queries) | 3.0s |
| first recall after daemon cold start | ~37s |
| `KHIVE_NO_DAEMON=1` recall (in-process) | ~37s every time |

## Decomposition

- The warm daemon IS serving (`KHIVE_NO_DAEMON=1` forces the in-process path and pays the ~37s cold ANN + model load every call; the daemon path does not).
- The ~2s cost is shared by **every embedding-bearing verb** (`memory.recall` and entity `search` alike) while by-ID/aggregate verbs sit at the 20-50ms floor — so it is not ANN search, not FTS, not SQLite, and not per-request dispatch overhead.
- Batch of 2 recalls in one request = 3.0s vs 2.1s single: per-op embedding dominates, with weak parallel scaling (consistent with contention on a shared embedder).
- Repeating an identical query does not get faster: no query-embedding cache.

## Hypothesis

The per-query embedding (`embed_one`, model `all-minilm-l6-v2` via lattice-embed) costs ~2s per call inside the daemon. `KhiveRuntime::embed` documents lazy-load-then-reuse of model weights, so either (a) the cached-weights path still costs ~2s of pure-Rust inference for a single short query, or (b) the daemon is not actually reusing the loaded embedder across requests. A MiniLM-class forward pass for one short sentence should be tens of milliseconds, not seconds.

## Suggested next steps

1. Instrument the daemon serve path with per-stage timing (embed / vector search / FTS / fusion / hydration) behind a debug flag, and confirm where the 2s lands.
2. If (a): profile lattice-embed's forward pass for the single-query case (SIMD/quantized path, batch-of-1 overhead).
3. If (b): fix embedder reuse across daemon requests.
4. Consider a query-embedding LRU as a cheap orthogonal win for repeated queries.

Why it matters: every recall-shaped caller (hooks, agents, interactive sessions) pays this on every call. Dropping recall from ~2.1s to ~200ms improves every consumer of the substrate, not just any single integration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: embedding-bearing verbs cost ~2s per op inside the warm daemon (recall/search); non-embedding verbs are at the 20-50ms floor #394

Measurements (idle machine, warm `kkernel mcp --daemon`, production-scale DB)

Decomposition

Hypothesis

Suggested next steps

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Op	Latency
`stats()`	0.04-0.05s
`get(id=<full-uuid>)`	0.02s
`memory.recall(query=..., limit=3)`	2.0-3.5s, flat (same or novel query)
`search(kind="entity", query=..., limit=3)`	2.9s
batch `[recall, recall]` (one request, distinct queries)	3.0s
first recall after daemon cold start	~37s
`KHIVE_NO_DAEMON=1` recall (in-process)	~37s every time

perf: embedding-bearing verbs cost ~2s per op inside the warm daemon (recall/search); non-embedding verbs are at the 20-50ms floor #394

Description

Measurements (idle machine, warm kkernel mcp --daemon, production-scale DB)

Decomposition

Hypothesis

Suggested next steps

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Measurements (idle machine, warm `kkernel mcp --daemon`, production-scale DB)