fix(auto-recall): thread AbortSignal to cancel embedding on timeout#668
fix(auto-recall): thread AbortSignal to cancel embedding on timeout#668ScientificProgrammer wants to merge 2 commits intoCortexReach:masterfrom
Conversation
Problem: the Promise.race-based timeout in the auto-recall before_prompt_build hook resolves the outer promise with undefined after AUTO_RECALL_TIMEOUT_MS, but it does not cancel the underlying embedding HTTP call. The underlying work keeps running, holding the per-agent memory-runtime mutex (and lancedb connection), so the next hook invocation serializes behind it. Combined with the gateway's session-write-lock max-hold of (timeoutMs + 120s grace, min 300s), a single slow embedding call can hold the session lock for several minutes before the watchdog force-releases — seen in production as "stuck session: state=processing age=219s queueDepth=0". Fix: - Create an AbortController in the hook and pass its signal into recallWork(signal). - recallWork threads the signal through retrieveWithRetry() into Retriever.retrieve() via a new RetrievalContext.signal field. - Retriever.retrieve() forwards the signal to the embedQuery() calls in both vectorOnlyRetrieval() and hybridRetrieval(). The embedder already accepts AbortSignal throughout (src/embedder.ts), so no downstream changes were needed. - When the setTimeout fires, the hook calls controller.abort() before resolving undefined. In-flight embedding fetch() requests terminate promptly; retriever.retrieve() rejects with AbortError which the retryWithRetry early-return handles (aborts skip the 75ms retry). - The catch branch downgrades AbortError to a debug log so we don't double-log the timeout warning the setTimeout path already emitted. Out of scope (follow-up): - Jina rerank HTTP calls (a separate code path in retriever.ts) should also accept the signal. Left for a focused follow-up since it touches the rerank client interface. - BM25 path does not call embedQuery and has no long-running HTTP call, so it does not need signal plumbing. Note: tsc --noEmit reports a pre-existing TS2353 error on retrieveWithRetry where callers pass `source: "auto-recall"` but the function's inline type did not include `source` (introduced by CortexReach#535, not this PR). Non-blocking — tsx runtime transpile does not check types. Can be fixed with a one-line addition to the param type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes an operational issue in the auto-recall before_prompt_build hook where a timeout via Promise.race returned early but left the underlying embedding HTTP request running, holding locks and causing subsequent sessions/hooks to stall.
Changes:
- Add an
AbortControllerin the auto-recall hook and thread itsAbortSignalthroughrecallWork()into retrieval. - Extend retriever plumbing (
RetrievalContext) to accept an optionalsignaland forward it toembedder.embedQuery(...). - Make
retrieveWithRetrybail out after the retry delay if the signal was aborted.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
index.ts |
Creates/aborts an AbortController on auto-recall timeout; threads signal into recall retrieval and adds an early return in the retry gap. |
src/retriever.ts |
Adds signal?: AbortSignal to RetrievalContext and forwards it into query embedding calls in vector/hybrid retrieval paths. |
Comments suppressed due to low confidence (1)
index.ts:2712
- This change introduces new timeout-cancellation behavior (aborting the in-flight embed/retrieval work) but there’s no automated test asserting that the hook returns promptly and the downstream embedder receives an aborted signal. Given the existing test suite, consider adding a unit/integration test that stubs a slow embedder call, advances timers to
AUTO_RECALL_TIMEOUT_MS, and asserts (1)before_prompt_buildreturnsundefinedquickly and (2) the stub observedsignal.aborted === true/ rejected with AbortError.
// Cancel in-flight embedding/retrieval HTTP calls so they don't
// keep holding resources after we've given up on the result.
abortController.abort(new Error("auto-recall timeout"));
api.logger.warn(
`memory-lancedb-pro: auto-recall timed out after ${AUTO_RECALL_TIMEOUT_MS}ms; skipping memory injection to avoid stalling agent startup`,
);
resolve(undefined);
}, AUTO_RECALL_TIMEOUT_MS);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Downgrade AbortError to debug — the timeout path already emitted | ||
| // a warn log, and the race has already resolved undefined. | ||
| const isAbort = err instanceof Error && (err.name === "AbortError" || /abort/i.test(err.message)); | ||
| if (isAbort) { | ||
| api.logger.debug?.(`memory-lancedb-pro: recall aborted: ${String(err)}`); | ||
| } else { | ||
| api.logger.warn(`memory-lancedb-pro: recall failed: ${String(err)}`); | ||
| } |
There was a problem hiding this comment.
The abort classification in the catch block is too broad (err.name === "AbortError" || /abort/i.test(err.message)). This will also downgrade unrelated aborts (e.g., the embedder's own timeout AbortError) to debug, reducing visibility of real failures. Instead, key this specifically to this timeout controller (e.g., check abortController.signal.aborted, compare err to abortController.signal.reason, or track a timedOut flag set in the timeout handler) and keep other errors at warn.
| * the method rejects with AbortError instead of holding the caller's session | ||
| * lock while the underlying HTTP request runs to completion. */ |
There was a problem hiding this comment.
The docstring for RetrievalContext.signal says the method rejects with AbortError, but when a caller passes an already-aborted signal, embedder.withTimeout() rejects with externalSignal.reason (which may be a plain Error, not an AbortError). Consider softening the wording to “rejects due to abort (often AbortError / abort reason)” to match actual behavior.
| * the method rejects with AbortError instead of holding the caller's session | |
| * lock while the underlying HTTP request runs to completion. */ | |
| * the method rejects due to abort (often with AbortError or the signal's | |
| * abort reason) instead of holding the caller's session lock while the | |
| * underlying HTTP request runs to completion. */ |
| async function retrieveWithRetry(params: { | ||
| query: string; | ||
| limit: number; | ||
| scopeFilter?: string[]; | ||
| category?: string; | ||
| signal?: AbortSignal; | ||
| }) { | ||
| let results = await retriever.retrieve(params); | ||
| if (results.length === 0) { | ||
| await sleep(75); | ||
| if (params.signal?.aborted) return results; | ||
| results = await retriever.retrieve(params); | ||
| } |
There was a problem hiding this comment.
retrieveWithRetry's inline params type is missing the source field even though callers pass source: "auto-recall". This keeps the existing TS2353 error around and makes it easier for future call sites to drift. Prefer typing params as RetrievalContext (or Pick<RetrievalContext, "query"|"limit"|"scopeFilter"|"category"|"source"|"signal">) so source/signal are both supported and checked.
Address Copilot P2 feedback on PR CortexReach#668. The catch block used `err.name === "AbortError" || /abort/i.test(err.message)` to decide whether to downgrade recall errors to debug. That was too broad: if the embedder's own internal timeout AbortController fired (src/embedder.ts withTimeout), its AbortError would propagate up and match the same pattern, silencing a real infrastructure failure. Key the debug-downgrade to `abortController.signal.aborted` instead. That flag is owned by this hook's own controller and is true only when our timeout callback fired. Aborts from any other controller (embedder, SDK client, upstream cancellation) now stay at warn. Also tighten the `RetrievalContext.signal` docstring: the embedder's withTimeout rejects with `externalSignal.reason ?? new Error("aborted")` when the signal is already aborted, which can be a plain Error rather than an AbortError. "rejects due to abort (often with AbortError or the signal's abort reason)" matches observed behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@copilot-pull-request-reviewer — feedback on 1. Abort classification narrowed ( 2. Retriever docstring tightened ( 3. Follow-up noted for a later PR (out of scope here, not a regression): the |
|
Thanks for tackling this. The problem is real and worth fixing, and the overall direction here looks right. I’m still at request changes for one blocker before merge:
What I’d like to see before merge:
Non-blocking notes:
Once the timeout rejection path is handled, this looks close. |
app3apps
left a comment
There was a problem hiding this comment.
Requesting changes before merge.
The direction here looks right and the problem is worth fixing, but there is still one blocker on the timeout path:
- When the timeout wins the
Promise.race, the in-flightrecallWork(abortController.signal)can still reject after the caller has already moved on. In the current shape, that rejection is not handled explicitly, so we risk turning the stuck-mutex issue into an unhandled promise rejection on the abort path.
What I’d like before merge:
- Handle the post-timeout rejection path explicitly so an abort-triggered failure is consumed intentionally rather than escaping as an unhandled rejection.
- Add a focused test for the
timeout -> abort -> cancellationpath.
Once that is covered, this looks close.
Summary
The auto-recall
before_prompt_buildhook wrapsrecallWork()inPromise.racewith asetTimeoutresolver (index.ts:~2690). When the timer fires, the race resolvesundefined— but the underlying embedding HTTP call keeps running. NoAbortController/AbortSignalis threaded through. The lingering work holds the per-agent memory-runtime mutex, so subsequent hook invocations serialize behind it.In production at the reporter's site, this manifested as
stuck session: state=processing age=219s queueDepth=0with the gateway's session-write-lock watchdog firing repeatedly. The gateway's lock max-hold istimeoutMs + 120s grace(min 300s), so a single slow embedding call ends up holding the session lock for 2–5 minutes before forced release.Changes
index.ts— create anAbortControllerin the hook; passcontroller.signalinto a newrecallWork(signal)parameter; in thesetTimeoutcallback, callcontroller.abort(new Error("auto-recall timeout"))before resolving. Thecatchbranch downgradesAbortErrorto adebuglog (the timeout path already emitted awarn).retrieveWithRetrynow accepts an optionalsignaland early-returns after the 75 ms retry gap if the signal was aborted in the interim.src/retriever.ts— addsignal?: AbortSignaltoRetrievalContext; destructure inretrieve(); forward tovectorOnlyRetrieval()andhybridRetrieval(), which pass it tothis.embedder.embedQuery(query, signal). The embedder already accepts anAbortSignalthroughout (src/embedder.ts:794,798,806,810,865), so no downstream changes are needed.Scope
embedQuery, so no plumbing needed.Test plan
Verified locally:
tsc --noEmitintroduces no new type errors. Two pre-existing errors remain (notablyTS2353onretrieveWithRetrywheresourceis passed but not declared in the inline param type — introduced by feat(memory): add per-agent auto-recall control #535, independent of this PR).autoRecallTimeoutMsand the in-flight embedding fetch seessignal.aborted === true.Automated test not added in this PR — suggested follow-up: a unit test that stubs the embedder with a 30 s delay and asserts
recallWorkreturns within 100 ms of the timeout AND the stub receives an aborted signal.Related
proving-groundagent, OpenClaw 2026.4.9) directly implicated this code path.memory_compactregistration), fix: resolve stale lock and same-instance write contention (#622, #623) #626 (stale lockfile). Those address adjacent failure modes but not thePromise.race-without-cancellation root cause fixed here.🤖 Developed with Claude Code (Claude Opus 4.7, 1M context).