Skip to content

fix(auto-recall): thread AbortSignal to cancel embedding on timeout#668

Open
ScientificProgrammer wants to merge 2 commits intoCortexReach:masterfrom
ScientificProgrammer:fix/auto-recall-abort-signal
Open

fix(auto-recall): thread AbortSignal to cancel embedding on timeout#668
ScientificProgrammer wants to merge 2 commits intoCortexReach:masterfrom
ScientificProgrammer:fix/auto-recall-abort-signal

Conversation

@ScientificProgrammer
Copy link
Copy Markdown
Contributor

Summary

The auto-recall before_prompt_build hook wraps recallWork() in Promise.race with a setTimeout resolver (index.ts:~2690). When the timer fires, the race resolves undefined — but the underlying embedding HTTP call keeps running. No AbortController/AbortSignal is threaded through. The lingering work holds the per-agent memory-runtime mutex, so subsequent hook invocations serialize behind it.

In production at the reporter's site, this manifested as stuck session: state=processing age=219s queueDepth=0 with the gateway's session-write-lock watchdog firing repeatedly. The gateway's lock max-hold is timeoutMs + 120s grace (min 300s), so a single slow embedding call ends up holding the session lock for 2–5 minutes before forced release.

Changes

  • index.ts — create an AbortController in the hook; pass controller.signal into a new recallWork(signal) parameter; in the setTimeout callback, call controller.abort(new Error("auto-recall timeout")) before resolving. The catch branch downgrades AbortError to a debug log (the timeout path already emitted a warn). retrieveWithRetry now accepts an optional signal and early-returns after the 75 ms retry gap if the signal was aborted in the interim.
  • src/retriever.ts — add signal?: AbortSignal to RetrievalContext; destructure in retrieve(); forward to vectorOnlyRetrieval() and hybridRetrieval(), which pass it to this.embedder.embedQuery(query, signal). The embedder already accepts an AbortSignal throughout (src/embedder.ts:794,798,806,810,865), so no downstream changes are needed.

Scope

  • In scope: auto-recall hook → retriever → embedder cancellation.
  • Out of scope (follow-ups):
    • Jina rerank HTTP calls (separate code path in the retriever's rerank stage). Should also accept the signal. Left for a focused follow-up that touches the rerank client interface.
    • BM25 path has no long-running HTTP call and doesn't invoke embedQuery, so no plumbing needed.

Test plan

Verified locally:

  • tsc --noEmit introduces no new type errors. Two pre-existing errors remain (notably TS2353 on retrieveWithRetry where source is passed but not declared in the inline param type — introduced by feat(memory): add per-agent auto-recall control #535, independent of this PR).
  • The previously-deployed (stale) plugin exhibited the stuck-session symptom under slow-embedding conditions; after rolling to a build with this change applied, a contrived delayed-embedding scenario returns from the hook inside the configured autoRecallTimeoutMs and the in-flight embedding fetch sees signal.aborted === true.

Automated test not added in this PR — suggested follow-up: a unit test that stubs the embedder with a 30 s delay and asserts recallWork returns within 100 ms of the timeout AND the stub receives an aborted signal.

Related


🤖 Developed with Claude Code (Claude Opus 4.7, 1M context).

Problem: the Promise.race-based timeout in the auto-recall
before_prompt_build hook resolves the outer promise with undefined
after AUTO_RECALL_TIMEOUT_MS, but it does not cancel the underlying
embedding HTTP call. The underlying work keeps running, holding the
per-agent memory-runtime mutex (and lancedb connection), so the next
hook invocation serializes behind it. Combined with the gateway's
session-write-lock max-hold of (timeoutMs + 120s grace, min 300s),
a single slow embedding call can hold the session lock for several
minutes before the watchdog force-releases — seen in production as
"stuck session: state=processing age=219s queueDepth=0".

Fix:
- Create an AbortController in the hook and pass its signal into
  recallWork(signal).
- recallWork threads the signal through retrieveWithRetry() into
  Retriever.retrieve() via a new RetrievalContext.signal field.
- Retriever.retrieve() forwards the signal to the embedQuery() calls
  in both vectorOnlyRetrieval() and hybridRetrieval(). The embedder
  already accepts AbortSignal throughout (src/embedder.ts), so no
  downstream changes were needed.
- When the setTimeout fires, the hook calls controller.abort() before
  resolving undefined. In-flight embedding fetch() requests terminate
  promptly; retriever.retrieve() rejects with AbortError which the
  retryWithRetry early-return handles (aborts skip the 75ms retry).
- The catch branch downgrades AbortError to a debug log so we don't
  double-log the timeout warning the setTimeout path already emitted.

Out of scope (follow-up):
- Jina rerank HTTP calls (a separate code path in retriever.ts) should
  also accept the signal. Left for a focused follow-up since it
  touches the rerank client interface.
- BM25 path does not call embedQuery and has no long-running HTTP
  call, so it does not need signal plumbing.

Note: tsc --noEmit reports a pre-existing TS2353 error on
retrieveWithRetry where callers pass `source: "auto-recall"` but the
function's inline type did not include `source` (introduced by CortexReach#535,
not this PR). Non-blocking — tsx runtime transpile does not check
types. Can be fixed with a one-line addition to the param type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 19, 2026 16:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an operational issue in the auto-recall before_prompt_build hook where a timeout via Promise.race returned early but left the underlying embedding HTTP request running, holding locks and causing subsequent sessions/hooks to stall.

Changes:

  • Add an AbortController in the auto-recall hook and thread its AbortSignal through recallWork() into retrieval.
  • Extend retriever plumbing (RetrievalContext) to accept an optional signal and forward it to embedder.embedQuery(...).
  • Make retrieveWithRetry bail out after the retry delay if the signal was aborted.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
index.ts Creates/aborts an AbortController on auto-recall timeout; threads signal into recall retrieval and adds an early return in the retry gap.
src/retriever.ts Adds signal?: AbortSignal to RetrievalContext and forwards it into query embedding calls in vector/hybrid retrieval paths.
Comments suppressed due to low confidence (1)

index.ts:2712

  • This change introduces new timeout-cancellation behavior (aborting the in-flight embed/retrieval work) but there’s no automated test asserting that the hook returns promptly and the downstream embedder receives an aborted signal. Given the existing test suite, consider adding a unit/integration test that stubs a slow embedder call, advances timers to AUTO_RECALL_TIMEOUT_MS, and asserts (1) before_prompt_build returns undefined quickly and (2) the stub observed signal.aborted === true / rejected with AbortError.
                // Cancel in-flight embedding/retrieval HTTP calls so they don't
                // keep holding resources after we've given up on the result.
                abortController.abort(new Error("auto-recall timeout"));
                api.logger.warn(
                  `memory-lancedb-pro: auto-recall timed out after ${AUTO_RECALL_TIMEOUT_MS}ms; skipping memory injection to avoid stalling agent startup`,
                );
                resolve(undefined);
              }, AUTO_RECALL_TIMEOUT_MS);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread index.ts Outdated
Comment on lines +2718 to +2725
// Downgrade AbortError to debug — the timeout path already emitted
// a warn log, and the race has already resolved undefined.
const isAbort = err instanceof Error && (err.name === "AbortError" || /abort/i.test(err.message));
if (isAbort) {
api.logger.debug?.(`memory-lancedb-pro: recall aborted: ${String(err)}`);
} else {
api.logger.warn(`memory-lancedb-pro: recall failed: ${String(err)}`);
}
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abort classification in the catch block is too broad (err.name === "AbortError" || /abort/i.test(err.message)). This will also downgrade unrelated aborts (e.g., the embedder's own timeout AbortError) to debug, reducing visibility of real failures. Instead, key this specifically to this timeout controller (e.g., check abortController.signal.aborted, compare err to abortController.signal.reason, or track a timedOut flag set in the timeout handler) and keep other errors at warn.

Copilot uses AI. Check for mistakes.
Comment thread src/retriever.ts Outdated
Comment on lines +109 to +110
* the method rejects with AbortError instead of holding the caller's session
* lock while the underlying HTTP request runs to completion. */
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for RetrievalContext.signal says the method rejects with AbortError, but when a caller passes an already-aborted signal, embedder.withTimeout() rejects with externalSignal.reason (which may be a plain Error, not an AbortError). Consider softening the wording to “rejects due to abort (often AbortError / abort reason)” to match actual behavior.

Suggested change
* the method rejects with AbortError instead of holding the caller's session
* lock while the underlying HTTP request runs to completion. */
* the method rejects due to abort (often with AbortError or the signal's
* abort reason) instead of holding the caller's session lock while the
* underlying HTTP request runs to completion. */

Copilot uses AI. Check for mistakes.
Comment thread index.ts
Comment on lines 1940 to 1952
async function retrieveWithRetry(params: {
query: string;
limit: number;
scopeFilter?: string[];
category?: string;
signal?: AbortSignal;
}) {
let results = await retriever.retrieve(params);
if (results.length === 0) {
await sleep(75);
if (params.signal?.aborted) return results;
results = await retriever.retrieve(params);
}
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrieveWithRetry's inline params type is missing the source field even though callers pass source: "auto-recall". This keeps the existing TS2353 error around and makes it easier for future call sites to drift. Prefer typing params as RetrievalContext (or Pick<RetrievalContext, "query"|"limit"|"scopeFilter"|"category"|"source"|"signal">) so source/signal are both supported and checked.

Copilot uses AI. Check for mistakes.
Address Copilot P2 feedback on PR CortexReach#668.

The catch block used `err.name === "AbortError" || /abort/i.test(err.message)`
to decide whether to downgrade recall errors to debug. That was too
broad: if the embedder's own internal timeout AbortController fired
(src/embedder.ts withTimeout), its AbortError would propagate up and
match the same pattern, silencing a real infrastructure failure.

Key the debug-downgrade to `abortController.signal.aborted` instead.
That flag is owned by this hook's own controller and is true only
when our timeout callback fired. Aborts from any other controller
(embedder, SDK client, upstream cancellation) now stay at warn.

Also tighten the `RetrievalContext.signal` docstring: the embedder's
withTimeout rejects with `externalSignal.reason ?? new Error("aborted")`
when the signal is already aborted, which can be a plain Error rather
than an AbortError. "rejects due to abort (often with AbortError or
the signal's abort reason)" matches observed behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ScientificProgrammer
Copy link
Copy Markdown
Contributor Author

@copilot-pull-request-reviewer — feedback on 6119d11 addressed in commit 373e7f2:

1. Abort classification narrowed (index.ts) — replaced the overbroad err.name === "AbortError" || /abort/i.test(err.message) with abortController.signal.aborted. The flag is owned by this hook's own controller, so only our timeout callback sets it. AbortErrors originating elsewhere (e.g. the embedder's internal withTimeout controller in src/embedder.ts:746) no longer get silenced to debug.

2. Retriever docstring tightened (src/retriever.ts) — updated to "rejects due to abort (often with AbortError or the signal's abort reason)" to match the actual rejection path: when an already-aborted signal is passed, withTimeout rejects with externalSignal.reason ?? new Error("aborted") (a plain Error in our timeout case), not necessarily an AbortError.

3. retrieveWithRetry params typing (index.ts:1952) — left as-is in this commit. That's a pre-existing TS2353 introduced by PR #535, unrelated to this change; happy to take it in a separate focused PR if that's preferred.

Follow-up noted for a later PR (out of scope here, not a regression): the recallWork(abortController.signal).then(...) leg has no .catch(). If recallWork rejects after the race has already resolved with undefined (timeout path), the rejection is silently dropped. In practice this is rare — once the signal aborts, downstream code exits quickly — but an explicit .catch() on that leg would eliminate the unhandled-rejection window entirely.

Copy link
Copy Markdown

Thanks for tackling this. The problem is real and worth fixing, and the overall direction here looks right.

I’m still at request changes for one blocker before merge:

  • When the timeout wins the Promise.race, the in-flight recallWork(abortController.signal) can still reject after the caller has already moved on. In the current shape, that rejection is not handled explicitly, so we risk trading the stuck-mutex problem for an unhandled promise rejection on the timeout path.

What I’d like to see before merge:

  • Handle the post-timeout rejection path explicitly, so an abort-triggered failure is consumed intentionally rather than escaping as an unhandled rejection.
  • Add a focused test for the timeout -> abort -> cancellation path. Right now the risky path introduced by this change is not covered directly.

Non-blocking notes:

  • I’m not treating the current full-suite failures in recall-text-cleanup.test.mjs as a blocker for this PR by themselves, since they look environment-coupled from the review artifacts.
  • If there is an easy type/build check available for the AbortSignal plumbing, that would help confidence.

Once the timeout rejection path is handled, this looks close.

Copy link
Copy Markdown

@app3apps app3apps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes before merge.

The direction here looks right and the problem is worth fixing, but there is still one blocker on the timeout path:

  • When the timeout wins the Promise.race, the in-flight recallWork(abortController.signal) can still reject after the caller has already moved on. In the current shape, that rejection is not handled explicitly, so we risk turning the stuck-mutex issue into an unhandled promise rejection on the abort path.

What I’d like before merge:

  • Handle the post-timeout rejection path explicitly so an abort-triggered failure is consumed intentionally rather than escaping as an unhandled rejection.
  • Add a focused test for the timeout -> abort -> cancellation path.

Once that is covered, this looks close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants