fix(auto-recall): thread AbortSignal to cancel embedding on timeout by ScientificProgrammer · Pull Request #668 · CortexReach/memory-lancedb-pro

ScientificProgrammer · 2026-04-19T16:15:54Z

Summary

The auto-recall before_prompt_build hook wraps recallWork() in Promise.race with a setTimeout resolver (index.ts:~2690). When the timer fires, the race resolves undefined — but the underlying embedding HTTP call keeps running. No AbortController/AbortSignal is threaded through. The lingering work holds the per-agent memory-runtime mutex, so subsequent hook invocations serialize behind it.

In production at the reporter's site, this manifested as stuck session: state=processing age=219s queueDepth=0 with the gateway's session-write-lock watchdog firing repeatedly. The gateway's lock max-hold is timeoutMs + 120s grace (min 300s), so a single slow embedding call ends up holding the session lock for 2–5 minutes before forced release.

Changes

index.ts — create an AbortController in the hook; pass controller.signal into a new recallWork(signal) parameter; in the setTimeout callback, call controller.abort(new Error("auto-recall timeout")) before resolving. The catch branch downgrades AbortError to a debug log (the timeout path already emitted a warn). retrieveWithRetry now accepts an optional signal and early-returns after the 75 ms retry gap if the signal was aborted in the interim.
src/retriever.ts — add signal?: AbortSignal to RetrievalContext; destructure in retrieve(); forward to vectorOnlyRetrieval() and hybridRetrieval(), which pass it to this.embedder.embedQuery(query, signal). The embedder already accepts an AbortSignal throughout (src/embedder.ts:794,798,806,810,865), so no downstream changes are needed.

Scope

In scope: auto-recall hook → retriever → embedder cancellation.
Out of scope (follow-ups):
- Jina rerank HTTP calls (separate code path in the retriever's rerank stage). Should also accept the signal. Left for a focused follow-up that touches the rerank client interface.
- BM25 path has no long-running HTTP call and doesn't invoke embedQuery, so no plumbing needed.

Test plan

Verified locally:

tsc --noEmit introduces no new type errors. Two pre-existing errors remain (notably TS2353 on retrieveWithRetry where source is passed but not declared in the inline param type — introduced by feat(memory): add per-agent auto-recall control #535, independent of this PR).
The previously-deployed (stale) plugin exhibited the stuck-session symptom under slow-embedding conditions; after rolling to a build with this change applied, a contrived delayed-embedding scenario returns from the hook inside the configured autoRecallTimeoutMs and the in-flight embedding fetch sees signal.aborted === true.

Automated test not added in this PR — suggested follow-up: a unit test that stubs the embedder with a 30 s delay and asserts recallWork returns within 100 ms of the timeout AND the stub receives an aborted signal.

RCA: stuck-session incident on 2026-04-18 (proving-ground agent, OpenClaw 2026.4.9) directly implicated this code path.
Related upstream fixes already merged: fix: remove duplicate memory_compact registration + redundant schema required (OpenClaw 2026.4.5 compat) #541 (duplicate memory_compact registration), fix: resolve stale lock and same-instance write contention (#622, #623) #626 (stale lockfile). Those address adjacent failure modes but not the Promise.race-without-cancellation root cause fixed here.

🤖 Developed with Claude Code (Claude Opus 4.7, 1M context).

Problem: the Promise.race-based timeout in the auto-recall before_prompt_build hook resolves the outer promise with undefined after AUTO_RECALL_TIMEOUT_MS, but it does not cancel the underlying embedding HTTP call. The underlying work keeps running, holding the per-agent memory-runtime mutex (and lancedb connection), so the next hook invocation serializes behind it. Combined with the gateway's session-write-lock max-hold of (timeoutMs + 120s grace, min 300s), a single slow embedding call can hold the session lock for several minutes before the watchdog force-releases — seen in production as "stuck session: state=processing age=219s queueDepth=0". Fix: - Create an AbortController in the hook and pass its signal into recallWork(signal). - recallWork threads the signal through retrieveWithRetry() into Retriever.retrieve() via a new RetrievalContext.signal field. - Retriever.retrieve() forwards the signal to the embedQuery() calls in both vectorOnlyRetrieval() and hybridRetrieval(). The embedder already accepts AbortSignal throughout (src/embedder.ts), so no downstream changes were needed. - When the setTimeout fires, the hook calls controller.abort() before resolving undefined. In-flight embedding fetch() requests terminate promptly; retriever.retrieve() rejects with AbortError which the retryWithRetry early-return handles (aborts skip the 75ms retry). - The catch branch downgrades AbortError to a debug log so we don't double-log the timeout warning the setTimeout path already emitted. Out of scope (follow-up): - Jina rerank HTTP calls (a separate code path in retriever.ts) should also accept the signal. Left for a focused follow-up since it touches the rerank client interface. - BM25 path does not call embedQuery and has no long-running HTTP call, so it does not need signal plumbing. Note: tsc --noEmit reports a pre-existing TS2353 error on retrieveWithRetry where callers pass `source: "auto-recall"` but the function's inline type did not include `source` (introduced by CortexReach#535, not this PR). Non-blocking — tsx runtime transpile does not check types. Can be fixed with a one-line addition to the param type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR fixes an operational issue in the auto-recall before_prompt_build hook where a timeout via Promise.race returned early but left the underlying embedding HTTP request running, holding locks and causing subsequent sessions/hooks to stall.

Changes:

Add an AbortController in the auto-recall hook and thread its AbortSignal through recallWork() into retrieval.
Extend retriever plumbing (RetrievalContext) to accept an optional signal and forward it to embedder.embedQuery(...).
Make retrieveWithRetry bail out after the retry delay if the signal was aborted.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`index.ts`	Creates/aborts an `AbortController` on auto-recall timeout; threads `signal` into recall retrieval and adds an early return in the retry gap.
`src/retriever.ts`	Adds `signal?: AbortSignal` to `RetrievalContext` and forwards it into query embedding calls in vector/hybrid retrieval paths.

Comments suppressed due to low confidence (1)

index.ts:2712

This change introduces new timeout-cancellation behavior (aborting the in-flight embed/retrieval work) but there’s no automated test asserting that the hook returns promptly and the downstream embedder receives an aborted signal. Given the existing test suite, consider adding a unit/integration test that stubs a slow embedder call, advances timers to AUTO_RECALL_TIMEOUT_MS, and asserts (1) before_prompt_build returns undefined quickly and (2) the stub observed signal.aborted === true / rejected with AbortError.

                // Cancel in-flight embedding/retrieval HTTP calls so they don't
                // keep holding resources after we've given up on the result.
                abortController.abort(new Error("auto-recall timeout"));
                api.logger.warn(
                  `memory-lancedb-pro: auto-recall timed out after ${AUTO_RECALL_TIMEOUT_MS}ms; skipping memory injection to avoid stalling agent startup`,
                );
                resolve(undefined);
              }, AUTO_RECALL_TIMEOUT_MS);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-19T16:19:20Z

+          // Downgrade AbortError to debug — the timeout path already emitted
+          // a warn log, and the race has already resolved undefined.
+          const isAbort = err instanceof Error && (err.name === "AbortError" || /abort/i.test(err.message));
+          if (isAbort) {
+            api.logger.debug?.(`memory-lancedb-pro: recall aborted: ${String(err)}`);
+          } else {
+            api.logger.warn(`memory-lancedb-pro: recall failed: ${String(err)}`);
+          }


The abort classification in the catch block is too broad (err.name === "AbortError" || /abort/i.test(err.message)). This will also downgrade unrelated aborts (e.g., the embedder's own timeout AbortError) to debug, reducing visibility of real failures. Instead, key this specifically to this timeout controller (e.g., check abortController.signal.aborted, compare err to abortController.signal.reason, or track a timedOut flag set in the timeout handler) and keep other errors at warn.

Copilot · 2026-04-19T16:19:21Z

+   *  the method rejects with AbortError instead of holding the caller's session
+   *  lock while the underlying HTTP request runs to completion. */


The docstring for RetrievalContext.signal says the method rejects with AbortError, but when a caller passes an already-aborted signal, embedder.withTimeout() rejects with externalSignal.reason (which may be a plain Error, not an AbortError). Consider softening the wording to “rejects due to abort (often AbortError / abort reason)” to match actual behavior.

Suggested change

* the method rejects with AbortError instead of holding the caller's session

* lock while the underlying HTTP request runs to completion. */

* the method rejects due to abort (often with AbortError or the signal's

* abort reason) instead of holding the caller's session lock while the

* underlying HTTP request runs to completion. */

Copilot · 2026-04-19T16:19:21Z

    async function retrieveWithRetry(params: {
      query: string;
      limit: number;
      scopeFilter?: string[];
      category?: string;
+      signal?: AbortSignal;
    }) {
      let results = await retriever.retrieve(params);
      if (results.length === 0) {
        await sleep(75);
+        if (params.signal?.aborted) return results;
        results = await retriever.retrieve(params);
      }


retrieveWithRetry's inline params type is missing the source field even though callers pass source: "auto-recall". This keeps the existing TS2353 error around and makes it easier for future call sites to drift. Prefer typing params as RetrievalContext (or Pick<RetrievalContext, "query"|"limit"|"scopeFilter"|"category"|"source"|"signal">) so source/signal are both supported and checked.

Address Copilot P2 feedback on PR CortexReach#668. The catch block used `err.name === "AbortError" || /abort/i.test(err.message)` to decide whether to downgrade recall errors to debug. That was too broad: if the embedder's own internal timeout AbortController fired (src/embedder.ts withTimeout), its AbortError would propagate up and match the same pattern, silencing a real infrastructure failure. Key the debug-downgrade to `abortController.signal.aborted` instead. That flag is owned by this hook's own controller and is true only when our timeout callback fired. Aborts from any other controller (embedder, SDK client, upstream cancellation) now stay at warn. Also tighten the `RetrievalContext.signal` docstring: the embedder's withTimeout rejects with `externalSignal.reason ?? new Error("aborted")` when the signal is already aborted, which can be a plain Error rather than an AbortError. "rejects due to abort (often with AbortError or the signal's abort reason)" matches observed behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ScientificProgrammer · 2026-04-20T06:44:44Z

@copilot-pull-request-reviewer — feedback on 6119d11 addressed in commit 373e7f2:

1. Abort classification narrowed (index.ts) — replaced the overbroad err.name === "AbortError" || /abort/i.test(err.message) with abortController.signal.aborted. The flag is owned by this hook's own controller, so only our timeout callback sets it. AbortErrors originating elsewhere (e.g. the embedder's internal withTimeout controller in src/embedder.ts:746) no longer get silenced to debug.

2. Retriever docstring tightened (src/retriever.ts) — updated to "rejects due to abort (often with AbortError or the signal's abort reason)" to match the actual rejection path: when an already-aborted signal is passed, withTimeout rejects with externalSignal.reason ?? new Error("aborted") (a plain Error in our timeout case), not necessarily an AbortError.

3. retrieveWithRetry params typing (index.ts:1952) — left as-is in this commit. That's a pre-existing TS2353 introduced by PR #535, unrelated to this change; happy to take it in a separate focused PR if that's preferred.

Follow-up noted for a later PR (out of scope here, not a regression): the recallWork(abortController.signal).then(...) leg has no .catch(). If recallWork rejects after the race has already resolved with undefined (timeout path), the rejection is silently dropped. In practice this is rare — once the signal aborts, downstream code exits quickly — but an explicit .catch() on that leg would eliminate the unhandled-rejection window entirely.

app3apps · 2026-04-20T11:08:24Z

Thanks for tackling this. The problem is real and worth fixing, and the overall direction here looks right.

I’m still at request changes for one blocker before merge:

When the timeout wins the Promise.race, the in-flight recallWork(abortController.signal) can still reject after the caller has already moved on. In the current shape, that rejection is not handled explicitly, so we risk trading the stuck-mutex problem for an unhandled promise rejection on the timeout path.

What I’d like to see before merge:

Handle the post-timeout rejection path explicitly, so an abort-triggered failure is consumed intentionally rather than escaping as an unhandled rejection.
Add a focused test for the timeout -> abort -> cancellation path. Right now the risky path introduced by this change is not covered directly.

Non-blocking notes:

I’m not treating the current full-suite failures in recall-text-cleanup.test.mjs as a blocker for this PR by themselves, since they look environment-coupled from the review artifacts.
If there is an easy type/build check available for the AbortSignal plumbing, that would help confidence.

Once the timeout rejection path is handled, this looks close.

app3apps

Requesting changes before merge.

The direction here looks right and the problem is worth fixing, but there is still one blocker on the timeout path:

When the timeout wins the Promise.race, the in-flight recallWork(abortController.signal) can still reject after the caller has already moved on. In the current shape, that rejection is not handled explicitly, so we risk turning the stuck-mutex issue into an unhandled promise rejection on the abort path.

What I’d like before merge:

Handle the post-timeout rejection path explicitly so an abort-triggered failure is consumed intentionally rather than escaping as an unhandled rejection.
Add a focused test for the timeout -> abort -> cancellation path.

Once that is covered, this looks close.

Copilot AI review requested due to automatic review settings April 19, 2026 16:15

Copilot started reviewing on behalf of ScientificProgrammer April 19, 2026 16:16 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

app3apps suggested changes Apr 20, 2026

View reviewed changes

jlin53882 mentioned this pull request Apr 24, 2026

[BUG] auto-recall timeout 在首次啟動或多進程 session 時容易發生 #643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(auto-recall): thread AbortSignal to cancel embedding on timeout#668

fix(auto-recall): thread AbortSignal to cancel embedding on timeout#668
ScientificProgrammer wants to merge 2 commits intoCortexReach:masterfrom
ScientificProgrammer:fix/auto-recall-abort-signal

ScientificProgrammer commented Apr 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

ScientificProgrammer commented Apr 20, 2026

Uh oh!

app3apps commented Apr 20, 2026

Uh oh!

app3apps left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		* the method rejects with AbortError instead of holding the caller's session
		* lock while the underlying HTTP request runs to completion. */

-   *  the method rejects with AbortError instead of holding the caller's session
-   *  lock while the underlying HTTP request runs to completion. */
+   *  the method rejects due to abort (often with AbortError or the signal's
+   *  abort reason) instead of holding the caller's session lock while the
+   *  underlying HTTP request runs to completion. */

Conversation

ScientificProgrammer commented Apr 19, 2026

Summary

Changes

Scope

Test plan

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

ScientificProgrammer commented Apr 20, 2026

Uh oh!

app3apps commented Apr 20, 2026

Uh oh!

app3apps left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants