Skip to content

fix(sdk): keep cancelled token observable after arun() interrupt#3395

Open
VascoSch92 wants to merge 1 commit into
mainfrom
fix/cancel-token-after-interrupt
Open

fix(sdk): keep cancelled token observable after arun() interrupt#3395
VascoSch92 wants to merge 1 commit into
mainfrom
fix/cancel-token-after-interrupt

Conversation

@VascoSch92
Copy link
Copy Markdown
Member

@VascoSch92 VascoSch92 commented May 27, 2026

  • A human has tested these changes.

Why

Addresses B3 of #3341.

arun()'s finally cleared self._cancel_token unconditionally. On interrupt(), the in-flight asyncio.Task is cancelled (breaking the await in astep), but tool calls dispatched via run_in_executor run in worker threads that can't be force-cancelled and may outlive arun(). A tool polling conversation.cancel_token — the documented cooperative-cancellation pattern — after the finally ran saw None and lost the cancellation signal, the opposite of what interrupt() intended.

What

  • Clear _cancel_token in arun()'s finally only when it was not cancelled. A cancelled token stays observable; the next run()/arun() creates a fresh one, so leaving it is safe.
  • Updated test_interrupt_sets_cancel_token (it encoded the old behaviour) and added test_cancel_token_stays_observable_after_interrupt.

The sync run() path is not affected: interrupt() falls back to pause() there and step() runs inline, so the finally only fires after tools return.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:0998a62-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-0998a62-python \
  ghcr.io/openhands/agent-server:0998a62-python

All tags pushed for this build

ghcr.io/openhands/agent-server:0998a62-golang-amd64
ghcr.io/openhands/agent-server:0998a622efe78cb24543db09a251909f237d724e-golang-amd64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-golang-amd64
ghcr.io/openhands/agent-server:0998a62-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:0998a62-golang-arm64
ghcr.io/openhands/agent-server:0998a622efe78cb24543db09a251909f237d724e-golang-arm64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-golang-arm64
ghcr.io/openhands/agent-server:0998a62-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:0998a62-java-amd64
ghcr.io/openhands/agent-server:0998a622efe78cb24543db09a251909f237d724e-java-amd64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-java-amd64
ghcr.io/openhands/agent-server:0998a62-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:0998a62-java-arm64
ghcr.io/openhands/agent-server:0998a622efe78cb24543db09a251909f237d724e-java-arm64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-java-arm64
ghcr.io/openhands/agent-server:0998a62-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:0998a62-python-amd64
ghcr.io/openhands/agent-server:0998a622efe78cb24543db09a251909f237d724e-python-amd64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-python-amd64
ghcr.io/openhands/agent-server:0998a62-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:0998a62-python-arm64
ghcr.io/openhands/agent-server:0998a622efe78cb24543db09a251909f237d724e-python-arm64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-python-arm64
ghcr.io/openhands/agent-server:0998a62-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:0998a62-golang
ghcr.io/openhands/agent-server:0998a622efe78cb24543db09a251909f237d724e-golang
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-golang
ghcr.io/openhands/agent-server:0998a62-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:0998a62-java
ghcr.io/openhands/agent-server:0998a622efe78cb24543db09a251909f237d724e-java
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-java
ghcr.io/openhands/agent-server:0998a62-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:0998a62-python
ghcr.io/openhands/agent-server:0998a622efe78cb24543db09a251909f237d724e-python
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-python
ghcr.io/openhands/agent-server:0998a62-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 0998a62-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 0998a62-python-amd64) are also available if needed

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟢 Good taste — the fix is small and preserves the intended cooperative-cancellation contract. I didn’t find any blocking code issues in the diff; focused interrupt tests pass locally (uv run pytest tests/sdk/conversation/test_interrupt.py -q).

Because this touches agent interruption/tool execution behavior, I’m leaving a COMMENT rather than approval per repo eval-risk guidance; a human maintainer should decide whether lightweight eval coverage is needed.

Automated review generated by OpenHands on behalf of the requester.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM — small localized cancellation lifecycle change, but it affects agent run/interruption semantics and tool execution observability.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26497199123

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/conversation/impl
   local_conversation.py5324491%308, 313, 457, 503, 540, 556, 621, 846–847, 850, 963, 974–977, 984–985, 988, 994–995, 998, 1004, 1019, 1022, 1026–1027, 1031–1033, 1040, 1130, 1135, 1245, 1247, 1251–1252, 1263–1264, 1289, 1484, 1488, 1558, 1565–1566
TOTAL27846817970% 

@VascoSch92 VascoSch92 requested a review from malhotra5 May 27, 2026 07:30
arun()'s finally cleared self._cancel_token unconditionally. Tool calls
dispatched via run_in_executor run in worker threads that can't be
force-cancelled and may outlive arun(), so a tool polling
conversation.cancel_token after an interrupt saw None and lost the
cancellation signal. Retain a cancelled token; the next run() replaces it.
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the SDK async interrupt flow with a real Conversation.arun() call: the PR keeps the cancelled token observable after interrupt and replaces it with a fresh token on the next run.

Does this PR achieve its stated goal?

Yes. The PR's goal was to prevent arun() cleanup from clearing a cancelled token before late-running worker/thread code can observe cooperative cancellation. On origin/main, the same SDK script showed after_interrupt_token_present=False and late_worker_token_present=False; on the PR commit, both became True with is_cancelled=True, and the next run received a fresh uncancelled token.

Phase Result
Environment Setup make build completed successfully; dependencies installed and build marked ready.
CI Status 🟡 gh pr checks reported 0 failing, 19 successful, 7 pending, 6 skipped at review time.
Functional Verification ✅ Reproduced the base bug and verified the PR fixes token observability plus fresh-token behavior.
Functional Verification

Test 1: Interrupted async run keeps cancellation observable for late worker-style polling

Step 1 — Reproduce / establish baseline (without the fix):
Checked out base origin/main (58771eeb) and ran a standalone SDK script that creates an Agent with a hanging async LLM, starts Conversation.arun(), calls conversation.interrupt(), then has a separate worker-style thread poll conversation.cancel_token after arun() has unwound.

Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_cancel_token.py 2>&1 | grep -E '^(during|after|late|next)':

58771eeb (HEAD, origin/main) fix(sdk): dedupe LLMs by usage_id in _ensure_agent_ready registration loop (#3392)
during_run_token_present=True
during_run_token_cancelled=False
after_interrupt_token_present=False
after_interrupt_token_cancelled=None
late_worker_token_present=False
late_worker_token_cancelled=None
next_run_token_present=True
next_run_token_cancelled=False
next_run_token_is_same_as_retained=False

This confirms the bug exists on base: while the token exists during the run, after interrupt() and arun() cleanup a late poll sees None, so cooperative cancellation is no longer observable.

Step 2 — Apply the PR's changes:
Checked out fix/cancel-token-after-interrupt at d495eeeaab6631dbe182af8e8fbfa1e96f7965f7.

Step 3 — Re-run with the fix in place:
Ran the same command on the PR branch:

d495eeea (HEAD -> fix/cancel-token-after-interrupt, origin/fix/cancel-token-after-interrupt) fix(sdk): keep cancelled token observable after arun() interrupt
during_run_token_present=True
during_run_token_cancelled=False
after_interrupt_token_present=True
after_interrupt_token_cancelled=True
late_worker_token_present=True
late_worker_token_cancelled=True
next_run_token_present=True
next_run_token_cancelled=False
next_run_token_is_same_as_retained=False

This confirms the fix works: after interrupt, both the direct public conversation.cancel_token poll and the late worker-style poll observe the retained cancelled token. The follow-up run also creates a fresh uncancelled token, matching the PR's safety claim.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

Comment on lines +1073 to +1077
# A cancelled token must stay observable: interrupted tool calls run
# in worker threads that can outlive arun() and still poll it. A
# fresh token is created on the next run().
if self._cancel_token is not None and not self._cancel_token.is_cancelled:
self._cancel_token = None
Copy link
Copy Markdown
Member

@malhotra5 malhotra5 May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pass a cancel token to each tool call. If arun can be called without a tool call finishing, interrupts should stop everything that's currently in flight

wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants