Release v1.24.0 by all-hands-bot · Pull Request #3401 · OpenHands/software-agent-sdk

all-hands-bot · 2026-05-27T11:25:09Z

Release v1.24.0

This PR prepares the release for version 1.24.0.

Release Checklist

Version set to 1.24.0
Fix any deprecation deadlines if they exist
Integration tests pass (tagged with integration-test)
Behavior tests pass (tagged with behavior-test)
Example tests pass (tagged with test-examples)
Evaluation on OpenHands Index
Confirm any release-note-required PRs are accurately called out in the final release notes

What happens on merge

When this PR is merged, the create-release.yml workflow will automatically:

Create a GitHub release with tag v1.24.0 and auto-generated notes, plus an explicit preamble for merged release-note-required PRs
Trigger pypi-release.yml to publish all packages to PyPI
Trigger version-bump-prs.yml to create downstream version bump PRs

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:17cb597-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-17cb597-python \
  ghcr.io/openhands/agent-server:17cb597-python

All tags pushed for this build

ghcr.io/openhands/agent-server:17cb597-golang-amd64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-golang-amd64
ghcr.io/openhands/agent-server:rel-1.24.0-golang-amd64
ghcr.io/openhands/agent-server:17cb597-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:17cb597-golang-arm64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-golang-arm64
ghcr.io/openhands/agent-server:rel-1.24.0-golang-arm64
ghcr.io/openhands/agent-server:17cb597-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:17cb597-java-amd64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-java-amd64
ghcr.io/openhands/agent-server:rel-1.24.0-java-amd64
ghcr.io/openhands/agent-server:17cb597-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:17cb597-java-arm64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-java-arm64
ghcr.io/openhands/agent-server:rel-1.24.0-java-arm64
ghcr.io/openhands/agent-server:17cb597-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:17cb597-python-amd64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-python-amd64
ghcr.io/openhands/agent-server:rel-1.24.0-python-amd64
ghcr.io/openhands/agent-server:17cb597-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:17cb597-python-arm64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-python-arm64
ghcr.io/openhands/agent-server:rel-1.24.0-python-arm64
ghcr.io/openhands/agent-server:17cb597-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:17cb597-golang
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-golang
ghcr.io/openhands/agent-server:rel-1.24.0-golang
ghcr.io/openhands/agent-server:17cb597-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:17cb597-java
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-java
ghcr.io/openhands/agent-server:rel-1.24.0-java
ghcr.io/openhands/agent-server:17cb597-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:17cb597-python
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-python
ghcr.io/openhands/agent-server:rel-1.24.0-python
ghcr.io/openhands/agent-server:17cb597-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 17cb597-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 17cb597-python-amd64) are also available if needed

github-actions · 2026-05-27T11:25:18Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-05-27T11:25:18Z

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-05-27T11:25:38Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-27T11:25:45Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot

🟡 Acceptable release bump, but I can’t approve this release PR yet.

Issues to address before approval:

Deprecation deadlines is failing for 1.24.0: register_tool(callable_factory) and the openhands.sdk.settings import shim have reached their removed in: 1.24.0 deadline.
Required release validation is not complete/current yet: Run tests still has in-progress jobs and no coverage report comment, Run Examples Scripts is still in progress/no result comment, and Run Integration Tests is still in progress/no final results comment. Please wait for these to pass on 19e34c7a376eb21734385de2074b629061313c00, then have a human maintainer review.

Risk: 🟡 Medium — release publication should not proceed with failed deprecation cleanup and incomplete release validation.

This review was created by an AI agent (OpenHands) on behalf of the user.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26508259686

github-actions · 2026-05-27T11:29:34Z

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $1.04
Models Tested: 4
Timestamp: 2026-05-27 11:29:25 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_minimax_MiniMax_M2.7: 📥 View & Download Logs
litellm_proxy_gemini_3.1_pro_preview: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_v4_flash: 📥 View & Download Logs
litellm_proxy_openai_gpt_5.5: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_minimax_MiniMax_M2.7	100.0%	8/8	1	9	$0.00	370,316
litellm_proxy_gemini_3.1_pro_preview	100.0%	9/9	0	9	$0.16	321,838
litellm_proxy_deepseek_deepseek_v4_flash	100.0%	8/8	1	9	$0.00	447,556
litellm_proxy_openai_gpt_5.5	100.0%	9/9	0	9	$0.88	291,083

📋 Detailed Results

litellm_proxy_minimax_MiniMax_M2.7

Success Rate: 100.0% (8/8)
Total Cost: $0.00
Token Usage: prompt: 364,766, completion: 5,550, cache_read: 195,783, reasoning: 34
Run Suffix: litellm_proxy_minimax_MiniMax_M2.7_19e34c7_minimax_m2_7_run_N9_20260527_112652
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3.1_pro_preview

Success Rate: 100.0% (9/9)
Total Cost: $0.16
Token Usage: prompt: 317,473, completion: 4,365, cache_read: 293,199, reasoning: 2,684
Run Suffix: litellm_proxy_gemini_3.1_pro_preview_19e34c7_gemini_3_1_pro_run_N9_20260527_112657

litellm_proxy_deepseek_deepseek_v4_flash

Success Rate: 100.0% (8/8)
Total Cost: $0.00
Token Usage: prompt: 442,309, completion: 5,247, cache_read: 395,520, reasoning: 1,127
Run Suffix: litellm_proxy_deepseek_deepseek_v4_flash_19e34c7_deepseek_v4_flash_run_N9_20260527_112656
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_openai_gpt_5.5

Success Rate: 100.0% (9/9)
Total Cost: $0.88
Token Usage: prompt: 286,233, completion: 4,850, cache_read: 154,624, reasoning: 1,784
Run Suffix: litellm_proxy_openai_gpt_5.5_19e34c7_gpt_5_5_run_N9_20260527_112701

all-hands-bot

⚠️ QA Report: PASS WITH ISSUES

Release-version behavior works in real execution, but the PR is not fully release-ready because the deprecation-deadline gate is failing and several release checks are still pending.

Does this PR achieve its stated goal?

Partially. I verified the user-visible release outputs: package metadata/imports, built wheels, the running agent-server /server_info endpoint, and the Run Eval workflow default all moved from 1.23.1/v1.23.1 on main to 1.24.0/v1.24.0 on this PR. However, the stated release checklist includes deprecation-deadline cleanup, and CI currently fails that gate for two SDK deprecations with removed in: 1.24.0, so I would not consider the release fully prepared yet.

Phase	Result
Environment Setup	✅ `make build` completed and installed editable packages at `1.24.0`
CI Status	⚠️ 1 failing check (`Deprecation deadlines`) and pending release checks (`Run tests`, `Run Examples Scripts`, `Run Integration Tests`) at review time
Functional Verification	✅ Actual imports, server endpoint, workflow config parse, and distribution build all reported `1.24.0`

Functional Verification

Test 1: Installed package metadata and imports

Step 1 — Establish baseline on main:
Ran uv run --project /tmp/qa-sdk-main-baseline python /tmp/qa_versions.py:

openhands-sdk=1.23.1
openhands-tools=1.23.1
openhands-workspace=1.23.1
openhands-agent-server=1.23.1
sdk_import=ok:Agent
tools_import=ok:file_editor

This shows the pre-release baseline exposes 1.23.1 package metadata while core SDK/tools imports work.

Step 2 — Apply the PR's changes:
Checked out rel-1.24.0 at 19e34c7a376eb21734385de2074b629061313c00 and ran the repository setup with make build.

Step 3 — Re-run with the PR in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_versions.py:

openhands-sdk=1.24.0
openhands-tools=1.24.0
openhands-workspace=1.24.0
openhands-agent-server=1.24.0
sdk_import=ok:Agent
tools_import=ok:file_editor

This confirms the installed packages a Python user imports now expose 1.24.0 and still import successfully.

Test 2: Running agent-server reports release versions

Step 1 — Establish baseline on main:
Started the server with uv run python -m openhands.agent_server --host 127.0.0.1 --port 8765 and requested /server_info:

{"version":"1.23.1","sdk_version":"1.23.1","tools_version":"1.23.1","workspace_version":"1.23.1"}

This shows the real server API previously surfaced 1.23.1 across the server, SDK, tools, and workspace versions.

Step 2 — Apply the PR's changes:
Started the same server command from the PR checkout on port 8766.

Step 3 — Re-run with the PR in place:
Requested http://127.0.0.1:8766/server_info:

{"version":"1.24.0","sdk_version":"1.24.0","tools_version":"1.24.0","workspace_version":"1.24.0"}

This confirms a real API consumer sees the intended 1.24.0 release versions.

Test 3: Release distribution build metadata

Step 1 — Establish baseline:
The baseline package metadata check above established the current release line as 1.23.1.

Step 2 — Apply the PR's changes:
Built release artifacts from the PR with uv build --all-packages --out-dir /tmp/qa-dist-1.24.0.

Step 3 — Inspect built wheels:
Read wheel METADATA from the generated artifacts:

openhands_agent_server-1.24.0-py3-none-any.whl -> openhands-agent-server 1.24.0
openhands_sdk-1.24.0-py3-none-any.whl -> openhands-sdk 1.24.0
openhands_tools-1.24.0-py3-none-any.whl -> openhands-tools 1.24.0
openhands_workspace-1.24.0-py3-none-any.whl -> openhands-workspace 1.24.0

This confirms the artifacts that would be published carry the intended release version.

Test 4: Run Eval workflow default

Step 1 — Establish baseline on main:
Parsed .github/workflows/run-eval.yml with yaml.BaseLoader:

main run-eval sdk_ref default=v1.23.1

This shows manual eval dispatch defaulted to the prior release.

Step 2 — Apply the PR's changes:
Parsed the same workflow in the PR checkout.

Step 3 — Re-run with the PR in place:

PR run-eval sdk_ref default=v1.24.0

This confirms a workflow-dispatch user gets the new release tag by default.

CI Evidence

Latest gh pr checks 3401 --repo OpenHands/software-agent-sdk summary at review time:

bucket_counts={'skipping': 16, 'pending': 14, 'pass': 19, 'fail': 1}
[Run tests]
- IN_PROGRESS (pending): tools-tests
- SUCCESS (pass): sdk-tests, workspace-tests, cross-tests, agent-server-tests, Test directory allowlist, windows-tests
[Run Examples Scripts]
- IN_PROGRESS (pending): test-examples
[Run Integration Tests]
- pending jobs remain for gpt-5.5, minimax-m2.7, and deepseek-v4-flash; one gemini job passed
[Deprecation deadlines]
- FAILURE (fail): check
[Version bump guard]
- SUCCESS (pass): Check package versions

Failing Deprecation deadlines job excerpt:

The following deprecated features have passed their removal deadline:

- [openhands-sdk] 'register_tool(callable_factory)' (warn_call)
  deprecated in: 1.19.1
  removed in:    1.24.0
  defined at:    openhands-sdk/openhands/sdk/tool/registry.py:163

- [openhands-sdk] f'Importing {name!r} from openhands.sdk.settings' (warn_call)
  deprecated in: 1.19.0
  removed in:    1.24.0
  defined at:    openhands-sdk/openhands/sdk/settings/__init__.py:122

Update or remove the listed features before publishing a version that meets or exceeds their removal deadline.

Issues Found

🟠 Issue: The release-version behavior is correct, but the PR is not fully release-ready while the Deprecation deadlines workflow fails for two SDK deprecations whose removal target is 1.24.0.
🟡 Minor / Status: Release-critical CI was still pending at review time (Run tests tools job, Run Examples Scripts, and multiple Run Integration Tests jobs), so final release readiness still depends on those completing successfully.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

github-actions · 2026-05-27T11:31:58Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk
__init__.py	28	2	92%	111–112
openhands-sdk/openhands/sdk/settings
model.py	562	50	91%	83, 108, 113, 352, 362–365, 368, 381, 385, 391, 401, 407, 412, 602, 615, 626, 636, 640, 642, 644, 646, 648, 650, 652, 927, 929, 1042, 1226, 1295, 1334, 1361, 1397–1400, 1426, 1550, 1595, 1627, 1637, 1639, 1644, 1662, 1675, 1677, 1679, 1681, 1688
openhands-sdk/openhands/sdk/tool
registry.py	90	10	88%	39, 59–60, 71, 84, 106–107, 110, 118, 156
TOTAL	29189	6411	78%

github-actions · 2026-05-27T11:35:32Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-05-27 11:55:16 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	24.4s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	24.6s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	11.6s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	35.4s	$0.03
01_standalone_sdk/09_pause_example.py	✅ PASS	12.1s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	32.9s	$0.03
01_standalone_sdk/11_async.py	✅ PASS	26.6s	$0.03
01_standalone_sdk/12_custom_secrets.py	✅ PASS	8.4s	$0.00
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	36.6s	$0.03
01_standalone_sdk/14_context_condenser.py	✅ PASS	2m 27s	$0.17
01_standalone_sdk/17_image_input.py	✅ PASS	20.0s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	17.3s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	16.0s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	11.1s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	10.2s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	18.1s	$0.02
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 31s	$0.02
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	8m 17s	$0.62
01_standalone_sdk/25_agent_delegation.py	✅ PASS	1m 27s	$0.09
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	16.3s	$0.03
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	33.8s	$0.02
01_standalone_sdk/29_llm_streaming.py	✅ PASS	33.5s	$0.02
01_standalone_sdk/30_tom_agent.py	✅ PASS	17.8s	$0.02
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	5m 23s	$0.42
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	16.7s	$0.02
01_standalone_sdk/33_hooks/main.py	✅ PASS	31.7s	$0.04
01_standalone_sdk/34_critic_example.py	✅ PASS	8m 52s	$0.61
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	9.6s	$0.00
01_standalone_sdk/37_llm_profile_store/main.py	✅ PASS	4.2s	$0.00
01_standalone_sdk/38_browser_session_recording.py	✅ PASS	38.0s	$0.03
01_standalone_sdk/39_llm_fallback.py	✅ PASS	9.5s	$0.00
01_standalone_sdk/40_acp_agent_example.py	✅ PASS	31.5s	$0.32
01_standalone_sdk/41_task_tool_set.py	✅ PASS	45.8s	$0.04
01_standalone_sdk/42_file_based_subagents.py	✅ PASS	28.8s	$0.04
01_standalone_sdk/43_mixed_marketplace_skills/main.py	✅ PASS	3.3s	$0.00
01_standalone_sdk/44_model_switching_in_convo.py	✅ PASS	7.1s	$0.01
01_standalone_sdk/45_parallel_tool_execution.py	✅ PASS	7m 8s	$0.53
01_standalone_sdk/46_agent_settings.py	✅ PASS	12.0s	$0.01
01_standalone_sdk/47_defense_in_depth_security.py	✅ PASS	4.9s	$0.00
01_standalone_sdk/48_conversation_fork.py	✅ PASS	13.7s	$0.00
01_standalone_sdk/49_switch_llm_tool.py	✅ PASS	7.2s	$0.03
01_standalone_sdk/50_async_cancellation.py	✅ PASS	12.1s	$0.01
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	40.3s	$0.03
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	✅ PASS	1m 38s	$0.06
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	✅ PASS	1m 32s	$0.08
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	✅ PASS	1m 46s	$0.03
02_remote_agent_server/06_custom_tool/main.py	✅ PASS	4m 29s	$0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	42.6s	$0.04
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	✅ PASS	3m 26s	$0.02
02_remote_agent_server/09_acp_agent_with_remote_runtime.py	✅ PASS	47.1s	$0.11
02_remote_agent_server/10_cloud_workspace_share_credentials.py	✅ PASS	30.9s	$0.04
02_remote_agent_server/11_conversation_fork.py	✅ PASS	1m 44s	$0.00
02_remote_agent_server/12_settings_and_secrets_api.py	✅ PASS	2m 27s	$0.02
02_remote_agent_server/13_workspace_get_llm.py	✅ PASS	31.0s	$0.02
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	20.1s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	43.6s	$0.05
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	16.6s	$0.02
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	2m 51s	$0.02

✅ All tests passed!

Total: 58 | Passed: 58 | Failed: 0 | Total Cost: $4.01

View full workflow run

github-actions · 2026-05-27T11:49:20Z

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $17.19
Models Tested: 4
Timestamp: 2026-05-27 11:49:12 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_minimax_MiniMax_M2.7: 📥 View & Download Logs
litellm_proxy_gemini_3.1_pro_preview: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_v4_flash: 📥 View & Download Logs
litellm_proxy_openai_gpt_5.5: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Total	Cost	Tokens
litellm_proxy_minimax_MiniMax_M2.7	100.0%	5/5	5	$0.17	3,734,601
litellm_proxy_gemini_3.1_pro_preview	100.0%	5/5	5	$13.25	7,597,614
litellm_proxy_deepseek_deepseek_v4_flash	100.0%	5/5	5	$0.17	3,423,284
litellm_proxy_openai_gpt_5.5	100.0%	5/5	5	$3.59	2,871,859

📋 Detailed Results

litellm_proxy_minimax_MiniMax_M2.7

Success Rate: 100.0% (5/5)
Total Cost: $0.17
Token Usage: prompt: 3,696,713, completion: 37,888, cache_read: 3,323,267, reasoning: 58
Run Suffix: litellm_proxy_minimax_MiniMax_M2.7_19e34c7_minimax_m2_7_run_N5_20260527_112703

litellm_proxy_gemini_3.1_pro_preview

Success Rate: 100.0% (5/5)
Total Cost: $13.25
Token Usage: prompt: 7,566,162, completion: 31,452, cache_read: 1,171,255, reasoning: 13,775
Run Suffix: litellm_proxy_gemini_3.1_pro_preview_19e34c7_gemini_3_1_pro_run_N5_20260527_112702

litellm_proxy_deepseek_deepseek_v4_flash

Success Rate: 100.0% (5/5)
Total Cost: $0.17
Token Usage: prompt: 3,386,146, completion: 37,138, cache_read: 3,131,008, reasoning: 10,076
Run Suffix: litellm_proxy_deepseek_deepseek_v4_flash_19e34c7_deepseek_v4_flash_run_N5_20260527_112718

litellm_proxy_openai_gpt_5.5

Success Rate: 100.0% (5/5)
Total Cost: $3.59
Token Usage: prompt: 2,835,368, completion: 36,491, cache_read: 2,492,416, reasoning: 11,257
Run Suffix: litellm_proxy_openai_gpt_5.5_19e34c7_gpt_5_5_run_N5_20260527_112655

enyst · 2026-05-27T13:24:17Z

Total Cost: $13.25

Token Usage: prompt: 7,566,162, completion: 31,452, cache_read: 1,171,255, reasoning: 13,775

Run Suffix: litellm_proxy_gemini_3.1_pro_preview_19e34c7_gemini_3_1_pro_run_N5_20260527_112702

Wow this is a funny one, 13.25. Everything else is much less 🤷

I'm guessing it's nothing, just random fluke, Gemini wants to think round-n-round

simonrosenberg · 2026-05-27T13:38:11Z

Deprecation-deadline cleanup in this release

Cutting 1.24.0 tripped the Deprecation deadlines check for two features marked removed_in: 1.24.0. Both are now removed in this PR:

register_tool(callable_factory) — removed; the 16 callable-factory call sites across the test suite were migrated to ToolDefinition subclasses/instances.
LLMAgentSettings public import aliases — removed from openhands.sdk and openhands.sdk.settings. The class is retained at openhands.sdk.settings.model (it's a live agent_kind="llm" member of the settings union, so legacy payloads still deserialize and the API-breakage field-value check is unchanged).

Note on the commit sequence: an interim commit deferred LLMAgentSettings to 1.25.0, because removing it failed the Python API (api-breakage) gate — the gate couldn't see its deprecation (it lives in the _DEPRECATED_SDK_EXPORTS registry + an f-string __getattr__, not a decorator/literal call). That gap was fixed in #3402 (now merged to main) and is cherry-picked here (commit 6807dfb) so this branch's api-breakage check passes deterministically; it dedupes on merge. A later commit then performs the actual removal. Net diff = both deprecations removed.

api-breakage now reports the LLMAgentSettings removal as a sanctioned scheduled removal (::notice, exit 0).

enyst · 2026-05-27T14:26:29Z

@OpenHands /codereview this PR, note that it's release PR so pay attention to specifics. Post directly on the PR.

openhands-ai · 2026-05-27T14:26:46Z

I'm on it! enyst can track my progress at all-hands.dev

enyst

🟡 Taste Rating: Acceptable — the code changes are pragmatic for a release cleanup, but I’m not approving yet because the release-specific validation is stale on the current head.

[TESTING GAPS]

[Release validation] Stale required release workflows: current PR head is bd02939bef29ccc894b246bcadd2484805fd6edd. I verified Run tests passed on that head (26514653367, coverage comment updated to link bd02939). However:
- Run Examples Scripts last passed on 19e34c7a376eb21734385de2074b629061313c00 (26508259690), before the deprecation-deadline cleanup commits.
- Run Integration Tests last passed on 19e34c7a376eb21734385de2074b629061313c00 for both the integration-test and behavior-test label runs (26508259747, 26508259849).

Per the release PR review guideline, I can’t approve a release PR until the latest PR-specific Run tests, Run Examples Scripts, and Run Integration Tests results all match the current PR state. Please rerun examples + integration/behavior against bd02939 (or have a human maintainer explicitly accept the stale validation before merging).

Code-wise, I didn’t find a blocker in the actual cleanup: register_tool(callable_factory) is removed with tests migrated to ToolDefinition subclasses/instances, and the LLMAgentSettings public aliases are removed while the model class remains in settings.model for legacy payload deserialization. The API-breakage checker update for _DEPRECATED_SDK_EXPORTS is narrow and covered by targeted tests.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

This is a release PR that publishes package versions and removes two deprecated public surfaces. The implementation looks intentional and the API/REST/unit gates are green on current head, but stale examples/integration/behavior validation means the release checklist is not fully proven for the final commit.

VERDICT:
⚠️ Comment / not approving yet: core logic looks sound, but release validation needs to be refreshed or explicitly accepted by a human maintainer before approval.

KEY INSIGHT:
The deprecation removals are aligned with the scheduled 1.24.0 cleanup; the remaining risk is release-process validation freshness, not code structure.

This review was created by an AI agent (OpenHands) on behalf of @enyst.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Was this review helpful? React with 👍 or 👎 to give feedback.

openhands-ai · 2026-05-27T14:32:53Z

OpenHands encountered an error: Request timeout after 30 seconds to https://balwusrsvebknnow.prod-runtime.all-hands.dev/api/conversations/e31d3d2e-aef6-4df1-890d-2d43d9eb60cd/ask_agent

See the conversation for more information.

simonrosenberg · 2026-05-27T15:17:41Z

Rebased onto latest main (now includes #3347, #3323, #3247, #3398, #3329, #3346, and #3402). The interim cherry-pick of #3402 was auto-dropped during the rebase since main now provides the api-breakage checker fix directly — so the Python API gate still sanctions the LLMAgentSettings removal. Re-validated locally on the rebased base: deprecation-deadline check clean, api-breakage exit 0, affected suites pass.

Co-authored-by: openhands <openhands@all-hands.dev>

…moval to 1.25.0 Cutting v1.24.0 trips the deprecation-deadline check for two features whose removal was scheduled for 1.24.0. Handle each per its actual removability: * register_tool(callable_factory) (deprecated 1.19.1): REMOVED. register_tool now accepts only a ToolDefinition instance or subclass; the dead _resolver_from_callable / _usability_from_callable helpers and the callable branch are dropped. 16 call sites across 6 test files still used the callable form and are migrated to register the ToolDefinition subclass directly, a prebuilt instance, or -- where conv_state is needed at resolve time -- a small ToolDefinition subclass. Clean under the SDK api-breakage gate: register_tool stays exported, only its accepted-arg union narrows. * LLMAgentSettings import aliases (deprecated 1.19.0): KEPT; deadline moved to 1.25.0. Removing them now fails the api-breakage gate -- the published 1.23.1 baseline deprecated them via a module __getattr__ calling warn_deprecated(f"Importing {name!r} ..."), an f-string the breakage checker cannot statically detect, so the removal reads as unsanctioned and there is no override. Re-expressed with a literal warn_deprecated("LLMAgentSettings", ...) feature name so the 1.25.0 baseline carries a detectable record and the removal passes the gate next minor. Tests: callable-factory registration now asserts TypeError; the qualname callable test is dropped; the 16 callers above are migrated. LLMAgentSettings alias test is unchanged (still warns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

With the api-breakage checker now recognizing the _DEPRECATED_SDK_EXPORTS registry (cherry-picked from #3402), removing LLMAgentSettings -- deprecated in 1.19.0 with removed_in 1.24.0 -- is sanctioned and lands in this release instead of being deferred. - Drop the public import aliases from `openhands.sdk` and `openhands.sdk.settings` (the __all__ entries, TYPE_CHECKING imports, and the __getattr__ / _DEPRECATED_SDK_EXPORTS shims). `from openhands.sdk import LLMAgentSettings` and the settings-level import now raise. - Retain the LLMAgentSettings *class* at `openhands.sdk.settings.model`: it is a live member of the settings discriminated union (agent_kind="llm") so legacy payloads still deserialize and the API-breakage field-value check is unchanged. - Rewrite the alias test to assert removal; update class/union docstrings. Verified: deprecation-deadline check clean; api-breakage reports a sanctioned scheduled removal (::notice, exit 0); pyright clean; 79 affected tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

enyst · 2026-05-27T16:35:34Z

@OpenHands Remove and re-add the 3 release tests labels, so that we have them run on the actual recent release PR. Wait until all three are done, then tell us directly on PR as a comment WDYT about the results.

openhands-ai · 2026-05-27T16:35:49Z

I'm on it! enyst can track my progress at all-hands.dev

github-actions · 2026-05-27T16:36:51Z

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-05-27T16:36:57Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-05-27T16:38:51Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-05-27 16:57:34 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	48.7s	$0.07
01_standalone_sdk/03_activate_skill.py	✅ PASS	20.8s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	11.1s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	26.2s	$0.02
01_standalone_sdk/09_pause_example.py	✅ PASS	10.4s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	29.2s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	28.4s	$0.04
01_standalone_sdk/12_custom_secrets.py	✅ PASS	8.7s	$0.00
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	57.7s	$0.06
01_standalone_sdk/14_context_condenser.py	✅ PASS	2m 14s	$0.16
01_standalone_sdk/17_image_input.py	✅ PASS	25.4s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	23.9s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	14.2s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	15.6s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	8.7s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	19.0s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 8s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	5m 20s	$0.37
01_standalone_sdk/25_agent_delegation.py	✅ PASS	1m 7s	$0.07
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	16.9s	$0.03
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	38.7s	$0.04
01_standalone_sdk/29_llm_streaming.py	✅ PASS	29.0s	$0.02
01_standalone_sdk/30_tom_agent.py	✅ PASS	7.9s	$0.01
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	4m 49s	$0.36
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	22.7s	$0.03
01_standalone_sdk/33_hooks/main.py	✅ PASS	46.7s	$0.05
01_standalone_sdk/34_critic_example.py	✅ PASS	8m 0s	$0.68
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	8.4s	$0.00
01_standalone_sdk/37_llm_profile_store/main.py	✅ PASS	3.5s	$0.00
01_standalone_sdk/38_browser_session_recording.py	✅ PASS	39.1s	$0.03
01_standalone_sdk/39_llm_fallback.py	✅ PASS	9.0s	$0.01
01_standalone_sdk/40_acp_agent_example.py	✅ PASS	29.1s	$0.32
01_standalone_sdk/41_task_tool_set.py	✅ PASS	23.0s	$0.03
01_standalone_sdk/42_file_based_subagents.py	✅ PASS	41.6s	$0.05
01_standalone_sdk/43_mixed_marketplace_skills/main.py	✅ PASS	2.9s	$0.00
01_standalone_sdk/44_model_switching_in_convo.py	✅ PASS	7.1s	$0.01
01_standalone_sdk/45_parallel_tool_execution.py	✅ PASS	6m 29s	$0.54
01_standalone_sdk/46_agent_settings.py	✅ PASS	11.4s	$0.01
01_standalone_sdk/47_defense_in_depth_security.py	✅ PASS	2.8s	$0.00
01_standalone_sdk/48_conversation_fork.py	✅ PASS	11.7s	$0.00
01_standalone_sdk/49_switch_llm_tool.py	✅ PASS	11.5s	$0.03
01_standalone_sdk/50_async_cancellation.py	✅ PASS	12.0s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	30.7s	$0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	✅ PASS	1m 51s	$0.04
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	✅ PASS	1m 5s	$0.06
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	✅ PASS	1m 49s	$0.05
02_remote_agent_server/06_custom_tool/main.py	✅ PASS	6m 3s	$0.04
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	38.2s	$0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	✅ PASS	4m 19s	$0.02
02_remote_agent_server/09_acp_agent_with_remote_runtime.py	✅ PASS	1m 1s	$0.14
02_remote_agent_server/10_cloud_workspace_share_credentials.py	✅ PASS	37.1s	$0.06
02_remote_agent_server/11_conversation_fork.py	✅ PASS	1m 43s	$0.00
02_remote_agent_server/12_settings_and_secrets_api.py	✅ PASS	2m 24s	$0.02
02_remote_agent_server/13_workspace_get_llm.py	✅ PASS	1m 16s	$0.04
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	19.9s	$0.03
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	1m 1s	$0.09
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	13.8s	$0.02
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	22.7s	$0.02

✅ All tests passed!

Total: 58 | Passed: 58 | Failed: 0 | Total Cost: $3.91

View full workflow run

github-actions · 2026-05-27T16:41:04Z

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.87
Models Tested: 4
Timestamp: 2026-05-27 16:40:55 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_minimax_MiniMax_M2.7: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_v4_flash: 📥 View & Download Logs
litellm_proxy_gemini_3.1_pro_preview: 📥 View & Download Logs
litellm_proxy_openai_gpt_5.5: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_minimax_MiniMax_M2.7	100.0%	8/8	1	9	$0.00	332,490
litellm_proxy_deepseek_deepseek_v4_flash	100.0%	8/8	1	9	$0.00	425,687
litellm_proxy_gemini_3.1_pro_preview	100.0%	9/9	0	9	$0.16	326,829
litellm_proxy_openai_gpt_5.5	100.0%	9/9	0	9	$0.72	281,330

📋 Detailed Results

litellm_proxy_minimax_MiniMax_M2.7

Success Rate: 100.0% (8/8)
Total Cost: $0.00
Token Usage: prompt: 327,726, completion: 4,764, cache_read: 257,306, reasoning: 345
Run Suffix: litellm_proxy_minimax_MiniMax_M2.7_17cb597_minimax_m2_7_run_N9_20260527_163836
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_v4_flash

Success Rate: 100.0% (8/8)
Total Cost: $0.00
Token Usage: prompt: 420,110, completion: 5,577, cache_read: 375,424, reasoning: 1,530
Run Suffix: litellm_proxy_deepseek_deepseek_v4_flash_17cb597_deepseek_v4_flash_run_N9_20260527_163831
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3.1_pro_preview

Success Rate: 100.0% (9/9)
Total Cost: $0.16
Token Usage: prompt: 322,242, completion: 4,587, cache_read: 301,124, reasoning: 2,615
Run Suffix: litellm_proxy_gemini_3.1_pro_preview_17cb597_gemini_3_1_pro_run_N9_20260527_163841

litellm_proxy_openai_gpt_5.5

Success Rate: 100.0% (9/9)
Total Cost: $0.72
Token Usage: prompt: 276,603, completion: 4,727, cache_read: 179,712, reasoning: 1,507
Run Suffix: litellm_proxy_openai_gpt_5.5_17cb597_gpt_5_5_run_N9_20260527_163845

enyst · 2026-05-27T16:44:42Z

@OpenHands see the comment here #3401 (comment) and I think the same 4 LLMs are for behavior tests and integration tests. Make an investigation in recent history to tell me: we used to run them for a Claude, probably a Sonnet IIRC; where/when did we “lose” that Claude? This is a side question, not part of the release PR, so investigate deeply from main branch and make a new issue on the repo where you tag me and explain to me what happened with those LLMs. Do NOT touch the release PR.

openhands-ai · 2026-05-27T16:44:57Z

I'm on it! enyst can track my progress at all-hands.dev

openhands-ai · 2026-05-27T16:48:55Z

OpenHands encountered an error: Request timeout after 30 seconds to https://yqvmoyebykrkpsdo.prod-runtime.all-hands.dev/api/conversations/6375e3a7-fc47-411f-8e0f-4b8b7952fe95/ask_agent

See the conversation for more information.

github-actions · 2026-05-27T16:50:12Z

🧪 Integration Tests Results

Overall Success Rate: 95.0%
Total Cost: $13.33
Models Tested: 4
Timestamp: 2026-05-27 16:50:03 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_minimax_MiniMax_M2.7: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_v4_flash: 📥 View & Download Logs
litellm_proxy_gemini_3.1_pro_preview: 📥 View & Download Logs
litellm_proxy_openai_gpt_5.5: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Total	Cost	Tokens
litellm_proxy_minimax_MiniMax_M2.7	100.0%	5/5	5	$0.15	3,507,102
litellm_proxy_deepseek_deepseek_v4_flash	100.0%	5/5	5	$0.17	2,937,345
litellm_proxy_gemini_3.1_pro_preview	80.0%	4/5	5	$9.34	5,759,103
litellm_proxy_openai_gpt_5.5	100.0%	5/5	5	$3.68	2,659,870

📋 Detailed Results

litellm_proxy_minimax_MiniMax_M2.7

Success Rate: 100.0% (5/5)
Total Cost: $0.15
Token Usage: prompt: 3,475,595, completion: 31,507, cache_read: 3,212,533
Run Suffix: litellm_proxy_minimax_MiniMax_M2.7_17cb597_minimax_m2_7_run_N5_20260527_163826

litellm_proxy_deepseek_deepseek_v4_flash

Success Rate: 100.0% (5/5)
Total Cost: $0.17
Token Usage: prompt: 2,901,332, completion: 36,013, cache_read: 2,642,048, reasoning: 10,838
Run Suffix: litellm_proxy_deepseek_deepseek_v4_flash_17cb597_deepseek_v4_flash_run_N5_20260527_163834

litellm_proxy_gemini_3.1_pro_preview

Success Rate: 80.0% (4/5)
Total Cost: $9.34
Token Usage: prompt: 5,721,376, completion: 37,727, cache_read: 1,337,698, reasoning: 17,695
Run Suffix: litellm_proxy_gemini_3.1_pro_preview_17cb597_gemini_3_1_pro_run_N5_20260527_163836

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully completed the core task: (1) updated MAX_CMD_OUTPUT_SIZE from 30000 to 20_000, (2) correctly identified that test files don't need modification since they import the constant dynamically, (3) ran test_observation_truncation.py which passed, and (4) verified the change with git diff.

However, the agent's approach violated the evaluation criterion "without unnecessary verification." Specifically:

Over-verification Issues:

After successfully running test_observation_truncation.py (the targeted, sufficient test), the agent attempted to run the broader uv run pytest tests/tools/terminal/ suite multiple times
These broader attempts repeatedly caused tmux session crashes
The agent made at least 3-4 separate attempts to run broader test suites despite encountering failures each time
This consumed many iterations (iterations 7-15+) dealing with tmux recovery, background processes, and log file inspection

What Should Have Happened:

Make the change ✅ (done efficiently)
Run targeted test: test_observation_truncation.py ✅ (done)
Report success and stop ✅ (eventually done, but after excessive iterations)

Positive Aspects:

Correctly understood that tests import MAX_CMD_OUTPUT_SIZE dynamically, so no test modifications needed
Properly used uv as instructed
Eventually recognized the issue and moved forward
Final summary message was accurate

Problematic Aspects:

Repeated unnecessary attempts to run broader test suites
Spent significant effort on tmux crash handling rather than moving forward
Did not cleanly stop after the targeted test passed
The evaluation criteria explicitly noted over-verification should be avoided; the agent's multiple attempts at broader suites constitute exactly that

The task was accomplishable in ~10-15 iterations; the agent took ~50+ iterations largely due to over-verification and repeated failed attempts at broader testing. (confidence=0.65) (Cost: $0.77)

litellm_proxy_openai_gpt_5.5

Success Rate: 100.0% (5/5)
Total Cost: $3.68
Token Usage: prompt: 2,623,692, completion: 36,178, cache_read: 2,232,320, reasoning: 9,854
Run Suffix: litellm_proxy_openai_gpt_5.5_17cb597_gpt_5_5_run_N5_20260527_163841

enyst · 2026-05-27T16:58:41Z

I removed and re-added the three release-test labels at ~16:36 UTC and waited for the fresh runs to finish.

Results from the new runs:

✅ Integration tests: 100.0% success, 4 models, total cost $0.87. Run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26524797465
✅ Example tests: 58/58 passed, 0 failed, total cost $3.91. Run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26524793833
⚠️ Behavior tests: workflow completed, but the report is 95.0% overall: 19/20 model-test combinations passed. The only failure is b02_no_oververification on litellm_proxy_gemini_3.1_pro_preview; the other three models were 5/5. Run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26524792646

WDYT: integration and examples look release-ready to me. I would not call the behavior-test checklist item fully green yet, because the latest behavior report still has a real reported failure. That said, the failure looks like a model/agent behavioral flake around over-verification and tmux recovery loops, not an obvious SDK release-code regression: Gemini completed the requested code change and targeted test, then failed the behavior criterion by continuing with unnecessary broader verification. If maintainers are comfortable accepting that as known/flaky behavior, I think the release can proceed; otherwise I’d rerun behavior-test once more or ask for explicit maintainer sign-off before checking that box.

This PR comment was created by an AI agent (OpenHands) on behalf of the requester.

openhands-ai · 2026-05-27T17:00:02Z

OpenHands encountered an error: Request timeout after 30 seconds to https://owkbyecunjtnbgml.prod-runtime.all-hands.dev/api/conversations/abde2eab-bfaa-454e-96d6-2bdf28600047/ask_agent

See the conversation for more information.

simonrosenberg · 2026-05-27T17:21:32Z

I removed and re-added the three release-test labels at ~16:36 UTC and waited for the fresh runs to finish.

Results from the new runs:

✅ Integration tests: 100.0% success, 4 models, total cost $0.87. Run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26524797465

✅ Example tests: 58/58 passed, 0 failed, total cost $3.91. Run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26524793833

⚠️ Behavior tests: workflow completed, but the report is 95.0% overall: 19/20 model-test combinations passed. The only failure is b02_no_oververification on litellm_proxy_gemini_3.1_pro_preview; the other three models were 5/5. Run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26524792646

WDYT: integration and examples look release-ready to me. I would not call the behavior-test checklist item fully green yet, because the latest behavior report still has a real reported failure. That said, the failure looks like a model/agent behavioral flake around over-verification and tmux recovery loops, not an obvious SDK release-code regression: Gemini completed the requested code change and targeted test, then failed the behavior criterion by continuing with unnecessary broader verification. If maintainers are comfortable accepting that as known/flaky behavior, I think the release can proceed; otherwise I’d rerun behavior-test once more or ask for explicit maintainer sign-off before checking that box.

This PR comment was created by an AI agent (OpenHands) on behalf of the requester.

I think it's safe to cut the release?

all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels May 27, 2026

all-hands-bot commented May 27, 2026

View reviewed changes

Comment thread openhands-sdk/pyproject.toml

simonrosenberg force-pushed the rel-1.24.0 branch 2 times, most recently from 82599a2 to 1da3dc3 Compare May 27, 2026 12:40

enyst reviewed May 27, 2026

View reviewed changes

enyst mentioned this pull request May 27, 2026

Revert "fix(ci): stop re-labeling unrelated PRs for old Field defaults" #3406

Closed

simonrosenberg force-pushed the rel-1.24.0 branch from bd02939 to 4710fa9 Compare May 27, 2026 15:17

github-actions Bot and others added 3 commits May 27, 2026 18:21

Release v1.24.0

b65ca79

Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg force-pushed the rel-1.24.0 branch from 4710fa9 to 17cb597 Compare May 27, 2026 16:22

enyst removed integration-test Runs the integration tests and comments the results behavior-test test-examples Run all applicable "examples/" files. Expensive operation. labels May 27, 2026

enyst added integration-test Runs the integration tests and comments the results behavior-test test-examples Run all applicable "examples/" files. Expensive operation. labels May 27, 2026 — with OpenHands AI

simonrosenberg mentioned this pull request May 27, 2026

Tracking: unify ACP model selection with the LLM-profile UX OpenHands/agent-canvas#769

Open

12 tasks

simonrosenberg reviewed May 27, 2026

View reviewed changes

Comment thread openhands-sdk/openhands/sdk/settings/model.py

simonrosenberg approved these changes May 27, 2026

View reviewed changes

simonrosenberg merged commit fdc2bdf into main May 27, 2026
109 of 110 checks passed

simonrosenberg deleted the rel-1.24.0 branch May 27, 2026 17:38

Conversation

all-hands-bot commented May 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release v1.24.0

Release Checklist

What happens on merge

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 27, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_minimax_MiniMax_M2.7

litellm_proxy_gemini_3.1_pro_preview

litellm_proxy_deepseek_deepseek_v4_flash

litellm_proxy_openai_gpt_5.5

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

⚠️ QA Report: PASS WITH ISSUES

Does this PR achieve its stated goal?

Test 1: Installed package metadata and imports

Test 2: Running agent-server reports release versions

Test 3: Release distribution build metadata

Test 4: Run Eval workflow default

Issues Found

Uh oh!

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

✅ All tests passed!

Uh oh!

github-actions Bot commented May 27, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_minimax_MiniMax_M2.7

litellm_proxy_gemini_3.1_pro_preview

litellm_proxy_deepseek_deepseek_v4_flash

litellm_proxy_openai_gpt_5.5

Uh oh!

enyst commented May 27, 2026

Uh oh!

simonrosenberg commented May 27, 2026

Deprecation-deadline cleanup in this release

Uh oh!

enyst commented May 27, 2026

Uh oh!

openhands-ai Bot commented May 27, 2026

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

openhands-ai Bot commented May 27, 2026

Uh oh!

simonrosenberg commented May 27, 2026

Uh oh!

enyst commented May 27, 2026

Uh oh!

openhands-ai Bot commented May 27, 2026

all-hands-bot commented May 27, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

github-actions Bot commented May 27, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`