Release v1.24.0#3401
Conversation
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly. |
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable release bump, but I can’t approve this release PR yet.
Issues to address before approval:
Deprecation deadlinesis failing for 1.24.0:register_tool(callable_factory)and theopenhands.sdk.settingsimport shim have reached theirremoved in: 1.24.0deadline.- Required release validation is not complete/current yet:
Run testsstill has in-progress jobs and no coverage report comment,Run Examples Scriptsis still in progress/no result comment, andRun Integration Testsis still in progress/no final results comment. Please wait for these to pass on19e34c7a376eb21734385de2074b629061313c00, then have a human maintainer review.
Risk: 🟡 Medium — release publication should not proceed with failed deprecation cleanup and incomplete release validation.
This review was created by an AI agent (OpenHands) on behalf of the user.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26508259686
🧪 Integration Tests ResultsOverall Success Rate: 100.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_minimax_MiniMax_M2.7
Skipped Tests:
litellm_proxy_gemini_3.1_pro_preview
litellm_proxy_deepseek_deepseek_v4_flash
Skipped Tests:
litellm_proxy_openai_gpt_5.5
|
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
Release-version behavior works in real execution, but the PR is not fully release-ready because the deprecation-deadline gate is failing and several release checks are still pending.
Does this PR achieve its stated goal?
Partially. I verified the user-visible release outputs: package metadata/imports, built wheels, the running agent-server /server_info endpoint, and the Run Eval workflow default all moved from 1.23.1/v1.23.1 on main to 1.24.0/v1.24.0 on this PR. However, the stated release checklist includes deprecation-deadline cleanup, and CI currently fails that gate for two SDK deprecations with removed in: 1.24.0, so I would not consider the release fully prepared yet.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed and installed editable packages at 1.24.0 |
| CI Status | Deprecation deadlines) and pending release checks (Run tests, Run Examples Scripts, Run Integration Tests) at review time |
| Functional Verification | ✅ Actual imports, server endpoint, workflow config parse, and distribution build all reported 1.24.0 |
Functional Verification
Test 1: Installed package metadata and imports
Step 1 — Establish baseline on main:
Ran uv run --project /tmp/qa-sdk-main-baseline python /tmp/qa_versions.py:
openhands-sdk=1.23.1
openhands-tools=1.23.1
openhands-workspace=1.23.1
openhands-agent-server=1.23.1
sdk_import=ok:Agent
tools_import=ok:file_editor
This shows the pre-release baseline exposes 1.23.1 package metadata while core SDK/tools imports work.
Step 2 — Apply the PR's changes:
Checked out rel-1.24.0 at 19e34c7a376eb21734385de2074b629061313c00 and ran the repository setup with make build.
Step 3 — Re-run with the PR in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_versions.py:
openhands-sdk=1.24.0
openhands-tools=1.24.0
openhands-workspace=1.24.0
openhands-agent-server=1.24.0
sdk_import=ok:Agent
tools_import=ok:file_editor
This confirms the installed packages a Python user imports now expose 1.24.0 and still import successfully.
Test 2: Running agent-server reports release versions
Step 1 — Establish baseline on main:
Started the server with uv run python -m openhands.agent_server --host 127.0.0.1 --port 8765 and requested /server_info:
{"version":"1.23.1","sdk_version":"1.23.1","tools_version":"1.23.1","workspace_version":"1.23.1"}This shows the real server API previously surfaced 1.23.1 across the server, SDK, tools, and workspace versions.
Step 2 — Apply the PR's changes:
Started the same server command from the PR checkout on port 8766.
Step 3 — Re-run with the PR in place:
Requested http://127.0.0.1:8766/server_info:
{"version":"1.24.0","sdk_version":"1.24.0","tools_version":"1.24.0","workspace_version":"1.24.0"}This confirms a real API consumer sees the intended 1.24.0 release versions.
Test 3: Release distribution build metadata
Step 1 — Establish baseline:
The baseline package metadata check above established the current release line as 1.23.1.
Step 2 — Apply the PR's changes:
Built release artifacts from the PR with uv build --all-packages --out-dir /tmp/qa-dist-1.24.0.
Step 3 — Inspect built wheels:
Read wheel METADATA from the generated artifacts:
openhands_agent_server-1.24.0-py3-none-any.whl -> openhands-agent-server 1.24.0
openhands_sdk-1.24.0-py3-none-any.whl -> openhands-sdk 1.24.0
openhands_tools-1.24.0-py3-none-any.whl -> openhands-tools 1.24.0
openhands_workspace-1.24.0-py3-none-any.whl -> openhands-workspace 1.24.0
This confirms the artifacts that would be published carry the intended release version.
Test 4: Run Eval workflow default
Step 1 — Establish baseline on main:
Parsed .github/workflows/run-eval.yml with yaml.BaseLoader:
main run-eval sdk_ref default=v1.23.1
This shows manual eval dispatch defaulted to the prior release.
Step 2 — Apply the PR's changes:
Parsed the same workflow in the PR checkout.
Step 3 — Re-run with the PR in place:
PR run-eval sdk_ref default=v1.24.0
This confirms a workflow-dispatch user gets the new release tag by default.
CI Evidence
Latest gh pr checks 3401 --repo OpenHands/software-agent-sdk summary at review time:
bucket_counts={'skipping': 16, 'pending': 14, 'pass': 19, 'fail': 1}
[Run tests]
- IN_PROGRESS (pending): tools-tests
- SUCCESS (pass): sdk-tests, workspace-tests, cross-tests, agent-server-tests, Test directory allowlist, windows-tests
[Run Examples Scripts]
- IN_PROGRESS (pending): test-examples
[Run Integration Tests]
- pending jobs remain for gpt-5.5, minimax-m2.7, and deepseek-v4-flash; one gemini job passed
[Deprecation deadlines]
- FAILURE (fail): check
[Version bump guard]
- SUCCESS (pass): Check package versions
Failing Deprecation deadlines job excerpt:
The following deprecated features have passed their removal deadline:
- [openhands-sdk] 'register_tool(callable_factory)' (warn_call)
deprecated in: 1.19.1
removed in: 1.24.0
defined at: openhands-sdk/openhands/sdk/tool/registry.py:163
- [openhands-sdk] f'Importing {name!r} from openhands.sdk.settings' (warn_call)
deprecated in: 1.19.0
removed in: 1.24.0
defined at: openhands-sdk/openhands/sdk/settings/__init__.py:122
Update or remove the listed features before publishing a version that meets or exceeds their removal deadline.
Issues Found
- 🟠 Issue: The release-version behavior is correct, but the PR is not fully release-ready while the
Deprecation deadlinesworkflow fails for two SDK deprecations whose removal target is1.24.0. - 🟡 Minor / Status: Release-critical CI was still pending at review time (
Run teststools job,Run Examples Scripts, and multipleRun Integration Testsjobs), so final release readiness still depends on those completing successfully.
This QA review was created by an AI agent (OpenHands) on behalf of the user.
Coverage Report •
|
||||||||||||||||||||||||||||||||||||||||
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 24.4s | $0.03 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 24.6s | $0.03 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 11.6s | $0.01 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 35.4s | $0.03 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 12.1s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 32.9s | $0.03 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 26.6s | $0.03 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 8.4s | $0.00 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 36.6s | $0.03 |
| 01_standalone_sdk/14_context_condenser.py | ✅ PASS | 2m 27s | $0.17 |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 20.0s | $0.02 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 17.3s | $0.02 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 16.0s | $0.02 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 11.1s | $0.02 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 10.2s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 18.1s | $0.02 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 1m 31s | $0.02 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 8m 17s | $0.62 |
| 01_standalone_sdk/25_agent_delegation.py | ✅ PASS | 1m 27s | $0.09 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 16.3s | $0.03 |
| 01_standalone_sdk/28_ask_agent_example.py | ✅ PASS | 33.8s | $0.02 |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 33.5s | $0.02 |
| 01_standalone_sdk/30_tom_agent.py | ✅ PASS | 17.8s | $0.02 |
| 01_standalone_sdk/31_iterative_refinement.py | ✅ PASS | 5m 23s | $0.42 |
| 01_standalone_sdk/32_configurable_security_policy.py | ✅ PASS | 16.7s | $0.02 |
| 01_standalone_sdk/33_hooks/main.py | ✅ PASS | 31.7s | $0.04 |
| 01_standalone_sdk/34_critic_example.py | ✅ PASS | 8m 52s | $0.61 |
| 01_standalone_sdk/36_event_json_to_openai_messages.py | ✅ PASS | 9.6s | $0.00 |
| 01_standalone_sdk/37_llm_profile_store/main.py | ✅ PASS | 4.2s | $0.00 |
| 01_standalone_sdk/38_browser_session_recording.py | ✅ PASS | 38.0s | $0.03 |
| 01_standalone_sdk/39_llm_fallback.py | ✅ PASS | 9.5s | $0.00 |
| 01_standalone_sdk/40_acp_agent_example.py | ✅ PASS | 31.5s | $0.32 |
| 01_standalone_sdk/41_task_tool_set.py | ✅ PASS | 45.8s | $0.04 |
| 01_standalone_sdk/42_file_based_subagents.py | ✅ PASS | 28.8s | $0.04 |
| 01_standalone_sdk/43_mixed_marketplace_skills/main.py | ✅ PASS | 3.3s | $0.00 |
| 01_standalone_sdk/44_model_switching_in_convo.py | ✅ PASS | 7.1s | $0.01 |
| 01_standalone_sdk/45_parallel_tool_execution.py | ✅ PASS | 7m 8s | $0.53 |
| 01_standalone_sdk/46_agent_settings.py | ✅ PASS | 12.0s | $0.01 |
| 01_standalone_sdk/47_defense_in_depth_security.py | ✅ PASS | 4.9s | $0.00 |
| 01_standalone_sdk/48_conversation_fork.py | ✅ PASS | 13.7s | $0.00 |
| 01_standalone_sdk/49_switch_llm_tool.py | ✅ PASS | 7.2s | $0.03 |
| 01_standalone_sdk/50_async_cancellation.py | ✅ PASS | 12.1s | $0.01 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 40.3s | $0.03 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ✅ PASS | 1m 38s | $0.06 |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ✅ PASS | 1m 32s | $0.08 |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ✅ PASS | 1m 46s | $0.03 |
| 02_remote_agent_server/06_custom_tool/main.py | ✅ PASS | 4m 29s | $0.03 |
| 02_remote_agent_server/07_convo_with_cloud_workspace.py | ✅ PASS | 42.6s | $0.04 |
| 02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py | ✅ PASS | 3m 26s | $0.02 |
| 02_remote_agent_server/09_acp_agent_with_remote_runtime.py | ✅ PASS | 47.1s | $0.11 |
| 02_remote_agent_server/10_cloud_workspace_share_credentials.py | ✅ PASS | 30.9s | $0.04 |
| 02_remote_agent_server/11_conversation_fork.py | ✅ PASS | 1m 44s | $0.00 |
| 02_remote_agent_server/12_settings_and_secrets_api.py | ✅ PASS | 2m 27s | $0.02 |
| 02_remote_agent_server/13_workspace_get_llm.py | ✅ PASS | 31.0s | $0.02 |
| 04_llm_specific_tools/01_gpt5_apply_patch_preset.py | ✅ PASS | 20.1s | $0.02 |
| 04_llm_specific_tools/02_gemini_file_tools.py | ✅ PASS | 43.6s | $0.05 |
| 05_skills_and_plugins/01_loading_agentskills/main.py | ✅ PASS | 16.6s | $0.02 |
| 05_skills_and_plugins/02_loading_plugins/main.py | ✅ PASS | 2m 51s | $0.02 |
✅ All tests passed!
Total: 58 | Passed: 58 | Failed: 0 | Total Cost: $4.01
🧪 Integration Tests ResultsOverall Success Rate: 100.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_minimax_MiniMax_M2.7
litellm_proxy_gemini_3.1_pro_preview
litellm_proxy_deepseek_deepseek_v4_flash
litellm_proxy_openai_gpt_5.5
|
82599a2 to
1da3dc3
Compare
Wow this is a funny one, 13.25. Everything else is much less 🤷 I'm guessing it's nothing, just random fluke, Gemini wants to think round-n-round |
Deprecation-deadline cleanup in this releaseCutting
api-breakage now reports the |
|
@OpenHands /codereview this PR, note that it's release PR so pay attention to specifics. Post directly on the PR. |
|
I'm on it! enyst can track my progress at all-hands.dev |
enyst
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable — the code changes are pragmatic for a release cleanup, but I’m not approving yet because the release-specific validation is stale on the current head.
[TESTING GAPS]
- [Release validation] Stale required release workflows: current PR head is
bd02939bef29ccc894b246bcadd2484805fd6edd. I verifiedRun testspassed on that head (26514653367, coverage comment updated to linkbd02939). However:Run Examples Scriptslast passed on19e34c7a376eb21734385de2074b629061313c00(26508259690), before the deprecation-deadline cleanup commits.Run Integration Testslast passed on19e34c7a376eb21734385de2074b629061313c00for both theintegration-testandbehavior-testlabel runs (26508259747,26508259849).
Per the release PR review guideline, I can’t approve a release PR until the latest PR-specific Run tests, Run Examples Scripts, and Run Integration Tests results all match the current PR state. Please rerun examples + integration/behavior against bd02939 (or have a human maintainer explicitly accept the stale validation before merging).
Code-wise, I didn’t find a blocker in the actual cleanup: register_tool(callable_factory) is removed with tests migrated to ToolDefinition subclasses/instances, and the LLMAgentSettings public aliases are removed while the model class remains in settings.model for legacy payload deserialization. The API-breakage checker update for _DEPRECATED_SDK_EXPORTS is narrow and covered by targeted tests.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
This is a release PR that publishes package versions and removes two deprecated public surfaces. The implementation looks intentional and the API/REST/unit gates are green on current head, but stale examples/integration/behavior validation means the release checklist is not fully proven for the final commit.
VERDICT:
KEY INSIGHT:
The deprecation removals are aligned with the scheduled 1.24.0 cleanup; the remaining risk is release-process validation freshness, not code structure.
This review was created by an AI agent (OpenHands) on behalf of @enyst.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.Was this review helpful? React with 👍 or 👎 to give feedback.
|
OpenHands encountered an error: Request timeout after 30 seconds to https://balwusrsvebknnow.prod-runtime.all-hands.dev/api/conversations/e31d3d2e-aef6-4df1-890d-2d43d9eb60cd/ask_agent See the conversation for more information. |
bd02939 to
4710fa9
Compare
|
Rebased onto latest |
Co-authored-by: openhands <openhands@all-hands.dev>
…moval to 1.25.0
Cutting v1.24.0 trips the deprecation-deadline check for two features whose
removal was scheduled for 1.24.0. Handle each per its actual removability:
* register_tool(callable_factory) (deprecated 1.19.1): REMOVED. register_tool
now accepts only a ToolDefinition instance or subclass; the dead
_resolver_from_callable / _usability_from_callable helpers and the callable
branch are dropped. 16 call sites across 6 test files still used the callable
form and are migrated to register the ToolDefinition subclass directly, a
prebuilt instance, or -- where conv_state is needed at resolve time -- a small
ToolDefinition subclass. Clean under the SDK api-breakage gate: register_tool
stays exported, only its accepted-arg union narrows.
* LLMAgentSettings import aliases (deprecated 1.19.0): KEPT; deadline moved to
1.25.0. Removing them now fails the api-breakage gate -- the published 1.23.1
baseline deprecated them via a module __getattr__ calling
warn_deprecated(f"Importing {name!r} ..."), an f-string the breakage checker
cannot statically detect, so the removal reads as unsanctioned and there is no
override. Re-expressed with a literal warn_deprecated("LLMAgentSettings", ...)
feature name so the 1.25.0 baseline carries a detectable record and the removal
passes the gate next minor.
Tests: callable-factory registration now asserts TypeError; the qualname
callable test is dropped; the 16 callers above are migrated. LLMAgentSettings
alias test is unchanged (still warns).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With the api-breakage checker now recognizing the _DEPRECATED_SDK_EXPORTS registry (cherry-picked from #3402), removing LLMAgentSettings -- deprecated in 1.19.0 with removed_in 1.24.0 -- is sanctioned and lands in this release instead of being deferred. - Drop the public import aliases from `openhands.sdk` and `openhands.sdk.settings` (the __all__ entries, TYPE_CHECKING imports, and the __getattr__ / _DEPRECATED_SDK_EXPORTS shims). `from openhands.sdk import LLMAgentSettings` and the settings-level import now raise. - Retain the LLMAgentSettings *class* at `openhands.sdk.settings.model`: it is a live member of the settings discriminated union (agent_kind="llm") so legacy payloads still deserialize and the API-breakage field-value check is unchanged. - Rewrite the alias test to assert removal; update class/union docstrings. Verified: deprecation-deadline check clean; api-breakage reports a sanctioned scheduled removal (::notice, exit 0); pyright clean; 79 affected tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4710fa9 to
17cb597
Compare
|
@OpenHands Remove and re-add the 3 release tests labels, so that we have them run on the actual recent release PR. Wait until all three are done, then tell us directly on PR as a comment WDYT about the results. |
|
I'm on it! enyst can track my progress at all-hands.dev |
|
Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 48.7s | $0.07 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 20.8s | $0.03 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 11.1s | $0.01 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 26.2s | $0.02 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 10.4s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 29.2s | $0.02 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 28.4s | $0.04 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 8.7s | $0.00 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 57.7s | $0.06 |
| 01_standalone_sdk/14_context_condenser.py | ✅ PASS | 2m 14s | $0.16 |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 25.4s | $0.02 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 23.9s | $0.02 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 14.2s | $0.02 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 15.6s | $0.02 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 8.7s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 19.0s | $0.01 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 1m 8s | $0.01 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 5m 20s | $0.37 |
| 01_standalone_sdk/25_agent_delegation.py | ✅ PASS | 1m 7s | $0.07 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 16.9s | $0.03 |
| 01_standalone_sdk/28_ask_agent_example.py | ✅ PASS | 38.7s | $0.04 |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 29.0s | $0.02 |
| 01_standalone_sdk/30_tom_agent.py | ✅ PASS | 7.9s | $0.01 |
| 01_standalone_sdk/31_iterative_refinement.py | ✅ PASS | 4m 49s | $0.36 |
| 01_standalone_sdk/32_configurable_security_policy.py | ✅ PASS | 22.7s | $0.03 |
| 01_standalone_sdk/33_hooks/main.py | ✅ PASS | 46.7s | $0.05 |
| 01_standalone_sdk/34_critic_example.py | ✅ PASS | 8m 0s | $0.68 |
| 01_standalone_sdk/36_event_json_to_openai_messages.py | ✅ PASS | 8.4s | $0.00 |
| 01_standalone_sdk/37_llm_profile_store/main.py | ✅ PASS | 3.5s | $0.00 |
| 01_standalone_sdk/38_browser_session_recording.py | ✅ PASS | 39.1s | $0.03 |
| 01_standalone_sdk/39_llm_fallback.py | ✅ PASS | 9.0s | $0.01 |
| 01_standalone_sdk/40_acp_agent_example.py | ✅ PASS | 29.1s | $0.32 |
| 01_standalone_sdk/41_task_tool_set.py | ✅ PASS | 23.0s | $0.03 |
| 01_standalone_sdk/42_file_based_subagents.py | ✅ PASS | 41.6s | $0.05 |
| 01_standalone_sdk/43_mixed_marketplace_skills/main.py | ✅ PASS | 2.9s | $0.00 |
| 01_standalone_sdk/44_model_switching_in_convo.py | ✅ PASS | 7.1s | $0.01 |
| 01_standalone_sdk/45_parallel_tool_execution.py | ✅ PASS | 6m 29s | $0.54 |
| 01_standalone_sdk/46_agent_settings.py | ✅ PASS | 11.4s | $0.01 |
| 01_standalone_sdk/47_defense_in_depth_security.py | ✅ PASS | 2.8s | $0.00 |
| 01_standalone_sdk/48_conversation_fork.py | ✅ PASS | 11.7s | $0.00 |
| 01_standalone_sdk/49_switch_llm_tool.py | ✅ PASS | 11.5s | $0.03 |
| 01_standalone_sdk/50_async_cancellation.py | ✅ PASS | 12.0s | $0.00 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 30.7s | $0.02 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ✅ PASS | 1m 51s | $0.04 |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ✅ PASS | 1m 5s | $0.06 |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ✅ PASS | 1m 49s | $0.05 |
| 02_remote_agent_server/06_custom_tool/main.py | ✅ PASS | 6m 3s | $0.04 |
| 02_remote_agent_server/07_convo_with_cloud_workspace.py | ✅ PASS | 38.2s | $0.03 |
| 02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py | ✅ PASS | 4m 19s | $0.02 |
| 02_remote_agent_server/09_acp_agent_with_remote_runtime.py | ✅ PASS | 1m 1s | $0.14 |
| 02_remote_agent_server/10_cloud_workspace_share_credentials.py | ✅ PASS | 37.1s | $0.06 |
| 02_remote_agent_server/11_conversation_fork.py | ✅ PASS | 1m 43s | $0.00 |
| 02_remote_agent_server/12_settings_and_secrets_api.py | ✅ PASS | 2m 24s | $0.02 |
| 02_remote_agent_server/13_workspace_get_llm.py | ✅ PASS | 1m 16s | $0.04 |
| 04_llm_specific_tools/01_gpt5_apply_patch_preset.py | ✅ PASS | 19.9s | $0.03 |
| 04_llm_specific_tools/02_gemini_file_tools.py | ✅ PASS | 1m 1s | $0.09 |
| 05_skills_and_plugins/01_loading_agentskills/main.py | ✅ PASS | 13.8s | $0.02 |
| 05_skills_and_plugins/02_loading_plugins/main.py | ✅ PASS | 22.7s | $0.02 |
✅ All tests passed!
Total: 58 | Passed: 58 | Failed: 0 | Total Cost: $3.91
🧪 Integration Tests ResultsOverall Success Rate: 100.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_minimax_MiniMax_M2.7
Skipped Tests:
litellm_proxy_deepseek_deepseek_v4_flash
Skipped Tests:
litellm_proxy_gemini_3.1_pro_preview
litellm_proxy_openai_gpt_5.5
|
|
@OpenHands see the comment here #3401 (comment) and I think the same 4 LLMs are for behavior tests and integration tests. Make an investigation in recent history to tell me: we used to run them for a Claude, probably a Sonnet IIRC; where/when did we “lose” that Claude? This is a side question, not part of the release PR, so investigate deeply from main branch and make a new issue on the repo where you tag me and explain to me what happened with those LLMs. Do NOT touch the release PR. |
|
I'm on it! enyst can track my progress at all-hands.dev |
|
OpenHands encountered an error: Request timeout after 30 seconds to https://yqvmoyebykrkpsdo.prod-runtime.all-hands.dev/api/conversations/6375e3a7-fc47-411f-8e0f-4b8b7952fe95/ask_agent See the conversation for more information. |
🧪 Integration Tests ResultsOverall Success Rate: 95.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_minimax_MiniMax_M2.7
litellm_proxy_deepseek_deepseek_v4_flash
litellm_proxy_gemini_3.1_pro_preview
Failed Tests:
However, the agent's approach violated the evaluation criterion "without unnecessary verification." Specifically: Over-verification Issues:
What Should Have Happened:
Positive Aspects:
Problematic Aspects:
The task was accomplishable in ~10-15 iterations; the agent took ~50+ iterations largely due to over-verification and repeated failed attempts at broader testing. (confidence=0.65) (Cost: $0.77) litellm_proxy_openai_gpt_5.5
|
|
I removed and re-added the three release-test labels at ~16:36 UTC and waited for the fresh runs to finish. Results from the new runs:
WDYT: integration and examples look release-ready to me. I would not call the behavior-test checklist item fully green yet, because the latest behavior report still has a real reported failure. That said, the failure looks like a model/agent behavioral flake around over-verification and tmux recovery loops, not an obvious SDK release-code regression: Gemini completed the requested code change and targeted test, then failed the behavior criterion by continuing with unnecessary broader verification. If maintainers are comfortable accepting that as known/flaky behavior, I think the release can proceed; otherwise I’d rerun This PR comment was created by an AI agent (OpenHands) on behalf of the requester. |
|
OpenHands encountered an error: Request timeout after 30 seconds to https://owkbyecunjtnbgml.prod-runtime.all-hands.dev/api/conversations/abde2eab-bfaa-454e-96d6-2bdf28600047/ask_agent See the conversation for more information. |
I think it's safe to cut the release? |
Release v1.24.0
This PR prepares the release for version 1.24.0.
Release Checklist
integration-test)behavior-test)test-examples)release-note-requiredPRs are accurately called out in the final release notesWhat happens on merge
When this PR is merged, the
create-release.ymlworkflow will automatically:v1.24.0and auto-generated notes, plus an explicit preamble for mergedrelease-note-requiredPRspypi-release.ymlto publish all packages to PyPIversion-bump-prs.ymlto create downstream version bump PRsAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:17cb597-pythonRun
All tags pushed for this build
About Multi-Architecture Support
17cb597-python) is a multi-arch manifest supporting both amd64 and arm6417cb597-python-amd64) are also available if needed