Skip to content

feat(controller): add sandbox backend runtime#988

Merged
shiyiyue1102 merged 4 commits into
agentscope-ai:mainfrom
shiyiyue1102:codex/sandbox-openkruise-backend-20260704
Jul 4, 2026
Merged

feat(controller): add sandbox backend runtime#988
shiyiyue1102 merged 4 commits into
agentscope-ai:mainfrom
shiyiyue1102:codex/sandbox-openkruise-backend-20260704

Conversation

@shiyiyue1102

Copy link
Copy Markdown
Collaborator

Summary

  • add Worker spec.backendRuntime / status support for pod and sandbox
  • add OpenKruise Sandbox/SandboxClaim backend lifecycle, status mapping, and watches
  • wire controller config, CRD schema, Helm env, and RBAC for sandbox backend support

Validation

  • go test ./internal/backend ./internal/controller ./internal/config ./internal/app ./internal/remoteclient
  • make test-unit
  • make check-crd-sync
  • git diff --check

Note: local helm lint was not run because the helm binary is not installed in this environment.

Change-Id: I0a21e7c6483a4edea2ca41cb49fd0ae4fcbc9317
@shiyiyue1102 shiyiyue1102 requested a review from maplefeng-a July 4, 2026 01:58
Change-Id: I8bb900f351b39f16dfcf1991e3dad9637b057e59
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

📊 CI Metrics Report

Summary

Metric Current Baseline Change
LLM Calls 65 81 -16 ↓ -19.8%
Input Tokens 1768376 2803871 -1035495 ↓ -36.9%
Output Tokens 13283 16791 -3508 ↓ -20.9%
Total Tokens 1781659 2820662 -1039003 ↓ -36.8%

By Role

Role Metric Current Baseline Change
🧠 Manager LLM Calls 54 68 -14 ↓ -20.6%
Input Tokens 1536490 2502214 -965724 ↓ -38.6%
Output Tokens 11483 13725 -2242 ↓ -16.3%
Total Tokens 1547973 2515939 -967966 ↓ -38.5%
🔧 Workers LLM Calls 11 13 -2 ↓ -15.4%
Input Tokens 231886 301657 -69771 ↓ -23.1%
Output Tokens 1800 3066 -1266 ↓ -41.3%
Total Tokens 233686 304723 -71037 ↓ -23.3%

Per-Test Breakdown

Test Mgr Calls Wkr Calls Δ Calls Mgr In Wkr In Mgr Out Wkr Out Δ Tokens Trend
02-create-worker 3 0 -9 ↓ -75.0% 81278 0 812 0 -276532 ↓ -77.1% ✅ improved
03-assign-task 18 4 +7 ↑ +46.7% 419058 88281 2743 609 +37035 ↑ +7.8% ⚠️ regressed
04-human-intervene 12 0 -1 ↓ -7.7% 313591 0 1989 0 -117418 ↓ -27.1% ✅ improved
05-heartbeat 6 0 -1 ↓ -14.3% 187089 0 1976 0 -86187 ↓ -31.3% ✅ improved
06-multi-worker 15 7 -12 ↓ -35.3% 535474 143605 3963 1191 -595901 ↓ -46.5% ✅ improved

Trends

4 test(s) improved (fewer LLM calls)
⚠️ 1 test(s) regressed (more LLM calls)


Generated by HiClaw CI on 2026-07-04 08:33:01 UTC


📦 Download debug logs & test artifacts

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

❌ Integration Tests Failed (llm-interaction / mgr=copaw / wk=hermes)

Commit: 8642c07
Workflow run: #1428

Test Results
No test output captured.
Debug Log (tail)
No debug logs available.

📦 Download full debug logs & test artifacts

@shiyiyue1102

Copy link
Copy Markdown
Collaborator Author

@maplefeng-a CI is green now. Could you take a look when convenient?

@maplefeng-a maplefeng-a left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I like the direction of adding spec.backendRuntime=sandbox, but I found two merge blockers in the current wiring.

  1. The sandbox backend does not pass the worker runtime environment or auth token into the SandboxClaim path.

createMemberContainer builds workerEnv and requests an auth token for non-k8s backends, but SandboxBackend.Create does not consume req.Env or req.AuthToken. It only forwards WorkersDeps into inplaceUpdate / dynamicVolumesMount, and the controller currently never populates WorkersDeps. As a result, a backendRuntime=sandbox Worker can create a SandboxClaim without the HICLAW_* env, Matrix settings, controller URL, or auth material that the worker needs to start and authenticate.

Could you wire the env/auth delivery path for sandbox Workers, and add a regression test that proves a sandbox Worker create request carries the generated runtime env/auth material into the provider-facing SandboxClaim/deps contract?

  1. Existing Workers with empty status.backendRuntime can bypass the backend switch guard.

For Workers created before this field existed, status.backendRuntime is empty. If such a running Worker is changed to spec.backendRuntime=sandbox, ensureMemberContainerPresent currently defaults currentBackend to the desired backend. That makes the reconciler query the sandbox backend, see NotFound, and create a sandbox runtime instead of treating the existing runtime as the legacy pod backend and blocking the switch until Stopped. This can leave the old Pod and the new Sandbox running at the same time.

Could you default an empty applied backend to pod for existing running Workers, keep the “backendRuntime cannot change until Stopped” behavior, and cover the upgrade/migration case in tests?

Change-Id: Ibd8ba0b20291c064b01975885271fda86bca6441
@shiyiyue1102

Copy link
Copy Markdown
Collaborator Author

@maplefeng-a Thanks for catching these. Fixed both blockers: sandbox Worker create now carries the generated runtime env/auth material into the sandbox deps/dynamic mount contract, and empty applied backendRuntime is treated as legacy pod so pod-to-sandbox switches stay blocked until the Worker is Stopped. Added regression coverage for both paths.

@shiyiyue1102 shiyiyue1102 requested a review from maplefeng-a July 4, 2026 07:40

@maplefeng-a maplefeng-a left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update. The legacy backend switch guard looks fixed now: empty status.backendRuntime is treated as pod, and the regression test covers the pod-to-sandbox switch case.

I think the env/auth delivery path still needs one more change before merge. The new WorkerDeps carries Env and AuthToken inside the Go struct, but the provider-facing SandboxClaim only serializes inplaceUpdate and dynamicVolumesMount. I could not find a path that writes WorkerDeps.Env / WorkerDeps.AuthToken into workers-deps/<name>/env or workers-deps/<name>/token before the claim is created. That means the sandbox can receive mount declarations without the actual HICLAW_* env/auth material behind those mounts.

Could you either wire the materialization/provider contract so the generated env/token are actually available through those dynamic mounts, with a regression test for that, or point to/update the source that already consumes WorkerDeps.Env / WorkerDeps.AuthToken and backs those subPaths?

Change-Id: I2fae3a7e3cd834f288ba7690ce740039a19ecd09
@shiyiyue1102

Copy link
Copy Markdown
Collaborator Author

@maplefeng-a Thanks, you were right. Fixed by materializing the generated sandbox Worker env and auth token into the worker-deps storage paths before creating the SandboxClaim, and added regression coverage for both the storage writes and the controller call ordering.

@shiyiyue1102 shiyiyue1102 requested a review from maplefeng-a July 4, 2026 08:51

@maplefeng-a maplefeng-a left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up. I re-checked the latest head and the previous blockers are addressed: sandbox Worker deps are materialized before SandboxClaim creation, the dynamic mount subPaths match the storage keys, and the legacy pod-to-sandbox switch guard remains covered. CI and the controller test suite are green from my side.

@shiyiyue1102 shiyiyue1102 merged commit c39d1b4 into agentscope-ai:main Jul 4, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants