fix(cubelet): converge pause/resume state on ttrpc errors and shim events#404
Conversation
…ents When task.Pause/Resume returned a ttrpc error (DeadlineExceeded, canceled, ttrpc closed), the cubelet in-memory view could stay stuck at PAUSING/PAUSED while cubeshim had already switched the VM state, leaving PausingAt lingering forever. update.go: - Wrap task.Pause/Resume with explicit 30s timeouts so the calls cannot be stretched arbitrarily by a missing or blocked upstream ctx. - On error, reconcile against the shim's real status using a fresh Background ctx (carrying over the namespace) and straighten the in-memory view to the truth, while still returning TaskPauseFailed/TaskResumeFailed upstream for alerting. events.go: - Subscribe to /tasks/paused and /tasks/resumed and handle them in convertEvent/handleEvent as an event-driven fallback. - Idempotently converge PausedAt/PausingAt for all containers, skipping terminal or user-deleted sandboxes. Assisted-by: Claude Opus 4.8 Signed-off-by: Feng Jin <ronyjin@tencent.com>
|
Claude finished @kinwin-ustc's task in 2m 4s —— View job Review: PR #404Progress
SummaryThis PR fixes a state divergence bug where the cubelet's in-memory view of container pause/resume state could become permanently stuck when ttrpc calls fail (DeadlineExceeded, canceled, or connection-closed). The fix introduces a three-layer defense:
The approach is sound and well-structured. Below are specific observations. Strengths
Observations1. Lock held during In the 2. Event handler timeout governs all cases The 3. Reconciled state is persisted via deferred In 4. Race window between reconcile and event-driven fallback If the reconcile path runs (failes to query status due to query timeout), and then the 5. Double At VerdictApproved. The code is correct, handles edge cases well (terminal states, exec processes, duplicate events), and the three-layer defense (timeout → reconcile → event) is thorough. The changes follow the existing codebase patterns consistently. No blocking issues found. |
DeadGC previously skipped all paused/pausing sandboxes outright, so a cubelet crash/restart mid-pause (missing both the RPC-level reconcile and the /tasks/paused event window) could leave a sandbox stuck at PAUSING forever. - service.go: expand scanDeadContainer skip logic into an explicit switch over container state. PAUSED and in-flight PAUSING (within pausingStuckThreshold) are still skipped to avoid the ttrpc state() timeout that would stamp Unknown=true and trigger a spurious Terminated/Destroy cascade; only PAUSING past the threshold is reconciled. - update.go: add pausingStuckThreshold (2 * taskPauseTimeout); extract the shared PAUSE-direction converger convergePauseStateFromShim as the single source of truth for both the RPC path and the new fallback; add reconcileStuckPausingSandbox, which queries the shim's real task status and converges without ever writing Unknown=true. Assisted-by: Claude opus: v4.8 Signed-off-by: Feng Jin <ronyjin@tencent.com>
|
Claude finished @kinwin-ustc's task in 1m 51s —— View job Review: PR #404Progress
SummaryThis PR addresses a state divergence bug where the cubelet's in-memory view of pause/resume state could become permanently stuck when ttrpc calls return errors (DeadlineExceeded, canceled, connection-closed). The fix introduces a three-layer defense:
Additionally, a DeadGC-based third line of defense ( The overall design is sound and well-structured. Below are detailed observations. Strengths
Observations1. Lock ordering in event handlers The 2. Event handler timeout applies uniformly The 3. Reconciler modifies in-memory state, persisted by deferred SyncByID In 4. Race window between reconcile and event-driven fallback If the reconcile path runs and also fails to query the shim status (status query times out), and then the 5. Double At 6. At 7. At 8. Unlike VerdictThe code is correct, well-structured, and handles edge cases thoroughly (terminal states, exec processes, duplicate events, cubelet restart mid-pause). The three-layer defense (RPC timeout → reconcile → event subscription) plus the DeadGC backstop provides comprehensive coverage against state divergence. The changes follow existing codebase patterns consistently. No blocking issues found. The pre-existing dead code in the |
|
I addressed one review item because it can produce a real correctness issue after a resume that succeeds through the shim but is only observed via the Before this change, the resumed-event handler only cleared The fix makes all successful resume convergence paths use the same helper:
That keeps the state transition logic consistent and guarantees that an opaque restore always invalidates stale runtime/restore-base bindings before the next commit. I also added tests for both the shared helper and the resumed-event path. |
|
I'll analyze this and get back to you. |
Signed-off-by: jinlong <jinlong@tencent.com>
74a30e8 to
4c65f64
Compare
|
Claude finished @fslongjin's task in 1m 24s —— View job Review: PR #404 — follow-up review (3rd commit)Progress
Focus of this reviewTwo earlier reviews already covered the first two commits ( Summary of the fix in commit
|

When task.Pause/Resume returned a ttrpc error (DeadlineExceeded, canceled, ttrpc closed), the cubelet in-memory view could stay stuck at PAUSING/PAUSED while cubeshim had already switched the VM state, leaving PausingAt lingering forever. close issue #403
update.go:
events.go:
Assisted-by: Claude Opus 4.8