Skip to content

fix(cubelet): converge pause/resume state on ttrpc errors and shim events#404

Merged
fslongjin merged 3 commits into
TencentCloud:masterfrom
kinwin-ustc:fix-cublet-pausing
May 31, 2026
Merged

fix(cubelet): converge pause/resume state on ttrpc errors and shim events#404
fslongjin merged 3 commits into
TencentCloud:masterfrom
kinwin-ustc:fix-cublet-pausing

Conversation

@kinwin-ustc
Copy link
Copy Markdown
Collaborator

When task.Pause/Resume returned a ttrpc error (DeadlineExceeded, canceled, ttrpc closed), the cubelet in-memory view could stay stuck at PAUSING/PAUSED while cubeshim had already switched the VM state, leaving PausingAt lingering forever. close issue #403

update.go:

  • Wrap task.Pause/Resume with explicit 30s timeouts so the calls cannot be stretched arbitrarily by a missing or blocked upstream ctx.
  • On error, reconcile against the shim's real status using a fresh Background ctx (carrying over the namespace) and straighten the in-memory view to the truth, while still returning TaskPauseFailed/TaskResumeFailed upstream for alerting.

events.go:

  • Subscribe to /tasks/paused and /tasks/resumed and handle them in convertEvent/handleEvent as an event-driven fallback.
  • Idempotently converge PausedAt/PausingAt for all containers, skipping terminal or user-deleted sandboxes.

Assisted-by: Claude Opus 4.8

…ents

When task.Pause/Resume returned a ttrpc error (DeadlineExceeded, canceled, ttrpc closed), the cubelet in-memory view could stay stuck at PAUSING/PAUSED while cubeshim had already switched the VM state, leaving PausingAt lingering forever.

update.go:
- Wrap task.Pause/Resume with explicit 30s timeouts so the calls cannot be stretched arbitrarily by a missing or blocked upstream ctx.
- On error, reconcile against the shim's real status using a fresh Background ctx (carrying over the namespace) and straighten the in-memory view to the truth, while still returning TaskPauseFailed/TaskResumeFailed upstream for alerting.

events.go:
- Subscribe to /tasks/paused and /tasks/resumed and handle them in convertEvent/handleEvent as an event-driven fallback.
- Idempotently converge PausedAt/PausingAt for all containers, skipping terminal or user-deleted sandboxes.

Assisted-by: Claude Opus 4.8
Signed-off-by: Feng Jin <ronyjin@tencent.com>
@kinwin-ustc kinwin-ustc requested a review from fslongjin as a code owner May 29, 2026 15:26
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 29, 2026

Claude finished @kinwin-ustc's task in 2m 4s —— View job


Review: PR #404

Progress

  • Analyze PR description and context
  • Read changed files (events.go, update.go)
  • Review supporting types (status.go, cubebox.go)
  • Analyze code for correctness, safety, edge cases
  • Post review feedback

Summary

This PR fixes a state divergence bug where the cubelet's in-memory view of container pause/resume state could become permanently stuck when ttrpc calls fail (DeadlineExceeded, canceled, or connection-closed). The fix introduces a three-layer defense:

  1. 30s explicit timeouts on task.Pause/task.Resume so the calls can't stretch arbitrarily
  2. Reconciliation on error — query the shim's real status with a fresh context.Background() and converge if the shim actually completed the operation
  3. Event-driven fallback via /tasks/paused and /tasks/resumed subscriptions

The approach is sound and well-structured. Below are specific observations.


Strengths

  1. Fresh context for reconciliation (update.go:288-293, 356-361): Using context.Background() for the status query is critical — the parent context is almost certainly already Done when the pause/resume ttrpc call fails. The namespace is explicitly carried over to avoid namespaces.NamespaceRequired failures. Well done.

  2. Idempotent event handlers (events.go:289, 325): Both event handlers check status.PausedAt == 0 before writing (for paused) and unconditionally clear both timestamps (for resumed). This makes them safe for duplicate delivery from containerd's event system.

  3. Terminal state guards (events.go:277-282, 313-318): Skipping the entire convergence path when UserMarkDeletedTime != nil or the container IsTerminated() is correct — no point backfilling paused semantics into something already shutting down or gone.

  4. Alertability preserved: The reconciler converges the in-memory view but still returns TaskPauseFailed/TaskResumeFailed upstream. Operators get alerts, but the system state is consistent. Good defensive design.

  5. exec- prefix filtering (events.go:261-263, 298-300): Prevents exec process events from triggering sandbox-level state changes. Consistent with the existing TaskExit handler pattern.

  6. Dual reconciliation for resume (update.go:378): The resume reconciler also calls invalidateRuntimeSnapshotBindingsAfterOpaqueRestore, keeping it consistent with the success path. This is easy to overlook but correctly handled.


Observations

1. Lock held during SyncByID in event handlers (pre-existing pattern)

In the TaskPaused/TaskResumed event handlers at events.go:274-275 and 311-312, cb.Lock() is acquired and held via defer cb.Unlock() through the SyncByID call on lines 296 and 330. If SyncByID internally attempts to acquire the same lock (e.g., through a status read), this would deadlock. The existing TaskExit handler at line 223-224 uses the same pattern, so this is consistent — but worth verifying during review that SyncByID doesn't need cb.Lock().

2. Event handler timeout governs all cases

The handleEventTimeout (10s at events.go:35) is set at the top of handleEvent (line 189) and applies to all cases including TaskPaused/TaskResumed. If SyncByID takes close to 10s, the context will be canceled before SyncByID completes. The existing handlers have the same constraint, so this is consistent, but worth noting for operators who might see timeout logs under load.

3. Reconciled state is persisted via deferred SyncByID

In UpdateWithPause/UpdateWithResume (update.go:152-154, 226-228), SyncByID is deferred before the reconciler call. Since the reconciler updates the in-memory container statuses directly (not through a copy), the deferred SyncByID will persist the converged state. This is correct but relies on the fact that the reconciler modifies sb.AllContainers() entries in-place.

4. Race window between reconcile and event-driven fallback

If the reconcile path runs (failes to query status due to query timeout), and then the /tasks/paused event arrives shortly after, both paths will try to converge. Since the event handler also locks cb.Lock() and checks terminal state idempotently, this is safe — the worst case is a redundant SyncByID.

5. Double exec- prefix check in TaskExit handler (pre-existing, not introduced by this PR)

At events.go:205 and 209, the TaskExit case has two identical strings.HasPrefix(e.ID, "exec-") checks in succession. The second is dead code.


Verdict

Approved. The code is correct, handles edge cases well (terminal states, exec processes, duplicate events), and the three-layer defense (timeout → reconcile → event) is thorough. The changes follow the existing codebase patterns consistently. No blocking issues found.

DeadGC previously skipped all paused/pausing sandboxes outright, so a cubelet crash/restart mid-pause (missing both the RPC-level reconcile and the /tasks/paused event window) could leave a sandbox stuck at PAUSING forever.

- service.go: expand scanDeadContainer skip logic into an explicit switch over container state. PAUSED and in-flight PAUSING (within pausingStuckThreshold) are still skipped to avoid the ttrpc state() timeout that would stamp Unknown=true and trigger a spurious Terminated/Destroy cascade; only PAUSING past the threshold is reconciled.
- update.go: add pausingStuckThreshold (2 * taskPauseTimeout); extract the shared PAUSE-direction converger convergePauseStateFromShim as the single source of truth for both the RPC path and the new fallback; add reconcileStuckPausingSandbox, which queries the shim's real task status and converges without ever writing Unknown=true.

Assisted-by: Claude opus: v4.8
Signed-off-by: Feng Jin <ronyjin@tencent.com>
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 29, 2026

Claude finished @kinwin-ustc's task in 1m 51s —— View job


Review: PR #404

Progress

  • Analyze PR description and context
  • Read changed files (events.go, update.go, service.go)
  • Review supporting types (status.go, cubebox.go, store.go)
  • Analyze code for correctness, safety, edge cases
  • Post review feedback

Summary

This PR addresses a state divergence bug where the cubelet's in-memory view of pause/resume state could become permanently stuck when ttrpc calls return errors (DeadlineExceeded, canceled, connection-closed). The fix introduces a three-layer defense:

  1. 30s explicit timeouts on task.Pause/task.Resume to prevent unbounded blocking
  2. Reconciliation on error — query the shim's real status using a fresh context.Background() and converge if the shim actually completed the operation
  3. Event-driven fallback via /tasks/paused and /tasks/resumed subscriptions

Additionally, a DeadGC-based third line of defense (reconcileStuckPausingSandbox) handles the case where the cubelet restarts mid-pause and misses both the RPC-level reconcile and the event-driven converge.

The overall design is sound and well-structured. Below are detailed observations.


Strengths

  1. Fresh context for reconciliation (update.go:299-303, 386-389): Using context.Background() for the status query is critical — the parent context is very likely already Done when the pause/resume ttrpc call fails. The namespace is explicitly carried over to avoid namespaces.NamespaceRequired failures. Well handled.

  2. Idempotent event handlers (events.go:288-289, 324-325): The TaskPaused handler only writes PausedAt when PausedAt == 0, and both handlers unconditionally clear timestamps. This makes them safe for duplicate delivery from containerd's event system.

  3. Terminal state guards (events.go:277-282, 313-318): Skipping convergence when UserMarkDeletedTime != nil or the container IsTerminated() is correct — no point backfilling pause/resume semantics into something already shutting down or gone.

  4. Alertability preserved: The reconciler converges the in-memory view but still returns TaskPauseFailed/TaskResumeFailed upstream. Operators get alerted, but system state stays consistent. Good defensive design.

  5. exec- prefix filtering (events.go:261-263, 298-300): Prevents exec process events from triggering sandbox-level state changes. Consistent with the existing TaskExit handler pattern.

  6. Shared convergence function (convergePauseStateFromShim at update.go:332): Used by both the RPC-level reconciler and the DeadGC fallback, preventing drift between the two paths. Also correctly handles the intermediate/unknown shim statuses by leaving the in-memory state untouched.

  7. Deferred SyncByID vs reconciler ordering (update.go:152-155, 183): The SyncByID is deferred before the reconciler call. Since defers are LIFO, the reconciler modifies in-memory state first, then the deferred SyncByID persists it. This ordering is correct.

  8. DeadGC safety window (pausingStuckThreshold = 2 * taskPauseTimeout at update.go:283): 60s is comfortably larger than the 30s RPC timeout, preventing premature reconciliation while a genuine in-flight pause holds the shim mutex.

  9. reconcileStuckPausingSandbox handles nil container (update.go:459-468): Defensively loads the container from the containerd client if the in-memory Container field is nil. This handles edge cases where the cubebox is partially initialized.


Observations

1. Lock ordering in event handlers

The TaskPaused/TaskResumed handlers at events.go:274-275 and 311-312 acquire cb.Lock() and hold it through the SyncByID call. I verified that SyncByID (cubes/local.go:143) calls l.Get(ctx, id) and l.cubeboxStore.Sync(id), which use the store-level sync.RWMutex — not the per-cubebox cb.RWMutex. So there is no deadlock risk. However, holding the lock through the entire handler (including store I/O in SyncByID) increases lock contention on this sandbox, blocking other operations like List() or Destroy() that also acquire cb.Lock(). The existing TaskExit handler at line 223-224 uses the same pattern, so this is a pre-existing design choice, not a new concern.

2. Event handler timeout applies uniformly

The handleEventTimeout (10s at events.go:35) applies to all event types including TaskPaused/TaskResumed. If SyncByID takes close to 10s (e.g., under heavy I/O load), the context will be canceled before completion. The existing handlers have the same constraint, so this is consistent — but operators might see context deadline exceeded logs under load.

3. Reconciler modifies in-memory state, persisted by deferred SyncByID

In UpdateWithPause/UpdateWithResume (update.go:152-155, 226-228), SyncByID is deferred before the reconciler call. Since the reconciler modifies sb.AllContainers() entries in-place (through c.Status.Update()), the deferred SyncByID persists the converged state. This is correct.

4. Race window between reconcile and event-driven fallback

If the reconcile path runs and also fails to query the shim status (status query times out), and then the /tasks/paused event arrives shortly after, both paths will try to converge. Since the event handler also locks cb.Lock() and checks terminal state idempotently, this is safe — the worst case is a redundant SyncByID.

5. Double exec- prefix check in TaskExit handler (pre-existing, not introduced by this PR)

At events.go:205 and 209, the TaskExit case has two identical strings.HasPrefix(e.ID, "exec-") checks in succession. The second is dead code, unreachable since the first returns nil.

6. reconcileStuckPausingSandbox runs with Background context

At update.go:454, the function creates its context from context.Background() with reconcileStatusTimeout (5s). The DeadGC loop at service.go:850 also uses context.Background(). This means DeadGC's task status queries won't be subject to any upstream cancellation, which is the correct behavior — these are truly independent background operations. However, note that the logging context (ctx parameter, passed as the DeadGC logger context) is used for log.G(ctx).Errorf(...) calls inside reconcileStuckPausingSandbox, so errors will be properly attributed.

7. UpdateWithPause applies PreStopType before both doPreStop and task.Pause

At update.go:143, constants.WithPreStopType(ctx, constants.PreStopTypePause) is set on the context before calling doPreStop (lines 167-170) and task.Pause (line 178). This means the prestop hooks and the pause call share the same prestop-type context metadata. The timeout context on line 176 wraps this context, so the pause ttrpc call gets the 30s timeout. Correct.

8. UpdateWithResume does not set PreStopType (correct — resume doesn't need prestop)

Unlike UpdateWithPause, UpdateWithResume (update.go:200) does not call constants.WithPreStopType or invoke doPreStop. This is correct — only the pause path needs prestop hooks, and resume directly issues task.Resume() without preparatory steps.


Verdict

The code is correct, well-structured, and handles edge cases thoroughly (terminal states, exec processes, duplicate events, cubelet restart mid-pause). The three-layer defense (RPC timeout → reconcile → event subscription) plus the DeadGC backstop provides comprehensive coverage against state divergence. The changes follow existing codebase patterns consistently.

No blocking issues found. The pre-existing dead code in the TaskExit handler (events.go:209) is minor and outside the scope of this PR.

@fslongjin
Copy link
Copy Markdown
Member

I addressed one review item because it can produce a real correctness issue after a resume that succeeds through the shim but is only observed via the /tasks/resumed event path.

Before this change, the resumed-event handler only cleared PausedAt and PausingAt, but did not invalidate the runtime snapshot bindings. That left Cubelet believing the VM was still bound to the pre-resume snapshot lineage, even though CubeShim had actually restored the VM from its internal pause snapshot. In that state, a later CommitSandbox could choose an incremental memory base that no longer matches the VM's real restore boundary.

The fix makes all successful resume convergence paths use the same helper:

  • normal UpdateWithResume success
  • resume RPC error reconciliation when shim status is already RUNNING
  • /tasks/resumed event reconciliation

That keeps the state transition logic consistent and guarantees that an opaque restore always invalidates stale runtime/restore-base bindings before the next commit. I also added tests for both the shared helper and the resumed-event path.

@cubesandboxbot
Copy link
Copy Markdown

Claude Code is working…

I'll analyze this and get back to you.

View job run

Signed-off-by: jinlong <jinlong@tencent.com>
@fslongjin fslongjin force-pushed the fix-cublet-pausing branch from 74a30e8 to 4c65f64 Compare May 31, 2026 10:11
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 31, 2026

Claude finished @fslongjin's task in 1m 24s —— View job


Review: PR #404 — follow-up review (3rd commit)

Progress

  • Analyze PR description and context
  • Read all changed files (events.go, update.go, service.go, resume_reconcile_test.go)
  • Review supporting types (snapshot_runtime_binding.go, rollback_test.go)
  • Review the latest commit 4c65f64 changes in detail
  • Analyze code for correctness, safety, edge cases
  • Post review feedback

Focus of this review

Two earlier reviews already covered the first two commits (7d6ceed, 6a9a677) and found no blocking issues. This review focuses on the third commit 4c65f64 — "fix(cubelet): align resumed event reconcile with opaque restore" — which was pushed in response to a review finding.


Summary of the fix in commit 4c65f64

The earlier /tasks/resumed event handler only cleared PausedAt and PausingAt but did not invalidate the runtime snapshot bindings (MasterAnnotationRuntimeSnapshotID, MasterAnnotationRuntimeRestoreSnapshotID). A resume from cubeshim is an opaque restore from an internal pause snapshot that cubelet cannot later reflink from. Without invalidation, a subsequent CommitSandbox could choose an incremental memory base that no longer matches the VM's real restore boundary.

The fix:

  1. Extracts convergeResumeStateAfterOpaqueRestore(sb, attachedAt) as the shared single source of truth for all resume-convergence paths
  2. Wires it into:
    • The normal UpdateWithResume success path (update.go:242)
    • The resume-RPC error reconcile path (update.go:418 via reconcileStatusAfterResumeError)
    • The /tasks/resumed event handler (events.go:322)
  3. Adds tests for both the shared helper and the resumed-event path

Detailed observations

1. Three paths, one function — correct factoring.

update.go:372 defines convergeResumeStateAfterOpaqueRestore, which calls invalidateRuntimeSnapshotBindingsAfterOpaqueRestore and then clears PausedAt/PausingAt on all containers. All three call sites (update.go:242, update.go:418, events.go:322) converge through this single function, so the binding-invalidation behavior cannot drift between paths again.

2. TaskPaused handler correctly does NOT call the shared helper.

The TaskPaused handler at events.go:288 still uses inline status updates — it only sets PausedAt and clears PausingAt. This is correct: pausing does not involve a snapshot restore, so there are no runtime bindings to invalidate.

3. convergeResumeStateAfterOpaqueRestore is nil-safe but convergePauseStateFromShim is not.

convergeResumeStateAfterOpaqueRestore at update.go:373-375 returns early if sb == nil. However, convergePauseStateFromShim at update.go:317 does not have a nil check. Both are called after sb has already been dereferenced elsewhere, so in practice this is not a bug — but the asymmetry is worth noting for future maintainers. If either is ever called from a context where sb could be nil (e.g., a background worker that filters sandboxes differently), convergePauseStateFromShim would panic with a nil-pointer dereference on sb.ID.

4. Test coverage is solid.

The two new tests in resume_reconcile_test.go cover:

  • TestConvergeResumeStateAfterOpaqueRestoreClearsPauseStateAndInvalidatesBindings: verifies that PausedAt and PausingAt are cleared, runtime snapshot bindings are invalidated, attachedAt timestamps are stamped, and unrelated labels are preserved — including for a sidecar container.
  • TestHandleEventTaskResumedConvergesOpaqueRestoreBindings: verifies the full event handler path with a fakeCubeboxAPI mock, including that SyncByID is called to persist the converged state.

Both tests assert on the content of the labels, not just their presence, which makes them resistant to false positives. Well done.

5. attachedAt time parameter consistency across paths.

All three call sites pass time.Now().UTC():

  • update.go:242 — normal success
  • update.go:418 — RPC error reconcile
  • events.go:322 — event handler

invalidateRuntimeSnapshotBindingsAfterOpaqueRestore stamps the attached-at timestamp only if !attachedAt.IsZero() (snapshot_runtime_binding.go:93). Since time.Now().UTC() is never the zero time, this guard is technically dead code for all three callers, but it's a reasonable defensive check in case a future caller passes time.Time{}.

6. Lock held through convergeResumeStateAfterOpaqueRestore in event handler.

The TaskResumed handler at events.go:311-312 acquires cb.Lock() and holds it through convergeResumeStateAfterOpaqueRestore (which mutates container statuses in-place) and SyncByID. This is consistent with the TaskExit and TaskPaused handlers, so it's a pre-existing design pattern. The test proves it works correctly under unit test conditions.


Verdict

Approved. The fix is correct, well-structured, and closes a real correctness gap where the resumed-event path could leave stale runtime snapshot bindings. The three resume-convergence paths now share a single function, preventing future drift. Tests are comprehensive and assert on the right things. No blocking issues found.


@fslongjin fslongjin merged commit 32a30f5 into TencentCloud:master May 31, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants