Skip to content

fix(canopy): run proxy as a native sidecar so canopy Jobs complete#86

Merged
passcod merged 1 commit into
mainfrom
fix/canopy-native-sidecar
Jul 2, 2026
Merged

fix(canopy): run proxy as a native sidecar so canopy Jobs complete#86
passcod merged 1 commit into
mainfrom
fix/canopy-native-sidecar

Conversation

@passcod

@passcod passcod commented Jul 2, 2026

Copy link
Copy Markdown
Member

🤖 The canopy-proxy was added as a plain container in the restore and
snapshot-list Jobs. In a Job Pod (restartPolicy: Never) the kubelet
doesn't stop still-running containers when another exits, so once the
main kopia container finished, the proxy kept serving and the Pod never
reached a terminal phase. The Job sat Active until
activeDeadlineSeconds fired → DeadlineExceededFailed, even
though the snapshot-list callback had already been POSTed successfully.

Confirmed on the live cluster: the failed canopy-replica-snapshot-list
Job showed reason=DeadlineExceeded, both containers plain, no init
containers.

Fix

Move the proxy to a native sidecar — an init container with
restartPolicy: Always. The kubelet keeps it running alongside the main
container and SIGTERMs it once the main container exits, so the Pod
completes on the main container's exit code. Needs k8s ≥ 1.29
(SidecarContainers GA); target cluster is 1.34.

Two supporting changes:

  • The proxy now waits on SIGTERM as well as SIGINT. tokio's
    ctrl_c() only catches SIGINT, so under k8s the proxy would have hung
    until SIGKILL and lost its final traffic stats.
  • Both canopy Job Pods get a 30s terminationGracePeriod so the sidecar
    can flush its stats callback on SIGTERM.

Regression test canopy_restore_job_proxy_is_native_sidecar asserts the
proxy is an init container with restartPolicy=Always and not a plain
container.

The canopy-proxy was added as a plain container in the restore and
snapshot-list Jobs. In a Job Pod (restartPolicy: Never) the kubelet
doesn't stop still-running containers when another exits, so once the
main kopia container finished the proxy kept serving and the Pod never
reached a terminal phase. The Job sat Active until activeDeadlineSeconds
fired -> DeadlineExceeded -> Failed, even though the snapshot-list
callback had already been POSTed successfully. Confirmed live: the
failed Job showed reason=DeadlineExceeded with both containers plain and
no init containers.

Move the proxy to a native sidecar (init container with
restartPolicy: Always). The kubelet keeps it running alongside the main
container and SIGTERMs it once the main container exits, so the Pod
completes on the main container's exit code. Requires k8s >= 1.29 for
SidecarContainers GA; the target cluster is 1.34.

Also fix the proxy to wait on SIGTERM as well as SIGINT — tokio's
ctrl_c() only catches SIGINT, so under k8s the proxy would have hung
until SIGKILL and lost its final traffic stats. Add a 30s
terminationGracePeriod so the sidecar can flush stats on SIGTERM.
@passcod passcod enabled auto-merge July 2, 2026 02:38
@passcod passcod merged commit ae225c7 into main Jul 2, 2026
18 checks passed
@passcod passcod deleted the fix/canopy-native-sidecar branch July 2, 2026 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant