-
Notifications
You must be signed in to change notification settings - Fork 92
define pod sandbox lifecycle contract #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,162 @@ | ||
| # NRI Pod Sandbox Lifecycle Hooks | ||
|
|
||
| ## Relationship to CRI API | ||
|
|
||
| This specification defines how NRI plugins interact with pod sandbox lifecycle events. The underlying pod sandbox operations are defined by the [Kubernetes CRI API](https://github.com/kubernetes/cri-api): | ||
|
|
||
| - **RunPodSandbox (CRI)**: Creates and starts a pod-level sandbox. Runtimes must ensure the sandbox is in the ready state on success. | ||
| - **StopPodSandbox (CRI)**: Stops any running process that is part of the sandbox and directs the runtime to reclaim certain pod resources (e.g. Network Namespace, CNI teardown, and image mounts). May be called multiple times, and is idempotent. | ||
| - **RemovePodSandbox (CRI)**: Removes the sandbox. If there are any running containers, they must be forcibly terminated and removed. | ||
|
|
||
| This NRI specification details when and under what conditions NRI plugins receive notifications for these events, ensuring plugins can reliably depend on consistent sandbox state across different runtime implementations. | ||
|
|
||
| ## Overview | ||
|
|
||
| The pod sandbox lifecycle consists of three distinct phases, each with a corresponding NRI event that plugins can subscribe to: | ||
|
|
||
| 1. **RunPodSandbox**: Fired during the the runtime CRI RunPodSandbox execution, after the PodSandbox is created but before setting the pod to running and then replying success to CRI RunPodSandbox request. | ||
| 2. **StopPodSandbox**: Fired when the runtime initiates CRI StopPodSandbox | ||
| 3. **RemovePodSandbox**: Fired when the runtime performs CRI RemovePodSandbox | ||
|
|
||
| For each event, this specification defines: | ||
|
|
||
| - **Sandbox State Contract**: What sandbox infrastructure conditions runtimes MUST satisfy when firing the NRI event | ||
| - **Plugin Responsibilities and Capabilities**: What plugins can safely do in response to the event | ||
|
|
||
| ## RunPodSandbox | ||
|
|
||
| **CRI Operation**: RunPodSandbox - Creates and starts a pod-level sandbox. | ||
|
|
||
| **NRI Event Timing**: The RunPodSandbox NRI event is fired after the runtime has successfully executed most of the CRI RunPodSandbox operation; NRI plugin execution is the final step before the sandbox reaches a "Ready" state. The Kubelet does not start workload containers until after the sandbox becomes "Ready". | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will container runtime keep retrying or fail on first failure/timeout?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On errors other than plugin timeout, the runtime will fail the pod creation request. On plugin timeout the plugin is kicked out by the runtime. |
||
|
|
||
| ### Sandbox State Contract | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If NRI plugin timed out, will it receive the Stop event?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If an NRI plugin times out the runtime kicks it out by forcibly disconnecting the plugin. |
||
|
|
||
| When the runtime fires the RunPodSandbox NRI event, it guarantees: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's say something about volumes and DRA devices in this list There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should DRA downward API be accessible? |
||
|
|
||
| - The Pod-level cgroup hierarchy has been established | ||
| - The Sandbox namespaces (IPC, Network, UTS) are created and active | ||
| - Network setup has been fully configured (network interfaces are up and assigned addressing) | ||
| - The pod IP address (if applicable) is assigned and available | ||
| - The "pause" container (if the runtime uses one) is running | ||
| - All prerequisite operations for workload container startup are complete, the pod is in the "unknown state" and will become "Ready" once the NRI event is processed. This guarantees the NRI plugin has a window to allocate resources for the pod before any workload containers are started. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can NRI pluging start it's own processes in this sandbox?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not by using NRI for this. NRI itself does not provide means for this. |
||
|
|
||
| ### Plugin Responsibilities and Capabilities | ||
|
|
||
| Upon receiving the RunPodSandbox event, plugins can safely: | ||
|
|
||
| - Access the network namespace and inspect network configuration | ||
| - Perform network-level operations or monitoring | ||
| - Inject sandbox-level hardware configurations (e.g., RDMA, RoCEv2) | ||
| - Establish plugin-specific tracking or monitoring for the pod | ||
| - Store initial state or baseline metrics for later reference | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's clarify the ordering of NRI plugins and what NRI plugin will "see" if there are multiple plugins.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently NRI plugins cannot mutate a pod via NRI. Therefore in the case of multiple plugins, all NRI plugins see identical pod sandbox data. |
||
|
|
||
| Plugins should treat this as an initialization phase. The sandbox infrastructure will remain accessible throughout the pod's lifetime until StopPodSandbox is called. | ||
|
|
||
| ## StopPodSandbox | ||
|
|
||
| **CRI Operation**: StopPodSandbox - Stops any running process that is part of the sandbox and reclaims certain pod resources (e.g. Network Namespace, CNI teardown, and image mounts). | ||
|
|
||
| **NRI Event Timing**: The StopPodSandbox NRI event is fired when the runtime initiates the CRI StopPodSandbox operation. | ||
|
|
||
| ### Sandbox State Contract | ||
|
|
||
| When the runtime fires the StopPodSandbox NRI event, it guarantees: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's add guarantee: this sandbox will never be reused. There is never a scenario when any new workload (containers) will be started there by container runtime after the first attempt to Stop sandbox. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this can be called even if the Run was never called. Mostly for garbage collecting the sandbox which creation was interrupted. E.g. Run called, almost at the end containerd crashed, on restart containerd doesn't know if it needs clean up state so it must call Stop on each NRI plugin. Ortherwise there will be resources leak
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nod.. at which point I believe the state of the pod would be "unknown" |
||
|
|
||
| - Workload containers within the sandbox are stopped or are stopping | ||
| - **CRITICAL**: The sandbox infrastructure still exists and remains fully accessible during this hook | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can runtime guarantee that the state is "good enough" to start another container in this sandbox?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no.. this is currently post stopping the pod which includes killing the containers and the pod container, during which when the pod container is killed and during exit processing of it's task the pod state is set to NotReady, which subsequently blocks new containers on the pod. This call is pre-tear down of certain pod resources, noting there was a discrepancy in the crio placement of the call, WIP to nomalize. We could add another call earlier in the stop processing if we want to add the goal of creating notifications that stop pod has been requested. And possibly a third call to indicate the pod containers are stopped but the pod container is still running and thus another container can be added. |
||
| - The pod resources allocated by the runtime; such as network namespace, CNI networks, and image mounts; are not unmounted or deleted until this hook completes | ||
| - The pod's cgroups remain accessible | ||
| - All pod-level resources remain stable until this hook returns | ||
|
|
||
| ### Plugin Responsibilities and Capabilities | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. responsibility: this event processing must be re-entrant. |
||
|
|
||
| StopPodSandbox is the designated cleanup and observation phase for plugins. Upon receiving this event, plugins can: | ||
|
|
||
| - Access the pod's network namespace to read final telemetry or metrics | ||
| - Collect final state for observability or troubleshooting | ||
| - Detach hardware interfaces or reconfigure resources | ||
| - Clean up custom firewall configurations, routing rules, or other network-level state | ||
| - Perform graceful cleanup or resource release before sandbox teardown | ||
|
|
||
| **Important**: Plugin processing must complete within the configured request timeout. Do not assume sandbox access persists after this hook returns or times out. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, I think we need to say that there will be a retry if timeout happened. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From that perspective, can we say that it can be assumed as long as plugin keeps returning error? |
||
|
|
||
| ## RemovePodSandbox | ||
|
|
||
| **CRI Operation**: RemovePodSandbox - Removes the sandbox and forcibly terminates any remaining containers. | ||
|
|
||
| **NRI Event Timing**: The RemovePodSandbox NRI event is fired when the runtime initiates the CRI RemovePodSandbox operation, just prior to removing the pod from the pod list. | ||
|
|
||
| ### Sandbox State Contract | ||
|
|
||
| When the runtime fires the RemovePodSandbox NRI event: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. another guarantee is that it never be called until the Stop succeeded |
||
|
|
||
| - All workload containers have been removed | ||
| - The StopPodSandbox operation has completed | ||
| - Network setup teardown may be underway or complete | ||
| - The pod's namespaces (Network, IPC, UTS) may have already been deleted | ||
| - Pod-level cgroups may be destroyed | ||
| - Sandbox infrastructure access is **not guaranteed** | ||
|
|
||
| ### Plugin Responsibilities and Capabilities | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Plugin must be reentrant here as well |
||
|
|
||
| RemovePodSandbox is strictly for plugin-internal cleanup. Plugins MUST NOT attempt to access pod infrastructure (namespaces, cgroups, network configuration) during this hook, as their existence is not guaranteed. | ||
|
|
||
| Plugins receiving this event should only: | ||
|
|
||
| - Clean up plugin-internal memory caches or object tracking associated with the podSandboxID | ||
| - Remove host-level tracking files, database entries, or other locally stored pod references | ||
| - Release any plugin resources held for this specific pod | ||
| - Perform final accounting or bookkeeping | ||
|
|
||
| **Important**: This hook is informational only. Plugins should not assume any pod infrastructure exists. Only clean up information the plugin created or stored internally. | ||
|
|
||
| ## Event Ordering and Guarantees | ||
|
|
||
| Runtimes MUST guarantee the following ordering: | ||
|
|
||
| 1. **RunPodSandbox** NRI event fires after successful CRI RunPodSandbox execution, but before the pod is set to the "Ready" state. | ||
| 2. **StopPodSandbox** NRI event fires during CRI StopPodSandbox execution, just prior to removing the runtime pod resources allocated by the runtime; such as network namespace, CNI networks, and image mounts | ||
| 3. **RemovePodSandbox** NRI event fires during CRI RemovePodSandbox execution | ||
| 4. These events MUST fire in strict order: RunPodSandbox → StopPodSandbox → RemovePodSandbox | ||
| 5. No workload containers will be started until after RunPodSandbox hook completes | ||
| 6. All workload containers will be stopped before StopPodSandbox hook is called | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this contradicts the contract above. It was saying stopped or being stopped:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nod was wondering why the "or" didn't get a chance to verify that |
||
| 7. No network resource reclamation should occur during StopPodSandbox hook execution | ||
|
|
||
| See the [CRI API specification](https://github.com/kubernetes/cri-api) for details on each CRI operation. | ||
|
|
||
| ## Plugin Implementation Guidance | ||
|
|
||
| ### Subscribing to Events | ||
|
|
||
| Plugins subscribe to these events during the Configure phase by returning the appropriate event flags in the ConfigureResponse: | ||
|
|
||
| - `Event_RUN_POD_SANDBOX` (1 << 0) for RunPodSandbox | ||
| - `Event_STOP_POD_SANDBOX` (1 << 1) for StopPodSandbox | ||
| - `Event_REMOVE_POD_SANDBOX` (1 << 2) for RemovePodSandbox | ||
|
|
||
| These events are delivered to plugins using the RunPodSandbox, StopPodSandbox and RemovePodSandbox event handlers. | ||
|
|
||
| ### Timeout Handling | ||
|
|
||
| All plugin processing must complete within the configured request timeout. Plugins should plan accordingly: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this statement is hard to parse as a spec. I suggest to specify that the timeout is treated as error by runtime |
||
|
|
||
| - **RunPodSandbox**: Failure may result in pod creation failure | ||
| - **StopPodSandbox**: Non-blocking for subsequent operations; the plugin should not depend on completion of subsequent teardown | ||
| - **RemovePodSandbox**: Non-blocking; removal will proceed regardless of plugin timeout | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are we not guaranteeing successful execution here? It may lead to resource leak. With DRA we guarantee successful unprepare call, why shouldn't we do the same here?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we have discussed MUST run plugins vs optional plugins. Is a good point that in this RM processing we need to also consider which of the plugins MUST complete, or possibly cause leaks. |
||
|
|
||
| ### Error Handling | ||
|
|
||
| On the teardown path, plugin errors MUST NOT prevent the operation from proceeding. Runtimes MUST ensure that a failing plugin cannot block pod or container teardown: | ||
|
|
||
| - **RunPodSandbox errors**: A plugin error may prevent the pod from being created, depending on runtime policy. Plugins bear responsibility for errors they return at this phase. | ||
| - **StopPodSandbox errors**: A plugin error MUST NOT prevent the sandbox from being stopped. The runtime MUST proceed with teardown regardless of plugin failures. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why? Is there any alternative way for plugin to handle clean up if runtime will not guarantee call and retry?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One alternative: on sync of the list of pods a plugin may be able to normalize it's internal list of pods with resources vs the actual sync list. |
||
| - **RemovePodSandbox errors**: A plugin error MUST NOT prevent the sandbox from being removed. The runtime MUST proceed with removal regardless of plugin failures. | ||
|
|
||
|
|
||
| ### Multi-Plugin Coordination | ||
|
|
||
| When multiple plugins are active: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if plugin didn't exist on sandbox creation and was notified about existing sandboxes on loading. What is the guarantee on consistency between that notification and plugin Stop/Remove calls? |
||
|
|
||
| - All RunPodSandbox hooks complete before first workload container starts | ||
| - Hooks execute in plugin index order; later plugins should not assume earlier plugins' modifications will persist | ||
| - RemovePodSandbox hooks are independent; plugins should not rely on side effects from other plugins | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.