Conversation
|
same way we do conformance for runtimes, some of these contracts may be added to conformance: kubernetes-sigs/cri-tools#2046. I think critest is the right place for it. But it opens an interesting discussion on whether we will also want to do conformance with other things like CNI. NRI is different enough that I would see it as a reasonable conformance requirement |
CNI is not part of CRI, is an implementation detail of the runtimes and has several flaws that cause a lot of problem with current workloads ... during last kubecon during the OCI meeting we also discussed to replace it by a modular solution based on NRI ... I have a draft that once I have time I plan to finish and share, but I suggest to not include CNI as part of conformance of anything |
| Plugins should handle errors gracefully and avoid leaving the pod or system in an inconsistent state. Error recovery strategies: | ||
|
|
||
| - **RunPodSandbox errors**: Problematic; may block pod creation depending on failure severity and runtime policy | ||
| - **StopPodSandbox errors**: May not prevent scenario termination depending on runtime policy |
There was a problem hiding this comment.
I think on the teardown path, actually both for pods and containers, we should not allow a plugin to try and prevent the operation with an error. If we agree, then we should clearly state here that, for StopPodSandbox and RemovePodSandbox, a plugin failing with an error will not prevent the operation from proceeding.
The current implementation has incorrect/inconstent behavior in this regard when multiple plugins are involved in the sense that for some of these (or corresponding container) teardown lifecycle events, a failure in a plugin will incorrectly prevent the event from being delivered to subsequent plugins, although it will not prevent the CRI-level operation from proceeding. There is a fix coming in for this, but it is waiting for #274 from get merged first (which is waiting for #277 to get merged first).
There was a problem hiding this comment.
@klihub can you please suggest the better wording for this?
I do not feel I'm able to translate that correctly to words :)
There was a problem hiding this comment.
ok, let me take a stab at this, I think now I got what you mean
There was a problem hiding this comment.
see new commit rephrasing it
|
@aojea Thank you, this looks great ! I only have a few comments. |
609d764 to
3c820b0
Compare
|
squashed |
|
@aojea thanks for putting this together, great doc. |
|
|
||
| **CRI Operation**: RunPodSandbox - Creates and starts a pod-level sandbox. | ||
|
|
||
| **NRI Event Timing**: The RunPodSandbox NRI event is fired after the runtime has successfully executed the CRI RunPodSandbox operation and the sandbox has reached a "Ready" state, but before any workload containers are started. |
There was a problem hiding this comment.
| **NRI Event Timing**: The RunPodSandbox NRI event is fired after the runtime has successfully executed the CRI RunPodSandbox operation and the sandbox has reached a "Ready" state, but before any workload containers are started. | |
| **NRI Event Timing**: The RunPodSandbox NRI event is fired when the runtime has tentatively finished executing the CRI RunPodSandbox operation but just before setting the pod to the "Ready" state, which occurs immediately after NRI event processing, and thus before any workload containers are/can be started as the Pod is still in the "unknown" state. |
There was a problem hiding this comment.
| **NRI Event Timing**: The RunPodSandbox NRI event is fired after the runtime has successfully executed the CRI RunPodSandbox operation and the sandbox has reached a "Ready" state, but before any workload containers are started. | |
| **NRI Event Timing**: The RunPodSandbox NRI event is fired after the runtime has successfully executed most of the CRI RunPodSandbox operation; NRI plugin execution is the final step before the sandbox reaches a "Ready" state. The Kubelet does not start workload containers until after the sandbox becomes "Ready". |
| This specification defines how NRI plugins interact with pod sandbox lifecycle events. The underlying pod sandbox operations are defined by the [Kubernetes CRI API](https://github.com/kubernetes/cri-api): | ||
|
|
||
| - **RunPodSandbox (CRI)**: Creates and starts a pod-level sandbox. Runtimes must ensure the sandbox is in the ready state on success. | ||
| - **StopPodSandbox (CRI)**: Stops any running process that is part of the sandbox and reclaims network resources. |
There was a problem hiding this comment.
| - **StopPodSandbox (CRI)**: Stops any running process that is part of the sandbox and reclaims network resources. | |
| - **StopPodSandbox (CRI)**: Stops any running process that is part of the sandbox and directs the runtime to reclaim certain pod resources (e.g. Network Namespace, CNI teardown, and image mounts). May be called multiple times, and is idempotent. |
|
|
||
| The pod sandbox lifecycle consists of three distinct phases, each with a corresponding NRI event that plugins can subscribe to: | ||
|
|
||
| 1. **RunPodSandbox**: Fired after the runtime successfully executes CRI RunPodSandbox |
There was a problem hiding this comment.
| 1. **RunPodSandbox**: Fired after the runtime successfully executes CRI RunPodSandbox | |
| 1. **RunPodSandbox**: Fired after the runtime successfully creates the pod, but before setting the pod to running and then replying success to CRI RunPodSandbox request. |
There was a problem hiding this comment.
| 1. **RunPodSandbox**: Fired after the runtime successfully executes CRI RunPodSandbox | |
| 1. **RunPodSandbox**: Fired during the the runtime CRI RunPodSandbox execution |
| - Network setup has been fully configured (network interfaces are up and assigned addressing) | ||
| - The pod IP address (if applicable) is assigned and available | ||
| - The "pause" container (if the runtime uses one) is running | ||
| - All prerequisite operations for workload container startup are complete |
There was a problem hiding this comment.
| - All prerequisite operations for workload container startup are complete | |
| - All prerequisite operations for workload container startup are complete, the pod is in the "unknown state" and will become "Ready" once the NRI event is processed. *This guarantees the NRI plugin has a window to allocate resources for the pod before any workload containers are started. |
|
|
||
| - Workload containers within the sandbox are stopped or are stopping | ||
| - **CRITICAL**: The sandbox infrastructure still exists and remains fully accessible during this hook | ||
| - The network namespace is not unmounted or deleted until this hook completes |
There was a problem hiding this comment.
| - The network namespace is not unmounted or deleted until this hook completes | |
| - The pod resources allocated by the runtime; such as network namespace, CNI networks, and image mounts; are not unmounted or deleted until this hook completes |
|
|
||
| **CRI Operation**: RemovePodSandbox - Removes the sandbox and forcibly terminates any remaining containers. | ||
|
|
||
| **NRI Event Timing**: The RemovePodSandbox NRI event is fired when the runtime initiates the CRI RemovePodSandbox operation, during final garbage collection. |
There was a problem hiding this comment.
| **NRI Event Timing**: The RemovePodSandbox NRI event is fired when the runtime initiates the CRI RemovePodSandbox operation, during final garbage collection. | |
| **NRI Event Timing**: The RemovePodSandbox NRI event is fired by the runtime just prior to removing the pod from the pod list. |
There was a problem hiding this comment.
The runtime doesn't initiate RemovePodSandbox, the Kubelet does.
|
|
||
| Runtimes MUST guarantee the following ordering: | ||
|
|
||
| 1. **RunPodSandbox** NRI event fires after successful CRI RunPodSandbox execution |
There was a problem hiding this comment.
| 1. **RunPodSandbox** NRI event fires after successful CRI RunPodSandbox execution | |
| 1. **RunPodSandbox** NRI event fires after successful CRI RunPodSandbox execution, but before the pod is set to the "Ready" state |
| Runtimes MUST guarantee the following ordering: | ||
|
|
||
| 1. **RunPodSandbox** NRI event fires after successful CRI RunPodSandbox execution | ||
| 2. **StopPodSandbox** NRI event fires during CRI StopPodSandbox execution |
There was a problem hiding this comment.
| 2. **StopPodSandbox** NRI event fires during CRI StopPodSandbox execution | |
| 2. **StopPodSandbox** NRI event fires during CRI StopPodSandbox execution, just prior to removing the runtime pod resources allocated by the runtime; such as network namespace, CNI networks, and image mounts |
3c820b0 to
18b2b7b
Compare
|
new commit trying to reconcile @mikebrow and @samuelkarp comments |
|
pls squash :-) |
Define the contract for the PodSandbox hooks for the NRI plugins. The Sandbox hooks are based on the CRI-API RPCs , since the OCI runtime only specify the container lifecycle. Co-authored-by: Mike Brown <brownwm@us.ibm.com> Signed-off-by: Antonio Ojea <aojea@google.com>
b635dc2 to
9a74a7b
Compare
|
|
||
| The pod sandbox lifecycle consists of three distinct phases, each with a corresponding NRI event that plugins can subscribe to: | ||
|
|
||
| 1. **RunPodSandbox**: Fired during the the runtime CRI RunPodSandbox execution, after the PodSandbox is created but before setting the pod to running and then replying success to CRI RunPodSandbox request. |
There was a problem hiding this comment.
| 1. **RunPodSandbox**: Fired during the the runtime CRI RunPodSandbox execution, after the PodSandbox is created but before setting the pod to running and then replying success to CRI RunPodSandbox request. | |
| 1. **RunPodSandbox**: Fired during the runtime CRI RunPodSandbox execution, after the PodSandbox is created but before setting the pod to running and then replying success to CRI RunPodSandbox request. |
|
|
||
| **CRI Operation**: RunPodSandbox - Creates and starts a pod-level sandbox. | ||
|
|
||
| **NRI Event Timing**: The RunPodSandbox NRI event is fired after the runtime has successfully executed most of the CRI RunPodSandbox operation; NRI plugin execution is the final step before the sandbox reaches a "Ready" state. The Kubelet does not start workload containers until after the sandbox becomes "Ready". |
There was a problem hiding this comment.
Will container runtime keep retrying or fail on first failure/timeout?
There was a problem hiding this comment.
On errors other than plugin timeout, the runtime will fail the pod creation request. On plugin timeout the plugin is kicked out by the runtime.
|
|
||
| **NRI Event Timing**: The RunPodSandbox NRI event is fired after the runtime has successfully executed most of the CRI RunPodSandbox operation; NRI plugin execution is the final step before the sandbox reaches a "Ready" state. The Kubelet does not start workload containers until after the sandbox becomes "Ready". | ||
|
|
||
| ### Sandbox State Contract |
There was a problem hiding this comment.
If NRI plugin timed out, will it receive the Stop event?
There was a problem hiding this comment.
If an NRI plugin times out the runtime kicks it out by forcibly disconnecting the plugin.
| - Network setup has been fully configured (network interfaces are up and assigned addressing) | ||
| - The pod IP address (if applicable) is assigned and available | ||
| - The "pause" container (if the runtime uses one) is running | ||
| - All prerequisite operations for workload container startup are complete, the pod is in the "unknown state" and will become "Ready" once the NRI event is processed. This guarantees the NRI plugin has a window to allocate resources for the pod before any workload containers are started. |
There was a problem hiding this comment.
Can NRI pluging start it's own processes in this sandbox?
There was a problem hiding this comment.
Not by using NRI for this. NRI itself does not provide means for this.
|
|
||
| ### Sandbox State Contract | ||
|
|
||
| When the runtime fires the RunPodSandbox NRI event, it guarantees: |
There was a problem hiding this comment.
let's say something about volumes and DRA devices in this list
There was a problem hiding this comment.
Should DRA downward API be accessible?
| 3. **RemovePodSandbox** NRI event fires during CRI RemovePodSandbox execution | ||
| 4. These events MUST fire in strict order: RunPodSandbox → StopPodSandbox → RemovePodSandbox | ||
| 5. No workload containers will be started until after RunPodSandbox hook completes | ||
| 6. All workload containers will be stopped before StopPodSandbox hook is called |
There was a problem hiding this comment.
this contradicts the contract above. It was saying stopped or being stopped:
- Workload containers within the sandbox are stopped or are stopping
There was a problem hiding this comment.
nod was wondering why the "or" didn't get a chance to verify that
|
|
||
| - **RunPodSandbox**: Failure may result in pod creation failure | ||
| - **StopPodSandbox**: Non-blocking for subsequent operations; the plugin should not depend on completion of subsequent teardown | ||
| - **RemovePodSandbox**: Non-blocking; removal will proceed regardless of plugin timeout |
There was a problem hiding this comment.
why are we not guaranteeing successful execution here? It may lead to resource leak. With DRA we guarantee successful unprepare call, why shouldn't we do the same here?
There was a problem hiding this comment.
we have discussed MUST run plugins vs optional plugins. Is a good point that in this RM processing we need to also consider which of the plugins MUST complete, or possibly cause leaks.
|
|
||
| ### Timeout Handling | ||
|
|
||
| All plugin processing must complete within the configured request timeout. Plugins should plan accordingly: |
There was a problem hiding this comment.
this statement is hard to parse as a spec. I suggest to specify that the timeout is treated as error by runtime
| On the teardown path, plugin errors MUST NOT prevent the operation from proceeding. Runtimes MUST ensure that a failing plugin cannot block pod or container teardown: | ||
|
|
||
| - **RunPodSandbox errors**: A plugin error may prevent the pod from being created, depending on runtime policy. Plugins bear responsibility for errors they return at this phase. | ||
| - **StopPodSandbox errors**: A plugin error MUST NOT prevent the sandbox from being stopped. The runtime MUST proceed with teardown regardless of plugin failures. |
There was a problem hiding this comment.
why? Is there any alternative way for plugin to handle clean up if runtime will not guarantee call and retry?
There was a problem hiding this comment.
One alternative: on sync of the list of pods a plugin may be able to normalize it's internal list of pods with resources vs the actual sync list.
|
|
||
| ### Multi-Plugin Coordination | ||
|
|
||
| When multiple plugins are active: |
There was a problem hiding this comment.
if plugin didn't exist on sandbox creation and was notified about existing sandboxes on loading. What is the guarantee on consistency between that notification and plugin Stop/Remove calls?
Define the contract for the PodSandbox hooks for the NRI plugins.
The Sandbox hooks are based on the CRI-API RPCs , since the OCI runtime only specify the container lifecycle.
/assign @samuelkarp @haircommander