-
Notifications
You must be signed in to change notification settings - Fork 362
Create blog post on AKS NAP disruption management #5685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wdarko1
wants to merge
3
commits into
Azure:master
Choose a base branch
from
wdarko1:nap-disruption-blog
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,314 @@ | ||||||
| --- | ||||||
| title: "Managing Disruption with AKS Node Auto-Provisioning" | ||||||
| description: "Learn AKS best practices to control NAP disruption with Pod Disruption Budgets (PDBs), node pool disruption budgets, consolidation, and maintenance windows." | ||||||
| date: 2026-04-12 | ||||||
| authors: ["wilson-darko"] | ||||||
| tags: | ||||||
| - node-auto-provisioning | ||||||
| --- | ||||||
|
|
||||||
| ## Background | ||||||
| AKS users want to ensure that their workloads scale when needed and are disrupted only when (and where) desired. | ||||||
| AKS Node Auto-Provisioning (NAP) is designed to keep clusters efficient: it provisions nodes for pending pods, and it also continuously *removes* nodes when it’s safe to do so (for example, when nodes are empty or underutilized). That node-removal **disruption** is where many production surprises happen. | ||||||
|
|
||||||
| When managing Kubernetes, operational questions that users might have are: | ||||||
|
|
||||||
| - How do I control when scale downs happen, or where it shouldn't? | ||||||
| - How do I control workload disruption so it happens predictably (and not in the middle of business hours)? | ||||||
| - Why won’t NAP scale down, even though I have lots of underused capacity? | ||||||
| - Why do upgrades get “stuck” on certain nodes? | ||||||
|
|
||||||
|
|
||||||
| This post focuses on **NAP disruption best practices**, and not workload scheduling (tools like topology spread constraints, node affinity, taints, etc.). For more on scheduling best practices, check out our earlier blog post on NAP scheduling fundamentals. | ||||||
|
|
||||||
| If you’re new to these NAP features, this post will give you “good defaults” as a starting point. If you’re already deep into NAP disruption settings, treat it as a checklist for the behaviors AKS users most commonly ask about. | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| <!-- truncate --> | ||||||
|
|
||||||
| :::info | ||||||
|
|
||||||
| Learn more about how to [configure disruption policies for NAP](https://learn.microsoft.com/azure/aks/node-auto-provisioning-disruption) | ||||||
|
|
||||||
| ::: | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Part 1 — The mental model: two layers of disruption control | ||||||
|
|
||||||
| When NAP decides a node (virtual machine) *could* be removed, there are two layers of controls that determine whether it actually happens: | ||||||
|
|
||||||
| ### Workload layer: Pod Disruption Budgets (PDBs) | ||||||
|
|
||||||
| PDBs are Kubernetes-native guardrails that limit **voluntary evictions** of pods. PDBs are how you tell Kubernetes: | ||||||
|
|
||||||
| “During voluntary disruptions, keep at least N replicas available (or limit max unavailable).” | ||||||
|
|
||||||
| :::note | ||||||
| Pod disruption budgets protect against **voluntary evictions**, not involuntary failures, forced migrations, or spot node eviction. | ||||||
| ::: | ||||||
|
|
||||||
| ### Infrastructure layer: Node-level disruption settings | ||||||
|
|
||||||
| NAP allows setting disruption settings at the node level | ||||||
|
|
||||||
| NAP is built on Karpenter concepts and exposes disruption controls on the **NodePool**: | ||||||
| - **Consolidation policy** (when NAP is allowed to consolidate) | ||||||
| - **Disruption budgets** (how many nodes can be disrupted at once, and when) | ||||||
| - **Expire-after** (node lifetime) | ||||||
| - **Drift** (replace nodes that are out of date with the desired NodePool configuration) | ||||||
|
|
||||||
| A good operational posture is: **use PDBs to protect *applications*** and **use NAP disruption tools to control *the cluster’s disruption rate***. | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Part 2 - NAP Overview | ||||||
|
|
||||||
| Node auto-provisioning (NAP) provisions, scales, and manages nodes. NAP bases its scheduling and disruption logic on settings from 3 sources: | ||||||
|
|
||||||
| - Workload deployment file - For disruption NAP honors the pod disruption budgets defined by the user here | ||||||
| - [NodePool CRD](https://learn.microsoft.com/azure/aks/node-auto-provisioning-node-pools) - Used to list the range of allowed virtual machine options (size, zones, architecture) and also disruption settings | ||||||
| - [AKSNodeClass CRD](https://learn.microsoft.com/azure/aks/node-auto-provisioning-aksnodeclass) - Used to define Azure-specific settings | ||||||
|
|
||||||
| ### How NAP handles disruption | ||||||
|
|
||||||
| NAP honors Kubernetes-native concepts such as Pod Disruption Budgets when making disruption decisions. NAP also has Karpenter-based concepts such as Consolidation, Drift, and Node Disruption Budgets. | ||||||
|
|
||||||
| #### What “disruption” means in NAP (and what it doesn’t) | ||||||
|
|
||||||
| In NAP, “disruption” typically refers to **voluntary** actions that delete nodes after draining them, such as: | ||||||
|
|
||||||
| - **Consolidation**: deleting or replacing nodes (with better VM sizes) to increase compute efficiency (and reduce cost). | ||||||
| - **Drift**: replacing existing nodes that no longer match desired configuration (for example, an updated settings in your NodePool and AKSNodeClass CRDs). | ||||||
| - **Expiration**: replacing nodes after a configured lifetime. | ||||||
|
|
||||||
| These are different from **involuntary** disruptions such as: | ||||||
|
|
||||||
| - Spot/eviction events | ||||||
| - Hardware failures | ||||||
| - Host reboots outside your control | ||||||
|
|
||||||
| PDBs and Karpenter disruption budgets mainly help with **voluntary** disruptions. These features do not regulate involuntary disruption (for example, spot VM evictions, node termination events, node stopping events). | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Part 3 — Pod Disruption Budgets (PDBs): controlling voluntary disruption | ||||||
|
|
||||||
| The most common NAP disruption problems come from PDBs that are either: | ||||||
|
|
||||||
| - **Too strict**, too strong of a guardrail blocks node drains indefinitely | ||||||
| - **Missing**, No gaurdrail allows too much disruption at once | ||||||
|
|
||||||
| ### A good default PDB | ||||||
|
|
||||||
| Kubernetes documentation describes minAvailable / maxUnavailable as the two key availability knobs for PDBs, and notes you can only specify one per PDB. | ||||||
|
|
||||||
| Here's an example of a PDB that regulates disruption without blocking scale downs, upgrades, and consolidation: | ||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: policy/v1 | ||||||
| kind: PodDisruptionBudget | ||||||
| metadata: | ||||||
| name: web-pdb | ||||||
| spec: | ||||||
| maxUnavailable: 1 | ||||||
| selector: | ||||||
| matchLabels: | ||||||
| app: web | ||||||
| ``` | ||||||
|
|
||||||
| Why it works well in practice: | ||||||
|
|
||||||
| - Consolidation/drift/expiration can still proceed. | ||||||
| - You avoid large brownouts caused by draining too many replicas at once. | ||||||
| - You reduce the chance of NAP “thrashing” a service by repeatedly moving too many pods. | ||||||
|
|
||||||
| ### The common PDB pitfall: “zero voluntary evictions” | ||||||
|
|
||||||
| If you effectively set zero voluntary evictions (`maxUnavailable: 0` or `minAvailable: 100%`), Kubernetes warns this can block node drains indefinitely for a node running one of those pods. | ||||||
|
|
||||||
| This common misconfiguration can cause scenarios such as: | ||||||
|
|
||||||
| - Node / Cluster upgrades fail as nodes won't voluntarily scale down | ||||||
| - Migration fails | ||||||
| - NAP Consolidation never happens | ||||||
|
|
||||||
| This can be intentional for extremely sensitive workloads, but it has a cost: if a node has one of these pods, draining that node can become impossible without changing the PDB (or taking an outage). We recommend setting some tolerance for their two settings, and also using disruption budgets or maintenance windows to control disruption. | ||||||
|
|
||||||
| **Practical guidance:** | ||||||
|
|
||||||
| - For critical workloads that you do not want to be disrupted at all, strictness of "zero eviction" may be intentional — but be deliberate. When you're ready to allow disruption to these workloads, you may have to change the PDBs in the workload deployment file. | ||||||
| - For general workloads that can tolerate minor disruption, prefer a small maxUnavailable (like 1) rather than “zero evictions.” | ||||||
| - Be clear on the tradeoff between zero tolerance (blocks upgrades, NAP consolidation, and scale down). | ||||||
|
|
||||||
| ## Part 4 — Controlling consolidation - “when” vs “how fast” | ||||||
|
|
||||||
| There are two different operator intents that often get conflated: | ||||||
|
|
||||||
| - **When** consolidation is allowed and will happen | ||||||
| - **How much** disruption can happen concurrently (budgets / rate limiting) | ||||||
|
|
||||||
| ### Consolidation policy (when) | ||||||
|
|
||||||
| Use the NodePool’s consolidation policy to express your comfort level with cost-optimization moves. For many clusters, a safe baseline is “only consolidate when empty or underutilized,” and then use budgets to keep the pace controlled. | ||||||
|
|
||||||
| Consolidation Settings | ||||||
|
|
||||||
| - `ConsolidationPolicy: WhenEmptyOrUnderutilized` - Triggered when NAP identifies that the existing nodes are underutilized (or empty). This is determined by NAP running cost simulations of combination of VM sizes will best match the currently configuration. Once one combination is found, this triggers consolidation. | ||||||
| - `ConsolidateAfter: 1d` - time-based setting that ontrols the delay before NAP consolidates nodes that are underutilized, working in conjunction with the `consolidationPolicy` setting | ||||||
| - `expireAfter: 24hr` - time-based setting that determines how long nodes defined in this NodePool CRD are allowed to exist. Any olders nodes will be deleted, regardless of Consolidation Policies. | ||||||
|
|
||||||
| _NOTE:_ - How NAP defines "Underutilized" is not currently a value that can be set by users, is is determined by the cost simulation runs by NAP. | ||||||
|
|
||||||
| The following example showed these disruption tools in action: | ||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: karpenter.sh/v1 | ||||||
| kind: NodePool | ||||||
| metadata: | ||||||
| name: default | ||||||
| spec: | ||||||
| disruption: | ||||||
| consolidationPolicy: WhenEmptyOrUnderutilized | ||||||
| template: | ||||||
| spec: | ||||||
| nodeClassRef: | ||||||
| name: default | ||||||
| expireAfter: Never | ||||||
| ``` | ||||||
|
|
||||||
| ### Node Disruption budgets (how fast) | ||||||
|
|
||||||
| NAP exposes Karpenter-style disruption budgets on the NodePool. If you don’t set them, a default budget of `nodes: 10%` is used. Use budgets to regulate how many nodes are consolidated at a time. | ||||||
|
|
||||||
| The following example sets the node disruption budget to 1 node at a time. | ||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: karpenter.sh/v1 | ||||||
| kind: NodePool | ||||||
| metadata: | ||||||
| name: default | ||||||
| spec: | ||||||
| disruption: | ||||||
| budgets: | ||||||
| - nodes: "1" | ||||||
| ``` | ||||||
|
|
||||||
| This is often the simplest way to prevent “NAP moved too many nodes at once”. | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Part 5 — Maintenance windows | ||||||
|
|
||||||
| A good practice for managing disruption is to **allow some consolidation, but only during a specific time-window**. | ||||||
|
|
||||||
| NAP node disruption budgets support `schedule` and `duration` so you can create time-based rules (cron syntax). These node disruption budgets can be defined by setting the `spec.disruption.budgets` field in the [NodePool CRD](https://learn.microsoft.com/azure/aks/node-auto-provisioning-node-pools) | ||||||
|
|
||||||
| For example, block disruptions during business hours: | ||||||
|
|
||||||
| ```yaml | ||||||
| budgets: | ||||||
| - nodes: "0" | ||||||
| schedule: "0 9 * * 1-5" # 9 AM Monday-Friday | ||||||
| duration: 8h | ||||||
| ``` | ||||||
|
|
||||||
| Or allow higher disruption on weekends, and block otherwise: | ||||||
|
|
||||||
| ```yaml | ||||||
| budgets: | ||||||
| - nodes: "50%" | ||||||
| schedule: "0 0 * * 6" # Saturday midnight | ||||||
| duration: 48h | ||||||
| - nodes: "0" | ||||||
| ``` | ||||||
|
|
||||||
| **Why this matters:** it aligns cost-optimization (consolidation/drift/expiration) and updates with the regulated timeline that works for your workload needs. | ||||||
|
|
||||||
| To learn more about node disruption budgets, visit our [NAP Disruption documentation](https://learn.microsoft.com/azure/aks/node-auto-provisioning-disruption#disruption-budgets) | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Part 6 — Don’t forget node image updates (drift) and the “90-day” reality | ||||||
|
|
||||||
| NAP nodes are regularly updated as images change. The node image updates doc calls out a key behavior: **if a node image version is older than 90 days, NAP forces pickup of the latest image version, bypassing any existing maintenance window**. | ||||||
|
|
||||||
| Operational takeaway: | ||||||
| - Set up maintenance windows and budgets, but also ensure you’re not drifting so long that you hit a forced-update scenario. | ||||||
| - Treat “keep nodes reasonably fresh” as part of disruption planning, not an afterthought. | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Part 7 — Observability: verify disruption decisions with events/logs | ||||||
|
|
||||||
| Before changing policies, confirm what NAP *thinks* it’s doing: | ||||||
|
|
||||||
| - View events: | ||||||
| - `kubectl get events --field-selector source=karpenter-events` | ||||||
| - Or use AKS control plane logs in Log Analytics (filter for `karpenter-events`) | ||||||
|
|
||||||
| This helps distinguish: | ||||||
| - “NAP wants to disrupt but is blocked by PDBs / budgets” | ||||||
| from | ||||||
| - “NAP isn’t trying to disrupt because consolidation policy doesn’t allow it” | ||||||
| from | ||||||
| - “NAP can’t replace nodes because provisioning is failing” | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Common disruption pitfalls | ||||||
|
|
||||||
| ### Symptom: NAP won’t consolidate / drains hang forever | ||||||
|
|
||||||
| **Likely cause** | ||||||
| - PDBs effectively allow zero voluntary evictions (`maxUnavailable: 0` / `minAvailable: 100%`), or | ||||||
| - Too few replicas to satisfy the PDB during drain. | ||||||
|
|
||||||
| **Fix** | ||||||
| - Relax PDBs (for example `maxUnavailable: 1`) or increase replicas. | ||||||
| - If a workload truly must be undisruptable, accept that nodes running it won’t be good consolidation targets. | ||||||
|
||||||
| - If a workload truly must be undisruptable, accept that nodes running it won’t be good consolidation targets. | |
| - If a workload truly must not be disrupted, accept that nodes running it won’t be good consolidation targets. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the repo’s blog post structure, add a hero image immediately after
<!-- truncate -->. The post directory currently contains onlyindex.md, so readers won’t get a hero/social image unless you add one (for example./hero-image.png) and reference it here with descriptive alt text.