This guide covers strategies for scaling and optimizing performance when deploying and managing OpenStack External Data Plane Management (EDPM) nodes using the openstack-operator.
The openstack-operator uses Ansible to configure and manage external compute nodes through its DataPlane functionality. Understanding how to structure your NodeSets and tune Ansible execution is critical for achieving optimal deployment performance, especially in large-scale environments.
A OpenStackDataPlaneNodeSet represents a group of nodes with similar
configuration. How you group nodes significantly impacts deployment performance
and manageability.
Group all similar nodes (e.g., all compute nodes) into one NodeSet.
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
name: compute-nodes
spec:
nodes:
compute-0:
hostName: compute-0
ansible:
ansibleHost: 192.168.122.100
compute-1:
hostName: compute-1
ansible:
ansibleHost: 192.168.122.101
# ... Up to any large number of computes
compute-99:
hostName: compute-99
ansible:
ansibleHost: 192.168.122.199Advantages:
-
Single Ansible execution handles all nodes
-
Ansible’s built-in parallelism (forks) manages concurrency
-
Simpler to manage - one OpenStackDataPlaneNodeSet CR to track
-
Consistent configuration across all nodes
-
Efficient OpenShift resource usage - less ansible-runner pods during OpenStackDataPlaneDeployment execution
Disadvantages:
-
Limited parallelism by Ansible forks setting
-
Single failure point - one playbook error may affect all nodes in the NodeSet
-
Harder to isolate problems to specific node subsets
-
Longer serial operations (e.g., gathering facts from 100 nodes)
Divide nodes into multiple NodeSets, each with a subset of nodes.
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
name: compute-group-1
spec:
nodes:
compute-0:
hostName: compute-0
ansible:
ansibleHost: 192.168.122.100
# ... compute-1 through compute-24
---
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
name: compute-group-2
spec:
nodes:
compute-25:
hostName: compute-25
ansible:
ansibleHost: 192.168.122.125
# ... compute-26 through compute-49Advantages:
-
Increased parallelism: Multiple ansible-runner pods execute simultaneously for each NodeSet.
-
Better failure isolation - one NodeSet’s failure doesn’t block others
-
Easier troubleshooting and isolation of issues
-
Can deploy groups incrementally or independently
-
Lower memory per ansible-runner pod
Disadvantages:
-
More NodeSet CRs to manage and monitor
-
Potential for configuration drift between NodeSets
-
Higher OpenShift overhead - multiple ansible-runner pods
-
More complex deployment orchestration
Group nodes by their role or function rather than arbitrarily.
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
name: compute-standard
spec:
nodes:
compute-0:
hostName: compute-0
ansible:
ansibleHost: 192.168.122.100
# ... standard compute nodes
---
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
name: compute-gpu
spec:
nodeTemplate:
ansible:
ansibleVars:
edpm_nova_pci_passthrough_whitelist:
- '{"vendor_id": "10de", "product_id": "1b38"}'
nodes:
compute-gpu-0:
hostName: compute-gpu-0
ansible:
ansibleHost: 192.168.122.210
# ... GPU compute nodesAdvantages:
-
Different configurations for different node types
-
Clear organizational structure
-
Natural parallelism across different roles
-
Easier capacity planning and management
Disadvantages:
-
May not maximize parallelism if roles have different node counts
-
Configuration must be carefully managed across NodeSets
Shares similar advantages and disadvantages as using multiple smaller NodeSets.
Understanding how the openstack-operator executes deployments is key to optimization.
When you create an OpenStackDataPlaneDeployment with multiple NodeSets, the operator starts them sequentially but they execute in parallel:
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: edpm-deployment
spec:
nodeSets:
- compute-group-1
- compute-group-2
- compute-group-3
- compute-group-4Execution flow:
-
Operator starts deployment for
compute-group-1→ ansible-runner pod launches -
Operator starts deployment for
compute-group-2→ ansible-runner pod launches -
Operator starts deployment for
compute-group-3→ ansible-runner pod launches -
Operator starts deployment for
compute-group-4→ ansible-runner pod launches -
All four ansible-runner pods execute in parallel
This means 4 separate Ansible executions run simultaneously, each processing their respective NodeSets.
Within each NodeSet, services execute sequentially (one after another). You cannot parallelize service execution within a single NodeSet, but multiple NodeSets executing in parallel means those services run in parallel across NodeSets.
Within a single Ansible execution (one NodeSet), parallelism is controlled by
Ansible’s forks setting. This determines how many nodes Ansible configures
simultaneously.
Default behavior (from edpm-ansible playbooks):
-
Strategy:
linear(waits for all hosts to complete a task before moving to next task) -
Forks: Defaults to 5 (can be overridden with
ANSIBLE_FORKS)
You can configure Ansible behavior by setting Ansible specific environment variables in the NodeSet spec:
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
name: compute-optimized
spec:
env:
# Enable colored output for easier log reading
- name: ANSIBLE_FORCE_COLOR
value: "True"
# Increase parallel execution (default: 5)
# Set based on your control plane resources
- name: ANSIBLE_FORKS
value: "50"
# Increase SSH connection timeout (default: 10)
# Useful for slow networks or heavily loaded nodes
- name: ANSIBLE_TIMEOUT
value: "30"
# Enable pipelining to reduce SSH overhead
# Requires sudo without requiretty
- name: ANSIBLE_PIPELINING
value: "True"
# Increase SSH connection persistence
# Reuses SSH connections for better performance
- name: ANSIBLE_SSH_PIPELINING
value: "True"
# Control SSH connection multiplexing
- name: ANSIBLE_SSH_CONTROL_PATH_DIR
value: "/tmp/ansible-ssh-%%h-%%p-%%r"
# Set callback plugins for better output
- name: ANSIBLE_STDOUT_CALLBACK
value: "yaml"
# Increase async job status polling
- name: ANSIBLE_POLL_INTERVAL
value: "5"
nodes:
compute-0:
# ... node definitionsSee https://docs.ansible.com/projects/ansible/latest/reference_appendices/config.html for a comprehensive list of ansible configuration settings.
The following are the more common configuration settings related to performance tuning for large environments.
The ANSIBLE_FORKS setting is the most impactful tuning parameter.
Considerations:
-
Control plane resources: More forks require more CPU/memory in the ansible-runner pod
-
Network capacity: More simultaneous SSH connections
-
Target node capacity: Nodes must handle concurrent configuration tasks
Recommendations:
-
Small deployments (< 10 nodes):
ANSIBLE_FORKS=5-10 -
Medium deployments (10-50 nodes):
ANSIBLE_FORKS=20-30 -
Large deployments (50-100 nodes):
ANSIBLE_FORKS=50-75 -
Very large (100+ nodes): Consider multiple NodeSets instead of very high fork count
env:
- name: ANSIBLE_FORKS
value: "50" # Tune based on your environmentPerformance impact: Increased parallelism can decrease deployment times.
Pipelining reduces the number of SSH operations required for each task. See https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/ssh_connection.html#parameter-pipelining for more details.
Requirements:
-
SSH user must have passwordless sudo or sudo without
requiretty
env:
- name: ANSIBLE_PIPELINING
value: "True"
- name: ANSIBLE_SSH_PIPELINING
value: "True"If you have slow networks or heavily loaded nodes:
env:
- name: ANSIBLE_TIMEOUT
value: "60" # SSH connection timeout in seconds
- name: ANSIBLE_GATHER_TIMEOUT
value: "60" # Fact gathering timeoutReduce overhead from excessive logging:
env:
- name: ANSIBLE_STDOUT_CALLBACK
value: "yaml" # or "json" for structured output
- name: ANSIBLE_DISPLAY_SKIPPED_HOSTS
value: "False"
- name: ANSIBLE_DISPLAY_OK_HOSTS
value: "False" # Only show changed/failed
- name: ANSIBLE_RETRY_FILES_ENABLED
value: "False" # Disable retry filesThe ansibleLimit field allows you to target specific nodes within your NodeSets without modifying the NodeSet definitions.
Deploy only to specific nodes:
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: limited-deployment
spec:
nodeSets:
- compute-nodes
ansibleLimit: "compute-0,compute-5,compute-10"This deploys only to compute-0, compute-5, and compute-10 within the compute-nodes NodeSet.
Use Ansible patterns for more flexible targeting. See https://docs.ansible.com/projects/ansible/latest/inventory_guide/intro_patterns.html for more details on using ansible limit and pattern matching.
Example OpenStackDataPlaneDeployment’s using ansibleLimit and pattern matching:
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: pattern-limited-deployment
spec:
nodeSets:
- compute-nodes
ansibleLimit: "compute-[0:9]" # First 10 nodes (compute-0 through compute-9)spec:
nodeSets:
- compute-nodes
ansibleLimit: "compute-1*" # Matches compute-1, compute-10, compute-11, etc.Deploy to nodes incrementally to validate changes:
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: rollout-phase-1
spec:
nodeSets:
- compute-nodes
ansibleLimit: "compute-[0:9]"apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: rollout-phase-2
spec:
nodeSets:
- compute-nodes
ansibleLimit: "compute-[10:49]"Fix issues on specific problematic nodes:
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: hotfix-deployment
spec:
nodeSets:
- compute-nodes
ansibleLimit: "compute-15,compute-23,compute-67"
servicesOverride
- "configure-os" # Only reconfigure OSTest configuration changes on a canary node:
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: canary-test
spec:
nodeSets:
- compute-nodes
ansibleLimit: "compute-0" # Single canary nodeThe ansibleLimit applies to all NodeSets in the deployment:
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: multi-nodeset-limited
spec:
nodeSets:
- compute-group-1
- compute-group-2
- compute-group-3
ansibleLimit: "compute-[0:9]" # Applies to all three NodeSetsThis will deploy to compute-0 through compute-9, in which ever NodeSets
they are specified in the 3 referenced NodeSets.
Let’s compare different strategies for deploying 100 compute nodes.
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
name: all-computes
spec:
env:
- name: ANSIBLE_FORKS
value: "50"
nodes:
# compute-0 through compute-99 (100 nodes)Characteristics:
-
Ansible executions per service: 1
-
Parallelism: Up to 50 nodes at once (limited by ANSIBLE_FORKS)
-
ansible-runner pods per service: 1
-
Fact gathering: Serial across 100 nodes (even with forks, each batch completes before next)
-
Failure handling: One failure may require redeploying all 100 nodes
-
Resource usage: One large ansible-runner pod per service
Timeline example (assuming 10 services, 5 minutes per service per node):
-
With forks=50: Two batches of 50 nodes
-
Service 1: Batch 1 (50 nodes) runs in parallel = 5 min, then Batch 2 (50 nodes) = 5 min → 10 min total
-
Service 2: Same pattern → 10 min total
-
Total for ~10 services: ~100 minutes
# compute-group-1: compute-0 through compute-24
# compute-group-2: compute-25 through compute-49
# compute-group-3: compute-50 through compute-74
# compute-group-4: compute-75 through compute-99
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
name: compute-group-1
spec:
env:
- name: ANSIBLE_FORKS
value: "25"
nodes:
# 25 nodesapiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: deploy-all-groups
spec:
nodeSets:
- compute-group-1
- compute-group-2
- compute-group-3
- compute-group-4Characteristics:
-
Ansible executions per service: 4 (in parallel)
-
Parallelism: 4 NodeSets × 25 nodes = 100 nodes effectively in parallel
-
ansible-runner pods per service: 4
-
Fact gathering: Parallel across 4 groups
-
Failure handling: One group’s failure doesn’t block others
-
Resource usage: Four medium ansible-runner pods per service
Timeline example (same assumptions):
-
With forks=25: Each NodeSet processes all 25 nodes in one batch
-
Service 1: All 4 groups run in parallel = 5 min total
-
Service 2: All 4 groups run in parallel = 5 min total
-
Total for all services: ~50 minutes (2x faster than Option 1)
| Aspect | Single NodeSet (100) | 4 NodeSets (25 each) | 10 NodeSets (10 each) |
|---|---|---|---|
Parallelism |
Limited (50 forks max) |
High (4 executions) |
Maximum (10 executions) |
Deployment Time |
~100 minutes |
~50 minutes |
~50 minutes |
Resource Overhead |
Low (1 pod) |
Medium (4 pods) |
High (10 pods) |
Failure Isolation |
Poor (all-or-nothing) |
Good (25% chunks) |
Excellent (10% chunks) |
Management Complexity |
Simple (1 CR) |
Moderate (4 CRs) |
Complex (10 CRs) |
Troubleshooting |
Difficult |
Easier |
Easiest |
Best For |
Small deployments, uniform nodes |
Balanced performance/management |
Maximum speed, good isolation |
Use Single Large NodeSet when:
-
You have less nodes in the NodeSet than a reasonable setting for ansible forks
-
Resources are constrained (can’t run multiple pods)
-
Configuration is identical across all nodes
-
Simplicity is more important than speed
Use Multiple Medium NodeSets when:
-
You want balanced performance and manageability
-
Some failure isolation is important
-
You have sufficient cluster resources for multiple pods
Use Many Small NodeSets when:
-
Maximum deployment speed is critical
-
Strong failure isolation is required
-
You have ample cluster resources
-
You can manage the additional CRs
Use Role-Based NodeSets when:
-
Nodes have different configurations
-
Different hardware types exist
-
You need to deploy to subsets frequently
-
Organizational clarity is important
Begin with modest settings and increase based on observed performance:
# Initial deployment
env:
- name: ANSIBLE_FORKS
value: "10"
# After monitoring, if resources allow
env:
- name: ANSIBLE_FORKS
value: "30"Before deploying to all nodes, test on a subset:
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: validation-deployment
spec:
nodeSets:
- compute-nodes
ansibleLimit: "compute-0,compute-1"apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: production-deployment
spec:
nodeSets:
- compute-nodes