DataPlane Performance Tuning for large scale deployments

This guide covers strategies for scaling and optimizing performance when deploying and managing OpenStack External Data Plane Management (EDPM) nodes using the openstack-operator.

Overview

The openstack-operator uses Ansible to configure and manage external compute nodes through its DataPlane functionality. Understanding how to structure your NodeSets and tune Ansible execution is critical for achieving optimal deployment performance, especially in large-scale environments.

Key Performance Factors

NodeSet organization: How nodes are grouped affects parallelism
Ansible parallelism: Configuration of forks and execution strategy
Service execution order: Services run sequentially within each NodeSet
Network and hardware resources: Available SSH connections, CPU, memory

NodeSet Grouping Strategies

A OpenStackDataPlaneNodeSet represents a group of nodes with similar configuration. How you group nodes significantly impacts deployment performance and manageability.

Strategy 1: Single Large NodeSet

Group all similar nodes (e.g., all compute nodes) into one NodeSet.

Single Large NodeSet Example

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
  name: compute-nodes
spec:
  nodes:
    compute-0:
      hostName: compute-0
      ansible:
        ansibleHost: 192.168.122.100
    compute-1:
      hostName: compute-1
      ansible:
        ansibleHost: 192.168.122.101
    # ... Up to any large number of computes
    compute-99:
      hostName: compute-99
      ansible:
        ansibleHost: 192.168.122.199

Advantages:

Single Ansible execution handles all nodes
Ansible’s built-in parallelism (forks) manages concurrency
Simpler to manage - one OpenStackDataPlaneNodeSet CR to track
Consistent configuration across all nodes
Efficient OpenShift resource usage - less ansible-runner pods during OpenStackDataPlaneDeployment execution

Disadvantages:

Limited parallelism by Ansible forks setting
Single failure point - one playbook error may affect all nodes in the NodeSet
Harder to isolate problems to specific node subsets
Longer serial operations (e.g., gathering facts from 100 nodes)

Strategy 2: Multiple Smaller NodeSets

Divide nodes into multiple NodeSets, each with a subset of nodes.

Multiple NodeSets Example

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
  name: compute-group-1
spec:
  nodes:
    compute-0:
      hostName: compute-0
      ansible:
        ansibleHost: 192.168.122.100
    # ... compute-1 through compute-24
---
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
  name: compute-group-2
spec:
  nodes:
    compute-25:
      hostName: compute-25
      ansible:
        ansibleHost: 192.168.122.125
    # ... compute-26 through compute-49

Advantages:

Increased parallelism: Multiple ansible-runner pods execute simultaneously for each NodeSet.
Better failure isolation - one NodeSet’s failure doesn’t block others
Easier troubleshooting and isolation of issues
Can deploy groups incrementally or independently
Lower memory per ansible-runner pod

Disadvantages:

More NodeSet CRs to manage and monitor
Potential for configuration drift between NodeSets
Higher OpenShift overhead - multiple ansible-runner pods
More complex deployment orchestration

Strategy 3: Role-Based NodeSets

Group nodes by their role or function rather than arbitrarily.

Role-Based NodeSets Example

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
  name: compute-standard
spec:
  nodes:
    compute-0:
      hostName: compute-0
      ansible:
        ansibleHost: 192.168.122.100
    # ... standard compute nodes
---
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
  name: compute-gpu
spec:
  nodeTemplate:
    ansible:
      ansibleVars:
        edpm_nova_pci_passthrough_whitelist:
          - '{"vendor_id": "10de", "product_id": "1b38"}'
  nodes:
    compute-gpu-0:
      hostName: compute-gpu-0
      ansible:
        ansibleHost: 192.168.122.210
    # ... GPU compute nodes

Advantages:

Different configurations for different node types
Clear organizational structure
Natural parallelism across different roles
Easier capacity planning and management

Disadvantages:

May not maximize parallelism if roles have different node counts
Configuration must be carefully managed across NodeSets

Shares similar advantages and disadvantages as using multiple smaller NodeSets.

Parallel Execution Patterns

Understanding how the openstack-operator executes deployments is key to optimization.

NodeSet-Level Parallelism

When you create an OpenStackDataPlaneDeployment with multiple NodeSets, the operator starts them sequentially but they execute in parallel:

Multiple NodeSets Deployment

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: edpm-deployment
spec:
  nodeSets:
    - compute-group-1
    - compute-group-2
    - compute-group-3
    - compute-group-4

Execution flow:

Operator starts deployment for compute-group-1 → ansible-runner pod launches
Operator starts deployment for compute-group-2 → ansible-runner pod launches
Operator starts deployment for compute-group-3 → ansible-runner pod launches
Operator starts deployment for compute-group-4 → ansible-runner pod launches
All four ansible-runner pods execute in parallel

This means 4 separate Ansible executions run simultaneously, each processing their respective NodeSets.

Service-Level Execution

Within each NodeSet, services execute sequentially (one after another). You cannot parallelize service execution within a single NodeSet, but multiple NodeSets executing in parallel means those services run in parallel across NodeSets.

Node-Level Parallelism Within Ansible

Within a single Ansible execution (one NodeSet), parallelism is controlled by Ansible’s forks setting. This determines how many nodes Ansible configures simultaneously.

Default behavior (from edpm-ansible playbooks):

Strategy: linear (waits for all hosts to complete a task before moving to next task)
Forks: Defaults to 5 (can be overridden with ANSIBLE_FORKS)

Ansible Performance Tuning

Environment Variables

You can configure Ansible behavior by setting Ansible specific environment variables in the NodeSet spec:

Complete Tuned NodeSet Example

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
  name: compute-optimized
spec:
  env:
    # Enable colored output for easier log reading
    - name: ANSIBLE_FORCE_COLOR
      value: "True"

    # Increase parallel execution (default: 5)
    # Set based on your control plane resources
    - name: ANSIBLE_FORKS
      value: "50"

    # Increase SSH connection timeout (default: 10)
    # Useful for slow networks or heavily loaded nodes
    - name: ANSIBLE_TIMEOUT
      value: "30"

    # Enable pipelining to reduce SSH overhead
    # Requires sudo without requiretty
    - name: ANSIBLE_PIPELINING
      value: "True"

    # Increase SSH connection persistence
    # Reuses SSH connections for better performance
    - name: ANSIBLE_SSH_PIPELINING
      value: "True"

    # Control SSH connection multiplexing
    - name: ANSIBLE_SSH_CONTROL_PATH_DIR
      value: "/tmp/ansible-ssh-%%h-%%p-%%r"

    # Set callback plugins for better output
    - name: ANSIBLE_STDOUT_CALLBACK
      value: "yaml"

    # Increase async job status polling
    - name: ANSIBLE_POLL_INTERVAL
      value: "5"
  nodes:
    compute-0:
      # ... node definitions

Common Ansible Tuning Techniques

See https://docs.ansible.com/projects/ansible/latest/reference_appendices/config.html for a comprehensive list of ansible configuration settings.

The following are the more common configuration settings related to performance tuning for large environments.

Optimize Fork Count

The ANSIBLE_FORKS setting is the most impactful tuning parameter.

Considerations:

Control plane resources: More forks require more CPU/memory in the ansible-runner pod
Network capacity: More simultaneous SSH connections
Target node capacity: Nodes must handle concurrent configuration tasks

Recommendations:

Small deployments (< 10 nodes): ANSIBLE_FORKS=5-10
Medium deployments (10-50 nodes): ANSIBLE_FORKS=20-30
Large deployments (50-100 nodes): ANSIBLE_FORKS=50-75
Very large (100+ nodes): Consider multiple NodeSets instead of very high fork count

env:
  - name: ANSIBLE_FORKS
    value: "50"  # Tune based on your environment

Performance impact: Increased parallelism can decrease deployment times.

Enable SSH Pipelining

Pipelining reduces the number of SSH operations required for each task. See https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/ssh_connection.html#parameter-pipelining for more details.

Requirements:

SSH user must have passwordless sudo or sudo without requiretty

env:
  - name: ANSIBLE_PIPELINING
    value: "True"
  - name: ANSIBLE_SSH_PIPELINING
    value: "True"

Increase Timeouts for Slow Environments

If you have slow networks or heavily loaded nodes:

env:
  - name: ANSIBLE_TIMEOUT
    value: "60"  # SSH connection timeout in seconds
  - name: ANSIBLE_GATHER_TIMEOUT
    value: "60"  # Fact gathering timeout

Control Output and Logging

Reduce overhead from excessive logging:

env:
  - name: ANSIBLE_STDOUT_CALLBACK
    value: "yaml"  # or "json" for structured output
  - name: ANSIBLE_DISPLAY_SKIPPED_HOSTS
    value: "False"
  - name: ANSIBLE_DISPLAY_OK_HOSTS
    value: "False"  # Only show changed/failed
  - name: ANSIBLE_RETRY_FILES_ENABLED
    value: "False"  # Disable retry files

SSH Connection Multiplexing

Reuse SSH connections for better performance:

env:
  - name: ANSIBLE_SSH_CONTROL_PATH
    value: "/tmp/ansible-ssh-%%h-%%p-%%r"
  - name: ANSIBLE_SSH_CONTROL_PERSIST
    value: "60s"  # Keep connections alive for 60 seconds

Using ansibleLimit for Targeted Deployments

The ansibleLimit field allows you to target specific nodes within your NodeSets without modifying the NodeSet definitions.

Basic Usage

Deploy only to specific nodes:

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: limited-deployment
spec:
  nodeSets:
    - compute-nodes
  ansibleLimit: "compute-0,compute-5,compute-10"

This deploys only to compute-0, compute-5, and compute-10 within the compute-nodes NodeSet.

Pattern Matching

Use Ansible patterns for more flexible targeting. See https://docs.ansible.com/projects/ansible/latest/inventory_guide/intro_patterns.html for more details on using ansible limit and pattern matching.

Example OpenStackDataPlaneDeployment’s using ansibleLimit and pattern matching:

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: pattern-limited-deployment
spec:
  nodeSets:
    - compute-nodes
  ansibleLimit: "compute-[0:9]"  # First 10 nodes (compute-0 through compute-9)

Wildcard Patterns

spec:
  nodeSets:
    - compute-nodes
  ansibleLimit: "compute-1*"  # Matches compute-1, compute-10, compute-11, etc.

Use Cases for ansible limit

Gradual Rollout

Deploy to nodes incrementally to validate changes:

Phase 1: Deploy to first subset of nodes

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: rollout-phase-1
spec:
  nodeSets:
    - compute-nodes
  ansibleLimit: "compute-[0:9]"

Phase 2: Deploy to next subset of nodes after validation

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: rollout-phase-2
spec:
  nodeSets:
    - compute-nodes
  ansibleLimit: "compute-[10:49]"

Hotfix Deployment

Fix issues on specific problematic nodes:

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: hotfix-deployment
spec:
  nodeSets:
    - compute-nodes
  ansibleLimit: "compute-15,compute-23,compute-67"
  servicesOverride
    - "configure-os"  # Only reconfigure OS

Testing Changes

Test configuration changes on a canary node:

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: canary-test
spec:
  nodeSets:
    - compute-nodes
  ansibleLimit: "compute-0"  # Single canary node

Combining with Multiple NodeSets

The ansibleLimit applies to all NodeSets in the deployment:

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: multi-nodeset-limited
spec:
  nodeSets:
    - compute-group-1
    - compute-group-2
    - compute-group-3
  ansibleLimit: "compute-[0:9]"  # Applies to all three NodeSets

This will deploy to compute-0 through compute-9, in which ever NodeSets they are specified in the 3 referenced NodeSets.

Scaling Strategy Comparison

Scenario: 100 Compute Nodes Deployment

Let’s compare different strategies for deploying 100 compute nodes.

Option 1: Single Large NodeSet

Configuration

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
  name: all-computes
spec:
  env:
    - name: ANSIBLE_FORKS
      value: "50"
  nodes:
    # compute-0 through compute-99 (100 nodes)

Characteristics:

Ansible executions per service: 1
Parallelism: Up to 50 nodes at once (limited by ANSIBLE_FORKS)
ansible-runner pods per service: 1
Fact gathering: Serial across 100 nodes (even with forks, each batch completes before next)
Failure handling: One failure may require redeploying all 100 nodes
Resource usage: One large ansible-runner pod per service

Timeline example (assuming 10 services, 5 minutes per service per node):

With forks=50: Two batches of 50 nodes
Service 1: Batch 1 (50 nodes) runs in parallel = 5 min, then Batch 2 (50 nodes) = 5 min → 10 min total
Service 2: Same pattern → 10 min total
Total for ~10 services: ~100 minutes

Option 2: Four NodeSets (25 nodes each)

Configuration

# compute-group-1: compute-0 through compute-24
# compute-group-2: compute-25 through compute-49
# compute-group-3: compute-50 through compute-74
# compute-group-4: compute-75 through compute-99

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
  name: compute-group-1
spec:
  env:
    - name: ANSIBLE_FORKS
      value: "25"
  nodes:
    # 25 nodes

Deployment

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: deploy-all-groups
spec:
  nodeSets:
    - compute-group-1
    - compute-group-2
    - compute-group-3
    - compute-group-4

Characteristics:

Ansible executions per service: 4 (in parallel)
Parallelism: 4 NodeSets × 25 nodes = 100 nodes effectively in parallel
ansible-runner pods per service: 4
Fact gathering: Parallel across 4 groups
Failure handling: One group’s failure doesn’t block others
Resource usage: Four medium ansible-runner pods per service

Timeline example (same assumptions):

With forks=25: Each NodeSet processes all 25 nodes in one batch
Service 1: All 4 groups run in parallel = 5 min total
Service 2: All 4 groups run in parallel = 5 min total
Total for all services: ~50 minutes (2x faster than Option 1)

Comparison Table

Aspect	Single NodeSet (100)	4 NodeSets (25 each)	10 NodeSets (10 each)
Parallelism	Limited (50 forks max)	High (4 executions)	Maximum (10 executions)
Deployment Time	~100 minutes	~50 minutes	~50 minutes
Resource Overhead	Low (1 pod)	Medium (4 pods)	High (10 pods)
Failure Isolation	Poor (all-or-nothing)	Good (25% chunks)	Excellent (10% chunks)
Management Complexity	Simple (1 CR)	Moderate (4 CRs)	Complex (10 CRs)
Troubleshooting	Difficult	Easier	Easiest
Best For	Small deployments, uniform nodes	Balanced performance/management	Maximum speed, good isolation

Recommendations

Use Single Large NodeSet when:

You have less nodes in the NodeSet than a reasonable setting for ansible forks
Resources are constrained (can’t run multiple pods)
Configuration is identical across all nodes
Simplicity is more important than speed

Use Multiple Medium NodeSets when:

You want balanced performance and manageability
Some failure isolation is important
You have sufficient cluster resources for multiple pods

Use Many Small NodeSets when:

Maximum deployment speed is critical
Strong failure isolation is required
You have ample cluster resources
You can manage the additional CRs

Use Role-Based NodeSets when:

Nodes have different configurations
Different hardware types exist
You need to deploy to subsets frequently
Organizational clarity is important

Best Practices

Start Conservative, Then Optimize

Begin with modest settings and increase based on observed performance:

# Initial deployment
env:
  - name: ANSIBLE_FORKS
    value: "10"

# After monitoring, if resources allow
env:
  - name: ANSIBLE_FORKS
    value: "30"

Use ansible limit for Validation

Before deploying to all nodes, test on a subset:

Test deployment

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: validation-deployment
spec:
  nodeSets:
    - compute-nodes
  ansibleLimit: "compute-0,compute-1"

Full deployment after validation

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: production-deployment
spec:
  nodeSets:
    - compute-nodes

Uh oh!

FilesExpand file tree

dataplane_performance_tuning_large_scale.adoc

Latest commit

History

dataplane_performance_tuning_large_scale.adoc

File metadata and controls

DataPlane Performance Tuning for large scale deployments

Overview

Key Performance Factors

NodeSet Grouping Strategies

Strategy 1: Single Large NodeSet

Strategy 2: Multiple Smaller NodeSets

Strategy 3: Role-Based NodeSets

Parallel Execution Patterns

NodeSet-Level Parallelism

Service-Level Execution

Node-Level Parallelism Within Ansible

Ansible Performance Tuning

Environment Variables

Common Ansible Tuning Techniques

Optimize Fork Count

Enable SSH Pipelining

Increase Timeouts for Slow Environments

Control Output and Logging

SSH Connection Multiplexing

Using ansibleLimit for Targeted Deployments

Basic Usage

Pattern Matching

Wildcard Patterns

Use Cases for ansible limit

Gradual Rollout

Hotfix Deployment

Testing Changes

Combining with Multiple NodeSets

Scaling Strategy Comparison

Scenario: 100 Compute Nodes Deployment

Option 1: Single Large NodeSet

Option 2: Four NodeSets (25 nodes each)

Comparison Table

Recommendations

Best Practices

Start Conservative, Then Optimize

Use ansible limit for Validation