fix: PUC-1722 Ironic conductor performance issues#2056
Open
skrobul wants to merge 4 commits into
Open
Conversation
In the past we needed the dnsmasq pods to run on the same physical host as the Ironic conductors. This was necessary to allow dnsmasq and ironic to share disk volume and exchange the information about hosts this way. But at some point we have switched to PVC volume which can be mounted on multiple nodes if the access mode is set to RWX. In order to scale the number of Ironic conductors, we need to decouple dnsmasq from Ironic conductor.
This prevents multiple conductor pods from being placed on the same host.
Currently most of the ironic-conductor's time is spent doing the power state syncs. This is partly because the BMCs are quite slow to respond, but also due to very limited CPU. Conductor will happily use more than one core, but it's currently limited to just 1. In theory we should be scaling up by increasing number of conductors, but at the moment we are limited to 1 conductor per host due to host networking.
Collaborator
Author
Rollback Plan: RWX → RWOUse this if you need to revert back to the original RWO volumes. This assumes the backup files from Step 3 of the migration are still present in Rollback Overview
Rollback Step 1: Disable ArgoCDk get clusterrolebinding pvceng-argocd-is-cluster-admin -o yaml > /tmp/argocd-binding.yaml
k delete clusterrolebinding pvceng-argocd-is-cluster-adminRollback Step 2: Scale Down StatefulSetsk scale statefulset ironic-conductor --replicas=0
k scale statefulset ironic-dnsmasq --replicas=0Rollback Step 3: Create Temporary RWO PVCsNo storage class is specified, so Kubernetes will use the cluster default. apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: dnsmasq-dhcp-old
namespace: openstack
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 16Mi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: dnsmasq-ironic-old
namespace: openstack
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 16Mik apply -f temp-pvcs-rwo.yamlRollback Step 4: Migrate Data with
|
Collaborator
Author
|
Once the PVC are migrated, remember to apply |
haseebsyed12
approved these changes
Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overall this PR aims to resolve CPU throttling for Ironic conductors by adjusting existing, overly strict CPU limits as well as horizontally scaling to multiple pods/nodes.
Summary of changes:
Unfortunately changing the storage class requires a per-environment migration. Here is rough plan (tested on dev environment)
This partially solves https://rackspace.atlassian.net/browse/PUC-1722, but requires changes in the deploy repo to increase the number of replicas to 2 for each of the environments. These are done in https://github.com/RSS-Engineering/undercloud-deploy/pull/1799
RWO → RWX Volume Migration plan
Prerequisites
pv-migratekubectl plugin installedceph-fs-ecstorage class will be used, which supports RWXOverview
ironic-conductorandironic-dnsmasqStatefulSets/tmpdnsmasq-dhcp-new,dnsmasq-ironic-new) onceph-fs-ecpv-migrateceph-fs-ecStep 1: Disable ArgoCD
Before making any changes, disable ArgoCD on the cluster to prevent it from reconciling resources during the migration.
k get clusterrolebinding pvceng-argocd-is-cluster-admin -o yaml > /tmp/argocd-binding.yaml k delete clusterrolebinding pvceng-argocd-is-cluster-adminStep 2: Scale Down StatefulSets
Step 3: Back Up Existing PVC Definitions
Step 4: Create Temporary RWX PVCs
Since two PVCs cannot share the same name in the same namespace, use temporary names during migration.
Step 5: Migrate Data with
pv-migratepv-migratewill spin up the rsync job, stream progress logs, and clean up after itself automatically.Step 6: Delete the Old RWO PVCs
Step 7: Recreate the PVCs with Original Names as RWX
Step 8: Migrate Data into the Final PVCs
Step 9: Delete the Temporary PVCs
Step 10: Scale Back Up and Verify
Your workload manifests require no changes since the PVC names are preserved.
Step 11: Restore ArgoCD Access