CKS: retry if unable to drain node or unable to upgrade k8s node#8402
Conversation
I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded. 2 of 16 upgrades failed due to ``` error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command... ``` 3 of 16 upgrades failed due to ``` Error from server: error when retrieving current configuration of: Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role" Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard" from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed ```
…d by the upgrade test
it will take 15m * 20 = 5 hours
|
@blueorangutan package |
|
@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## 4.18 #8402 +/- ##
============================================
+ Coverage 13.12% 13.16% +0.03%
- Complexity 9141 9203 +62
============================================
Files 2720 2724 +4
Lines 257726 258091 +365
Branches 40177 40229 +52
============================================
+ Hits 33838 33988 +150
- Misses 219598 219796 +198
- Partials 4290 4307 +17 ☔ View full report in Codecov by Sentry. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8132 |
|
@weizhouapache very few new tests 😜 |
|
@blueorangutan test matrix |
|
@weizhouapache a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8667)
|
@shwstppr |
|
[SF] Trillian test result (tid-8669)
|
|
@blueorangutan test matrix |
|
@weizhouapache a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8671)
|
|
@weizhouapache do we need to move some of the test cases to a component test? Taking more than 5.5h now |
|
[SF] Trillian test result (tid-8673)
|
@shwstppr |
This reverts commit af93915.
|
@blueorangutan test matrix |
|
@weizhouapache a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests |
|
@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
@blueorangutan test rocky8 kvm-rocky8 |
|
@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8943)
|
|
@blueorangutan package |
|
@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8446 |
|
@blueorangutan test rocky8 kvm-rocky8 |
|
@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8952)
|
|
[SF] Trillian test result (tid-8958)
|
|
run 2 trillian test in the weekend
Not perfect but looks better than before |
|
@blueorangutan package |
|
@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8520 |
|
@blueorangutan test rocky8 kvm-rocky8 |
|
@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-9078)
|
|
Merging based on 2 approvals and perfect smoke tests result. |
This reverts commit 34e227a.
…che#8402) * CKS: retry if unable to drain node or unable to upgrade k8s node I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded. 2 of 16 upgrades failed due to ``` error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command... ``` 3 of 16 upgrades failed due to ``` Error from server: error when retrieving current configuration of: Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role" Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard" from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed ``` * CKS: remove tests of creating/deleting HA clusters as they are covered by the upgrade test * Update PR 8402 as suggested * test: remove CKS cluster if fail to create or verify
…che#8402) * CKS: retry if unable to drain node or unable to upgrade k8s node I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded. 2 of 16 upgrades failed due to ``` error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command... ``` 3 of 16 upgrades failed due to ``` Error from server: error when retrieving current configuration of: Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role" Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard" from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed ``` * CKS: remove tests of creating/deleting HA clusters as they are covered by the upgrade test * Update PR 8402 as suggested * test: remove CKS cluster if fail to create or verify
…che#8402) * CKS: retry if unable to drain node or unable to upgrade k8s node I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded. 2 of 16 upgrades failed due to ``` error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command... ``` 3 of 16 upgrades failed due to ``` Error from server: error when retrieving current configuration of: Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role" Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard" from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed ``` * CKS: remove tests of creating/deleting HA clusters as they are covered by the upgrade test * Update PR 8402 as suggested * test: remove CKS cluster if fail to create or verify
Description
This PR tries to fix the upgrade of HA cluster, by retrying in case of
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?