Skip to content

systemvm: fix keepalived is always restarted when update config for monitor service#4386

Closed
ustcweizhou wants to merge 1 commit into
apache:masterfrom
ustcweizhou:4.15-rvr-keepalived
Closed

systemvm: fix keepalived is always restarted when update config for monitor service#4386
ustcweizhou wants to merge 1 commit into
apache:masterfrom
ustcweizhou:4.15-rvr-keepalived

Conversation

@ustcweizhou
Copy link
Copy Markdown
Contributor

Description

in 4.15, keepalived in redundant VRs keeps restarting every minute.
After debugging, I found it happens when update config for monitor service.
it is because keepalived process is changed in Debian 10.

in Debian 9 (systemvm for 4.14),

root@r-1969-VM:~# ps -ef|grep keepalived
root     16324     1  0 09:53 ?        00:00:04 /usr/sbin/keepalived
root     16325 16324  0 09:53 ?        00:00:04 /usr/sbin/keepalived
root     16326 16324  0 09:53 ?        00:00:14 /usr/sbin/keepalived

in Debian 10 (systemvm for 4.15), processes end with "--dont-fork"

root@r-2040-VM:~# ps -ef|grep keepalived
root      5237     1  0 16:40 ?        00:00:00 /usr/sbin/keepalived --dont-fork
root      5239  5237  0 16:40 ?        00:00:03 /usr/sbin/keepalived --dont-fork

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

…onitor service

in 4.15, keepalievd in redundant VRs keeps restarting every minute.
After debugging, I found it happens when update config for monitor service.
it is because keepalived process is changed in Debian 10.

in Debian 9 (systemvm for 4.14),
```
root@r-1969-VM:~# ps -ef|grep keepalived
root     16324     1  0 09:53 ?        00:00:04 /usr/sbin/keepalived
root     16325 16324  0 09:53 ?        00:00:04 /usr/sbin/keepalived
root     16326 16324  0 09:53 ?        00:00:14 /usr/sbin/keepalived
```

in Debian 10 (systemvm for 4.15), processes end with "--dont-fork"
```
root@r-2040-VM:~# ps -ef|grep keepalived
root      5237     1  0 16:40 ?        00:00:00 /usr/sbin/keepalived --dont-fork
root      5239  5237  0 16:40 ?        00:00:03 /usr/sbin/keepalived --dont-fork
```
Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@@ -194,7 +194,7 @@ def _redundant_on(self):
heartbeat_cron.commit()

proc = CsProcess(['/usr/sbin/keepalived'])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ustcweizhou specifying "/usr/sbin/keepalived" here may not be required as grep is used instead.

I understand this issue is with keepalived process, but on a broader view, other processes might have similar issue later, can this be fixed in CsProcess.py find() ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sureshanaparti yes, of course.
To be sure that is not regression issue, I choosed a easier way to fix it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok @weizhouapache , got it. Note that, if the same issue repeats for other processes later, then CsProcess.py have to be fixed.

@weizhouapache
Copy link
Copy Markdown
Member

@rhtyd @DaanHoogland this needs to be merged asap.

the test failure should be fixed by this pr.

Test | Result | Time (s) | Test File
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers | Failure | 290.08 | test_vpc_redundant.py

@DaanHoogland
Copy link
Copy Markdown
Contributor

ok @weizhouapache , will run tests once
@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2158

@apache apache deleted a comment from blueorangutan Oct 13, 2020
@apache apache deleted a comment from blueorangutan Oct 13, 2020
@apache apache deleted a comment from blueorangutan Oct 13, 2020
@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@apache apache deleted a comment from blueorangutan Oct 13, 2020
@apache apache deleted a comment from blueorangutan Oct 13, 2020
@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-2930)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38742 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4386-t2930-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 83 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_07_deploy_kubernetes_ha_cluster Failure 3609.57 test_kubernetes_clusters.py
test_08_deploy_and_upgrade_kubernetes_ha_cluster Failure 0.09 test_kubernetes_clusters.py
test_09_delete_kubernetes_ha_cluster Failure 0.06 test_kubernetes_clusters.py
ContextSuite context=TestKubernetesCluster>:teardown Error 75.82 test_kubernetes_clusters.py
test_hostha_kvm_host_fencing Error 175.07 test_hostha_kvm.py

@DaanHoogland
Copy link
Copy Markdown
Contributor

none of these errors are related, @rhtyd @sureshanaparti or others, can we merge?

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Oct 14, 2020

The same PR was sent to 4.14 #4384 - should we close this one and merge the other one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants