Keeper priority#622
Conversation
|
semaphoreci failed at config_test.go:541 in one of the configurations, but I don't think this is related to this pull request -- seems like the database just couldn't restart in 10 seconds. Is there a way to restart the test? |
|
Any updates on this PR? Need this feature. |
|
|
||
| // Return DB who can be new master. This function mostly takes care of | ||
| // sync mode; in async case, new master is just first element of findBestNewMasters. | ||
| func (s *Sentinel) findBestNewMaster(cd *cluster.ClusterData, curMasterDB *cluster.DB, logErrors bool) *cluster.DB { |
There was a problem hiding this comment.
findBestNewMaster returns the db who can be the new master. This function ....
| } | ||
| if bestNewMasterDB == nil { | ||
| if logErrors { | ||
| log.Errorw("cannot choose synchronous standby since there's not match between the possible masters and the usable synchronousStandbys", "reported", curMasterDB.Status.SynchronousStandbys, "spec", curMasterDB.Spec.SynchronousStandbys, "common", commonSyncStandbys, "possibleMasters", bestNewMasters) |
| commonSyncStandbys := util.CommonElements(curMasterDB.Status.SynchronousStandbys, curMasterDB.Spec.SynchronousStandbys) | ||
| if len(commonSyncStandbys) == 0 { | ||
| if logErrors { | ||
| log.Errorw("cannot choose synchronous standby since there are no common elements between the latest master reported synchronous standbys and the db spec ones", "reported", curMasterDB.Status.SynchronousStandbys, "spec", curMasterDB.Spec.SynchronousStandbys) |
| log.Infow("electing db as the new master", "db", bestNewMasterDB.UID, "keeper", bestNewMasterDB.Spec.KeeperUID) | ||
| // Even if current master is ok, we probably still | ||
| // want to change it if there is ready DB with higher | ||
| // keeper priority. |
There was a problem hiding this comment.
So the priority will be used also to changing primary also when there's no failure? This should probably be documented
There was a problem hiding this comment.
Yes. I have added some words to --priority help describing this.
| } | ||
| return reflect.DeepEqual(cd1, cd2) | ||
|
|
||
| equal, diff := util.DeepEqualVerbose(cd1, cd2) |
There was a problem hiding this comment.
Since this change could be really useful (in other projects I also used github.com/google/go-cmp) to debug test failures but not strictly related to this feature, can you please remove from this PR and open a new dedicated PR?
| } | ||
| pi := cd.Keepers[dbs[i].Spec.KeeperUID].Status.Priority | ||
| pj := cd.Keepers[dbs[i].Spec.KeeperUID].Status.Priority | ||
| return pi < pj |
There was a problem hiding this comment.
In Line 700,
pj := cd.Keepers[dbs[i].Spec.KeeperUID].Status.Priority
replace dbs[i] to dbs[j]
|
@arssher can you update this PR? |
|
Hi, I will try to look into this in a few days. |
|
Sorry for the quite a delay. I have updated the PR:
|
|
BTW, when generating docs (gen_commands_doc.sh) I have noticed that it wasn't updated for some lately added keeper options, e.g. pg-advertise-address. I have excluded those from this PR also as I believe they should go in separate commit. Also, does it make sense to restamp all docs files with current date like "Auto generated by spf13/cobra on 30-Jun-2019"? I can revert them, leaving only ones who have really changed. |
|
(Made a small cleanup just now, used DefaultPriority const and purged obsolete NotSpecifiedPriority const) |
|
One more minor fix/cleanup: forgot to add generated doc for Last CI build failed on Consul and passed all other stores. Looking at logs, I think this PR is not a reason of the failure again. Error was at ha_test.go:1574 in TestKeeperRemovalStolonCtl. removekeeper failed with: So most probably |
|
Hm, now two builds failed. One of them is exactly as previous: TestKeeperRemovalStolonCtl at ha_test.go:1576. Another is TestForceFailSyncReplStandbyCluster, at It is unclear to me why this happened. Since I didn't touch |
|
@sgotti would you please check this PR? |
|
@sgotti Any updates on this? |
|
@rearden-steel this PR requires an update/rebase and we have to understand if this will cover also #696. Are you willing to take care of this PR? @lawrencejones @maksm90 |
6b7fdfb to
ef591f2
Compare
Sentinel will promote keeper with higher priority than the current one if this is possible. In async mode this is a bit non-deterministic because we always elect node with highest LSN, and under heavy load prioritized node might never report LSN higher than its stronger competitors. However, if nodes are equal this should happen at some moment. In sync mode, we can just elect any of synchronous standbies. Priority can be set during keeper start (--priority) or later with new command 'stolonctl set keeperpriority'. The latter allows to update priority without restarting the keeper (and its Postgres instance), which can be used for controlled failover. Implements sorintlab#492
ef591f2 to
988568e
Compare
|
@sgotti, I've rebased this branch. I had to modify one place because of a linter error, now all existing tests pass and works well for me. As for CI, I've found it to be quite unstable these days: first it failed on which seems to be not a matter of this PR and means that BTW, in the #696 you mentioned that it worth to check compatibility of this PR with 05b1b0f. I had a look on it, and it seems that everything works fine even without changes, since added in this PR However, it may be a bit misleading for a user, so maybe we should restrict usages of both |
Add --priority keeper option.
Sentinel will promote keeper with higher priority than the current one if this
is possible. In async mode this is a bit non-deterministic because we always
elect node with highest LSN, and under heavy load prioritized node might never
report LSN higher than its stronger competitors. However, if nodes are equal
this should happen at some moment. In sync mode, we can just elect any of
synchronous standbies.
Implements #492
While here, make eyeballing of sentinel tests failures easier.