Skip to content

bugfix:When the Master is down, the region or datacenter detects abno…#24

Open
lazzyfu wants to merge 1 commit intopercona:masterfrom
lazzyfu:bugfix-failover-region-detects-abnormalities
Open

bugfix:When the Master is down, the region or datacenter detects abno…#24
lazzyfu wants to merge 1 commit intopercona:masterfrom
lazzyfu:bugfix-failover-region-detects-abnormalities

Conversation

@lazzyfu
Copy link

@lazzyfu lazzyfu commented Jun 19, 2023

Hi
My English is so-so, I use Google translate, sorry, Thank you.

Issue

Triggering conditions

  • PreventCrossRegionMasterFailover = true
  • PreventCrossDataCenterMasterFailover = true

Satisfy any one or enable both

trigger timing

not necessarily,Once there, the impact is very serious

How to reproduce

  1. When ORC scans the instance status, the Master node is normal at this time, and ORC will mark the current Master node instanceFound=true
  2. The master node is suddenly shut down or other unreachable faults
  3. At this time, ORC will continue to execute DetectRegionQuery, DetectDataCenterQuery and other operations (non-matching configuration file regular part)
  4. The master node has been down, so naturally the results cannot be obtained
  5. Update the null value to the table orchestrator.database_instance, you can see that the Master node region is empty
    20230619-112525
  6. If Failover is performed at this time, the value of analysisEntry.AnalyzedInstanceRegion will be empty, causing the region or datacenter verification to fail, and the failover will fail.

Steps to reproduce

topology

20230619-113005

debug code

go/inst/instance_dao.go

Add a for loop under instanceFound = true (you can take as many seconds as you want, don’t be too big, if it is too big, the detection will be too slow, and the effect will appear slowly)

image

Reboot Orchestrator

go run go/cmd/orchestrator/main.go -config conf/orchestrator.conf.json  -debug http

Shutdown the Master node

Here you need to choose a good timing. The timing is the 5-second logic of debugging the code. The shutdown command should be executed during the 5-second period of the Loop.

image

observe topology

At this time, the topology restoration fails, and the restoration failure will form a cascade topology.

image

Observe the Recovery log

image
image
image

Observe the records of the Orchestrator table

image

@kamil-holubicki kamil-holubicki force-pushed the master branch 2 times, most recently from d301f13 to 3b5b0f8 Compare January 15, 2026 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant