fix(db): tolerate not-yet-ready members in Patroni cluster discovery#794
Conversation
Parse each /cluster member independently and drop any that don't fully deserialize, instead of failing the entire response over one member. Root cause: Patroni lists a node as a member as soon as it registers in the DCS, which happens before it publishes its conn_url. A replica mid-creation (or an uninitialized node) therefore appears without host/port/api_url. Those fields are required on ClusterMember, so a strict Vec<ClusterMember> parse failed the whole /cluster response with 'missing field host' — surfacing as 'Failed to refresh cluster state: Failed to parse cluster info: error decoding response body' and freezing topology discovery for every consumer (cloud-api + chat-api) whenever a new postgres instance was being added. Confirmed by repro test against the real struct (PARSE ERROR: missing field host). Same resilience philosophy as the existing deserialize_lag fix: one bad member must not poison the whole cluster view; the next refresh picks the node up once it finishes registering. Adds regression tests for a single initializing member and for the full live /cluster shape with an initializing replica appended.
There was a problem hiding this comment.
Code Review
This pull request introduces a custom deserializer, deserialize_members, for Patroni cluster members to handle half-registered nodes gracefully. Instead of failing the entire /cluster response parsing when a member is missing connection info, it drops the incomplete member and logs a warning. Unit tests were also added to verify this behavior. The review feedback suggests optimizing the deserialization loop by extracting the member's name before calling serde_json::from_value, thereby avoiding cloning the serde_json::Value on every iteration and allowing ownership transfer.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for value in raw { | ||
| match serde_json::from_value::<ClusterMember>(value.clone()) { | ||
| Ok(member) => members.push(member), | ||
| Err(e) => { | ||
| let name = value | ||
| .get("name") | ||
| .and_then(serde_json::Value::as_str) | ||
| .unwrap_or("<unknown>"); | ||
| warn!("Skipping not-yet-ready cluster member {name}: {e}"); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
By extracting the member's name before deserialization, we can avoid cloning the entire serde_json::Value on every iteration. serde_json::from_value can then take ownership of the Value directly, which allows it to reuse the internal allocations (like strings) when constructing the ClusterMember struct.
| for value in raw { | |
| match serde_json::from_value::<ClusterMember>(value.clone()) { | |
| Ok(member) => members.push(member), | |
| Err(e) => { | |
| let name = value | |
| .get("name") | |
| .and_then(serde_json::Value::as_str) | |
| .unwrap_or("<unknown>"); | |
| warn!("Skipping not-yet-ready cluster member {name}: {e}"); | |
| } | |
| } | |
| } | |
| for value in raw { | |
| let name = value | |
| .get("name") | |
| .and_then(serde_json::Value::as_str) | |
| .map(String::from); | |
| match serde_json::from_value::<ClusterMember>(value) { | |
| Ok(member) => members.push(member), | |
| Err(e) => { | |
| let name = name.as_deref().unwrap_or("<unknown>"); | |
| warn!("Skipping not-yet-ready cluster member {name}: {e}"); | |
| } | |
| } | |
| } |
Review — PR #794: tolerate not-yet-ready Patroni membersFocused, well-tested fix for a real production failure. The root-cause writeup is empirically grounded and the approach mirrors the existing Correctness verified:
Minor (non-blocking) nits:
No critical issues. Logic, error paths, and tests all hold up. ✅ (approved) |
PierreLeGuen
left a comment
There was a problem hiding this comment.
The fix isolates member-level parse failures so a single not-yet-ready node can't poison the whole /cluster topology response, mirroring the existing deserialize_lag strategy. Small, correct, and covered by two new regression tests.
Non-blocking follow-ups:
- The
warn!for dropped members logs only member name + serde error, no customer data (patroni_discovery.rs:60-79) — confirmed safe; downstream consumers incluster_manager.rsonly see fully-deserialized members. - Doc comment in
test_full_cluster_with_initializing_replica_parsessays "4 ready members" but the fixture has 3 (the assertion correctly checkslen == 3) — cosmetic. cargo fmt --checkfails on two changed assertions inpatroni_discovery.rs, and CILintis red for the same diff — worth a quickcargo fmtbefore merge.
Checks: cargo test -p database patroni_discovery (6/6 passed, incl. 2 new regression tests) and cargo clippy -p database --lib clean; cargo fmt --check / CI Lint failed on formatting.
Symptom
Adding a new postgres replica triggered this in both cloud-api and chat-api (shared
databasecrate):Exact root cause (confirmed empirically)
Patroni lists a node as a
/clustermember as soon as it registers in the DCS — before it publishes itsconn_url. So a replica mid-creation (statecreating replica, or an uninitialized node) appears withouthost/port/api_url. Those fields are required onClusterMember, so a strictVec<ClusterMember>parse fails the entire response.Reproduced against the real struct in a unit test:
One half-registered member poisons the whole cluster view → discovery refresh errors for every consumer until the new node finishes registering (and startup discovery would fail outright during that window). This is the same class as the prior
lag:"unknown"poison — different field.Fix
Parse members individually (
deserialize_members) and drop any that don't fully deserialize — a member with no connection info isn't a usable leader/replica target anyway, and the next refresh picks it up once it finishes registering.ClusterMemberstays rigid, socluster_manageris untouched (every surviving member still hashost/port). Same philosophy as the existingdeserialize_lagresilience.Tests
test_initializing_member_does_not_poison_parse— leader + onecreating replica(no host/port) → parses, bad member dropped.test_full_cluster_with_initializing_replica_parses— real staging/clustershape + an initializing replica → 3 ready members survive, leader discoverable.patroni_discoverytests still pass (6 passed; 0 failed).Validation
Verified in dev (deterministic repro against the real struct). End-to-end validation: deploy this, then add a staging postgres replica and confirm discovery stays healthy through the member's initialization window. Unblocks deploying new postgres instances.