fix(db): tolerate not-yet-ready members in Patroni cluster discovery#307
Conversation
Mirror of nearai/cloud-api#794 — chat-api vendors the same database crate. Patroni lists a node as a /cluster member as soon as it registers in the DCS, before it publishes its conn_url, so a replica mid-creation appears without host/port/api_url. Those fields are required on ClusterMember, so a strict Vec<ClusterMember> parse failed the whole response with 'missing field host' — surfacing as 'Failed to refresh cluster state: ... error decoding response body' and freezing topology discovery whenever a new postgres instance was being added. Parse each member independently and drop any that don't fully deserialize; a member with no connection info is not a usable leader/replica target anyway, and the next refresh picks it up once it finishes registering. Same resilience philosophy as the existing deserialize_lag fix. Adds regression tests.
There was a problem hiding this comment.
Code Review
This pull request introduces a custom deserializer deserialize_members for Patroni cluster members in crates/database/src/patroni_discovery.rs. This change ensures that if a single cluster member is uninitialized or half-registered (missing required fields like host or port), it is gracefully skipped rather than causing the entire cluster deserialization to fail. Additionally, two unit tests were added to verify this resilient parsing behavior. There are no review comments, and I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Review — PR #307: tolerate not-yet-ready Patroni members Verdict: ✅ Approve. Correct, minimal, well-targeted fix that mirrors the existing Correctness
Minor (non-blocking)
let name = value.get("name").and_then(serde_json::Value::as_str)
.unwrap_or("<unknown>").to_owned();
match serde_json::from_value::<ClusterMember>(value) {
Ok(member) => members.push(member),
Err(e) => warn!("Skipping not-yet-ready cluster member {name}: {e}"),
}Refresh cadence is low, so this is purely a micro-optimization — fine to leave as-is. Tests cover both the isolated and full-cluster cases. No critical issues. ✅ |
PierreLeGuen
left a comment
There was a problem hiding this comment.
The change adds deserialize_members, which parses each Patroni /cluster member independently and drops any that fail to deserialize (e.g. a replica mid-creation lacking host/port), rather than failing the whole cluster parse. This is correct and well-scoped, mirroring the existing deserialize_lag pattern.
- Dropping not-yet-ready members is safe:
update_cluster_state(lines 178–192) only selects members inrunning/streamingstate usinghost/port, which dropped members lack — so selection behavior is unchanged. - The
warn!log only emits the cluster nodenameand the serde error, exposing no customer data. ClusterInfois deserialization-only, so the custom#[serde(deserialize_with)]cannot affect any serialization path. Nodeny_unknown_fields, so extra fields likeapi_urlare safely ignored.
Local checks: cargo test -p database (9/9 passed, including the two new regression tests) and cargo clippy -p database -- -D warnings (clean).
Mirror of nearai/cloud-api#794 — chat-api vendors its own copy of the
databasecrate, so it hit the identical failure.Symptom
Adding a new postgres replica logged, in chat-api and cloud-api:
Root cause (confirmed against the real struct)
Patroni lists a node as a
/clustermember as soon as it registers in the DCS — before it publishes itsconn_url, so a replica mid-creation appears withouthost/port/api_url. Those fields are required onClusterMember, so a strictVec<ClusterMember>parse fails the entire response (missing field \host``) — one half-registered member poisons discovery for the whole initialization window.Fix
deserialize_membersparses members individually and drops any that don't fully deserialize.ClusterMemberstays rigid, so the rest of the discovery/selection code is untouched (surviving members all havehost/port). Same philosophy as the existingdeserialize_lagresilience.Tests
9 passed; 0 failed— addstest_initializing_member_does_not_poison_parseandtest_full_cluster_with_initializing_replica_parses; fmt + clippy (-D warnings) clean.Unblocks deploying new postgres instances.