Skip to content

fix(db): tolerate not-yet-ready members in Patroni cluster discovery#307

Merged
think-in-universe merged 1 commit into
mainfrom
fix/patroni-tolerate-initializing-members
Jun 15, 2026
Merged

fix(db): tolerate not-yet-ready members in Patroni cluster discovery#307
think-in-universe merged 1 commit into
mainfrom
fix/patroni-tolerate-initializing-members

Conversation

@Evrard-Nil

Copy link
Copy Markdown
Contributor

Mirror of nearai/cloud-api#794 — chat-api vendors its own copy of the database crate, so it hit the identical failure.

Symptom

Adding a new postgres replica logged, in chat-api and cloud-api:

ERROR database::patroni_discovery: Failed to refresh cluster state:
Failed to parse cluster info: error decoding response body

Root cause (confirmed against the real struct)

Patroni lists a node as a /cluster member as soon as it registers in the DCS — before it publishes its conn_url, so a replica mid-creation appears without host/port/api_url. Those fields are required on ClusterMember, so a strict Vec<ClusterMember> parse fails the entire response (missing field \host``) — one half-registered member poisons discovery for the whole initialization window.

Fix

deserialize_members parses members individually and drops any that don't fully deserialize. ClusterMember stays rigid, so the rest of the discovery/selection code is untouched (surviving members all have host/port). Same philosophy as the existing deserialize_lag resilience.

Tests

9 passed; 0 failed — adds test_initializing_member_does_not_poison_parse and test_full_cluster_with_initializing_replica_parses; fmt + clippy (-D warnings) clean.

Unblocks deploying new postgres instances.

Mirror of nearai/cloud-api#794 — chat-api vendors the same database crate.

Patroni lists a node as a /cluster member as soon as it registers in the DCS,
before it publishes its conn_url, so a replica mid-creation appears without
host/port/api_url. Those fields are required on ClusterMember, so a strict
Vec<ClusterMember> parse failed the whole response with 'missing field host' —
surfacing as 'Failed to refresh cluster state: ... error decoding response
body' and freezing topology discovery whenever a new postgres instance was
being added.

Parse each member independently and drop any that don't fully deserialize; a
member with no connection info is not a usable leader/replica target anyway,
and the next refresh picks it up once it finishes registering. Same resilience
philosophy as the existing deserialize_lag fix. Adds regression tests.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a custom deserializer deserialize_members for Patroni cluster members in crates/database/src/patroni_discovery.rs. This change ensures that if a single cluster member is uninitialized or half-registered (missing required fields like host or port), it is gracefully skipped rather than causing the entire cluster deserialization to fail. Additionally, two unit tests were added to verify this resilient parsing behavior. There are no review comments, and I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review — PR #307: tolerate not-yet-ready Patroni members

Verdict: ✅ Approve. Correct, minimal, well-targeted fix that mirrors the existing deserialize_lag resilience pattern.

Correctness

  • Root cause is sound: host/port are required on ClusterMember, so one half-registered member (missing field host) poisons the whole /cluster parse. Parsing members individually and dropping unparseable ones is the right fix.
  • Downstream consumption is safe with dropped members: update_cluster_state handles an empty member list, an absent leader is Option + warn!("No leader found"), and get_random_replica/get_next_replica guard on replicas.is_empty(). Skipped members are never valid leader/replica targets, so dropping them cannot select a bad node — and the next refresh re-includes them once conn_url is published.
  • Backward compatible: well-formed responses deserialize identically; the derived Serialize on ClusterInfo is unaffected by deserialize_with.
  • Privacy: the warn! logs only the Patroni node name (an infra hostname) and a serde error — no customer data. Compliant with CLAUDE.md.

Minor (non-blocking)

  • serde_json::from_value::<ClusterMember>(value.clone()) clones every member on the happy path just to retain value for the error branch. Since it runs per-member on each refresh, extract the name first then consume value to avoid the allocation:
let name = value.get("name").and_then(serde_json::Value::as_str)
    .unwrap_or("<unknown>").to_owned();
match serde_json::from_value::<ClusterMember>(value) {
    Ok(member) => members.push(member),
    Err(e) => warn!("Skipping not-yet-ready cluster member {name}: {e}"),
}

Refresh cadence is low, so this is purely a micro-optimization — fine to leave as-is.

Tests cover both the isolated and full-cluster cases. No critical issues.

@PierreLeGuen PierreLeGuen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change adds deserialize_members, which parses each Patroni /cluster member independently and drops any that fail to deserialize (e.g. a replica mid-creation lacking host/port), rather than failing the whole cluster parse. This is correct and well-scoped, mirroring the existing deserialize_lag pattern.

  • Dropping not-yet-ready members is safe: update_cluster_state (lines 178–192) only selects members in running/streaming state using host/port, which dropped members lack — so selection behavior is unchanged.
  • The warn! log only emits the cluster node name and the serde error, exposing no customer data.
  • ClusterInfo is deserialization-only, so the custom #[serde(deserialize_with)] cannot affect any serialization path. No deny_unknown_fields, so extra fields like api_url are safely ignored.

Local checks: cargo test -p database (9/9 passed, including the two new regression tests) and cargo clippy -p database -- -D warnings (clean).

@think-in-universe think-in-universe merged commit 06325a3 into main Jun 15, 2026
2 checks passed
@think-in-universe think-in-universe deleted the fix/patroni-tolerate-initializing-members branch June 15, 2026 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants