Stabilize sandbox Elasticsearch StatefulSet by arielr-lt · Pull Request #1013 · CredentialEngine/CredentialRegistry

arielr-lt · 2026-03-23T19:07:08Z

Summary

Set podManagementPolicy: Parallel — nodes start simultaneously and can discover each other during cluster formation. OrderedReady (the default) caused a deadlock where neither node could become ready without the other already being up.
Set timeoutSeconds: 10 on readiness and liveness probes — the default 1s was too short for the wait_for_status=yellow health endpoint, causing nodes to be killed before they could join the cluster.

The elasticsearch-discovery service already had publishNotReadyAddresses: true in the manifest but had drifted from the live cluster. This has been applied directly.

These fixes were identified during an incident where a dead EC2 node (kubelet stopped posting status) caused the cluster to lose quorum and enter a CrashLoopBackOff cycle that could not self-recover.

- Remove opensearch-deployment.yaml and opensearch-pvc.yaml (unused, no app references) - Set elasticsearch replicas to 1 and configure discovery.type=single-node

Missing proxy-body-size annotation caused nginx to use its default 1m limit, rejecting large publish payloads (~3.4MB) with 413 errors. Matches the limit already set in sandbox and staging.

- Set podManagementPolicy to Parallel so nodes can start simultaneously and discover each other during cluster formation. OrderedReady caused a deadlock where neither node could become ready without the other. - Set timeoutSeconds: 10 on readiness and liveness probes. The default of 1s was too short for the wait_for_status=yellow health endpoint, causing nodes to be killed before they could join the cluster. The elasticsearch-discovery service already had publishNotReadyAddresses: true in the manifest but was not applied to the live cluster (drift). This has been applied directly.

Ariel Rolfo added 3 commits March 19, 2026 18:02

Remove unused OpenSearch and fix Elasticsearch to single-node in staging

ae8beb9

- Remove opensearch-deployment.yaml and opensearch-pvc.yaml (unused, no app references) - Set elasticsearch replicas to 1 and configure discovery.type=single-node

(#1011) Increase prod ingress max body size to 10m

02ba07f

Missing proxy-body-size annotation caused nginx to use its default 1m limit, rejecting large publish payloads (~3.4MB) with 413 errors. Matches the limit already set in sandbox and staging.

edgarf self-requested a review March 26, 2026 06:01

edgarf approved these changes Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize sandbox Elasticsearch StatefulSet#1013

Stabilize sandbox Elasticsearch StatefulSet#1013
arielr-lt wants to merge 3 commits intomasterfrom
fix/sandbox-elasticsearch-stability

arielr-lt commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arielr-lt commented Mar 23, 2026

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants