From 9f9f1970bb84d410f7718d2921739db95659de78 Mon Sep 17 00:00:00 2001
From: Arek Borucki <arekborucki@protonmail.com>
Date: Tue, 9 Jun 2026 15:57:08 +0200
Subject: [PATCH 1/4] docs: add scaling guide for replicas, shards, and Keeper
 quorum

---
 docs/guides/scaling.mdx | 113 ++++++++++++++++++++++++++++++++++++++++
 docs/navigation.json    |   3 +-
 2 files changed, 115 insertions(+), 1 deletion(-)
 create mode 100644 docs/guides/scaling.mdx
diff --git a/docs/guides/scaling.mdx b/docs/guides/scaling.mdx
new file mode 100644
index 00000000..cbb56103
--- /dev/null
+++ b/docs/guides/scaling.mdx
@@ -0,0 +1,113 @@
+---
+position: 4
+slug: /clickhouse-operator/guides/scaling
+title: 'Scaling'
+keywords: ['kubernetes', 'scaling', 'replicas', 'shards', 'keeper', 'quorum']
+description: 'How to scale ClickHouse replicas and shards and Keeper quorum members, and what the operator does automatically.'
+doc_type: 'guide'
+---
+
+# Scaling clusters
+
+You scale a cluster by editing the replica and shard counts on the Custom Resource. The operator reconciles the running cluster toward the new topology: it creates or removes the per-replica StatefulSets, keeps the schema in sync, and surfaces progress through status conditions.
+
+This guide covers how to scale `ClickHouseCluster` replicas and shards, how to scale a `KeeperCluster` quorum safely, and which conditions to watch while a scale operation is in flight.
+
+<Note>
+Scaling a `ClickHouseCluster` past one replica per shard requires a Keeper quorum and `ReplicatedMergeTree` tables — replication is what lets a second replica serve the same data. Point the cluster at a Keeper with `spec.keeperClusterRef` before you raise `spec.replicas`.
+</Note>
+
+## Scaling replicas {#scaling-replicas}
+
+`spec.replicas` sets the number of replicas in every shard. Each replica runs in its own StatefulSet named `<cluster>-clickhouse-<shard>-<replica>`, so a cluster with `shards: 2` and `replicas: 3` runs six StatefulSets.
+
+Raise or lower the count in place:
+
+```yaml
+spec:
+  replicas: 3   # was 1
+  keeperClusterRef:
+    name: my-keeper
+```
+
+On scale up the operator creates the new per-replica StatefulSets, waits for each pod to become ready, and then synchronizes the schema to the new replicas (see [Automatic schema sync](#automatic-schema-sync)). On scale down it removes the surplus StatefulSets and cleans up the stale replica metadata in Keeper.
+
+## Scaling shards {#scaling-shards}
+
+`spec.shards` sets the number of shards. Each new shard adds a full set of per-replica StatefulSets, and the operator creates one [PodDisruptionBudget per shard](/products/kubernetes-operator/guides/configuration#pod-disruption-budgets) so a disruption in one shard cannot count against another.
+
+```yaml
+spec:
+  shards: 3   # was 1
+  replicas: 2
+```
+
+Each shard holds a distinct slice of the data, and the operator does not copy or move rows between shards. A `Distributed` table or an explicit routing scheme decides which shard a row lands on, so adding a shard gives new writes somewhere to land without touching the rows already stored in the existing shards.
+
+## Automatic schema sync {#automatic-schema-sync}
+
+When `spec.settings.enableDatabaseSync` is `true` (the default), the operator keeps the schema aligned as the topology changes:
+
+- **On scale up** — once at least two replicas are ready, the operator replicates the database definitions to the newly created replicas, so a fresh replica joins with the same `Replicated` and integration databases as the rest of the cluster.
+- **On scale down** — before a replica disappears, the operator removes its registration from Keeper, so the shrunk cluster does not wait on metadata for a replica that no longer exists.
+
+This covers `Replicated` databases and integration database engines. It does not move table data — row data lives in `ReplicatedMergeTree` tables and replicates through Keeper independently of this schema sync. With a single ready replica there is nothing to replicate to, so the operator skips the step and logs that it has no target.
+
+Set `enableDatabaseSync: false` to turn the behavior off, for example when an external tool owns schema propagation. The operator then reports the `SchemaSyncDisabled` reason on the `SchemaInSync` condition.
+
+## Conditions to watch {#scaling-conditions}
+
+Inspect progress on the Custom Resource while a scale operation runs:
+
+```bash
+kubectl get clickhousecluster sample -o yaml | sed -n '/conditions:/,/^[^ ]/p'
+```
+
+| Condition | Reason | Meaning |
+|---|---|---|
+| `ClusterSizeAligned` | `UpToDate` | Running replica count matches the requested topology |
+| `ClusterSizeAligned` | `ScalingUp` | The operator is adding replicas |
+| `ClusterSizeAligned` | `ScalingDown` | The operator is removing replicas |
+| `SchemaInSync` | `ReplicasInSync` | Databases exist on all replicas and stale metadata is cleaned up |
+| `SchemaInSync` | `DatabasesNotCreated` | The operator has not finished creating databases on the new replicas |
+| `SchemaInSync` | `ReplicasNotCleanedUp` | Stale replica metadata from a scale down is not yet removed |
+| `SchemaInSync` | `SchemaSyncDisabled` | `enableDatabaseSync` is `false` |
+| `Ready` | `AllShardsReady` | Every shard has a ready replica |
+| `Ready` | `SomeShardsNotReady` | At least one shard has no ready replica |
+
+A scale operation is complete when `ClusterSizeAligned` reports `UpToDate`, `SchemaInSync` reports `ReplicasInSync`, and `Ready` reports `AllShardsReady`.
+
+## Scaling Keeper {#scaling-keeper}
+
+A `KeeperCluster` runs a RAFT quorum, so the operator changes its membership **one replica at a time** and only while the cluster is in a stable state. This protects the quorum: a `2F+1` cluster tolerates `F` members down, so a 3-node cluster keeps working with one member missing and a 5-node cluster with two.
+
+```yaml
+spec:
+  replicas: 5   # was 3
+```
+
+On scale up the operator adds the lowest free replica ID to the quorum; on scale down it removes the highest ID. Each step waits for the quorum to settle before the next one starts. The [Keeper PodDisruptionBudget](/products/kubernetes-operator/guides/configuration#pod-disruption-budgets) defaults to `maxUnavailable: replicas/2` to preserve the quorum during voluntary disruptions.
+
+The `ScaleAllowed` condition reports whether the quorum can change membership right now:
+
+| Reason | Meaning |
+|---|---|
+| `ReadyToScale` | The quorum is stable and the operator can add or remove a member |
+| `ReplicaHasPendingChanges` | A replica still has a pending configuration change |
+| `ReplicaNotReady` | A replica is not ready, so membership changes wait |
+| `NoQuorum` | The cluster has no quorum and cannot change membership safely |
+| `WaitingFollowers` | The operator is waiting for followers to catch up |
+
+Scale Keeper in single steps and let `ScaleAllowed` return to `ReadyToScale` between changes. Jumping several members at once does not bypass the one-at-a-time reconcile — the operator still walks the quorum one member per step.
+
+## Known limitation: growing shards on an existing cluster {#shard-topology-limitation}
+
+A StatefulSet has immutable fields (for example `serviceName`, `selector`, and `volumeClaimTemplates`). When a topology change would require mutating those fields on a shard's existing StatefulSet, the operator cannot update it in place, and that shard keeps its original shape while the newly created shards run the new one. This is tracked in [issue #191](https://github.com/ClickHouse/clickhouse-operator/issues/191).
+
+If you hit this, delete the affected StatefulSet so the operator recreates it with the current spec:
+
+```bash
+kubectl delete statefulset <cluster>-clickhouse-0-0
+```
+
+Set `dataVolumeClaimSpec.persistentVolumeReclaimPolicy: Retain` so the PVCs survive the delete and the recreated StatefulSet re-binds to the same volumes.
diff --git a/docs/navigation.json b/docs/navigation.json
index f41a1541..0a36d44c 100644
--- a/docs/navigation.json
+++ b/docs/navigation.json
@@ -18,7 +18,8 @@
       "pages": [
         "products/kubernetes-operator/guides/introduction",
         "products/kubernetes-operator/guides/configuration",
-        "products/kubernetes-operator/guides/monitoring"
+        "products/kubernetes-operator/guides/monitoring",
+        "products/kubernetes-operator/guides/scaling"
       ]
     },
     {

From 963e060b39c769248cf0e5fd09d786fc9d31aadc Mon Sep 17 00:00:00 2001
From: Arek Borucki <arekborucki@protonmail.com>
Date: Tue, 9 Jun 2026 16:18:38 +0200
Subject: [PATCH 2/4] docs: add scaling guide for replicas, shards, and Keeper
 quorum

---
 docs/guides/scaling.mdx | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/guides/scaling.mdx b/docs/guides/scaling.mdx
index cbb56103..8e217b8f 100644
--- a/docs/guides/scaling.mdx
+++ b/docs/guides/scaling.mdx
@@ -14,7 +14,7 @@ You scale a cluster by editing the replica and shard counts on the Custom Resour
 This guide covers how to scale `ClickHouseCluster` replicas and shards, how to scale a `KeeperCluster` quorum safely, and which conditions to watch while a scale operation is in flight.
 
 <Note>
-Scaling a `ClickHouseCluster` past one replica per shard requires a Keeper quorum and `ReplicatedMergeTree` tables — replication is what lets a second replica serve the same data. Point the cluster at a Keeper with `spec.keeperClusterRef` before you raise `spec.replicas`.
+A `ClickHouseCluster` always needs a Keeper, referenced through the required `spec.keeperClusterRef` field — the operator coordinates the cluster through it regardless of size. To run more than one replica per shard, the data must also live in `ReplicatedMergeTree` tables, since replication is what lets a second replica serve the same rows.
 </Note>
 
 ## Scaling replicas {#scaling-replicas}
@@ -30,7 +30,7 @@ spec:
     name: my-keeper
 ```
 
-On scale up the operator creates the new per-replica StatefulSets, waits for each pod to become ready, and then synchronizes the schema to the new replicas (see [Automatic schema sync](#automatic-schema-sync)). On scale down it removes the surplus StatefulSets and cleans up the stale replica metadata in Keeper.
+On scale up the operator creates the new per-replica StatefulSets, waits for each pod to become ready, and then synchronizes the schema to the new replicas (see [Automatic schema sync](#automatic-schema-sync)). On scale down it removes the surplus StatefulSets and cleans up the stale replicated-database replica registrations the removed replicas left behind.
 
 ## Scaling shards {#scaling-shards}
 
@@ -49,7 +49,7 @@ Each shard holds a distinct slice of the data, and the operator does not copy or
 When `spec.settings.enableDatabaseSync` is `true` (the default), the operator keeps the schema aligned as the topology changes:
 
 - **On scale up** — once at least two replicas are ready, the operator replicates the database definitions to the newly created replicas, so a fresh replica joins with the same `Replicated` and integration databases as the rest of the cluster.
-- **On scale down** — before a replica disappears, the operator removes its registration from Keeper, so the shrunk cluster does not wait on metadata for a replica that no longer exists.
+- **On scale down** — before a replica disappears, the operator drops that replica's registration from each `Replicated` database with `SYSTEM DROP DATABASE REPLICA`, so the shrunk cluster does not wait on a `Replicated` database replica that no longer exists.
 
 This covers `Replicated` databases and integration database engines. It does not move table data — row data lives in `ReplicatedMergeTree` tables and replicates through Keeper independently of this schema sync. With a single ready replica there is nothing to replicate to, so the operator skips the step and logs that it has no target.
 

From 95b134909a5f0a95ef21fcd1393e1a6c221aef7f Mon Sep 17 00:00:00 2001
From: Arek Borucki <arekborucki@protonmail.com>
Date: Wed, 10 Jun 2026 08:04:52 +0200
Subject: [PATCH 3/4] docs: add scaling guide for replicas, shards, and Keeper
 quorum

---
 docs/guides/scaling.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/guides/scaling.mdx b/docs/guides/scaling.mdx
index 8e217b8f..ec6e1d3d 100644
--- a/docs/guides/scaling.mdx
+++ b/docs/guides/scaling.mdx
@@ -110,4 +110,4 @@ If you hit this, delete the affected StatefulSet so the operator recreates it wi
 kubectl delete statefulset <cluster>-clickhouse-0-0
 ```
 
-Set `dataVolumeClaimSpec.persistentVolumeReclaimPolicy: Retain` so the PVCs survive the delete and the recreated StatefulSet re-binds to the same volumes.
+Deleting a StatefulSet does not delete its PVCs, so the recreated StatefulSet re-binds to the same volumes and the data is preserved. If you also need the underlying PersistentVolumes to survive a PVC deletion, use a StorageClass with `reclaimPolicy: Retain`.

From 589d3fe40b04ab76994c9731810e5ce1402e8f74 Mon Sep 17 00:00:00 2001
From: Arek Borucki <arekborucki@protonmail.com>
Date: Thu, 11 Jun 2026 07:12:06 +0200
Subject: [PATCH 4/4] docs: drop unreproduced shard-topology limitation section

---
 docs/guides/scaling.mdx | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/docs/guides/scaling.mdx b/docs/guides/scaling.mdx
index ec6e1d3d..6526a1e3 100644
--- a/docs/guides/scaling.mdx
+++ b/docs/guides/scaling.mdx
@@ -99,15 +99,3 @@ The `ScaleAllowed` condition reports whether the quorum can change membership ri
 | `WaitingFollowers` | The operator is waiting for followers to catch up |
 
 Scale Keeper in single steps and let `ScaleAllowed` return to `ReadyToScale` between changes. Jumping several members at once does not bypass the one-at-a-time reconcile — the operator still walks the quorum one member per step.
-
-## Known limitation: growing shards on an existing cluster {#shard-topology-limitation}
-
-A StatefulSet has immutable fields (for example `serviceName`, `selector`, and `volumeClaimTemplates`). When a topology change would require mutating those fields on a shard's existing StatefulSet, the operator cannot update it in place, and that shard keeps its original shape while the newly created shards run the new one. This is tracked in [issue #191](https://github.com/ClickHouse/clickhouse-operator/issues/191).
-
-If you hit this, delete the affected StatefulSet so the operator recreates it with the current spec:
-
-```bash
-kubectl delete statefulset <cluster>-clickhouse-0-0
-```
-
-Deleting a StatefulSet does not delete its PVCs, so the recreated StatefulSet re-binds to the same volumes and the data is preserved. If you also need the underlying PersistentVolumes to survive a PVC deletion, use a StorageClass with `reclaimPolicy: Retain`.