From 9f9f1970bb84d410f7718d2921739db95659de78 Mon Sep 17 00:00:00 2001 From: Arek Borucki Date: Tue, 9 Jun 2026 15:57:08 +0200 Subject: [PATCH 1/4] docs: add scaling guide for replicas, shards, and Keeper quorum --- docs/guides/scaling.mdx | 113 ++++++++++++++++++++++++++++++++++++++++ docs/navigation.json | 3 +- 2 files changed, 115 insertions(+), 1 deletion(-) create mode 100644 docs/guides/scaling.mdx diff --git a/docs/guides/scaling.mdx b/docs/guides/scaling.mdx new file mode 100644 index 00000000..cbb56103 --- /dev/null +++ b/docs/guides/scaling.mdx @@ -0,0 +1,113 @@ +--- +position: 4 +slug: /clickhouse-operator/guides/scaling +title: 'Scaling' +keywords: ['kubernetes', 'scaling', 'replicas', 'shards', 'keeper', 'quorum'] +description: 'How to scale ClickHouse replicas and shards and Keeper quorum members, and what the operator does automatically.' +doc_type: 'guide' +--- + +# Scaling clusters + +You scale a cluster by editing the replica and shard counts on the Custom Resource. The operator reconciles the running cluster toward the new topology: it creates or removes the per-replica StatefulSets, keeps the schema in sync, and surfaces progress through status conditions. + +This guide covers how to scale `ClickHouseCluster` replicas and shards, how to scale a `KeeperCluster` quorum safely, and which conditions to watch while a scale operation is in flight. + + +Scaling a `ClickHouseCluster` past one replica per shard requires a Keeper quorum and `ReplicatedMergeTree` tables — replication is what lets a second replica serve the same data. Point the cluster at a Keeper with `spec.keeperClusterRef` before you raise `spec.replicas`. + + +## Scaling replicas {#scaling-replicas} + +`spec.replicas` sets the number of replicas in every shard. Each replica runs in its own StatefulSet named `-clickhouse--`, so a cluster with `shards: 2` and `replicas: 3` runs six StatefulSets. + +Raise or lower the count in place: + +```yaml +spec: + replicas: 3 # was 1 + keeperClusterRef: + name: my-keeper +``` + +On scale up the operator creates the new per-replica StatefulSets, waits for each pod to become ready, and then synchronizes the schema to the new replicas (see [Automatic schema sync](#automatic-schema-sync)). On scale down it removes the surplus StatefulSets and cleans up the stale replica metadata in Keeper. + +## Scaling shards {#scaling-shards} + +`spec.shards` sets the number of shards. Each new shard adds a full set of per-replica StatefulSets, and the operator creates one [PodDisruptionBudget per shard](/products/kubernetes-operator/guides/configuration#pod-disruption-budgets) so a disruption in one shard cannot count against another. + +```yaml +spec: + shards: 3 # was 1 + replicas: 2 +``` + +Each shard holds a distinct slice of the data, and the operator does not copy or move rows between shards. A `Distributed` table or an explicit routing scheme decides which shard a row lands on, so adding a shard gives new writes somewhere to land without touching the rows already stored in the existing shards. + +## Automatic schema sync {#automatic-schema-sync} + +When `spec.settings.enableDatabaseSync` is `true` (the default), the operator keeps the schema aligned as the topology changes: + +- **On scale up** — once at least two replicas are ready, the operator replicates the database definitions to the newly created replicas, so a fresh replica joins with the same `Replicated` and integration databases as the rest of the cluster. +- **On scale down** — before a replica disappears, the operator removes its registration from Keeper, so the shrunk cluster does not wait on metadata for a replica that no longer exists. + +This covers `Replicated` databases and integration database engines. It does not move table data — row data lives in `ReplicatedMergeTree` tables and replicates through Keeper independently of this schema sync. With a single ready replica there is nothing to replicate to, so the operator skips the step and logs that it has no target. + +Set `enableDatabaseSync: false` to turn the behavior off, for example when an external tool owns schema propagation. The operator then reports the `SchemaSyncDisabled` reason on the `SchemaInSync` condition. + +## Conditions to watch {#scaling-conditions} + +Inspect progress on the Custom Resource while a scale operation runs: + +```bash +kubectl get clickhousecluster sample -o yaml | sed -n '/conditions:/,/^[^ ]/p' +``` + +| Condition | Reason | Meaning | +|---|---|---| +| `ClusterSizeAligned` | `UpToDate` | Running replica count matches the requested topology | +| `ClusterSizeAligned` | `ScalingUp` | The operator is adding replicas | +| `ClusterSizeAligned` | `ScalingDown` | The operator is removing replicas | +| `SchemaInSync` | `ReplicasInSync` | Databases exist on all replicas and stale metadata is cleaned up | +| `SchemaInSync` | `DatabasesNotCreated` | The operator has not finished creating databases on the new replicas | +| `SchemaInSync` | `ReplicasNotCleanedUp` | Stale replica metadata from a scale down is not yet removed | +| `SchemaInSync` | `SchemaSyncDisabled` | `enableDatabaseSync` is `false` | +| `Ready` | `AllShardsReady` | Every shard has a ready replica | +| `Ready` | `SomeShardsNotReady` | At least one shard has no ready replica | + +A scale operation is complete when `ClusterSizeAligned` reports `UpToDate`, `SchemaInSync` reports `ReplicasInSync`, and `Ready` reports `AllShardsReady`. + +## Scaling Keeper {#scaling-keeper} + +A `KeeperCluster` runs a RAFT quorum, so the operator changes its membership **one replica at a time** and only while the cluster is in a stable state. This protects the quorum: a `2F+1` cluster tolerates `F` members down, so a 3-node cluster keeps working with one member missing and a 5-node cluster with two. + +```yaml +spec: + replicas: 5 # was 3 +``` + +On scale up the operator adds the lowest free replica ID to the quorum; on scale down it removes the highest ID. Each step waits for the quorum to settle before the next one starts. The [Keeper PodDisruptionBudget](/products/kubernetes-operator/guides/configuration#pod-disruption-budgets) defaults to `maxUnavailable: replicas/2` to preserve the quorum during voluntary disruptions. + +The `ScaleAllowed` condition reports whether the quorum can change membership right now: + +| Reason | Meaning | +|---|---| +| `ReadyToScale` | The quorum is stable and the operator can add or remove a member | +| `ReplicaHasPendingChanges` | A replica still has a pending configuration change | +| `ReplicaNotReady` | A replica is not ready, so membership changes wait | +| `NoQuorum` | The cluster has no quorum and cannot change membership safely | +| `WaitingFollowers` | The operator is waiting for followers to catch up | + +Scale Keeper in single steps and let `ScaleAllowed` return to `ReadyToScale` between changes. Jumping several members at once does not bypass the one-at-a-time reconcile — the operator still walks the quorum one member per step. + +## Known limitation: growing shards on an existing cluster {#shard-topology-limitation} + +A StatefulSet has immutable fields (for example `serviceName`, `selector`, and `volumeClaimTemplates`). When a topology change would require mutating those fields on a shard's existing StatefulSet, the operator cannot update it in place, and that shard keeps its original shape while the newly created shards run the new one. This is tracked in [issue #191](https://github.com/ClickHouse/clickhouse-operator/issues/191). + +If you hit this, delete the affected StatefulSet so the operator recreates it with the current spec: + +```bash +kubectl delete statefulset -clickhouse-0-0 +``` + +Set `dataVolumeClaimSpec.persistentVolumeReclaimPolicy: Retain` so the PVCs survive the delete and the recreated StatefulSet re-binds to the same volumes. diff --git a/docs/navigation.json b/docs/navigation.json index f41a1541..0a36d44c 100644 --- a/docs/navigation.json +++ b/docs/navigation.json @@ -18,7 +18,8 @@ "pages": [ "products/kubernetes-operator/guides/introduction", "products/kubernetes-operator/guides/configuration", - "products/kubernetes-operator/guides/monitoring" + "products/kubernetes-operator/guides/monitoring", + "products/kubernetes-operator/guides/scaling" ] }, { From 963e060b39c769248cf0e5fd09d786fc9d31aadc Mon Sep 17 00:00:00 2001 From: Arek Borucki Date: Tue, 9 Jun 2026 16:18:38 +0200 Subject: [PATCH 2/4] docs: add scaling guide for replicas, shards, and Keeper quorum --- docs/guides/scaling.mdx | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/guides/scaling.mdx b/docs/guides/scaling.mdx index cbb56103..8e217b8f 100644 --- a/docs/guides/scaling.mdx +++ b/docs/guides/scaling.mdx @@ -14,7 +14,7 @@ You scale a cluster by editing the replica and shard counts on the Custom Resour This guide covers how to scale `ClickHouseCluster` replicas and shards, how to scale a `KeeperCluster` quorum safely, and which conditions to watch while a scale operation is in flight. -Scaling a `ClickHouseCluster` past one replica per shard requires a Keeper quorum and `ReplicatedMergeTree` tables — replication is what lets a second replica serve the same data. Point the cluster at a Keeper with `spec.keeperClusterRef` before you raise `spec.replicas`. +A `ClickHouseCluster` always needs a Keeper, referenced through the required `spec.keeperClusterRef` field — the operator coordinates the cluster through it regardless of size. To run more than one replica per shard, the data must also live in `ReplicatedMergeTree` tables, since replication is what lets a second replica serve the same rows. ## Scaling replicas {#scaling-replicas} @@ -30,7 +30,7 @@ spec: name: my-keeper ``` -On scale up the operator creates the new per-replica StatefulSets, waits for each pod to become ready, and then synchronizes the schema to the new replicas (see [Automatic schema sync](#automatic-schema-sync)). On scale down it removes the surplus StatefulSets and cleans up the stale replica metadata in Keeper. +On scale up the operator creates the new per-replica StatefulSets, waits for each pod to become ready, and then synchronizes the schema to the new replicas (see [Automatic schema sync](#automatic-schema-sync)). On scale down it removes the surplus StatefulSets and cleans up the stale replicated-database replica registrations the removed replicas left behind. ## Scaling shards {#scaling-shards} @@ -49,7 +49,7 @@ Each shard holds a distinct slice of the data, and the operator does not copy or When `spec.settings.enableDatabaseSync` is `true` (the default), the operator keeps the schema aligned as the topology changes: - **On scale up** — once at least two replicas are ready, the operator replicates the database definitions to the newly created replicas, so a fresh replica joins with the same `Replicated` and integration databases as the rest of the cluster. -- **On scale down** — before a replica disappears, the operator removes its registration from Keeper, so the shrunk cluster does not wait on metadata for a replica that no longer exists. +- **On scale down** — before a replica disappears, the operator drops that replica's registration from each `Replicated` database with `SYSTEM DROP DATABASE REPLICA`, so the shrunk cluster does not wait on a `Replicated` database replica that no longer exists. This covers `Replicated` databases and integration database engines. It does not move table data — row data lives in `ReplicatedMergeTree` tables and replicates through Keeper independently of this schema sync. With a single ready replica there is nothing to replicate to, so the operator skips the step and logs that it has no target. From 95b134909a5f0a95ef21fcd1393e1a6c221aef7f Mon Sep 17 00:00:00 2001 From: Arek Borucki Date: Wed, 10 Jun 2026 08:04:52 +0200 Subject: [PATCH 3/4] docs: add scaling guide for replicas, shards, and Keeper quorum --- docs/guides/scaling.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/scaling.mdx b/docs/guides/scaling.mdx index 8e217b8f..ec6e1d3d 100644 --- a/docs/guides/scaling.mdx +++ b/docs/guides/scaling.mdx @@ -110,4 +110,4 @@ If you hit this, delete the affected StatefulSet so the operator recreates it wi kubectl delete statefulset -clickhouse-0-0 ``` -Set `dataVolumeClaimSpec.persistentVolumeReclaimPolicy: Retain` so the PVCs survive the delete and the recreated StatefulSet re-binds to the same volumes. +Deleting a StatefulSet does not delete its PVCs, so the recreated StatefulSet re-binds to the same volumes and the data is preserved. If you also need the underlying PersistentVolumes to survive a PVC deletion, use a StorageClass with `reclaimPolicy: Retain`. From 589d3fe40b04ab76994c9731810e5ce1402e8f74 Mon Sep 17 00:00:00 2001 From: Arek Borucki Date: Thu, 11 Jun 2026 07:12:06 +0200 Subject: [PATCH 4/4] docs: drop unreproduced shard-topology limitation section --- docs/guides/scaling.mdx | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/docs/guides/scaling.mdx b/docs/guides/scaling.mdx index ec6e1d3d..6526a1e3 100644 --- a/docs/guides/scaling.mdx +++ b/docs/guides/scaling.mdx @@ -99,15 +99,3 @@ The `ScaleAllowed` condition reports whether the quorum can change membership ri | `WaitingFollowers` | The operator is waiting for followers to catch up | Scale Keeper in single steps and let `ScaleAllowed` return to `ReadyToScale` between changes. Jumping several members at once does not bypass the one-at-a-time reconcile — the operator still walks the quorum one member per step. - -## Known limitation: growing shards on an existing cluster {#shard-topology-limitation} - -A StatefulSet has immutable fields (for example `serviceName`, `selector`, and `volumeClaimTemplates`). When a topology change would require mutating those fields on a shard's existing StatefulSet, the operator cannot update it in place, and that shard keeps its original shape while the newly created shards run the new one. This is tracked in [issue #191](https://github.com/ClickHouse/clickhouse-operator/issues/191). - -If you hit this, delete the affected StatefulSet so the operator recreates it with the current spec: - -```bash -kubectl delete statefulset -clickhouse-0-0 -``` - -Deleting a StatefulSet does not delete its PVCs, so the recreated StatefulSet re-binds to the same volumes and the data is preserved. If you also need the underlying PersistentVolumes to survive a PVC deletion, use a StorageClass with `reclaimPolicy: Retain`.