From bb696e018d66398fa0f4e0ac69de370967207cc8 Mon Sep 17 00:00:00 2001 From: Ben Schumacher Date: Wed, 15 Apr 2026 11:29:13 +0200 Subject: [PATCH 1/9] docs: clarify HA vs DR and active/passive support in backup-disaster-recovery Distinguish high availability (single-site clustering) from disaster recovery (multi-site failover), clarify that Mattermost supports active/passive DR only and does not support active/active deployments, and rename the "High Availability deployment" section to "Active/passive DR deployment" for accuracy. Co-Authored-By: Claude Sonnet 4.6 --- source/deployment-guide/backup-disaster-recovery.rst | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/source/deployment-guide/backup-disaster-recovery.rst b/source/deployment-guide/backup-disaster-recovery.rst index aed15d5fed5..6d2064a8182 100644 --- a/source/deployment-guide/backup-disaster-recovery.rst +++ b/source/deployment-guide/backup-disaster-recovery.rst @@ -29,11 +29,17 @@ To back up your Mattermost server: To restore a Mattermost instance from backup, restore your database, ``config.json`` file, and optionally the locally stored user files into the locations from which they were backed up. -Disaster recovery +Disaster recovery ----------------- An appropriate disaster recovery plan weighs the benefits of mitigating specific risks against the cost and complexity of setting up disaster recovery infrastructure and automation. +**High availability (HA) vs. disaster recovery (DR)** + +HA and DR are distinct concepts that are often confused. HA refers to a clustered deployment within a single site that eliminates single points of failure and keeps Mattermost running through individual component outages (e.g., a failed app node or database replica). DR addresses the broader scenario of an entire site or region becoming unavailable, and typically requires a secondary deployment in a separate data center or cloud region. + +Mattermost supports active/passive DR, where a secondary site is kept in sync but only activated during a failover. Mattermost does not support active/active deployments, where both sites serve live traffic simultaneously. + Automated backup ~~~~~~~~~~~~~~~~ @@ -46,12 +52,12 @@ Automating backups for a Mattermost server provides a copy of the server's state Recovering from a failure using a backup is typically a manual process and will incur downtime. The alternative is to automate recovery using a high availability deployment. -High Availability deployment +Active/passive DR deployment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Enterprise customers who use Mattermost for mission-critical operations must ensure continuous availability and operational resilience. A robust disaster recovery strategy is essential to mitigate risks associated with data center failures, ensuring that users can access Mattermost seamlessly, even in the event of unexpected outages. -This section details the steps needed to set up Mattermost in a disaster recovery mode, and how to fail over from one data center to another. +This section details the steps needed to set up Mattermost in an active/passive disaster recovery configuration, and how to fail over from one data center to another. .. tip:: From 7fd420e66077a9757a4474a71c2169ebdc0eb448 Mon Sep 17 00:00:00 2001 From: Ben Schumacher Date: Wed, 15 Apr 2026 13:04:18 +0200 Subject: [PATCH 2/9] docs: move AWS DR guide to dedicated subpage Extract the AWS-specific active/passive DR deployment steps from backup-disaster-recovery.rst into a new disaster-recovery-aws.rst subpage. The main page now links to it via toctree, keeping the overview page concise and making room for future platform-specific guides. Co-Authored-By: Claude Sonnet 4.6 --- .../backup-disaster-recovery.rst | 354 +---------------- .../disaster-recovery-aws.rst | 358 ++++++++++++++++++ 2 files changed, 362 insertions(+), 350 deletions(-) create mode 100644 source/deployment-guide/disaster-recovery-aws.rst diff --git a/source/deployment-guide/backup-disaster-recovery.rst b/source/deployment-guide/backup-disaster-recovery.rst index 6d2064a8182..1b8aba84234 100644 --- a/source/deployment-guide/backup-disaster-recovery.rst +++ b/source/deployment-guide/backup-disaster-recovery.rst @@ -55,358 +55,12 @@ Recovering from a failure using a backup is typically a manual process and will Active/passive DR deployment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Enterprise customers who use Mattermost for mission-critical operations must ensure continuous availability and operational resilience. A robust disaster recovery strategy is essential to mitigate risks associated with data center failures, ensuring that users can access Mattermost seamlessly, even in the event of unexpected outages. +For step-by-step instructions on setting up Mattermost in an active/passive DR configuration across two data centers, including how to replicate the database, file storage, and search indices, and how to perform a failover, see the platform-specific guide: -This section details the steps needed to set up Mattermost in an active/passive disaster recovery configuration, and how to fail over from one data center to another. +.. toctree:: + :maxdepth: 1 -.. tip:: - - To learn how to safely upgrade your deployment in Kubernetes for High Availability and Active/Active support, see the :doc:`Upgrading Mattermost in Kubernetes and High Availability Environments ` documenation. - -Set up in one data center -^^^^^^^^^^^^^^^^^^^^^^^^^ - -As a first step, set up Mattermost in a single data center. At a very basic high level, this would be something like below: - -.. image:: ../images/dr1.png - :alt: An architecture diagram showing a single proxy that's forwarding traffic to 2 nodes, a database with single writer + n readers, and an S3 bucket and ES/OS using AWS OpenSearch service. - -The diagram above has a single proxy, forwarding traffic to 2 nodes. There's also a database with single writer + n readers and an S3 bucket and ES/OS using AWS OpenSearch service. - -At this stage, we are ignoring other details like LDAP/SAML, SMTP etc. - -.. tip:: - The following architecture would be implemented when an entire region goes down. It does not cover the case when a single server/service goes down. For example: - - - If a single app node goes down, follow best practices to provision a new node. - - If a database replica node goes down, create a new replica from AWS console. Or set a policy to do so automatically. - -Replicate database -^^^^^^^^^^^^^^^^^^ - -The next tasks include creating a global AWS Cluster. - -1. Select the RDS instance in the AWS Console, and expand the **Actions** menu to select **Add AWS Region**. - -2. Choose the secondary region and enter the other details. - -.. warning:: - - Select the **Enable write forwarding** option on the secondary cluster to help forward write operations from secondary to primary. See the `AmazonRDS write forwarding `_ documentation for details. - - Also verify the PostgreSQL version and ensure it allows ``write forwarding``. Not all PostgreSQL versions allow it. See the `Amazon RDS write forwarding region and version availability `_ documentation for details. - -You should now have a global cluster with the primary cluster in ``us-west-1``, and the secondary cluster in ``us-east-1``: - -.. image:: ../images/dr2.png - :alt: A screenshot of the AWS console with a global RDS cluster where the primary cluster is us-west-1 and the secondary cluster is us-east-1. - -Replicate S3 bucket -^^^^^^^^^^^^^^^^^^^ - -1. Create a new S3 bucket in the secondary region. - -2. Back in the original bucket, go to the **Properties** tab, and enable **Bucket versioning**. - -3. Go to the **Management** tab, scroll down to **Replication Rules**, and create a new replication rule. - -4. In the rule, select the source bucket, and then choose **Apply to all objects in the bucket** to replicate everything in the bucket. - -5. Choose the destination bucket. - -6. For the IAM role, select **Create new role**. - -.. warning:: - - Select the **Replica modification sync** option for the bucket to help keep the replica and source buckets in sync with each other. - -7. Select **Save**. - -8. Select **Yes** when prompted to start a job to replicate any existing objects to the secondary bucket or not. - -9. Perform these same steps on the secondary bucket. - -Now you have bi-directional replication working between these S3 replica and source buckets. - -Replicate ES/OS storage -^^^^^^^^^^^^^^^^^^^^^^^ - -1. To replicate ES/OS storage, set up CCR (cross-cluster replication) for AWS OpenSearch with the following requirements: - - - Elasticsearch 7.10 or OpenSearch 2.x - - Fine-grained access control enabled - - Node-to-node encryption enabled - -.. tip:: - - All you need is a recent OpenSearch version with fine-grained access control enabled. Node-to-node encryption is automatically enabled once you enable fine-grained access control. - -2. You also need to add the ``CrossClusterGet`` permission on the IAM policy for the OS cluster set under the **Security Configuration** tab for your OS domain. We recommend the following as per AWS, but feel free to fine-tune as necessary: - - .. code-block:: sh - - { - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Principal": { - "AWS": "*" - }, - "Action": "es:ESHttp*", - "Resource": "arn:aws:es:::domain//*" - }, - { - "Effect": "Allow", - "Principal": { - "AWS": "*" - }, - "Action": "es:ESCrossClusterGet", - "Resource": "arn:aws:es:::domain/" - } - ] - } - -To recap: - -- Use OpenSearch 2.x. -- Enable fine-grained access control. -- Create the master user, and note the server credentials. -- Set the IAM policy as above. - -.. warning:: - - After creating the master user, IP based access to the OS might not work from Mattermost application nodes. You may need to update the ``ElasticSearchSettings`` section in ``config.json`` to update the server :ref:`username ` and :ref:`password `. - -3. Create a new OS cluster in the secondary region. Follow the same steps again for this cluster. - - .. warning:: - - At this stage, ensure that you have all indices populated with data in the primary region. Run a bulk index to do that if you haven’t already. - -4. Begin replication from the primary to secondary region. - - a. First, create a connection from secondary to primary. Note that replication in OS works in a “pull“ model, so the secondary site pulls data from the primary. - - b. In the Amazon OpenSearch Service console, select the secondary domain, go to the **Connections** tab, and choose **Request**. - - c. For **Connection alias**, enter a name for your connection. - - d. Choose **connect to a domain in another AWS account or region**, and enter the **ARN** of the primary domain. - - e. Select **Request** to send a permission request to the primary domain. - - f. Open the primary domain to see and accept the incoming request under the **Connections** tab. - -5. Now set up the replication rules for indices. - - a. SSH into an app node in the secondary region to set up an auto-follow rule for the ``posts*`` indices because of the daily naming scheme and monthly aggregation. - - b. For the other indices, replicate each of them. You can also set up a rule with ``*`` to replicate everything, but that would also include the hidden and system indices which you don’t want. - - c. Set up the auto-follow for ``posts*`` indices: - - .. code-block:: sh - - curl -XPOST -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/_autofollow?pretty' -d ' - { - "leader_alias" : "", - "name": "autofollow-rule", - "pattern": "posts*", - "use_roles":{ - "leader_cluster_role": "all_access", - "follower_cluster_role": "all_access" - } - }' - - d. Check the status of the auto-follow rule: - - .. code-block:: sh - - curl -H 'Content-Type: application/json' -u 'username/password' 'https://<>/_plugins/_replication/autofollow_stats?pretty' - { - "num_success_start_replication" : 2, - "num_failed_start_replication" : 0, - "num_failed_leader_calls" : 0, - "failed_indices" : [ ], - "autofollow_stats" : [ - { - "name" : "autofollow-rule", - "pattern" : "posts*", - "num_success_start_replication" : 2, - "num_failed_start_replication" : 0, - "num_failed_leader_calls" : 0, - "failed_indices" : [ ], - "last_execution_time" : 1737699113927 - } - ] - } - - e. Next, set up replication for the other indices: - - .. code-block:: sh - - curl -XPUT -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/channels/_start?pretty' -d ' - { - "leader_alias": "", - "leader_index": "channels", - "use_roles":{ - "leader_cluster_role": "all_access", - "follower_cluster_role": "all_access" - } - }' - - curl -XPUT -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/users/_start?pretty' -d ' - { - "leader_alias": "", - "leader_index": "users", - "use_roles":{ - "leader_cluster_role": "all_access", - "follower_cluster_role": "all_access" - } - }' - - curl -XPUT -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/files/_start?pretty' -d ' - { - "leader_alias": "", - "leader_index": "files", - "use_roles":{ - "leader_cluster_role": "all_access", - "follower_cluster_role": "all_access" - } - }' - - f. Check the status of the replication rules: - - .. code-block:: sh - - curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/channels/_status?pretty' - curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/files/_status?pretty' - curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/users/_status?pretty' - curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/posts_/_status?pretty' - curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/posts_/_status?pretty' - Sample output: - { - "status" : "SYNCING", - "reason" : "User initiated", - "leader_alias" : "", - "leader_index" : "", - "follower_index" : "", - "syncing_details" : { - "leader_checkpoint" : 16, - "follower_checkpoint" : 16, - "seq_no" : 17 - } - } - - g. Check for indices. You should be able to see all the indices from the primary domain in the secondary domain: - - .. code-block:: sh - - curl -s -u ':' 'https:///_cat/indices?pretty' - -Replicate job servers -^^^^^^^^^^^^^^^^^^^^^ - -If the job scheduler is left running in the secondary region, it will pick up jobs and start running them. Therefore, set ``JobSettings.RunScheduler`` to ``false`` on all nodes in the secondary region. When a failover happens, you need to enable it for the new primary region, and deactivate it for the new secondary region. - -Test the secondary region -^^^^^^^^^^^^^^^^^^^^^^^^^ - -With the above steps complete, you have a fully functioning secondary region. You can replicate the same setup of nodes and a proxy server like the primary region. The app nodes in the secondary region won’t be able to come up the first time because Mattermost will try to run some DDL statements which are not allowed with write-forwarding. So it will be stuck in a loop trying to connect. Once you fail over the region, it will start working. The primary region will still be readable, and any periodic writes will be forwarded to the secondary (now primary). - -.. warning:: - - Ensure you have separate ``ClusterNames`` for the different clusters in two regions to use the same database across 2 clusters. - -Failover RDS to secondary -^^^^^^^^^^^^^^^^^^^^^^^^^ - -To perform the failover, go to the RDS global cluster, and under **Actions**, select **Switchover or Failover global database**, and then select **switchover** to switch over without any data loss (which will take more time to complete). Alternatively, you can choose **failover** for a quicker failover at the expense of data-loss. If the entire region is unavailable anyways, then **failover** is no worse than **switchover**. - -After this is done, the app nodes which were stuck trying to connect should move forward and everything should be functional. You can read/write, upload images and everything should be replicated. Everything except OpenSearch data. - -Failover ES/OS to secondary -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -ES/OS does not allow multi-writer for a single index. You can only write to 1 index at one time. Therefore, you need to perform some manual steps to reverse the replication direction, and start replicating from secondary to primary. - -For simplicity, let’s say ``site1`` is primary, and ``site2`` is secondary. Therefore, OS in ``site1`` is the leader domain, and in ``site2`` is the follower. The follower pulls from the leader. To switch the direction where ``site2`` becomes leader, and ``site1`` becomes follower. - -1. Remove the rule from ``site1`` > ``site 2`` in AWS Console. This will auto-pause the replication, but the indices in ``site2`` will still be read-only. Remove the replication rules for that. - -2. Remove auto-follow rule: - - .. code-block:: sh - - curl -XDELETE -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/_autofollow?pretty' -d ' - { - "leader_alias" : "", - "name": "autofollow-rule" - }' - -3. Check the status of the auto-follow rule as mentioned before. - -4. Remove replication rules: - - .. code-block:: sh - - curl -XPOST -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/channels/_stop?pretty' -d '{}' - curl -XPOST -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/files/_stop?pretty' -d '{}' - curl -XPOST -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/users/_stop?pretty' -d '{}' - -5. Check the status of replication rules as mentioned before. - -6. Now indices will become writable - -7. Add rule from ``site2`` > ``site1`` in AWS console. - -8. In ``site1``, make all the indices as followers. You must delete all indices first: - - .. code-block:: sh - - curl -XDELETE -u ':' 'https:///posts*?pretty' - curl -XDELETE -u ':' 'https:///channels?pretty' - curl -XDELETE -u ':' 'https:///files?pretty' - curl -XDELETE -u ':' 'https:///users?pretty' - -9. Refresh indices: - - .. code-block:: sh - - curl -XPOST -u ':' 'https:///_refresh?pretty' - -10. Confirm that everything is deleted: - - .. code-block:: sh - - curl -s -u ':' 'https:///_cat/indices?pretty' - -11. Add the auto-follow rule add replication rules. Follow the same steps as before. - -12. List the indices again to confirm that replication has started and indices are available. - -S3 bucket is auto-replicated both ways -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -There’s nothing you need to do to ensure the S3 bucket is auto-replicating both ways. - -Testing end to end -^^^^^^^^^^^^^^^^^^^ - -Once the failover has happened, and the ES/OS replication direction has been swapped, the new site can be used normally. - -This becomes the final architecture: - -.. image:: ../images/dr3.png - :alt: A diagram showing the final architecture with Mattermost set up in 2 data centers. - -You can use DNS to easily switch between PRIMARY to SECONDARY during a failover. - -.. tip:: - Websockets will still point to the old data center even if you have switched DNS. You need to roll over each app node gradually to move those connections to the new data center. If all your nodes are down, no action is necessary and the clients will automatically re-connect to the new data center. - -The S3 bucket is replicated bi-directionally while the database and ES/OS is replicated uni-directionally. + Active/passive DR on AWS Failover from Single Sign-On outage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/source/deployment-guide/disaster-recovery-aws.rst b/source/deployment-guide/disaster-recovery-aws.rst new file mode 100644 index 00000000000..20a06169d2d --- /dev/null +++ b/source/deployment-guide/disaster-recovery-aws.rst @@ -0,0 +1,358 @@ +Active/passive DR deployment on AWS +===================================== + +.. include:: ../_static/badges/all-commercial.rst + :start-after: :nosearch: + +Enterprise customers who use Mattermost for mission-critical operations must ensure continuous availability and operational resilience. A robust disaster recovery strategy is essential to mitigate risks associated with data center failures, ensuring that users can access Mattermost seamlessly, even in the event of unexpected outages. + +This page details the steps needed to set up Mattermost in an active/passive disaster recovery configuration on AWS, and how to fail over from one data center to another. + +.. tip:: + + To learn how to safely upgrade your deployment in Kubernetes for High Availability and Active/Active support, see the :doc:`Upgrading Mattermost in Kubernetes and High Availability Environments ` documenation. + +Set up in one data center +-------------------------- + +As a first step, set up Mattermost in a single data center. At a very basic high level, this would be something like below: + +.. image:: ../images/dr1.png + :alt: An architecture diagram showing a single proxy that's forwarding traffic to 2 nodes, a database with single writer + n readers, and an S3 bucket and ES/OS using AWS OpenSearch service. + +The diagram above has a single proxy, forwarding traffic to 2 nodes. There's also a database with single writer + n readers and an S3 bucket and ES/OS using AWS OpenSearch service. + +At this stage, we are ignoring other details like LDAP/SAML, SMTP etc. + +.. tip:: + The following architecture would be implemented when an entire region goes down. It does not cover the case when a single server/service goes down. For example: + + - If a single app node goes down, follow best practices to provision a new node. + - If a database replica node goes down, create a new replica from AWS console. Or set a policy to do so automatically. + +Replicate database +------------------ + +The next tasks include creating a global AWS Cluster. + +1. Select the RDS instance in the AWS Console, and expand the **Actions** menu to select **Add AWS Region**. + +2. Choose the secondary region and enter the other details. + +.. warning:: + + Select the **Enable write forwarding** option on the secondary cluster to help forward write operations from secondary to primary. See the `AmazonRDS write forwarding `_ documentation for details. + + Also verify the PostgreSQL version and ensure it allows ``write forwarding``. Not all PostgreSQL versions allow it. See the `Amazon RDS write forwarding region and version availability `_ documentation for details. + +You should now have a global cluster with the primary cluster in ``us-west-1``, and the secondary cluster in ``us-east-1``: + +.. image:: ../images/dr2.png + :alt: A screenshot of the AWS console with a global RDS cluster where the primary cluster is us-west-1 and the secondary cluster is us-east-1. + +Replicate S3 bucket +-------------------- + +1. Create a new S3 bucket in the secondary region. + +2. Back in the original bucket, go to the **Properties** tab, and enable **Bucket versioning**. + +3. Go to the **Management** tab, scroll down to **Replication Rules**, and create a new replication rule. + +4. In the rule, select the source bucket, and then choose **Apply to all objects in the bucket** to replicate everything in the bucket. + +5. Choose the destination bucket. + +6. For the IAM role, select **Create new role**. + +.. warning:: + + Select the **Replica modification sync** option for the bucket to help keep the replica and source buckets in sync with each other. + +7. Select **Save**. + +8. Select **Yes** when prompted to start a job to replicate any existing objects to the secondary bucket or not. + +9. Perform these same steps on the secondary bucket. + +Now you have bi-directional replication working between these S3 replica and source buckets. + +Replicate ES/OS storage +------------------------ + +1. To replicate ES/OS storage, set up CCR (cross-cluster replication) for AWS OpenSearch with the following requirements: + + - Elasticsearch 7.10 or OpenSearch 2.x + - Fine-grained access control enabled + - Node-to-node encryption enabled + +.. tip:: + + All you need is a recent OpenSearch version with fine-grained access control enabled. Node-to-node encryption is automatically enabled once you enable fine-grained access control. + +2. You also need to add the ``CrossClusterGet`` permission on the IAM policy for the OS cluster set under the **Security Configuration** tab for your OS domain. We recommend the following as per AWS, but feel free to fine-tune as necessary: + + .. code-block:: sh + + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": "*" + }, + "Action": "es:ESHttp*", + "Resource": "arn:aws:es:::domain//*" + }, + { + "Effect": "Allow", + "Principal": { + "AWS": "*" + }, + "Action": "es:ESCrossClusterGet", + "Resource": "arn:aws:es:::domain/" + } + ] + } + +To recap: + +- Use OpenSearch 2.x. +- Enable fine-grained access control. +- Create the master user, and note the server credentials. +- Set the IAM policy as above. + +.. warning:: + + After creating the master user, IP based access to the OS might not work from Mattermost application nodes. You may need to update the ``ElasticSearchSettings`` section in ``config.json`` to update the server :ref:`username ` and :ref:`password `. + +3. Create a new OS cluster in the secondary region. Follow the same steps again for this cluster. + + .. warning:: + + At this stage, ensure that you have all indices populated with data in the primary region. Run a bulk index to do that if you haven't already. + +4. Begin replication from the primary to secondary region. + + a. First, create a connection from secondary to primary. Note that replication in OS works in a "pull" model, so the secondary site pulls data from the primary. + + b. In the Amazon OpenSearch Service console, select the secondary domain, go to the **Connections** tab, and choose **Request**. + + c. For **Connection alias**, enter a name for your connection. + + d. Choose **connect to a domain in another AWS account or region**, and enter the **ARN** of the primary domain. + + e. Select **Request** to send a permission request to the primary domain. + + f. Open the primary domain to see and accept the incoming request under the **Connections** tab. + +5. Now set up the replication rules for indices. + + a. SSH into an app node in the secondary region to set up an auto-follow rule for the ``posts*`` indices because of the daily naming scheme and monthly aggregation. + + b. For the other indices, replicate each of them. You can also set up a rule with ``*`` to replicate everything, but that would also include the hidden and system indices which you don't want. + + c. Set up the auto-follow for ``posts*`` indices: + + .. code-block:: sh + + curl -XPOST -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/_autofollow?pretty' -d ' + { + "leader_alias" : "", + "name": "autofollow-rule", + "pattern": "posts*", + "use_roles":{ + "leader_cluster_role": "all_access", + "follower_cluster_role": "all_access" + } + }' + + d. Check the status of the auto-follow rule: + + .. code-block:: sh + + curl -H 'Content-Type: application/json' -u 'username/password' 'https://<>/_plugins/_replication/autofollow_stats?pretty' + { + "num_success_start_replication" : 2, + "num_failed_start_replication" : 0, + "num_failed_leader_calls" : 0, + "failed_indices" : [ ], + "autofollow_stats" : [ + { + "name" : "autofollow-rule", + "pattern" : "posts*", + "num_success_start_replication" : 2, + "num_failed_start_replication" : 0, + "num_failed_leader_calls" : 0, + "failed_indices" : [ ], + "last_execution_time" : 1737699113927 + } + ] + } + + e. Next, set up replication for the other indices: + + .. code-block:: sh + + curl -XPUT -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/channels/_start?pretty' -d ' + { + "leader_alias": "", + "leader_index": "channels", + "use_roles":{ + "leader_cluster_role": "all_access", + "follower_cluster_role": "all_access" + } + }' + + curl -XPUT -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/users/_start?pretty' -d ' + { + "leader_alias": "", + "leader_index": "users", + "use_roles":{ + "leader_cluster_role": "all_access", + "follower_cluster_role": "all_access" + } + }' + + curl -XPUT -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/files/_start?pretty' -d ' + { + "leader_alias": "", + "leader_index": "files", + "use_roles":{ + "leader_cluster_role": "all_access", + "follower_cluster_role": "all_access" + } + }' + + f. Check the status of the replication rules: + + .. code-block:: sh + + curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/channels/_status?pretty' + curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/files/_status?pretty' + curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/users/_status?pretty' + curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/posts_/_status?pretty' + curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/posts_/_status?pretty' + Sample output: + { + "status" : "SYNCING", + "reason" : "User initiated", + "leader_alias" : "", + "leader_index" : "", + "follower_index" : "", + "syncing_details" : { + "leader_checkpoint" : 16, + "follower_checkpoint" : 16, + "seq_no" : 17 + } + } + + g. Check for indices. You should be able to see all the indices from the primary domain in the secondary domain: + + .. code-block:: sh + + curl -s -u ':' 'https:///_cat/indices?pretty' + +Replicate job servers +---------------------- + +If the job scheduler is left running in the secondary region, it will pick up jobs and start running them. Therefore, set ``JobSettings.RunScheduler`` to ``false`` on all nodes in the secondary region. When a failover happens, you need to enable it for the new primary region, and deactivate it for the new secondary region. + +Test the secondary region +-------------------------- + +With the above steps complete, you have a fully functioning secondary region. You can replicate the same setup of nodes and a proxy server like the primary region. The app nodes in the secondary region won't be able to come up the first time because Mattermost will try to run some DDL statements which are not allowed with write-forwarding. So it will be stuck in a loop trying to connect. Once you fail over the region, it will start working. The primary region will still be readable, and any periodic writes will be forwarded to the secondary (now primary). + +.. warning:: + + Ensure you have separate ``ClusterNames`` for the different clusters in two regions to use the same database across 2 clusters. + +Failover RDS to secondary +-------------------------- + +To perform the failover, go to the RDS global cluster, and under **Actions**, select **Switchover or Failover global database**, and then select **switchover** to switch over without any data loss (which will take more time to complete). Alternatively, you can choose **failover** for a quicker failover at the expense of data-loss. If the entire region is unavailable anyways, then **failover** is no worse than **switchover**. + +After this is done, the app nodes which were stuck trying to connect should move forward and everything should be functional. You can read/write, upload images and everything should be replicated. Everything except OpenSearch data. + +Failover ES/OS to secondary +----------------------------- + +ES/OS does not allow multi-writer for a single index. You can only write to 1 index at one time. Therefore, you need to perform some manual steps to reverse the replication direction, and start replicating from secondary to primary. + +For simplicity, let's say ``site1`` is primary, and ``site2`` is secondary. Therefore, OS in ``site1`` is the leader domain, and in ``site2`` is the follower. The follower pulls from the leader. To switch the direction where ``site2`` becomes leader, and ``site1`` becomes follower. + +1. Remove the rule from ``site1`` > ``site 2`` in AWS Console. This will auto-pause the replication, but the indices in ``site2`` will still be read-only. Remove the replication rules for that. + +2. Remove auto-follow rule: + + .. code-block:: sh + + curl -XDELETE -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/_autofollow?pretty' -d ' + { + "leader_alias" : "", + "name": "autofollow-rule" + }' + +3. Check the status of the auto-follow rule as mentioned before. + +4. Remove replication rules: + + .. code-block:: sh + + curl -XPOST -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/channels/_stop?pretty' -d '{}' + curl -XPOST -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/files/_stop?pretty' -d '{}' + curl -XPOST -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/users/_stop?pretty' -d '{}' + +5. Check the status of replication rules as mentioned before. + +6. Now indices will become writable + +7. Add rule from ``site2`` > ``site1`` in AWS console. + +8. In ``site1``, make all the indices as followers. You must delete all indices first: + + .. code-block:: sh + + curl -XDELETE -u ':' 'https:///posts*?pretty' + curl -XDELETE -u ':' 'https:///channels?pretty' + curl -XDELETE -u ':' 'https:///files?pretty' + curl -XDELETE -u ':' 'https:///users?pretty' + +9. Refresh indices: + + .. code-block:: sh + + curl -XPOST -u ':' 'https:///_refresh?pretty' + +10. Confirm that everything is deleted: + + .. code-block:: sh + + curl -s -u ':' 'https:///_cat/indices?pretty' + +11. Add the auto-follow rule add replication rules. Follow the same steps as before. + +12. List the indices again to confirm that replication has started and indices are available. + +S3 bucket is auto-replicated both ways +---------------------------------------- + +There's nothing you need to do to ensure the S3 bucket is auto-replicating both ways. + +Testing end to end +------------------- + +Once the failover has happened, and the ES/OS replication direction has been swapped, the new site can be used normally. + +This becomes the final architecture: + +.. image:: ../images/dr3.png + :alt: A diagram showing the final architecture with Mattermost set up in 2 data centers. + +You can use DNS to easily switch between PRIMARY to SECONDARY during a failover. + +.. tip:: + Websockets will still point to the old data center even if you have switched DNS. You need to roll over each app node gradually to move those connections to the new data center. If all your nodes are down, no action is necessary and the clients will automatically re-connect to the new data center. + +The S3 bucket is replicated bi-directionally while the database and ES/OS is replicated uni-directionally. From dde9aabc38680225f7ab03c4b90c54c06b4bf655 Mon Sep 17 00:00:00 2001 From: Ben Schumacher Date: Wed, 15 Apr 2026 13:49:11 +0200 Subject: [PATCH 3/9] docs: apply PR review feedback on disaster recovery pages MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix typo: "documenation" → "documentation" - Fix curl credentials: "username/password" → ":" and empty host placeholder - Remove duplicate posts_ status curl command - Change IAM policy code block language from sh to json - Add prerequisites section to disaster-recovery-aws.rst - Wrap HA vs DR explanation in a note directive Co-Authored-By: Claude Sonnet 4.6 --- .../backup-disaster-recovery.rst | 8 +++++--- .../deployment-guide/disaster-recovery-aws.rst | 17 +++++++++++++---- 2 files changed, 18 insertions(+), 7 deletions(-) diff --git a/source/deployment-guide/backup-disaster-recovery.rst b/source/deployment-guide/backup-disaster-recovery.rst index 1b8aba84234..0466d54f91d 100644 --- a/source/deployment-guide/backup-disaster-recovery.rst +++ b/source/deployment-guide/backup-disaster-recovery.rst @@ -34,11 +34,13 @@ Disaster recovery An appropriate disaster recovery plan weighs the benefits of mitigating specific risks against the cost and complexity of setting up disaster recovery infrastructure and automation. -**High availability (HA) vs. disaster recovery (DR)** +.. note:: + + **High availability (HA) vs. disaster recovery (DR)** -HA and DR are distinct concepts that are often confused. HA refers to a clustered deployment within a single site that eliminates single points of failure and keeps Mattermost running through individual component outages (e.g., a failed app node or database replica). DR addresses the broader scenario of an entire site or region becoming unavailable, and typically requires a secondary deployment in a separate data center or cloud region. + HA and DR are distinct concepts that are often confused. HA refers to a clustered deployment within a single site that eliminates single points of failure and keeps Mattermost running through individual component outages (e.g., a failed app node or database replica). DR addresses the broader scenario of an entire site or region becoming unavailable, and typically requires a secondary deployment in a separate data center or cloud region. -Mattermost supports active/passive DR, where a secondary site is kept in sync but only activated during a failover. Mattermost does not support active/active deployments, where both sites serve live traffic simultaneously. + Mattermost supports active/passive DR, where a secondary site is kept in sync but only activated during a failover. Mattermost does not support active/active deployments, where both sites serve live traffic simultaneously. Automated backup ~~~~~~~~~~~~~~~~ diff --git a/source/deployment-guide/disaster-recovery-aws.rst b/source/deployment-guide/disaster-recovery-aws.rst index 20a06169d2d..73a00e56cbb 100644 --- a/source/deployment-guide/disaster-recovery-aws.rst +++ b/source/deployment-guide/disaster-recovery-aws.rst @@ -4,13 +4,23 @@ Active/passive DR deployment on AWS .. include:: ../_static/badges/all-commercial.rst :start-after: :nosearch: +Before you begin, ensure the following are in place: + +- AWS account access with IAM permissions to manage RDS, S3, and OpenSearch resources in both regions +- A chosen primary and secondary AWS region pair for failover +- An existing, healthy Mattermost primary deployment +- DNS control over the domain used to reach Mattermost, so you can redirect traffic during failover +- Required RDS Aurora PostgreSQL global cluster permissions and a verified, restorable database backup +- OpenSearch 2.x with fine-grained access control available in both regions +- Network connectivity between the primary and secondary regions confirmed + Enterprise customers who use Mattermost for mission-critical operations must ensure continuous availability and operational resilience. A robust disaster recovery strategy is essential to mitigate risks associated with data center failures, ensuring that users can access Mattermost seamlessly, even in the event of unexpected outages. This page details the steps needed to set up Mattermost in an active/passive disaster recovery configuration on AWS, and how to fail over from one data center to another. .. tip:: - To learn how to safely upgrade your deployment in Kubernetes for High Availability and Active/Active support, see the :doc:`Upgrading Mattermost in Kubernetes and High Availability Environments ` documenation. + To learn how to safely upgrade your deployment in Kubernetes for High Availability and Active/Active support, see the :doc:`Upgrading Mattermost in Kubernetes and High Availability Environments ` documentation. Set up in one data center -------------------------- @@ -92,7 +102,7 @@ Replicate ES/OS storage 2. You also need to add the ``CrossClusterGet`` permission on the IAM policy for the OS cluster set under the **Security Configuration** tab for your OS domain. We recommend the following as per AWS, but feel free to fine-tune as necessary: - .. code-block:: sh + .. code-block:: json { "Version": "2012-10-17", @@ -172,7 +182,7 @@ To recap: .. code-block:: sh - curl -H 'Content-Type: application/json' -u 'username/password' 'https://<>/_plugins/_replication/autofollow_stats?pretty' + curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/autofollow_stats?pretty' { "num_success_start_replication" : 2, "num_failed_start_replication" : 0, @@ -233,7 +243,6 @@ To recap: curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/files/_status?pretty' curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/users/_status?pretty' curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/posts_/_status?pretty' - curl -H 'Content-Type: application/json' -u ':' 'https:///_plugins/_replication/posts_/_status?pretty' Sample output: { "status" : "SYNCING", From 995626c8774995dac79627d021ede41da263f913 Mon Sep 17 00:00:00 2001 From: Ben Schumacher Date: Wed, 15 Apr 2026 13:55:05 +0200 Subject: [PATCH 4/9] docs: demote SSO outage sub-sections to child headings Change the three SSO failover sub-sections from ~~~~ to ^^^^^ so they render as children of "Failover from Single Sign-On outage" in the sidebar TOC rather than at the same level. Co-Authored-By: Claude Sonnet 4.6 --- source/deployment-guide/backup-disaster-recovery.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/source/deployment-guide/backup-disaster-recovery.rst b/source/deployment-guide/backup-disaster-recovery.rst index 0466d54f91d..f61ba782317 100644 --- a/source/deployment-guide/backup-disaster-recovery.rst +++ b/source/deployment-guide/backup-disaster-recovery.rst @@ -79,22 +79,22 @@ When using Single Sign-on with Mattermost Enterprise Edition an outage to your S In each case, the user cannot reach the SSO provider, and cannot log in. In this case, there are several potential mitigations: -Configure your SSO provider for High Availability -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Configure your SSO provider for High Availability +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you're using a self-hosted Single Sign-on provider, several options are available for `High Availability configurations that protect your system from unplanned outages `_. For SaaS-based authentication providers, while you still have a dependency on service uptime, you can set up redundancy in source systems from which data is being pulled. For example, with the OneLogin SaaS-based authentication service, you can set up High Availability LDAP connectivity to further reduce the chances of an outage. -Set up your own IDP to provide an automated or manual SSO failover option -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Set up your own IDP to provide an automated or manual SSO failover option +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Create a custom Identity Provider for SAML authentication that connects to both an active and a standby authentication option, that can be manually or automatically switched in case of an outage. In this configuration, security should be carefully reviewed to prevent the standby SSO option from weakening your authentication protocols. -Set up a manual failover plan for SSO outages -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Set up a manual failover plan for SSO outages +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When users are unable to reach your organization's SSO provider during an outage, an error message directing them to contact your support link (defined in your System Console settings) is displayed. From f0f43c5bd769dafe2e4663c4f140c76884ba2f22 Mon Sep 17 00:00:00 2001 From: Ben Schumacher Date: Mon, 27 Apr 2026 10:27:32 +0200 Subject: [PATCH 5/9] docs: address ewwollesen review feedback on AWS DR guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Reword awkward "at a very basic high level" sentence (line 28) - Clarify OpenSearch tip to be scoped to OpenSearch 2.x users - Fix "site 2" → "site2" consistency (line 294) - Convert single-sentence S3 section heading to a note block - Add link to RunScheduler config docs for operators who need guidance - Add "Restore to primary data center" section for post-event recovery Co-Authored-By: Claude Sonnet 4.6 --- .../disaster-recovery-aws.rst | 28 ++++++++++++++----- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/source/deployment-guide/disaster-recovery-aws.rst b/source/deployment-guide/disaster-recovery-aws.rst index 73a00e56cbb..6e6747aa71d 100644 --- a/source/deployment-guide/disaster-recovery-aws.rst +++ b/source/deployment-guide/disaster-recovery-aws.rst @@ -25,7 +25,7 @@ This page details the steps needed to set up Mattermost in an active/passive dis Set up in one data center -------------------------- -As a first step, set up Mattermost in a single data center. At a very basic high level, this would be something like below: +As a first step, set up Mattermost in a single data center. The following diagram illustrates a basic single data center architecture: .. image:: ../images/dr1.png :alt: An architecture diagram showing a single proxy that's forwarding traffic to 2 nodes, a database with single writer + n readers, and an S3 bucket and ES/OS using AWS OpenSearch service. @@ -98,7 +98,7 @@ Replicate ES/OS storage .. tip:: - All you need is a recent OpenSearch version with fine-grained access control enabled. Node-to-node encryption is automatically enabled once you enable fine-grained access control. + If you are already running OpenSearch 2.x, all you need to do is enable fine-grained access control — node-to-node encryption is enabled automatically once fine-grained access control is turned on. 2. You also need to add the ``CrossClusterGet`` permission on the IAM policy for the OS cluster set under the **Security Configuration** tab for your OS domain. We recommend the following as per AWS, but feel free to fine-tune as necessary: @@ -266,7 +266,7 @@ To recap: Replicate job servers ---------------------- -If the job scheduler is left running in the secondary region, it will pick up jobs and start running them. Therefore, set ``JobSettings.RunScheduler`` to ``false`` on all nodes in the secondary region. When a failover happens, you need to enable it for the new primary region, and deactivate it for the new secondary region. +If the job scheduler is left running in the secondary region, it will pick up jobs and start running them. Therefore, set ``JobSettings.RunScheduler`` to ``false`` on all nodes in the secondary region. When a failover happens, you need to enable it for the new primary region, and deactivate it for the new secondary region. See the :ref:`RunScheduler configuration setting ` documentation for details. Test the secondary region -------------------------- @@ -291,7 +291,7 @@ ES/OS does not allow multi-writer for a single index. You can only write to 1 in For simplicity, let's say ``site1`` is primary, and ``site2`` is secondary. Therefore, OS in ``site1`` is the leader domain, and in ``site2`` is the follower. The follower pulls from the leader. To switch the direction where ``site2`` becomes leader, and ``site1`` becomes follower. -1. Remove the rule from ``site1`` > ``site 2`` in AWS Console. This will auto-pause the replication, but the indices in ``site2`` will still be read-only. Remove the replication rules for that. +1. Remove the rule from ``site1`` > ``site2`` in AWS Console. This will auto-pause the replication, but the indices in ``site2`` will still be read-only. Remove the replication rules for that. 2. Remove auto-follow rule: @@ -344,10 +344,9 @@ For simplicity, let's say ``site1`` is primary, and ``site2`` is secondary. Ther 12. List the indices again to confirm that replication has started and indices are available. -S3 bucket is auto-replicated both ways ----------------------------------------- +.. note:: -There's nothing you need to do to ensure the S3 bucket is auto-replicating both ways. + There's nothing you need to do to ensure the S3 bucket is auto-replicating both ways. Testing end to end ------------------- @@ -365,3 +364,18 @@ You can use DNS to easily switch between PRIMARY to SECONDARY during a failover. Websockets will still point to the old data center even if you have switched DNS. You need to roll over each app node gradually to move those connections to the new data center. If all your nodes are down, no action is necessary and the clients will automatically re-connect to the new data center. The S3 bucket is replicated bi-directionally while the database and ES/OS is replicated uni-directionally. + +Restore to primary data center +-------------------------------- + +When the disaster event is resolved and you are ready to restore normal operations, perform the same failover steps in reverse to return traffic to the original primary data center: + +1. Perform an RDS switchover back to the original primary region using the **Switchover or Failover global database** option in the RDS console. + +2. Reverse the ES/OS replication direction by following the same steps in the `Failover ES/OS to secondary`_ section, swapping the roles of ``site1`` and ``site2``. + +3. Update DNS to redirect traffic back to the original primary data center. + +4. Re-enable ``JobSettings.RunScheduler`` on the original primary nodes and disable it on the secondary nodes. + +5. Roll over app nodes gradually to move websocket connections back to the primary data center. From 593c1f34defab5a97a518218638e219cd932b7c1 Mon Sep 17 00:00:00 2001 From: Ben Schumacher Date: Mon, 27 Apr 2026 10:31:57 +0200 Subject: [PATCH 6/9] =?UTF-8?q?docs:=20fix=20ElasticSearchSettings=20typo?= =?UTF-8?q?=20=E2=86=92=20ElasticsearchSettings?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- source/deployment-guide/disaster-recovery-aws.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/deployment-guide/disaster-recovery-aws.rst b/source/deployment-guide/disaster-recovery-aws.rst index 6e6747aa71d..c6ee04db4b3 100644 --- a/source/deployment-guide/disaster-recovery-aws.rst +++ b/source/deployment-guide/disaster-recovery-aws.rst @@ -135,7 +135,7 @@ To recap: .. warning:: - After creating the master user, IP based access to the OS might not work from Mattermost application nodes. You may need to update the ``ElasticSearchSettings`` section in ``config.json`` to update the server :ref:`username ` and :ref:`password `. + After creating the master user, IP based access to the OS might not work from Mattermost application nodes. You may need to update the ``ElasticsearchSettings`` section in ``config.json`` to update the server :ref:`username ` and :ref:`password `. 3. Create a new OS cluster in the secondary region. Follow the same steps again for this cluster. From e5ebd13ed9a954a4b04980a27fc771ba7ccc93f7 Mon Sep 17 00:00:00 2001 From: Ben Schumacher Date: Mon, 27 Apr 2026 10:32:59 +0200 Subject: [PATCH 7/9] docs: expand job scheduler section into numbered procedure Co-Authored-By: Claude Sonnet 4.6 --- source/deployment-guide/disaster-recovery-aws.rst | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/source/deployment-guide/disaster-recovery-aws.rst b/source/deployment-guide/disaster-recovery-aws.rst index c6ee04db4b3..b4912ec84fb 100644 --- a/source/deployment-guide/disaster-recovery-aws.rst +++ b/source/deployment-guide/disaster-recovery-aws.rst @@ -266,7 +266,15 @@ To recap: Replicate job servers ---------------------- -If the job scheduler is left running in the secondary region, it will pick up jobs and start running them. Therefore, set ``JobSettings.RunScheduler`` to ``false`` on all nodes in the secondary region. When a failover happens, you need to enable it for the new primary region, and deactivate it for the new secondary region. See the :ref:`RunScheduler configuration setting ` documentation for details. +If the job scheduler is left running in the secondary region, it will pick up jobs and start running them. Follow these steps to manage it correctly. See the :ref:`RunScheduler configuration setting ` documentation for details. + +1. **Precondition:** Set ``JobSettings.RunScheduler`` to ``false`` on all nodes in the secondary region before enabling that region. + +2. **On failover:** Set ``JobSettings.RunScheduler`` to ``true`` on all nodes in the new primary region. + +3. **Immediately after:** Set ``JobSettings.RunScheduler`` to ``false`` on all nodes in the new secondary region. + +4. **Verify:** Confirm that jobs execute only in the active region. Submit a test job (for example, trigger an index rebuild) and check the job logs to ensure it runs on the new primary, not the secondary. Test the secondary region -------------------------- From 1ef6601dba2448ad054070c54087a74ec9b22fd6 Mon Sep 17 00:00:00 2001 From: Ben Schumacher Date: Mon, 4 May 2026 10:00:27 +0200 Subject: [PATCH 8/9] docs: fix grammar in SSO outage section per ewwollesen review - Add missing comma after "Mattermost Enterprise Edition" - Remove spurious "to" from "continue to using" Co-Authored-By: Claude Sonnet 4.6 --- source/deployment-guide/backup-disaster-recovery.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/deployment-guide/backup-disaster-recovery.rst b/source/deployment-guide/backup-disaster-recovery.rst index f61ba782317..860577deeb2 100644 --- a/source/deployment-guide/backup-disaster-recovery.rst +++ b/source/deployment-guide/backup-disaster-recovery.rst @@ -67,11 +67,11 @@ For step-by-step instructions on setting up Mattermost in an active/passive DR c Failover from Single Sign-On outage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -When using Single Sign-on with Mattermost Enterprise Edition an outage to your SSO provider can cause a partial outage on your Mattermost instance. +When using Single Sign-on with Mattermost Enterprise Edition, an outage to your SSO provider can cause a partial outage on your Mattermost instance. **What happens during an SSO outage?** -- **Most people can still log in.** By default, when a user logs in to Mattermost they receive a session token lasting 30 days (the duration can be configured in the System Console). During an SSO outage, users with valid session tokens can continue to using Mattermost uninterrupted. +- **Most people can still log in.** By default, when a user logs in to Mattermost they receive a session token lasting 30 days (the duration can be configured in the System Console). During an SSO outage, users with valid session tokens can continue using Mattermost uninterrupted. - **Some people can't log in.** During an SSO outage, there are two situations under which a user cannot log in: * Users whose session token expires during the outage. From ebbb2c514b3b9c241bc31dbebe043f9c9b6e6315 Mon Sep 17 00:00:00 2001 From: Ben Schumacher Date: Mon, 4 May 2026 10:41:00 +0200 Subject: [PATCH 9/9] docs: apply remaining ewwollesen suggestions on backup-disaster-recovery.rst MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - "1 of" → "one of" in backup step 3 - "In this case, there are several potential mitigations:" → "In either case, several mitigations are available:" - Clean up SSO outage sentence: remove "issue", add "their email and", fix comma placement Co-Authored-By: Claude Sonnet 4.6 --- source/deployment-guide/backup-disaster-recovery.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/source/deployment-guide/backup-disaster-recovery.rst b/source/deployment-guide/backup-disaster-recovery.rst index 860577deeb2..6d74688bebd 100644 --- a/source/deployment-guide/backup-disaster-recovery.rst +++ b/source/deployment-guide/backup-disaster-recovery.rst @@ -17,7 +17,7 @@ To back up your Mattermost server: 2. Back up your server settings stored in ``config/config.json``. If you are using SAML configuration for Mattermost, your SAML certificate files will be saved in the ``config`` directory. Therefore, we recommend backing up the entire directory. -3. Back up files stored by your users with 1 of the following options: +3. Back up files stored by your users with one of the following options: - If you use local storage using the default ``./data`` directory, back up this directory. - If you use local storage using a non-default directory specified in the ``Directory`` setting in ``config.json``, back up files in that location. @@ -77,7 +77,7 @@ When using Single Sign-on with Mattermost Enterprise Edition, an outage to your * Users whose session token expires during the outage. * Users trying to log in to new devices. -In each case, the user cannot reach the SSO provider, and cannot log in. In this case, there are several potential mitigations: +In each case, the user cannot reach the SSO provider, and cannot log in. In either case, several mitigations are available: Configure your SSO provider for High Availability ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -98,6 +98,6 @@ Set up a manual failover plan for SSO outages When users are unable to reach your organization's SSO provider during an outage, an error message directing them to contact your support link (defined in your System Console settings) is displayed. -Once IT is contacted about an SSO outage issue, they can temporarily change a user's account from SSO to email-password using the System Console, and the end user can use password to claim the account, until the SSO outage is over and the account can be converted back to SSO. +Once IT is contacted about an SSO outage, they can temporarily change a user's account from SSO to email-password using the System Console, and the end user can use their email and password to claim the account until the SSO outage is over and the account can be converted back to SSO. When the outage is over, it's critical to switch everyone back to SSO from email-password to maintain consistency and security. \ No newline at end of file