docs: move AWS DR guide to dedicated subpage#8893
Conversation
…recovery Distinguish high availability (single-site clustering) from disaster recovery (multi-site failover), clarify that Mattermost supports active/passive DR only and does not support active/active deployments, and rename the "High Availability deployment" section to "Active/passive DR deployment" for accuracy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract the AWS-specific active/passive DR deployment steps from backup-disaster-recovery.rst into a new disaster-recovery-aws.rst subpage. The main page now links to it via toctree, keeping the overview page concise and making room for future platform-specific guides. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughDocumentation reorganized: the main Disaster Recovery guide was trimmed of AWS-specific step-by-step instructions, replaced with an Active/passive DR overview, and a new AWS-specific active/passive DR page was added with end-to-end replication and failover procedures. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant DNS as DNS
participant Primary as PrimaryRegion\n(App, RDS, S3, OpenSearch)
participant Secondary as SecondaryRegion\n(App replica, RDS replica, S3 replica, OpenSearch replica)
participant Admin as Admin
User->>DNS: Resolve app endpoint
DNS->>Primary: Route traffic to Primary App nodes
User->>Primary: App requests (reads/writes)
Primary->>RDS: DB writes/reads
Primary->>S3: Object writes (replicated)
Primary->>OpenSearch: Index writes (replicated)
Note over Primary,Secondary: Continuous replication configured\n(RDS global cluster, S3 replication, OpenSearch CCR)
alt Primary region failure
Admin->>DNS: Switch endpoint to Secondary
DNS->>Secondary: Route users to Secondary App nodes
Admin->>RDS: Promote secondary as writer
Admin->>Secondary: Stop/adjust Job scheduler (JobSettings.RunScheduler=false)
Admin->>OpenSearch: Reverse replication (remove leader rules, recreate in opposite direction)
Secondary->>S3: Accept replicated objects / sync
Admin->>Secondary: Enable scheduler and roll app nodes
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
source/deployment-guide/disaster-recovery-aws.rst (2)
95-117: Usejsoninstead ofshfor the IAM policy block.This block is a JSON policy document, not a shell command. Language labelling should match content.
Suggested minimal diff
- .. code-block:: sh + .. code-block:: jsonAs per coding guidelines, "Require code fences or code directives to identify the language when practical."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/deployment-guide/disaster-recovery-aws.rst` around lines 95 - 117, The code block showing the IAM policy is labeled as a shell snippet ("code-block:: sh") but contains JSON; update the directive to "code-block:: json" so the IAM policy document is correctly identified and syntax-highlighted; locate the block that currently begins with code-block:: sh and change that directive to code-block:: json (the JSON policy object with keys "Version" and "Statement") to match the content.
7-10: Add a short prerequisites block before procedural steps.This page jumps into execution quickly. A compact prerequisites list (AWS account access, region pair selected, existing Mattermost primary deployment, DNS ownership, OpenSearch/RDS permissions) would reduce operator error for novice admins.
As per coding guidelines, "List prerequisites clearly at the top of documentation sections."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/deployment-guide/disaster-recovery-aws.rst` around lines 7 - 10, Add a short "Prerequisites" block at the top of the Mattermost AWS disaster recovery guide (before the procedural steps that start with the current introductory paragraphs) listing required items: AWS account access and IAM permissions, chosen region pair for failover, an existing Mattermost primary deployment, control/ownership of DNS for failover updates, required OpenSearch/RDS permissions and backups, and any tooling/CLI versions; ensure the block uses a clear bullet list and a brief note about verifying backups and network connectivity so novice operators see these checks before executing the procedure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/deployment-guide/disaster-recovery-aws.rst`:
- Line 13: Fix the typo in the cross-reference sentence by replacing
"documenation" with "documentation" in the sentence that references the
Upgrading Mattermost in Kubernetes and High Availability Environments doc (the
string containing ":doc:`Upgrading Mattermost in Kubernetes and High
Availability Environments
</administration-guide/upgrade/upgrade-mattermost-kubernetes-ha>`"). Ensure the
corrected sentence reads "...see the ... documentation." and keep the rest of
the cross-reference unchanged.
- Around line 235-237: Duplicate curl command checking the _status for
posts_<DATE> appears twice; remove the redundant line so only one curl -H
'Content-Type: application/json' -u '<USERNAME>:<PASSWORD>'
'https://<HOSTNAME>/_plugins/_replication/posts_<DATE>/_status?pretty' remains,
preserving the Sample output line that follows and keeping steps atomic and
numbered.
- Line 175: Replace the incorrect curl credential separator and clarify the host
placeholder: change the curl -u argument from "username/password" to the
required "username:password" format and update the URL placeholder (e.g., use a
clearer <hostname[:port]> or <elasticsearch-host>) in the example command string
shown in the diff so readers can substitute a real host.
---
Nitpick comments:
In `@source/deployment-guide/disaster-recovery-aws.rst`:
- Around line 95-117: The code block showing the IAM policy is labeled as a
shell snippet ("code-block:: sh") but contains JSON; update the directive to
"code-block:: json" so the IAM policy document is correctly identified and
syntax-highlighted; locate the block that currently begins with code-block:: sh
and change that directive to code-block:: json (the JSON policy object with keys
"Version" and "Statement") to match the content.
- Around line 7-10: Add a short "Prerequisites" block at the top of the
Mattermost AWS disaster recovery guide (before the procedural steps that start
with the current introductory paragraphs) listing required items: AWS account
access and IAM permissions, chosen region pair for failover, an existing
Mattermost primary deployment, control/ownership of DNS for failover updates,
required OpenSearch/RDS permissions and backups, and any tooling/CLI versions;
ensure the block uses a clear bullet list and a brief note about verifying
backups and network connectivity so novice operators see these checks before
executing the procedure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ca13014e-776c-4c71-a691-88e42d103ad0
📒 Files selected for processing (2)
source/deployment-guide/backup-disaster-recovery.rstsource/deployment-guide/disaster-recovery-aws.rst
|
Newest code from mattermost has been published to preview environment for Git SHA 7fd420e |
- Fix typo: "documenation" → "documentation" - Fix curl credentials: "username/password" → "<USERNAME>:<PASSWORD>" and empty host placeholder - Remove duplicate posts_<DATE> status curl command - Change IAM policy code block language from sh to json - Add prerequisites section to disaster-recovery-aws.rst - Wrap HA vs DR explanation in a note directive Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Newest code from mattermost has been published to preview environment for Git SHA dde9aab |
Change the three SSO failover sub-sections from ~~~~ to ^^^^^ so they render as children of "Failover from Single Sign-On outage" in the sidebar TOC rather than at the same level. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Newest code from mattermost has been published to preview environment for Git SHA 995626c |
|
cc @mrckndt |
- Reword awkward "at a very basic high level" sentence (line 28) - Clarify OpenSearch tip to be scoped to OpenSearch 2.x users - Fix "site 2" → "site2" consistency (line 294) - Convert single-sentence S3 section heading to a note block - Add link to RunScheduler config docs for operators who need guidance - Add "Restore to primary data center" section for post-event recovery Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@ewwollesen Thanks for the detailed feedback! PTAL |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
source/deployment-guide/disaster-recovery-aws.rst (1)
269-269: Split scheduler failover guidance into atomic steps with a quick success check.This sentence packs prerequisite, failover action, rollback action, and reference into one block. Please convert it into short numbered steps and add a verification cue (for example, confirm job execution only on the active region after toggling).
As per coding guidelines, "Use numbered, atomic steps (one action per step) when providing procedural instructions" and "Include expected output or success checks after key steps to help readers verify progress."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/deployment-guide/disaster-recovery-aws.rst` at line 269, Split the single paragraph about scheduler failover into an explicit numbered procedure: 1) On all nodes in the secondary region set JobSettings.RunScheduler = false (precondition); 2) When failing over, enable JobSettings.RunScheduler = true on nodes in the new primary region; 3) Immediately disable JobSettings.RunScheduler = false on nodes in the new secondary region; 4) Add a quick verification step such as “confirm jobs execute only in the active region” (e.g., submit a test job and check it runs on the primary, not the secondary); keep the existing :ref:`RunScheduler configuration setting <administration-guide/configure/environment-configuration-settings:run scheduler>` link for details.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/deployment-guide/disaster-recovery-aws.rst`:
- Line 138: The documentation uses the wrong config key name
"ElasticSearchSettings" which can mislead operators; update the text to
reference the correct Mattermost config key "ElasticsearchSettings" (the section
in config.json that contains server username and password references) so all
occurrences of ElasticSearchSettings are replaced with ElasticsearchSettings in
the sentence describing updating the server :ref:`username` and :ref:`password`.
- Around line 107-126: The OpenSearch policy uses an overly permissive Principal
("Principal": { "AWS": "*" }) in two places; replace these wildcard principals
with scoped, least-privilege ARNs (e.g., specific AWS account ARN or IAM
role/user ARNs) for both occurrences and ensure the ARNs match the target
Resource entries (the "Resource":
"arn:aws:es:<region>:<acc_num>:domain/<domain_name>/*" and the domain-level
resource). Keep actions ("es:ESHttp*" and "es:ESCrossClusterGet") unchanged but
limit the principals to the exact account/role/user ARNs intended to access the
domain.
---
Nitpick comments:
In `@source/deployment-guide/disaster-recovery-aws.rst`:
- Line 269: Split the single paragraph about scheduler failover into an explicit
numbered procedure: 1) On all nodes in the secondary region set
JobSettings.RunScheduler = false (precondition); 2) When failing over, enable
JobSettings.RunScheduler = true on nodes in the new primary region; 3)
Immediately disable JobSettings.RunScheduler = false on nodes in the new
secondary region; 4) Add a quick verification step such as “confirm jobs execute
only in the active region” (e.g., submit a test job and check it runs on the
primary, not the secondary); keep the existing :ref:`RunScheduler configuration
setting <administration-guide/configure/environment-configuration-settings:run
scheduler>` link for details.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fadd34c3-8fd4-4aeb-b974-c8e6970fab75
📒 Files selected for processing (1)
source/deployment-guide/disaster-recovery-aws.rst
|
Newest code from mattermost has been published to preview environment for Git SHA f0f43c5 |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Newest code from mattermost has been published to preview environment for Git SHA 61a0a39 |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Newest code from mattermost has been published to preview environment for Git SHA 593c1f3 |
|
Newest code from mattermost has been published to preview environment for Git SHA e5ebd13 |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
source/deployment-guide/disaster-recovery-aws.rst (1)
107-127:⚠️ Potential issue | 🟠 MajorRestrict the wildcard AWS principal in the OpenSearch IAM policy.
The IAM policy uses
"Principal": { "AWS": "*" }on lines 112-113 and 120-121, which grants access to any AWS principal. This is overly permissive and violates the principle of least privilege. In a production DR setup, this could allow unauthorized cross-account access to your OpenSearch domain.Replace the wildcard with explicit IAM role or account ARNs that correspond to your primary and secondary Mattermost deployments.
🛡️ Suggested security fix
{ "Effect": "Allow", "Principal": { - "AWS": "*" + "AWS": [ + "arn:aws:iam::<PRIMARY_ACCOUNT_ID>:role/<PRIMARY_OPENSEARCH_ROLE>", + "arn:aws:iam::<SECONDARY_ACCOUNT_ID>:role/<SECONDARY_OPENSEARCH_ROLE>" + ] }, "Action": "es:ESHttp*", "Resource": "arn:aws:es:<region>:<acc_num>:domain/<domain_name>/*" }, { "Effect": "Allow", "Principal": { - "AWS": "*" + "AWS": [ + "arn:aws:iam::<PRIMARY_ACCOUNT_ID>:role/<PRIMARY_OPENSEARCH_ROLE>", + "arn:aws:iam::<SECONDARY_ACCOUNT_ID>:role/<SECONDARY_OPENSEARCH_ROLE>" + ] }, "Action": "es:ESCrossClusterGet", "Resource": "arn:aws:es:<region>:<acc_num>:domain/<domain_name>" }Based on learnings, "When reviewing or iterating on documentation, evaluate it through the lens of Veteran Vince ... and flag content that is ... security-unsafe."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/deployment-guide/disaster-recovery-aws.rst` around lines 107 - 127, The policy currently uses a wildcard AWS principal ("Principal": { "AWS": "*" }) in the OpenSearch domain IAM policy (the statements that allow "es:ESHttp*" and "es:ESCrossClusterGet"), which is overly permissive; replace the wildcard principal value with explicit IAM role or account ARNs for your primary and secondary Mattermost deployments (for example the IAM role ARNs used by the DR cluster and the primary cluster) so only those principals can call es:ESHttp* and es:ESCrossClusterGet on the specified "Resource" ARN; ensure you update both statements and validate the ARNs are correct and scoped (prefer role ARNs over account-wide ARNs) before applying.
🧹 Nitpick comments (2)
source/deployment-guide/disaster-recovery-aws.rst (2)
277-278: Consider clarifying where to check job logs for verification.The verification step instructs users to "check the job logs" but doesn't specify where these logs are located. For a novice administrator, it may not be immediately clear whether to check server logs, System Console logs, or application logs.
Optional clarification
Consider adding a brief note about where to find job logs, such as:
4. **Verify:** Confirm that jobs execute only in the active region. Submit a test job (for example, trigger an index rebuild) and check the job logs to ensure it runs on the new primary, not the secondary. + + You can monitor job execution in **System Console > Server Logs** or by reviewing the Mattermost server log files.As per coding guidelines, "Define technical terms briefly inline on first use rather than assuming reader knowledge" and "Flag any step that assumes knowledge a novice IT administrator with 1-2 years of experience likely doesn't have."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/deployment-guide/disaster-recovery-aws.rst` around lines 277 - 278, Update the "Verify: Confirm that jobs execute only in the active region..." step to explicitly state where to find the job logs: add a short note telling users to check the application job logs via the System Console's "Jobs" (or "Background Jobs") page and, if needed, the server/application log files on the primary instance for more detail (e.g., /var/log/<app>-worker or platform-specific logging service). Edit the sentence in disaster-recovery-aws.rst that begins "Verify: Confirm that jobs execute only in the active region..." to include this brief inline guidance and an example (e.g., "trigger an index rebuild and view the job entry in System Console → Jobs; check server logs on the active primary if further detail is required").
37-42: Consider usingnoteinstead oftipfor critical scope information.This admonition clarifies the fundamental scope and limitations of the DR architecture (entire region failure vs. single service failure), which is important context that affects whether a reader should follow this guide. A
noteis more appropriate for clarifications and exceptions that help the reader understand the boundaries of the content.Suggested change
-.. tip:: +.. note:: The following architecture would be implemented when an entire region goes down. It does not cover the case when a single server/service goes down. For example:As per coding guidelines, "Use
noteadmonition for clarifications, exceptions, non-blocking caveats, or extra context that helps the reader."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/deployment-guide/disaster-recovery-aws.rst` around lines 37 - 42, Replace the ".. tip::" admonition with ".. note::" for the block that begins with ".. tip::" and contains the explanation about the architecture scope ("The following architecture would be implemented when an entire region goes down...") so the message is presented as a clarifying exception rather than a non-critical tip; update the opening token from "tip" to "note" and keep the inner text and bullet examples unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@source/deployment-guide/disaster-recovery-aws.rst`:
- Around line 107-127: The policy currently uses a wildcard AWS principal
("Principal": { "AWS": "*" }) in the OpenSearch domain IAM policy (the
statements that allow "es:ESHttp*" and "es:ESCrossClusterGet"), which is overly
permissive; replace the wildcard principal value with explicit IAM role or
account ARNs for your primary and secondary Mattermost deployments (for example
the IAM role ARNs used by the DR cluster and the primary cluster) so only those
principals can call es:ESHttp* and es:ESCrossClusterGet on the specified
"Resource" ARN; ensure you update both statements and validate the ARNs are
correct and scoped (prefer role ARNs over account-wide ARNs) before applying.
---
Nitpick comments:
In `@source/deployment-guide/disaster-recovery-aws.rst`:
- Around line 277-278: Update the "Verify: Confirm that jobs execute only in the
active region..." step to explicitly state where to find the job logs: add a
short note telling users to check the application job logs via the System
Console's "Jobs" (or "Background Jobs") page and, if needed, the
server/application log files on the primary instance for more detail (e.g.,
/var/log/<app>-worker or platform-specific logging service). Edit the sentence
in disaster-recovery-aws.rst that begins "Verify: Confirm that jobs execute only
in the active region..." to include this brief inline guidance and an example
(e.g., "trigger an index rebuild and view the job entry in System Console →
Jobs; check server logs on the active primary if further detail is required").
- Around line 37-42: Replace the ".. tip::" admonition with ".. note::" for the
block that begins with ".. tip::" and contains the explanation about the
architecture scope ("The following architecture would be implemented when an
entire region goes down...") so the message is presented as a clarifying
exception rather than a non-critical tip; update the opening token from "tip" to
"note" and keep the inner text and bullet examples unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6a6195ea-e7ef-463c-b604-23990443bbe6
📒 Files selected for processing (1)
source/deployment-guide/disaster-recovery-aws.rst
|
Newest code from mattermost has been published to preview environment for Git SHA 93f3594 |
ewwollesen
left a comment
There was a problem hiding this comment.
I went ahead and just made some grammar/wording updates that I probably missed the first time around.
- Add missing comma after "Mattermost Enterprise Edition" - Remove spurious "to" from "continue to using" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Newest code from mattermost has been published to preview environment for Git SHA 1ef6601 |
…ery.rst - "1 of" → "one of" in backup step 3 - "In this case, there are several potential mitigations:" → "In either case, several mitigations are available:" - Clean up SSO outage sentence: remove "issue", add "their email and", fix comma placement Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Newest code from mattermost has been published to preview environment for Git SHA ebbb2c5 |
|
Thanks for the in-depth review @ewwollesen 👍 |
|
Newest code from mattermost has been published to preview environment for Git SHA 6950c44 |
ewwollesen
left a comment
There was a problem hiding this comment.
Looks good. Approved.
Summary
The old page was 90% an AWS guide. I've split the page into two: one general overview page on how we do DR at Mattermost and a sub page for the AWS guide. I've also added clarifications to the main page on what is supported and what isn't.
AI Summary
backup-disaster-recovery.rstinto a newdisaster-recovery-aws.rstsubpagetoctree, keeping the overview concise and making room for future platform-specific guidesmaster)Preview
http://mattermost-docs-preview-pulls.s3-website-us-east-1.amazonaws.com/8893/deployment-guide/backup-disaster-recovery.html#
🤖 Generated with Claude Code