fix(e2e): webhook-driven, auto-versioned CMT + fix non-PR dispatch 422 by yasserfaraazkhan · Pull Request #90 · mattermost/matterwick

yasserfaraazkhan · 2026-05-25T04:17:44Z

Summary

Reworks Compatibility Matrix Testing (CMT) to a webhook-driven, self-updating model, fixes the bug that broke every non-PR E2E dispatch, and makes instance cleanup correct end-to-end across all trigger flows.

Root-cause fix: undeclared `mw_tracking_key` → HTTP 422

matterwick was sending an mw_tracking_key workflow_dispatch input that none of the target workflows declare. GitHub rejects a dispatch carrying an undeclared input with HTTP 422, so release, scheduled, and CMT dispatches all failed (PR-label worked only because it never sent it). Removed from every dispatch path.

CMT is webhook-driven (no HTTP endpoints)

Since the version set is resolved by matterwick, nothing needs to be passed from the workflow — so the /cmt_dispatch and /cleanup_e2e endpoints (and their CMTTriggerSecret/CleanupSecret) are removed. Instead a lightweight CMT Provisioner workflow fires → matterwick hears workflow_run.requested → handleCMTTrigger provisions one server per version → dispatches compatibility-matrix-testing.yml.

Trigger (replaces the monthly schedule): CMT now runs on release branches. Desktop fires on push to release-v* (the cut + subsequent pushes); mobile fires only on the release-v* branch cut (on: create). Because create delivers a workflow_run webhook for every branch/tag creation, matterwick gates with shouldTriggerCMT: provision only when the run was a manual workflow_dispatch (any branch) or ran on a release branch. Manual runs from the Actions tab still work for both repos.

Note — intentional duplicate coverage: a release-v* push/cut also triggers matterwick's normal release E2E (handlePushEventE2E: whole suite, latest server). CMT additionally runs its matrix (desktop: whole suite per version; mobile: smoke per version), whose set includes the latest server — so the latest server is exercised twice. Expected and accepted.

Auto-derived CMT version set (no monthly edits)

resolveCMTServerVersions() fetches the mattermost/mattermost GitHub releases at each trigger and builds the set: active ESR line(s) + latest 3 stable minor lines + current RC, latest patch each, deduped, capped at 10 (raised from 5). ESR lines are detected from the release notes ("Extended Support Release"). Config.CMTServerVersions is an optional manual override; defaultCMTServerVersions (10.11.18, 11.7.1) is only a fallback for a GitHub-API outage.

Cleanup — correct across every flow

Two layers, audited end-to-end:

Primary, completion-based (non-PR: CMT / nightly / push-release). On the test workflow's workflow_run completed event, instances are matched by commit-SHA suffix, scoped per flow (CMT vs non-CMT) so two runs sharing a SHA (e.g. the monthly nightly + CMT on the same master commit) can't tear down each other's servers.
- Fix — key on dispatch-time HEAD SHA. Non-PR flows dispatch with ref: <branch> (workflow_dispatch can't take a SHA), so the run's head_sha is the branch HEAD at dispatch time. Previously we keyed on the trigger SHA, so if the branch advanced during the long provisioning window the primary match missed its own instances. matterwick now resolves the branch HEAD via the commits API right before dispatch and keys on that, so it lines up with the completed run's head_sha. Falls back to the trigger SHA on API error.
Backstop, age-based periodic scan (cleanupStaleE2EInstances). Catches anything the primary path misses.
- Non-PR: E2EInstanceMaxAge (default 3h).
- PR: PR instances have no completion-based teardown (they're deliberately reused across label toggles and commits), so they now auto-expire at a longer, separate E2EPRInstanceMaxAge (default 24h). When a PR instance is reaped, its in-memory tracking entry is evicted — this is what makes a re-applied E2E/Run provision a fresh set instead of dispatching against deleted servers.

PR teardown triggers unchanged otherwise: E2E/Reset-Servers label = explicit tear-down-now; PR close = cleanup; removing E2E/Run is intentionally a no-op (so it doesn't kill the still-finishing run).

Key files

server/workflow_run.go — handleCMTTrigger; shouldTriggerCMT release-branch/manual gate; flow-scoped SHA cleanup; dispatch-time HEAD SHA re-key for CMT + nightly; removed handleCMTRunCleanup; dropped mw_tracking_key.
server/push_events.go — dispatch-time HEAD SHA re-key for push/release.
server/e2e_tests.go — resolveCMTServerVersions() + cmtServerVersions() + semver comparator; resolveBranchHeadSHA(); cleanupStaleE2EInstances (renamed, now also reaps aged-out PR instances) + evictReapedPRInstances(); removed mw_tracking_key from dispatch.
server/config.go + default JSON — CMTServerVersions / CMTTriggerWorkflowName / E2EPRInstanceMaxAge; removed the secrets.
Removed server/cmt_dispatch.go, server/cleanup_e2e.go (+ tests).

Companion PRs

mattermost/desktop#3829, mattermost/mattermost-mobile#9800, mattermost/gitops-platform#72.

Verification

go build ./..., go vet ./server/..., and go test ./server/... pass. New unit tests cover resolveBranchHeadSHA, the PR max-age knob, and evictReapedPRInstances (the re-provision-after-expiry guarantee). The CMT resolver was validated against live GitHub data — today it yields 10.11.19, 11.5.7, 11.6.4, 11.7.2, 11.8.0-rc3 (all confirmed as real Docker tags).

🤖 Generated with Claude Code

Release Note

NONE

GitHub's workflow_run webhook payload does not carry workflow_dispatch inputs (verified against the workflow_run schema), so CMT Provisioner runs were never able to start provisioning -- the server_versions input was always read as empty and the handler returned without dispatching compatibility-matrix-testing.yml. This change moves CMT trigger out of the webhook path and onto two new authenticated HTTP endpoints, called directly by the workflows: POST /cmt_dispatch -- cmt-provisioner.yml POSTs server_versions plus full context here; matterwick provisions in a goroutine and returns 202 immediately. POST /cleanup_e2e -- compatibility-matrix-testing.yml POSTs the CMT provisioner's run_id here on completion; matterwick destroys tracked instances under {repo}-cmt-{run_id}-* keys. Auth: each endpoint takes a shared secret in a request header (X-Trigger-Token, X-Cleanup-Token) compared with constant-time comparison. Endpoints reject 503 if the corresponding secret is unconfigured rather than running unauthenticated. The broken "requested" branch in handleWorkflowRunEventWithInputs that read payload.WorkflowRun.Inputs["server_versions"] is removed. The "completed" branch is kept as a defensive fallback for stuck instances. Co-Authored-By: Claude <noreply@anthropic.com>

Adds 18 unit tests covering the synchronous validation surface of both new endpoints: missing/wrong tokens, malformed JSON, missing required fields, edge inputs (",,," server_versions, unknown repo type, zero run_id), and the cleanup map-deletion logic across matching, non-matching, and multi-key scenarios. Also drops the CheckLimitRateAndAbortRequest call from both handlers. The /github_event handler keeps it because synchronous webhook responses may need to make GitHub API calls before answering, but the new endpoints only call GitHub from background goroutines (cmt_dispatch) or talk to the cloud provisioner (cleanup_e2e). Keeping the rate check would force every request to block on a live GitHub round-trip, which makes the endpoints untestable in unit tests (no live token) for no production benefit. Inline comments explain the rationale. End-to-end coverage of the happy path (provisioning + workflow dispatch) is deliberately left to e2e_dryrun_test.go, which already mocks the cloud and github clients; duplicating that mocking here adds noise without catching anything new. The validation rejection paths above are the bug-prone surface. Co-Authored-By: Claude <noreply@anthropic.com>

mm-cloud-bot · 2026-05-25T04:17:48Z

@yasserfaraazkhan: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

I understand the commands that are listed here

coderabbitai · 2026-05-25T04:22:13Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

CMT provisioning routing now matches a configured trigger name and reads configured server versions. Instance cleanup switches from tracking-key in-memory deletion to SHA-suffix scanning; all dispatch functions drop tracking-key inputs and call sites are updated.

Changes

SHA-Based Cleanup and CMT Routing Refactor

Layer / File(s)	Summary
CMT Configuration and Version Defaults `server/config.go`, `config/config-matterwick.default.json`	`MatterwickConfig` gains `CMTTriggerWorkflowName` and `CMTServerVersions`, a `defaultCMTServerVersions` fallback, and `CMTVersions()` accessor; default config adds trigger name (`"CMT Provisioner"`) and server versions (`["10.11.18","11.7.1"]`).
CMT Event Routing and Provisioning Trigger `server/workflow_run.go`	`handleWorkflowRunEventWithInputs` now routes only when `workflow_run.name` equals configured `CMTTriggerWorkflowName` and action is `requested`; adds `handleCMTTrigger`; `dispatchCMTWorkflow` no longer accepts or emits `mw_tracking_key`.
E2E Dispatch Signature Cascade `server/e2e_tests.go`, `server/push_events.go`, `server/workflow_run.go`	Desktop/mobile dispatch helpers remove `trackingKey` parameters; push-event trigger helpers and nightly dispatch call sites updated to stop passing tracking keys.
E2E Cleanup Architecture: SHA-Based Matching `server/workflow_run.go`, `server/e2e_dryrun_test.go`	Completion cleanup now uses `findAndDestroyInstancesBySHA` with a `cmtOnly` flag and SHA-suffix matching; `handleCMTRunCleanup` removed; new test `TestInstanceKeyMatchesSHA` validates matching semantics.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

mattermost/matterwick#89: Overlaps earlier changes removing mw_tracking_key from workflow dispatch inputs and moving cleanup to SHA-based destruction.

Suggested labels

release-note-none

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately reflects the main objective: fixing non-PR dispatch 422 errors (caused by undeclared mw_tracking_key) and reworking CMT to webhook-driven delivery.
Description check	✅ Passed	The PR description comprehensively explains the changes, including the root-cause fix for the HTTP 422 bug, the webhook-driven CMT model, cleanup strategy, and key files modified.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/cmt-direct-dispatch-and-cleanup-endpoint

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

server/cmt_dispatch_test.go (1)

47-57: 💤 Low value

Consider using httptest.NewRequestWithContext to satisfy linter.

Static analysis flags httptest.NewRequest usage. While functionally equivalent in tests, using NewRequestWithContext aligns with best practices and silences the linter.

♻️ Suggested fix

 func doCMTDispatchRequest(t *testing.T, s *Server, body, token string) *httptest.ResponseRecorder {
 	t.Helper()
-	req := httptest.NewRequest(http.MethodPost, "/cmt_dispatch", bytes.NewBufferString(body))
+	req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/cmt_dispatch", bytes.NewBufferString(body))
 	req.Header.Set("Content-Type", "application/json")
 	if token != "" {
 		req.Header.Set("X-Trigger-Token", token)
 	}
 	w := httptest.NewRecorder()
 	s.handleCMTDispatch(w, req)
 	return w
 }

You'll need to add "context" to the imports.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/cmt_dispatch_test.go` around lines 47 - 57, The test helper
doCMTDispatchRequest uses httptest.NewRequest which the linter flags; update
doCMTDispatchRequest to use httptest.NewRequestWithContext with a background
context (e.g., context.Background()) when creating the request, and add
"context" to the imports; specifically modify the call in doCMTDispatchRequest
(the NewRequest call that constructs req) to NewRequestWithContext and ensure
the X-Trigger-Token header handling and subsequent call to s.handleCMTDispatch
remain unchanged.

server/cleanup_e2e_test.go (1)

35-45: 💤 Low value

Consider using httptest.NewRequestWithContext to satisfy linter.

Same as in cmt_dispatch_test.go, static analysis flags httptest.NewRequest. Using NewRequestWithContext would align with the linter expectations.

♻️ Suggested fix

 func doCleanupE2ERequest(t *testing.T, s *Server, body, token string) *httptest.ResponseRecorder {
 	t.Helper()
-	req := httptest.NewRequest(http.MethodPost, "/cleanup_e2e", bytes.NewBufferString(body))
+	req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/cleanup_e2e", bytes.NewBufferString(body))
 	req.Header.Set("Content-Type", "application/json")
 	if token != "" {
 		req.Header.Set("X-Cleanup-Token", token)
 	}
 	w := httptest.NewRecorder()
 	s.handleCleanupE2E(w, req)
 	return w
 }

You'll need to add "context" to the imports.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/cleanup_e2e_test.go` around lines 35 - 45, Replace httptest.NewRequest
with httptest.NewRequestWithContext in doCleanupE2ERequest and pass a context
(e.g., context.Background()); also add "context" to the imports. Specifically,
update the call in doCleanupE2ERequest (the helper that builds the POST to
"/cleanup_e2e") to use httptest.NewRequestWithContext(context.Background(),
http.MethodPost, ... ) and ensure the package imports include "context".

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@server/cleanup_e2e_test.go`:
- Around line 35-45: Replace httptest.NewRequest with
httptest.NewRequestWithContext in doCleanupE2ERequest and pass a context (e.g.,
context.Background()); also add "context" to the imports. Specifically, update
the call in doCleanupE2ERequest (the helper that builds the POST to
"/cleanup_e2e") to use httptest.NewRequestWithContext(context.Background(),
http.MethodPost, ... ) and ensure the package imports include "context".

In `@server/cmt_dispatch_test.go`:
- Around line 47-57: The test helper doCMTDispatchRequest uses
httptest.NewRequest which the linter flags; update doCMTDispatchRequest to use
httptest.NewRequestWithContext with a background context (e.g.,
context.Background()) when creating the request, and add "context" to the
imports; specifically modify the call in doCMTDispatchRequest (the NewRequest
call that constructs req) to NewRequestWithContext and ensure the
X-Trigger-Token header handling and subsequent call to s.handleCMTDispatch
remain unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7b08b625-de55-47f7-b1f4-061630a8a168

📥 Commits

Reviewing files that changed from the base of the PR and between 68ce3b7 and 1dd004a.

📒 Files selected for processing (7)

server/cleanup_e2e.go
server/cleanup_e2e_test.go
server/cmt_dispatch.go
server/cmt_dispatch_test.go
server/config.go
server/server.go
server/workflow_run.go

The "Primary" lookup at handleWorkflowRunEventWithInputs read payload.WorkflowRun.Inputs["mw_tracking_key"], which is the same field that broke CMT provisioning -- GitHub's workflow_run webhook payload does not include workflow_dispatch inputs (verified against the Octokit schema and the REST API for a real workflow run). The field is always empty in production, so the "primary" branch never fires; only the SHA-based fallback runs. That fallback works because every tracking-key format ends with -{sha}. Removing the dead read eliminates the misleading "Primary"/"Fallback" labels and makes the SHA-based scan the documented canonical mechanism. The mw_tracking_key value is still written when dispatching test workflows -- it remains queryable via the REST API for any future code path that wants to do a key-based lookup -- but the webhook-driven cleanup intentionally does not depend on it. Co-Authored-By: Claude <noreply@anthropic.com>

…g_key CMT now runs off a lightweight scheduled trigger workflow that matterwick hears as a workflow_run event, provisioning one server per version from a hardcoded set in config (CMTServerVersions) and dispatching compatibility-matrix-testing.yml. This removes the /cmt_dispatch and /cleanup_e2e HTTP endpoints (and their CMTTriggerSecret/CleanupSecret config), which existed only to pass server_versions past the workflow_run webhook. Also stop sending the undeclared `mw_tracking_key` workflow_dispatch input: GitHub rejects a dispatch carrying an undeclared input with HTTP 422, which broke every non-PR dispatch (release, scheduled, and CMT). Cleanup is keyed off the test workflow's workflow_run completed event, matched by commit SHA. - workflow_run.go: add handleCMTTrigger, provision on CMTTriggerWorkflowName 'requested'; remove handleCMTRunCleanup fallback - config.go: add CMTServerVersions (+ defaultCMTServerVersions) and CMTTriggerWorkflowName; remove CMTTriggerSecret/CleanupSecret - remove cmt_dispatch.go, cleanup_e2e.go and their tests - drop dead trackingKey plumbing from desktop/mobile/CMT dispatch paths Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/e2e_tests.go`:
- Around line 1051-1055: The dispatch uses branch "ref" which can advance and
desync cleanup keyed by commit SHA; update dispatchDesktopE2EWorkflow and
dispatchMobileE2EWorkflow to pin the run to the tracked commit by including the
commit SHA as a declared workflow input (e.g., "tracked_sha") and pass sha into
the workflow inputs, or alternatively set "ref" to the exact commit SHA when
calling the GitHub Actions API; also update the corresponding workflow files to
declare the tracked_sha input (so workflow_run.inputs can be consulted) and
adjust the cleanup path (findAndDestroyInstancesBySHA / createCloudInstallation
cleanup logic) to read workflow_run.inputs.tracked_sha (or ensure head_sha
equals the tracked SHA) so provisioning and cleanup use the same unique
identifier.

In `@server/workflow_run.go`:
- Around line 108-113: The cleanup currently calls
s.findAndDestroyInstancesBySHA(repoName, headSHA) which destroys all entries
matching {repo}-{sha} and will tear down concurrent runs; change the cleanup to
preserve and use the per-dispatch correlation key (runID) instead of SHA-only.
Update the webhook handling to extract runID from the payload and call a new or
updated method (e.g., s.findAndDestroyInstancesByRunID(repoName, headSHA, runID)
or s.findAndDestroyInstances(repoName, headSHA, runID)) so matching uses the
full tracking key that includes runID, and only fall back to SHA-only logic when
runID is absent; modify the matching/teardown logic inside the corresponding
method to filter by runID (and repoName/headSHA as needed) rather than deleting
all {repo}-{sha} entries.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8ce2a0e-2229-48cb-b0e6-4517747ca6a5

📥 Commits

Reviewing files that changed from the base of the PR and between d0e3d60 and e0c42f2.

📒 Files selected for processing (5)

config/config-matterwick.default.json
server/config.go
server/e2e_tests.go
server/push_events.go
server/workflow_run.go

✅ Files skipped from review due to trivial changes (1)

config/config-matterwick.default.json

…t reaped When two flows fire on the same commit SHA (notably the monthly CMT trigger and the nightly trigger on the same master HEAD), the completing test workflow would destroy the other flow's still-active servers, because cleanup matched every {repo}-…-{sha} key. Scope cleanup to the completing workflow's flow: CMT test completions reap only {repo}-cmt-… keys, other E2E completions reap only non-CMT (push/scheduled) keys. We can't key on a run ID as CodeRabbit suggested: GitHub's workflow_run payload carries the *test* run's id (not the trigger run id embedded in the tracking key) and omits workflow_dispatch inputs, so head_sha + flow is the only viable correlation. Branch-advance SHA desync remains handled by the periodic orphan cleanup scan. Extracted instanceKeyMatchesSHA + added TestInstanceKeyMatchesSHA covering the collision. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

….18, 11.7.1) Use the latest patch of the v10.11 ESR line and the v11.7 feature line (both verified present on Docker Hub) instead of the .0 GA placeholders, so CMT tests the currently-shipping builds. Keeps the code default in sync with the gitops config (mattermost/gitops-platform#72). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

One cloud instance is provisioned per version, so 10 versions = 10 instances. Allows the full ESR-to-current version set to run in a single CMT run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Instead of a hardcoded list that needs monthly edits, matterwick now resolves the CMT version set at trigger time from the mattermost/mattermost GitHub releases: the active ESR line(s) + the latest 3 stable minor lines + the current release candidate, latest patch of each, deduped and capped at 10. - resolveCMTServerVersions() (server/e2e_tests.go): fresh fetch per CMT trigger (no cache; CMT runs rarely), per_page=100, parses tag_name/draft/prerelease/body. ESR lines detected via 'Extended Support Release' in the release body. RC included only when newer than the latest stable. Falls back to defaultCMTServerVersions on API error. Adds a small internal semver parser/comparator (no new dependency). - cmtServerVersions(): explicit Config.CMTServerVersions is an optional manual override; empty (normal) => auto-resolve. Removed the old Config.CMTVersions(). - handleCMTTrigger uses s.cmtServerVersions(). - Tests: auto-derived set, manual override (no API call), API-error fallback, RC exclusion when stale, and parseCMTVersion edge cases. Raised the cap test to 10. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

NARSimoes

Thanks @yasserfaraazkhan , just a few clarifications.

NARSimoes · 2026-06-04T12:10:37Z

-  "E2EInstanceMaxAge": 6
+  "E2EInstanceMaxAge": 6,
+  "CMTTriggerWorkflowName": "CMT Provisioner",
+  "CMTServerVersions": ["10.11.18", "11.7.1"]


How are we planning to update these versions? Might better discuss in DM.

If understood correctly, the logic is to set "CMTServerVersions: []" in order to allow matterwick get the latest supported version and provisions instances for those versions.

@NARSimoes https://github.com/mattermost/gitops-platform/pull/72/changes#diff-d242a4ffee35793c04e9591a799f9136234ba74b276617420390641ae142aa9fR96

the deployed values will be empty so it will fetch the latest versions. In this config the values is just placeholders for demo purpose.

NARSimoes · 2026-06-04T12:40:05Z

 	req, err := client.NewRequest("POST",
 		fmt.Sprintf("/repos/%s/%s/actions/workflows/compatibility-matrix-testing.yml/dispatches", repoOwner, repoName),
 		map[string]interface{}{
 			"ref":    branch,


Sanity check: It seems for the new CMT logic we are provisioning using headBranch as ref but the clean up findAndDestroyInstancesBySHA is using headSHA. If this is accurate we'll keep leftovers for CMT runs

Thanks @NARSimoes , the CMT will now use the SHA and clean up after completion

Non-PR flows (CMT, nightly/scheduled, push/release) provision instances keyed by the trigger SHA, then dispatch the test workflow with ref=branch (workflow_dispatch cannot take a commit SHA). GitHub records the dispatched run's head_sha as the branch HEAD at dispatch time, which can differ from the trigger SHA when the branch advances during the (often long) provisioning window. Completion cleanup matches instances by the run's head_sha, so a moved branch made the primary SHA-match miss its instances. Resolve the branch HEAD via the GitHub commits API right before dispatch and key the tracking entry on that SHA so it aligns with the run's head_sha. Falls back to the trigger SHA on error; the periodic age-based scan remains the backstop either way. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PR E2E instances had no completion-based teardown and were explicitly skipped by the periodic scan, so a long-open PR could keep its servers alive indefinitely (only PR-close or the E2E/Reset-Servers label tore them down). Extend the periodic scan to also reap PR instances past a longer, separate max-age (E2EPRInstanceMaxAge, default 24h vs 3h for non-PR), and evict the reaped PR's in-memory tracking entry. Eviction is what makes a re-applied E2E/Run provision a fresh set: with the stale entry gone, the reuse path in handleE2ETestRequest finds no live instances and creates new servers instead of dispatching against deleted ones. Renames cleanupStaleNonPRE2EInstances -> cleanupStaleE2EInstances since it now handles both kinds. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CMT is now driven by release-branch events instead of a monthly schedule: desktop fires on push to release-v* (the cut + subsequent pushes), mobile fires only on the release-v* branch cut via `on: create`. Because `create` delivers a workflow_run webhook for every branch/tag creation (even when the trigger workflow's job is skipped), Matterwick must gate independently. Add shouldTriggerCMT: provision CMT only when the run was a manual workflow_dispatch (any branch) or ran on a release branch (head_branch matches isReleaseBranch). This filters mobile's create-event noise and non-release pushes while keeping ad-hoc manual runs working. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The compatibility-matrix-testing.yml cleanup step that consumed cmt_run_id was removed (Matterwick now reaps servers by commit SHA on workflow_run completion), leaving the input dead. Drop it from the dispatch payload and remove the now-unused sha/runID params from dispatchCMTWorkflow. This pairs with removing the cmt_run_id input declaration from the desktop/mobile CMT workflows. Deploy ordering: this Matterwick image must be live before/with those workflow changes — once the input is undeclared, sending it would 422 (the mw_tracking_key class). Prod matterwick (v0.4.12) does not dispatch these workflows, so the coordinated rollout is safe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Replace branch-based CMT triggering with RC-tag triggering. The desktop & mobile RC cut (e.g. v6.2.0-rc.1) is the moment the release team actually wants compatibility-matrix validation — earlier than a release branch push, scoped to the artifact that will ship. For tag pushes GitHub sets workflow_run.head_branch to the tag name (no refs/tags/ prefix), so the gate matches the tag string directly: - isRCTag: ^v?\d+\.\d+\.\d+-rc[.\-]?\d+$ matches v6.2.0-rc.1, v2.41.0-rc.10, 6.2.0-rc1; rejects GA tags (v6.2.0), beta/alpha (v1.0.22-beta), and nightly (6.3.0-nightly.x). - shouldTriggerCMT also keeps release-branch acceptance as defense-in-depth and continues to allow manual workflow_dispatch on any ref. Unit tests lock in the matching/rejection set so future tag conventions can't silently slip through. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

NARSimoes

Just two minor / nit comments that can be addressed in a a follow up. Looks good to me but would be nice a matterwick owner do a sanity check on edge cases before merge

NARSimoes · 2026-06-11T06:46:19Z

+// desktop and mobile repos. It identifies CMT test-workflow completions so SHA-based cleanup
+// only reaps CMT instances (keys "{repo}-cmt-…-{sha}") and leaves a concurrent nightly/push
+// run on the same SHA untouched (and vice versa).
+const cmtTestWorkflowName = "Compatibility Matrix Testing"


Curiosity: any particular reason to have CMT hardcoded while others are config driven?

NARSimoes · 2026-06-11T06:51:39Z

+
+	const maxVersions = 10
+	if len(chosen) > maxVersions {
+		chosen = chosen[len(chosen)-maxVersions:] // keep the newest if somehow over the cap


nit: I think it's effectively the same due the amount of versions. Just highlighting to double check: the resolveCMTServerVersions keeps the newest 10 while handleCMTWithServerVersions the first 10 entris

Mobile doesn't cut RC git tags; its analog is the `build-release-NNN` branch (3+ digit build number, e.g. build-release-786 / build-release-1100) created by the "Mattermost Mobile Release" workflow when a release build is dispatched to TestFlight / Play Store. Treat that as the RC moment for mobile and run CMT against it. - New isBuildReleaseBranch helper with regex `^build-release-\d{3,}$` (3+ digits — rejects test artifacts like build-release-1 without locking the upper bound, so a future move to 5-digit numbers needs no code change). Platform-specific variants build-release-ios-NNN, -sim-NNN, -android-NNN are intentionally excluded (don't start with a digit after the prefix). - shouldTriggerCMT extended to accept this in addition to manual dispatch / RC tag / release branch. - TestIsBuildReleaseBranch + extended TestShouldTriggerCMT cover the boundary cases: real 3/4/5-digit build numbers (true), 1- and 2-digit test artifacts (false), platform variants (false), no overlap with existing gates. The paired mobile workflow change adds the `build-release-[0-9][0-9][0-9]*` branch pattern to cmt-provisioner.yml's push trigger. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The release team treats RCs as their go/no-go test gate and re-cuts on failures. Desktop CMT already runs the whole E2E suite across the multi-version matrix at each RC tag cut, and that matrix includes the latest server — so the per-push release E2E was redundant coverage for desktop. Skip it for desktop release-branch pushes. Mobile is unchanged: mobile CMT runs only smoke tests, so the whole-suite release E2E on `release-*` push remains the per-commit signal there. Other auto-triggers are unaffected: - master/main push trigger: still fires - nightly trigger: still fires - PR label trigger: still fires - mobile release-branch push: still fires Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Every push to a mobile release-2.X branch was firing a full PR-style E2E suite — 3 cloud servers + 10 macOS shards + 10 ubuntu shards + iOS/Android builds, ~60-90 min wall-clock, ~$27/push in GitHub Actions alone. Recent cadence on real release branches: release-2.41: 191 commits (~$5,160) release-2.40: 158 commits (~$4,270) release-2.39: 114 commits (~$3,080) release-2.38: 90 commits (~$2,430) release-2.37: 52 commits (~$1,400) ~$16K across these 5 branches alone, with ~15% of runs triggered by pure noise (translation updates, "Bump app build number", docs/CI tweaks) that can't meaningfully fail differently than the previous run. The release team treats RC builds as the test gate (re-cut on failure), which makes per-cherry-pick E2E during stabilization redundant. Same logic that already removed the desktop release-branch trigger; now applied to mobile too. After this change, auto E2E coverage on release-side: - master/main push -> full PR-style E2E (unchanged, single per-commit signal) - build-release-NNN push -> CMT (smoke x N server versions, mobile only) - RC tag cut -> CMT (desktop only; mobile doesn't tag) - PR label E2E/Run -> full PR-style E2E (unchanged) Full-suite-on-build-release-NNN is NOT wired today (separate decision). E2EAutoTriggerOnRelease and E2EReleasePatternPrefix config fields remain in place — still used by isReleaseBranch() as defense-in-depth in the CMT workflow_run gate (shouldTriggerCMT). Leaving them avoids an unrelated gitops/config diff in this change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Every push to mobile main was firing a full PR-style E2E suite (3 cloud servers, 10 macOS shards + 10 ubuntu shards, ~$27/push, ~60-90 min). Main sees ~5-20 merges/day during active development -> tens of thousands of $/year for post-merge regression signal that PR-label E2E already covers pre-merge. Skip the master/main gate for mobile; desktop master push still fires. Coverage that remains for mobile main: - PR-label E2E pre-merge (developer-driven, single PR's branch tested) - Manual workflow_dispatch - CMT (smoke x N versions) on build-release-NNN cut Full-suite E2E on build-release-NNN remains ungated. If desired, two ways to wire it: 1. Separate matterwick push-handler gate for isBuildReleaseBranch 2. Promote latest version inside CMT to whole-suite (covers RC fully without a parallel matterwick path) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…er code The mobile CMT_MATRIX now marks the highest-semver server entry as `latest: true`. The mobile workflow uses that flag (via expressions in `with:`, which has matrix context unlike job-level `if:`) to run the whole detox suite against latest and smoke-only against older versions — closing the full-suite-on-RC gap without a separate matterwick path. Matterwick stays repo-agnostic: it only signals "which is latest". The test-directory / parallelism choices that flow from that live in the mobile workflow, where the repo's structure is known. Dead-code cleanup now that the surrounding triggers are gone: - handleNightlyE2ETrigger (~95 lines) + its workflow_run gate. Both repos removed e2e-nightly-trigger.yml in prior commits, so no workflow with name "E2E Nightly Trigger" can fire matterwick on workflow_run.requested any longer. - E2ENightlyTriggerWorkflowName config field — no readers left. - E2EAutoTriggerOnRelease config field — the release-branch push gate that read it was removed in the prior commit; no readers left. - runType MASTER/RELEASE branching in trigger*ForPushEvent — only master/main pushes reach those functions now (release-branch and mobile-main paths are skipped earlier), so RELEASE is unreachable. Hardcode "MASTER". - Stale log fields (auto_release, release_pattern_prefix, is_release_branch) in the push-event fallthrough log. What stays in use: - isReleaseBranch + E2EReleasePatternPrefix — defense-in-depth in shouldTriggerCMT (the CMT workflow_run gate also accepts release-* refs from manual dispatches). - cmtOnly cleanup scoping — cheap insurance against hypothetical same-SHA collisions between flows. Net: -158 lines (-95 nightly handler, -36 config refs, -27 misc), +new tests for the latest-marker boundary cases (5-element ESR+minors+RC, stable vs same-X.Y.Z RC, multi-digit rc.N, single version, all unparseable fallback). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Matterwick's nightly handler was removed in prior commits; this dispatch function still took a `nightly bool` parameter and sent `"nightly": ...` in the e2e-functional.yml workflow inputs map. Push-event callers always passed `false`, and the PR-label path was a separate function that never went through here at all — so the param was dead. Companion change in desktop (paired): remove the `nightly` workflow input from e2e-functional.yml + e2e-functional-template.yml and the `if inputs.nightly` block that built `desktop-nightly-${OS}` build IDs. - dispatchDesktopE2EWorkflow: drop `nightly bool` parameter - drop "nightly": fmt.Sprintf("%t", nightly) from workflowInputs - update sole caller (triggerDesktopE2EWorkflowForPushEvent) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Yasser Khan and others added 2 commits May 21, 2026 14:46

mm-cloud-bot added the kind/bug Categorizes issue or PR as related to a bug. label May 25, 2026

mm-cloud-bot added the do-not-merge/release-note-label-needed label May 25, 2026

yasserfaraazkhan mentioned this pull request May 25, 2026

ci(e2e): CMT scheduling + macOS E2E stabilization + CI/security hardening mattermost/desktop#3829

Merged

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

coderabbitai Bot approved these changes May 25, 2026

View reviewed changes

yasserfaraazkhan requested review from NARSimoes and saturninoabril May 25, 2026 06:05

yasserfaraazkhan changed the title ~~fix(cmt): add /cmt_dispatch and /cleanup_e2e endpoints~~ fix(e2e): webhook-driven CMT + fix non-PR dispatch 422 (mw_tracking_key) Jun 1, 2026

coderabbitai Bot requested changes Jun 1, 2026

View reviewed changes

Comment thread server/e2e_tests.go Outdated

Comment thread server/workflow_run.go Outdated

coderabbitai Bot approved these changes Jun 1, 2026

View reviewed changes

yasserfaraazkhan and others added 3 commits June 1, 2026 15:52

feat(cmt): raise CMT version cap from 5 to 10

8af9a48

One cloud instance is provisioned per version, so 10 versions = 10 instances. Allows the full ESR-to-current version set to run in a single CMT run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

yasserfaraazkhan changed the title ~~fix(e2e): webhook-driven CMT + fix non-PR dispatch 422 (mw_tracking_key)~~ fix(e2e): webhook-driven, auto-versioned CMT + fix non-PR dispatch 422 Jun 2, 2026

yasserfaraazkhan requested a review from esarafianou June 2, 2026 10:37

NARSimoes requested changes Jun 4, 2026

View reviewed changes

yasserfaraazkhan and others added 2 commits June 5, 2026 00:44

yasserfaraazkhan requested a review from NARSimoes June 4, 2026 19:56

yasserfaraazkhan and others added 2 commits June 5, 2026 19:50

mm-cloud-bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed labels Jun 5, 2026

NARSimoes approved these changes Jun 11, 2026

View reviewed changes

yasserfaraazkhan and others added 6 commits June 12, 2026 02:19

Conversation

yasserfaraazkhan commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root-cause fix: undeclared mw_tracking_key → HTTP 422

CMT is webhook-driven (no HTTP endpoints)

Auto-derived CMT version set (no monthly edits)

Cleanup — correct across every flow

Key files

Companion PRs

Verification

Release Note

Uh oh!

mm-cloud-bot commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NARSimoes left a comment

Choose a reason for hiding this comment

Uh oh!

NARSimoes Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

yasserfaraazkhan Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

NARSimoes Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

yasserfaraazkhan Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NARSimoes left a comment

Choose a reason for hiding this comment

Uh oh!

NARSimoes Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

NARSimoes Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yasserfaraazkhan commented May 25, 2026 •

edited

Loading

Root-cause fix: undeclared `mw_tracking_key` → HTTP 422

coderabbitai Bot commented May 25, 2026 •

edited

Loading