Skip to content

fix(e2e): webhook-driven, auto-versioned CMT + fix non-PR dispatch 422#90

Open
yasserfaraazkhan wants to merge 19 commits into
masterfrom
fix/cmt-direct-dispatch-and-cleanup-endpoint
Open

fix(e2e): webhook-driven, auto-versioned CMT + fix non-PR dispatch 422#90
yasserfaraazkhan wants to merge 19 commits into
masterfrom
fix/cmt-direct-dispatch-and-cleanup-endpoint

Conversation

@yasserfaraazkhan

@yasserfaraazkhan yasserfaraazkhan commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Reworks Compatibility Matrix Testing (CMT) to a webhook-driven, self-updating model, fixes the bug that broke every non-PR E2E dispatch, and makes instance cleanup correct end-to-end across all trigger flows.

Root-cause fix: undeclared mw_tracking_key → HTTP 422

matterwick was sending an mw_tracking_key workflow_dispatch input that none of the target workflows declare. GitHub rejects a dispatch carrying an undeclared input with HTTP 422, so release, scheduled, and CMT dispatches all failed (PR-label worked only because it never sent it). Removed from every dispatch path.

CMT is webhook-driven (no HTTP endpoints)

Since the version set is resolved by matterwick, nothing needs to be passed from the workflow — so the /cmt_dispatch and /cleanup_e2e endpoints (and their CMTTriggerSecret/CleanupSecret) are removed. Instead a lightweight CMT Provisioner workflow fires → matterwick hears workflow_run.requestedhandleCMTTrigger provisions one server per version → dispatches compatibility-matrix-testing.yml.

Trigger (replaces the monthly schedule): CMT now runs on release branches. Desktop fires on push to release-v* (the cut + subsequent pushes); mobile fires only on the release-v* branch cut (on: create). Because create delivers a workflow_run webhook for every branch/tag creation, matterwick gates with shouldTriggerCMT: provision only when the run was a manual workflow_dispatch (any branch) or ran on a release branch. Manual runs from the Actions tab still work for both repos.

Note — intentional duplicate coverage: a release-v* push/cut also triggers matterwick's normal release E2E (handlePushEventE2E: whole suite, latest server). CMT additionally runs its matrix (desktop: whole suite per version; mobile: smoke per version), whose set includes the latest server — so the latest server is exercised twice. Expected and accepted.

Auto-derived CMT version set (no monthly edits)

resolveCMTServerVersions() fetches the mattermost/mattermost GitHub releases at each trigger and builds the set: active ESR line(s) + latest 3 stable minor lines + current RC, latest patch each, deduped, capped at 10 (raised from 5). ESR lines are detected from the release notes ("Extended Support Release"). Config.CMTServerVersions is an optional manual override; defaultCMTServerVersions (10.11.18, 11.7.1) is only a fallback for a GitHub-API outage.

Cleanup — correct across every flow

Two layers, audited end-to-end:

  1. Primary, completion-based (non-PR: CMT / nightly / push-release). On the test workflow's workflow_run completed event, instances are matched by commit-SHA suffix, scoped per flow (CMT vs non-CMT) so two runs sharing a SHA (e.g. the monthly nightly + CMT on the same master commit) can't tear down each other's servers.

    • Fix — key on dispatch-time HEAD SHA. Non-PR flows dispatch with ref: <branch> (workflow_dispatch can't take a SHA), so the run's head_sha is the branch HEAD at dispatch time. Previously we keyed on the trigger SHA, so if the branch advanced during the long provisioning window the primary match missed its own instances. matterwick now resolves the branch HEAD via the commits API right before dispatch and keys on that, so it lines up with the completed run's head_sha. Falls back to the trigger SHA on API error.
  2. Backstop, age-based periodic scan (cleanupStaleE2EInstances). Catches anything the primary path misses.

    • Non-PR: E2EInstanceMaxAge (default 3h).
    • PR: PR instances have no completion-based teardown (they're deliberately reused across label toggles and commits), so they now auto-expire at a longer, separate E2EPRInstanceMaxAge (default 24h). When a PR instance is reaped, its in-memory tracking entry is evicted — this is what makes a re-applied E2E/Run provision a fresh set instead of dispatching against deleted servers.

PR teardown triggers unchanged otherwise: E2E/Reset-Servers label = explicit tear-down-now; PR close = cleanup; removing E2E/Run is intentionally a no-op (so it doesn't kill the still-finishing run).

Key files

  • server/workflow_run.gohandleCMTTrigger; shouldTriggerCMT release-branch/manual gate; flow-scoped SHA cleanup; dispatch-time HEAD SHA re-key for CMT + nightly; removed handleCMTRunCleanup; dropped mw_tracking_key.
  • server/push_events.go — dispatch-time HEAD SHA re-key for push/release.
  • server/e2e_tests.goresolveCMTServerVersions() + cmtServerVersions() + semver comparator; resolveBranchHeadSHA(); cleanupStaleE2EInstances (renamed, now also reaps aged-out PR instances) + evictReapedPRInstances(); removed mw_tracking_key from dispatch.
  • server/config.go + default JSON — CMTServerVersions / CMTTriggerWorkflowName / E2EPRInstanceMaxAge; removed the secrets.
  • Removed server/cmt_dispatch.go, server/cleanup_e2e.go (+ tests).

Companion PRs

mattermost/desktop#3829, mattermost/mattermost-mobile#9800, mattermost/gitops-platform#72.

Verification

go build ./..., go vet ./server/..., and go test ./server/... pass. New unit tests cover resolveBranchHeadSHA, the PR max-age knob, and evictReapedPRInstances (the re-provision-after-expiry guarantee). The CMT resolver was validated against live GitHub data — today it yields 10.11.19, 11.5.7, 11.6.4, 11.7.2, 11.8.0-rc3 (all confirmed as real Docker tags).

🤖 Generated with Claude Code

Release Note

NONE

Yasser Khan and others added 2 commits May 21, 2026 14:46
GitHub's workflow_run webhook payload does not carry workflow_dispatch
inputs (verified against the workflow_run schema), so CMT Provisioner
runs were never able to start provisioning -- the server_versions input
was always read as empty and the handler returned without dispatching
compatibility-matrix-testing.yml.

This change moves CMT trigger out of the webhook path and onto two new
authenticated HTTP endpoints, called directly by the workflows:

  POST /cmt_dispatch  -- cmt-provisioner.yml POSTs server_versions plus
                         full context here; matterwick provisions in a
                         goroutine and returns 202 immediately.
  POST /cleanup_e2e   -- compatibility-matrix-testing.yml POSTs the
                         CMT provisioner's run_id here on completion;
                         matterwick destroys tracked instances under
                         {repo}-cmt-{run_id}-* keys.

Auth: each endpoint takes a shared secret in a request header
(X-Trigger-Token, X-Cleanup-Token) compared with constant-time
comparison. Endpoints reject 503 if the corresponding secret is
unconfigured rather than running unauthenticated.

The broken "requested" branch in handleWorkflowRunEventWithInputs that
read payload.WorkflowRun.Inputs["server_versions"] is removed. The
"completed" branch is kept as a defensive fallback for stuck instances.

Co-Authored-By: Claude <noreply@anthropic.com>
Adds 18 unit tests covering the synchronous validation surface of both
new endpoints: missing/wrong tokens, malformed JSON, missing required
fields, edge inputs (",,," server_versions, unknown repo type, zero
run_id), and the cleanup map-deletion logic across matching,
non-matching, and multi-key scenarios.

Also drops the CheckLimitRateAndAbortRequest call from both handlers.
The /github_event handler keeps it because synchronous webhook
responses may need to make GitHub API calls before answering, but the
new endpoints only call GitHub from background goroutines (cmt_dispatch)
or talk to the cloud provisioner (cleanup_e2e). Keeping the rate check
would force every request to block on a live GitHub round-trip, which
makes the endpoints untestable in unit tests (no live token) for no
production benefit. Inline comments explain the rationale.

End-to-end coverage of the happy path (provisioning + workflow
dispatch) is deliberately left to e2e_dryrun_test.go, which already
mocks the cloud and github clients; duplicating that mocking here adds
noise without catching anything new. The validation rejection paths
above are the bug-prone surface.

Co-Authored-By: Claude <noreply@anthropic.com>
@mm-cloud-bot mm-cloud-bot added the kind/bug Categorizes issue or PR as related to a bug. label May 25, 2026
@mm-cloud-bot

Copy link
Copy Markdown

@yasserfaraazkhan: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

I understand the commands that are listed here

@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

CMT provisioning routing now matches a configured trigger name and reads configured server versions. Instance cleanup switches from tracking-key in-memory deletion to SHA-suffix scanning; all dispatch functions drop tracking-key inputs and call sites are updated.

Changes

SHA-Based Cleanup and CMT Routing Refactor

Layer / File(s) Summary
CMT Configuration and Version Defaults
server/config.go, config/config-matterwick.default.json
MatterwickConfig gains CMTTriggerWorkflowName and CMTServerVersions, a defaultCMTServerVersions fallback, and CMTVersions() accessor; default config adds trigger name ("CMT Provisioner") and server versions (["10.11.18","11.7.1"]).
CMT Event Routing and Provisioning Trigger
server/workflow_run.go
handleWorkflowRunEventWithInputs now routes only when workflow_run.name equals configured CMTTriggerWorkflowName and action is requested; adds handleCMTTrigger; dispatchCMTWorkflow no longer accepts or emits mw_tracking_key.
E2E Dispatch Signature Cascade
server/e2e_tests.go, server/push_events.go, server/workflow_run.go
Desktop/mobile dispatch helpers remove trackingKey parameters; push-event trigger helpers and nightly dispatch call sites updated to stop passing tracking keys.
E2E Cleanup Architecture: SHA-Based Matching
server/workflow_run.go, server/e2e_dryrun_test.go
Completion cleanup now uses findAndDestroyInstancesBySHA with a cmtOnly flag and SHA-suffix matching; handleCMTRunCleanup removed; new test TestInstanceKeyMatchesSHA validates matching semantics.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • mattermost/matterwick#89: Overlaps earlier changes removing mw_tracking_key from workflow dispatch inputs and moving cleanup to SHA-based destruction.

Suggested labels

release-note-none

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately reflects the main objective: fixing non-PR dispatch 422 errors (caused by undeclared mw_tracking_key) and reworking CMT to webhook-driven delivery.
Description check ✅ Passed The PR description comprehensively explains the changes, including the root-cause fix for the HTTP 422 bug, the webhook-driven CMT model, cleanup strategy, and key files modified.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cmt-direct-dispatch-and-cleanup-endpoint

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
server/cmt_dispatch_test.go (1)

47-57: 💤 Low value

Consider using httptest.NewRequestWithContext to satisfy linter.

Static analysis flags httptest.NewRequest usage. While functionally equivalent in tests, using NewRequestWithContext aligns with best practices and silences the linter.

♻️ Suggested fix
 func doCMTDispatchRequest(t *testing.T, s *Server, body, token string) *httptest.ResponseRecorder {
 	t.Helper()
-	req := httptest.NewRequest(http.MethodPost, "/cmt_dispatch", bytes.NewBufferString(body))
+	req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/cmt_dispatch", bytes.NewBufferString(body))
 	req.Header.Set("Content-Type", "application/json")
 	if token != "" {
 		req.Header.Set("X-Trigger-Token", token)
 	}
 	w := httptest.NewRecorder()
 	s.handleCMTDispatch(w, req)
 	return w
 }

You'll need to add "context" to the imports.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/cmt_dispatch_test.go` around lines 47 - 57, The test helper
doCMTDispatchRequest uses httptest.NewRequest which the linter flags; update
doCMTDispatchRequest to use httptest.NewRequestWithContext with a background
context (e.g., context.Background()) when creating the request, and add
"context" to the imports; specifically modify the call in doCMTDispatchRequest
(the NewRequest call that constructs req) to NewRequestWithContext and ensure
the X-Trigger-Token header handling and subsequent call to s.handleCMTDispatch
remain unchanged.
server/cleanup_e2e_test.go (1)

35-45: 💤 Low value

Consider using httptest.NewRequestWithContext to satisfy linter.

Same as in cmt_dispatch_test.go, static analysis flags httptest.NewRequest. Using NewRequestWithContext would align with the linter expectations.

♻️ Suggested fix
 func doCleanupE2ERequest(t *testing.T, s *Server, body, token string) *httptest.ResponseRecorder {
 	t.Helper()
-	req := httptest.NewRequest(http.MethodPost, "/cleanup_e2e", bytes.NewBufferString(body))
+	req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/cleanup_e2e", bytes.NewBufferString(body))
 	req.Header.Set("Content-Type", "application/json")
 	if token != "" {
 		req.Header.Set("X-Cleanup-Token", token)
 	}
 	w := httptest.NewRecorder()
 	s.handleCleanupE2E(w, req)
 	return w
 }

You'll need to add "context" to the imports.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/cleanup_e2e_test.go` around lines 35 - 45, Replace httptest.NewRequest
with httptest.NewRequestWithContext in doCleanupE2ERequest and pass a context
(e.g., context.Background()); also add "context" to the imports. Specifically,
update the call in doCleanupE2ERequest (the helper that builds the POST to
"/cleanup_e2e") to use httptest.NewRequestWithContext(context.Background(),
http.MethodPost, ... ) and ensure the package imports include "context".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@server/cleanup_e2e_test.go`:
- Around line 35-45: Replace httptest.NewRequest with
httptest.NewRequestWithContext in doCleanupE2ERequest and pass a context (e.g.,
context.Background()); also add "context" to the imports. Specifically, update
the call in doCleanupE2ERequest (the helper that builds the POST to
"/cleanup_e2e") to use httptest.NewRequestWithContext(context.Background(),
http.MethodPost, ... ) and ensure the package imports include "context".

In `@server/cmt_dispatch_test.go`:
- Around line 47-57: The test helper doCMTDispatchRequest uses
httptest.NewRequest which the linter flags; update doCMTDispatchRequest to use
httptest.NewRequestWithContext with a background context (e.g.,
context.Background()) when creating the request, and add "context" to the
imports; specifically modify the call in doCMTDispatchRequest (the NewRequest
call that constructs req) to NewRequestWithContext and ensure the
X-Trigger-Token header handling and subsequent call to s.handleCMTDispatch
remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7b08b625-de55-47f7-b1f4-061630a8a168

📥 Commits

Reviewing files that changed from the base of the PR and between 68ce3b7 and 1dd004a.

📒 Files selected for processing (7)
  • server/cleanup_e2e.go
  • server/cleanup_e2e_test.go
  • server/cmt_dispatch.go
  • server/cmt_dispatch_test.go
  • server/config.go
  • server/server.go
  • server/workflow_run.go

The "Primary" lookup at handleWorkflowRunEventWithInputs read
payload.WorkflowRun.Inputs["mw_tracking_key"], which is the same field
that broke CMT provisioning -- GitHub's workflow_run webhook payload
does not include workflow_dispatch inputs (verified against the Octokit
schema and the REST API for a real workflow run). The field is always
empty in production, so the "primary" branch never fires; only the
SHA-based fallback runs. That fallback works because every tracking-key
format ends with -{sha}.

Removing the dead read eliminates the misleading "Primary"/"Fallback"
labels and makes the SHA-based scan the documented canonical mechanism.
The mw_tracking_key value is still written when dispatching test
workflows -- it remains queryable via the REST API for any future code
path that wants to do a key-based lookup -- but the webhook-driven
cleanup intentionally does not depend on it.

Co-Authored-By: Claude <noreply@anthropic.com>
…g_key

CMT now runs off a lightweight scheduled trigger workflow that matterwick
hears as a workflow_run event, provisioning one server per version from a
hardcoded set in config (CMTServerVersions) and dispatching
compatibility-matrix-testing.yml. This removes the /cmt_dispatch and
/cleanup_e2e HTTP endpoints (and their CMTTriggerSecret/CleanupSecret config),
which existed only to pass server_versions past the workflow_run webhook.

Also stop sending the undeclared `mw_tracking_key` workflow_dispatch input:
GitHub rejects a dispatch carrying an undeclared input with HTTP 422, which
broke every non-PR dispatch (release, scheduled, and CMT). Cleanup is keyed off
the test workflow's workflow_run completed event, matched by commit SHA.

- workflow_run.go: add handleCMTTrigger, provision on CMTTriggerWorkflowName 'requested'; remove handleCMTRunCleanup fallback
- config.go: add CMTServerVersions (+ defaultCMTServerVersions) and CMTTriggerWorkflowName; remove CMTTriggerSecret/CleanupSecret
- remove cmt_dispatch.go, cleanup_e2e.go and their tests
- drop dead trackingKey plumbing from desktop/mobile/CMT dispatch paths

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@yasserfaraazkhan yasserfaraazkhan changed the title fix(cmt): add /cmt_dispatch and /cleanup_e2e endpoints fix(e2e): webhook-driven CMT + fix non-PR dispatch 422 (mw_tracking_key) Jun 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/e2e_tests.go`:
- Around line 1051-1055: The dispatch uses branch "ref" which can advance and
desync cleanup keyed by commit SHA; update dispatchDesktopE2EWorkflow and
dispatchMobileE2EWorkflow to pin the run to the tracked commit by including the
commit SHA as a declared workflow input (e.g., "tracked_sha") and pass sha into
the workflow inputs, or alternatively set "ref" to the exact commit SHA when
calling the GitHub Actions API; also update the corresponding workflow files to
declare the tracked_sha input (so workflow_run.inputs can be consulted) and
adjust the cleanup path (findAndDestroyInstancesBySHA / createCloudInstallation
cleanup logic) to read workflow_run.inputs.tracked_sha (or ensure head_sha
equals the tracked SHA) so provisioning and cleanup use the same unique
identifier.

In `@server/workflow_run.go`:
- Around line 108-113: The cleanup currently calls
s.findAndDestroyInstancesBySHA(repoName, headSHA) which destroys all entries
matching {repo}-{sha} and will tear down concurrent runs; change the cleanup to
preserve and use the per-dispatch correlation key (runID) instead of SHA-only.
Update the webhook handling to extract runID from the payload and call a new or
updated method (e.g., s.findAndDestroyInstancesByRunID(repoName, headSHA, runID)
or s.findAndDestroyInstances(repoName, headSHA, runID)) so matching uses the
full tracking key that includes runID, and only fall back to SHA-only logic when
runID is absent; modify the matching/teardown logic inside the corresponding
method to filter by runID (and repoName/headSHA as needed) rather than deleting
all {repo}-{sha} entries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8ce2a0e-2229-48cb-b0e6-4517747ca6a5

📥 Commits

Reviewing files that changed from the base of the PR and between d0e3d60 and e0c42f2.

📒 Files selected for processing (5)
  • config/config-matterwick.default.json
  • server/config.go
  • server/e2e_tests.go
  • server/push_events.go
  • server/workflow_run.go
✅ Files skipped from review due to trivial changes (1)
  • config/config-matterwick.default.json

Comment thread server/e2e_tests.go Outdated
Comment thread server/workflow_run.go Outdated
…t reaped

When two flows fire on the same commit SHA (notably the monthly CMT trigger and
the nightly trigger on the same master HEAD), the completing test workflow would
destroy the other flow's still-active servers, because cleanup matched every
{repo}-…-{sha} key. Scope cleanup to the completing workflow's flow: CMT test
completions reap only {repo}-cmt-… keys, other E2E completions reap only
non-CMT (push/scheduled) keys.

We can't key on a run ID as CodeRabbit suggested: GitHub's workflow_run payload
carries the *test* run's id (not the trigger run id embedded in the tracking
key) and omits workflow_dispatch inputs, so head_sha + flow is the only viable
correlation. Branch-advance SHA desync remains handled by the periodic orphan
cleanup scan.

Extracted instanceKeyMatchesSHA + added TestInstanceKeyMatchesSHA covering the
collision.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
yasserfaraazkhan and others added 3 commits June 1, 2026 15:52
….18, 11.7.1)

Use the latest patch of the v10.11 ESR line and the v11.7 feature line (both
verified present on Docker Hub) instead of the .0 GA placeholders, so CMT tests
the currently-shipping builds. Keeps the code default in sync with the gitops
config (mattermost/gitops-platform#72).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
One cloud instance is provisioned per version, so 10 versions = 10 instances. Allows
the full ESR-to-current version set to run in a single CMT run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Instead of a hardcoded list that needs monthly edits, matterwick now resolves the
CMT version set at trigger time from the mattermost/mattermost GitHub releases:
the active ESR line(s) + the latest 3 stable minor lines + the current release
candidate, latest patch of each, deduped and capped at 10.

- resolveCMTServerVersions() (server/e2e_tests.go): fresh fetch per CMT trigger (no
  cache; CMT runs rarely), per_page=100, parses tag_name/draft/prerelease/body.
  ESR lines detected via 'Extended Support Release' in the release body. RC included
  only when newer than the latest stable. Falls back to defaultCMTServerVersions on
  API error. Adds a small internal semver parser/comparator (no new dependency).
- cmtServerVersions(): explicit Config.CMTServerVersions is an optional manual
  override; empty (normal) => auto-resolve. Removed the old Config.CMTVersions().
- handleCMTTrigger uses s.cmtServerVersions().
- Tests: auto-derived set, manual override (no API call), API-error fallback, RC
  exclusion when stale, and parseCMTVersion edge cases. Raised the cap test to 10.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@yasserfaraazkhan yasserfaraazkhan changed the title fix(e2e): webhook-driven CMT + fix non-PR dispatch 422 (mw_tracking_key) fix(e2e): webhook-driven, auto-versioned CMT + fix non-PR dispatch 422 Jun 2, 2026

@NARSimoes NARSimoes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yasserfaraazkhan , just a few clarifications.

"E2EInstanceMaxAge": 6
"E2EInstanceMaxAge": 6,
"CMTTriggerWorkflowName": "CMT Provisioner",
"CMTServerVersions": ["10.11.18", "11.7.1"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we planning to update these versions? Might better discuss in DM.

If understood correctly, the logic is to set "CMTServerVersions: []" in order to allow matterwick get the latest supported version and provisions instances for those versions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NARSimoes https://github.com/mattermost/gitops-platform/pull/72/changes#diff-d242a4ffee35793c04e9591a799f9136234ba74b276617420390641ae142aa9fR96

the deployed values will be empty so it will fetch the latest versions. In this config the values is just placeholders for demo purpose.

Comment thread server/workflow_run.go
req, err := client.NewRequest("POST",
fmt.Sprintf("/repos/%s/%s/actions/workflows/compatibility-matrix-testing.yml/dispatches", repoOwner, repoName),
map[string]interface{}{
"ref": branch,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check: It seems for the new CMT logic we are provisioning using headBranch as ref but the clean up findAndDestroyInstancesBySHA is using headSHA. If this is accurate we'll keep leftovers for CMT runs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NARSimoes , the CMT will now use the SHA and clean up after completion

Comment thread server/config.go
yasserfaraazkhan and others added 2 commits June 5, 2026 00:44
Non-PR flows (CMT, nightly/scheduled, push/release) provision instances
keyed by the trigger SHA, then dispatch the test workflow with ref=branch
(workflow_dispatch cannot take a commit SHA). GitHub records the dispatched
run's head_sha as the branch HEAD at dispatch time, which can differ from
the trigger SHA when the branch advances during the (often long)
provisioning window. Completion cleanup matches instances by the run's
head_sha, so a moved branch made the primary SHA-match miss its instances.

Resolve the branch HEAD via the GitHub commits API right before dispatch
and key the tracking entry on that SHA so it aligns with the run's
head_sha. Falls back to the trigger SHA on error; the periodic age-based
scan remains the backstop either way.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR E2E instances had no completion-based teardown and were explicitly
skipped by the periodic scan, so a long-open PR could keep its servers
alive indefinitely (only PR-close or the E2E/Reset-Servers label tore
them down).

Extend the periodic scan to also reap PR instances past a longer,
separate max-age (E2EPRInstanceMaxAge, default 24h vs 3h for non-PR), and
evict the reaped PR's in-memory tracking entry. Eviction is what makes a
re-applied E2E/Run provision a fresh set: with the stale entry gone, the
reuse path in handleE2ETestRequest finds no live instances and creates
new servers instead of dispatching against deleted ones.

Renames cleanupStaleNonPRE2EInstances -> cleanupStaleE2EInstances since it
now handles both kinds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
yasserfaraazkhan and others added 2 commits June 5, 2026 19:50
CMT is now driven by release-branch events instead of a monthly schedule:
desktop fires on push to release-v* (the cut + subsequent pushes), mobile
fires only on the release-v* branch cut via `on: create`. Because `create`
delivers a workflow_run webhook for every branch/tag creation (even when
the trigger workflow's job is skipped), Matterwick must gate independently.

Add shouldTriggerCMT: provision CMT only when the run was a manual
workflow_dispatch (any branch) or ran on a release branch (head_branch
matches isReleaseBranch). This filters mobile's create-event noise and
non-release pushes while keeping ad-hoc manual runs working.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The compatibility-matrix-testing.yml cleanup step that consumed cmt_run_id
was removed (Matterwick now reaps servers by commit SHA on workflow_run
completion), leaving the input dead. Drop it from the dispatch payload and
remove the now-unused sha/runID params from dispatchCMTWorkflow.

This pairs with removing the cmt_run_id input declaration from the
desktop/mobile CMT workflows. Deploy ordering: this Matterwick image must
be live before/with those workflow changes — once the input is undeclared,
sending it would 422 (the mw_tracking_key class). Prod matterwick (v0.4.12)
does not dispatch these workflows, so the coordinated rollout is safe.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mm-cloud-bot mm-cloud-bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed labels Jun 5, 2026
Replace branch-based CMT triggering with RC-tag triggering. The desktop &
mobile RC cut (e.g. v6.2.0-rc.1) is the moment the release team actually
wants compatibility-matrix validation — earlier than a release branch
push, scoped to the artifact that will ship.

For tag pushes GitHub sets workflow_run.head_branch to the tag name (no
refs/tags/ prefix), so the gate matches the tag string directly:
  - isRCTag: ^v?\d+\.\d+\.\d+-rc[.\-]?\d+$
    matches v6.2.0-rc.1, v2.41.0-rc.10, 6.2.0-rc1; rejects GA tags
    (v6.2.0), beta/alpha (v1.0.22-beta), and nightly (6.3.0-nightly.x).
  - shouldTriggerCMT also keeps release-branch acceptance as
    defense-in-depth and continues to allow manual workflow_dispatch on
    any ref.

Unit tests lock in the matching/rejection set so future tag conventions
can't silently slip through.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@NARSimoes NARSimoes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two minor / nit comments that can be addressed in a a follow up. Looks good to me but would be nice a matterwick owner do a sanity check on edge cases before merge

Comment thread server/workflow_run.go
// desktop and mobile repos. It identifies CMT test-workflow completions so SHA-based cleanup
// only reaps CMT instances (keys "{repo}-cmt-…-{sha}") and leaves a concurrent nightly/push
// run on the same SHA untouched (and vice versa).
const cmtTestWorkflowName = "Compatibility Matrix Testing"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curiosity: any particular reason to have CMT hardcoded while others are config driven?

Comment thread server/e2e_tests.go

const maxVersions = 10
if len(chosen) > maxVersions {
chosen = chosen[len(chosen)-maxVersions:] // keep the newest if somehow over the cap

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it's effectively the same due the amount of versions. Just highlighting to double check: the resolveCMTServerVersions keeps the newest 10 while handleCMTWithServerVersions the first 10 entris

yasserfaraazkhan and others added 6 commits June 12, 2026 02:19
Mobile doesn't cut RC git tags; its analog is the `build-release-NNN`
branch (3+ digit build number, e.g. build-release-786 / build-release-1100)
created by the "Mattermost Mobile Release" workflow when a release build
is dispatched to TestFlight / Play Store. Treat that as the RC moment for
mobile and run CMT against it.

  - New isBuildReleaseBranch helper with regex `^build-release-\d{3,}$`
    (3+ digits — rejects test artifacts like build-release-1 without
    locking the upper bound, so a future move to 5-digit numbers needs
    no code change). Platform-specific variants build-release-ios-NNN,
    -sim-NNN, -android-NNN are intentionally excluded (don't start with
    a digit after the prefix).
  - shouldTriggerCMT extended to accept this in addition to manual
    dispatch / RC tag / release branch.
  - TestIsBuildReleaseBranch + extended TestShouldTriggerCMT cover the
    boundary cases: real 3/4/5-digit build numbers (true), 1- and 2-digit
    test artifacts (false), platform variants (false), no overlap with
    existing gates.

The paired mobile workflow change adds the `build-release-[0-9][0-9][0-9]*`
branch pattern to cmt-provisioner.yml's push trigger.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The release team treats RCs as their go/no-go test gate and re-cuts on
failures. Desktop CMT already runs the whole E2E suite across the
multi-version matrix at each RC tag cut, and that matrix includes the
latest server — so the per-push release E2E was redundant coverage for
desktop. Skip it for desktop release-branch pushes.

Mobile is unchanged: mobile CMT runs only smoke tests, so the whole-suite
release E2E on `release-*` push remains the per-commit signal there.

Other auto-triggers are unaffected:
  - master/main push trigger: still fires
  - nightly trigger: still fires
  - PR label trigger: still fires
  - mobile release-branch push: still fires

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every push to a mobile release-2.X branch was firing a full PR-style E2E
suite — 3 cloud servers + 10 macOS shards + 10 ubuntu shards + iOS/Android
builds, ~60-90 min wall-clock, ~$27/push in GitHub Actions alone.

Recent cadence on real release branches:
  release-2.41: 191 commits  (~$5,160)
  release-2.40: 158 commits  (~$4,270)
  release-2.39: 114 commits  (~$3,080)
  release-2.38:  90 commits  (~$2,430)
  release-2.37:  52 commits  (~$1,400)

~$16K across these 5 branches alone, with ~15% of runs triggered by pure
noise (translation updates, "Bump app build number", docs/CI tweaks) that
can't meaningfully fail differently than the previous run.

The release team treats RC builds as the test gate (re-cut on failure),
which makes per-cherry-pick E2E during stabilization redundant. Same
logic that already removed the desktop release-branch trigger; now
applied to mobile too.

After this change, auto E2E coverage on release-side:
  - master/main push -> full PR-style E2E (unchanged, single per-commit signal)
  - build-release-NNN push -> CMT (smoke x N server versions, mobile only)
  - RC tag cut -> CMT (desktop only; mobile doesn't tag)
  - PR label E2E/Run -> full PR-style E2E (unchanged)

Full-suite-on-build-release-NNN is NOT wired today (separate decision).

E2EAutoTriggerOnRelease and E2EReleasePatternPrefix config fields remain
in place — still used by isReleaseBranch() as defense-in-depth in the CMT
workflow_run gate (shouldTriggerCMT). Leaving them avoids an unrelated
gitops/config diff in this change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every push to mobile main was firing a full PR-style E2E suite (3 cloud
servers, 10 macOS shards + 10 ubuntu shards, ~$27/push, ~60-90 min). Main
sees ~5-20 merges/day during active development -> tens of thousands of
$/year for post-merge regression signal that PR-label E2E already covers
pre-merge.

Skip the master/main gate for mobile; desktop master push still fires.

Coverage that remains for mobile main:
  - PR-label E2E pre-merge (developer-driven, single PR's branch tested)
  - Manual workflow_dispatch
  - CMT (smoke x N versions) on build-release-NNN cut

Full-suite E2E on build-release-NNN remains ungated. If desired, two ways
to wire it:
  1. Separate matterwick push-handler gate for isBuildReleaseBranch
  2. Promote latest version inside CMT to whole-suite (covers RC fully
     without a parallel matterwick path)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…er code

The mobile CMT_MATRIX now marks the highest-semver server entry as
`latest: true`. The mobile workflow uses that flag (via expressions in
`with:`, which has matrix context unlike job-level `if:`) to run the
whole detox suite against latest and smoke-only against older versions
— closing the full-suite-on-RC gap without a separate matterwick path.

Matterwick stays repo-agnostic: it only signals "which is latest". The
test-directory / parallelism choices that flow from that live in the
mobile workflow, where the repo's structure is known.

Dead-code cleanup now that the surrounding triggers are gone:

  - handleNightlyE2ETrigger (~95 lines) + its workflow_run gate. Both
    repos removed e2e-nightly-trigger.yml in prior commits, so no
    workflow with name "E2E Nightly Trigger" can fire matterwick on
    workflow_run.requested any longer.
  - E2ENightlyTriggerWorkflowName config field — no readers left.
  - E2EAutoTriggerOnRelease config field — the release-branch push gate
    that read it was removed in the prior commit; no readers left.
  - runType MASTER/RELEASE branching in trigger*ForPushEvent — only
    master/main pushes reach those functions now (release-branch and
    mobile-main paths are skipped earlier), so RELEASE is unreachable.
    Hardcode "MASTER".
  - Stale log fields (auto_release, release_pattern_prefix,
    is_release_branch) in the push-event fallthrough log.

What stays in use:
  - isReleaseBranch + E2EReleasePatternPrefix — defense-in-depth in
    shouldTriggerCMT (the CMT workflow_run gate also accepts release-*
    refs from manual dispatches).
  - cmtOnly cleanup scoping — cheap insurance against hypothetical
    same-SHA collisions between flows.

Net: -158 lines (-95 nightly handler, -36 config refs, -27 misc), +new
tests for the latest-marker boundary cases (5-element ESR+minors+RC,
stable vs same-X.Y.Z RC, multi-digit rc.N, single version, all
unparseable fallback).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Matterwick's nightly handler was removed in prior commits; this dispatch
function still took a `nightly bool` parameter and sent `"nightly": ...`
in the e2e-functional.yml workflow inputs map. Push-event callers always
passed `false`, and the PR-label path was a separate function that never
went through here at all — so the param was dead.

Companion change in desktop (paired): remove the `nightly` workflow input
from e2e-functional.yml + e2e-functional-template.yml and the
`if inputs.nightly` block that built `desktop-nightly-${OS}` build IDs.

  - dispatchDesktopE2EWorkflow: drop `nightly bool` parameter
  - drop "nightly": fmt.Sprintf("%t", nightly) from workflowInputs
  - update sole caller (triggerDesktopE2EWorkflowForPushEvent)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesn't merit a release note.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants