Skip to content

xmtp_mls: Client::join_group_by_external_invite (atomic external-commit join)#3674

Open
tylerhawkes wants to merge 1 commit into
tyler/qr-invite-l10-create-invitefrom
tyler/qr-invite-l11-join-by-invite
Open

xmtp_mls: Client::join_group_by_external_invite (atomic external-commit join)#3674
tylerhawkes wants to merge 1 commit into
tyler/qr-invite-l10-create-invitefrom
tyler/qr-invite-l11-join-by-invite

Conversation

@tylerhawkes

@tylerhawkes tylerhawkes commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the join half of the QR-invite flow.

  • Client::join_group_by_external_invite(invite_payload_bytes, encrypted_blob_bytes) -> JoinByExternalInviteOutput<Context>
  • Decodes both protos, unwraps the encrypted GroupInfo, verifies the group_id_hash matches sha256(decrypted.group_id) (defense-in-depth against a malicious service swapping ciphertexts).
  • Validates the payload version and expiry.
  • Builds an atomic external commit: ExternalInit + by-value Adds for the joiner's other installations + AppDataUpdate(s) registering the joiner in the group membership component. Single commit, no fast-follow.
  • Publishes the commit, HPKE-Welcomes the other installations, then re-exports a refreshed GroupInfo at the new epoch (wrapped with the same key + fresh nonce) for the next scanner.
  • Output carries the new MlsGroup and the serialized refreshed blob.

Stack

Test plan

  • Integration coverage via PR-T1.
  • cargo check -p xmtp_mls passes.

Note

Add Client::join_group_by_external_invite for atomic external-commit group joining

  • Adds a new client/external_invite.rs module implementing Client::join_group_by_external_invite, which performs an atomic MLS external commit to join a group from a QR-style invite payload.
  • The method decodes and validates the ExternalInvitePayload, parses GroupInfo, fetches key packages for co-resident installations, and builds an external commit with inline Add proposals and an AppDataUpdate recording the new member.
  • After committing, it persists a StoredGroup row with Pending membership, publishes the commit with an HMAC binding, sends HPKE-wrapped Welcome messages to other installations, validates post-join policy, and returns a re-encrypted GroupInfo blob for upload back to the invite service.
  • Adds JoinByExternalInviteError for structured error reporting across all failure modes in the join flow.
📊 Macroscope summarized 3056397. 1 file reviewed, 0 issues evaluated, 0 issues filtered, 0 comments posted

🗂️ Filtered Issues

No issues evaluated.

@tylerhawkes tylerhawkes requested a review from a team as a code owner May 21, 2026 08:21
@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Claude finished @tylerhawkes's task —— View job


Code Review

  • Read and analyze the implementation
  • Identify issues not already covered by existing reviews
  • Post review feedback

Review Date: 2026-06-10 06:49 UTC

I've reviewed the implementation focusing on issues not already identified by macroscopeapp. Here are additional concerns:

Security & Correctness

1. VerifiableGroupInfo Not Cryptographically Verified Before Use ⚠️ Critical
external_invite.rs:242-245

The verifiable_group_info is extracted but never actually verified before being passed to build_group() at line 407. The variable name suggests it should be verified (typically via .verify() with the group context), but verification is missing. This means a malicious service could provide a GroupInfo with forged signatures or invalid extensions, and the client would blindly join.

Recommendation: Add cryptographic verification of the GroupInfo before use, or confirm that build_group() handles this internally (and add a comment explaining this).


2. Missing Error Detail for Key Package Fetch Failures 🟡 Medium
external_invite.rs:317

When kps_map.get(id) returns Some(Err(_)), the error is discarded and only counted as missing. This loses critical diagnostic information about why key package verification failed (expired, revoked, signature mismatch, etc.). In production, operators won't know whether to retry, refresh identity updates, or treat this as a security incident.

Recommendation:

Some(Err(e)) => {
    tracing::warn!("Key package verification failed for installation {}: {}", hex::encode(id), e);
    missing += 1;
}

3. other_installation_ids.clone() Unnecessary Allocation 🔵 Performance
external_invite.rs:298

other_installation_ids is cloned to pass to get_key_packages_for_installation_ids(), then immediately iterated again at line 305. Since get_key_packages_for_installation_ids likely doesn't need ownership (check signature), consider passing &other_installation_ids or other_installation_ids.as_slice() to avoid the clone.


4. app_data_payload.clone() at Line 367 🔵 Performance
external_invite.rs:367

The app_data_payload is cloned when constructing the AppDataUpdateProposal, but then reused at line 447 for apply_update_payload(). If AppDataUpdateProposal::update() takes ownership, this clone is necessary, but verify whether a reference would suffice.


5. Hardcoded |_| true PSK Acceptor 🟡 Medium
external_invite.rs:462

The external commit builder uses |_| true as the PSK acceptor predicate, meaning all PSKs are blindly accepted without validation. While external commits typically don't use PSKs, this could be a security issue if the group has PSK proposals in the ratchet tree.

Recommendation: Either use a stricter predicate or add a comment explaining why unconditional acceptance is safe in this context.


Code Quality

6. Missing Tracing for Critical Failure Paths 🔵 Observability

The function has good tracing at the entry point (line 212), but critical error paths like MissingInstallations, ExternalGroupIdMismatch, and PostJoinPolicyMissing have no logging before returning. In production, these silent failures make debugging QR-invite issues difficult.

Recommendation: Add tracing::warn! or tracing::error! before returning domain-specific errors.


7. Test Coverage Gap ⚠️ Critical
As noted in the PR description and codecov report, this 753-line feature has 0% test coverage with integration tests deferred to PR-T1. Given the complexity (crypto, network I/O, state persistence, error paths), unit tests for:

  • Error cases (expired blob, wrong key, missing policy, mismatched external_group_id)
  • Edge cases (no other installations, PSK handling, sequence_id=0)
  • Key zeroization (verify key.fill(0) is reached in all code paths)

are strongly recommended before merge.


Previously Identified Issues (Summary)

For completeness, macroscopeapp already identified:

  • High: Hardcoded message_cursor: 0 breaks sync cursors (line 527)
  • High: Publishing before validating external_group_id (line 493)
  • Medium: Error swallowing in persist_joined_group (line 750)
  • Medium: Publish failure leaves inconsistent local state (line 461)
  • Low: Stale timestamp for TTL calculation (line 582)

These should be addressed alongside the new issues above.


@macroscopeapp

macroscopeapp Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Approvability

Verdict: Needs human review

4 blocking correctness issues found. This PR introduces a substantial new feature (753 lines) implementing QR-invite external commit join flow with complex MLS protocol interactions and network operations. Additionally, there are multiple unresolved review comments including 2 high-severity issues (publishing before validation, hardcoded sync cursor) that warrant human review.

No code changes detected at 3056397. Prior analysis still applies.

You can customize Macroscope's approvability policy. Learn more.

@tylerhawkes tylerhawkes force-pushed the tyler/qr-invite-l10-create-invite branch from ffaaf90 to c12079b Compare May 26, 2026 22:49
@tylerhawkes tylerhawkes force-pushed the tyler/qr-invite-l11-join-by-invite branch 2 times, most recently from df41a55 to dab038e Compare May 26, 2026 22:59
Comment on lines +749 to +750
// End-to-end test coverage for this entry point lives in the QR-invite
// integration test (T-1, libxmtp-integration-test). The join path

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium client/external_invite.rs:749

persist_joined_group discards the Result from stored_group.store_or_ignore(&context.db()) with let _ = , so database errors are silently swallowed and the function returns Ok(()) even when persistence fails. This leaves MLS state persisted (via openmls finalize) but the libxmtp StoredGroup row missing, causing the group to be invisible to libxmtp APIs. Consider propagating the error by returning store_or_ignore(...)? or handling the Err case explicitly.

-    let _ = stored_group.store_or_ignore(&context.db());
+    stored_group.store_or_ignore(&context.db()).map_err(|e| {
+        ClientError::Generic(format!(
+            "join_group_by_external_invite: store_or_ignore failed: {e}"
+        ))
+    })?;
     Ok(())
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file crates/xmtp_mls/src/client/external_invite.rs around lines 749-750:

`persist_joined_group` discards the `Result` from `stored_group.store_or_ignore(&context.db())` with `let _ = `, so database errors are silently swallowed and the function returns `Ok(())` even when persistence fails. This leaves MLS state persisted (via openmls finalize) but the libxmtp `StoredGroup` row missing, causing the group to be invisible to libxmtp APIs. Consider propagating the error by returning `store_or_ignore(...)?` or handling the `Err` case explicitly.

Evidence trail:
crates/xmtp_mls/src/client/external_invite.rs lines 728-751 (REVIEWED_COMMIT): `persist_joined_group` function discards `Result` on line 749 with `let _ =` and returns `Ok(())` on line 750. Caller at line 485 uses `?` operator expecting error propagation.

@codecov

codecov Bot commented May 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 177 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (tyler/qr-invite-l10-create-invite@57e7e30). Learn more about missing BASE report.

Files with missing lines Patch % Lines
crates/xmtp_mls/src/client/external_invite.rs 0.00% 177 Missing ⚠️
Additional details and impacted files
@@                         Coverage Diff                          @@
##             tyler/qr-invite-l10-create-invite    #3674   +/-   ##
====================================================================
  Coverage                                     ?   83.82%           
====================================================================
  Files                                        ?      411           
  Lines                                        ?    60484           
  Branches                                     ?        0           
====================================================================
  Hits                                         ?    50700           
  Misses                                       ?     9784           
  Partials                                     ?        0           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@tylerhawkes tylerhawkes force-pushed the tyler/qr-invite-l11-join-by-invite branch from dab038e to f0d63d4 Compare May 27, 2026 18:17
@tylerhawkes tylerhawkes force-pushed the tyler/qr-invite-l10-create-invite branch from 5ebf785 to e8978a2 Compare May 27, 2026 18:33
@tylerhawkes tylerhawkes force-pushed the tyler/qr-invite-l11-join-by-invite branch from f0d63d4 to a6ec61d Compare May 27, 2026 18:33
@tylerhawkes tylerhawkes force-pushed the tyler/qr-invite-l10-create-invite branch 2 times, most recently from e8ceed3 to 57e7e30 Compare June 6, 2026 05:13
@tylerhawkes tylerhawkes force-pushed the tyler/qr-invite-l11-join-by-invite branch from a6ec61d to b5b9a53 Compare June 6, 2026 05:13
let welcome_bytes = welcome_msg
.tls_serialize_detached()
.map_err(JoinByExternalInviteError::GroupInfoTlsDecode)?;
let welcome_metadata_bytes = WelcomeMetadata { message_cursor: 0 }.encode_to_vec();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High client/external_invite.rs:527

WelcomeMetadata { message_cursor: 0 } is hard-coded for HPKE-wrapped Welcomes to other installations, but process_new_welcome uses message_cursor to set the recipient's sync cursor. Each co-resident installation will store the membership-change message at cursor 0 instead of the actual commit position, causing subsequent syncs to replay history or mis-order messages. Consider passing the real cursor value from the external commit's publication response.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @crates/xmtp_mls/src/client/external_invite.rs around line 527:

`WelcomeMetadata { message_cursor: 0 }` is hard-coded for HPKE-wrapped Welcomes to other installations, but `process_new_welcome` uses `message_cursor` to set the recipient's sync cursor. Each co-resident installation will store the membership-change message at cursor 0 instead of the actual commit position, causing subsequent syncs to replay history or mis-order messages. Consider passing the real cursor value from the external commit's publication response.

}
if policy.external_group_id.as_slice() != payload_external_group_id.as_slice() {
return Err(JoinByExternalInviteError::ExternalGroupIdMismatch.into());
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low client/external_invite.rs:582

compute_refreshed_blob_expiry uses now_ns captured at the start of the function, so expire_in_ns policies produce a TTL shortened by the full join latency. When that latency exceeds the relative expiry window, the returned refreshed_encrypted_group_info blob is already expired. Consider capturing a fresh now_ns just before the expiry computation so the TTL reflects the actual remaining validity window.

-        let refreshed_expires_at_ns = compute_refreshed_blob_expiry(&policy, now);
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @crates/xmtp_mls/src/client/external_invite.rs around line 582:

`compute_refreshed_blob_expiry` uses `now_ns` captured at the start of the function, so `expire_in_ns` policies produce a TTL shortened by the full join latency. When that latency exceeds the relative expiry window, the returned `refreshed_encrypted_group_info` blob is already expired. Consider capturing a fresh `now_ns` just before the expiry computation so the TTL reflects the actual remaining validity window.

Comment on lines +461 to +469
let (mls_group, bundle) = builder
.build(provider.rand(), provider.crypto(), signer, |_| true)
.map_err(|e| {
JoinByExternalInviteError::ExternalCommitBuilderFailed(format!("build: {e}"))
})?
.finalize(&provider)
.map_err(|e| {
JoinByExternalInviteError::ExternalCommitBuilderFailed(format!("finalize: {e}"))
})?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium client/external_invite.rs:461

If send_group_messages fails, the method returns PublishError after builder.finalize() has already persisted the new MLS group state locally and persist_joined_group has created the libxmtp wrapper row. The client is left thinking it joined the group at the new epoch even though the external commit never reached XMTP delivery, so the operation is not safely retryable and the local state diverges from the rest of the group. Consider reordering so that publishing happens before local persistence, or implementing cleanup/rollback logic on publish failure.

-        let (mls_group, bundle) = builder
-            .build(provider.rand(), provider.crypto(), signer, |_| true)
-            .map_err(|e| {
-                JoinByExternalInviteError::ExternalCommitBuilderFailed(format!("build: {e}"))
-            })?
-            .finalize(&provider)
-            .map_err(|e| {
-                JoinByExternalInviteError::ExternalCommitBuilderFailed(format!("finalize: {e}"))
-            })?;
+        let (commit, bundle) = builder
+            .build(provider.rand(), provider.crypto(), signer, |_| true)
+            .map_err(|e| {
+                JoinByExternalInviteError::ExternalCommitBuilderFailed(format!("build: {e}"))
+            })?;
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @crates/xmtp_mls/src/client/external_invite.rs around lines 461-469:

If `send_group_messages` fails, the method returns `PublishError` after `builder.finalize()` has already persisted the new MLS group state locally and `persist_joined_group` has created the libxmtp wrapper row. The client is left thinking it joined the group at the new epoch even though the external commit never reached XMTP delivery, so the operation is not safely retryable and the local state diverges from the rest of the group. Consider reordering so that publishing happens before local persistence, or implementing cleanup/rollback logic on publish failure.

created_at_ns,
);

// 8. Publish the PublicMessage external commit to XMTP delivery.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High client/external_invite.rs:493

join_group_by_external_invite publishes the external commit and device welcomes (lines 503-561) before verifying that the joined group's EXTERNAL_COMMIT_POLICY.external_group_id matches the QR payload (lines 580-582). If the invite service returns a GroupInfo for the wrong group encrypted under the same symmetric key, the client joins and publishes to that wrong group before ExternalGroupIdMismatch is returned. Move the external_group_id validation (steps 10a-10b) before the publish steps (steps 8-9) so the mismatch is caught before any network side effects occur.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @crates/xmtp_mls/src/client/external_invite.rs around line 493:

`join_group_by_external_invite` publishes the external commit and device welcomes (lines 503-561) before verifying that the joined group's `EXTERNAL_COMMIT_POLICY.external_group_id` matches the QR payload (lines 580-582). If the invite service returns a `GroupInfo` for the wrong group encrypted under the same symmetric key, the client joins and publishes to that wrong group before `ExternalGroupIdMismatch` is returned. Move the external_group_id validation (steps 10a-10b) before the publish steps (steps 8-9) so the mismatch is caught before any network side effects occur.

@tylerhawkes tylerhawkes force-pushed the tyler/qr-invite-l10-create-invite branch from 88884e2 to 47d1f52 Compare June 10, 2026 06:42
@tylerhawkes tylerhawkes force-pushed the tyler/qr-invite-l11-join-by-invite branch from b5b9a53 to 3056397 Compare June 10, 2026 06:48
PostCommitGroupInfoEncrypt(#[source] EncryptedGroupInfoError),
}

impl From<JoinByExternalInviteError> for ClientError {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low client/external_invite.rs:170

The From<JoinByExternalInviteError> implementation discards all structured error variants and collapses them into ClientError::Generic, so bindings/node and bindings/wasm format errors via error_code() always report only Generic instead of the specific failure type (malformed invite, stale QR, missing installations, etc.). SDK callers lose the ability to distinguish these failure modes programmatically. Consider implementing a non-generic ClientError variant that preserves the structured error information.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @crates/xmtp_mls/src/client/external_invite.rs around line 170:

The `From<JoinByExternalInviteError>` implementation discards all structured error variants and collapses them into `ClientError::Generic`, so `bindings/node` and `bindings/wasm` format errors via `error_code()` always report only `Generic` instead of the specific failure type (malformed invite, stale QR, missing installations, etc.). SDK callers lose the ability to distinguish these failure modes programmatically. Consider implementing a non-generic `ClientError` variant that preserves the structured error information.

tylerhawkes added a commit that referenced this pull request Jun 11, 2026
…ves) + adapt invite::payload

Regenerates xmtp_proto from xmtp/proto#336 (typed SymmetricKey /
GroupStateHash / ServicePointer newtypes, ExternalCommitPolicyV1
max_uses + refresh_pointers, EncryptedGroupInfoBlobV1 AAD +
effective-expiry semantics, GroupMembershipEntry.V1
admitted_via_external_group_id).

Consuming changes kept to what main already contains:

- invite::payload rewritten for the typed shape: symmetric_key is a
  SymmetricKey submessage (MissingSymmetricKey + material-length
  checks), service_pointer is an optional ServicePointer (absent =
  application-resolved; present-but-empty oneof fails closed;
  https_url parsed and scheme-checked via url). New
  https_service_pointer / opaque_service_pointer /
  validate_service_pointer helpers.

- GroupMembershipEntry.V1 construction sites gain the new
  admitted_via_external_group_id field, empty everywhere: the
  migrator's synthesized entries are all Welcome/legacy members
  (absent is permanently correct), and the membership-update rewrite
  path documents that tag preservation MUST land together with the
  validator's write-once enforcement in the external-commit stack —
  nothing on main can set the tag yet, so empty is correct today.

The rest of the QR-invite stack (#3671 -> #3666 -> #3667 -> #3668 ->
#3673 -> #3674) rebases onto this and picks up the new fields layer
by layer.
tylerhawkes added a commit that referenced this pull request Jun 11, 2026
…ves) + adapt invite::payload

Regenerates xmtp_proto from xmtp/proto#336 (typed SymmetricKey /
GroupStateHash / ServicePointer newtypes, ExternalCommitPolicyV1
max_uses + refresh_pointers, EncryptedGroupInfoBlobV1 AAD +
effective-expiry semantics, GroupMembershipEntry.V1
admitted_via_external_group_id).

Consuming changes kept to what main already contains:

- invite::payload rewritten for the typed shape: symmetric_key is a
  SymmetricKey submessage (MissingSymmetricKey + material-length
  checks), service_pointer is an optional ServicePointer (absent =
  application-resolved; present-but-empty oneof fails closed;
  https_url parsed and scheme-checked via url). New
  https_service_pointer / opaque_service_pointer /
  validate_service_pointer helpers.

- GroupMembershipEntry.V1 construction sites gain the new
  admitted_via_external_group_id field, empty everywhere: the
  migrator's synthesized entries are all Welcome/legacy members
  (absent is permanently correct), and the membership-update rewrite
  path documents that tag preservation MUST land together with the
  validator's write-once enforcement in the external-commit stack —
  nothing on main can set the tag yet, so empty is correct today.

The rest of the QR-invite stack (#3671 -> #3666 -> #3667 -> #3668 ->
#3673 -> #3674) rebases onto this and picks up the new fields layer
by layer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant