Skip to content

feat: add POST /admin-api/tee/fleet-join endpoint#2091

Merged
chefsale merged 5 commits intomasterfrom
feat/tee-fleet-join-endpoint
Apr 1, 2026
Merged

feat: add POST /admin-api/tee/fleet-join endpoint#2091
chefsale merged 5 commits intomasterfrom
feat/tee-fleet-join-endpoint

Conversation

@chefsale
Copy link
Copy Markdown
Member

@chefsale chefsale commented Mar 31, 2026

Summary

  • New endpoint POST /admin-api/tee/fleet-join { groupId } that triggers the full TEE HA join flow from the fleet node side
  • Called by the fleet sidecar on localhost after receiving an assignment from MDMA
  • Creates a new identity, subscribes to the group topic, generates TDX attestation with bound public key, broadcasts TeeAttestationAnnounce
  • Adds NodeClient::publish_on_group() for publishing on group gossip topics

Test plan

  • Endpoint returns 400 for invalid group_id
  • Endpoint generates attestation and broadcasts on group topic
  • Existing peers receive the announce and verify/admit

Made with Cursor


Note

Medium Risk
Touches TEE attestation generation and group gossip publishing, so failures or timing/peer-detection behavior could impact fleet join reliability. The endpoint is protected, but still introduces new security-sensitive flow around identity + attestation broadcast.

Overview
Adds a new protected admin endpoint POST /admin-api/tee/fleet-join that takes a groupId, creates a fresh identity, generates a non-mock TDX attestation bound to the identity (via SHA-256 hash), subscribes to the group gossip topic, and broadcasts a TeeAttestationAnnounce message (borsh-encoded) to the group; on broadcast failure it unsubscribes.

Extends NodeClient with publish_on_group() which publishes to group/<id> topics after waiting up to 10s for at least one mesh peer (then publishes anyway), and wires the new tee::protected_service() router into the authenticated admin routes. Workspace version is bumped to 0.10.1-rc.16, and server dependencies are updated to include borsh, sha2, and calimero-network-primitives.

Written by Cursor Bugbot for commit fd6231e. This will update automatically on new commits. Configure here.

Fleet sidecar calls this on localhost to trigger the full TEE HA join flow:
1. Creates a new identity for the fleet node
2. Subscribes to the group gossip topic
3. Generates TDX attestation with group_id as nonce and SHA256(public_key) bound to report_data
4. Broadcasts TeeAttestationAnnounce on the group topic
5. Returns status with public_key and attestation info

Also adds NodeClient::publish_on_group() for publishing arbitrary
payloads on group gossip topics.

Made-with: Cursor
Copy link
Copy Markdown

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 2 agents | Quality score: 44% | Review time: 138.4s

🟡 2 warnings, 💡 3 suggestions, 📝 1 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-48692531


if let Err(err) = state.node_client.subscribe_group(group_id_bytes).await {
error!(error=?err, "Failed to subscribe to group topic");
return ApiError {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Resource leak: subscription not cleaned up on subsequent failures

If generate_attestation or the broadcast fails after subscribe_group succeeds, the group subscription is never cleaned up, leaking network resources on repeated failed calls.

Suggested fix:

Add cleanup logic (unsubscribe_group) in error paths after line 62, or use a guard/RAII pattern to ensure cleanup on early returns.

.publish_on_group(group_id_bytes, payload)
.await
{
Ok(_) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Mock attestation status exposed in API response

Returning is_mock in the response reveals whether the TEE is running in mock/test mode, which helps attackers identify non-production or weakly protected deployments.

Suggested fix:

Remove `is_mock` from the API response; if needed for debugging, only include it when a debug flag is explicitly enabled.

error!(error=?err, "Failed to create identity for fleet join");
return ApiError {
status_code: StatusCode::INTERNAL_SERVER_ERROR,
message: format!("Failed to create identity: {err}"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Internal error details exposed to clients

Error messages like format!("Failed to create identity: {err}") leak internal implementation details that could aid attackers in understanding the system.

Suggested fix:

Return generic error messages to clients while logging the detailed error internally (already done via `error!` macro).

}
};

let our_public_key = match state.ctx_client.new_identity() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Orphaned identity on partial failure

If any step after new_identity() fails, the created identity is never used; if identities are persisted, this accumulates orphaned entries over time.

Suggested fix:

Consider deferring identity creation until after subscription succeeds, or document that orphaned identities are acceptable/garbage-collected.

use crate::admin::handlers::validation::ValidatedJson;
use crate::admin::service::{ApiError, ApiResponse};
use crate::AdminState;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 No visible rate limiting for resource-intensive endpoint

This endpoint creates identities, generates TDX attestations, and broadcasts to the network—all potentially expensive operations with no visible rate limiting to prevent abuse.

Suggested fix:

Consider adding rate limiting middleware for this endpoint to prevent resource exhaustion attacks.

pub group_id: String,
}

impl Validate for FleetJoinRequest {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: hex::decode called twice for the same input

Validation calls hex::decode to check validity, then the handler decodes again; the allocation is redundant.

Suggested fix:

Consider storing the decoded bytes in a validated wrapper type, or accept the duplication as acceptable for clarity.

Copy link
Copy Markdown

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 2 agents | Quality score: 47% | Review time: 85.4s

🟡 2 warnings, 💡 2 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-38e0ad35

}
};

info!(%our_public_key, "Created identity for fleet TEE node");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Resource leak: subscription not cleaned up on subsequent failures

If subscribe_group succeeds but generate_attestation, serialization, or publish_on_group fails, the group subscription remains active without cleanup, accumulating orphan subscriptions over repeated failed attempts.

Suggested fix:

Wrap operations in a transaction-like pattern: if any step after subscribe_group fails, call unsubscribe_group before returning the error.

}
.into_response();
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Identity persisted even when join ultimately fails

A new identity is created via new_identity() before subscribing and broadcasting; if subsequent steps fail, this identity may persist as orphaned state that cannot be retried or cleaned up.

Suggested fix:

Consider delaying identity creation until after subscription succeeds, or implement cleanup logic for the identity on failure.

.publish_on_group(group_id_bytes, payload)
.await
{
Ok(_) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Response leaks attestation mode (mock vs production)

Returning is_mock in the API response reveals whether the system runs with mock attestations (development) or real TDX hardware, which could inform attackers about the deployment environment.

Suggested fix:

Consider removing `is_mock` from the response payload, or restrict it to debug builds only.

Err(err) => {
error!(error=?err, "Failed to create identity for fleet join");
return ApiError {
status_code: StatusCode::INTERNAL_SERVER_ERROR,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verbose error messages may leak internal details

Error responses include raw internal error strings (e.g., format!("Failed to create identity: {err}")) which could expose implementation details to attackers.

Suggested fix:

Return generic error messages to clients and log detailed errors server-side only.

@chefsale chefsale force-pushed the feat/tee-fleet-join-endpoint branch from 99b06df to 95c14ad Compare April 1, 2026 05:57
Copy link
Copy Markdown

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 2 agents | Quality score: 52% | Review time: 148.4s

🟡 2 warnings. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-02f02a44

};

info!(%our_public_key, "Created identity for fleet TEE node");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Subscription not cleaned up on subsequent failures

If any operation fails after subscribe_group succeeds (attestation generation, serialization, or publish), the node remains subscribed to the group topic but the join flow is incomplete, leaving orphaned state.

Suggested fix:

Call `unsubscribe_group` in the error paths after line 60, or restructure to only subscribe after all other operations succeed (attestation/serialization can be done before subscribing).

.into_response();
}
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Mock attestations accepted unlike /tee/attest endpoint

This endpoint accepts and broadcasts mock attestations (is_mock=true), while /tee/attest rejects them with NOT_IMPLEMENTED. Broadcasting mock attestations to the network could allow spoofed TEE nodes in non-TDX environments.

Suggested fix:

Add the same mock rejection logic as attest.rs: if attestation.is_mock { return ApiError with NOT_IMPLEMENTED }

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

E2E Rust Apps Failed

One or more E2E workflows (e2e-kv-store, xcall-example) failed after retries.

Please check the workflow logs for more details.

Made-with: Cursor
@chefsale chefsale force-pushed the feat/tee-fleet-join-endpoint branch from 95c14ad to fa08590 Compare April 1, 2026 06:30
Copy link
Copy Markdown

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 2 agents | Quality score: 44% | Review time: 178.1s

🟡 3 warnings, 💡 2 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-436597b5

};

info!(%our_public_key, "Created identity for fleet TEE node");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Resource leak: subscription not cleaned up on failure

If attestation generation, serialization, or broadcast fails after subscribing to the group topic (line 51), the subscription is never cleaned up.

Suggested fix:

Wrap the post-subscription logic in a pattern that calls `unsubscribe_group` on any error path, similar to `broadcast_specialized_node_invite` cleanup pattern in client.rs.

use crate::admin::service::{ApiError, ApiResponse};
use crate::AdminState;

pub async fn handler(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 No authorization check for group membership

The handler allows any caller to join any group_id without verifying the caller is authorized to join that specific group, potentially allowing unauthorized nodes to inject themselves into TEE groups.

Suggested fix:

Add authorization check to verify the requesting node/sidecar is permitted to join the specified group_id before creating identity and broadcasting attestation.

Ok(())
}

pub async fn publish_on_group(&self, group_id: [u8; 32], payload: Vec<u8>) -> eyre::Result<()> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Unrestricted group topic publishing

The new publish_on_group method allows publishing arbitrary payloads to any group topic without validating the caller has membership or permission for that group.

Suggested fix:

Add a check that the caller is a member of the group before allowing publish, or document that callers must perform this check.

.into_response();
}
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Identity created but not cleaned up on subsequent failures

A new identity is persisted via new_identity() but if any later step fails (subscribe, attestation, broadcast), the orphan identity remains stored.

Suggested fix:

Consider deferring identity creation until after other validations succeed, or document that orphan identities are acceptable and will be garbage collected.


match state
.node_client
.publish_on_group(group_id_bytes, payload)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Information disclosure via is_mock field in response

Exposing is_mock in the API response reveals whether the system is running with real TDX attestation or mock mode, which could help attackers assess the security posture.

Suggested fix:

Consider removing `is_mock` from the response or only including it in debug builds.

Made-with: Cursor
Copy link
Copy Markdown

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 2 agents | Quality score: 33% | Review time: 151.7s

🔴 1 critical, 🟡 3 warnings, 💡 1 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-e9553ad6

use crate::admin::service::{ApiError, ApiResponse};
use crate::AdminState;

pub async fn handler(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Unauthenticated endpoint creates identities and broadcasts to P2P network

The /admin-api/tee/fleet-join endpoint is registered under public_routes (no session auth layer), allowing any network-reachable client to create identities and broadcast attestation announcements, potentially enabling DoS or network spam attacks.

Suggested fix:

Move `fleet_join` route to `protected_routes` in service.rs, or add explicit authentication check in the handler. If localhost-only access is intended, add middleware to verify requests originate from 127.0.0.1.

info!(%our_public_key, "Created identity for fleet TEE node");

if let Err(err) = state.node_client.subscribe_group(group_id_bytes).await {
error!(error=?err, "Failed to subscribe to group topic");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing cleanup on subsequent failures after subscribe_group

If generate_attestation or publish_on_group fails after subscribe_group succeeds, the node remains subscribed to the group topic but returns an error—leaving dangling subscription state.

Suggested fix:

Unsubscribe from the group topic in error paths after line 53, or wrap in a scope that cleans up on failure (e.g., using a guard or explicit unsubscribe calls in each error branch).

}
};

let our_public_key = match state.ctx_client.new_identity() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 No rate limiting on identity creation

Each request to this endpoint creates a new cryptographic identity via new_identity(). Without rate limiting, an attacker can exhaust storage or memory by repeatedly calling this endpoint.

Suggested fix:

Add rate limiting middleware (e.g., tower-governor or custom limiter) or track pending join attempts to prevent resource exhaustion.

};

let our_public_key = match state.ctx_client.new_identity() {
Ok(pk) => pk,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Created identity not cleaned up on failure

A new identity is created at line 39-48 and persisted, but if any subsequent step fails (subscribe, attestation, publish), the orphaned identity remains stored with no cleanup.

Suggested fix:

Consider either deferring identity creation until all preconditions succeed, or implementing rollback logic to remove the identity on failure.

ValidatedJson(req): ValidatedJson<FleetJoinRequest>,
) -> impl IntoResponse {
info!(group_id=%req.group_id, "Fleet join: subscribing to group and broadcasting attestation");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Logging user-controlled group_id without sanitization

The group_id from the request is logged directly; while hex-validated, extremely long invalid inputs before validation could affect log integrity.

Suggested fix:

Move the info! log after the hex validation succeeds, or truncate the logged value to a safe length.

Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

- Wire tee::protected_service() into protected_routes (endpoint was unreachable)
- Replace hardcoded 2s sleep with mesh_peer_count polling in publish_on_group
- Reorder operations: generate attestation + serialize before subscribing
- Reject mock attestations (match /tee/attest behavior)
- Use random nonce instead of static group_id for freshness binding
- Use generic error messages to avoid leaking internals
- Remove is_mock from API response
- Unsubscribe from group on publish failure

Made-with: Cursor
Copy link
Copy Markdown

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 2 agents | Quality score: 54% | Review time: 136.8s

🟡 1 warnings, 💡 1 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-f06c7f17

// Create identity first (before subscribe, so we can clean up if needed)
let our_public_key = match state.ctx_client.new_identity() {
Ok(pk) => pk,
Err(err) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Identity resource leak on error paths

If attestation generation, payload serialization, or group subscription fails after new_identity() succeeds, the created identity is never cleaned up.

Suggested fix:

Track created identity and clean it up in error paths, or use a guard pattern that removes the identity on drop unless explicitly committed.

Ok(())
}

pub async fn publish_on_group(&self, group_id: [u8; 32], payload: Vec<u8>) -> eyre::Result<()> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 No cancellation or backpressure for 10-second polling wait

The polling loop can block the async handler for up to 10 seconds with no cancellation mechanism, which may cause HTTP client timeouts on slow mesh peer discovery.

Suggested fix:

Consider accepting an optional timeout parameter or using `tokio::select!` with a cancellation token for better responsiveness.

Copy link
Copy Markdown

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 2 agents | Quality score: 51% | Review time: 125.9s

🟡 1 warnings, 💡 3 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-20eff9da

// Create identity first (before subscribe, so we can clean up if needed)
let our_public_key = match state.ctx_client.new_identity() {
Ok(pk) => pk,
Err(err) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Identity not cleaned up on subsequent failures

If attestation generation, serialization, subscribe, or publish fails after new_identity() succeeds, the created identity is leaked and never removed.

Suggested fix:

Add cleanup logic to delete the identity on error paths, or document that orphaned identities are acceptable and garbage-collected elsewhere.

status_code: StatusCode::INTERNAL_SERVER_ERROR,
message: "Failed to subscribe to group".to_owned(),
}
.into_response();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Publish proceeds with zero mesh peers after timeout

After the 10-second wait, publishing proceeds even with no mesh peers, meaning the TeeAttestationAnnounce may be silently lost while returning success to the caller.

Suggested fix:

Consider returning an error or a warning status in the response when no mesh peers were available, so the caller knows the announcement may not have been received.


const MAX_WAIT: std::time::Duration = std::time::Duration::from_secs(10);
const POLL_INTERVAL: std::time::Duration = std::time::Duration::from_millis(500);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Repeated TopicHash clone in polling loop

topic.clone() allocates on each loop iteration (up to 20 times); consider using a reference if the API allows, or accepting the overhead given the infrequent calls.

Suggested fix:

If mesh_peer_count can accept &TopicHash, change to borrow; otherwise document this is acceptable given the 500ms interval.

// and if it fails we haven't leaked a subscription
let pk_hash: [u8; 32] = Sha256::digest(&*our_public_key).into();
let nonce: [u8; 32] = rand::random();
let report_data = build_report_data(&nonce, Some(&pk_hash));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Consider using OsRng for attestation nonce

For cryptographic nonces in TEE attestation, using OsRng directly (via getrandom or rand::rngs::OsRng) is preferred over rand::random() to avoid any theoretical state exposure from the userspace CSPRNG.

Suggested fix:

Replace `rand::random()` with `use rand::RngCore; let mut nonce = [0u8; 32]; rand::rngs::OsRng.fill_bytes(&mut nonce);`

@chefsale chefsale enabled auto-merge (squash) April 1, 2026 08:09
@chefsale chefsale disabled auto-merge April 1, 2026 08:26
@chefsale chefsale enabled auto-merge (squash) April 1, 2026 08:27
@chefsale chefsale disabled auto-merge April 1, 2026 08:27
@chefsale chefsale merged commit 6285f42 into master Apr 1, 2026
15 checks passed
@chefsale chefsale deleted the feat/tee-fleet-join-endpoint branch April 1, 2026 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant