Add Events design sketch proposal#1
Conversation
- Mermaid sequence diagrams for poll/push/webhook delivery - Appendix: end-to-end GitHub example with Redis sub store - Push: clarify multiple concurrent streams allowed; note HTTP/2 dependency - Webhook: specify 2xx ACK requires durability or end-consumer processing - Use nextPollSeconds consistently
| - **SDK-level polling, not LLM-level polling.** The client SDK drives the polling loop without burning LLM inference tokens. The LLM is only invoked when events actually arrive. | ||
| - **No durable subscription state.** Poll holds no subscription state at all. Push scopes subscription state to connection lifetime. Webhook uses soft state with mandatory TTL — the server holds subscriptions in memory, but they expire automatically if the client stops refreshing. The client is always the source of truth across all three modes. (This principle is about *subscription* state. Servers that relay from upstream sources will still hold other state — upstream credentials, webhook registrations with the upstream, an in-memory event buffer — but none of it is owed to any particular MCP client and all of it is outside the protocol's concern.) | ||
| - **Client owns subscription state.** In all modes, the client holds the canonical list of subscriptions. For poll and push, the server has no subscription state at all. For webhook, the server holds TTL-scoped subscription records, but the client drives their lifecycle via periodic refresh. | ||
| - **Event payloads are untrusted data.** The spec must be explicit that event payloads carry the same injection risks as tool results. |
There was a problem hiding this comment.
What should applications do about this? What is our current guidance how to treat tool results as injection risks?
There was a problem hiding this comment.
I think MCP is generally (and perhaps intentionally) a bit vague on this. In practice, it could mean things doing scanning for prompt injection attacks, perhaps audit logging, etc.
|
One clarification - are UI-originated events (user interactions) in scope of this, or is this strictly about external/backend events? Because the delivery patterns and the 'how does it reach the LLM' answer may be quite different for the two cases? |
This is for MCP, so it is about events between an MCP client and an MCP server. Hosts are welcome to, e.g. spin up a local MCP server and publish events for UI stuff over an MCP channel if that helps them in some way, but I don't imagine that to be a common interaction pattern. For that, it would likely make sense for the host to just have its own in-process event system that doesn't use MCP (but could hook into the same handlers if needed). |
…ity, security hardening Resolve internal contradictions and design gaps found in review: - Webhook cursors are now client-owned (per-event retry, no server watermark); name/params/delivery.url are immutable identity fields; unsubscribe requires delivery.url - Unify error/terminated shape across modes; define heartbeat and terminated notifications; renumber error codes to -32011..-32016 to avoid base-spec collision - SSRF: validate at delivery time, block fe80::/10, MUST NOT follow redirects; HMAC: X-MCP-Subscription-Id header, hex-lowercase encoding, mandatory timestamp, retry regenerates - events/stream scoped to event notifications only (existing GET SSE retained); StreamEventsResult send rules clarified per transport - Broadcast emit now uses author-supplied match/transform hooks with ctx; poll lease keyed on (principal, eventName, hash(params)); eventId SHOULD be upstream's stable identifier - Qualify at-least-once for emit-only types; refresh reactivates suspended delivery; capability uses listChanged - Add Open Questions on multi-event-name subscriptions and ownership-verification handshake Add design-sketch-revision-deltas.md documenting all changes for implementations built against the prior revision.
|
|
||
| Push state is scoped to the lifetime of the `events/stream` connection — when the connection closes, all loops and listeners stop. | ||
|
|
||
| **For webhook mode:** The SDK uses the same two patterns (poll-driven or direct emit), but instead of writing to an SSE stream, it POSTs events to the subscriber's callback URL with HMAC signatures. The SDK holds webhook subscriptions in memory with TTL — no external storage is required. If the server restarts, all webhook subscriptions are lost. Clients will re-subscribe on their next refresh cycle, passing their last-persisted cursor; for event types backed by a durable upstream this resumes without gaps, while emit-only event types lose events that occurred during the outage. This is by design: the mandatory TTL + refresh mechanism eliminates the need for durable subscription storage. |
There was a problem hiding this comment.
question: Is sending an event after TTL expiry considered non-conformance? Implicitly I think yes, but there's certainly some infrastructural considerations here that a softer TTL limit would help.
There was a problem hiding this comment.
I think it's allowed, but they should expect an error back from the client that the sub is not found. There are races here, and that's fine, server should just expect it near the TTL boundary.
|
|
||
| There are three delivery modes with different subscription mechanisms: | ||
|
|
||
| - **Poll mode:** Client calls `events/poll` with event name, params, and cursor. No separate subscribe step needed — the first poll with a null cursor bootstraps the subscription. Server holds no protocol-required state (the SDK MAY track an ephemeral poll lease for lifecycle hooks; see *Unsubscribe timing by mode*). |
There was a problem hiding this comment.
question: Certainly more philosophical in nature: we could theoretically build this on top of resource templates. What would the relationship between this shape and a resources/read response?
There was a problem hiding this comment.
Not sure I follow the relationship to resource templates. Do you mean e.g. define a template that is something like event://{slack-channel-id}/messages/{cursor}?
| - `id` is a client-provided identifier for each subscription. It is opaque to the server and echoed back in responses to allow the client to correlate results with subscriptions. It must be unique within a single `events/poll` request. | ||
| - `cursor` is opaque to the client. The client stores it and passes it back on the next poll. A `null` cursor means "start from now" — the server returns no events and provides a fresh cursor for subsequent polls. | ||
| - `eventId` enables client-side deduplication across polls (e.g., after a crash/restart). It is **server-assigned**: when the upstream source provides a stable event identifier (Stripe `evt_*`, GitHub delivery GUID, Kafka offset, Gmail message ID), the server SHOULD use that value as `eventId` so that the same upstream event surfaced via multiple paths (e.g., webhook emit and poll backfill) carries the same `eventId` and dedup works. The SDK auto-generates an `eventId` only when the author supplies none. | ||
| - `maxEvents` is an optional top-level field that caps the number of events returned per subscription. If more events are available than the limit, the server returns a partial batch with an intermediate cursor and sets `hasMore: true`. The client SHOULD poll again immediately (ignoring `nextPollSeconds`) to drain the backlog. If omitted, the server uses its own default limit. |
There was a problem hiding this comment.
nitpick: "events per subscription" is a bit ambiguous here, would be nice to make it clear it's per subscription in the poll context
| - `events/subscribe` is idempotent within the caller's subscription scope (see *Subscription Identity*). If a subscription with the same scoped key exists, the server resets the TTL and updates mutable fields in place. If the subscription has expired — or the server has restarted and lost it — the server creates a fresh subscription using the provided cursor. | ||
| - The server holds subscription state (id, event name, params, callback URL, secret) in memory with TTL. No durable storage is required — if the server restarts, clients will re-subscribe on their next refresh cycle. For event types backed by a durable upstream, the client's persisted cursor recovers any events that occurred during the gap; for emit-only event types, events during the gap are not recoverable (see *Emit-only event types*). | ||
|
|
||
| #### Subscription Identity |
There was a problem hiding this comment.
question: This section seems to push pretty hard into MCP server implementation internals, is the goal to say that principal confusion is such a huge issue we should be making a prescription here?
There was a problem hiding this comment.
I agree to an extent, but subscription identity is a protocol issue, and especially since we need to deal with the webhook receiver that doesn't have the principal, so we do need to say something. Aligned with trying not to be over-prescriptive and please push back if you feel it is.
| The SDK intersects the event type's advertised `delivery` list with the modes the client is configured to use, and picks in this preference order: | ||
|
|
||
| 1. **Webhook** if the client has `WebhookConfig` set and the server lists it — low latency without tying up a persistent connection. | ||
| 2. **Push** if the transport supports streaming (stdio, or HTTP when webhook isn't configured) and the server lists it — low latency, but requires holding a connection open. |
There was a problem hiding this comment.
Should we also consider UDP in push mode. This will be super helpful for voice and video scenarios.
There was a problem hiding this comment.
Today we leverage http for the transports. Two areas that this might be considered at/by:
- SEP 2322 - For relaxing some of the dependence specific transport types to make the transport stateless overall
- UDP (as a transport level requirement for events) would mean decoupling events with transports. I beleive there is a Custom Transports effort that allows one to swap these in - but would also need hosts to support (and proxies/firewalls in between)
Definitely would be interesting to see this blossom - but i also suspsect our shortest path to UDP might be http 3.0?
There was a problem hiding this comment.
Mentioned in the meeting, but will mention again here: I'd love for MCP to support voice and video! For this SEP and WG though (and perhaps this should be made clearer), I'm considering real-time media streaming as out of scope. My read is that the protocols and interactions needed for them are very different, e.g. media probably needs something like WebRTC or similar and I expect the signaling layer protocol to be quite different from what we're trying to do here (which is more discrete event delivery).
I'd actually be very interested in whether folks have LLM use cases for voice and video over MCP. It's not something I've explored.
There was a problem hiding this comment.
The broader point I was trying to get at is whether the events model should leave room for more continuous or high-frequency event sources, not just discrete business events like “email received” or “incident
created”.
Examples:
- long-running job progress: queued → running → 30% → retrying → completed/failed
- monitoring/agent telemetry: CPU/memory/error-rate threshold changes, health state transitions, periodic stats snapshots
- media-adjacent events: “audio stream started”, “speech segment detected”, “transcript chunk ready”, “video analysis result available”
So I agree media payload transport itself can be out of scope, but I’d like to get clarity on whether event streams can represent status/progress/telemetry-style updates, and where the boundary is
between:
- discrete MCP events,
- progress/logging/notifications that already exist,
- high-frequency telemetry,
- actual real-time media streams.
Maybe the SEP should explicitly say real-time media transport is out of scope, while control-plane events and derived events from media or monitoring systems are in scope if they fit the cursor/dedup/delivery
model.
There was a problem hiding this comment.
The tradoff id see with adding voice/video - is it couples the intent of events (as triggers of changes?) with specific app payloads. Today we dont have guarantees on latencies of events (or even notifications/pushes etc). And lack of these guarantees means we are looking at best efforts scenarios for now? Realtime to me feels like starting to have clearer guarantees there - so not being in scope makes sense.
There was a problem hiding this comment.
@srikalyan Good point. One thing - For long running job progress - tasks (SEP 2663) may be a better option (which already posts notifications on progress etc). Events to me looks like sources of progress where the work did NOT orginate from a (mcp) host.
But the monitoring/telemetry is a good use case and may actually fit here - "start monitoring for cpu usage on X" and this would be a windowed listing (last N events or X seconds).
There was a problem hiding this comment.
@srikalyan I'd be interested in the use cases you have in mind for the high frequency events? I think this proposal doesn't explicitly disallow that, though if we're thinking about events that e.g can fire consistently multiple per second then we probably need to think more about things like flow control, aggregation operations, rx-style stream handling.
There was a problem hiding this comment.
Let me clarify my question: I’m not specifically asking for voice/video support. Those were examples where TCP-based push may not be the right fit.
The broader question is whether this SEP expects to support raw high-frequency event streams at all.
For example:
- per-request events from a high-QPS service
- raw observability / telemetry samples
- agent monitoring signals at high frequency
- market-data-like event feeds
- realtime experiment exposure or metric events
If the answer is “no, Events are intended for lower-volume discrete state changes only,” then I agree UDP/WebRTC/QUIC-style concerns are probably out of scope and the doc should maybe say that explicitly.
But if raw high-frequency events are in scope, then I wonder whether tying push delivery primarily to TCP-style transports creates the wrong assumptions around latency, backpressure, ordering, retransmission, and head-of-line blocking. In those cases, UDP or QUIC-like transports may be more appropriate, depending on whether the application prefers freshness over reliable delivery of every event.
So maybe the scope question is:
Should MCP Events support only reliable discrete event delivery, or should it also support best-effor event streams where dropping stale events is acceptable?
If it is the former, that helps clarify the boundary. If it is the latter, I think transport choices like UDP/QUIC become more relevant, possibly in coordination with the Transports WG rather than directly in this SEP.
For example, imagine an MCP server (datadog or sentry) exposing live request telemetry for a high-QPS service during an incident:
{
"name": "service.request_sample",
"params": {
"service": "checkout",
"sampleRate": 0.01,
"fields": ["status", "latencyMs", "region", "variant"],
"maxEventsPerSecond": 1000
}
}
The server might emit sampled request events like:
{
"eventId": "sample_789",
"name": "service.request_sample",
"timestamp": "2026-04-30T12:10:00Z",
"data": {
"status": 500,
"latencyMs": 842,
"region": "us-west",
"variant": "B"
}
}
In this kind of use case, the client may prefer fresh samples over reliable delivery of every sample. If the agent is already behind, old samples may be less useful than the latest view of the system. TCP-style ordered reliable delivery can create head-of-line blocking and backlog behavior that is actively undesirable.There was a problem hiding this comment.
Thanks for clarifying, understood that you're not asking specifically for voice/video.
My question then is around the specific use cases you have in mind, e.g.
For example, imagine an MCP server (datadog or sentry) exposing live request telemetry for a high-QPS service during an incident:
I understand this use case in abstract, but I'm wondering what the user journey that you have in mind looks like when an AI agent is involved. e.g. for this one, I guess I'm asking my agent to help debug some live incident. It then subscribes to this high-frequency event stream, and then what?
- Putting all events into context feels like a non-starter since it will quickly explode at 100s of events per second, and gives the agent no time to actually process things.
- Perhaps the agent is doing some code-mode things where it writes a script to subscribe, collect, and aggregate? I could imagine this!
- Or is the agent just subscribing to a small window of events, e.g. get the last 100 events, stick those in context and then go from there?
- Or are you imaging that we also introduce some stream-processing primitives into MCP, e.g. "subscribe to this event stream, but then transform this into an events-per-second stream, and filter by X, etc."
I think these are all potentially valid, but really need to understand the CUJ so that we can design appropriately. You are right that if we go down this path then we need to think more about other transports, backpressure, etc.
Right now the proposal does allow of out of order delivery and best effort (using cursor: null) but we've punted on things like flow control for now. I think the design does allow for alternative transports via the different delivery modes, though we'd need to consider that carefully: MCP generally tries to avoid introducing new transport where possible to avoid a large compatibility matrix. I think it could be justified if this use case is important and we have no other reasonable choice.
In summary: I think this is very interesting, but it would be useful to understand who is asking for this and to understand their specific use case and desired integration with AI applications.
There was a problem hiding this comment.
Totally agree that putting all events directly into context is a non-starter. At 100s/sec the agent would never get time to reason, and the context would explode immediately.
The CUJ I have in mind is closer to your “code-mode script that subscribes, collects, and aggregates” example.
Concrete example: an agent is asked to monitor an A/B test on a high-QPS platform, backed by Statsig / Datadog / Sentry / internal experimentation infra:
“Watch experiment checkout-redesign for the next 30 minutes. If variant B causes error rate or p95 latency to exceed the guardrail, pause B / roll back to A.”
The agent would not put every exposure/request/error event into context. It might write or run a small collector that subscribes to a high-frequency stream, keeps a bounded rolling window, aggregates locally,
and only surfaces summaries or threshold crossings back to the agent.
For example:
- subscribe to sampled request/error events for experiment
checkout-redesign - maintain a rolling 1m / 5m window
- compute error rate, p95 latency, sample size by variant
- only wake the agent when a guardrail is breached or confidence is high enough
- then the agent calls existing tools to inspect details and possibly pause/rollback the experiment, depending on permissions/policy
So the agent-facing context might be something like:
{
"experiment": "checkout-redesign",
"variant": "B",
"windowSeconds": 300,
"sampleSize": 250000,
"errorRate": 0.043,
"baselineErrorRate": 0.009,
"p95LatencyMs": 420,
"guardrail": "error_rate",
"recommendation": "pause_variant"
}
The “small window” model also makes sense to me, but I’d think of it as bounded by both count and time, e.g. “latest 100 events within the last 5 minutes,” rather than an unbounded stream being pushed into
context.
I agree the current proposal covers the MVP use case well: reliable-ish discrete event delivery with ordering/replay where the upstream supports it. My question is mostly whether we should explicitly call high-
frequency best-effort streams out of scope, or leave room for them through patterns like bounded windows, sampling, aggregation, and possibly future transport/flow-control work.|
|
||
| The client owns all subscription state across all three delivery modes. There is no server-side subscription listing method. The client maintains its own subscription registry and can reconstruct server-side state at any time via idempotent `events/subscribe` (for webhook mode) or by resuming poll/push with stored cursors. | ||
|
|
||
| Enterprise governance tools can inspect the client SDK's subscription registry for audit purposes. For webhook mode, orphaned subscriptions (e.g., from a crashed client) are cleaned up automatically by TTL expiry — no server-side listing is needed. |
There was a problem hiding this comment.
how server operators audit active webhook subscriptions, investigate abuse, or perform emergency shutdowns without relying only on client SDK state
There was a problem hiding this comment.
Are you thinking if (and how much) say clients should have in this. Today server has full authority to do this - I am suspecting this is about how much should this be notified to the client?
There was a problem hiding this comment.
So you are saying that server owner should keep track of all the subscriptions?
|
|
||
| - `delivery` lists the delivery modes this event type supports — any non-empty subset of `"poll"`, `"push"`, `"webhook"`. No mode is mandatory. A client that cannot use any of the listed modes cannot subscribe to this event type. | ||
| - `inputSchema` is a JSON Schema describing valid subscription parameters — these may include filters (which narrow the event stream), transforms (which modify payloads), or other server-defined configuration. This mirrors the `inputSchema` on tools for consistency. | ||
| - `payloadSchema` describes the shape of `data` in delivered events. |
There was a problem hiding this comment.
How event payloads evolve compatibly, whether event names need namespaces/versioning, and whether MCP reserves an mcp.* prefix?
There was a problem hiding this comment.
It's a good question. So far, MCP hasn't decided on tool name namespacing and versioning, so my inclination is to not try to define it here and rather wait for broader guidance for MCP.
You did make me realise that we need a _meta for the event payload and event type definition (similar to tools). Will add that. _meta does reserve a prefix for MCP.
|
|
||
| **Subscribe-time:** The server MUST verify that the authenticated user has permission to subscribe to the requested event type with the given params. For example, a Slack server must verify the user has access to the channel specified in the params. | ||
|
|
||
| **Delivery-time:** The server SHOULD periodically re-verify permissions. If the user's access is revoked (e.g., removed from a Slack channel), the server terminates the subscription. The termination signal carries the same nested-`error` shape across all modes: |
There was a problem hiding this comment.
Should authorization must be checked on every poll, every push delivery, every webhook delivery, or only on a configured interval, and what happens to queued events after revocation
There was a problem hiding this comment.
This is a good one - one thing coming out of the auth (and fine-grained-auth) wg is also looking at this - I am thinking deferring to that would keep it simpler here (i am sure the integration wont be as zero effort)
There was a problem hiding this comment.
@panyam can you expand more on the keeping it simpler part? Auth once, timely or every push?
|
Posting here as well as Discord for more feedback/visibility: I had an orthogonal proposal from some time ago on a way to avoid webhooks on the protocol layer completely by putting the responsibility on the transport layer. The premise is that if MCP is already bidirectional when stateful, push should just work, and we don't need to invent anything new for triggers/events, and webhooks exist primarily because the bidirectional connection isn't durable (the network might drop). So it becomes more of a question of: how do we ensure durable SSE? The idea is to patch the problem for Streamable HTTP: when a notification (server -> client) fails to get pulled by SSE, hit a webhook backup to deliver the events. In this setting, the current event proposal will focus solely on event semantics, and not on how it gets delivered. That separated concern goes into the transport layer. It feels more elegant because stdio servers, for example, don't seem to make a ton of sense supporting a webhook. Would love to hear if this idea has value! |
| for msg in history.messages: | ||
| if matches_params(params, msg): | ||
| data = {"messageId": msg.id, "from": msg.sender, ...} | ||
| if params.get("redact_pii"): | ||
| data = redact(data) | ||
| events.append(Event(name="email.received", eventId=msg.id, data=data)) | ||
| return EventResult(events=events, cursor=history.historyId) |
There was a problem hiding this comment.
thoughts on doing a yield-style setup?
There was a problem hiding this comment.
Open to it! All the SDK stuff here is non-normative. I'd be interested in seeing prototypes of that to see how well it works.
There was a problem hiding this comment.
This is interesting @caseychow-oai - Will share a demo of this. It was interesting because I am thinking (and this is obviously an impl detail) the Yield source and an actual Event source could have different expectations on whether the listing is by a tool call or for justing getting events.
|
|
||
| 6. **Should webhook subscription require an ownership-verification handshake?** Before activating delivery, the server would POST a challenge token to `delivery.url` and require the endpoint to echo it back (cf. Slack's URL verification, SNS `SubscriptionConfirmation`). This proves the subscriber controls the endpoint, preventing a client from pointing deliveries at a third party. Cost: an extra round-trip and an endpoint-side requirement. The current SSRF defenses (blocklist, no-redirect, delivery-time IP validation) mitigate internal-target abuse but not third-party-target abuse. | ||
|
|
||
| 7. **Should webhook deliveries support multiple signatures for zero-downtime secret rotation?** The current design has a single `X-MCP-Signature` header, and rotation happens via an atomic upsert of `delivery.secret`. But there's a race: webhooks in flight when the upsert lands were signed with the old secret, and a receiver that has already switched to validating against the new secret will reject them. Stripe-style multi-signature (e.g., `X-MCP-Signature: t=<ts>,v1=<sig_old>,v1=<sig_new>`) lets the server sign with both secrets during a grace window so the receiver can verify against either. Is the in-flight window small enough to ignore, or should the spec allow `delivery.secret` to be an array (or require servers to dual-sign for N seconds after rotation)? |
There was a problem hiding this comment.
+1 to this, very hard to add after
…, SEP review fixes Make poll a mandatory delivery mode so every events-capable server is reachable by every client (push/webhook remain optional optimizations). Add a Webhook Delivery Profile subsection with WAF-relevant invariants (POST-only, required headers, 256 KiB body SHOULD, TLS termination) and defer User-Agent to SEP-1336 and egress-IP discovery to SEP-2127. Incorporate further fixes from adversarial review.
…t, tuple-based identity - Adopt Standard Webhooks as the delivery signature scheme (webhook-id/timestamp/signature, v1 base64, multi-sig rotation); keep X-MCP-Subscription-Id as the one MCP-specific header - Webhook secret is now client-supplied and REQUIRED (whsec_ + 24-64B base64, server MUST reject otherwise); server never generates one - Subscription identity is (principal?, delivery.url, name, params); the response returns a server-derived id used only for routing, never as input - Revert to no-mode-mandatory for delivery modes - TLS is now MUST for delivery.url - Allow cursor: null in any delivery for event types without replay; clarify CursorExpired always implies a possible gap
…dening, decisions table
- Unify cursor expiry into in-band `truncated: true` (drop -32014 CursorExpired): poll result body, push re-sends notifications/events/active mid-stream, webhook refresh response + {"type":"gap"} control envelope; client never re-subscribes for a gap
- Add `maxAge` (seconds) to bound replay across all modes; servers may decline by truncating to now
- Heartbeat carries cursor (push) and refresh response carries the safe-watermark cursor (webhook) so the client's position advances during quiet periods
- Webhook now requires an authenticated principal (key is (principal, url, name, params)); drop body `id` (route via X-MCP-Subscription-Id); add Non-event webhook bodies subsection (gap/terminated/verification with `type` discriminator)
- Drop subscription `id` from poll/stream (one sub per request; correlate via JSON-RPC id / requestId)
- Rewrite Summary as a how-it-works overview; add Key Design Decisions table; add OQ5 (task state changes as events) and OQ6 (endpoint verification + asymmetric server signing)
- Consistency pass across diagrams, examples, error lists, and appendix step references
|
|
||
| ```jsonc | ||
| // params (optional) | ||
| { "cursor": "..." } // pagination |
There was a problem hiding this comment.
events/list accepts a pagination cursor, but the response example does not include nextCursor / hasMore. Should this follow existing MCP list pagination semantics?
There was a problem hiding this comment.
Yep, will update to include nextCursor in the example. hasMore doesn't seem to be used by MCP.
events/list response now shows nextCursor for consistency with base MCP list pagination. Event-type definitions and EventOccurrence both gain optional _meta, matching Tool/Resource/Prompt.
… suffix - Push notifications now carry the parent events/stream request id in params._meta["io.modelcontextprotocol/subscriptionId"] (per SEP-2575) instead of params.requestId - Rename nextPollSeconds -> nextPollMs and maxAge -> maxAgeMs (integer milliseconds) to match the MCP duration-field convention; example values and prose updated
Records why poll, push, and webhook all exist as protocol-level modes rather than picking one: the deployment topologies are disjoint and no mode subsumes another. Acknowledges the departure from the usual one-way-to-do-it stance and the spec/SDK cost it carries, and notes the rejected alternative (one mode + proxy synthesis) and why.
- Replace Open Question 6 with normative endpoint-verification rules: a mandatory intent challenge (or allowlist) before activating webhook delivery, cached per (principal, url); add -32014 EndpointVerificationFailed and fill the verification control-envelope row. - Add optional asymmetric server signing (Standard Webhooks v1a,) for known-server allowlists. The webhook-signing key is discovered from an origin the client already authenticates: inline webhooks.signingKeys on the SEP-2127 server card if it lands first, otherwise a standalone /.well-known/mcp-webhook-jwks.json document (one chosen at SEP-review time). - Correct the webhook-secret rationale (HMAC stops deception, not flooding) and add Key Design Decisions rows for endpoint verification and server identity. - Add a Threat Model summary table to Security Considerations.
Replace the seven event-specific error codes (-32011..-32017) with five general-purpose codes that carry their specifics in a typed `data` payload: NotFound, Forbidden, ResourceExhausted, Unsupported, CallbackEndpointError Bad or non-https callback URLs and malformed secrets now fold into the standard InvalidParams (-32602). The remaining codes are contiguous (-32011..-32015) and named for reuse across the protocol rather than scoped to events; clients still distinguish every case they act on by code, by method, or by a typed `data` discriminator. Also corrects the stale -32002 ResourceNotFound reference (no longer present in the schema) and records the trade-off in Key Design Decisions.
|
|
||
| The transport mechanism differs by transport type: | ||
|
|
||
| - **Streamable HTTP:** The `events/stream` request is a POST that returns an SSE response stream. This stream carries event notifications (`notifications/events/*`); it is independent of, and does not replace, the transport's existing GET-based SSE stream, which continues to carry non-event server-initiated notifications (`notifications/tools/list_changed`, progress, logging, etc.). The client cancels by aborting the request stream (TCP close on HTTP/1.1, `RST_STREAM` on HTTP/2) — no explicit cancellation message is needed. |
There was a problem hiding this comment.
One quick question/comment here and curious about the trade-offs: The current shape collapses subscribe and stream-open into a single events/stream request. For monolithic servers that's clean, but for tiered deployments it forces auth, ACL, and event-registry state to live wherever the SSE connection terminates. In our architecture (and I'd guess this generalizes to anyone whose connection tier is separate from their app/auth tier - i.e. gateways fronting an upstream API, edge layers handling long-lived connections, etc.), that means the connection-termination tier has to either re-implement the auth/registry surface or make synchronous cross-tier RPCs at every checkpoint: subscribe-time validation, scope check, ACL evaluation, per-event re-check, termination signaling.
Webhook mode sidesteps this elegantly because subscribe and delivery are decoupled — events/subscribe lives in the auth/registry tier, delivery can fan out from anywhere, and the client-supplied secret means no out-of-band coordination. Poll mode is similar: each events/poll is a self-contained request that can be served by whichever tier holds the registry.
Would the WG be open to letting events/stream optionally accept a server-issued subscription handle from a prior events/subscribe call? Concretely:
- Push-mode servers that want today's behavior keep working -
events/streamwith{name, params, Optional:cursor}does subscribe-and-stream in one shot. - Servers that want to split the tiers can advertise
events/subscribefor push (the spec currently scopes it to webhook) and acceptevents/streamwith{subscription_id, Optional:cursor}against a previously-minted handle. The connection-termination tier validates the handle, opens the stream, and stays out of the auth path.
This feels additive - no wire-format break, opt-in per server, and clients that don't care can ignore the new path, but would love to hear your thoughts.
Regarding the security shape, the handle could follow the same pattern Slack's Socket Mode already uses for apps.connections.open: the auth/registry tier mints a single-use, short-TTL ticket, embeds it in the URL it returns, and the connection-termination tier consumes the ticket exactly once at stream-open by calling back to the auth tier to validate. That gives single-use semantics, bounded replay window, principal binding (via the ticket), and a clean revocation seam. It's also orthogonal to the bearer token: the ticket authorizes opening this specific subscription, not arbitrary API access, so leaking it gets you one stream, not a session. Worth noting the spec wouldn't need to prescribe the format (subscription_id can stay opaque) - only that it's server-issued and consumed once at events/stream time.
Has this come up already? Would love some thoughts and feedback 🙇 !
There was a problem hiding this comment.
@vaishnavparth thanks for raising this.
I'm curious if you see this as any different from e.g. tools/call?
Would the WG be open to letting events/stream optionally accept a server-issued subscription handle from a prior events/subscribe call?
The problem with this is that it makes the protocol stateful again which is exactly what we're trying to avoid with the next MCP spec release. See SEP-2575. We just removed the GET stream and resources/subscribe for this very purpose, so I'd be quite strongly against bringing it back.
It would be fair to argue that Webhook mode is bringing this back in some form, but the hope was to limit it to webhooks (which are inherently stateful) rather than bringing statefulness back to the core protocol, especially for something analogous to resource subscriptions, which were made stateless.
Even if we did what was suggested, do you not need the x-tier call for per-event re-checks?
There was a problem hiding this comment.
Thanks @pja-ant, this is really helpful, and the SEP-2575 pointer was the part I was missing. Reading that thread, I take your point: the whole direction of travel is moving server-side state out of the protocol, and what I was suggesting would re-introduce it for push in the same shape that just got removed for resources. Sounds like Webhook's TTL'd subscription as a contained exception (because the server has to remember where to deliver) makes sense as the design line, and I shouldn't try to extend it.
On your question: "do you not need the x-tier call for per-event re-checks?" - In our case the per-event re-check would happen in our auth/registry tier as part of the emit path, before fanning out to whatever holds the connection, so the handle wasn't really about avoiding cross-tier traffic at delivery time. It was about letting the connection-termination tier stay auth-naive at stream-open.
The tools/call analogy is also fair. The thing that actually differs in our case is connection lifetime rather than auth shape, but that's a transport problem. In our architecture, tools/call is short-lived and served by a web application (which also handles auth and ACLs) whereas events/stream is long-lived and the server to hold the long lived SSE connection lies outside of the web application.
The tools/call vs. events/stream framing is the part I'd love to get your thoughts on, even if just to think out loud: short-lived auth-bearing requests and long-lived ones land in pretty different deployment shapes for a non-monolithic server, and I suspect this won't be the last time it comes up as more upstreams adopt events. Happy to be a sounding board if it's useful, or to drop it if you'd rather keep the WG focused. Either way - thanks again 🙇
There was a problem hiding this comment.
Hey! 👋 Saurabh here. I work on the Slack platform and have spent a lot of time deep in the internals of Slack’s Events API and delivery infrastructure: webhooks, retries, rate limiting, and related systems.
Really excited to see this proposal. The three-mode design (poll, push, webhook) maps closely to patterns we’ve converged on at Slack over years of operating large-scale event systems in production.
I’ve left a few comments throughout, mostly based on lessons learned from running event delivery at Slack scale. The hardest problems in these systems tend to be around failure modes, degradation behavior, retries, and protecting servers and clients from each other rather than the happy path.
Overall, I think there are some really strong design choices here, especially the Standard Webhooks adoption and endpoint verification approach. Looking forward to seeing this evolve. 🎉
| "params": { "severity": "P1" }, | ||
| "delivery": { | ||
| "mode": "webhook", | ||
| "url": "https://proxy.example.com/hooks/client123", |
There was a problem hiding this comment.
Have you considered additionally supporting a static callback URL per client rather than per subscription?
The model that has worked well for Slack is that each app registers a single Request URL, verifies domain ownership once, and all event subscriptions are delivered there. This keeps verification overhead low, makes callback URL rotation simpler for clients, and allows the server to manage health tracking at the client/app level rather than per subscription URL.
One possible approach here would be to support a mode where the client registers and verifies a single callback URL up front. Subsequent subscriptions could then omit delivery.url and inherit the established endpoint.
There was a problem hiding this comment.
Hey @saurabhsahni - I think this is a good suggestion. I need to spend a bit more time thinking through this, but I like the idea and I'm wondering if there is something we can learn from the CIMD work from our auth friends, i.e. something like the client exposes a myclient.com/.well-known/mcp/webhook-receiver.json as a DNS-backed proof of authenticity for the client that exposes the same-origin delivery URL to avoid the one-time verification step.
There was a problem hiding this comment.
Yeah, something along those lines could be compelling.
That said, I'd love to see this remain optional so platforms that already have developer-configured callback URLs can reuse their existing infrastructure. At Slack, for example, developers register a single Request URL in their app configuration, and we deliver all event subscriptions to that endpoint. We'd want to be able to use that existing URL for MCP subscriptions rather than requiring apps to also host and maintain a separate .well-known document.
There was a problem hiding this comment.
+1 to Saurabh's point about keeping this optional and reuse-friendly.
One thing I wasn't sure about and wanted to ask: does the .well-known/mcp/webhook-receiver.json approach assume the client can serve same-origin static content from the delivery domain? I ask because in tiered setups (gateways, edge tiers, managed integration platforms) the delivery endpoint often isn't the same origin as anything the client can publish a well-known doc on, so I wanted to make sure that case stays supported.
If I'm reading Saurabh's suggestion right, the verify-once-then-inherit model also seems like it could lean on the endpoint-verification challenge this proposal already has, without needing DNS or same-origin hosting. Would it make sense to treat .well-known as one optional authenticity mechanism for clients that can serve it, while keeping the verified-once callback URL as a baseline that doesn't constrain receiver topology? Curious whether that squares with what you both had in mind.
| - Deliveries follow the [Standard Webhooks](https://github.com/standard-webhooks/standard-webhooks/blob/main/spec/standard-webhooks.md) signature scheme. `webhook-id` carries the `eventId` for event deliveries (per-delivery, used for dedup; control envelopes use `msg_<type>_<random>` — see *Non-event webhook bodies*). `webhook-timestamp` is the Unix timestamp (seconds) of the request. `webhook-signature` is `v1,<base64>` where the value is `HMAC-SHA256(secret, webhook-id + "." + webhook-timestamp + "." + body)`; multiple space-delimited signatures MAY be present during secret rotation. `X-MCP-Subscription-Id` is an MCP-specific header (not part of Standard Webhooks) carrying the subscription `id` so the receiver can select the correct secret before parsing the body. The receiver MUST verify the signature before processing, SHOULD reject deliveries where `webhook-timestamp` is more than 5 minutes old, and SHOULD deduplicate on `webhook-id`. Each retry attempt regenerates the timestamp and signature. | ||
| - `eventId` in the body enables idempotent processing. The receiver SHOULD deduplicate by this value. | ||
| - `cursor` in the body is a **safe-to-persist watermark**: it represents a position such that every event at or before it has been acknowledged by the endpoint or abandoned by the server. The server computes this from its in-memory retry queue — it does not include `cursor_N` in event N's payload until events at positions `< N` have been acked or given up on. Computing this requires the server to hold, per active subscription, the upstream position and ack status of in-flight events; this is in-memory TTL-scoped state (lost on restart, reconstructed by client refresh), but it is per-subscription bookkeeping the SDK must implement. The client persists every cursor it receives; the most recently received value is always safe to supply on refresh. The server MAY send `cursor: null` for event types that do not support replay (see *Cursor Lifecycle*); in that case there is no recovery point to persist. The endpoint MUST make `cursor` and `eventId` available to the consuming client by whatever channel it uses to forward events; cursor-based recovery on resubscribe depends on the client receiving and persisting this value. | ||
| - **Delivery model.** The server retries each event independently with exponential backoff on non-`2xx` responses. This matches the dominant webhook convention (Stripe, GitHub, Shopify, the Standard Webhooks spec). Concurrent deliveries and retries may therefore arrive out of order; the receiver uses `eventId` for deduplication and `timestamp` for ordering if needed. Because the payload `cursor` is a watermark (not the event's own position), out-of-order arrival does not cause the client to persist an unsafe cursor. |
There was a problem hiding this comment.
A few retry semantics that have worked well for Slack webhooks and Events API integrations:
-
Retry at most 3 times with bounded backoff (roughly immediate, 1 minute, 5 minutes)
-
Include retry metadata headers on every attempt:
X-Slack-Retry-Num: retry attempt numberX-Slack-Retry-Reason: categorized failure reason (http_timeout,connection_failed,http_error, etc.)
-
Allow the receiver to explicitly suppress retries for a specific delivery (Slack uses
X-Slack-No-Retry: 1), which is useful when the endpoint intentionally rejects stale or already-processed events
Might be worth considering something similar in the Standard Webhooks profile here:
- Define recommended retry limits and a bounded retry window (for example, max 3 attempts within 10 minutes total)
- Standardize retry metadata headers so receivers can reason about retry behavior consistently
- Allow receivers to explicitly signal “do not retry this delivery” to avoid unnecessary retries and queue growth for intentionally rejected events
There was a problem hiding this comment.
These are good suggestions, will weave them in.
| - **Delivery model.** The server retries each event independently with exponential backoff on non-`2xx` responses. This matches the dominant webhook convention (Stripe, GitHub, Shopify, the Standard Webhooks spec). Concurrent deliveries and retries may therefore arrive out of order; the receiver uses `eventId` for deduplication and `timestamp` for ordering if needed. Because the payload `cursor` is a watermark (not the event's own position), out-of-order arrival does not cause the client to persist an unsafe cursor. | ||
| - **Acknowledgement semantics.** A `2xx` response from the webhook endpoint signals that the event has been accepted and the server need not retry it. The endpoint SHOULD NOT return `2xx` until the event has been durably persisted or forwarded — an endpoint that ACKs and then loses the event leaves recovery dependent on the client's last-persisted cursor, which may predate the lost event. At-least-once delivery in webhook mode holds between server and endpoint; end-to-end delivery to the agent depends on the endpoint honouring this contract. | ||
| - **Subscribe/delivery race.** Because the secret is client-supplied, the receiver can verify the very first delivery — there is no window where a delivery arrives before the receiver knows the secret. A receiver that gets a delivery for an `id` it has not yet been told to route (e.g., the subscribe response has not propagated to the gateway) SHOULD return a retryable status (`503` or `425 Too Early`); the server's retry/backoff redelivers once the receiver is ready. `eventId` deduplication and cursor replay make this safe. | ||
| - After repeated failures (server-defined threshold), the server MAY suspend delivery (`deliveryStatus.active: false`). A subsequent successful refresh reactivates it (sets `active: true`) and the server resumes retrying pending events; if the client never refreshes, the subscription expires naturally at TTL. |
There was a problem hiding this comment.
It will be useful to include some recommended suspension thresholds here as implementation guidance. For example: “95% delivery failure rate over a rolling 60-minute window, with a minimum of 100 delivery attempts.”
| - `eventId` in the body enables idempotent processing. The receiver SHOULD deduplicate by this value. | ||
| - `cursor` in the body is a **safe-to-persist watermark**: it represents a position such that every event at or before it has been acknowledged by the endpoint or abandoned by the server. The server computes this from its in-memory retry queue — it does not include `cursor_N` in event N's payload until events at positions `< N` have been acked or given up on. Computing this requires the server to hold, per active subscription, the upstream position and ack status of in-flight events; this is in-memory TTL-scoped state (lost on restart, reconstructed by client refresh), but it is per-subscription bookkeeping the SDK must implement. The client persists every cursor it receives; the most recently received value is always safe to supply on refresh. The server MAY send `cursor: null` for event types that do not support replay (see *Cursor Lifecycle*); in that case there is no recovery point to persist. The endpoint MUST make `cursor` and `eventId` available to the consuming client by whatever channel it uses to forward events; cursor-based recovery on resubscribe depends on the client receiving and persisting this value. | ||
| - **Delivery model.** The server retries each event independently with exponential backoff on non-`2xx` responses. This matches the dominant webhook convention (Stripe, GitHub, Shopify, the Standard Webhooks spec). Concurrent deliveries and retries may therefore arrive out of order; the receiver uses `eventId` for deduplication and `timestamp` for ordering if needed. Because the payload `cursor` is a watermark (not the event's own position), out-of-order arrival does not cause the client to persist an unsafe cursor. | ||
| - **Acknowledgement semantics.** A `2xx` response from the webhook endpoint signals that the event has been accepted and the server need not retry it. The endpoint SHOULD NOT return `2xx` until the event has been durably persisted or forwarded — an endpoint that ACKs and then loses the event leaves recovery dependent on the client's last-persisted cursor, which may predate the lost event. At-least-once delivery in webhook mode holds between server and endpoint; end-to-end delivery to the agent depends on the endpoint honouring this contract. |
There was a problem hiding this comment.
Thoughts on specifying a webhook acknowledgment timeout?
There was a problem hiding this comment.
Makes sense. Any suggestions? Something fixed, or should the client say "give me this much time"? Thinking about clients that maybe do some DB checkpoint every X-seconds or something and don't want to ack until durable. Maybe overkill though, and probably clients shouldn't have too much control over timeouts since it is the server that pays for it.
There was a problem hiding this comment.
At Slack, a fixed, server-defined timeout (3 seconds), combined with clear documentation, has worked well in practice. Clients are expected to acknowledge requests quickly and perform any longer-running work asynchronously. If a client doesn't respond within the timeout window, we treat the delivery as failed and retry the request.
There was a problem hiding this comment.
From the meeting today - leaving towards making this a choice a server can make on its own without wrapping this into the protocol.
|
|
||
| This design does not include a protocol-level flow control mechanism (e.g., credit-based or token-based backpressure). This is intentional for v1. | ||
|
|
||
| In practice, the dominant bottleneck in an MCP event pipeline is LLM inference — processing a single event may take seconds to minutes of model time. Compared to this, event delivery rates from typical upstream sources (email, Slack, PagerDuty) are negligible. The system is inherently consumer-bound, not producer-bound. |
There was a problem hiding this comment.
At Slack, we also enforce outbound event delivery rate limits. There is an upper bound on how many events we will dispatch to an app within a given window, and when that limit is exceeded we emit an app_rate_limited event so the app is aware that deliveries are being throttled.
It may be worth defining a standard “rate limited” or “delivery suspended” control event/notification in the spec that servers can emit when throttling or protective suppression kicks in. That gives clients explicit visibility into dropped, delayed, or suppressed deliveries instead of requiring them to infer it indirectly from gaps or reduced traffic.
There was a problem hiding this comment.
Is this to protect the server?
If you are dropping events, I think we have the"gap" message for that. For delayed events, we don't have anything. Agree a control event could be useful, just being mindful of feature creep here. How valuable do you think this is for clients? Is just a deliveryStatus sufficient?
There was a problem hiding this comment.
It's really about protecting both sides. The server avoids fan-out storms, and the client avoids getting overwhelmed by a sudden burst of deliveries. The rate limit acts as a circuit breaker.
I see the gap message as related but slightly different. gap is a generic, backward-looking signal that some events were skipped, but it doesn't explain why. The signal I'm describing is more specific: "the server is actively throttling outbound deliveries for your app." At Slack, we emit an app_rate_limited event when that happens, and it's been useful because otherwise developers often assume events are being silently lost.
That said, I agree this may not be worth adding to the initial spec. Just wanted to flag it as something that's proven valuable in practice.
There was a problem hiding this comment.
Deferring to Saurabh here since this is his area, just wanted to add one small data point on Peter's deliveryStatus question. The case that seems hardest to detect today is throttled vs. delayed — gap tells a client it missed something after the fact, but not that it's actively being throttled and a retry/poll right now won't help.
Would a lightweight deliveryStatus hint (e.g. { deliveryStatus: "throttled", retryAfterMs }) be enough to cover that without a full control event? The thing that seems valuable is just giving the client something actionable to back off on, so it doesn't read reduced traffic as "nothing happening". I agree that the broader control-event is probably post-v1; mostly wondering whether a small retryAfterMs - style field now would be cheap insurance against clients baking in the wrong assumption early.
There was a problem hiding this comment.
yeah I'm open to a lightweight deliveryStatus
| "id": 2, | ||
| "params": { | ||
| "name": "incident.created", | ||
| "params": { "severity": "P1" }, |
There was a problem hiding this comment.
Thoughts on renaming the inner params field to something like filter or criteria?
params.params is likely to create confusion in docs and code:
subscription.params.params.severity
filter also better communicates intent here: this field narrows which events are delivered rather than parameterizing a function call.
For example:
{
"name": "incident.created",
"filter": {
"severity": "P1"
}
}
There was a problem hiding this comment.
The intent was to match tools/call -- which I now realised I have failed at because tools/call uses arguments for the inner set...
Regarding filter, that may not be entirely accurate. While I do imagine filtering is the common use case, the arguments may not always be filters, e.g. a temperature event stream might have a parameter for farenheit or celsius, which is not really a filter but a transformation.
I do agree the params.param is confusing and hopefully params.arguments resolves that and has precedence in tool calling.
There was a problem hiding this comment.
arguments works great. The precedent from tools/call makes it consistent, and params.arguments reads much more naturally than params.params.
| | `name` | string | yes | Event type name | | ||
| | `timestamp` | string (ISO 8601) | yes | When the event occurred | | ||
| | `data` | object | yes | Payload conforming to the event type's `payloadSchema` | | ||
| | `cursor` | string \| null | no | Subscription position after this event (push/webhook only; poll carries cursor at the response level). `null` when the event type does not support replay — see *Cursor Lifecycle*. | |
There was a problem hiding this comment.
Would it be worth saying, in one sentence, that cursor is optional and nullable on both the request and the response, and that an absent cursor is treated identically to cursor: null (nothing to persist; start from / replay from now)? Concretely, it would help to have this spelled out symmetrically for both SDKs:
- Server SDK: when sending (subscribe/refresh response and delivered payloads), MAY omit
cursorentirely instead of writing an explicitnull; when receiving (the subscribe/poll request), MUST treat an absentcursorthe same asnulland MUST NOT fail on omission. - Client SDK: when sending (the subscribe/poll request), MAY omit
cursorinstead of sendingnull; when receiving (responses and delivered payloads), MUST treat an absentcursorthe same asnulland MUST NOT fail on omission.
The motivation is emit-only upstreams like Slack, where there's no replay history and cursor is always null. If both SDKs know that "absent" and "null" are interchangeable, neither side has to special-case the field just to always write null, and an emit-only server that simply never emits it stays conformant.
Entirely possible this is already obvious from Required: no and I'm over reading it - deferring to you on whether it's worth a line.
There was a problem hiding this comment.
What you've suggested is the intention. It can be clearer. Thanks!
|
|
||
| - **Poll mode:** Client calls `events/poll` with event name, params, and cursor. No separate subscribe step needed — the first poll with a null cursor bootstraps the subscription. Server holds no protocol-required state (the SDK MAY track an ephemeral poll lease for lifecycle hooks; see *Unsubscribe timing by mode*). | ||
| - **Push mode:** Client opens a long-lived `events/stream` request per subscription. Events are delivered on the SSE response stream (HTTP) or as notifications on stdout (stdio). Closing the request stream terminates the subscription. Server state is scoped to request lifetime. | ||
| - **Webhook mode:** Client calls `events/subscribe` to register a callback URL. The server POSTs events to that URL as they occur. Subscriptions have a mandatory TTL — the client must periodically refresh by re-calling `events/subscribe` before the TTL expires. If the client stops refreshing, the subscription expires and the server reclaims resources. Designed for remote servers where maintaining a long-lived connection is impractical. |
There was a problem hiding this comment.
I've been prototyping webhook delivery on a multi-tenant server, and two questions came up that I couldn't fully answer from the text. I may well be missing something you've already considered, so please read these as questions rather than gaps.
If I'm reading it right, the TTL value is server-defined while the existence of a TTL is mandated: refreshBefore is required in the subscribe response, the client re-calls events/subscribe before it, and the server resets the TTL on each refresh, with the duration surfacing as server SDK config (webhook_ttl, 30 minutes as an illustrative value). That split seems very sensible. Two smaller things I wasn't sure about:
1. Is a recommended TTL range worth considering?
Would it be worth the spec suggesting a TTL floor and a default order of magnitude (say, a floor of "at least 1 hour" and a default somewhere in the hours-to-~7-days range)? My thinking is that a generous TTL seems safe here because the server SHOULD re-check authorization at delivery time and emits a terminated envelope on access loss, so a long TTL doesn't obviously widen a stale-access window once events are flowing. The main cost seems to be slower reclamation of idle subscriptions, and an idle subscription isn't delivering anything anyway. The one thing I wasn't sure about: delivery-time re-auth is a SHOULD, and a fully idle subscription wouldn't get re-checked at all (nothing is delivered), so maybe leaning on re-auth as the safety argument is too strong. Curious whether you'd already weighed this.
2. The refresh model seems to assume a long-lived component for the timer
Re-calling events/subscribe on a schedule needs something in the deployment that can run a timer, and I think the spec already handles most of the cases. I want to make sure I'm reading it correctly:
- Agentic / long-lived clients look covered: the Client SDK section says the SDK "manages … webhook refresh cycles" and "runs a background refresh loop." So this population seems to get refresh for free.
- Forward-proxy deployments also look mostly covered, since the proxy is a long-lived service and a natural home for the refresh loop, and the callback endpoint "does not need to be the client itself."
The case I wasn't sure about is a deployment with no long-lived component at all ie. a pure serverless receiver with no SDK host and no proxy, just an endpoint that wakes when a request arrives. There doesn't seem to be a natural place there to run a refresh loop, and I'd worry that asking that integrator to stand up a cron purely to keep a subscription alive might be a bit of an adoption tax. But this may be a narrower case than I'm imagining, and "front it with a proxy" may simply be the intended answer.
If it's useful, a few directions I've been thinking about (very tentatively):
- Sliding-window refresh on delivery. Treating a successful delivery ack as an implicit TTL reset, so an actively-receiving subscription stays alive with no timer and only idle ones need explicit refresh. It seemed to compose well with the rest of the design, since active deliveries already carry the safe-to-persist cursor in each payload. I'd totally defer to you on whether it was considered and set aside for a reason.
- Optional long-lived / persistent subscriptions. For an endpoint whose intent is already verified (the challenge handshake or an allowlist), a subscription that doesn't require periodic refresh and ends only on
events/unsubscribeor access revocation i.e. roughly the Slack Events API model. If there's already a thread on static/persistent callback URLs, this probably belongs there; I looked through the Open Questions and couldn't spot it, so a pointer would help if it exists. - Documented pattern for non-agentic receivers. Since the SDK refresh loop is already specced, maybe the missing piece is just guidance rather than mechanism i.e. a recommended pattern for a receiver with no long-lived host (pair with a minimal scheduled refresh, front with a refreshing proxy, or the persistent mode above).
There was a problem hiding this comment.
Is a recommended TTL range worth considering?
Yes we should provide guidance around TTL.
I think the 1-7 days range feels too long though, but it depends on the server. Some servers will only have short-lived subscriptions, e.g. a coding agent subscribing to github just to keep track of CI. Others will be more long-running.
The cost of short TTL is having to send refreshes too frequently. Cost is O(1/TTL).
The cost of long TTL is subscription storage. Cost is O(TTL).
1/TTL is close to 0 for any reasonable TTL whereas subscription storage grows linearly. I'm thinking a bound like min 5 minutes to max 1 day feels more realistic, but depends on the expected rate of incoming subscriptions, which is server dependent. I don't see much value of a 7 day vs 1 day TTL. Sending a refresh once a day doesn't feel like much of a burden.
generous TTL seems safe here because the server SHOULD re-check authorization at delivery time and emits a terminated envelope on access loss
I agree, and I see TTL and access as an independent thing. I feel a server with long TTL probably should either treat this closer to a MUST, OR just make sure that events are pure notifications and have nothing sensitive.
The case I wasn't sure about is a deployment with no long-lived component at all ie. a pure serverless receiver with no SDK host and no proxy, just an endpoint that wakes when a request arrives. There doesn't seem to be a natural place there to run a refresh loop, and I'd worry that asking that integrator to stand up a cron purely to keep a subscription alive might be a bit of an adoption tax.
If the server sent a heartbeat on the webhook, would this be a sufficient hint to wake-up and refresh the subscription? That way you don't need to maintain the cron -- the server does it for you. Similar to your sliding window idea in a way.
I don't think we can rely purely on unsubscribe since the server doesn't know if it will ever come. MCP sessions had this issue.
There was a problem hiding this comment.
From the weekly meeting - it likely makes sense (for Slack's use cases especially) for Server-Server communication where Slack allows for a permanent subscription (i.e. infinite TTL). Maybe something we can enforce by making the refreshBefore field as optional, which, when missing, indicates that the subscription need not be refreshed.
The client now suggests a subscription lifetime via ttlMs on events/subscribe; the server's refreshBefore grant is authoritative and SHOULD only shorten (clamping up to a server minimum TTL is the one sanctioned exception). An explicit ttlMs: null requests no expiry, and refreshBefore: null grants it. This reframes server-side durability as the server's choice, coupled to the TTLs it grants: short grants keep the existing in-memory soft-state model, while long or no-expiry grants shift durability and cleanup onto the server (persist across restarts, persist verification status, failure-based GC instead of TTL expiry, no expectation of unsubscribe). Rationale: long-lived tenants with many subscriptions shouldn't be forced into a server-invented refresh cadence, and a one-year TTL is practically unlimited anyway — better to offer no expiry explicitly and attach the real obligations to it.
Initial design sketch for the MCP Events primitive — subscription model, delivery modes (poll/push/webhook), and protocol surface.
Draft for WG discussion.