feat(security): authenticate multi-player WebSocket connections#405
feat(security): authenticate multi-player WebSocket connections#405leotrs wants to merge 27 commits into
Conversation
The multi-player Y.js server was previously open: anyone who could reach
the publicly-exposed Fly TCP port (1234) could enumerate rooms like
file-{id}-prod, read full document source, and inject Y.js updates that
the backend YDocClient would persist. This was the worst of the C1-C5
beta blockers.
Token model:
- Backend mints a short-lived (5 min) JWT on POST /files/{id}/collab/start,
signed with INTERNAL_SHARED_SECRET. Claims: sub, file_id, role, iat, exp.
- The role claim carries the user's actual permission (OWNER/EDITOR/
COMMENTER) or "backend"; gate widened from require_edit to require_view
so viewers/commenters can still get a token for read-only sessions.
Handshake:
- Client sends {type:'auth', token:'<jwt>'} as the first WS text frame.
- Multi-player server verifies signature + exp + file_id-matches-docName
(skipped for role='backend' — system trust), replies {type:'auth_ok'},
then yields to setupWSConnection + the existing auto-bootstrap flow.
- Any failure → server closes with code 4401 'auth-failed'.
Wiring:
- backend/aris/collaboration/auth.py mints + verifies tokens.
- YDocClient sends an auth message before SyncStep1; ?role=backend query
param is gone (role now comes from JWT).
- CLI fetches a token via /collab/start, then sends it on the WS.
- Frontend wraps WebsocketProvider with AuthedWebSocket which holds
y-websocket's onopen until auth_ok arrives, buffering any sync sends
in the meantime. Tokens are refreshed on each reconnect.
Tests: new multi-player/server.auth.test.js (16 cases), new
backend/tests/test_collaboration/test_auth.py (8 cases), updated
test_routes_collab.py + test_yjs_client_role.py to exercise the new
handshake. 33 multi-player + 74 backend collab/route tests pass.
Closes std-kab28c.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for rsm-studio-frontend canceled.
|
✅ Deploy Preview for rsm-studio-site canceled.
|
Preview DeployFrontend: https://pr-405--rsm-studio-frontend.netlify.app Test user: This preview will be destroyed when the PR is closed. |
The previous commit added jsonwebtoken to multi-player/package.json for WS auth, but the lockfile was not regenerated. Docker builds run `npm ci` which requires exact lockfile alignment, so the whole stack failed to start in CI — bricking e2e-collab, e2e-frontend, and e2e-site (they all share the docker-compose stack). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EditorCodeMirror.vue calls `provider.value.ws?.addEventListener('error', ...)`
on the WebSocket polyfill. The wrapper only exposed `.on{open,message,error,
close}` setters, so the call threw a TypeError that aborted the watcher before
EditorView was created — `.cm-editor` never mounted, breaking every e2e-collab
test and any auth-content test that opens the editor.
Make AuthedWebSocket extend EventTarget and dispatch fresh open/message/error/
close events alongside the existing setter callbacks. y-websocket's internal
`.onmessage` etc. setters continue to work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… instances y-websocket's broadcastMessage (line 234 of y-websocket.js) gates every Y.Doc update on `ws.readyState === ws.OPEN`, accessing OPEN on the wrapper instance. We had CONNECTING/OPEN/CLOSING/CLOSED only as static class properties, so the instance lookup returned undefined → 1 === undefined was always false → every local edit's update message was silently dropped on the client side. The initial syncStep1 / awareness sends happen inside y-websocket's onopen handler with a direct websocket.send() (no readyState check), which is why provider.synced went true and initial document state propagated correctly, but no subsequent edit ever reached the server. This explains the multi-tab and compile-persistence e2e failures: tabs see the initial DB content but nothing they type propagates anywhere — not to other tabs and not to the backend YDocClient that persists to PostgreSQL. Match the browser WebSocket contract by also exposing the constants on the prototype so instance access works. Regression test verifies both axes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Needs human reviewWhat changed: Adds JWT-based authentication to the Y.js multi-player WebSocket server. The frontend editor, CLI Review checklist:
What to look for:
Notes for reviewer:
|
Needs human reviewThis PR started as multi-player WebSocket auth and grew to also fix the deploy-config gaps that auth surfaced. Full scope below. What changed1. Multi-player WebSocket auth (the original change) 2. Deploy-config fixes (found while getting the preview reviewable)
3. Tests + docs
Live preview
Review checklist
What to look for / flag
CI noteAll real checks are green. The single red |
Resolve import conflict in backend/aris/routes/file.py — keep list_user_accessible_files (from #404) AND mint_collab_token (from this PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…upervisord supervisord's `environment=` directive replaces the inherited env wholesale rather than extending it. The prod supervisord.conf only forwarded MULTIPLAYER_PORT and HOST to the multi-player Node process, so the INTERNAL_SHARED_SECRET set on the Fly app never reached the JWT verifier. Result: every WS auth handshake added in this PR fails with code 4401, the editor's status bar shows "Offline", and CodeMirror never receives a Y.js sync (rendered HTML still shows because that's the cached file.html from GET /files). dev (`supervisord.dev.conf`) already forwards both vars; the bug was prod- only. Tests didn't catch it because unit tests inject the secret directly and CI e2e runs against the dev compose stack. Also forward BACKEND_INTERNAL_URL with the prod backend port (8080) so the multi-player's auto-bootstrap call to /internal/collab/start reaches uvicorn (the server.js default fallback is 8000, which is the dev port). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Frontend builds with VITE_MULTIPLAYER_URL='wss://...:1234' (HTTPS-origin pages can't use bare ws://), but port 1234 was previously configured as plain TCP with no handlers. Fly's edge therefore didn't terminate TLS on that port, and every wss:// handshake failed before the new auth message could even be sent — the editor's status bar shows "Offline". `handlers = ["tls", "http"]` tells Fly's edge to terminate TLS using the app's wildcard *.fly.dev cert and forward plain HTTP/WS to the container. This is the second half of the offline-editor fix; the supervisord change in the previous commit forwards INTERNAL_SHARED_SECRET so the auth check succeeds once the TLS layer is working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y gaps
Three tests to prevent the supervisord-env and fly-tls-handler bugs from
recurring (both root-caused while reviewing this PR):
1. backend/tests/test_docker_config.py
- test_multiplayer_env_forwards_internal_shared_secret: asserts
supervisord.conf's [program:multiplayer] env line includes
INTERNAL_SHARED_SECRET="%(ENV_INTERNAL_SHARED_SECRET)s" — without it,
JWT verify uses undefined and every WS auth 4401s.
- test_multiplayer_env_forwards_backend_internal_url: asserts
BACKEND_INTERNAL_URL points at localhost:8080 (prod backend port,
not the dev 8000 fallback in server.js).
- test_multiplayer_port_has_tls_handler: asserts backend/fly.toml port
1234 has `handlers = [..., "tls", ...]`. Without it, Fly's edge
never terminates TLS on that port and the frontend's wss:// fails.
2. scripts/smoke-test-preview.py
End-to-end smoke test: registers a user, creates a file, calls
/collab/start to mint a token, opens wss://...:1234, sends auth frame,
asserts auth_ok. Validates the whole live stack (TLS + JWT verify +
secret consistency) that unit tests can't see.
3. .github/workflows/preview.yml
Runs the smoke test as the final preview-deploy step. Fails the
preview deploy job if the editor would be Offline in a real browser,
so CI catches deploy-config drift instead of a human reviewer noticing
when they open the preview.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Needs human reviewWhat changed: The Y.js multi-player WebSocket now requires a short-lived JWT as the first frame (mints on Review checklist:
What to look for:
|
The existing yjs-multi-tab.spec.js uses browser.newContext() per tab, which puts every tab in its own browser session. y-websocket's BroadcastChannel does NOT bridge across isolated contexts, so those tests only exercise the WebSocket relay path. The real-world failure mode reported during PR #405 review — opening two tabs in the same Chrome window and typing in both — runs through BroadcastChannel in addition to the WS relay and was uncovered by the suite. Adds yjs-multi-tab-same-browser.spec.js with two cases that share one browser context across pages and use real keyboard input (not view.dispatch). Each case attaches page-error + console.error capture and asserts no "No tile at position undefined" leaks out of the CodeMirror measure cycle when remote edits land — the exact symptom captured in the review screenshot. Also adds openSecondTab() to yjs-helpers.js so other multi-tab same- browser tests can reuse the shared-context setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Needs human reviewWhat changed: Adds short-lived JWT authentication to the Y.js multi-player WebSocket server. Every editor connection (browser, CLI, backend client) now performs an auth handshake before Y.js sync proceeds — the frontend goes through a new Review checklist:
What to look for:
|
Needs human reviewWhat changed: Multi-player WebSocket now requires a short-lived JWT as the first frame; the editor flow on the frontend, the backend Y.js client, and the CLI Staging preview — use this exact URL:
Review checklist:
What to look for:
|
backend/Dockerfile stripped the editable rsm-lang reference from pyproject.toml and let uv sync pull rsm-lang from PyPI. The cloned aris-pub/rsm checkout in the build context was used only for the Node services (rsm-lsp, tree-sitter-rsm), so Python-side rsm fixes — anything in rsm/static/, the renderer — silently failed to propagate to staging and prod even after merging to rsm main. Replace the sed-strip with `COPY rsm /rsm`. backend/pyproject.toml's [tool.uv.sources] points rsm-lang at "../../rsm", which resolves to /rsm from /app, so uv sync installs rsm-lang (and its tree-sitter-rsm dep) editable from the cloned GH-main checkout. tree-sitter-rsm's committed src/parser.c + src/scanner.c build as a C extension under the runtime stage's existing build-essential; python:3.13-slim ships the headers. Staging now sources all three rsm pieces from aris-pub/rsm main, identical to Dev and CI. Documented the four environments and their differences in docs/environments.md, linked from the README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit added `COPY rsm /rsm` but left pyproject.toml's
relative `path = "../../rsm"`. uv cannot normalize that inside the
container: from /app, ../../rsm escapes the filesystem root and uv
rejects it ("cannot normalize a relative path beyond the base
directory"). The deploy build failed at `uv sync`.
Rewrite the source to the absolute `/rsm` where the checkout was copied.
Verified locally: uv sync resolves rsm-lang + tree-sitter-rsm editable
from the absolute path, the tree-sitter-rsm C extension compiles, and
the resolved libraries.js carries the temml fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
multi-player/package.json has a `test` script (vitest run) covering server.auth.test.js, server.bootstrap.test.js, server.test.js — 33 tests including the 16 WebSocket-auth cases added in this PR. No CI job invoked it, so a regression in the multi-player auth handshake would not be caught. Add a unit-multiplayer job mirroring unit-site: checkout, pnpm install, `pnpm --filter ./multi-player run test`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .hns-exit file is a harness exit signal and was committed by mistake in an earlier session. Untrack it and add it to .gitignore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Needs human reviewWhat changed: The Y.js multi-player collaborative editor now requires a short-lived JWT to connect — the frontend fetches it from Review checklist:
What to look for:
|
Needs human reviewWhat changed: Multi-player WebSocket connections now require a short-lived JWT — the editor performs an auth handshake before Y.js sync, the Review checklist:
What to look for:
|
Needs human reviewWhat changed: Adds JWT auth to the Y.js multi-player WebSocket (token minted by Review checklist (against
What to look for:
|
Needs human reviewWhat changed: Adds short-lived JWT authentication for the Y.js multi-player WebSocket — the editor now performs an auth handshake before connecting, and Review checklist:
What to look for:
|
Needs human reviewWhat changed: The collaborative editor's multi-player WebSocket connection now requires a short-lived JWT handshake ( Review checklist (use the deploy preview):
What to look for:
|
Needs human reviewWhat changed: The Y.js multi-player WebSocket now requires a short-lived JWT auth handshake — the collaborative editor connects, mints a token via Review checklist (deploy preview: https://pr-405--rsm-studio-frontend.netlify.app):
What to look for:
|
The per-PR preview (preview.yml) uploads the prebuilt frontend/dist directly via nwtgck/actions-netlify, which does not read netlify.toml. Without frontend/public/_redirects (copied into dist/ by Vite), every client-side route 404s on reload — the failure a reviewer hit while verifying this PR's WS-auth path. The fix itself landed on main (663897d); this adds the missing regression test so the file can't silently disappear again. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The initial-content assertion read manuscript-viewer textContent once, right after manuscript-container mounts — racing the compiled-HTML paint and intermittently seeing empty content. Failure snapshots showed 'Hello' present, confirming a read-too-early race (failing on main too, chromium + firefox). Switch to auto-retrying web-first assertions (toContainText / not.toContainText). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Needs human reviewWhat changed: Multi-player collaborative editing now requires a short-lived JWT auth handshake on the WebSocket — the CodeMirror editor mints a token via Review checklist (use the per-PR preview, NOT the inert
What to look for:
|
|
Stuck on step 3. Typing works (rendered output refreshes correctly) but after reloading, all edits are lost. |
Needs human reviewWhat changed: Real-time collaborative editing now requires a short-lived JWT on the multi-player WebSocket, and the Review checklist (use the per-PR preview, not the inert
What to look for:
|
Stuck after 5 work cyclesLast agent notes: Needs human intervention to unblock. |
Needs human reviewWhat changed: The Y.js multi-player collaborative editor now authenticates its WebSocket connection with a short-lived JWT (minted by Review checklist (use the per-PR preview, NOT the
What to look for:
|
Needs human reviewWhat changed: The Y.js multi-player WebSocket now requires a short-lived JWT — the frontend editor ( Review checklist (use the real per-PR preview, not the inert
What to look for:
|
Needs human reviewWhat changed: The Y.js multi-player WebSocket now requires a short-lived JWT auth handshake — the collaborative editor must complete an This is the live collaborative editor path: if the handshake, token refresh, or BroadcastChannel sync regresses, the editor silently goes Offline or freezes. Worth a hands-on pass before merge. Review checklist (use the per-PR preview, NOT the inert
What to look for:
|
When the last frontend disconnects, the multi-player server kicks the backend YDocClient with close code 4000 and deletes the room. The client flushes to the DB and reconnects, but _connect_and_run builds a fresh empty Doc and the one-time _has_seeded guard suppressed the DB re-seed — so the backend rejoined the recreated (empty) room holding an empty document and served it to the next editor that opened. This is the "edits lost after reload" bug: a single-tab reload tears the room down, and the reloaded page synced against the empty backend doc. Clear _has_seeded on the 4000 (all-frontends-left) close so the next _connect_and_run re-seeds from the DB and broadcasts the content to the reloaded frontend. Plain reconnects (backend hot-reload with live frontends present) keep the guard, so we never re-seed stale DB content over the frontends' newer in-room edits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stuck after 1 work cyclesFailing checks:
Last agent notes: Needs human intervention to unblock. |
The _empty_room_server helper for the re-seed regression test was added after WS auth landed and never consumed the auth handshake that YDocClient now sends first. The auth JSON frame was read as the sync message, failing `assert raw[0] == 0`, killing the connection before re-seed could run — failing test_reseed_from_db_on_reconnect_to_empty_room in CI. Mirror the handshake already present in _room_server. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Needs human reviewWhat changed: Adds short-lived JWT auth to the Y.js multi-player WebSocket — the collaborative editor now performs an Review checklist (use the per-PR preview, which talks to the PR backend — NOT the
What to look for:
|
Needs human reviewWhat changed: Multi-player WebSocket connections now require a short-lived JWT (minted by Review checklist (use the per-PR preview, not the inert deploy-preview notice):
What to look for:
|


Summary
Adds short-lived JWT authentication for the Y.js multi-player WebSocket server, then fixes the deploy-config gaps that auth surfaced once it ran on a real Fly deploy.
1. Multi-player WebSocket auth
file-{id}-prodand inject Y.js updates the backend persisted.POST /files/{id}/collab/start(gate widenedrequire_edit→require_view). Client presents it as the first WS frame; multi-player verifies signature + exp + thatfile_idmatches the docName before yielding to Y.js sync. Wrong/missing/expired tokens close with code 4401.YDocClient, the CLIstudio editcommand, and frontendEditorCodeMirror.vuevia anAuthedWebSocketwrapper that holds y-websocket'sonopenuntilauth_okand queues sync sends meanwhile.2. Deploy-config fixes
docker/supervisord.conf: the prod multi-player process now receivesINTERNAL_SHARED_SECRET+BACKEND_INTERNAL_URL. supervisord'senvironment=replaces the inherited env wholesale; without these every WS auth 4401'd on prod-style deploys.backend/fly.toml: port 1234 now hashandlers = ["tls", "http"]so Fly's edge terminates TLS for thewss://multi-player URL.backend/Dockerfile: installsrsm-langeditable from the clonedaris-pub/rsmcheckout (absolute/rsmpath) instead of PyPI — staging + prod now source all three rsm pieces from GH main, identical to dev/CI.frontend/netlify.toml+site/netlify.toml: Netlify's automaticdeploy-preview-N--URL was built without the per-PR backend URL, so it pointed at the production backend. The[context.deploy-preview]build now publishes an inert static notice instead of the app — closing a "preview frontend talks to prod" hazard. The repo's real per-PR previews remain thepr-N--aliases frompreview.yml.3. Tests + docs
backend/tests/test_docker_config.py: regression asserts for the supervisord secret + fly.toml TLS handler.scripts/smoke-test-preview.py+ apreview.ymlstep: end-to-end WS-auth smoke test that fails the deploy if the editor would be Offline.docs/environments.md: documents the four environments (dev / test / staging / prod) and how each sources rsm. Linked from the README.Closes std-kab28c.
Test plan
multi-player/server.auth.test.js,backend/tests/test_collaboration/test_auth.py— auth handshake + every failure mode.test_routes_collab.py,test_yjs_client_role.py,test_yjs_client_crdt.pyupdated for the handshake.test_docker_config.py— supervisord + fly.toml regression asserts.pr-405preview: editor Online,auth_okcompletes,rsm-langserved from the clone (/rsm/rsm/__init__.py).deploy-preview-405--now serves the inert notice, not a prod-pointing app.pr-405--rsm-studio-frontend.netlify.apppreview — see the "Needs human review" comment.🤖 Generated with Claude Code