feat(apps-proxy): AJDA-2746 track WS activity per data frame instead of presence#2604
feat(apps-proxy): AJDA-2746 track WS activity per data frame instead of presence#2604sykora-ji wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR changes apps-proxy WebSocket activity tracking from “connection is open” to “actual WebSocket data frames are flowing”, by wrapping upgraded WebSocket connections and observing RFC 6455 frame headers in both directions.
Changes:
- Removed presence-based periodic notify (the 30s ticker + active connection counting) from the upstream WebSocket proxy path.
- Added a new
wsactivitypackage that passively parses RFC 6455 frames (allocation-free) and triggers a callback once per non-control frame, plus unit tests. - Added/updated integration + test utilities to verify end-to-end per-frame notify behavior and to make notification counters concurrency-safe in tests.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/pkg/service/appsproxy/proxy/apphandler/upstream/upstream.go | Removes ticker-based WS activity tracking and wraps upgraded conns to notify per data frame. |
| internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/frame.go | Adds a streaming, allocation-free WebSocket frame header parser. |
| internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/frame_test.go | Adds unit tests + fuzzing for the frame parser across header shapes and streaming patterns. |
| internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/conn.go | Adds an io.ReadWriteCloser wrapper that observes frames on both Read and Write. |
| internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/conn_test.go | Adds wrapper pass-through + short-write/error/close semantics tests. |
| internal/pkg/service/appsproxy/proxy/proxy_test.go | Adds an integration test for “handshake notify once, idle WS does not re-notify, data frame re-notifies after throttle”. |
| internal/pkg/service/appsproxy/proxy/testutil/api.go | Adds locking + safe getters for notification counts used during concurrent tests. |
| internal/pkg/service/appsproxy/proxy/testutil/app.go | Adds a /ws-idle endpoint for integration testing of idle WebSockets. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@sykora-ji ideally put it on canary/testing and retest the functionality of new frame activity parser |
MiroCillik
left a comment
There was a problem hiding this comment.
Reviewed the per-frame WS activity tracking. Architecture is clean and very well-tested (fuzz, split-feeds, masked/unmasked, short-write-no-double-feed). Build + upstream/wsactivity test packages pass locally. One performance item worth addressing (or at least benchmarking) before rollout, plus two minor cleanups — inline.
| // uses context.WithoutCancel, so the in-flight call survives the WS | ||
| // timeout and any per-request cancellation. | ||
| reqCtx := res.Request.Context() | ||
| res.Body = wsactivity.Wrap(rwc, func() { u.notify(reqCtx) }) |
There was a problem hiding this comment.
Medium — per-frame goroutine + span churn; the 30s throttle does not absorb it.
This callback fires u.notify(reqCtx) once per data frame, and u.notify unconditionally (a) spawns a goroutine via u.manager.wg.Go, (b) starts an OTel span keboola.go.apps-proxy.upstream.notify, and only then (c) calls notify.Manager.Notify, where the 30s throttle check actually lives (notify.go:60).
So the throttle caps the outbound PATCH but not the goroutine spawn or the exported span. handleUpgradeResponse copies with a 32 KB buffer and the parser fires once per frame within each chunk — a data app streaming small frames (live dashboard, chat, log tail) can produce thousands of frames per Read → thousands of goroutines + thousands of discarded spans per second. The old ticker was one call / 30s / app.
Suggestion: gate cheaply before u.notify — e.g. a per-upstream atomic.Int64 next-allowed unix-nano compared against clock.Now(), only calling u.notify when the window has elapsed. The authoritative throttle in notify.Manager still guarantees correctness; this just keeps a cheap fast-path for the ~99.9% of frames that get throttled out. At minimum, benchmark on a high-frame-rate app before production.
| // non-activity: notify is skipped on a Running app and the request is rejected | ||
| // with 503 Retry-After on a Suspended one, requiring the user to perform a | ||
| // meaningful action (refresh, click into the UI) to wake the app. | ||
| func isFrameworkBackgroundPoll(path string) bool { |
There was a problem hiding this comment.
Minor — hardcodes Streamlit-internal paths. Exact-match is correctly tested (trailing-slash / query-string safe), but this couples apps-proxy to Streamlit's private /_stcore/* endpoints and will silently grow per framework. Consider framing it explicitly as a temporary/framework-specific workaround like the adjacent Origin-rewrite block already does, and note the long-term follow-up (server-side config / app-level beacon) so it doesn't quietly calcify.
|
|
||
| // NotificationsSnapshot returns a copy of the current Notifications map under | ||
| // a lock. | ||
| func (v *DataAppsAPI) NotificationsSnapshot() map[string]int { |
There was a problem hiding this comment.
Minor — dead code. NotificationsSnapshot is never called (only NotificationsCount is used by the new test). Exported test-helper methods won't trip the unused linter. Either drop it or use it — project guidance is to remove unused functions rather than keep them.
Validation note: will this fully solve SUPPORT-16324 (GRPN forgotten-tab autosuspend)?This PR correctly fixes the stated root cause — the presence ticker is gone, Tornado's One caveat worth confirming before we close the support ticket, since the report specifically says "logs show periodic Streamlit reruns from a tab held open":
If the GRPN app's "periodic reruns" are app-driven (e.g. autorefresh) rather than user-driven, this change will not suspend it — the server-pushed data frames keep it alive via the new path, just as the old ticker did. Suggested follow-ups before declaring SUPPORT-16324 resolved:
Possible refinement (design conversation, not a blocker — diverges from the bidirectional spec): counting only client→server data frames as activity would more closely match "a user actually did something" (genuine interaction → client |
…e parsing Stateful RFC 6455 frame parser with a passthrough io.ReadWriteCloser wrapper that drives one parser per direction. The wrapper invokes a callback once per non-control frame (opcodes 0x0..0x7 selected via `opcode & 0x8`) and never allocates based on the attacker-controlled payload length field — payload bytes are skipped via a uint64 counter. Includes unit tests covering 7/16/64-bit payload lengths, masked and unmasked frames, fragmented messages, multi-frame buffers, byte-by-byte streaming, reserved data/control opcodes, and a fuzz test for parser panic safety. Used by upstream proxy in a follow-up commit.
Existing table-driven proxy tests only read Notifications after process shutdown, so direct map access is race-free for them. Tests that observe notify behavior mid-flight (e.g. during a running WebSocket) need a locked accessor — add `sync.Mutex` around the map and expose `NotificationsCount` / `NotificationsSnapshot` getters.
Accepts a WebSocket and reads forever without writing any data frames. The library's Read loop continues to auto-respond to client pings with pongs, but no data ever leaves the server. Used by tests that need to distinguish data-frame activity from idle WebSockets exchanging only protocol-level keep-alive traffic.
…of presence Previously a 30s ticker fired `notify()` whenever any WebSocket was open, counted by an `activeWsCount` atomic. This kept apps awake indefinitely even when only protocol-level ping/pong was exchanged — the forgotten browser tab case reported in SUPPORT-16324 / PROF-104. Replace the ticker with per-frame activity tracking: install a `proxy.ModifyResponse` hook that wraps `res.Body` (which `httputil.ReverseProxy.handleUpgradeResponse` uses as the upstream-side io.ReadWriteCloser for bidi-copy after 101). The wsactivity wrapper parses RFC 6455 frame headers on both directions and invokes `u.notify(ctx)` once per non-control frame; the existing 30s-per-app throttle in notify.Manager absorbs the resulting burst. User-visible behavior shift: an open WebSocket that only exchanges ping/pong frames no longer keeps an app from auto-suspending. Apps that intentionally relied on the 30s heartbeat must now emit actual data frames to signal liveness. Add `TestWebsocketActivityTracking` integration test using FakeClock: verifies (a) idle WS past the throttle window does NOT produce additional notifies — the regression test for the removed ticker — and (b) a data frame past the throttle window DOES produce a second notify, proving the per-frame path is wired end-to-end.
…vity
CI golangci-lint flagged 9 issues, all in the wsactivity package:
- gosec G115 integer-overflow conversions in frame.go — annotated with
//nolint:gosec and an inline justification of the bounds that make
each conversion safe (extLen/maskKey/payload counters).
- intrange — switch `for k := 0; k < n; k++` to `for k := range n`
(Go 1.22+).
- prealloc — preallocate three test byte slices with computed capacity
to avoid intermediate growths.
- depguard — conn_test.go imported the stdlib `errors`; switch to the
project's `internal/pkg/utils/errors` package per coding convention.
Also addresses two Copilot PR review comments:
- testutil/app.go `/ws-idle` doc comment: corrected the description —
the handler drains reads only, it never writes ("echoes" was wrong).
- proxy_test.go TestWebsocketActivityTracking doc comment + assertion
message: clarified that the idle-WS check is not a hard regression
test against a future real-time ticker (the previous implementation
slept 30 s of wall-clock time, which 250 ms wall-clock cannot catch).
Code review must catch any reintroduction of presence-based polling.
No functional change.
… comment The inline comment claimed the short-write half-point lands inside the header, which is wrong for the 16-byte masked text frame used by the test (header is 6 bytes, half-point at 8 falls past header end). The test itself still proves the no-double-feed invariant: a buggy "feed full p on every Write" implementation would, on the caller's retry, re-walk the last bytes (mask key + payload prefix) and fire spurious callbacks. Reword the comment to describe what the test actually catches. No code change.
Address review feedback that the original byte-by-byte state machine felt C-like and could be more Go-idiomatic. The previous version had: - 7 mutable struct fields tracking partial state of each sub-phase, - 5 explicit states (header0/header1/extLen/maskKey/payload), - three near-duplicate "consume up to N bytes" loops, - manual big-endian accumulation via shift-and-or, - four //nolint:gosec annotations for narrowing conversions. Replace with a header-buffer approach: accumulate up to 14 bytes of the current frame's header into a fixed-size array, then decode the whole header at once using binary.BigEndian. Implicit state is now encoded in just three counters (headerLen, headerNeed, payloadLeft) with two obvious phases — drain payload, accumulate header. - 4 struct fields instead of 7. - 2 implicit phases instead of 5 explicit states. - 1 byte-copy code path (via copy()) instead of 3. - 0 gosec annotations: header offsets are statically bounded, payload bytes are skipped via uint64 counter only (still allocation-free against 2^63-1 lengths). - binary.BigEndian.Uint16/Uint64 instead of manual `<<8 | x` loops. Semantics are unchanged — all existing unit tests pass without modification except for one assertion that referenced the removed internal `state` field; updated to check the equivalent post-condition on headerLen/headerNeed/payloadLeft.
…nd polls Discovered during canary testing: the WebSocket per-frame fix alone does not deliver the customer-visible improvement, because the Streamlit frontend posts /_stcore/health and /_stcore/host-config to apps-proxy every WS reconnect cycle (~20 min, imposed by an external idle timeout). Each request flows through trace().GotConn → u.notify(), so lastRequestTimestamp keeps moving forward by reconnect alone — defeating auto-suspend on every forgotten tab. Add isFrameworkBackgroundPoll helper recognizing the two Streamlit background-poll paths and use it in two places: - trace() middleware: skip notify when the request path is a background poll. The upstream call still happens normally; we just do not treat the round-trip as user activity. - ServeHTTPOrError: add a new case to the suspended-state switch. When the app is auto-suspended (Stopped + auto-restart enabled, no DevMode) and the path is a background poll, reply 503 with Retry-After: 60 and do NOT trigger wakeup. The Streamlit client backs off; the user must perform a meaningful action (refresh, click in UI → GET / or similar) to wake the app, which lands on a non-poll path and falls into the default branch. Effect: forgotten tabs stay suspended indefinitely. autoSuspendAfterSeconds becomes a usable cost control again. The 503 vs spinner-page distinction also lets the Streamlit client surface a "Connection lost" state in the tab rather than infinitely spinning, which is the expected UX once the app has been auto-suspended. Unit test covers the helper (matched paths + comprehensive negatives: /_stcore/stream WS upgrade, /_stcore/upload_file user action, normal user paths, prefix-look-alike rejection). Two new integration cases: - public-app-framework-background-poll-skips-notify: three background polls against a Running app produce zero notifications (the existing path would have produced one per request). - public-app-framework-background-poll-on-suspended-returns-503-no-wakeup: background poll on a Stopped app returns 503 + Retry-After: 60 with an empty body (vs the spinner-page HTML served by the default branch), so wakeup is not triggered.
…ded background poll Canary testing showed that returning a bare empty 503 on a background poll of a suspended app leaves the data-app frontend (Streamlit) showing a generic, confusing "Connection error" with no guidance — and, unlike the spinner served on a real wakeup, no indication of what happened or what to do. Serve a plain-text message instead, via a new pagewriter.WriteSuspendedPage: "The application was paused due to inactivity. Refresh the page to start it again." Streamlit displays the health-check response body in its connection modal (the same mechanism behind the existing "re-starting, please wait" plain-text message — see IsStreamlitHealthCheck / AJDA-1935), so the user now sees an honest, actionable message. A reload hits GET /, a non-poll path, which is treated as real activity and triggers wakeup → spinner → app starts. This does not change the cost behavior (the app still stays suspended on background polls — no wakeup is triggered); it only improves the message the user sees. Full auto-recovery on tab refocus still requires the presence beacon, tracked separately. Update the integration test to assert the plain-text Content-Type and the "Refresh the page" body instead of an empty body / literal "60" Retry-After.
6368c8b to
2a4d6eb
Compare
…al scrollbar
Canary testing showed the previous message ("The application was paused due to
inactivity. Refresh the page to start it again.", 81 chars) wraps to several
lines in Streamlit's connection modal, which has limited height and ends up
showing a scrollbar.
The modal renders the response body as plain text — it honors neither newlines
nor HTML (only markdown [text](url) links), so wording length is the only lever.
Shorten to "App paused. Reload the page to restart." (39 chars), which still
states what happened and what to do but fits without scrolling.
Update the integration-test assertion accordingly (Reload vs Refresh).
link to issue
Description
Replaces presence-based WebSocket activity tracking with per-frame activity tracking in
apps-proxy.Removed. The 30 s ticker goroutine in
AppUpstream.NewUpstreamthat callednotify()wheneveractiveWsCount > 0, theactiveWsCount atomic.Int64field, and itsAdd(1)/Add(-1)bookkeeping innewWebsocketProxy.Added. A new internal package
internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/:frame.go— a stateful, allocation-free RFC 6455 frame parser. State machine:header0 → header1 → extLen → maskKey → payload → header0. Discriminates control vs data viaopcode & 0x8(data: 0x0–0x7, control: 0x8–0xF). Handles 7-, 16-, and 64-bit payload lengths and masked frames. Payload bytes are skipped via auint64counter — no buffer is sized from the wire (payload length can be up to 2^63-1 per RFC).conn.go—Wrap(io.ReadWriteCloser, FrameCallback) io.ReadWriteCloserinstantiates one parser per direction. Read feeds the server→client parser, Write feeds the client→server parser; bytes pass through unchanged.Integration in
upstream.go.newWebsocketProxyinstalls aproxy.ModifyResponsehook that, on 101 Switching Protocols, wrapsres.Body— whichhttputil.ReverseProxy.handleUpgradeResponsetype-asserts toio.ReadWriteCloserand uses as the upstream end of the bidirectional copy with the hijacked client conn. A single wrap captures both directions of the WebSocket stream. The wrapper invokesu.notify(reqCtx)once per non-control frame; the existing 30 s-per-app throttle innotify.Managerabsorbs the burst (at most one Sandboxes Service PATCH per app per 30 s, regardless of frame rate). The existingtrace().GotConnnotify on the upstream handshake is preserved.Tests.
wsactivity/{frame,conn}_test.go— unit tests across all frame-header shapes (7/16/64-bit length, masked/unmasked), control-frame skipping, fragmented messages (one callback per frame), multi-frame buffers, byte-by-byte streaming, reserved opcode classification, fuzz test for parser panic safety, andWrappass-through / short-write / Close semantics.proxy_test.go— newTestWebsocketActivityTrackingusing aclockwork.FakeClock: opens an idle WS to a new/ws-idleendpoint, advances the fake clock past the 30 s throttle window, and asserts (a) notify count stays at 1 (regression test for the removed ticker) and (b) sending a real text frame past the window increments notify to 2 (proves the per-frame path is wired end-to-end).testutil/api.go—sync.MutexaroundDataAppsAPI.NotificationsplusNotificationsCount/NotificationsSnapshotgetters, needed for race-free reads while the proxy is still running.testutil/app.go—/ws-idleendpoint that accepts a WebSocket and reads forever without writing any data frames.Release Notes
autoSuspendAfterSecondssetting was effectively disabled for Streamlit data apps with regular viewers: a forgotten browser tab keeps a WebSocket open, the 30 s ticker resets the activity timestamp on every iteration, and the app never suspends. Compute is billed for sessions nobody is using.autoSuspendAfterSecondsbecomes a usable cost control again.apps-proxyrollout — test ondev-*tag first, have PM verify on testing stack, then promote stack by stack.apps-proxyand contains no schema or data migration. Re-deploying the prior image restores the old presence-based tracking.