Skip to content

feat(apps-proxy): AJDA-2746 track WS activity per data frame instead of presence#2604

Open
sykora-ji wants to merge 10 commits into
mainfrom
sykorajiri-AJDA-2746
Open

feat(apps-proxy): AJDA-2746 track WS activity per data frame instead of presence#2604
sykora-ji wants to merge 10 commits into
mainfrom
sykorajiri-AJDA-2746

Conversation

@sykora-ji
Copy link
Copy Markdown

link to issue

Description

Replaces presence-based WebSocket activity tracking with per-frame activity tracking in apps-proxy.

Removed. The 30 s ticker goroutine in AppUpstream.NewUpstream that called notify() whenever activeWsCount > 0, the activeWsCount atomic.Int64 field, and its Add(1) / Add(-1) bookkeeping in newWebsocketProxy.

Added. A new internal package internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/:

  • frame.go — a stateful, allocation-free RFC 6455 frame parser. State machine: header0 → header1 → extLen → maskKey → payload → header0. Discriminates control vs data via opcode & 0x8 (data: 0x0–0x7, control: 0x8–0xF). Handles 7-, 16-, and 64-bit payload lengths and masked frames. Payload bytes are skipped via a uint64 counter — no buffer is sized from the wire (payload length can be up to 2^63-1 per RFC).
  • conn.goWrap(io.ReadWriteCloser, FrameCallback) io.ReadWriteCloser instantiates one parser per direction. Read feeds the server→client parser, Write feeds the client→server parser; bytes pass through unchanged.

Integration in upstream.go. newWebsocketProxy installs a proxy.ModifyResponse hook that, on 101 Switching Protocols, wraps res.Body — which httputil.ReverseProxy.handleUpgradeResponse type-asserts to io.ReadWriteCloser and uses as the upstream end of the bidirectional copy with the hijacked client conn. A single wrap captures both directions of the WebSocket stream. The wrapper invokes u.notify(reqCtx) once per non-control frame; the existing 30 s-per-app throttle in notify.Manager absorbs the burst (at most one Sandboxes Service PATCH per app per 30 s, regardless of frame rate). The existing trace().GotConn notify on the upstream handshake is preserved.

Tests.

  • wsactivity/{frame,conn}_test.go — unit tests across all frame-header shapes (7/16/64-bit length, masked/unmasked), control-frame skipping, fragmented messages (one callback per frame), multi-frame buffers, byte-by-byte streaming, reserved opcode classification, fuzz test for parser panic safety, and Wrap pass-through / short-write / Close semantics.
  • proxy_test.go — new TestWebsocketActivityTracking using a clockwork.FakeClock: opens an idle WS to a new /ws-idle endpoint, advances the fake clock past the 30 s throttle window, and asserts (a) notify count stays at 1 (regression test for the removed ticker) and (b) sending a real text frame past the window increments notify to 2 (proves the per-frame path is wired end-to-end).
  • testutil/api.gosync.Mutex around DataAppsAPI.Notifications plus NotificationsCount / NotificationsSnapshot getters, needed for race-free reads while the proxy is still running.
  • testutil/app.go/ws-idle endpoint that accepts a WebSocket and reads forever without writing any data frames.

Release Notes

  • Justification
    • A customer reported that the autoSuspendAfterSeconds setting was effectively disabled for Streamlit data apps with regular viewers: a forgotten browser tab keeps a WebSocket open, the 30 s ticker resets the activity timestamp on every iteration, and the app never suspends. Compute is billed for sessions nobody is using.
    • From now on, the proxy looks at actual WebSocket traffic. Idle tabs that only exchange protocol-level keep-alive pings stop counting as "active", so autoSuspendAfterSeconds becomes a usable cost control again.
  • Plans for Customer Communication
    • User-visible behavior shift: an open WebSocket that only exchanges ping/pong frames will no longer prevent an app from auto-suspending. Any app that relied on the 30 s heartbeat to stay alive without emitting data frames must now emit real frames as a liveness signal. This should be flagged in the apps-proxy operator notes / changelog when the PR lands. The reporting customer is awaiting this fix and should be notified.
  • Impact Analysis
    • Strict subset of the previous notify firing set (real activity ⊆ "WS is open"), so the only behavioral regression is for apps whose framework emits zero data frames between renders — undocumented and not a supported pattern.
    • Not behind a feature flag. The change is the explicit goal of the ticket and Pepa's spec; gating it would dilute the fix.
  • Deployment Plan
    • Continuous deployment via the standard apps-proxy rollout — test on dev-* tag first, have PM verify on testing stack, then promote stack by stack.
  • Rollback Plan
    • Trivial revert: the change is isolated to apps-proxy and contains no schema or data migration. Re-deploying the prior image restores the old presence-based tracking.
  • Post-Release Support Plan
    • Watch for new tickets reporting "my app suspended unexpectedly" — those will identify frameworks that silently relied on the old behavior. Otherwise no additional monitoring beyond the existing apps-proxy SLOs.

@linear
Copy link
Copy Markdown

linear Bot commented May 21, 2026

AJDA-2746

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes apps-proxy WebSocket activity tracking from “connection is open” to “actual WebSocket data frames are flowing”, by wrapping upgraded WebSocket connections and observing RFC 6455 frame headers in both directions.

Changes:

  • Removed presence-based periodic notify (the 30s ticker + active connection counting) from the upstream WebSocket proxy path.
  • Added a new wsactivity package that passively parses RFC 6455 frames (allocation-free) and triggers a callback once per non-control frame, plus unit tests.
  • Added/updated integration + test utilities to verify end-to-end per-frame notify behavior and to make notification counters concurrency-safe in tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/pkg/service/appsproxy/proxy/apphandler/upstream/upstream.go Removes ticker-based WS activity tracking and wraps upgraded conns to notify per data frame.
internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/frame.go Adds a streaming, allocation-free WebSocket frame header parser.
internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/frame_test.go Adds unit tests + fuzzing for the frame parser across header shapes and streaming patterns.
internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/conn.go Adds an io.ReadWriteCloser wrapper that observes frames on both Read and Write.
internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/conn_test.go Adds wrapper pass-through + short-write/error/close semantics tests.
internal/pkg/service/appsproxy/proxy/proxy_test.go Adds an integration test for “handshake notify once, idle WS does not re-notify, data frame re-notifies after throttle”.
internal/pkg/service/appsproxy/proxy/testutil/api.go Adds locking + safe getters for notification counts used during concurrent tests.
internal/pkg/service/appsproxy/proxy/testutil/app.go Adds a /ws-idle endpoint for integration testing of idle WebSockets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/pkg/service/appsproxy/proxy/testutil/app.go Outdated
Comment thread internal/pkg/service/appsproxy/proxy/proxy_test.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comment thread internal/pkg/service/appsproxy/proxy/apphandler/upstream/wsactivity/conn_test.go Outdated
@sykora-ji sykora-ji requested a review from MiroCillik May 21, 2026 14:39
@sykora-ji sykora-ji marked this pull request as ready for review May 21, 2026 14:39
@sykora-ji sykora-ji requested a review from Matovidlo May 22, 2026 08:29
Copy link
Copy Markdown
Contributor

@Matovidlo Matovidlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Matovidlo
Copy link
Copy Markdown
Contributor

@sykora-ji ideally put it on canary/testing and retest the functionality of new frame activity parser

Copy link
Copy Markdown
Member

@MiroCillik MiroCillik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the per-frame WS activity tracking. Architecture is clean and very well-tested (fuzz, split-feeds, masked/unmasked, short-write-no-double-feed). Build + upstream/wsactivity test packages pass locally. One performance item worth addressing (or at least benchmarking) before rollout, plus two minor cleanups — inline.

// uses context.WithoutCancel, so the in-flight call survives the WS
// timeout and any per-request cancellation.
reqCtx := res.Request.Context()
res.Body = wsactivity.Wrap(rwc, func() { u.notify(reqCtx) })
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — per-frame goroutine + span churn; the 30s throttle does not absorb it.

This callback fires u.notify(reqCtx) once per data frame, and u.notify unconditionally (a) spawns a goroutine via u.manager.wg.Go, (b) starts an OTel span keboola.go.apps-proxy.upstream.notify, and only then (c) calls notify.Manager.Notify, where the 30s throttle check actually lives (notify.go:60).

So the throttle caps the outbound PATCH but not the goroutine spawn or the exported span. handleUpgradeResponse copies with a 32 KB buffer and the parser fires once per frame within each chunk — a data app streaming small frames (live dashboard, chat, log tail) can produce thousands of frames per Read → thousands of goroutines + thousands of discarded spans per second. The old ticker was one call / 30s / app.

Suggestion: gate cheaply before u.notify — e.g. a per-upstream atomic.Int64 next-allowed unix-nano compared against clock.Now(), only calling u.notify when the window has elapsed. The authoritative throttle in notify.Manager still guarantees correctness; this just keeps a cheap fast-path for the ~99.9% of frames that get throttled out. At minimum, benchmark on a high-frame-rate app before production.

// non-activity: notify is skipped on a Running app and the request is rejected
// with 503 Retry-After on a Suspended one, requiring the user to perform a
// meaningful action (refresh, click into the UI) to wake the app.
func isFrameworkBackgroundPoll(path string) bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor — hardcodes Streamlit-internal paths. Exact-match is correctly tested (trailing-slash / query-string safe), but this couples apps-proxy to Streamlit's private /_stcore/* endpoints and will silently grow per framework. Consider framing it explicitly as a temporary/framework-specific workaround like the adjacent Origin-rewrite block already does, and note the long-term follow-up (server-side config / app-level beacon) so it doesn't quietly calcify.


// NotificationsSnapshot returns a copy of the current Notifications map under
// a lock.
func (v *DataAppsAPI) NotificationsSnapshot() map[string]int {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor — dead code. NotificationsSnapshot is never called (only NotificationsCount is used by the new test). Exported test-helper methods won't trip the unused linter. Either drop it or use it — project guidance is to remove unused functions rather than keep them.

@MiroCillik
Copy link
Copy Markdown
Member

Validation note: will this fully solve SUPPORT-16324 (GRPN forgotten-tab autosuspend)?

This PR correctly fixes the stated root cause — the presence ticker is gone, Tornado's PING/PONG are control frames (opcode & 0x8 != 0) and are ignored, and the /_stcore/health / /_stcore/host-config HTTP polls are filtered. So a truly idle tab (only ping/pong + health polls, user away) now produces zero activity notifies and autosuspend works. 👍

One caveat worth confirming before we close the support ticket, since the report specifically says "logs show periodic Streamlit reruns from a tab held open":

  • A Streamlit rerun pushes WebSocket data frames (ForwardMsg, binary opcode 0x2) server→client.
  • This PR counts data frames in both directions (per the spec), so any rerun — even one with no user behind it — counts as activity and resets the timer.
  • Periodic reruns happen without user interaction in common cases: st_autorefresh / streamlit-autorefresh, @st.fragment(run_every=...), a polling/time.sleep script loop, or the ~20 min WS reconnect cycle (each reconnect re-runs the script → full re-render → data frames; see the isFrameworkBackgroundPoll comment that references this reconnect cycle).

If the GRPN app's "periodic reruns" are app-driven (e.g. autorefresh) rather than user-driven, this change will not suspend it — the server-pushed data frames keep it alive via the new path, just as the old ticker did.

Suggested follow-ups before declaring SUPPORT-16324 resolved:

  1. Confirm what is producing the periodic reruns on that specific app (app-side autorefresh vs. genuine user). If app-driven, set expectations: this PR alone won't suspend it — that's the Cli tool skeleton #3 (user-active beacon) / Unit and functional tests #4 (configurable idle mode) follow-up.
  2. Verify the customer's autoSuspendAfterSeconds is shorter than the WS reconnect interval, otherwise reconnect-triggered reruns can still defeat it.

Possible refinement (design conversation, not a blocker — diverges from the bidirectional spec): counting only client→server data frames as activity would more closely match "a user actually did something" (genuine interaction → client BackMsg; app-driven refresh → only server ForwardMsg), at the cost of some edge cases (automatic client frames, initial server-side render).

Jiří Novák added 9 commits June 5, 2026 10:44
…e parsing

Stateful RFC 6455 frame parser with a passthrough io.ReadWriteCloser
wrapper that drives one parser per direction. The wrapper invokes a
callback once per non-control frame (opcodes 0x0..0x7 selected via
`opcode & 0x8`) and never allocates based on the attacker-controlled
payload length field — payload bytes are skipped via a uint64 counter.

Includes unit tests covering 7/16/64-bit payload lengths, masked and
unmasked frames, fragmented messages, multi-frame buffers, byte-by-byte
streaming, reserved data/control opcodes, and a fuzz test for parser
panic safety.

Used by upstream proxy in a follow-up commit.
Existing table-driven proxy tests only read Notifications after process
shutdown, so direct map access is race-free for them. Tests that observe
notify behavior mid-flight (e.g. during a running WebSocket) need a
locked accessor — add `sync.Mutex` around the map and expose
`NotificationsCount` / `NotificationsSnapshot` getters.
Accepts a WebSocket and reads forever without writing any data frames.
The library's Read loop continues to auto-respond to client pings with
pongs, but no data ever leaves the server. Used by tests that need to
distinguish data-frame activity from idle WebSockets exchanging only
protocol-level keep-alive traffic.
…of presence

Previously a 30s ticker fired `notify()` whenever any WebSocket was open,
counted by an `activeWsCount` atomic. This kept apps awake indefinitely
even when only protocol-level ping/pong was exchanged — the forgotten
browser tab case reported in SUPPORT-16324 / PROF-104.

Replace the ticker with per-frame activity tracking: install a
`proxy.ModifyResponse` hook that wraps `res.Body` (which
`httputil.ReverseProxy.handleUpgradeResponse` uses as the upstream-side
io.ReadWriteCloser for bidi-copy after 101). The wsactivity wrapper
parses RFC 6455 frame headers on both directions and invokes
`u.notify(ctx)` once per non-control frame; the existing 30s-per-app
throttle in notify.Manager absorbs the resulting burst.

User-visible behavior shift: an open WebSocket that only exchanges
ping/pong frames no longer keeps an app from auto-suspending. Apps that
intentionally relied on the 30s heartbeat must now emit actual data
frames to signal liveness.

Add `TestWebsocketActivityTracking` integration test using FakeClock:
verifies (a) idle WS past the throttle window does NOT produce
additional notifies — the regression test for the removed ticker — and
(b) a data frame past the throttle window DOES produce a second notify,
proving the per-frame path is wired end-to-end.
…vity

CI golangci-lint flagged 9 issues, all in the wsactivity package:

- gosec G115 integer-overflow conversions in frame.go — annotated with
  //nolint:gosec and an inline justification of the bounds that make
  each conversion safe (extLen/maskKey/payload counters).
- intrange — switch `for k := 0; k < n; k++` to `for k := range n`
  (Go 1.22+).
- prealloc — preallocate three test byte slices with computed capacity
  to avoid intermediate growths.
- depguard — conn_test.go imported the stdlib `errors`; switch to the
  project's `internal/pkg/utils/errors` package per coding convention.

Also addresses two Copilot PR review comments:

- testutil/app.go `/ws-idle` doc comment: corrected the description —
  the handler drains reads only, it never writes ("echoes" was wrong).
- proxy_test.go TestWebsocketActivityTracking doc comment + assertion
  message: clarified that the idle-WS check is not a hard regression
  test against a future real-time ticker (the previous implementation
  slept 30 s of wall-clock time, which 250 ms wall-clock cannot catch).
  Code review must catch any reintroduction of presence-based polling.

No functional change.
… comment

The inline comment claimed the short-write half-point lands inside the
header, which is wrong for the 16-byte masked text frame used by the
test (header is 6 bytes, half-point at 8 falls past header end). The
test itself still proves the no-double-feed invariant: a buggy "feed
full p on every Write" implementation would, on the caller's retry,
re-walk the last bytes (mask key + payload prefix) and fire spurious
callbacks. Reword the comment to describe what the test actually
catches.

No code change.
Address review feedback that the original byte-by-byte state machine
felt C-like and could be more Go-idiomatic. The previous version had:

- 7 mutable struct fields tracking partial state of each sub-phase,
- 5 explicit states (header0/header1/extLen/maskKey/payload),
- three near-duplicate "consume up to N bytes" loops,
- manual big-endian accumulation via shift-and-or,
- four //nolint:gosec annotations for narrowing conversions.

Replace with a header-buffer approach: accumulate up to 14 bytes of the
current frame's header into a fixed-size array, then decode the whole
header at once using binary.BigEndian. Implicit state is now encoded in
just three counters (headerLen, headerNeed, payloadLeft) with two
obvious phases — drain payload, accumulate header.

- 4 struct fields instead of 7.
- 2 implicit phases instead of 5 explicit states.
- 1 byte-copy code path (via copy()) instead of 3.
- 0 gosec annotations: header offsets are statically bounded, payload
  bytes are skipped via uint64 counter only (still allocation-free
  against 2^63-1 lengths).
- binary.BigEndian.Uint16/Uint64 instead of manual `<<8 | x` loops.

Semantics are unchanged — all existing unit tests pass without
modification except for one assertion that referenced the removed
internal `state` field; updated to check the equivalent post-condition
on headerLen/headerNeed/payloadLeft.
…nd polls

Discovered during canary testing: the WebSocket per-frame fix alone does
not deliver the customer-visible improvement, because the Streamlit
frontend posts /_stcore/health and /_stcore/host-config to apps-proxy
every WS reconnect cycle (~20 min, imposed by an external idle timeout).
Each request flows through trace().GotConn → u.notify(), so
lastRequestTimestamp keeps moving forward by reconnect alone — defeating
auto-suspend on every forgotten tab.

Add isFrameworkBackgroundPoll helper recognizing the two Streamlit
background-poll paths and use it in two places:

- trace() middleware: skip notify when the request path is a background
  poll. The upstream call still happens normally; we just do not treat
  the round-trip as user activity.

- ServeHTTPOrError: add a new case to the suspended-state switch. When
  the app is auto-suspended (Stopped + auto-restart enabled, no DevMode)
  and the path is a background poll, reply 503 with Retry-After: 60 and
  do NOT trigger wakeup. The Streamlit client backs off; the user must
  perform a meaningful action (refresh, click in UI → GET / or similar)
  to wake the app, which lands on a non-poll path and falls into the
  default branch.

Effect: forgotten tabs stay suspended indefinitely. autoSuspendAfterSeconds
becomes a usable cost control again. The 503 vs spinner-page distinction
also lets the Streamlit client surface a "Connection lost" state in the
tab rather than infinitely spinning, which is the expected UX once the
app has been auto-suspended.

Unit test covers the helper (matched paths + comprehensive negatives:
/_stcore/stream WS upgrade, /_stcore/upload_file user action, normal
user paths, prefix-look-alike rejection). Two new integration cases:

- public-app-framework-background-poll-skips-notify: three background
  polls against a Running app produce zero notifications (the existing
  path would have produced one per request).
- public-app-framework-background-poll-on-suspended-returns-503-no-wakeup:
  background poll on a Stopped app returns 503 + Retry-After: 60 with
  an empty body (vs the spinner-page HTML served by the default branch),
  so wakeup is not triggered.
…ded background poll

Canary testing showed that returning a bare empty 503 on a background poll
of a suspended app leaves the data-app frontend (Streamlit) showing a
generic, confusing "Connection error" with no guidance — and, unlike the
spinner served on a real wakeup, no indication of what happened or what to do.

Serve a plain-text message instead, via a new pagewriter.WriteSuspendedPage:

  "The application was paused due to inactivity. Refresh the page to start it again."

Streamlit displays the health-check response body in its connection modal
(the same mechanism behind the existing "re-starting, please wait" plain-text
message — see IsStreamlitHealthCheck / AJDA-1935), so the user now sees an
honest, actionable message. A reload hits GET /, a non-poll path, which is
treated as real activity and triggers wakeup → spinner → app starts.

This does not change the cost behavior (the app still stays suspended on
background polls — no wakeup is triggered); it only improves the message the
user sees. Full auto-recovery on tab refocus still requires the presence
beacon, tracked separately.

Update the integration test to assert the plain-text Content-Type and the
"Refresh the page" body instead of an empty body / literal "60" Retry-After.
@sykora-ji sykora-ji force-pushed the sykorajiri-AJDA-2746 branch from 6368c8b to 2a4d6eb Compare June 5, 2026 08:45
…al scrollbar

Canary testing showed the previous message ("The application was paused due to
inactivity. Refresh the page to start it again.", 81 chars) wraps to several
lines in Streamlit's connection modal, which has limited height and ends up
showing a scrollbar.

The modal renders the response body as plain text — it honors neither newlines
nor HTML (only markdown [text](url) links), so wording length is the only lever.
Shorten to "App paused. Reload the page to restart." (39 chars), which still
states what happened and what to do but fits without scrolling.

Update the integration-test assertion accordingly (Reload vs Refresh).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants