Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions agent_docs/tasks/2026-06-13-fresh-client-load-followups.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Fresh-Client Load Follow-ups: Map Asset Preload + News Feed SWR

Follow-up to the WebSocket snapshot work (`2026-06-13-fresh-client-snapshot-replay.md`).
Two further fresh-client slowness sources: the map JS bundle waterfall and the
dashboard's slow text feeds.

## Issue

1. **Map asset waterfall.** The default view is `TACTICAL` → `TacticalMap`,
which depends on the `deck-gl` (~1 MB) and map-engine (~1 MB MapLibre /
~1.7 MB Mapbox) vendor chunks. Since the views are lazy-loaded and the entry
was deliberately stopped from preloading vendors (`f32c30d`), a cold-cache
client only discovers these chunks **after** the entry + App chunks download,
parse, and the dynamic import fires — a multi-hop request waterfall before the
default view can paint.
2. **Dashboard text feeds slow to load.** `GET /api/news/feed` (NewsWidget)
fetched the 5 configured RSS feeds **sequentially**, each with a 10 s timeout
(up to ~50 s worst case), and the Redis cache was populated **lazily** by the
requesting client. So every 15-minute cache expiry made the next caller block
on the full upstream fetch. There is no news poller pre-warming the cache.

## Solution

1. **Hoist `modulepreload` hints for the critical map chunks.** A small build-only
Vite plugin (`mapCriticalPreloadPlugin`) injects
`<link rel="modulepreload">` into `index.html` for `deck-gl`, the active GL
engine, and the `TacticalMap` view chunk, so the browser fetches them in
parallel with the entry instead of serially after it. The engine is chosen at
build time to mirror `mapStyles.ts`: Mapbox when a valid `VITE_MAPBOX_TOKEN`
is set (+ `VITE_ENABLE_MAPBOX !== "false"`), MapLibre otherwise. Only the
default view's engine is preloaded — the globe-only MapLibre in a Mapbox build
still loads on demand. The cacheable-vendor split is otherwise unchanged.
2. **Concurrent fetch + stale-while-revalidate for the news feed.**
- `_fetch_feeds` now fetches all sources with `asyncio.gather` (latency bounded
by the slowest single feed, not the sum) via a non-raising `_fetch_one`.
- The endpoint serves the cached payload immediately and, once it ages past the
15-minute freshness window, kicks off a **background** refresh
(`_trigger_refresh`) so callers never block on the upstream fetch. The data
is kept for `CACHE_HARD_TTL` (6 h) for stale serving; a `CACHE_FRESH_KEY`
marks freshness. Background refreshes are deduped within a worker (a held
task ref) and across workers (a Redis `SET NX` lock). Only a truly cold
cache (no data at all, or Redis down) fetches synchronously — and that fetch
is now concurrent.

## Changes

- **`frontend/vite.config.ts`**
- Switched to the function form of `defineConfig` to read build env via
`loadEnv` and pick the engine chunk.
- Added `mapCriticalPreloadPlugin(engineChunk)` (uses `transformIndexHtml` with
`ctx.bundle` to resolve hashed chunk filenames and inject preload links).
- **`backend/api/routers/news.py`**
- Added `CACHE_FRESH_KEY`, `CACHE_REFRESH_LOCK`, `CACHE_HARD_TTL`,
`CACHE_REFRESH_LOCK_TTL`.
- Split `_fetch_feeds` into `_fetch_one` (per-feed, never raises) +
concurrent `gather`.
- Added `_store_feed`, `_refresh_and_release`, `_trigger_refresh`, and a module
`_refresh_task` ref.
- Rewrote `get_news_feed` for stale-while-revalidate.
- Added `warm_cache()` (non-blocking refresh delegating to the deduped
background refresh) and `prewarm_loop()` — a continuous pre-warmer that
refreshes on startup and then every `NEWS_PREWARM_INTERVAL`
(`NEWS_PREWARM_INTERVAL_SECONDS`, default 600 s — comfortably inside the
900 s freshness window so the cache is always warm even with no traffic).
- **`backend/api/main.py`**
- Lifespan launches `news.prewarm_loop()` as a supervised background task
after `broadcast_service.start()` and cancels it on shutdown alongside the
historian / RF-cleanup tasks. The feed cache is therefore kept warm
independent of client traffic, so a fresh dashboard never blocks on the
upstream RSS fetch.
- **`backend/api/tests/test_news_router.py`**
- Added tests: fresh cache served without refresh; stale cache served +
triggers refresh; cold cache fetches synchronously; `_fetch_feeds` merges +
sorts newest-first and strips `_ts`; `_trigger_refresh` NX-lock dedupe.

## Verification

- **Frontend** (`frontend`): `pnpm run typecheck` (covers `vite.config.ts`),
`pnpm run lint`, `pnpm run test` → 278 passed. `pnpm run build` succeeded;
`dist/index.html` now contains `modulepreload` links for `deck-gl`,
`maplibre` (no Mapbox token in this build → engine = MapLibre, and `mapbox`
is correctly *not* preloaded), and `TacticalMap`.
- **Backend API** (`backend/api`): `ruff check` on changed files passed;
`pytest` full suite → 172 passed (was 167; +5 news tests).

## Benefits

- **Map paints sooner on a cold client**: the two multi-MB critical chunks and
the default view chunk download in parallel with the entry, collapsing the
discover-then-fetch waterfall — without reverting the cacheable vendor split
or preloading the unused engine.
- **Dashboard text feeds load fast and stay fast**: concurrent fetching cuts
cold-cache latency from the sum of feed latencies to the slowest single feed,
and stale-while-revalidate means the periodic 15-minute cache expiry no longer
blocks a user — they get instant (slightly stale) data while a background
refresh runs. Background refreshes are deduped so a burst of clients triggers
at most one upstream fetch.
89 changes: 89 additions & 0 deletions agent_docs/tasks/2026-06-13-fresh-client-snapshot-replay.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Fresh-Client Snapshot Replay (Last-Value Cache)

## Issue

After the recent frontend rendering optimizations (cached static layers, lazy
map loading), the map paints almost instantly — but on a **fresh client** it
stays empty for a long time while entities trickle in. Aircraft, ships, and
satellites only arrive over the live WebSocket (`/api/tracks/live`), and the
broadcast consumer reads Kafka with `auto_offset_reset="latest"`. A late joiner
therefore receives **no backlog** — it must wait for each poller to re-emit its
next full sweep before the world populates. The orbital sweep alone is a
~15–37 s cycle (≈11k satellites), so a fresh client can sit on a near-empty map
for tens of seconds. The faster (now near-instant) render made this pre-existing
gap glaringly obvious.

## Solution

Add a **last-value cache (LVC)** to `BroadcastManager` and replay it to every
newly-connected client before live streaming begins.

- As the Kafka consume loop transforms each message to its TAK frame, it also
stores the latest frame per `uid` in an in-memory cache, keyed by entity id
and stamped with a monotonic receive time. The cache is kept warm even when
no clients are connected, so it is ready the instant someone joins.
- On WebSocket connect, the per-client worker first replays the current cache
(the "snapshot") directly to that client, then enters the normal live-stream
drain loop. Frames are sent on the existing one-frame-per-entity wire format,
so **the frontend needs no changes** — a snapshot frame is indistinguishable
from a live update, and the client's existing `lastSourceTime` de-dup guard
harmlessly ignores any overlap between snapshot and live deltas.
- Stale entries (not re-emitted within `LIVE_SNAPSHOT_TTL_SECONDS`, default
300 s) are excluded from snapshots and periodically pruned; a hard cap
(`LIVE_SNAPSHOT_MAX_ENTITIES`, default 20 000) bounds memory.

### Why direct send, not the live queue

The per-client live queue is bounded at 256 messages (it intentionally drops
oldest under back-pressure). A multi-thousand-entity snapshot pushed through it
would be almost entirely dropped, so the snapshot is sent directly via
`send_bytes` with the same 3 s per-frame timeout, yielding to the event loop
every 256 frames so a large replay never starves the consume loop or other
clients.

## Changes

- **`backend/api/core/config.py`**
- Added `LIVE_SNAPSHOT_TTL_SECONDS` (default 300) and
`LIVE_SNAPSHOT_MAX_ENTITIES` (default 20 000).
- **`backend/api/services/broadcast.py`**
- Added `import time` and the `_LVC_PRUNE_INTERVAL_S` constant.
- `BroadcastManager.__init__`: added the `_lvc` cache and `_last_prune`.
- `_consume`: records every transformed frame into the LVC
(`_record_live`) before the early-out on zero clients.
- New helpers: `_record_live`, `_maybe_prune` (TTL sweep + hard cap),
`_snapshot_frames` (fresh frames, copied for safe concurrent iteration),
and `_send_snapshot` (direct, yielding, disconnect-aware replay).
- `_client_worker`: replays the snapshot before the live drain loop; bails
out cleanly if the client disconnects mid-snapshot.
- `stop()`: clears the cache.
- **`backend/api/tests/test_broadcast_snapshot.py`** (new)
- Covers LVC population/overwrite, blank-uid rejection, TTL exclusion, prune
(stale drop + hard cap), and snapshot send (all frames, empty no-op,
mid-stream disconnect, stale exclusion).

## Verification

Run on host (`backend/api`):

- `uv tool run ruff check services/broadcast.py core/config.py tests/test_broadcast_snapshot.py` → All checks passed.
- `uv run python -m pytest tests/test_broadcast_snapshot.py tests/test_tracks_validation.py -q` → 16 passed.
- `uv run python -m pytest -q` (full API suite) → 167 passed.

No frontend changes were required, so frontend suites were not run (per the
Targeted Verification rule).

## Benefits

- **Fresh clients paint the full picture immediately** instead of waiting up to
a full poller sweep (tens of seconds for satellites). The data-load latency a
late joiner perceives drops from "next sweep" to "one connect round-trip."
- **Backend-only, wire-compatible**: no frontend changes, no proto/worker
changes, no new service or DB query on connect — the snapshot is served from
memory.
- **Bounded and self-healing**: TTL + hard cap bound memory; stale entities
(landed aircraft, departed vessels) age out automatically and never appear in
a snapshot.
- **Back-pressure safe**: the snapshot bypasses the bounded live queue and
yields regularly, so a large replay cannot starve the consume loop or slow
other connected clients.
9 changes: 9 additions & 0 deletions backend/api/core/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,15 @@ def DB_DSN(self) -> str:
# Kafka
KAFKA_BROKERS = os.getenv("KAFKA_BROKERS", "sovereign-redpanda:9092")

# Live-stream snapshot (last-value cache).
# A freshly-connected WebSocket client is replayed the current world state
# so the map paints immediately instead of waiting for each poller's next
# full sweep (the orbital sweep alone is a ~15-37 s cycle). Entities not
# re-emitted within the TTL are dropped from the snapshot; the hard cap
# bounds memory if the uid space ever runs away.
LIVE_SNAPSHOT_TTL_SECONDS = int(os.getenv("LIVE_SNAPSHOT_TTL_SECONDS", "300"))
LIVE_SNAPSHOT_MAX_ENTITIES = int(os.getenv("LIVE_SNAPSHOT_MAX_ENTITIES", "20000"))

# Authentication
# When AUTH_ENABLED=false all authentication checks are skipped (local dev only — NEVER in production).
AUTH_ENABLED: bool = os.getenv("AUTH_ENABLED", "true").lower() not in (
Expand Down
18 changes: 15 additions & 3 deletions backend/api/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ async def _historian_supervisor():
# Global task handles
historian_task_handle: asyncio.Task | None = None
rf_cleanup_task_handle: asyncio.Task | None = None
news_prewarm_task_handle: asyncio.Task | None = None


@asynccontextmanager
Expand All @@ -86,7 +87,7 @@ async def lifespan(app: FastAPI):
BUG-017: Replaced deprecated @app.on_event("startup") / @app.on_event("shutdown")
decorators with the modern lifespan context manager pattern (FastAPI >= 0.93).
"""
global historian_task_handle, rf_cleanup_task_handle
global historian_task_handle, rf_cleanup_task_handle, news_prewarm_task_handle
# --- Startup ---
settings.validate()
await db.connect()
Expand Down Expand Up @@ -121,12 +122,23 @@ async def lifespan(app: FastAPI):
historian_task_handle = asyncio.create_task(_historian_supervisor())
rf_cleanup_task_handle = asyncio.create_task(rf_sites_cleanup_task())
await broadcast_service.start()
logger.info("Database, Redis, Historian, RF Cleanup, and Broadcast Service started")
# Continuously pre-warm the news feed cache in the background so a fresh
# dashboard always hits a warm cache instead of blocking on the upstream
# RSS fetch (refreshes on startup, then on an interval).
news_prewarm_task_handle = asyncio.create_task(news.prewarm_loop())
logger.info(
"Database, Redis, Historian, RF Cleanup, Broadcast Service, "
"and News Pre-warm started"
)

yield

# --- Shutdown ---
for handle in (historian_task_handle, rf_cleanup_task_handle):
for handle in (
historian_task_handle,
rf_cleanup_task_handle,
news_prewarm_task_handle,
):
if handle:
handle.cancel()
try:
Expand Down
Loading
Loading