Skip to content

Feat/fasta prep backend#248

Draft
t03i wants to merge 30 commits into
mainfrom
feat/fasta-prep-backend
Draft

Feat/fasta prep backend#248
t03i wants to merge 30 commits into
mainfrom
feat/fasta-prep-backend

Conversation

@t03i
Copy link
Copy Markdown
Collaborator

@t03i t03i commented May 6, 2026

TLDR

MVP backend implementation.

Description

Architecture

[Browser]
   │ POST /api/prepare (multipart FASTA)
   │ GET  /api/prepare/:id/events  (SSE)
   │ GET  /api/prepare/:id/bundle  (download)
   ▼
[Caddy reverse proxy] ── existing TLS termination, SSE-friendly by default
   │
   ▼
[protspace-prep container]
   │ FastAPI app (uvicorn, single process)
   │ ├─ in-memory job registry: dict[job_id → JobState]
   │ ├─ asyncio.Semaphore(5) bounding concurrent active jobs
   │ ├─ asyncio task per job; pipeline runs via asyncio.to_thread
   │ └─ /var/lib/protspace-prep/jobs/<job_id>/ for bundle artifacts
   │
   ▼
[Biocentral API] (HTTPS, anonymous today)

SLO:

  • 5 concurrent users supported
  • resutls of 1k seqs in ~1 min

Closes: #236

t03i and others added 24 commits May 6, 2026 09:12
Also fixes a pre-existing knip failure caused by playwright@1.57.0
crashing under jiti when the Playwright plugin loaded app/tests/
playwright.config.ts. Disable the plugin for the app workspace — specs
are already captured via the entry glob — and add a comment explaining
why.
…ment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…SE event queues

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds pipeline.py with run_protspace_prepare(), which launches the
protspace CLI as an async subprocess, parses stderr for stage
transitions (embedding, projecting, annotating, bundling), enforces a
configurable timeout, and raises PipelineFailure on non-zero exit or
missing bundle output. Also provides cleanup_job_dir() for the TTL
sweeper.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds api.py with POST /api/prepare, GET /api/prepare/{id}/events (SSE),
and GET /api/prepare/{id}/bundle. Updates app.py to accept an injectable
pipeline, fixes late-subscriber path in jobs.py to always synthesize a
queued event before replaying the terminal event, and hardens conftest.py
to set PREP_JOB_ROOT before module-level create_app() runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements isFastaFile helper and prepareFastaBundle that uploads a FASTA
file, streams SSE progress events, and resolves with a .parquetbundle File.
Uses @public JSDoc tags on FastaPrepStage/FastaPrepOptions so knip recognises
them as intentional public API ahead of Task 10 wiring.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ndle path intact

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… races)

- Fix 1: sanitize Content-Disposition filename via _safe_download_name()
  to prevent header injection from hostile original_name values
- Fix 2: add SSE keep-alive comment frames every _KEEPALIVE_INTERVAL_SECONDS
  (15 s default, monkeypatchable) using asyncio.wait_for on the subscribe iter
- Fix 3: register subscriber queue BEFORE yielding the synthetic queued event
  so terminal events published during the yield cannot be missed
- Fix 4: send None sentinel to all live subscriber queues in sweep_expired()
  before popping _subscribers, preventing indefinite hangs
- Fix 5: catch asyncio.CancelledError in _run(), publish error event, set
  ERROR status, then re-raise so cancellation propagates cleanly
- Fix 6: use peek_bundle/mark_consumed split so consumed flag is only set
  after a successful path.read_bytes(); OSError surfaces as HTTP 500
- Fix 7: register atexit handler in conftest.py to clean up the mkdtemp dir
  after the test session (previously leaked on every run)

New tests: 14 added (47 total, was 33). All pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract abort handler to a named function so cleanup() can call
removeEventListener, preventing listener accumulation when the same
AbortController is reused across multiple prepareFastaBundle calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous keepalive loop wrapped each `aiter.__anext__()` in
`asyncio.wait_for`, which cancels the inner coroutine on timeout. That
cancellation exhausted the underlying async generator, so the first
keepalive frame silently truncated the stream and the EventSource client
fired an error event surfacing as "Bundle preparation failed."

Hold a single in-flight `__anext__()` task across keepalive ticks via
`asyncio.shield`, only creating a new one once an event has been
delivered. Regression test now asserts the stream keeps flowing past the
keepalive boundary and still delivers `event: done`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the single `protspace prepare` invocation with explicit calls to
`protspace embed`, `annotate`, `project`, and `bundle`. Embed (Biocentral)
and annotate (UniProt) are network-bound and independent, so they run
concurrently inside an `asyncio.TaskGroup`; project and bundle run
sequentially afterward. The whole run shares a single wall-clock budget
via `asyncio.timeout` so the SSE contract still has a deterministic upper
bound.

Stage events are now driven by the pipeline orchestrator rather than
parsed out of stderr, so the regex-based stage detector is gone. Each
step's stderr is still drained line-by-line (last 50 lines kept for
failure messages) so subprocesses never block on a full pipe, and
cancellation kills the subprocess before propagating.

Tests cover the success path, parallel execution of embed+annotate,
per-step failure surfaces, the missing-bundle and missing-H5 sentinels,
and timeout-driven subprocess kill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the prod topology where the SPA and prep backend are hosted on
separate origins. The new compose service builds a custom Caddy image
with caddy-ratelimit baked in, fronts protspace-prep on
http://localhost:9090, and applies CORS headers, an OPTIONS preflight
short-circuit, a 9 MB submit body cap, and a 5-per-15min submit rate
limit so dev behavior matches what users will hit in prod.

Also adds PREP_SEQUENCE_MIN_COUNT=20 to the prep service env so the
floor enforced by the validator is configured at the deployment layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a consumer-supplied loadFromFileHandler rejected, the rejection
propagated to the caller but the data-loader never updated its `error`
property or fired `data-error`, so listeners (the explore runtime in
particular) had no signal to drop the loading overlay. Catch handler
errors at the boundary, set `this.error`, and dispatch `data-error`
with the original Error so existing listeners can branch on
`originalError.name === 'AbortError'` cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prep submit path returned bare "Upload failed (HTTP 429)" messages
on rate-limit, oversize, and backend-unavailable responses. Map 429,
413, 503, and 504 to user-readable strings, parse Retry-After (seconds
or HTTP date) into a "try again in N minutes" hint, and fall back to a
generic but still helpful message when the header is missing or the
body is non-JSON (e.g. Caddy's plain-text 429).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires an AbortController through prepareFastaBundle and renders a Cancel
button on the loading overlay while the prep job runs. The dataset
controller's data-error handler now special-cases AbortError so a user
cancel resolves the load queue cleanly instead of surfacing as a
toast/error UI. The button is removed once the bundle handoff completes
or the prep call rejects.

The runtime now also reads VITE_PREP_API_BASE so the SPA can target a
separate backend origin (the new Caddy in front of protspace-prep) in
both dev and prod, falling back to same-origin when unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ject

Splits the mocked end-to-end into two focused tests: one that exercises
the new Cancel button (asserting the bundle is never fetched and the
overlay closes) and one that completes the prep flow against the
mocked backend. Adds a `fasta-prep-live` playwright project that drives
a real Caddy + protspace-prep + Biocentral round-trip using a small
fixture FASTA, with a 6-minute timeout for cold starts. Playwright
baseURL now reads PLAYWRIGHT_BASE_URL so the live project can target
the dev origin (default localhost:8080) without editing the config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Type-check, knip, and the docs build still run on every commit; the
test suite now runs in CI only. Keeping `test:ci` in the precommit hook
made every commit a multi-minute wait, which encouraged --no-verify
detours that defeat the rest of the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
embed/project keyed projections_data.identifier by the raw FASTA header
(sp|P12345|NAME_HUMAN) while annotate ran the same header through
parse_identifier (P12345). The frontend bundle join in
data-loader/utils/bundle.ts joins on projection.identifier, so for any
UniProt FASTA every lookup missed and annotations silently dropped.

Run both subprocesses against an input.normalized.fasta whose headers are
already passed through protspace's parse_identifier, so both downstream
tables agree on a single key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Progress was hard-coded to 25% for every onProgress event, so the bar
froze for the entire pipeline. Map each stage (queued/embedding/
annotating/projecting/bundling) to its own percentage and clamp with
Math.max so out-of-order events can never roll the bar backwards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@t03i
Copy link
Copy Markdown
Collaborator Author

t03i commented May 6, 2026

1 min is currently absolutely unrealistic. For 840 sequences, we're breaching 3:30 likely because of Biocentral. Some more profiling is required.

Backend:
- Stream bundle via FileResponse + BackgroundTask (no full read into memory)
- ExceptionGroup handling joins all PipelineFailure messages instead of
  dropping siblings
- Switch JobState timestamps to time.time() to match sweep's mtime check
- FastaValidationError exception handler collapses three 400 blocks
- Drop dead consume_bundle, cleanup_job_dir, and # Fix N: markers
- Extract _force_put helper for the queue drop-oldest pattern
- functools.partial replaces _default_pipeline closure
- Misc: BOM escape, named nucleotide threshold, encoding="utf-8"

Caddy/Docker:
- Caddyfile.example: handle_path -> handle (was 404'ing every request)
- Extract (prep_backend) snippet to deduplicate dev/example
- Drop duplicate HEALTHCHECK from Dockerfile (compose owns it)
- Switch base image to ghcr.io/astral-sh/uv:python3.12-bookworm-slim

Frontend:
- loading-overlay scopes #progress-* lookups to the overlay element

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@t03i t03i marked this pull request as ready for review May 6, 2026 12:16
@t03i t03i requested a review from tsenoner May 6, 2026 12:17
t03i added 2 commits May 6, 2026 14:21
- Add e2e-tests job that auto-starts the dev server via Playwright's
  webServer block and uploads the HTML report on failure.
- Drop branches:['**'] push trigger so feature pushes don't run twice
  (once for push, once for the PR).
- Gate fasta-prep-live behind RUN_LIVE_E2E so the default e2e run
  doesn't try to hit the real prep backend.
t03i and others added 3 commits May 6, 2026 15:52
The superpowers/ subtree holds local planning/spec notes that aren't part
of the user-facing docs site. Untracking + srcExclude keeps these files
local without breaking the docs:build pre-commit hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… handling

Frontend:
- Estimate embedding time from sequence count and surface it as a sub-message.
- Smooth progress with an asymptotic creep between embedding/projecting stages.
- Show queue position when the job is waiting for a slot.
- Display a persistent "Got a larger dataset?" overlay note linking to the
  Colab notebook so users have a fallback when the lab service is busy or down.
- Wrap submit/SSE/download failures in a typed FastaPrepError that carries an
  optional server-supplied error code.

Backend:
- Tag the queued event with queue_position and running counts so the UI can
  show "Position N in queue" instead of a blank wait.
- Propagate an optional code on PipelineFailure into the SSE error payload.
- Classify Biocentral connection / 503 failures as BIOCENTRAL_UNAVAILABLE
  with a friendlier user-facing message that points at the Colab fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docker-compose.prod.yml: pin protspace-prep + caddy-ratelimit to GHCR images
  via PREP_TAG / CADDY_TAG, override compose to listen on 0.0.0.0:8080, and
  pass CORS_ALLOWED_ORIGIN through to Caddy.
- config/Caddyfile.prod: rate limit + 9 MB body cap on POST /api/prepare,
  CORS for the configured SPA origin, /healthz endpoint. The lab edge
  gateway terminates TLS upstream; this Caddy listens on plain HTTP.
- scripts/deploy-vm.sh + update-vm.sh: first-time deploy and routine update
  helpers driven by .env.
- .github/workflows/publish-images.yml: build and push protspace-prep and
  caddy-ratelimit images to GHCR on main, tags, and PRs touching the prep
  service or Caddy Dockerfile.
- Split Playwright e2e off the main CI workflow into a scheduled +
  label-gated workflow (run-e2e label or manual dispatch) so PR CI stays
  fast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@t03i
Copy link
Copy Markdown
Collaborator Author

t03i commented May 6, 2026

Changed e2e to only run nightly/with label.
Changes regarding colab and progress already in now @tsenoner.

@tsenoner tsenoner marked this pull request as draft May 8, 2026 07:37
@tsenoner
Copy link
Copy Markdown
Owner

tsenoner commented May 8, 2026

Review notes

Likely bugs

  • app/src/explore/runtime.ts:140 — the embedding/annotating branch hardcodes 'Embedding sequences (~3 min)…', overriding the dynamic embeddingLabel from formatEmbeddingLabel(seqCount). A 50-seq job will show
    "~3 min" instead of "~20 sec". Should reuse embeddingLabel like the queued branch does.
  • api.py:134mark_consumed(job_id) runs before the response body is sent; the bundle deletion is a
    BackgroundTask. If the SPA's download fetch fails mid-stream, consumed=True and the file is gone, so
    retry returns 410. Consider deferring mark_consumed until the BackgroundTask completes successfully.

Colab fallback UX

  • We want the Colab notebook
    (https://colab.research.google.com/github/tsenoner/protspace/blob/main/notebooks/ProtSpace_Preparation.ipynb)
    to be a visible alternative — not only during the prep flow as the "Got a larger dataset?" note. Consider
    surfacing it on the explore page itself (e.g. near the FASTA upload affordance) so users know they can prepare
    the data themselves.
  • When Biocentral is unavailable, the backend already returns BIOCENTRAL_UNAVAILABLE with a friendly message
    in _BIOCENTRAL_FRIENDLY_MESSAGE, but the message is plain text — the frontend currently doesn't render it
    as a clickable link to the Colab notebook. When the SPA receives an error with code === 'BIOCENTRAL_UNAVAILABLE', it should show a dedicated banner/notification with the Colab link and an
    explanation that the user can run the same pipeline there. Right now the user just sees an error toast.

Dead / inconsistent config

  • Settings.biocentral_endpoint (config.py:19, PREP_BIOCENTRAL_ENDPOINT) is loaded but never read. Either
    pipe it into the protspace embed subprocess or drop the field.
  • PREP_SEQUENCE_MIN_COUNT defaults to 1 in config.py, is overridden to 20 in docker-compose.yml, and
    isn't mentioned in services/protspace-prep/README.md. Pick one and document it.

Worth a TODO, not a blocker

  • In-memory JobRegistry — container restart orphans live jobs. Job dirs survive on the volume but _jobs is
    empty, so /bundle returns 404 until the sweeper runs. Worth flagging in the README and deciding how the SPA
    should react.
  • _classify_failure is only applied inside the embed/annotate TaskGroup catch (pipeline.py:115). A
    failure in project or bundle won't get the BIOCENTRAL_UNAVAILABLE reclassification. Not a real risk
    today, but a sharp edge if those steps ever go network-bound.

Smaller

  • _looks_like_nucleotide is all(ch in ACGTUN) over ≥50 residues — theoretically can false-positive on
    pathological sequences (e.g. poly-A peptides). Consider a ratio-based check if false positives ever appear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Installing the github protspace setup on one of our clusters

2 participants