Skip to content

GSoC 2026: PoC + Proposal#98

Open
alaotach wants to merge 7 commits intoBaderLab:masterfrom
alaotach:poc
Open

GSoC 2026: PoC + Proposal#98
alaotach wants to merge 7 commits intoBaderLab:masterfrom
alaotach:poc

Conversation

@alaotach
Copy link
Copy Markdown

Overview

This PR brings together the GSoC 2026 proposal and proof-of-concept implementation for OpenPIP's modernized, multi-protocol data ingestion system. The core idea is a complete architectural shift away from synchronous Symfony-based uploads toward an async, event-driven pipeline that can handle diverse interaction data formats through pluggable parser support. The proposal and the working code are both here, so reviewers can evaluate not just the plan but the execution behind it.


Proposal Documentation

The proposal document (gsoc_openpip_2_proposal_draft.md) covers a 350-hour effort across six technical fronts: legacy archaeology, schema design, ingestion pipeline, async job system, API and search modernization, and frontend refactoring. Rather than staying high-level, it includes concrete code-level references that map existing Symfony controllers to their modern FastAPI equivalents, detailed parser architecture documentation, and infrastructure requirements. The goal was to write something reviewers could actually verify, not just read.


PoC Implementation

Screen.Recording.2026-03-27.155919.mp4

Backend Architecture

The backend is built on FastAPI with modular router composition across uploads, search, exports, datasets, admin, and legacy compatibility. An explicit service layer (upload_service.py) handles orchestration with validation guards for job IDs and raw payloads. Data persistence (db.py) uses PostgreSQL with deterministic deduplication via SHA256 hashing of dataset and namespace identifier pairs, so duplicate canonical interactions cannot accumulate across dataset boundaries. The async job system (jobs.py) runs a two-phase validate-then-commit workflow backed by ARQ and Redis.

The parser framework uses a Protocol-based InteractionParser contract, making it straightforward to add new formats without touching core logic. Two parsers ship with the PoC: a full PSI-MI TAB 2.5/2.7 parser with multi-format confidence normalization, and a CSV gene interaction parser with namespace validation. Row-level diagnostics surface as structured RowValidationError objects, with CSV export available for remediation.

Frontend Integration

The React upload manager (upload-manager.tsx) includes UUID validation guards that prevent undefined job IDs from ever reaching an API call. Job progress streams in real time via Server-Sent Events. The UX follows a deliberate two-phase flow: validate first, review any errors, then commit. This pattern means users can fix specific rows and retry without starting over.

Infrastructure

The Docker Compose setup brings up PostgreSQL 16, Redis 7, the FastAPI service, an ARQ worker, and the Next.js frontend together. There are no sync mode fallbacks. Redis and PostgreSQL are required at startup and enforced through configuration validation, keeping the local development topology honest against production. Database migrations are managed through Alembic with typed schemas for upload jobs, canonical interactions, row errors, and source records.

Legacy Compatibility

A compatibility shim (legacy_compat.py) preserves the existing API surface during incremental cutover. The three legacy routes map cleanly: the upload process endpoint enqueues validation, the insert data endpoint handles import and commit, and the search endpoint proxies to the modernized interaction search. Nothing breaks for consumers that have not migrated yet.

Test Coverage

Tests cover confidence normalization and identifier validation in the parser layer, format validation for both CSV and PSI-MI TAB inputs, and end-to-end integration tests for the validate-then-commit workflow including error collection behavior.


Problems Solved Along the Way

Three concrete issues came up during development and were resolved before this PR:

HTTP 422 errors and undefined job IDs were fixed by adding frontend UUID validation alongside backend normalization guards. Response serialization failures were fixed by converting raw JSON strings to dicts before Pydantic model construction. Sync mode conflicts were eliminated entirely by removing fallback paths and enforcing the Docker and Redis requirement from startup.


What Makes This Architecture Worth It

The parser pluggability means new formats can be added without touching core ingestion logic. Deterministic hashing keeps deduplication reliable across dataset boundaries without any manual intervention. Row-level error context gives users enough information to fix and retry specific records rather than re-uploading everything. And the Docker Compose setup means anyone picking this up locally gets an environment that actually reflects production topology.


Files Changed

Proposal: gsoc_openpip_2_proposal_draft.md

Backend: app/main.py, app/services/upload_service.py, app/db.py, app/models.py, app/parser.py, app/parsers.py, app/jobs.py

Routers: app/routers/uploads.py, app/routers/legacy_compat.py, and supporting routers

Frontend: frontend/components/upload-manager.tsx

Infrastructure: docker-compose.yml, migrations/

Tests: tests/test_parser_*.py, tests/test_integration_upload_commit.py


Validation Status

Full validate-then-commit workflow is integration tested. Parser framework is tested against both PSI-MI TAB and CSV formats. Error collection and CSV export are verified. EventSource streaming for job progress is confirmed working. Frontend UUID guards preventing undefined token propagation are tested. Legacy API route compatibility shim is functional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant