Open
Conversation
…res schema along with docker compose
…r, duplicate tracking, error export
…, frontend, tests, and Docker runtime
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR brings together the GSoC 2026 proposal and proof-of-concept implementation for OpenPIP's modernized, multi-protocol data ingestion system. The core idea is a complete architectural shift away from synchronous Symfony-based uploads toward an async, event-driven pipeline that can handle diverse interaction data formats through pluggable parser support. The proposal and the working code are both here, so reviewers can evaluate not just the plan but the execution behind it.
Proposal Documentation
The proposal document (
gsoc_openpip_2_proposal_draft.md) covers a 350-hour effort across six technical fronts: legacy archaeology, schema design, ingestion pipeline, async job system, API and search modernization, and frontend refactoring. Rather than staying high-level, it includes concrete code-level references that map existing Symfony controllers to their modern FastAPI equivalents, detailed parser architecture documentation, and infrastructure requirements. The goal was to write something reviewers could actually verify, not just read.PoC Implementation
Screen.Recording.2026-03-27.155919.mp4
Backend Architecture
The backend is built on FastAPI with modular router composition across uploads, search, exports, datasets, admin, and legacy compatibility. An explicit service layer (
upload_service.py) handles orchestration with validation guards for job IDs and raw payloads. Data persistence (db.py) uses PostgreSQL with deterministic deduplication via SHA256 hashing of dataset and namespace identifier pairs, so duplicate canonical interactions cannot accumulate across dataset boundaries. The async job system (jobs.py) runs a two-phase validate-then-commit workflow backed by ARQ and Redis.The parser framework uses a Protocol-based
InteractionParsercontract, making it straightforward to add new formats without touching core logic. Two parsers ship with the PoC: a full PSI-MI TAB 2.5/2.7 parser with multi-format confidence normalization, and a CSV gene interaction parser with namespace validation. Row-level diagnostics surface as structuredRowValidationErrorobjects, with CSV export available for remediation.Frontend Integration
The React upload manager (
upload-manager.tsx) includes UUID validation guards that prevent undefined job IDs from ever reaching an API call. Job progress streams in real time via Server-Sent Events. The UX follows a deliberate two-phase flow: validate first, review any errors, then commit. This pattern means users can fix specific rows and retry without starting over.Infrastructure
The Docker Compose setup brings up PostgreSQL 16, Redis 7, the FastAPI service, an ARQ worker, and the Next.js frontend together. There are no sync mode fallbacks. Redis and PostgreSQL are required at startup and enforced through configuration validation, keeping the local development topology honest against production. Database migrations are managed through Alembic with typed schemas for upload jobs, canonical interactions, row errors, and source records.
Legacy Compatibility
A compatibility shim (
legacy_compat.py) preserves the existing API surface during incremental cutover. The three legacy routes map cleanly: the upload process endpoint enqueues validation, the insert data endpoint handles import and commit, and the search endpoint proxies to the modernized interaction search. Nothing breaks for consumers that have not migrated yet.Test Coverage
Tests cover confidence normalization and identifier validation in the parser layer, format validation for both CSV and PSI-MI TAB inputs, and end-to-end integration tests for the validate-then-commit workflow including error collection behavior.
Problems Solved Along the Way
Three concrete issues came up during development and were resolved before this PR:
HTTP 422 errors and undefined job IDs were fixed by adding frontend UUID validation alongside backend normalization guards. Response serialization failures were fixed by converting raw JSON strings to dicts before Pydantic model construction. Sync mode conflicts were eliminated entirely by removing fallback paths and enforcing the Docker and Redis requirement from startup.
What Makes This Architecture Worth It
The parser pluggability means new formats can be added without touching core ingestion logic. Deterministic hashing keeps deduplication reliable across dataset boundaries without any manual intervention. Row-level error context gives users enough information to fix and retry specific records rather than re-uploading everything. And the Docker Compose setup means anyone picking this up locally gets an environment that actually reflects production topology.
Files Changed
Proposal:
gsoc_openpip_2_proposal_draft.mdBackend:
app/main.py,app/services/upload_service.py,app/db.py,app/models.py,app/parser.py,app/parsers.py,app/jobs.pyRouters:
app/routers/uploads.py,app/routers/legacy_compat.py, and supporting routersFrontend:
frontend/components/upload-manager.tsxInfrastructure:
docker-compose.yml,migrations/Tests:
tests/test_parser_*.py,tests/test_integration_upload_commit.pyValidation Status
Full validate-then-commit workflow is integration tested. Parser framework is tested against both PSI-MI TAB and CSV formats. Error collection and CSV export are verified. EventSource streaming for job progress is confirmed working. Frontend UUID guards preventing undefined token propagation are tested. Legacy API route compatibility shim is functional.