-
Notifications
You must be signed in to change notification settings - Fork 12
[GSoC 2026] Codebase Analysis & Proposed Migration Architecture :- Feedback Requested #95
Description
About Me
Hi, I'm a GSoC 2025 applicant interested in the openPIP 2.0 modernization project
under NRNB. I've spent time going through the existing codebase, the published paper
(Helmy et al., JMB 2022), and the GSoC project description carefully before posting
this. I want to share my understanding of the current system and my proposed migration
approach, and get feedback from the mentors before I finalize my proposal.
Section 1: Current Stack — What I've Mapped Out
From the codebase and paper, here's my understanding of the existing architecture:
| Layer | Current (v1) |
|---|---|
| Language | PHP 7.2 |
| Framework | Symfony 2.x |
| ORM | Doctrine ORM |
| Frontend | jQuery, jQuery UI, Cytoscape.js, FooTable, qTip2, TinyMCE |
| Database | MySQL 8.0 |
| Server | Apache (php:7.2.0-apache) |
| Containerization | Docker Compose |
| Data Format | PSI-MI TAB v2.7 |
| External APIs | UniProt, Ensembl (protein annotation during upload) |
| Asset pipeline | Symfony Assetic |
| File uploads | vich/uploader-bundle |
Key entry points I've identified:
start.sh/populate_db.sh— Docker startup and DB seeding scriptssrc/AppBundle/Controller/— page controllers (Search, Admin, Download, User)src/AppBundle/Entity/— Doctrine ORM entities (Protein, Interaction, Dataset, etc.)web/— public root with app.php and static assets
The admin workflow currently is: PSI-MI TAB file upload → PHP parser →
MySQL population → UniProt/Ensembl annotation fetch → web display via Symfony/Twig.
Section 2: Proposed Migration — openPIP 2.0 Stack
Here's the mapping I'm proposing for the new stack:
| Layer | Current (v1) | Proposed (v2.0) |
|---|---|---|
| Language | PHP 7.2 | Python 3.11+ |
| Framework | Symfony 2.x | FastAPI (or Django REST) |
| ORM | Doctrine | SQLAlchemy (FastAPI) / Django ORM |
| Frontend | jQuery + Twig templates | React + Vite |
| Network viz | Cytoscape.js (jQuery-bound) | Cytoscape.js (React wrapper) |
| Database | MySQL 8.0 | PostgreSQL (or keep MySQL) |
| Containerization | Docker Compose | Docker Compose (retained + improved) |
| Data formats | PSI-MI TAB v2.7 only | PSI-MI TAB v2.7 + CSV |
| File uploads | vich/uploader-bundle | React drag-and-drop + FastAPI endpoints |
| External APIs | UniProt, Ensembl | UniProt REST API + Ensembl REST API |
My reasoning for FastAPI over Django: the core of openPIP 2.0 is data
ingestion + query serving — FastAPI's async nature suits this well, and
its automatic OpenAPI docs would give openPIP a REST API essentially for
free (something v1 explicitly lacks per the paper). However, I'm open to
Django if the mentors prefer it for its batteries-included admin panel.
Section 3: My Proposed Approach for the Data Upload Pipeline
This is the most critical component. The current PHP upload flow would
map to Python as:
[File Upload (React drag-drop)]
↓
[FastAPI endpoint receives file]
↓
[Python PSI-MI TAB v2.7 parser] ← I've already started a PoC of this
↓
[Validation layer — column count, format checks, controlled vocabulary]
↓
[UniProt/Ensembl REST API calls for protein annotation]
↓
[SQLAlchemy models → PostgreSQL]
↓
[Progress feedback via WebSocket or SSE to React frontend]
For CSV support (new in v2.0), I'd add a normalization step that maps
CSV columns to the internal data model before hitting the same
validation/storage pipeline.
Section 4: Questions for the Mentors
Before I finalize my proposal, I have 3 specific questions I'd genuinely
like your input on:
Q1 — Database: Is the plan to redesign the MySQL schema from scratch
for v2.0, or should the new schema preserve backward compatibility with
the existing one? This significantly affects how I scope the migration work.
Q2 — Framework preference: Do you have a preference between FastAPI
and Django for the backend? The paper mentions Python as preferred —
either works, but knowing your preference early would help me write a
more aligned proposal.
Q3 — Scope of "CSV support": When the project description mentions
a "simpler tabular CSV format," does this mean a simplified subset of
PSI-MI fields in CSV form, or a completely custom schema that labs can
define? This changes how complex the normalization layer needs to be.
I'm happy to discuss any of the above or adjust my approach based on
your feedback. I'll also be opening a small proof-of-concept PR
(a Python PSI-MI TAB parser) shortly as a concrete contribution.
Thanks for your time.