Skip to content

[GSoC 2026] Codebase Analysis & Proposed Migration Architecture :- Feedback Requested #95

@Abhishek-Kumar-Rai5

Description

@Abhishek-Kumar-Rai5

About Me

Hi, I'm a GSoC 2025 applicant interested in the openPIP 2.0 modernization project
under NRNB. I've spent time going through the existing codebase, the published paper
(Helmy et al., JMB 2022), and the GSoC project description carefully before posting
this. I want to share my understanding of the current system and my proposed migration
approach, and get feedback from the mentors before I finalize my proposal.


Section 1: Current Stack — What I've Mapped Out

From the codebase and paper, here's my understanding of the existing architecture:

Layer Current (v1)
Language PHP 7.2
Framework Symfony 2.x
ORM Doctrine ORM
Frontend jQuery, jQuery UI, Cytoscape.js, FooTable, qTip2, TinyMCE
Database MySQL 8.0
Server Apache (php:7.2.0-apache)
Containerization Docker Compose
Data Format PSI-MI TAB v2.7
External APIs UniProt, Ensembl (protein annotation during upload)
Asset pipeline Symfony Assetic
File uploads vich/uploader-bundle

Key entry points I've identified:

  • start.sh / populate_db.sh — Docker startup and DB seeding scripts
  • src/AppBundle/Controller/ — page controllers (Search, Admin, Download, User)
  • src/AppBundle/Entity/ — Doctrine ORM entities (Protein, Interaction, Dataset, etc.)
  • web/ — public root with app.php and static assets

The admin workflow currently is: PSI-MI TAB file upload → PHP parser →
MySQL population → UniProt/Ensembl annotation fetch → web display via Symfony/Twig.


Section 2: Proposed Migration — openPIP 2.0 Stack

Here's the mapping I'm proposing for the new stack:

Layer Current (v1) Proposed (v2.0)
Language PHP 7.2 Python 3.11+
Framework Symfony 2.x FastAPI (or Django REST)
ORM Doctrine SQLAlchemy (FastAPI) / Django ORM
Frontend jQuery + Twig templates React + Vite
Network viz Cytoscape.js (jQuery-bound) Cytoscape.js (React wrapper)
Database MySQL 8.0 PostgreSQL (or keep MySQL)
Containerization Docker Compose Docker Compose (retained + improved)
Data formats PSI-MI TAB v2.7 only PSI-MI TAB v2.7 + CSV
File uploads vich/uploader-bundle React drag-and-drop + FastAPI endpoints
External APIs UniProt, Ensembl UniProt REST API + Ensembl REST API

My reasoning for FastAPI over Django: the core of openPIP 2.0 is data
ingestion + query serving — FastAPI's async nature suits this well, and
its automatic OpenAPI docs would give openPIP a REST API essentially for
free (something v1 explicitly lacks per the paper). However, I'm open to
Django if the mentors prefer it for its batteries-included admin panel.


Section 3: My Proposed Approach for the Data Upload Pipeline

This is the most critical component. The current PHP upload flow would
map to Python as:

[File Upload (React drag-drop)]
       ↓
[FastAPI endpoint receives file]
       ↓
[Python PSI-MI TAB v2.7 parser]  ← I've already started a PoC of this
       ↓
[Validation layer — column count, format checks, controlled vocabulary]
       ↓
[UniProt/Ensembl REST API calls for protein annotation]
       ↓
[SQLAlchemy models → PostgreSQL]
       ↓
[Progress feedback via WebSocket or SSE to React frontend]

For CSV support (new in v2.0), I'd add a normalization step that maps
CSV columns to the internal data model before hitting the same
validation/storage pipeline.


Section 4: Questions for the Mentors

Before I finalize my proposal, I have 3 specific questions I'd genuinely
like your input on:

Q1 — Database: Is the plan to redesign the MySQL schema from scratch
for v2.0, or should the new schema preserve backward compatibility with
the existing one? This significantly affects how I scope the migration work.

Q2 — Framework preference: Do you have a preference between FastAPI
and Django for the backend? The paper mentions Python as preferred —
either works, but knowing your preference early would help me write a
more aligned proposal.

Q3 — Scope of "CSV support": When the project description mentions
a "simpler tabular CSV format," does this mean a simplified subset of
PSI-MI fields in CSV form, or a completely custom schema that labs can
define? This changes how complex the normalization layer needs to be.


I'm happy to discuss any of the above or adjust my approach based on
your feedback. I'll also be opening a small proof-of-concept PR
(a Python PSI-MI TAB parser) shortly as a concrete contribution.

Thanks for your time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions