[GSoC 2026] Add Python PSI-MI TAB v2.7 parser, CSV support and validation layer for openPIP 2.0#97
Open
Abhishek-Kumar-Rai5 wants to merge 1 commit intoBaderLab:masterfrom
Conversation
- mitab_parser.py: full PSI-MI TAB v2.7 parser mapping to openpip.sql schema - csv_parser.py: new CSV format support normalizing to same ParsedInteraction model - models.py: Python dataclasses mirroring protein, interaction, dataset, organism tables - validator.py: per-row validation with errors and warnings before DB insertion - tests/: 34 passing tests covering all parsers and validator - Updated requirements.txt with pytest
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion to issue #96.
The existing
data-upload/uploader.pyis a Selenium script thatsubmits files through the live website UI — it contains no parsing
logic.
requirements.txthad onlyselenium. This PR adds themissing parsing, validation, and data modeling layer to
data-upload/.Files Added
models.pyPython dataclasses mapped field-by-field to the openpip.sql schema.
ProteinproteinOrganismorganismDatasetdatasetInteractionCategoryinteraction_categoryParsedInteractioninteraction+ junction tablesEvery field name matches the actual DB column name from openpip.sql.
mitab_parser.pyFull PSI-MI TAB v2.7 parser. Output maps directly to models.py.
db:value(description)format with pipe-separated multiplespsi-mi:"MI:0018"(two hybrid)proteintable columnsorganismtableinteraction.scorecapped to varchar(10) per schemadataset.pubmed_idinteraction_category.category_nameTwo entry points:
parse_mitab27(filepath)— reads from diskparse_mitab27_from_string(content)— reads raw string content,for FastAPI endpoints receiving uploaded files in memory
csv_parser.pyNew format support per the GSoC project goals. Minimum required
columns:
protein_a,protein_b. Optional:interaction_type,score,publication,author,dataset,year.Normalizes to the same
ParsedInteractionmodel as the PSI-MI TABparser. Both formats feed the same validation and insertion path.
Auto-detects UniProt accession vs gene name and sets correct fields.
validator.pyPer-row validation before DB insertion.
Errors (block insertion):
Warnings (logged, insertion proceeds):
tests/34 tests, all passing.
Covers: field parsing, multi-value fields, quoted PSI-MI terms,
UniProt/Ensembl/gene name extraction, score handling, organism taxid,
interaction category, dataset pubmed mapping, CSV parsing, missing
column detection, validator errors and warnings.
requirements.txtAdded
pytest.Core Design Decision
Both PSI-MI TAB and CSV normalize to the same
ParsedInteractiondataclass. Validation and DB insertion are written once regardless
of input format. Adding new formats later only requires a new
normalizer — the rest of the pipeline is unchanged.
Testing
cd data-upload pip install pytest pytest tests/test_parser.py -v