Skip to content

[GSoC 2026] Add Python PSI-MI TAB v2.7 parser, CSV support and validation layer for openPIP 2.0#97

Open
Abhishek-Kumar-Rai5 wants to merge 1 commit intoBaderLab:masterfrom
Abhishek-Kumar-Rai5:python-mitab-parser
Open

[GSoC 2026] Add Python PSI-MI TAB v2.7 parser, CSV support and validation layer for openPIP 2.0#97
Abhishek-Kumar-Rai5 wants to merge 1 commit intoBaderLab:masterfrom
Abhishek-Kumar-Rai5:python-mitab-parser

Conversation

@Abhishek-Kumar-Rai5
Copy link
Copy Markdown

Companion to issue #96.

The existing data-upload/uploader.py is a Selenium script that
submits files through the live website UI — it contains no parsing
logic. requirements.txt had only selenium. This PR adds the
missing parsing, validation, and data modeling layer to data-upload/.


Files Added

models.py

Python dataclasses mapped field-by-field to the openpip.sql schema.

Dataclass DB table
Protein protein
Organism organism
Dataset dataset
InteractionCategory interaction_category
ParsedInteraction interaction + junction tables

Every field name matches the actual DB column name from openpip.sql.

mitab_parser.py

Full PSI-MI TAB v2.7 parser. Output maps directly to models.py.

  • All 42 PSI-MI TAB v2.7 columns handled
  • db:value(description) format with pipe-separated multiples
  • Quoted PSI-MI ontology terms e.g. psi-mi:"MI:0018"(two hybrid)
  • Files with fewer than 42 columns padded gracefully
  • UniProt, Ensembl, Entrez → correct protein table columns
  • Gene name from alias field value where description = "gene name"
  • Taxon ID + common name → organism table
  • Score → interaction.score capped to varchar(10) per schema
  • PubMed ID → dataset.pubmed_id
  • Interaction type → interaction_category.category_name

Two entry points:

  • parse_mitab27(filepath) — reads from disk
  • parse_mitab27_from_string(content) — reads raw string content,
    for FastAPI endpoints receiving uploaded files in memory

csv_parser.py

New format support per the GSoC project goals. Minimum required
columns: protein_a, protein_b. Optional: interaction_type,
score, publication, author, dataset, year.

Normalizes to the same ParsedInteraction model as the PSI-MI TAB
parser. Both formats feed the same validation and insertion path.
Auto-detects UniProt accession vs gene name and sets correct fields.

validator.py

Per-row validation before DB insertion.

Errors (block insertion):

  • Interactor A or B has no identifier of any kind

Warnings (logged, insertion proceeds):

  • No UniProt ID — UniProt REST annotation will be skipped
  • Score present but not numeric

tests/

34 tests, all passing.

pytest tests/test_parser.py -v
# 34 passed in 0.67s

Covers: field parsing, multi-value fields, quoted PSI-MI terms,
UniProt/Ensembl/gene name extraction, score handling, organism taxid,
interaction category, dataset pubmed mapping, CSV parsing, missing
column detection, validator errors and warnings.

requirements.txt

Added pytest.


Core Design Decision

Both PSI-MI TAB and CSV normalize to the same ParsedInteraction
dataclass. Validation and DB insertion are written once regardless
of input format. Adding new formats later only requires a new
normalizer — the rest of the pipeline is unchanged.


Testing

cd data-upload
pip install pytest
pytest tests/test_parser.py -v

- mitab_parser.py: full PSI-MI TAB v2.7 parser mapping to openpip.sql schema
- csv_parser.py: new CSV format support normalizing to same ParsedInteraction model
- models.py: Python dataclasses mirroring protein, interaction, dataset, organism tables
- validator.py: per-row validation with errors and warnings before DB insertion
- tests/: 34 passing tests covering all parsers and validator
- Updated requirements.txt with pytest
@Abhishek-Kumar-Rai5
Copy link
Copy Markdown
Author

@MoHelmy @gbader Please look into this and give your suggestions and feedback on this. Your further guidance will be highly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant