chore: regenerate toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides) by tsenoner · Pull Request #50 · tsenoner/protspace

tsenoner · 2026-05-02T08:22:23Z

Summary

Adds scripts/generate_toxprot_demo.py to regenerate the toxprot web demo bundle end-to-end: fetch UniProt sequences + signal-peptide positions, strip SPs, embed mature peptides with ProtT5 + ESM2-650M, run UMAP/PCA + annotations via protspace prepare, then post-process to match the original demo's column layout and styling.

Annotations: trimmed to the original 18-column layout (drops signal_peptide + InterPro/taxonomy auxiliaries — signal_peptide is misleading after SP cleavage). length is the mature length.
Defaults: column + projection ordering ensures ProtT5 — UMAP 2 and protein_families are picked by the web app on load.
Styling: top-9 categories for pfam / ec / superfamily / cath recomputed from the new dataset (Kelly's palette, __NA__ at zOrder 9). protein_families styling preserved byte-for-byte from the existing web bundle.
Output: data/toxins/data.parquetbundle — 7,831 proteins, 4 projections, ~12 MB. Copied to protspace_web/app/public/ manually.
All commits use chore: prefix — this is dev-only tooling and shouldn't trigger a semantic-release minor bump.

Test plan

uv run pytest tests/test_toxprot_demo.py -v — 9/9 passing
uv run ruff check scripts/ tests/ — clean
End-to-end run completed: 7,831 proteins, 4,432 SPs cleaved, 4 projections, 18 annotation columns
Verified protein_families settings match original byte-for-byte
Verified ProtT5 — UMAP 2 is row 0 in projections_metadata
Verified protein_families is first non-id annotation column
Spot-check the visualization at protspace.app once the bundle is copied to protspace_web/app/public/

🤖 Generated with Claude Code

count_h5_rows previously summed len() across all datasets, which returned total residues (or entries × embedding_dim) instead of the number of proteins. Replaced with a single-sample inspection that reports entries, dimension, and dtype. inspect_bundle is a new helper that prints rows/cols/schema and a short preview for each table in a .parquetbundle, plus the settings keys when present. Reuses read_bundle from data.io.bundle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Design for recreating the demo .parquetbundle shipped at protspace_web/app/public/data.parquetbundle with two new behaviours: strip signal peptides before embedding, and add ESMC-300m alongside ProtT5. A standalone scripts/generate_toxprot_demo.py orchestrates fetch → strip → protspace prepare → length+settings post-process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Seven-task TDD plan that scaffolds the orchestration script, builds parse_signal_peptides / write_mature_fasta / fetch_toxprot_tsv / postprocess_bundle with unit tests where they make sense, wires up main(), and finishes with a wipe + end-to-end verification step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The script is dev tooling, not user-facing package functionality; chore: avoids triggering a minor bump from semantic-release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the uncertainty check ran against the entire Signal peptide field, so a cleanly-bounded SP with a /note or /evidence containing `?`, `<`, or `>` would be incorrectly skipped. Use the regex hit/miss itself as the uncertainty signal: SIGNAL_RE only matches digit bounds, so 0 hits + a SIGNAL keyword in the field == uncertain bounds. Also guard against blank Entry rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address code-review nits on Task 4: - Document that the cache key is out_path only. - Use splitlines() instead of count("\n") so the empty-payload guard doesn't fire spuriously if UniProt ever returns the data row without a trailing newline. - Pass encoding="utf-8" explicitly to write_text for symmetry with the decode step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address code-review nits on Task 5: - Comment why we map mature lengths by protein_id rather than zipping positionally — the prepare pipeline can reorder rows during EmbeddingSet merging and dedup, so positional mapping would silently corrupt lengths. - Enrich the missing-key error to include the bundle filename, the size of the mature_lengths map, and the first 5 missing IDs — makes the live-run debug path much shorter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply code-review feedback on Task 6: - Use protspace.cli.app.setup_logging instead of logging.basicConfig. This caps urllib3/requests at WARNING (else they spam DEBUG with -v) and routes through the tqdm-aware handler so subprocess progress bars don't get garbled. - Switch -v to action="count" for parity with `protspace prepare`'s verbosity convention; default behaviour is unchanged (INFO). - Use shlex.join when logging the prepare invocation so the printed command is copy-paste safe (METHODS contains a `;`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Use prot_t5 + esm2_650m as the two embedders (was prot_t5 + esmc_300m). - Trim the bundle to the original demo's 18 annotation columns: drop signal_peptide (sequence was stripped in this pipeline so the SP annotation no longer applies) plus the InterPro/taxonomy auxiliaries brought in by the `interpro` and `taxonomy` annotation groups. - Reorder columns so protein_families is the first non-id column — the web app picks the first non-id column as the default annotation. Bundle also keeps ProtT5 — UMAP 2 as the first projection so it loads by default. - Recompute the manual top-9 + __NA__ categories for pfam, ec, superfamily, and cath from the new dataset (split on `;`, drop the trailing `|score` / `|EVIDENCE`); preserve the hand-curated protein_families styling byte-for-byte from the existing web bundle. - Drop stale tracked data/toxins legacy artifacts left over from the pre-regeneration layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add the regenerated 7,831-protein toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides) to the repo. data/toxins/ is whitelisted in .gitignore for exactly this purpose, matching the precedent of the legacy bundle files we deleted in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…demo chore: regenerate toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides)

tsenoner and others added 16 commits April 30, 2026 18:25

docs(plan): use chore: prefix for toxprot demo commits

4fdfb49

The script is dev tooling, not user-facing package functionality; chore: avoids triggering a minor bump from semantic-release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(scripts): scaffold generate_toxprot_demo

8d5ca82

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(toxprot-demo): parse signal peptides from UniProt TSV

c32396a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(toxprot-demo): write mature FASTA with SPs cleaved

a411e84

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(toxprot-demo): stream UniProt TSV with sequence + signal_peptide

6d7e2b1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(toxprot-demo): post-process bundle with mature length + settings

d5f56e8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(toxprot-demo): wire main orchestration

90354a1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tsenoner merged commit db12e33 into main May 2, 2026
4 checks passed

tsenoner deleted the feat/regenerate-toxprot-demo branch May 2, 2026 08:30

martinpycha pushed a commit to d0rr4/PP1 that referenced this pull request May 20, 2026

Merge pull request tsenoner#50 from tsenoner/feat/regenerate-toxprot-…

47bf5c1

…demo chore: regenerate toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: regenerate toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides)#50

chore: regenerate toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides)#50
tsenoner merged 16 commits into
mainfrom
feat/regenerate-toxprot-demo

tsenoner commented May 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tsenoner commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tsenoner commented May 2, 2026 •

edited

Loading