chore: regenerate toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides)#50
Merged
Conversation
count_h5_rows previously summed len() across all datasets, which returned total residues (or entries × embedding_dim) instead of the number of proteins. Replaced with a single-sample inspection that reports entries, dimension, and dtype. inspect_bundle is a new helper that prints rows/cols/schema and a short preview for each table in a .parquetbundle, plus the settings keys when present. Reuses read_bundle from data.io.bundle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Design for recreating the demo .parquetbundle shipped at protspace_web/app/public/data.parquetbundle with two new behaviours: strip signal peptides before embedding, and add ESMC-300m alongside ProtT5. A standalone scripts/generate_toxprot_demo.py orchestrates fetch → strip → protspace prepare → length+settings post-process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seven-task TDD plan that scaffolds the orchestration script, builds parse_signal_peptides / write_mature_fasta / fetch_toxprot_tsv / postprocess_bundle with unit tests where they make sense, wires up main(), and finishes with a wipe + end-to-end verification step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The script is dev tooling, not user-facing package functionality; chore: avoids triggering a minor bump from semantic-release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the uncertainty check ran against the entire Signal peptide field, so a cleanly-bounded SP with a /note or /evidence containing `?`, `<`, or `>` would be incorrectly skipped. Use the regex hit/miss itself as the uncertainty signal: SIGNAL_RE only matches digit bounds, so 0 hits + a SIGNAL keyword in the field == uncertain bounds. Also guard against blank Entry rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address code-review nits on Task 4:
- Document that the cache key is out_path only.
- Use splitlines() instead of count("\n") so the empty-payload guard
doesn't fire spuriously if UniProt ever returns the data row without
a trailing newline.
- Pass encoding="utf-8" explicitly to write_text for symmetry with the
decode step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address code-review nits on Task 5: - Comment why we map mature lengths by protein_id rather than zipping positionally — the prepare pipeline can reorder rows during EmbeddingSet merging and dedup, so positional mapping would silently corrupt lengths. - Enrich the missing-key error to include the bundle filename, the size of the mature_lengths map, and the first 5 missing IDs — makes the live-run debug path much shorter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply code-review feedback on Task 6: - Use protspace.cli.app.setup_logging instead of logging.basicConfig. This caps urllib3/requests at WARNING (else they spam DEBUG with -v) and routes through the tqdm-aware handler so subprocess progress bars don't get garbled. - Switch -v to action="count" for parity with `protspace prepare`'s verbosity convention; default behaviour is unchanged (INFO). - Use shlex.join when logging the prepare invocation so the printed command is copy-paste safe (METHODS contains a `;`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Use prot_t5 + esm2_650m as the two embedders (was prot_t5 + esmc_300m). - Trim the bundle to the original demo's 18 annotation columns: drop signal_peptide (sequence was stripped in this pipeline so the SP annotation no longer applies) plus the InterPro/taxonomy auxiliaries brought in by the `interpro` and `taxonomy` annotation groups. - Reorder columns so protein_families is the first non-id column — the web app picks the first non-id column as the default annotation. Bundle also keeps ProtT5 — UMAP 2 as the first projection so it loads by default. - Recompute the manual top-9 + __NA__ categories for pfam, ec, superfamily, and cath from the new dataset (split on `;`, drop the trailing `|score` / `|EVIDENCE`); preserve the hand-curated protein_families styling byte-for-byte from the existing web bundle. - Drop stale tracked data/toxins legacy artifacts left over from the pre-regeneration layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the regenerated 7,831-protein toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides) to the repo. data/toxins/ is whitelisted in .gitignore for exactly this purpose, matching the precedent of the legacy bundle files we deleted in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
martinpycha
pushed a commit
to d0rr4/PP1
that referenced
this pull request
May 20, 2026
…demo chore: regenerate toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
scripts/generate_toxprot_demo.pyto regenerate the toxprot web demo bundle end-to-end: fetch UniProt sequences + signal-peptide positions, strip SPs, embed mature peptides with ProtT5 + ESM2-650M, run UMAP/PCA + annotations viaprotspace prepare, then post-process to match the original demo's column layout and styling.signal_peptide+ InterPro/taxonomy auxiliaries —signal_peptideis misleading after SP cleavage).lengthis the mature length.ProtT5 — UMAP 2andprotein_familiesare picked by the web app on load.pfam/ec/superfamily/cathrecomputed from the new dataset (Kelly's palette,__NA__at zOrder 9).protein_familiesstyling preserved byte-for-byte from the existing web bundle.data/toxins/data.parquetbundle— 7,831 proteins, 4 projections, ~12 MB. Copied toprotspace_web/app/public/manually.chore:prefix — this is dev-only tooling and shouldn't trigger a semantic-release minor bump.Test plan
uv run pytest tests/test_toxprot_demo.py -v— 9/9 passinguv run ruff check scripts/ tests/— cleanprotein_familiessettings match original byte-for-byteprojections_metadataprotein_familiesis first non-id annotation columnprotspace_web/app/public/🤖 Generated with Claude Code