Skip to content

chore: regenerate toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides)#50

Merged
tsenoner merged 16 commits into
mainfrom
feat/regenerate-toxprot-demo
May 2, 2026
Merged

chore: regenerate toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides)#50
tsenoner merged 16 commits into
mainfrom
feat/regenerate-toxprot-demo

Conversation

@tsenoner
Copy link
Copy Markdown
Owner

@tsenoner tsenoner commented May 2, 2026

Summary

Adds scripts/generate_toxprot_demo.py to regenerate the toxprot web demo bundle end-to-end: fetch UniProt sequences + signal-peptide positions, strip SPs, embed mature peptides with ProtT5 + ESM2-650M, run UMAP/PCA + annotations via protspace prepare, then post-process to match the original demo's column layout and styling.

  • Annotations: trimmed to the original 18-column layout (drops signal_peptide + InterPro/taxonomy auxiliaries — signal_peptide is misleading after SP cleavage). length is the mature length.
  • Defaults: column + projection ordering ensures ProtT5 — UMAP 2 and protein_families are picked by the web app on load.
  • Styling: top-9 categories for pfam / ec / superfamily / cath recomputed from the new dataset (Kelly's palette, __NA__ at zOrder 9). protein_families styling preserved byte-for-byte from the existing web bundle.
  • Output: data/toxins/data.parquetbundle — 7,831 proteins, 4 projections, ~12 MB. Copied to protspace_web/app/public/ manually.
  • All commits use chore: prefix — this is dev-only tooling and shouldn't trigger a semantic-release minor bump.

Test plan

  • uv run pytest tests/test_toxprot_demo.py -v — 9/9 passing
  • uv run ruff check scripts/ tests/ — clean
  • End-to-end run completed: 7,831 proteins, 4,432 SPs cleaved, 4 projections, 18 annotation columns
  • Verified protein_families settings match original byte-for-byte
  • Verified ProtT5 — UMAP 2 is row 0 in projections_metadata
  • Verified protein_families is first non-id annotation column
  • Spot-check the visualization at protspace.app once the bundle is copied to protspace_web/app/public/

🤖 Generated with Claude Code

tsenoner and others added 16 commits April 30, 2026 18:25
count_h5_rows previously summed len() across all datasets, which
returned total residues (or entries × embedding_dim) instead of the
number of proteins. Replaced with a single-sample inspection that
reports entries, dimension, and dtype.

inspect_bundle is a new helper that prints rows/cols/schema and a
short preview for each table in a .parquetbundle, plus the settings
keys when present. Reuses read_bundle from data.io.bundle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Design for recreating the demo .parquetbundle shipped at
protspace_web/app/public/data.parquetbundle with two new behaviours:
strip signal peptides before embedding, and add ESMC-300m alongside
ProtT5. A standalone scripts/generate_toxprot_demo.py orchestrates
fetch → strip → protspace prepare → length+settings post-process.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seven-task TDD plan that scaffolds the orchestration script, builds
parse_signal_peptides / write_mature_fasta / fetch_toxprot_tsv /
postprocess_bundle with unit tests where they make sense, wires up
main(), and finishes with a wipe + end-to-end verification step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The script is dev tooling, not user-facing package functionality;
chore: avoids triggering a minor bump from semantic-release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the uncertainty check ran against the entire Signal peptide
field, so a cleanly-bounded SP with a /note or /evidence containing
`?`, `<`, or `>` would be incorrectly skipped. Use the regex hit/miss
itself as the uncertainty signal: SIGNAL_RE only matches digit bounds,
so 0 hits + a SIGNAL keyword in the field == uncertain bounds. Also
guard against blank Entry rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address code-review nits on Task 4:
- Document that the cache key is out_path only.
- Use splitlines() instead of count("\n") so the empty-payload guard
  doesn't fire spuriously if UniProt ever returns the data row without
  a trailing newline.
- Pass encoding="utf-8" explicitly to write_text for symmetry with the
  decode step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address code-review nits on Task 5:
- Comment why we map mature lengths by protein_id rather than zipping
  positionally — the prepare pipeline can reorder rows during
  EmbeddingSet merging and dedup, so positional mapping would silently
  corrupt lengths.
- Enrich the missing-key error to include the bundle filename, the
  size of the mature_lengths map, and the first 5 missing IDs — makes
  the live-run debug path much shorter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply code-review feedback on Task 6:
- Use protspace.cli.app.setup_logging instead of logging.basicConfig.
  This caps urllib3/requests at WARNING (else they spam DEBUG with
  -v) and routes through the tqdm-aware handler so subprocess
  progress bars don't get garbled.
- Switch -v to action="count" for parity with `protspace prepare`'s
  verbosity convention; default behaviour is unchanged (INFO).
- Use shlex.join when logging the prepare invocation so the printed
  command is copy-paste safe (METHODS contains a `;`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Use prot_t5 + esm2_650m as the two embedders (was prot_t5 + esmc_300m).
- Trim the bundle to the original demo's 18 annotation columns: drop
  signal_peptide (sequence was stripped in this pipeline so the SP
  annotation no longer applies) plus the InterPro/taxonomy auxiliaries
  brought in by the `interpro` and `taxonomy` annotation groups.
- Reorder columns so protein_families is the first non-id column — the
  web app picks the first non-id column as the default annotation.
  Bundle also keeps ProtT5 — UMAP 2 as the first projection so it loads
  by default.
- Recompute the manual top-9 + __NA__ categories for pfam, ec,
  superfamily, and cath from the new dataset (split on `;`, drop the
  trailing `|score` / `|EVIDENCE`); preserve the hand-curated
  protein_families styling byte-for-byte from the existing web bundle.
- Drop stale tracked data/toxins legacy artifacts left over from the
  pre-regeneration layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the regenerated 7,831-protein toxprot demo bundle (ProtT5 + ESM2-650M,
mature peptides) to the repo. data/toxins/ is whitelisted in .gitignore
for exactly this purpose, matching the precedent of the legacy bundle
files we deleted in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tsenoner tsenoner merged commit db12e33 into main May 2, 2026
4 checks passed
@tsenoner tsenoner deleted the feat/regenerate-toxprot-demo branch May 2, 2026 08:30
martinpycha pushed a commit to d0rr4/PP1 that referenced this pull request May 20, 2026
…demo

chore: regenerate toxprot demo bundle (ProtT5 + ESM2-650M, mature peptides)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant