diff --git a/docs/plans/README.md b/docs/plans/README.md new file mode 100644 index 0000000..e830dde --- /dev/null +++ b/docs/plans/README.md @@ -0,0 +1,27 @@ +# Migration plans + +Versioned plans for the GoaT data import pipeline migration. The newest +version is the source of truth; older versions are preserved verbatim so the +evolution of the plan stays auditable. + +## Current + +- [v2-current-plan.md](v2-current-plan.md) — Phase 1 complete, Phase 2 in + progress (cleanup tasks tracked), Phases 3–5 outlined. + +## History + +- [v1-initial-plan.md](v1-initial-plan.md) — original plan written at project + kickoff. Defines the five-phase framing, the gap analysis, and the + network-robustness / logging / conventions reference that v2 still relies on. + +## Versioning convention + +- The newest plan is always named `vN-current-plan.md`. +- When a new revision lands, the previous `current` file is renamed to + `vN-.md` (e.g. `v1-initial-plan.md`) and a new + `v(N+1)-current-plan.md` is added. +- Historical plans are never edited after archival — corrections go into the + new version's change log. +- Each new plan ends with a `Change log` section summarizing what changed + versus the previous version. diff --git a/docs/plans/v1-initial-plan.md b/docs/plans/v1-initial-plan.md new file mode 100644 index 0000000..7b04500 --- /dev/null +++ b/docs/plans/v1-initial-plan.md @@ -0,0 +1,571 @@ +# GoaT Data Import Pipeline Migration Plan + +## TL;DR + +Migrate all data fetching from the legacy `goat-data` GitHub Actions workflow to scheduled Prefect-backed updaters in the `data` repo, then wire up parsers and validators to produce import-ready TSV/YAML pairs on S3. Five phases: fetch (Phase 1), parse+validate (Phase 2), switch S3 source (Phase 3), replace import (Phase 4), full pipeline (Phase 5). + +--- + +## Gap Analysis: Updater Coverage + +### Already Implemented (11 updaters) + +| Updater | Legacy Equivalent | Schedule | Notes | +| ---------------------------- | -------------------------------------------------- | -------------------- | -------------------------------------------------- | +| `update_ncbi_datasets` | fetch-ncbi-datasets-zip | Daily | NCBI Datasets CLI → JSONL | +| `update_ncbi_taxonomy` | fetch-ncbi-taxdump (commented out in legacy) | Weekly | FTP taxdump with MD5 verification | +| `update_ena_taxonomy_extra` | fetch-ena-taxonomy-extra (commented out in legacy) | Weekly | ENA REST API | +| `update_genomehubs_taxonomy` | fetch-genomehubs-taxonomy | Daily | blobtk collation from NCBI+ENA+OTT | +| `update_tolid_prefixes` | fetch-tolids (commented out in legacy) | Weekly | GitLab WTSI; 400k line minimum validation | +| `update_ott_taxonomy` | (no legacy equivalent) | Monthly | Open Tree of Life .tgz download | +| `update_tol_portal_status` | STS API (fully replaced via tol-sdk) | Daily (orchestrated) | Docker-isolated; replaces `STS_AUTHORIZATION_KEY` | +| `update_tol_genome_notes` | (new source — no legacy equivalent) | Daily (orchestrated) | Docker-isolated; tol-sdk | +| `update_nhm_status_list` | fetch-from-apis NHM (commented out in legacy) | Weekly | POST API with cursor pagination | +| `update_boat_config` | GoaT API + Lustre | Daily | Assembly QC config builder | +| `tol_utils` (shared) | — | — | Shared ToL Portal helper, not a standalone updater | + +### Missing — Need New Updaters (8 categories, ~11 updaters) + +| # | Source | Legacy Job | Priority | Schedule | Complexity | +| --- | ---------------------------------------- | --------------------------------------------------- | -------- | -------- | ------------------------------------------------ | +| 1 | **BlobToolKit** | fetch-blobtoolkit (Docker `genomehubs parse --btk`) | HIGH | Daily | Medium — API pagination + Docker | +| 2 | **RefSeq Organelles** | fetch-refseq-organelles (FTP + BioPython) | HIGH | Weekly | Medium — FTP + GenBank parsing | +| 3 | **VGP Status** | fetch-from-apis (GitHub YAML) | MEDIUM | Weekly | Low — simple HTTP + YAML parse | +| 4 | **JGI 1KFG** | fetch-from-apis (OAuth REST) | MEDIUM | Weekly | Medium — OAuth token exchange | +| 5 | **Ensembl Metadata** (×6) | fetch-assembly-links (6 JSON endpoints) | MEDIUM | Monthly | Low — HTTP + JSON→TSV, one parameterized updater | +| 6 | **UCSC Assembly Hubs** | fetch-assembly-links | LOW | Monthly | Low — HTTP + text parsing | +| 7 | **Google Sheets Status** (~20+ projects) | fetch-from-apis (R + Python) | HIGH | Weekly | High — rewrite R→Python, normalize tables | +| 8 | **SRA Data** | (parse_sra_data.py) | MEDIUM | Weekly | Medium — NCBI API + XML parsing | + +### Legacy-Only Binary Fetches (no migration needed) + +These legacy jobs fetch tool binaries, not data. They become package dependencies in the data repo: + +- **fetch-ncbi-datasets** — downloads `datasets` CLI executable → already a pip dependency (`ncbi-datasets-cli`) +- **fetch-genomehubs-api** — downloads GenomeHubs API binary → already a pip dependency (`genomehubs`) +- **fetch-genomehubs-ui** — downloads GenomeHubs UI binary → not needed for data pipeline + +### Static/Semi-Static Sources (no external fetch needed) + +These exist as curated YAML/TSV pairs in `goat-data/sources/` and are uploaded directly to S3: + +- **Genomesize/Karyotype** — 25 FILE\_ sources (genome size databases, chromosome counts) +- **Conservation** — CITES index (periodically updated manually) +- **UK Legislation** — 9 FILE\_ sources (very static) +- **Regional Lists** — 7 FILE\_ sources (static geographic lists) +- **Lineages** — ODB10 lineage mappings +- **OTT IDs** — OTT taxonomy mappings + +Note: **ToLIDs** are NOT static — the prefix list is actively fetched by `update_tolid_prefixes`. The `tolids.names.yaml` config in `goat-data/sources/tolids/` is a naming convention file that ships with the YAML configs, not a separate data source. + +These should be synced to S3 via a simple `sync_static_sources` utility or manually, not via updaters. + +--- + +## Phase 1: External Data Fetching + +### Goal + +All external data fetching implemented as Prefect updaters with scheduled deployments, uploading raw data to S3 and emitting events for downstream parsing. + +### Steps + +#### Group A: API-Based Updaters (parallel development) + +**Step 1: `update_vgp_status` — VGP Status List** + +- Fetch GitHub YAML from `https://raw.githubusercontent.com/vgl-hub/genome-portal/master/_data/table_tracker.yml` +- Parse YAML, extract fields: common_name, family, order, scientific_name, status, taxon_id, vgp_phase +- Write TSV to `s3://goat/resources/status-lists/vgp.tsv` +- Schedule: Weekly +- Reuse: `safe_get()` from `flows/lib/utils.py`, `parse_args/shared_args` pattern +- Reference: `goat-data/scripts/api/api_config.py` VGL handlers + +**Step 2: `update_jgi_status` — JGI 1KFG** + +- OAuth token exchange: offline_token → access_token via `https://signon.jgi.doe.gov/signon/create` +- Paginated API: `https://gold-ws.jgi.doe.gov/projects?studyGoldId=Gs0000001` +- Write TSV to `s3://goat/resources/status-lists/jgi_1kfg.tsv` +- Schedule: Weekly +- Requires: `JGI_OFFLINE_TOKEN` secret (Prefect Secret block or env var) +- Reference: `goat-data/scripts/jgi_to_tsv.py` +- Bug risk: Legacy code has fragile OAuth flow — add proper token refresh and expiry handling + +**Step 3: `update_ensembl_metadata` — Ensembl Species Metadata (6 databases)** + +- Single parameterized updater deployed 6 times with different division parameters +- Divisions: Fungi, Metazoa, Plants, Protists, Vertebrates, Rapid Release +- Fetch JSON from Ensembl REST API endpoints +- Transform JSON→TSV (replace legacy `jq` one-liners with explicit Python) +- Write to `s3://goat/resources/assembly-data/species_metadata_Ensembl{Division}.tsv.gz` +- Schedule: Monthly +- Reference: `goat-data/.github/workflows/fetch-resources.yml` fetch-assembly-links job + +**Step 4: `update_ucsc_assemblies` — UCSC Genome Browser** + +- Fetch assembly hub list from UCSC API +- Parse to TSV +- Write to `s3://goat/resources/assembly-data/ucsc_ids.tsv` +- Schedule: Monthly +- Reuse: `safe_get()`, standard arg parsing + +**Step 5: `update_sra_data` — SRA Metadata** + +- Fetch from NCBI SRA API (Entrez or BigQuery) +- Parse XML/JSON responses to TSV +- Write to `s3://goat/resources/sra/sra.tsv.gz` +- Schedule: Weekly +- Reference: `goat-data/scripts/parse_sra_data.py` +- Bug risk: Legacy script has hardcoded batch sizes and silent error swallowing + +#### Group B: Complex Updaters (sequential, more effort) + +**Step 6: `update_blobtoolkit` — BlobToolKit Analysis Data** + +- Approach A (preferred): Direct API fetch from `https://blobtoolkit.genomehubs.org/api/v1/search/Eukaryota` + per-assembly detail queries +- Approach B: Docker-isolated `genomehubs parse --btk` via orchestrator pattern (like tol_genome_notes) +- Outputs: `btk.tsv.gz` + `btk.files.yaml` to `s3://goat/resources/btk/` +- Schedule: Daily +- Reference: `goat-data/scripts/parse_blobtoolkit.py` +- Bug risk: Legacy has `print(plots)` debug line left in (line 66); pagination may miss entries + +**Step 7: `update_refseq_organelles` — RefSeq Organelle Data** + +- Fetch from NCBI FTP: `ftp.ncbi.nlm.nih.gov/refseq/release/` +- Parse GenBank flat files for mitochondrion/plastid sequences +- Extract: accession, taxon_id, organism, sequence_length, references +- Write to `s3://goat/resources/assembly-data/refseq_organelles.tsv.gz` +- Schedule: Weekly +- Reference: `goat-data/scripts/parse_refseq_organelles.py` (uses BioPython) +- Consideration: BioPython dependency may need Docker isolation (check pydantic conflicts) + +**Step 8: `update_google_sheets_status` — Google Sheets Project Status Lists** + +- Rewrite R script (`get_googlesheets.R`) entirely in Python +- Fetch TSVs from public Google Sheets URLs (no auth needed for public sheets) +- Use `import_status_lib.py` patterns for table normalization but rewrite cleanly: + - Replace pandas one-liners with explicit column mapping + - Handle encoding robustly (UTF-8 with fallback) + - Normalize species names, taxon IDs +- Projects list parameterized (deploy once per project group or batch) +- Outputs: One TSV per project to `s3://goat/resources/status-lists/{project}_expanded.tsv` +- Schedule: Weekly +- Sub-steps: + - 8a: Core fetcher function (reusable across all sheets) + - 8b: Table normalizer (species name cleaning, status field mapping) + - 8c: Per-project configuration (sheet URLs, field mappings, column renames) + - 8d: Deploy as single flow with project list parameter +- Reference: `goat-data/scripts/import_status_lib.py`, `goat-data/scripts/import_status.py` +- Bug risks in legacy: + - Code duplication (import_status_lib.py copied to ebp_import/) + - Silent encoding failures + - Hardcoded 24-project list + - Pandas operations that silently drop data on merge conflicts + +#### Group C: Infrastructure & Static Data + +**Step 9: `sync_static_sources` — Static YAML/TSV pairs** + +- Utility to upload curated YAML/TSV pairs from goat-data/sources/ to S3 +- Not a scheduled updater — run manually or on goat-data repo changes +- Covers: genomesize-karyotype, conservation, uk-legislation, regional-lists, lineages +- Could be triggered by a webhook on goat-data repo pushes + +**Step 10: Secrets & Configuration** + +- Configure Prefect Secret blocks for: `JGI_OFFLINE_TOKEN`, Google Sheets URLs +- STS_AUTHORIZATION_KEY no longer needed (replaced by tol-sdk) +- Add deployment entries to `flows/prefect.yaml` for all new updaters + +### Relevant Files (Phase 1) + +**New files to create:** + +- `flows/updaters/update_vgp_status.py` +- `flows/updaters/update_jgi_status.py` +- `flows/updaters/update_ensembl_metadata.py` +- `flows/updaters/update_ucsc_assemblies.py` +- `flows/updaters/update_sra_data.py` +- `flows/updaters/update_blobtoolkit.py` +- `flows/updaters/update_refseq_organelles.py` +- `flows/updaters/update_google_sheets_status.py` +- `flows/lib/google_sheets.py` (shared Google Sheets fetching utilities) +- `flows/lib/api_helpers.py` (shared API helpers: OAuth, pagination, JSON→TSV) + +**Existing files to modify:** + +- `flows/prefect.yaml` — add deployments for all new updaters +- `flows/lib/utils.py` — add any missing shared utilities +- `flows/lib/shared_args.py` — add new argument definitions if needed +- `requirements.txt` — add BioPython if needed for RefSeq parsing + +**Reference files (goat-data, read-only):** + +- `goat-data/scripts/api/api_config.py` — API endpoint definitions +- `goat-data/scripts/api/api_tools.py` — retry/pagination patterns +- `goat-data/scripts/jgi_to_tsv.py` — JGI OAuth flow +- `goat-data/scripts/parse_blobtoolkit.py` — BTK API parsing +- `goat-data/scripts/parse_refseq_organelles.py` — GenBank parsing +- `goat-data/scripts/parse_sra_data.py` — SRA parsing +- `goat-data/scripts/import_status_lib.py` — table normalization +- `goat-data/scripts/get_googlesheets.R` — Google Sheets URLs +- `goat-data/.github/workflows/fetch-resources.yml` — complete fetch workflow + +### Verification (Phase 1) + +1. Each updater runs locally with `SKIP_PREFECT=true` and produces valid output TSV +2. Output TSV format matches goat-data legacy output (diff comparison where possible) +3. S3 upload succeeds to `s3://goat/resources/` paths +4. Events emitted with correct resource types for downstream triggering +5. All tests pass: `python -m pytest tests/` +6. No secret values hardcoded; all auth via env vars or Prefect Secret blocks +7. `prefect deploy --prefect-file flows/prefect.yaml --all` succeeds + +### Decisions (Phase 1) + +- **Google Sheets**: Rewrite in Python (not R) for consistency with the rest of the codebase +- **BlobToolKit**: Prefer direct API approach over Docker genomehubs parse (simpler, avoids Docker-in-Docker); fall back to orchestrator pattern if API is insufficient +- **RefSeq Organelles**: Use BioPython in Docker container if pydantic conflicts arise +- **Static sources**: Not updaters — sync utility or manual upload +- **STS replaced by ToL Portal**: No migration needed (already done via `update_tol_portal_status`) + +--- + +## Phase 2: YAML-Backed Parsers & Validation + +### Goal + +All data sources processed by fetch-parse-validate pipeline. Parsing triggered by update events. Validated TSV/YAML pairs uploaded to new S3 directories (`s3://goat/validated/`). + +### Steps + +**Step 1: Implement `parse_sequencing_status` parser** + +- Handle all status list TSV formats (VGP, JGI, Google Sheets projects, NHM, ToL Portal) +- Config-driven: read YAML to determine column mappings +- Reuse `Config` class from `flows/lib/utils.py` +- One parser handles all ~65 status list YAML configs + +**Step 2: Implement `parse_refseq_organelles` parser** + +- Replace stub with working implementation +- Read YAML config, apply field mappings from `refseq_organelles.types.yaml` +- Validate organelle accessions, taxonomy + +**Step 3: Implement `parse_blobtoolkit` parser** + +- Parse BTK TSV using YAML config from `btk.types.yaml` +- Handle BUSCO stats, base composition, read mapping fields + +**Step 4: Implement `parse_ensembl_metadata` parser** + +- Handle all 6 Ensembl division TSVs +- Single generic parser, config-driven via YAML + +**Step 5: Implement `parse_sra_data` parser** + +- Parse SRA TSV with YAML config from `sra.types.yaml` + +**Step 6: Implement `parse_genomesize_karyotype` parser** + +- Handle the 25+ genomesize/karyotype FILE\_ sources +- Generic parser for simple TSV→validated TSV transformation + +**Step 7: Implement `parse_conservation` and `parse_legislation` parsers** + +- Static data validation parsers +- Check CITES categories, legislation references against YAML constraints + +**Step 8: Wire all fetch-parse-validate deployments** + +- Add trigger entries in `prefect.yaml` for each parser +- Events from Phase 1 updaters trigger corresponding parse-validate flows +- `validate_file_pair()` runs `blobtk validate` on each output +- Gate S3 upload on validation success + +**Step 9: Configure S3 output paths** + +- Validated outputs go to `s3://goat/validated/{directory}/` (NOT `s3://goat/resources/` or `s3://goat/sources/`) +- Both validated TSV and validated YAML uploaded +- Validation report (JSONL) uploaded alongside for audit + +### Relevant Files (Phase 2) + +- `flows/parsers/parse_sequencing_status.py` — complete implementation +- `flows/parsers/parse_refseq_organelles.py` — replace stub +- `flows/parsers/parse_blobtoolkit.py` — new +- `flows/parsers/parse_ensembl_metadata.py` — new +- `flows/parsers/parse_sra_data.py` — new +- `flows/parsers/parse_genomesize_karyotype.py` — new (generic) +- `flows/parsers/parse_conservation.py` — new +- `flows/orchestration/wrapper_fetch_parse_validate.py` — existing, may need updates +- `flows/validators/validate_file_pair.py` — existing, may need S3 path updates +- `flows/prefect.yaml` — add trigger entries +- Local copies of YAML configs from `goat-data/sources/` for development + +### Verification (Phase 2) + +1. Each parser produces TSV matching the YAML config headers +2. `blobtk validate -g ` passes for each output with ≥95% valid rows +3. Event chain works: updater → parse → validate → S3 upload +4. Validated files appear in `s3://goat/validated/` directories +5. Row counts comparable to legacy pipeline output +6. No data loss: compare parsed row counts against raw input counts + +### Decisions (Phase 2) + +- **S3 validated path**: `s3://goat/validated/` (separate from `resources/` and `sources/`) +- **Parser reuse**: `parse_sequencing_status` handles ALL status list formats via YAML config +- **Parser reuse**: `parse_genomesize_karyotype` handles ALL genomesize/karyotype sources generically +- **YAML configs**: Develop with local copies, production fetches from goat-data sources/ +- **Scope boundary**: Phase 2 does NOT change the legacy import at all + +--- + +## Phase 3: Switch Legacy Import to Validated Data + +### Goal + +Legacy import workflow reads from `s3://goat/validated/` instead of `s3://goat/resources/` or `s3://goat/sources/`, removing all fetch steps from the import. + +### Steps + +1. Verify data parity: compare `s3://goat/validated/` against `s3://goat/sources/` for all directories +2. Update `goat-data/.github/workflows/genomehubs-index.yml` to read from `s3://goat/validated/` +3. Remove fetch jobs from `goat-data/.github/workflows/fetch-resources.yml` (or disable) +4. Update `goat-data/.github/workflows/s3_release.yml` to skip fetch-resources +5. Run test release with validated data; compare with latest production release +6. Staged rollout: switch one directory at a time, verify, proceed + +### Verification (Phase 3) + +1. Test release produces identical (or improved) Elasticsearch indices +2. API test suite passes +3. UI test suite passes +4. Row counts match or exceed previous release +5. Rollback path confirmed: can revert to `s3://goat/sources/` if issues + +### Risk Mitigation + +- Keep `s3://goat/sources/` and `s3://goat/resources/` intact as rollback +- Phase 3 changes only S3 paths in workflow config, easily reversible +- Switch one source directory at a time (assembly-data first, then status-lists, etc.) + +--- + +## Phase 4: Replace Legacy Import (Future) + +### Goal + +Replace `genomehubs index` with updated import code that reads validated TSV/YAML pairs directly. + +### Scope + +- Requires new import code not yet available +- Skip validation/lookup steps (already done in Phase 2) +- Direct TSV→Elasticsearch indexing + +--- + +## Phase 5: Full Pipeline Migration (Future) + +### Goal + +Remove all GitHub Actions workflow dependencies; full pipeline runs in Prefect. + +### Scope + +- Yet to be defined +- Includes: ES init, indexing, fill, test, release promotion +- Replaces: s3_release.yml, genomehubs-init.yml, genomehubs-index.yml, genomehubs-fill.yml, genomehubs-test.yml + +--- + +## Network Robustness Review + +### Current `safe_get()` Implementation + +The existing `safe_get()` in `flows/lib/utils.py` provides: + +- Configurable timeout (default 300s) +- Supports GET/POST/HEAD methods +- Raises on HTTP errors via `response.raise_for_status()` +- **No built-in retry logic** — relies entirely on Prefect task-level retries + +### Current Retry Patterns Across Updaters + +| Updater | Task Retries | Delay | Notes | +| -------------------------- | ------------ | ----- | --------------------------------------------------------- | +| `update_tolid_prefixes` | 2 | 2s | Reasonable for a single file download | +| `update_ncbi_datasets` | 2 | 2s | Too few for NCBI rate-limited API; batches of 50 | +| `update_nhm_status_list` | 100 | 60s | Very aggressive — NHM API known to be unstable | +| `update_tol_portal_status` | 3 | 60s | Orchestrator-level retries re-run entire Docker container | +| `update_tol_genome_notes` | 3 | 60s | Same orchestrator pattern | +| `update_ncbi_taxonomy` | 2 | 2s | FTP download — may need longer delay | +| `update_ott_taxonomy` | 2 | 2s | HTTP download of .tgz — adequate | +| `update_boat_config` | (not set) | — | SSH-based, different failure modes | + +### Gaps & Recommendations for New Updaters + +**1. Add HTTP-level retry to `safe_get()`** (or create `resilient_get()`) + +- Use `urllib3.util.Retry` with `requests.adapters.HTTPAdapter` for transport-level retries +- Retry on: 429 (rate limit), 500, 502, 503, 504 +- Exponential backoff: 1s, 2s, 4s (3 attempts) +- This separates transient HTTP failures from task-level Prefect retries (which re-run the entire task) +- Existing updaters benefit automatically when `safe_get()` is hardened + +**2. Per-source timeout tuning** +| Source | Recommended Timeout | Rationale | +|--------|-------------------|-----------| +| VGP (GitHub raw) | 30s | Small YAML file, fast CDN | +| JGI API | 120s per page | Paginated, can be slow | +| Ensembl FTP | 300s | Large JSON files (>100MB for Vertebrates) | +| UCSC | 60s | Small text file | +| SRA API | 300s | Potentially large responses | +| BlobToolKit API | 120s per request | Many per-assembly detail calls | +| RefSeq FTP | 600s | Large GenBank files | +| Google Sheets | 60s per sheet | Can be slow on large sheets | + +**3. Partial failure handling for paginated APIs** + +- JGI, BlobToolKit, and SRA all paginate — a failure mid-pagination should not discard pages already fetched +- Write each page to a temp file; only assemble final TSV after all pages succeed +- If a page fails after retries, emit a warning event with partial count and halt gracefully + +**4. Idempotency and freshness checks** + +- `update_tolid_prefixes` already uses HTTP HEAD timestamp comparison — reuse this pattern +- New updaters should check `Last-Modified` or `ETag` before downloading, using `is_local_file_current_http()` +- For APIs without timestamp headers (JGI, BlobToolKit), compare MD5 of output against previous S3 version using `generate_md5()` + +**5. S3 upload atomicity** + +- Current `upload_to_s3()` uses `s3cmd put` — if interrupted, leaves partial file on S3 +- Recommendation: upload to a `.tmp` key first, then copy to final key and delete `.tmp` +- Or use boto3 multipart upload with automatic cleanup on failure + +**6. Connection pooling for high-volume API calls** + +- BlobToolKit updater will make ~10,000+ individual API calls (one per assembly) +- Use a `requests.Session()` to reuse TCP connections and benefit from connection pooling +- Add rate limiting (e.g., 10 req/s) to avoid overwhelming the BlobToolKit API + +**7. DNS and TLS failure handling** + +- `requests.exceptions.ConnectionError` and `requests.exceptions.SSLError` are not HTTP status codes — they won't be caught by status-code retry logic +- Ensure transport-level retries cover these cases +- Add explicit handling in updaters: log the error clearly, distinguish transient DNS vs permanent config errors + +--- + +## Logging Review + +### Current Logging Patterns in the Data Repo + +**Primary mechanism:** `print()` with `@task(log_prints=True)` + +- When running under Prefect, print statements are captured as INFO-level logs +- When `SKIP_PREFECT=true`, print goes to stdout (useful for local testing) +- No structured logging (no JSON, no log levels beyond print) + +**What's logged today (by updater):** + +| Updater | Logging Pattern | Gaps | +| ---------------------------- | ------------------------------------------------ | --------------------------------- | +| `update_ncbi_datasets` | Prints batch progress, line counts, match status | No timing info | +| `update_ncbi_taxonomy` | Prints MD5 comparison, extraction status | Good — includes checksums | +| `update_tolid_prefixes` | Prints line count, timestamp comparison result | Good — includes freshness check | +| `update_nhm_status_list` | Prints page count, record counts per page | Missing total elapsed time | +| `update_tol_portal_status` | Minimal — Docker output captured | Docker stdout mixed with app logs | +| `update_tol_genome_notes` | Minimal — Docker output captured | Same Docker stdout issue | +| `update_ena_taxonomy_extra` | Prints taxon counts | Missing API call timing | +| `update_genomehubs_taxonomy` | Prints blobtk command and result | Good — includes command | +| `update_ott_taxonomy` | Prints download size, extraction | Good | +| `update_boat_config` | Prints API queries, SSH commands | Good — verbose | + +### Recommendations for New Updaters + +**1. Standardize a logging helper** +Create a `log_progress()` utility in `flows/lib/utils.py` that: + +- Prints a timestamped message (ISO 8601) +- Includes the updater/task name as a prefix +- Works identically with and without Prefect (`print()`-based, not `logging` module) +- Example: `[2026-04-24T12:00:00Z] update_vgp_status: Fetched 1,234 records in 3.2s` + +**2. Log network call summaries** +Every HTTP request should log: + +- URL (redacted if contains secrets) +- Method (GET/POST/HEAD) +- Response status code +- Response size (bytes) +- Elapsed time (seconds) +- Whether the response was from cache/retry + +**3. Log output file summaries** +After writing each output file, log: + +- File path (local and S3) +- Row count +- File size +- MD5 hash +- Whether it matches previous version (changed/unchanged) + +**4. Log events emitted** +Print a summary when emitting Prefect events: + +- Event name +- `matches.previous` value +- Key payload fields (row count, etc.) +- This aids debugging when running with `SKIP_PREFECT=true` (since `emit_event` is a no-op) + +**5. Error context in exceptions** +Every caught exception should include: + +- The URL or resource that failed +- The HTTP status code (if applicable) +- The attempt number (if retrying) +- A hint about whether the error is transient or permanent +- Example: `RuntimeError("JGI OAuth token exchange failed (attempt 2/3): 401 Unauthorized — check JGI_OFFLINE_TOKEN is valid")` + +**6. Docker orchestrator logging** +For Docker-isolated flows (`tol_portal_status`, `tol_genome_notes`): + +- Capture and prefix Docker stdout/stderr separately +- Log Docker exit code explicitly +- Log the full Docker command (with secrets redacted) for reproducibility + +--- + +## Conventions Reference + +### YAML/TSV Pair Convention (goat-data) + +- **Prefix patterns**: `ATTR_` (attribute defs), `TAXON_` (taxonomy), `FILE_` (data sources), unprefixed (primary) +- **YAML structure**: `file:` metadata, `attributes:` field mappings, `taxonomy:` taxon matching, `identifiers:` ID columns +- **`needs:`** directive: lists dependent YAML files that must be co-located +- **TSV naming**: matches `file.name` in YAML config, often `.gz` compressed + +### Data Repo Code Conventions + +- Absolute imports: `from flows.lib import utils` +- Google-style docstrings with type hints +- `SKIP_PREFECT=true` for local testing +- `run_quoted()` for subprocess (never `shell=True`) +- `safe_get()` for HTTP requests +- `parse_args()` with `shared_args` constants for CLI +- Tasks: focused, idempotent, with `@task(retries=N, log_prints=True)` +- Events: `emit_event()` with `prefect.resource.id/type/matches.previous` +- Black formatter, 88-char line length + +### Legacy Code Bug Risks to Avoid + +1. `parse_blobtoolkit.py` line 66: debug `print(plots)` left in production +2. `import_status_lib.py`: duplicated across directories, encoding silently fails +3. `fetch-or-fallback.sh`: `|| exit 0` masks real errors +4. Google Sheets: hardcoded `gid` parameters break on URL changes +5. JGI OAuth: no token refresh/expiry handling +6. NCBI API: hardcoded 30s timeouts, silent failure on rate limit +7. Pandas merge conflicts silently drop data in status list processing diff --git a/docs/plans/v2-current-plan.md b/docs/plans/v2-current-plan.md new file mode 100644 index 0000000..eb4e4c2 --- /dev/null +++ b/docs/plans/v2-current-plan.md @@ -0,0 +1,172 @@ +# GoaT Data Import Pipeline Migration Plan — v2 (current) + +> **Status as of this revision:** Phase 1 complete · Phase 2 mostly complete (cleanup tasks remaining) · Phases 3–5 not yet started. +> +> For the original framing and historical context see [v1-initial-plan.md](v1-initial-plan.md). + +## TL;DR + +Phase 1 (external data fetching) and the bulk of Phase 2 (YAML-backed parsers + validation) are now implemented. Every legacy fetch job from `goat-data/.github/workflows/fetch-resources.yml` has a corresponding Prefect updater, and every source directory that needs transformation has either a parser or a `SKIP_PARSING` assignment plus a YAML config. What remains in Phase 2 is targeted cleanup: confirm parser-vs-skip assignments, ensure every status-list YAML can be validated end-to-end locally, get `blobtk` on the worker PATH, and finalize a couple of YAML schemas. After that, Phase 3 cuts the legacy import over to `s3://goat/validated/`. + +## Status at a glance + +| Phase | State | Notes | +| ------------------------------------ | -------------- | ----------------------------------------------------------------------------------- | +| Phase 1 — External data fetching | ✅ Complete | 18 updaters deployed (see inventory below). | +| Phase 2 — Parsers & validation | 🔄 In progress | 11 parsers + fetch-parse-validate wrapper landed. Cleanup tasks tracked below. | +| Phase 3 — Switch legacy import | ⬜ Not started | Gated on Phase 2 cleanup + a parity comparison between `validated/` and `sources/`. | +| Phase 4 — Replace `genomehubs index` | ⬜ Future | Requires new import code; out of scope for this revision. | +| Phase 5 — Full pipeline migration | ⬜ Future | Replaces remaining GitHub Actions workflows (release / init / index / fill / test). | + +--- + +## Phase 1 — External Data Fetching (COMPLETE) + +Every external fetch job from the legacy workflow now has a Prefect-backed updater that writes raw data to `s3://goat/resources/...` and emits an `update.*.finished` event. + +**Updater inventory (`flows/updaters/`):** + +- `update_ncbi_datasets.py` +- `update_ncbi_taxonomy.py` +- `update_ena_taxonomy_extra.py` +- `update_genomehubs_taxonomy.py` +- `update_tolid_prefixes.py` +- `update_ott_taxonomy.py` +- `update_tol_portal_status.py` +- `update_tol_genome_notes.py` +- `update_nhm_status_list.py` +- `update_boat_config.py` +- `update_vgp_status.py` +- `update_vgp_original_status.py` +- `update_jgi_status.py` +- `update_ensembl_metadata.py` +- `update_ucsc_assemblies.py` +- `update_sra_data.py` +- `update_blobtoolkit.py` +- `update_refseq_organelles.py` +- `update_google_sheets_status.py` + +Shared helpers: `tol_utils.py`, `flows/updaters/api/`, `flows/lib/utils.py`, `flows/lib/shared_args.py`. + +All deployments are wired in `flows/prefect.yaml`. There are no remaining Phase 1 items. + +--- + +## Phase 2 — YAML-Backed Parsers & Validation (IN PROGRESS) + +### What is in place + +**Parsers (`flows/parsers/`):** + +| Parser | Handles | +| --------------------------------------- | ------------------------------------------------------------------------ | +| `parse_ncbi_assemblies.py` | NCBI Datasets + data-freeze assembly TSVs. | +| `parse_refseq_organelles.py` | NCBI RefSeq mitochondrion / plastid GenBank → pivoted TSV. | +| `parse_sequencing_status.py` | JGI 1KFG (and any other status list whose source format needs pivoting). | +| `parse_blobtoolkit.py` | BlobToolKit analysis exports (stub; see cleanup). | +| `parse_sra_data.py` | SRA metadata TSV. | +| `parse_genomesize_karyotype.py` | Generic genomesize / karyotype `FILE_` sources. | +| `parse_conservation.py` | CITES + conservation sources. | +| `parse_legislation.py` | UK legislation FILE\_ sources. | +| `parse_skip_parsing.py` | Pass-through for inputs that already match their YAML schema. | +| `parse_backfill_historical_versions.py` | Historical assembly version backfill. | + +Discovery is automatic via `flows/parsers/register.py` (any `parse_*.py` is picked up). `Parser` enum members serialize to lowercase (e.g. `skip_parsing`) but the `PARSERS.parsers` dict is keyed by `Parser.name` (uppercase, e.g. `SKIP_PARSING`). + +**Wrappers, validators, and orchestrators:** + +- `flows/lib/wrapper_fetch_parse_validate.py` — production fetch → parse → validate → S3 upload pipeline. +- `flows/lib/local_fetch_parse_validate.py` — local equivalent: copies the YAML + TSV into a work directory, runs the parser, runs `validate_file_pair` with `s3_path=None`, and gracefully skips validation if the `blobtk` binary is not on PATH. Handles plain ↔ gz conversion so the input file matches the YAML's `file.name`. Handles the lowercase-enum / uppercase-dict-key mismatch when looking up parsers. +- `flows/lib/validate_file_pair.py` — wraps the `blobtk validate` Rust binary. +- `flows/orchestrators/batch_validate_status_lists.py` — triggered by `update.google.sheets.status.finished`. Iterates every `FILE_*.types.yaml` under `goat-data/sources/status-lists/`, calls the standard `fetch_parse_validate(parser=Parser.SKIP_PARSING, …)` for each TSV present, and reports pass / fail / skip-no-tsv / skip-config-error counts. CLI flags: `--yaml-dir --work-dir --taxdump-path --s3-path --dry-run --min-valid --min-assigned`. + +**Other lib modules in current use:** `conditional_import.py`, `fetch_genomehubs_target_list.py`, `fetch_previous_file_pair.py`, `for_each_record.py`, `index_assembly_features.py`, `process_features.py`, `shared_args.py`, `shared_tasks.py`, `utils.py`. + +### Parser ↔ source assignment audit (current) + +| Source directory | Deployment | Parser | +| -------------------------------------------- | ------------------------------------- | ---------------------- | +| `assembly-data/ncbi_datasets` | `fpv-ncbi-datasets` | `NCBI_ASSEMBLIES` | +| `assembly-data/data_freeze` | `fpv-data-freeze` | `NCBI_ASSEMBLIES` | +| `assembly-data/refseq_organelles` | `fpv-refseq-organelles` | `REFSEQ_ORGANELLES` | +| `assembly-data/ucsc` | `fpv-ucsc` | `SKIP_PARSING` | +| `btk/` | `fpv-blobtoolkit` | `SKIP_PARSING` | +| `sra/` | `fpv-sra` | `SKIP_PARSING` | +| `status-lists/vgp` (FILE_VGP_Ordinal_Phase1) | `fpv-vgp` | `SKIP_PARSING` | +| `status-lists/nhm` | `fpv-nhm` | `SKIP_PARSING` | +| `status-lists/jgi_1kfg` | `fpv-jgi` | `SEQUENCING_STATUS` | +| `status-lists/google_sheets/*` | `batch-validate-google-sheets-status` | `SKIP_PARSING` (batch) | + +Rationale for `SKIP_PARSING` on BTK and UCSC: `blobtk validate` can derive the taxonomy columns from a `taxon_id` column automatically, so no pre-parse transformation is required. The YAML schema is the source of truth. + +### Phase 2 cleanup — remaining work + +1. **`blobtk` on worker PATH.** Validation currently no-ops locally on the developer machine because the binary is not installed. Add it to the worker image (and document a local install option) so `local_fetch_parse_validate.py` reports real validation outcomes instead of skipping. +2. **`BLOBTOOLKIT` parser placeholder.** `parse_blobtoolkit.py` exists but is a thin pass-through. Decide whether to keep `SKIP_PARSING` permanently for `btk/` (current production setting) or graduate to a real parser once the BTK API export gains structured fields the YAML cannot describe. +3. **`GENOMESIZE_KARYOTYPE` schema confirmation.** `parse_genomesize_karyotype.py` is generic, but a handful of `FILE_` sources still need their YAMLs cross-checked against the parser's column expectations. Walk every YAML under `goat-data/sources/genomesize-karyotype/` and run `local_fetch_parse_validate.py` once per file. +4. **JGI YAML.** Confirm `sources/status-lists/jgi_1kfg/jgi_1kfg.types.yaml` matches the columns emitted by `update_jgi_status` after the OAuth pagination rewrite. +5. **End-to-end parity check.** Run the batch validator (`batch_validate_status_lists.py`) over all current `status-lists/` YAMLs locally and record the pass / fail / skip rates. Fix anything that fails before Phase 3. + +### Verification (Phase 2) + +1. `python -m flows.lib.local_fetch_parse_validate --yaml … --tsv …` returns exit 0 for every (parser, source) pair in the table above. +2. `python -m flows.orchestrators.batch_validate_status_lists --dry-run` lists every `FILE_*.types.yaml` under `status-lists/` with the expected parser assignment. +3. `prefect deploy --prefect-file flows/prefect.yaml --all` succeeds and the trigger for `batch-validate-google-sheets-status` shows `update.google.sheets.status.finished`. + +--- + +## Phase 3 — Switch Legacy Import to Validated Data (NOT STARTED) + +Unchanged from v1. Recap: + +1. Confirm parity between `s3://goat/validated/` and `s3://goat/sources/` per directory. +2. Update `goat-data/.github/workflows/genomehubs-index.yml` to read from `validated/`. +3. Disable fetch jobs in `goat-data/.github/workflows/fetch-resources.yml` and skip them from `s3_release.yml`. +4. Test release; compare ES indices, API tests, UI tests. +5. Staged rollout: assembly-data first, then status-lists, then the rest. + +Rollback path: revert the S3 path in the workflow — `sources/` and `resources/` remain intact. + +## Phase 4 — Replace Legacy Import (FUTURE) + +Unchanged from v1. Requires the new import code (skips re-validation/lookup, reads validated TSV/YAML pairs directly into Elasticsearch). + +## Phase 5 — Full Pipeline Migration (FUTURE) + +Unchanged from v1. Move ES init, indexing, fill, test, and release promotion out of GitHub Actions into Prefect. + +--- + +## Implemented surface area (snapshot) + +- **Parsers:** 11 (see Phase 2 table). +- **Updaters:** 19 (Phase 1 inventory). +- **Orchestrators:** 5 — `batch_validate_status_lists`, `tasks`, `tol_data_pipeline`, `tol_genome_notes_orchestration`, `tol_portal_status_orchestration`. +- **Lib modules:** 12 — `conditional_import`, `fetch_genomehubs_target_list`, `fetch_previous_file_pair`, `for_each_record`, `index_assembly_features`, `local_fetch_parse_validate`, `process_features`, `shared_args`, `shared_tasks`, `utils`, `validate_file_pair`, `wrapper_fetch_parse_validate`. +- **Deployments in `flows/prefect.yaml`:** 34 (including the new `batch-validate-google-sheets-status`). + +## Reference material carried forward from v1 + +The following sections of [v1-initial-plan.md](v1-initial-plan.md) remain authoritative and have not been duplicated here: + +- **Gap analysis** — historical record of which legacy jobs needed updaters. Now fully implemented. +- **Network robustness review** — `safe_get()` hardening guidance, per-source timeout table, paginated-API partial-failure handling, idempotency / freshness checks, S3 upload atomicity, connection pooling, DNS / TLS handling. +- **Logging review** — `log_progress()` helper proposal, network-call summaries, output-file summaries, event-emission logging, exception context, Docker orchestrator logging. +- **Conventions reference** — YAML/TSV pair conventions, repo coding conventions, list of legacy code bug risks to avoid. + +These are general-purpose engineering guidance and apply to any future updater or parser work. + +--- + +## Change log + +**v1 → v2 (this revision):** + +- Marked Phase 1 complete; replaced the "missing updaters" table with the implemented inventory. +- Marked Phase 2 mostly complete; added the parser-vs-source assignment audit table. +- Added the **Phase 2 cleanup** section enumerating the remaining items (blobtk PATH, BLOBTOOLKIT parser decision, GENOMESIZE_KARYOTYPE schema sweep, JGI YAML, end-to-end parity). +- Documented `flows/lib/local_fetch_parse_validate.py` and `flows/orchestrators/batch_validate_status_lists.py` (both new since v1). +- Recorded the BTK and UCSC `SKIP_PARSING` decision (auto-taxonomy in `blobtk validate`). +- Recorded the VGP YAML correction (`FILE_VGP_Ordinal_Phase1.types.yaml`). +- Phases 3–5 unchanged. +- Network-robustness, logging, and conventions sections kept in v1 by reference rather than duplicated. diff --git a/flows/.gitignore b/flows/.gitignore new file mode 100644 index 0000000..47b052f --- /dev/null +++ b/flows/.gitignore @@ -0,0 +1,2 @@ +!.gitignore +__pycache__/ diff --git a/flows/lib/__pycache__/__init__.cpython-313.pyc b/flows/lib/__pycache__/__init__.cpython-313.pyc new file mode 100644 index 0000000..a5e6e06 Binary files /dev/null and b/flows/lib/__pycache__/__init__.cpython-313.pyc differ diff --git a/flows/lib/__pycache__/conditional_import.cpython-313.pyc b/flows/lib/__pycache__/conditional_import.cpython-313.pyc new file mode 100644 index 0000000..a72da4a Binary files /dev/null and b/flows/lib/__pycache__/conditional_import.cpython-313.pyc differ diff --git a/flows/lib/__pycache__/shared_args.cpython-313.pyc b/flows/lib/__pycache__/shared_args.cpython-313.pyc new file mode 100644 index 0000000..8f8009e Binary files /dev/null and b/flows/lib/__pycache__/shared_args.cpython-313.pyc differ diff --git a/flows/lib/__pycache__/utils.cpython-313.pyc b/flows/lib/__pycache__/utils.cpython-313.pyc new file mode 100644 index 0000000..7597344 Binary files /dev/null and b/flows/lib/__pycache__/utils.cpython-313.pyc differ diff --git a/flows/lib/conditional_import.py b/flows/lib/conditional_import.py index ada08c4..89f0184 100644 --- a/flows/lib/conditional_import.py +++ b/flows/lib/conditional_import.py @@ -35,5 +35,6 @@ def emit_event(*_, **__): from prefect.events import emit_event from prefect.runtime.task_run import run_count +NO_CACHE = NO_CACHE -__all__ = ["flow", "task", "emit_event", "run_count", "skip_prefect", "NO_CACHE"] +__all__ = ["flow", "task", "emit_event", "run_count", "skip_prefect"] diff --git a/flows/lib/fetch_previous_file_pair.py b/flows/lib/fetch_previous_file_pair.py index d0acbeb..cbb070b 100644 --- a/flows/lib/fetch_previous_file_pair.py +++ b/flows/lib/fetch_previous_file_pair.py @@ -112,13 +112,16 @@ def copy_yaml_files(yaml_path: str, config: Config, work_dir: str) -> None: # Copy any dependencies to the working directory if "needs" in config.config["file"]: source_dir = os.path.dirname(yaml_path) - for file in config.config["file"]["needs"]: + needs = config.config["file"]["needs"] + if not isinstance(needs, list): + needs = [needs] + for file in needs: file_path = os.path.join(source_dir, file) shutil.copy(file_path, work_dir) @flow() -def fetch_previous_file_pair(yaml_path: str, s3_path: str, work_dir: str) -> None: +def fetch_previous_file_pair(yaml_path: str, s3_path: str, work_dir: str) -> bool: """ Fetch the previous YAML/TSV files and compare headers. @@ -128,7 +131,7 @@ def fetch_previous_file_pair(yaml_path: str, s3_path: str, work_dir: str) -> Non work_dir (str): Path to the working directory. """ config = utils.load_config(yaml_path) - (local_file, remote_file) = get_filenames(config, s3_path, work_dir) + local_file, remote_file = get_filenames(config, s3_path, work_dir) line_count = fetch_tsv_file(remote_file, local_file) copy_yaml_files(yaml_path, config, work_dir) status = compare_headers(config, local_file) diff --git a/flows/lib/local_fetch_parse_validate.py b/flows/lib/local_fetch_parse_validate.py new file mode 100644 index 0000000..e1e4757 --- /dev/null +++ b/flows/lib/local_fetch_parse_validate.py @@ -0,0 +1,266 @@ +"""Local fetch-parse-validate wrapper for testing without S3. + +This mirrors the production wrapper_fetch_parse_validate.py but: +- Skips S3 fetch (uses a local input TSV directly) +- Copies YAML + input TSV into work_dir with expected names +- Runs the selected parser (or SKIP_PARSING) +- Runs blobtk validate locally (no S3 upload) + +Usage: + SKIP_PREFECT=true python -m flows.lib.local_fetch_parse_validate \ + -p SKIP_PARSING \ + --yaml-path ../goat-data/sources/status-lists/FILE_VGP_Ordinal_Phase1.types.yaml \ + --input-tsv tsv_examples/VGP_Ordinal_Phase1_plus.tsv \ + --work-dir /tmp/test-vgp + + SKIP_PREFECT=true python -m flows.lib.local_fetch_parse_validate \ + -p REFSEQ_ORGANELLES \ + --yaml-path ../goat-data/sources/assembly-data/refseq_organelles.types.yaml \ + --input-tsv tsv_examples/refseq_organelles.tsv \ + --work-dir /tmp/test-refseq +""" + +import argparse +import os +import shutil +import subprocess +import sys + +from flows.lib.conditional_import import flow +from flows.lib.utils import enum_action, load_config +from flows.parsers.register import register_plugins +from flows.validators.validate_file_pair import validate_file_pair + +PARSERS = register_plugins() + + +def _check_blobtk(): + """Verify blobtk is available on PATH.""" + try: + subprocess.run( + ["blobtk", "--version"], + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + ) + except FileNotFoundError: + print( + "[local] WARNING: 'blobtk' not found on PATH. " + "Validation step will be skipped.\n" + " Install via: cd blobtk/rust && cargo build --release && " + "export PATH=$PWD/target/release:$PATH", + file=sys.stderr, + ) + return False + return True + + +def _copy_yaml_to_workdir(yaml_path: str, work_dir: str) -> str: + """Copy the YAML and its dependencies into work_dir. + + Returns: + str: Path to the copied YAML in work_dir. + """ + os.makedirs(work_dir, exist_ok=True) + config = load_config(yaml_path) + dest = os.path.join(work_dir, os.path.basename(yaml_path)) + shutil.copy(yaml_path, dest) + + # Copy dependency YAML files (e.g. "needs:" references) + if "needs" in config.config.get("file", {}): + source_dir = os.path.dirname(yaml_path) + needs = config.config["file"]["needs"] + if isinstance(needs, str): + needs = [needs] + for dep in needs: + dep_path = os.path.join(source_dir, dep) + if os.path.exists(dep_path): + shutil.copy(dep_path, work_dir) + return dest + + +def _place_input_tsv(input_tsv: str, yaml_path: str, work_dir: str) -> str: + """Copy or symlink the input TSV into work_dir with the name expected by the YAML. + + Handles gzip: if YAML expects .gz but input is plain, compress on copy. + If YAML expects plain but input is .gz, decompress on copy. + + Returns: + str: Path to the TSV in work_dir. + """ + import gzip as gzip_mod + + config = load_config(yaml_path) + expected_name = os.path.basename(config.config["file"]["name"]) + dest = os.path.join(work_dir, expected_name) + + # If input already matches expected location, skip + if os.path.abspath(input_tsv) == os.path.abspath(dest): + return dest + + expects_gz = expected_name.endswith(".gz") + input_is_gz = input_tsv.endswith(".gz") + + if expects_gz and not input_is_gz: + # Compress plain input into .gz destination + with open(input_tsv, "rb") as f_in, gzip_mod.open(dest, "wb") as f_out: + shutil.copyfileobj(f_in, f_out) + elif not expects_gz and input_is_gz: + # Decompress .gz input into plain destination + with gzip_mod.open(input_tsv, "rb") as f_in, open(dest, "wb") as f_out: + shutil.copyfileobj(f_in, f_out) + else: + # Same format — straight copy + shutil.copy(input_tsv, dest) + return dest + + +@flow(log_prints=True) +def local_fetch_parse_validate( + parser: str, + yaml_path: str, + input_tsv: str, + work_dir: str, + taxdump_path: str = None, + append: bool = False, + min_valid: int = 0, + min_assigned: int = 0, +) -> bool: + """Run the parse-validate pipeline locally without S3. + + Args: + parser: Parser enum name (e.g. "SKIP_PARSING", "REFSEQ_ORGANELLES"). + yaml_path: Path to the source YAML configuration file. + input_tsv: Path to the input TSV file from the updater. + work_dir: Working directory for intermediate files. + taxdump_path: Optional path to an NCBI taxdump for taxonomy validation. + append: Whether to run in append mode. + min_valid: Minimum expected valid row count. + min_assigned: Minimum expected assigned taxa count. + + Returns: + bool: True if validation passed. + """ + yaml_path = os.path.abspath(yaml_path) + input_tsv = os.path.abspath(input_tsv) + work_dir = os.path.abspath(work_dir) + + if not os.path.exists(yaml_path): + raise FileNotFoundError(f"YAML file not found: {yaml_path}") + if not os.path.exists(input_tsv): + raise FileNotFoundError(f"Input TSV not found: {input_tsv}") + + print(f"[local] Parser: {parser}") + print(f"[local] YAML: {yaml_path}") + print(f"[local] Input: {input_tsv}") + print(f"[local] Work: {work_dir}") + + # Step 1: Copy YAML to work_dir + working_yaml = _copy_yaml_to_workdir(yaml_path, work_dir) + print(f"[local] Copied YAML → {working_yaml}") + + # Step 2: Place input TSV with expected filename + tsv_dest = _place_input_tsv(input_tsv, yaml_path, work_dir) + print(f"[local] Input TSV → {tsv_dest}") + + # Step 3: Run parser + parser_key = parser.name if hasattr(parser, "name") else str(parser) + file_parser = PARSERS.parsers[parser_key] + print(f"[local] Running parser: {file_parser.name}") + file_parser.func( + working_yaml=working_yaml, + work_dir=work_dir, + append=append, + data_freeze_path=None, + ) + print("[local] Parser completed") + + # Step 4: Validate (no S3 upload — s3_path=None) + if _check_blobtk(): + print("[local] Running validation...") + status = validate_file_pair( + yaml_path=yaml_path, + work_dir=work_dir, + taxdump_path=taxdump_path, + s3_path=None, + min_valid=min_valid, + min_assigned=min_assigned, + ) + if status: + print("[local] ✓ Validation PASSED") + else: + print("[local] ✗ Validation FAILED") + else: + print("[local] ⚠ Validation SKIPPED (blobtk not available)") + status = None + return status + + +def main(): + """CLI entry point.""" + arg_parser = argparse.ArgumentParser( + description="Local fetch-parse-validate (no S3).", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=__doc__, + ) + arg_parser.add_argument( + "-p", "--parser", + required=True, + type=str, + action=enum_action(PARSERS.ParserEnum), + help=f"Parser to use. Choices: {[e.name for e in PARSERS.ParserEnum]}", + ) + arg_parser.add_argument( + "--yaml-path", + required=True, + help="Path to the source YAML configuration file.", + ) + arg_parser.add_argument( + "--input-tsv", + required=True, + help="Path to the input TSV file (from the updater).", + ) + arg_parser.add_argument( + "--work-dir", + default="/tmp/local-fpv", + help="Working directory for intermediate files.", + ) + arg_parser.add_argument( + "--taxdump-path", + default=None, + help="Path to an NCBI taxdump directory.", + ) + arg_parser.add_argument( + "--append", + action="store_true", + help="Run in append mode.", + ) + arg_parser.add_argument( + "--min-valid", + type=int, + default=0, + help="Minimum expected valid row count.", + ) + arg_parser.add_argument( + "--min-assigned", + type=int, + default=0, + help="Minimum expected assigned taxa count.", + ) + + args = arg_parser.parse_args() + success = local_fetch_parse_validate( + parser=args.parser, + yaml_path=args.yaml_path, + input_tsv=args.input_tsv, + work_dir=args.work_dir, + taxdump_path=args.taxdump_path, + append=args.append, + min_valid=args.min_valid, + min_assigned=args.min_assigned, + ) + # Exit 0 if validation passed or was skipped (None), 1 if failed + sys.exit(0 if success is not False else 1) + + +if __name__ == "__main__": + main() diff --git a/flows/lib/shared_args.py b/flows/lib/shared_args.py index 4d5d801..4f9f4fb 100644 --- a/flows/lib/shared_args.py +++ b/flows/lib/shared_args.py @@ -57,6 +57,16 @@ }, } +DIVISION = { + "flags": ["--division"], + "keys": { + "help": "Ensembl division (fungi, metazoa, plants, protists, rapid, vertebrates).", + "type": str, + "default": "vertebrates", + }, +} + + DRY_RUN = { "flags": ["-d", "--dry_run"], "keys": { @@ -94,6 +104,15 @@ "keys": {"help": "Type of index to fetch.", "type": str}, } +INDEX_URL = { + "flags": ["--index_url"], + "keys": { + "help": "URL to the private index TSV listing project sheets.", + "type": str, + }, +} + + INPUT_PATH = { "flags": ["-i", "--input_path"], "keys": {"help": "Path to the input file.", "type": str}, @@ -197,9 +216,7 @@ } -def default( - arg: Dict[str, Any], default: Union[int, float, bool, str] -) -> Dict[str, Any]: +def default(arg: Dict[str, Any], default: Union[int, float, bool, str]) -> Dict[str, Any]: """Return an argument with a default value.""" # append/replace the default value to the help message default_re = r"\s*\(default: .*\)" diff --git a/flows/lib/utils.py b/flows/lib/utils.py index ed8f8bd..60e4855 100644 --- a/flows/lib/utils.py +++ b/flows/lib/utils.py @@ -1,6 +1,7 @@ #!/usr/bin/python3 import contextlib +import glob import gzip import hashlib import os @@ -18,6 +19,8 @@ import boto3 import requests from botocore.exceptions import ClientError +from requests.adapters import HTTPAdapter +from urllib3.util.retry import Retry from dateutil import parser from genomehubs import utils as gh_utils @@ -131,6 +134,152 @@ def parse(self, data: dict) -> dict: return parsed_data +def open_tsv(input_path: str): + """Open a TSV file (plain or gzipped) for reading. + + Args: + input_path (str): Path to the input TSV file (.tsv or .tsv.gz). + + Returns: + File handle in text mode. + """ + if input_path.endswith(".gz"): + return gzip.open(input_path, "rt", encoding="utf-8", newline="") + return open(input_path, "rt", encoding="utf-8", newline="") + + +def parse_tsv_with_config( + input_path: str, + config: "Config", + key_field: Optional[str] = None, + delimiter: str = "\t", +) -> Dict[str, dict]: + """Parse a TSV file row-by-row through a Config's parse functions. + + Each row of the TSV is treated as a flat dict keyed by column header, + matching the expected ``path:`` references in YAML attribute definitions. + Rows are passed through ``gh_utils.parse_report_values`` to apply any + YAML-defined translations and field mappings, then keyed in the returned + dict by ``key_field`` (or by row index if not provided). + + Args: + input_path (str): Path to the input TSV file (.tsv or .tsv.gz). + config (Config): Loaded YAML configuration. + key_field (str): Optional input column name to use as the dict key. + If not provided, rows are keyed by sequential integer. + delimiter (str): Field delimiter in the TSV (default: tab). + + Returns: + Dict[str, dict]: Mapping of key → parsed row dict (YAML-named fields). + """ + parsed: Dict[str, dict] = {} + with open_tsv(input_path) as fh: + reader = DictReader(fh, delimiter=delimiter) + for index, record in enumerate(reader): + row = gh_utils.parse_report_values(config.parse_fns, record) + if key_field and key_field in record and record[key_field]: + key = record[key_field] + else: + key = str(index) + parsed[key] = row + return parsed + + +def locate_input_tsv(work_dir: str, expected_name: Optional[str] = None) -> str: + """Locate the input TSV in ``work_dir`` for a generic parser. + + Picks the single ``*.tsv`` or ``*.tsv.gz`` in ``work_dir`` whose basename + is not the expected output. Falls back to a direct hit on ``expected_name`` + when present. + + Args: + work_dir (str): Working directory. + expected_name (str): Output filename from ``config.meta["file_name"]``; + used to exclude the parser's intended output from candidate + inputs. + + Returns: + str: Path to the input TSV. + + Raises: + FileNotFoundError: If no candidate TSV is found. + ValueError: If multiple candidate TSVs are found. + """ + candidates = sorted( + glob.glob(os.path.join(work_dir, "*.tsv")) + + glob.glob(os.path.join(work_dir, "*.tsv.gz")) + ) + if expected_name: + candidates = [ + c for c in candidates if os.path.basename(c) != expected_name + ] + if not candidates: + raise FileNotFoundError( + f"No TSV input found in {work_dir} (expected != {expected_name})" + ) + if len(candidates) > 1: + raise ValueError( + f"Multiple TSV inputs in {work_dir}: {candidates!r}" + ) + return candidates[0] + + +def run_generic_tsv_parser( + working_yaml: str, + work_dir: str, + append: bool = False, + key_field: Optional[str] = None, +) -> None: + """Run the generic flat-TSV-with-Config parsing pipeline. + + Locates the input TSV in ``work_dir``, loads the YAML config, + applies ``parse_report_values`` row-by-row, and writes the + canonical TSV to ``work_dir`` (preserving the YAML-defined + ``file_name`` for downstream validation). + + Args: + working_yaml (str): Path to the working YAML config file. + work_dir (str): Working directory. + append (bool): If True, load previous parsed data. + key_field (str): Optional input column to key parsed rows by. + """ + config = load_config(config_file=working_yaml, load_previous=append) + expected_name = config.meta["file_name"] + input_path = locate_input_tsv(work_dir, expected_name) + print(f"Parsing {input_path} with {working_yaml}") + + parsed = parse_tsv_with_config(input_path, config, key_field=key_field) + print(f"Parsed {len(parsed)} records") + + output_name = config.meta["file_name"] + config.meta["file_name"] = os.path.join( + work_dir, os.path.basename(output_name) + ) + try: + write_parsed_tsv(parsed, config) + finally: + config.meta["file_name"] = output_name + + +def write_parsed_tsv(parsed: Dict[str, dict], config: "Config") -> None: + """Write a parsed dict to TSV using config-defined headers and meta. + + Handles ``.gz`` filenames by writing uncompressed then gzipping. + + Args: + parsed (Dict[str, dict]): Mapping of key → row dict. + config (Config): Loaded YAML configuration. + """ + file_name = config.meta["file_name"] + if file_name.endswith(".gz"): + config.meta["file_name"] = file_name[:-3] + gh_utils.write_tsv(parsed, config.headers, config.meta) + os.system(f"gzip -f {config.meta['file_name']}") + config.meta["file_name"] = file_name + else: + gh_utils.write_tsv(parsed, config.headers, config.meta) + + def format_entry(entry, key: str, meta: dict) -> str: """ Formats a single entry in a dictionary, handling the case where the entry is a list. @@ -550,13 +699,56 @@ def __call__(self, parser, namespace, values, option_string=None): return EnumAction +def _build_session(retries=3, backoff_factor=1.0, status_forcelist=None): + """Build a requests Session with transport-level retry logic. + + Args: + retries (int): Total number of retries per request. + backoff_factor (float): Backoff factor for exponential delay between retries. + status_forcelist (list): HTTP status codes to trigger a retry. + + Returns: + requests.Session: Configured session with retry adapter. + """ + if status_forcelist is None: + status_forcelist = [429, 500, 502, 503, 504] + retry = Retry( + total=retries, + backoff_factor=backoff_factor, + status_forcelist=status_forcelist, + allowed_methods=["GET", "POST", "HEAD"], + raise_on_status=False, + ) + adapter = HTTPAdapter(max_retries=retry) + session = requests.Session() + session.mount("https://", adapter) + session.mount("http://", adapter) + return session + + def safe_get(*args, method="GET", timeout=300, **kwargs): + """Make an HTTP request with transport-level retries. + + Retries automatically on 429/5xx status codes and connection errors + with exponential backoff (1s, 2s, 4s). Separate from Prefect task-level + retries which re-run the entire task. + + Args: + *args: Positional arguments passed to requests (typically the URL). + method (str): HTTP method — "GET", "POST", or "HEAD". + timeout (int): Request timeout in seconds. + **kwargs: Additional keyword arguments passed to requests. + + Returns: + requests.Response: The HTTP response object. + """ + session = _build_session() if method == "GET": - return requests.get(*args, timeout=timeout, **kwargs) + return session.get(*args, timeout=timeout, **kwargs) elif method == "POST": - return requests.post(*args, timeout=timeout, **kwargs) + return session.post(*args, timeout=timeout, **kwargs) elif method == "HEAD": - return requests.head(*args, timeout=timeout, **kwargs) + return session.head(*args, timeout=timeout, **kwargs) def find_http_file(http_path: str, filename: str) -> str: diff --git a/flows/orchestrators/batch_validate_status_lists.py b/flows/orchestrators/batch_validate_status_lists.py new file mode 100644 index 0000000..29863bd --- /dev/null +++ b/flows/orchestrators/batch_validate_status_lists.py @@ -0,0 +1,111 @@ +"""Batch fetch-parse-validate for all Google Sheets status list TSVs. + +Triggered by update.google.sheets.status.finished, this flow iterates +through all FILE_*.types.yaml in the status-lists directory and runs +the full fetch-parse-validate pipeline (with SKIP_PARSING) for each one +whose corresponding TSV is present in work_dir. +""" + +import os +import sys +from glob import glob +from typing import Optional + +from flows.lib.conditional_import import flow +from flows.lib.utils import load_config +from flows.lib.wrapper_fetch_parse_validate import Parser, fetch_parse_validate + + +@flow(log_prints=True) +def batch_validate_status_lists( + yaml_dir: str, + work_dir: str, + s3_path: str = "s3://goat/sources/status-lists/", + taxdump_path: Optional[str] = None, + dry_run: bool = False, + min_valid: int = 0, + min_assigned: int = 0, +) -> bool: + """Run fetch-parse-validate (SKIP_PARSING) for all status list TSVs in work_dir. + + For each FILE_*.types.yaml in yaml_dir, checks if the corresponding + TSV exists in work_dir. If present, invokes the standard + fetch_parse_validate flow with SKIP_PARSING. + + Args: + yaml_dir: Directory containing FILE_*.types.yaml files. + work_dir: Directory containing TSVs output by the updater. + s3_path: S3 path prefix for validated files. + taxdump_path: Optional NCBI taxdump path. + dry_run: If True, skip S3 upload. + min_valid: Minimum valid row count per file. + min_assigned: Minimum assigned taxa per file. + + Returns: + bool: True if all validations passed. + """ + yaml_files = sorted(glob(os.path.join(yaml_dir, "FILE_*.types.yaml"))) + if not yaml_files: + print(f"No FILE_*.types.yaml found in {yaml_dir}") + return False + + results = {} + + for yaml_path in yaml_files: + yaml_name = os.path.basename(yaml_path) + try: + config = load_config(yaml_path) + except Exception as e: + print(f" SKIP {yaml_name}: failed to load config — {e}") + results[yaml_name] = "skip-config-error" + continue + + tsv_name = os.path.basename(config.config["file"]["name"]) + tsv_path = os.path.join(work_dir, tsv_name) + + if not os.path.exists(tsv_path): + # TSV not present — updater may not have produced it this run + results[yaml_name] = "skip-no-tsv" + continue + + try: + fetch_parse_validate( + parser=Parser.SKIP_PARSING, + yaml_path=yaml_path, + s3_path=s3_path, + work_dir=work_dir, + taxdump_path=taxdump_path, + dry_run=dry_run, + min_valid=min_valid, + min_assigned=min_assigned, + ) + results[yaml_name] = "pass" + print(f" ✓ {yaml_name}") + except Exception as e: + print(f" ✗ {yaml_name}: {e}") + results[yaml_name] = "fail" + + # Summary + passed = sum(1 for v in results.values() if v == "pass") + failed = sum(1 for v in results.values() if v == "fail") + skipped = sum(1 for v in results.values() if v.startswith("skip")) + print(f"\nBatch fetch-parse-validate: {passed} passed, {failed} failed, {skipped} skipped") + + return failed == 0 + + +if __name__ == "__main__": + import argparse + + p = argparse.ArgumentParser(description="Batch validate status list TSVs.") + p.add_argument("--yaml-dir", required=True, help="Directory with FILE_*.types.yaml") + p.add_argument("--work-dir", required=True, help="Directory with updater TSVs") + p.add_argument("--taxdump-path", default=None) + p.add_argument("--s3-path", default=None) + p.add_argument("--dry-run", action="store_true") + p.add_argument("--min-valid", type=int, default=0) + p.add_argument("--min-assigned", type=int, default=0) + args = p.parse_args() + + success = batch_validate_status_lists(**vars(args)) + sys.exit(0 if success else 1) diff --git a/flows/parsers/parse_blobtoolkit.py b/flows/parsers/parse_blobtoolkit.py new file mode 100644 index 0000000..0971bf5 --- /dev/null +++ b/flows/parsers/parse_blobtoolkit.py @@ -0,0 +1,93 @@ +"""Parse BlobToolKit assembly summary TSV. + +The corresponding updater (``update_blobtoolkit``) emits a flat TSV +(one row per BTK dataset) whose column headers already match the +``header:`` values in ``btk.types.yaml``. This parser simply applies +the YAML parse functions and writes the canonical TSV. +""" + +import os +from glob import glob + +from flows.lib.conditional_import import flow +from flows.lib.utils import ( # noqa: E402 + Parser, + load_config, + parse_tsv_with_config, + write_parsed_tsv, +) +from flows.parsers.args import parse_args # noqa: E402 + + +def _locate_input_tsv(work_dir: str, expected_name: str) -> str: + """Find the input TSV in ``work_dir``. + + Args: + work_dir (str): Working directory. + expected_name (str): YAML-defined output filename. + + Returns: + str: Path to the input TSV. + """ + candidates = sorted( + glob(os.path.join(work_dir, "*.tsv")) + + glob(os.path.join(work_dir, "*.tsv.gz")) + ) + candidates = [c for c in candidates if os.path.basename(c) != expected_name] + if not candidates: + raise FileNotFoundError(f"No BTK input TSV found in {work_dir}") + if len(candidates) > 1: + raise ValueError(f"Multiple TSV inputs in {work_dir}: {candidates!r}") + return candidates[0] + + +@flow(log_prints=True) +def parse_blobtoolkit( + working_yaml: str, + work_dir: str, + append: bool = False, + **kwargs, +) -> None: + """Parse BTK summary TSV using YAML schema. + + Args: + working_yaml (str): Path to the YAML configuration file. + work_dir (str): Working directory containing the input TSV. + append (bool): If True, load previous parsed data. + **kwargs: Ignored extra arguments from the wrapper. + """ + config = load_config(config_file=working_yaml, load_previous=append) + + expected_name = config.meta["file_name"] + input_path = _locate_input_tsv(work_dir, expected_name) + print(f"Parsing BlobToolKit summary: {input_path}") + + parsed = parse_tsv_with_config(input_path, config, key_field="accession") + print(f"Parsed {len(parsed)} BTK dataset records") + + output_name = config.meta["file_name"] + config.meta["file_name"] = os.path.join( + work_dir, os.path.basename(output_name) + ) + try: + write_parsed_tsv(parsed, config) + finally: + config.meta["file_name"] = output_name + + +def plugin(): + """Register the parser plugin.""" + return Parser( + name="BLOBTOOLKIT", + func=parse_blobtoolkit, + description="Parse BlobToolKit assembly summary TSV using YAML schema.", + ) + + +if __name__ == "__main__": + args = parse_args("Parse BlobToolKit assembly summary TSV.") + parse_blobtoolkit( + working_yaml=args.yaml_path, + work_dir=os.path.dirname(args.input_path) or ".", + append=args.append, + ) diff --git a/flows/parsers/parse_conservation.py b/flows/parsers/parse_conservation.py new file mode 100644 index 0000000..f86c549 --- /dev/null +++ b/flows/parsers/parse_conservation.py @@ -0,0 +1,50 @@ +"""Parse conservation-status source TSV using a YAML schema. + +Handles the ``FILE_CITES_full_index.types.yaml`` (and any future +``FILE_*.types.yaml``) configurations under ``sources/conservation``. +Delegates to the shared generic flat-TSV pipeline. +""" + +import os + +from flows.lib.conditional_import import flow +from flows.lib.utils import Parser, run_generic_tsv_parser +from flows.parsers.args import parse_args + + +@flow(log_prints=True) +def parse_conservation( + working_yaml: str, + work_dir: str, + append: bool = False, + **kwargs, +) -> None: + """Parse a conservation-status TSV using YAML schema. + + Args: + working_yaml (str): Path to the YAML configuration file. + work_dir (str): Working directory containing the input TSV. + append (bool): If True, load previous parsed data. + **kwargs: Ignored extra arguments from the wrapper. + """ + run_generic_tsv_parser( + working_yaml=working_yaml, work_dir=work_dir, append=append + ) + + +def plugin(): + """Register the parser plugin.""" + return Parser( + name="CONSERVATION", + func=parse_conservation, + description="Parse a conservation-status TSV using a YAML schema.", + ) + + +if __name__ == "__main__": + args = parse_args("Parse a conservation-status TSV using a YAML schema.") + parse_conservation( + working_yaml=args.yaml_path, + work_dir=os.path.dirname(args.input_path) or ".", + append=args.append, + ) diff --git a/flows/parsers/parse_genomesize_karyotype.py b/flows/parsers/parse_genomesize_karyotype.py new file mode 100644 index 0000000..715e4b7 --- /dev/null +++ b/flows/parsers/parse_genomesize_karyotype.py @@ -0,0 +1,53 @@ +"""Parse genome size & karyotype source TSV using a YAML schema. + +Generic parser for the ~25 ``FILE_*.types.yaml`` configurations under +``sources/genomesize-karyotype``. Each source is a flat TSV whose +columns map directly to YAML attribute headers; this parser delegates +to the shared generic flat-TSV pipeline. +""" + +import os + +from flows.lib.conditional_import import flow +from flows.lib.utils import Parser, run_generic_tsv_parser +from flows.parsers.args import parse_args + + +@flow(log_prints=True) +def parse_genomesize_karyotype( + working_yaml: str, + work_dir: str, + append: bool = False, + **kwargs, +) -> None: + """Parse a genome-size or karyotype TSV using YAML schema. + + Args: + working_yaml (str): Path to the YAML configuration file. + work_dir (str): Working directory containing the input TSV. + append (bool): If True, load previous parsed data. + **kwargs: Ignored extra arguments from the wrapper. + """ + run_generic_tsv_parser( + working_yaml=working_yaml, work_dir=work_dir, append=append + ) + + +def plugin(): + """Register the parser plugin.""" + return Parser( + name="GENOMESIZE_KARYOTYPE", + func=parse_genomesize_karyotype, + description="Parse a genome-size or karyotype TSV using a YAML schema.", + ) + + +if __name__ == "__main__": + args = parse_args( + "Parse a genome-size or karyotype TSV using a YAML schema." + ) + parse_genomesize_karyotype( + working_yaml=args.yaml_path, + work_dir=os.path.dirname(args.input_path) or ".", + append=args.append, + ) diff --git a/flows/parsers/parse_legislation.py b/flows/parsers/parse_legislation.py new file mode 100644 index 0000000..43d3aa5 --- /dev/null +++ b/flows/parsers/parse_legislation.py @@ -0,0 +1,51 @@ +"""Parse legislation source TSV using a YAML schema. + +Handles the ``FILE_*.types.yaml`` configurations under +``sources/uk-legislation`` (Council Directive 92/43/EEC, Conservation +of Habitats and Species Regulations 2017, Wildlife and Countryside +Act 1981, etc.). Delegates to the shared generic flat-TSV pipeline. +""" + +import os + +from flows.lib.conditional_import import flow +from flows.lib.utils import Parser, run_generic_tsv_parser +from flows.parsers.args import parse_args + + +@flow(log_prints=True) +def parse_legislation( + working_yaml: str, + work_dir: str, + append: bool = False, + **kwargs, +) -> None: + """Parse a legislation TSV using YAML schema. + + Args: + working_yaml (str): Path to the YAML configuration file. + work_dir (str): Working directory containing the input TSV. + append (bool): If True, load previous parsed data. + **kwargs: Ignored extra arguments from the wrapper. + """ + run_generic_tsv_parser( + working_yaml=working_yaml, work_dir=work_dir, append=append + ) + + +def plugin(): + """Register the parser plugin.""" + return Parser( + name="LEGISLATION", + func=parse_legislation, + description="Parse a legislation TSV using a YAML schema.", + ) + + +if __name__ == "__main__": + args = parse_args("Parse a legislation TSV using a YAML schema.") + parse_legislation( + working_yaml=args.yaml_path, + work_dir=os.path.dirname(args.input_path) or ".", + append=args.append, + ) diff --git a/flows/parsers/parse_refseq_organelles.py b/flows/parsers/parse_refseq_organelles.py index c7245bd..c441264 100644 --- a/flows/parsers/parse_refseq_organelles.py +++ b/flows/parsers/parse_refseq_organelles.py @@ -1,32 +1,143 @@ -from flows.lib.utils import Parser # noqa: E402 +"""Parse RefSeq organelle data into one-row-per-assembly TSV. + +The corresponding updater (``update_refseq_organelles``) emits one row per +organelle sequence (mitochondrion or plastid). The GoaT YAML schema +(``refseq_organelles.types.yaml``) expects one row per assembly with +combined ``mitochondrion*`` / ``plastid*`` columns. This parser pivots +the per-organelle rows by the assembly accession (genbank), then runs +the records through the YAML parse functions. +""" + +import os +from csv import DictReader +from glob import glob + +from genomehubs import utils as gh_utils + +from flows.lib.conditional_import import flow +from flows.lib.utils import ( # noqa: E402 + Parser, + load_config, + open_tsv, + write_parsed_tsv, +) from flows.parsers.args import parse_args # noqa: E402 +ORGANELLE_FIELDS = ("id", "assemblySpan", "gcPercent", "nPercent") + + +def _locate_input_tsv(work_dir: str, expected_name: str) -> str: + """Find the per-organelle input TSV in ``work_dir``.""" + expected_path = os.path.join(work_dir, expected_name) + if os.path.exists(expected_path): + return expected_path + candidates = sorted( + glob(os.path.join(work_dir, "*.tsv")) + + glob(os.path.join(work_dir, "*.tsv.gz")) + ) + if not candidates: + raise FileNotFoundError( + f"No TSV input found in {work_dir} (expected {expected_name})" + ) + if len(candidates) > 1: + raise ValueError( + f"Multiple TSV inputs in {work_dir}: {candidates!r}" + ) + return candidates[0] + + +def _pivot_by_assembly(input_path: str) -> dict: + """Group per-organelle rows by GenBank accession. + + Args: + input_path (str): Path to the per-organelle TSV. + + Returns: + dict: Mapping of assembly accession → nested record with + ``mitochondrion``/``plastid`` sub-dicts. + """ + by_assembly: dict = {} + with open_tsv(input_path) as fh: + reader = DictReader(fh, delimiter="\t") + for row in reader: + assembly = row.get("genbankAccession") or row.get("id") + if not assembly: + continue + record = by_assembly.setdefault( + assembly, + { + "id": row.get("id", assembly), + "genbankAccession": assembly, + "bioproject": row.get("bioproject", ""), + "biosample": row.get("biosample", ""), + "releaseDate": row.get("releaseDate", ""), + "annotations": {"organism": row.get("organismName", "")}, + "taxonId": row.get("taxonId", ""), + "sourceAuthor": row.get("sourceAuthor", ""), + "sourceYear": row.get("sourceYear", ""), + "sourceTitle": row.get("sourceTitle", ""), + "pubmedId": row.get("pubmedId", ""), + "sampleLocation": row.get("sampleLocation", ""), + }, + ) + organelle = (row.get("organelle") or "").lower() + if organelle in ("mitochondrion", "plastid"): + record[organelle] = { + field: row.get(field, "") for field in ORGANELLE_FIELDS + } + return by_assembly + +@flow(log_prints=True) def parse_refseq_organelles( - working_yaml: str, work_dir: str, append: bool, **kwargs + working_yaml: str, + work_dir: str, + append: bool = False, + **kwargs, ) -> None: - """ - Wrapper function to parse the RefSeq organelles JSONL file. + """Pivot per-organelle TSV to per-assembly and apply YAML schema. Args: - working_yaml (str): Path to the working YAML file. - work_dir (str): Path to the working directory. - append (bool): Whether to append to the existing TSV file. - **kwargs: Additional keyword arguments. + working_yaml (str): Path to the YAML configuration file. + work_dir (str): Working directory containing the input TSV. + append (bool): If True, load previous parsed data. + **kwargs: Ignored extra arguments from the wrapper. """ - print("parsing RefSeq organelles files") + config = load_config(config_file=working_yaml, load_previous=append) + + expected_name = config.meta["file_name"] + input_path = _locate_input_tsv(work_dir, expected_name) + print(f"Parsing RefSeq organelles: {input_path}") + + grouped = _pivot_by_assembly(input_path) + print(f"Pivoted to {len(grouped)} assemblies") + + parsed = { + key: gh_utils.parse_report_values(config.parse_fns, record) + for key, record in grouped.items() + } + + output_name = config.meta["file_name"] + config.meta["file_name"] = os.path.join(work_dir, os.path.basename(output_name)) + try: + write_parsed_tsv(parsed, config) + finally: + config.meta["file_name"] = output_name def plugin(): - """Register the flow.""" + """Register the parser plugin.""" return Parser( name="REFSEQ_ORGANELLES", func=parse_refseq_organelles, - description="Parse the RefSeq organelles files in a directory.", + description="Pivot per-organelle TSV to per-assembly and apply YAML schema.", ) if __name__ == "__main__": - """Run the flow.""" - args = parse_args() - parse_refseq_organelles(**vars(args)) + args = parse_args("Parse RefSeq organelle data into one-row-per-assembly TSV.") + parse_refseq_organelles( + working_yaml=args.yaml_path, + work_dir=os.path.dirname(args.input_path) or ".", + append=args.append, + ) diff --git a/flows/parsers/parse_sequencing_status.py b/flows/parsers/parse_sequencing_status.py index 37915d7..c7091bf 100644 --- a/flows/parsers/parse_sequencing_status.py +++ b/flows/parsers/parse_sequencing_status.py @@ -1,32 +1,114 @@ -from flows.lib.utils import Parser # noqa: E402 +"""Parse status list TSVs using a YAML schema. + +Generic parser that handles all sequencing status list inputs: +VGP, JGI 1KFG, Google Sheets project lists, NHM, CNGB, +ToL Portal, ToL Genome Notes, and similar. + +The input is a tab-separated file produced by the corresponding +updater (one row per record). The YAML schema describes how each +input column maps to a GoaT attribute (and may translate values +via ``translate:`` blocks). This parser: + +1. Locates the input TSV in ``work_dir`` matching ``meta.file_name`` + (or, failing that, the single TSV in the directory). +2. Reads each row as a flat dict keyed by column header. +3. Applies YAML parse functions via ``gh_utils.parse_report_values``. +4. Writes the canonical TSV using YAML-defined headers. +""" + +import os +from glob import glob + +from flows.lib.conditional_import import flow +from flows.lib.utils import ( # noqa: E402 + Parser, + load_config, + parse_tsv_with_config, + write_parsed_tsv, +) from flows.parsers.args import parse_args # noqa: E402 +def _locate_input_tsv(work_dir: str, expected_name: str) -> str: + """Find the input TSV in ``work_dir``. + + Looks first for ``expected_name`` (matching ``meta.file_name`` from + the YAML); falls back to a single ``*.tsv`` or ``*.tsv.gz`` in + ``work_dir`` that is not the expected output. + + Args: + work_dir (str): Working directory. + expected_name (str): The filename declared in YAML ``file.name``. + + Returns: + str: Absolute path to the input TSV. + """ + expected_path = os.path.join(work_dir, expected_name) + if os.path.exists(expected_path): + return expected_path + + candidates = sorted( + glob(os.path.join(work_dir, "*.tsv")) + + glob(os.path.join(work_dir, "*.tsv.gz")) + ) + if not candidates: + raise FileNotFoundError( + f"No TSV input found in {work_dir} (expected {expected_name})" + ) + if len(candidates) > 1: + raise ValueError( + f"Multiple TSV inputs in {work_dir}: {candidates!r}; " + "place a single source TSV or name it to match YAML file.name." + ) + return candidates[0] + + +@flow(log_prints=True) def parse_sequencing_status( - working_yaml: str, work_dir: str, append: bool, **kwargs + working_yaml: str, + work_dir: str, + append: bool = False, + **kwargs, ) -> None: - """ - Wrapper function to parse the sequencing status files. + """Parse a sequencing status TSV using a YAML schema. Args: - working_yaml (str): Path to the working YAML file. - work_dir (str): Path to the working directory. - append (bool): Whether to append to the existing TSV file. - **kwargs: Additional keyword arguments. + working_yaml (str): Path to the YAML configuration file. + work_dir (str): Working directory containing the input TSV. + append (bool): If True, load previous parsed data for incremental + updates. + **kwargs: Ignored extra arguments from the wrapper. """ - print("parsing sequencing status files") + config = load_config(config_file=working_yaml, load_previous=append) + + expected_name = config.meta["file_name"] + input_path = _locate_input_tsv(work_dir, expected_name) + print(f"Parsing sequencing status: {input_path}") + + parsed = parse_tsv_with_config(input_path, config) + print(f"Parsed {len(parsed)} rows") + + output_name = config.meta["file_name"] + config.meta["file_name"] = os.path.join(work_dir, os.path.basename(output_name)) + try: + write_parsed_tsv(parsed, config) + finally: + config.meta["file_name"] = output_name def plugin(): - """Register the flow.""" + """Register the parser plugin.""" return Parser( name="SEQUENCING_STATUS", func=parse_sequencing_status, - description="Parse the sequencing status files in a directory.", + description="Parse a sequencing status TSV using a YAML schema.", ) if __name__ == "__main__": - """Run the flow.""" - args = parse_args() - parse_sequencing_status(**vars(args)) + args = parse_args("Parse a sequencing status TSV using a YAML schema.") + parse_sequencing_status( + working_yaml=args.yaml_path, + work_dir=os.path.dirname(args.input_path) or ".", + append=args.append, + ) diff --git a/flows/parsers/parse_sra_data.py b/flows/parsers/parse_sra_data.py new file mode 100644 index 0000000..cc29e50 --- /dev/null +++ b/flows/parsers/parse_sra_data.py @@ -0,0 +1,55 @@ +"""Parse SRA accession TSV using a YAML schema. + +The corresponding updater (``update_sra_data``) emits a flat TSV with +columns matching the headers in ``sra.types.yaml`` (``run_accession``, +``sra_accession``, ``platform``, ``library_source``, ``reads``, +``total_runs``, ``total_reads``, ``taxon_id``). This parser delegates +to the shared generic flat-TSV pipeline. +""" + +import os + +from flows.lib.conditional_import import flow +from flows.lib.utils import Parser, run_generic_tsv_parser +from flows.parsers.args import parse_args + + +@flow(log_prints=True) +def parse_sra_data( + working_yaml: str, + work_dir: str, + append: bool = False, + **kwargs, +) -> None: + """Parse SRA accession TSV using YAML schema. + + Args: + working_yaml (str): Path to the YAML configuration file. + work_dir (str): Working directory containing the input TSV. + append (bool): If True, load previous parsed data. + **kwargs: Ignored extra arguments from the wrapper. + """ + run_generic_tsv_parser( + working_yaml=working_yaml, + work_dir=work_dir, + append=append, + key_field="run_accession", + ) + + +def plugin(): + """Register the parser plugin.""" + return Parser( + name="SRA_DATA", + func=parse_sra_data, + description="Parse SRA accession TSV using a YAML schema.", + ) + + +if __name__ == "__main__": + args = parse_args("Parse SRA accession TSV using a YAML schema.") + parse_sra_data( + working_yaml=args.yaml_path, + work_dir=os.path.dirname(args.input_path) or ".", + append=args.append, + ) diff --git a/flows/prefect.yaml b/flows/prefect.yaml index 07ae523..ee29ca7 100644 --- a/flows/prefect.yaml +++ b/flows/prefect.yaml @@ -264,3 +264,362 @@ deployments: schedules: - *daily work_pool: *goat_data_work_pool + + # ----------------------------------------------------------------------- + # Phase 1 updaters — external data fetching (migrated from goat-data) + # ----------------------------------------------------------------------- + + - name: update-vgp-original-status + # Fetch VGP original status list from GitHub YAML tracker (monthly) + entrypoint: flows/updaters/update_vgp_original_status.py:update_vgp_original_status + parameters: + output_path: "/home/ubuntu/tmp/test/status-lists/vgp_original_status.tsv" + s3_path: s3://goat/resources/status-lists/vgp_original_status.tsv + min_records: 100 + schedules: + - *monthly + work_pool: *goat_data_work_pool + + - name: update-vgp-status + # Fetch VGP Ordinal Phase1+ status from the live Google Sheet + entrypoint: flows/updaters/update_vgp_status.py:update_vgp_status + parameters: + output_path: "/home/ubuntu/tmp/test/status-lists/VGP_Ordinal_Phase1_plus.tsv" + s3_path: s3://goat/resources/status-lists/VGP_Ordinal_Phase1_plus.tsv + min_records: 100 + schedules: + - *daily + work_pool: *goat_data_work_pool + + - name: update-ensembl-metadata-main + # Fetch species metadata from Ensembl main site + entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_metadata.tsv.gz" + division: vertebrates + s3_path: s3://goat/resources/assembly-data/ensembl_metadata.tsv.gz + schedules: + - *weekly + work_pool: *goat_data_work_pool + + - name: update-ensembl-metadata-metazoa + entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_metazoa_metadata.tsv.gz" + division: metazoa + s3_path: s3://goat/resources/assembly-data/ensembl_metazoa_metadata.tsv.gz + schedules: + - *weekly + work_pool: *goat_data_work_pool + + - name: update-ensembl-metadata-plants + entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_plants_metadata.tsv.gz" + division: plants + s3_path: s3://goat/resources/assembly-data/ensembl_plants_metadata.tsv.gz + schedules: + - *weekly + work_pool: *goat_data_work_pool + + - name: update-ensembl-metadata-fungi + entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_fungi_metadata.tsv.gz" + division: fungi + s3_path: s3://goat/resources/assembly-data/ensembl_fungi_metadata.tsv.gz + schedules: + - *weekly + work_pool: *goat_data_work_pool + + - name: update-ensembl-metadata-protists + entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_protists_metadata.tsv.gz" + division: protists + s3_path: s3://goat/resources/assembly-data/ensembl_protists_metadata.tsv.gz + schedules: + - *weekly + work_pool: *goat_data_work_pool + + - name: update-ensembl-metadata-rapid + entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_rapid_metadata.tsv.gz" + division: rapid + s3_path: s3://goat/resources/assembly-data/ensembl_rapid_metadata.tsv.gz + schedules: + - *weekly + work_pool: *goat_data_work_pool + + - name: update-ucsc-assemblies + # Fetch UCSC assembly hub accession list + entrypoint: flows/updaters/update_ucsc_assemblies.py:update_ucsc_assemblies + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/ucsc_assemblies.tsv" + s3_path: s3://goat/resources/assembly-data/ucsc_assemblies.tsv + schedules: + - *weekly + work_pool: *goat_data_work_pool + + - name: update-jgi-status + # Fetch JGI 1KFG project data via OAuth API + entrypoint: flows/updaters/update_jgi_status.py:update_jgi_status + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/jgi_1kfg_status.tsv" + s3_path: s3://goat/resources/assembly-data/jgi_1kfg_status.tsv + schedules: + - *weekly + work_pool: + name: goat-data + work_queue_name: default + job_variables: + env: + JGI_OFFLINE_TOKEN: "{{ $JGI_OFFLINE_TOKEN }}" + + - name: update-sra-data + # Fetch SRA data via NCBI E-utilities + entrypoint: flows/updaters/update_sra_data.py:update_sra_data + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/sra_data.tsv.gz" + s3_path: s3://goat/resources/assembly-data/sra_data.tsv.gz + schedules: + - *weekly + work_pool: + name: goat-data + work_queue_name: default + job_variables: + env: + NCBI_API_KEY: "{{ $NCBI_API_KEY }}" + + - name: update-blobtoolkit + # Fetch BlobToolKit analysis data via API + entrypoint: flows/updaters/update_blobtoolkit.py:update_blobtoolkit + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/btk.tsv.gz" + s3_path: s3://goat/resources/assembly-data/btk.tsv.gz + schedules: + - *weekly + work_pool: *goat_data_work_pool + + - name: update-refseq-organelles + # Fetch and parse RefSeq organelle data from NCBI FTP + entrypoint: flows/updaters/update_refseq_organelles.py:update_refseq_organelles + parameters: + output_path: "/home/ubuntu/tmp/test/assembly-data/refseq_organelles.tsv.gz" + s3_path: s3://goat/resources/assembly-data/refseq_organelles.tsv.gz + schedules: + - *weekly + work_pool: *goat_data_work_pool + + - name: update-google-sheets-status + # Fetch project status data from Google Sheets + entrypoint: flows/updaters/update_google_sheets_status.py:update_google_sheets_status + parameters: + output_path: "/home/ubuntu/tmp/test/status-lists/google-sheets" + # This needs to be set to the URL of a TSV file containing the list of Google Sheets to fetch + index_url: "" + s3_path: s3://goat/resources/status-lists/google-sheets/ + schedules: + - *daily + work_pool: *goat_data_work_pool + + # ----------------------------------------------------------------------- + # Phase 2 fetch-parse-validate — triggered by Phase 1 update events + # ----------------------------------------------------------------------- + + - name: batch-validate-google-sheets-status + # Triggered by update.google.sheets.status.finished + # Runs fetch-parse-validate (SKIP_PARSING) for each status list TSV + entrypoint: flows/orchestrators/batch_validate_status_lists.py:batch_validate_status_lists + parameters: + yaml_dir: "../goat-data-main/sources/status-lists" + work_dir: "/home/ubuntu/tmp/test/status-lists/google-sheets" + s3_path: "s3://goat/sources/status-lists/" + dry_run: true + triggers: + - enabled: true + type: event + match: + prefect.resource.type: google.sheets.status + expect: + - update.google.sheets.status.finished + parameters: + yaml_dir: "../goat-data-main/sources/status-lists" + work_dir: "/home/ubuntu/tmp/test/status-lists/google-sheets" + s3_path: "s3://goat/sources/status-lists/" + dry_run: true + work_pool: *goat_data_work_pool + + - name: fetch-parse-validate-blobtoolkit + # Triggered by update.blobtoolkit.finished + # Skip parsing — blobtk validate handles taxonomy column detection + entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate + parameters: + parser: "ParserEnum.SKIP_PARSING" + yaml_path: "../goat-data-main/sources/btk/btk.types.yaml" + s3_path: "s3://goat/sources/btk/" + work_dir: "/home/ubuntu/tmp/test/assembly-data" + dry_run: true + triggers: + - enabled: true + type: event + match: + prefect.resource.type: blobtoolkit + expect: + - update.blobtoolkit.finished + parameters: + parser: "ParserEnum.SKIP_PARSING" + yaml_path: "../goat-data-main/sources/btk/btk.types.yaml" + s3_path: "s3://goat/sources/btk/" + work_dir: "/home/ubuntu/tmp/test/assembly-data" + dry_run: true + work_pool: *goat_data_work_pool + + - name: fetch-parse-validate-refseq-organelles + # Triggered by update.refseq.organelles.finished + entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate + parameters: + parser: "ParserEnum.REFSEQ_ORGANELLES" + yaml_path: "../goat-data-main/sources/assembly-data/refseq_organelles.types.yaml" + s3_path: "s3://goat/sources/assembly-data/" + work_dir: "/home/ubuntu/tmp/test/assembly-data" + dry_run: true + triggers: + - enabled: true + type: event + match: + prefect.resource.type: refseq.organelles + expect: + - update.refseq.organelles.finished + parameters: + parser: "ParserEnum.REFSEQ_ORGANELLES" + yaml_path: "../goat-data-main/sources/assembly-data/refseq_organelles.types.yaml" + s3_path: "s3://goat/sources/assembly-data/" + work_dir: "/home/ubuntu/tmp/test/assembly-data" + dry_run: true + work_pool: *goat_data_work_pool + + - name: fetch-parse-validate-sra-data + # Triggered by update.sra.data.finished + # SRA updater output already matches YAML schema — skip parsing, validate only + entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate + parameters: + parser: "ParserEnum.SKIP_PARSING" + yaml_path: "../goat-data-main/sources/sra/sra.types.yaml" + s3_path: "s3://goat/sources/sra/" + work_dir: "/home/ubuntu/tmp/test/assembly-data" + dry_run: true + triggers: + - enabled: true + type: event + match: + prefect.resource.type: sra.data + expect: + - update.sra.data.finished + parameters: + parser: "ParserEnum.SKIP_PARSING" + yaml_path: "../goat-data-main/sources/sra/sra.types.yaml" + s3_path: "s3://goat/sources/sra/" + work_dir: "/home/ubuntu/tmp/test/assembly-data" + dry_run: true + work_pool: *goat_data_work_pool + + - name: fetch-parse-validate-vgp-status + # Triggered by update.vgp.status.finished + # VGP Ordinal updater output is a superset of YAML columns — skip parsing, validate only + entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate + parameters: + parser: "ParserEnum.SKIP_PARSING" + yaml_path: "../goat-data-main/sources/status-lists/FILE_VGP_Ordinal_Phase1.types.yaml" + s3_path: "s3://goat/sources/status-lists/" + work_dir: "/home/ubuntu/tmp/test/status-lists" + dry_run: true + triggers: + - enabled: true + type: event + match: + prefect.resource.type: vgp.status + expect: + - update.vgp.status.finished + parameters: + parser: "ParserEnum.SKIP_PARSING" + yaml_path: "../goat-data-main/sources/status-lists/FILE_VGP_Ordinal_Phase1.types.yaml" + s3_path: "s3://goat/sources/status-lists/" + work_dir: "/home/ubuntu/tmp/test/status-lists" + dry_run: true + work_pool: *goat_data_work_pool + + - name: fetch-parse-validate-jgi-status + # Triggered by update.jgi.status.finished + entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate + parameters: + parser: "ParserEnum.SEQUENCING_STATUS" + yaml_path: "../goat-data-main/sources/assembly-data/1kfg_manual_bioprojects.types.yaml" + s3_path: "s3://goat/sources/assembly-data/" + work_dir: "/home/ubuntu/tmp/test/assembly-data" + dry_run: true + triggers: + - enabled: true + type: event + match: + prefect.resource.type: jgi.status + expect: + - update.jgi.status.finished + parameters: + parser: "ParserEnum.SEQUENCING_STATUS" + yaml_path: "../goat-data-main/sources/assembly-data/1kfg_manual_bioprojects.types.yaml" + s3_path: "s3://goat/sources/assembly-data/" + work_dir: "/home/ubuntu/tmp/test/assembly-data" + dry_run: true + work_pool: *goat_data_work_pool + + - name: fetch-parse-validate-ucsc-assemblies + # Triggered by update.ucsc.assemblies.finished + # Skip parsing — YAML column mapping can be adjusted later if needed + entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate + parameters: + parser: "ParserEnum.SKIP_PARSING" + yaml_path: "../goat-data-main/sources/assembly-data/ucsc_ids.types.yaml" + s3_path: "s3://goat/sources/assembly-data/" + work_dir: "/home/ubuntu/tmp/test/assembly-data" + dry_run: true + triggers: + - enabled: true + type: event + match: + prefect.resource.type: ucsc.assemblies + expect: + - update.ucsc.assemblies.finished + parameters: + parser: "ParserEnum.SKIP_PARSING" + yaml_path: "../goat-data-main/sources/assembly-data/ucsc_ids.types.yaml" + s3_path: "s3://goat/sources/assembly-data/" + work_dir: "/home/ubuntu/tmp/test/assembly-data" + dry_run: true + work_pool: *goat_data_work_pool + + - name: fetch-parse-validate-nhm-status + # Triggered by update.nhm.status.finished (legacy NHM updater) + # NHM updater output is a superset of YAML columns — skip parsing, validate only + entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate + parameters: + parser: "ParserEnum.SKIP_PARSING" + yaml_path: "../goat-data-main/sources/status-lists/FILE_DTOL_nhm.types.yaml" + s3_path: "s3://goat/sources/status-lists/" + work_dir: "/home/ubuntu/tmp/test/status-lists" + dry_run: true + triggers: + - enabled: true + type: event + match: + prefect.resource.type: nhm.status + expect: + - update.nhm.status.finished + parameters: + parser: "ParserEnum.SKIP_PARSING" + yaml_path: "../goat-data-main/sources/status-lists/FILE_DTOL_nhm.types.yaml" + s3_path: "s3://goat/sources/status-lists/" + work_dir: "/home/ubuntu/tmp/test/status-lists" + dry_run: true + work_pool: *goat_data_work_pool diff --git a/flows/updaters/__pycache__/__init__.cpython-313.pyc b/flows/updaters/__pycache__/__init__.cpython-313.pyc new file mode 100644 index 0000000..e0b0acd Binary files /dev/null and b/flows/updaters/__pycache__/__init__.cpython-313.pyc differ diff --git a/flows/updaters/__pycache__/tol_utils.cpython-312.pyc b/flows/updaters/__pycache__/tol_utils.cpython-312.pyc new file mode 100644 index 0000000..d813ef1 Binary files /dev/null and b/flows/updaters/__pycache__/tol_utils.cpython-312.pyc differ diff --git a/flows/updaters/__pycache__/tol_utils.cpython-313.pyc b/flows/updaters/__pycache__/tol_utils.cpython-313.pyc new file mode 100644 index 0000000..5ca0d68 Binary files /dev/null and b/flows/updaters/__pycache__/tol_utils.cpython-313.pyc differ diff --git a/flows/updaters/__pycache__/update_ensembl_metadata.cpython-313.pyc b/flows/updaters/__pycache__/update_ensembl_metadata.cpython-313.pyc new file mode 100644 index 0000000..a26e96a Binary files /dev/null and b/flows/updaters/__pycache__/update_ensembl_metadata.cpython-313.pyc differ diff --git a/flows/updaters/__pycache__/update_google_sheets_status.cpython-313.pyc b/flows/updaters/__pycache__/update_google_sheets_status.cpython-313.pyc new file mode 100644 index 0000000..fc2aa1d Binary files /dev/null and b/flows/updaters/__pycache__/update_google_sheets_status.cpython-313.pyc differ diff --git a/flows/updaters/__pycache__/update_sra_data.cpython-313.pyc b/flows/updaters/__pycache__/update_sra_data.cpython-313.pyc new file mode 100644 index 0000000..8632a26 Binary files /dev/null and b/flows/updaters/__pycache__/update_sra_data.cpython-313.pyc differ diff --git a/flows/updaters/__pycache__/update_tol_portal_status.cpython-312.pyc b/flows/updaters/__pycache__/update_tol_portal_status.cpython-312.pyc new file mode 100644 index 0000000..3b8c262 Binary files /dev/null and b/flows/updaters/__pycache__/update_tol_portal_status.cpython-312.pyc differ diff --git a/flows/updaters/__pycache__/update_tol_portal_status.cpython-313.pyc b/flows/updaters/__pycache__/update_tol_portal_status.cpython-313.pyc new file mode 100644 index 0000000..0671839 Binary files /dev/null and b/flows/updaters/__pycache__/update_tol_portal_status.cpython-313.pyc differ diff --git a/flows/updaters/__pycache__/update_ucsc_assemblies.cpython-313.pyc b/flows/updaters/__pycache__/update_ucsc_assemblies.cpython-313.pyc new file mode 100644 index 0000000..c1c1a71 Binary files /dev/null and b/flows/updaters/__pycache__/update_ucsc_assemblies.cpython-313.pyc differ diff --git a/flows/updaters/__pycache__/update_vgp_status.cpython-313.pyc b/flows/updaters/__pycache__/update_vgp_status.cpython-313.pyc new file mode 100644 index 0000000..cdaf4fb Binary files /dev/null and b/flows/updaters/__pycache__/update_vgp_status.cpython-313.pyc differ diff --git a/flows/updaters/api/__pycache__/__init__.cpython-313.pyc b/flows/updaters/api/__pycache__/__init__.cpython-313.pyc new file mode 100644 index 0000000..1c8256d Binary files /dev/null and b/flows/updaters/api/__pycache__/__init__.cpython-313.pyc differ diff --git a/flows/updaters/api/__pycache__/api_config.cpython-313.pyc b/flows/updaters/api/__pycache__/api_config.cpython-313.pyc new file mode 100644 index 0000000..50ecbb2 Binary files /dev/null and b/flows/updaters/api/__pycache__/api_config.cpython-313.pyc differ diff --git a/flows/updaters/api/__pycache__/api_tools.cpython-313.pyc b/flows/updaters/api/__pycache__/api_tools.cpython-313.pyc new file mode 100644 index 0000000..5fcaf5f Binary files /dev/null and b/flows/updaters/api/__pycache__/api_tools.cpython-313.pyc differ diff --git a/flows/updaters/update_blobtoolkit.py b/flows/updaters/update_blobtoolkit.py new file mode 100644 index 0000000..33721fc --- /dev/null +++ b/flows/updaters/update_blobtoolkit.py @@ -0,0 +1,274 @@ +import contextlib +import csv +import gzip +import os + +from flows.lib.conditional_import import emit_event, flow, task +from flows.lib.shared_args import MIN_RECORDS, OUTPUT_PATH, S3_PATH, parse_args, required +from flows.lib.utils import _build_session, is_safe_path, upload_to_s3 + +BTK_API = "https://blobtoolkit.genomehubs.org/api/v1" +BTK_VIEW = "https://blobtoolkit.genomehubs.org/view" + +TSV_FIELDNAMES = [ + "accession", + "taxid", + "species", + "taxon_name", + "subspecies", + "id", + "source", + "sourceSlug", + "sourceStub", + "busco_lineage", + "busco_string", + "busco_complete", + "nohit", + "target", + "at_percent", + "gc_percent", + "n_percent", +] + + +def _stream_datasets(root: str, session) -> list: + """Stream BlobToolKit dataset entries for a taxon root. + + Args: + root (str): Taxonomic root to query (e.g., "Eukaryota"). + session: A requests.Session with retry support. + + Returns: + list: List of dataset metadata dicts. + """ + url = f"{BTK_API}/search/{root}" + response = session.get(url, timeout=300) + response.raise_for_status() + return response.json() + + +def _extract_stats(meta: dict) -> dict: + """Extract BlobToolKit summary stats into a flat dict row. + + Args: + meta (dict): Raw BTK dataset metadata. + + Returns: + dict: Flat row dict matching TSV_FIELDNAMES. + """ + summary = meta.get("summaryStats", {}) + row = { + "accession": meta.get("accession", ""), + "taxid": str(meta.get("taxid", "")), + "species": meta.get("species", meta.get("taxon_name", "")), + "taxon_name": meta.get("taxon_name", ""), + "subspecies": "", + "id": meta.get("id", ""), + "source": "BlobToolKit", + "sourceSlug": meta.get("id", ""), + "sourceStub": "https://blobtoolkit.genomehubs.org/view/dataset/", + "busco_lineage": "", + "busco_string": "", + "busco_complete": "", + "nohit": "", + "target": "", + "at_percent": "", + "gc_percent": "", + "n_percent": "", + } + + with contextlib.suppress(KeyError): + taxon_name = meta.get("taxon_name", "") + species = meta.get("species", "") + if species and taxon_name and len(taxon_name) > len(species): + row["subspecies"] = taxon_name + + if "busco" in summary: + for lineage, stats in summary["busco"].items(): + row["busco_lineage"] = lineage + row["busco_string"] = stats.get("string", "") + total = stats.get("t", 0) + if total > 0: + row["busco_complete"] = f"{stats.get('c', 0) / total * 100:.2f}" + break + + if "stats" in summary: + row["nohit"] = f"{summary['stats'].get('noHit', 0) * 100:.2f}" + with contextlib.suppress(KeyError): + row["target"] = f"{summary['stats']['target'] * 100:.2f}" + + if "baseComposition" in summary: + bc = summary["baseComposition"] + row["at_percent"] = f"{bc.get('at', 0) * 100:.2f}" + row["gc_percent"] = f"{bc.get('gc', 0) * 100:.2f}" + row["n_percent"] = f"{bc.get('n', 0) * 100:.2f}" + + return row + + +def _describe_files(meta: dict) -> list: + """Generate analysis file descriptors for a BlobToolKit dataset. + + Args: + meta (dict): Raw BTK dataset metadata. + + Returns: + list: List of file descriptor dicts. + """ + plots = ["cumulative", "snail"] + summary = meta.get("summaryStats", {}) + if summary.get("readMapping"): + plots.append("blob") + + files = [] + dataset_id = meta.get("id", "") + accession = meta.get("accession", "") + taxid = str(meta.get("taxid", "")) + + for plot in plots: + if plot == "blob": + url = f"{BTK_API}/image/{dataset_id}/{plot}/circle?format=png" + else: + url = f"{BTK_API}/image/{dataset_id}/{plot}?format=png" + files.append( + { + "name": f"{plot}.png", + "url": url, + "source_url": f"{BTK_VIEW}/{dataset_id}/dataset/{dataset_id}/{plot}", + "analysis_id": f"btk-{dataset_id}", + "description": f"a {plot} plot from BlobToolKit analysis {dataset_id}", + "title": f"{plot} plot {dataset_id}", + "command": "blobtoolkit pipeline", + "assembly_id": accession, + "taxon_id": taxid, + "analysis": { + "name": "BlobToolKit", + "title": f"BlobToolKit analysis of {accession}", + "description": (f"Analysis of public assembly {accession} " f"using BlobToolKit"), + "source": "BlobToolKit", + "source_url": (f"https://blobtoolkit.genomehubs.org/view/dataset/{dataset_id}"), + }, + } + ) + return files + + +@task(retries=2, retry_delay_seconds=30, log_prints=True) +def fetch_blobtoolkit( + output_dir: str, + root: str = "Eukaryota", + min_records: int = 1, +) -> tuple[int, int]: + """Fetch BlobToolKit data and write TSV + files YAML. + + Uses a persistent session with connection pooling for the many API calls. + + Args: + output_dir (str): Directory to write btk.tsv.gz and btk.files.yaml. + root (str): Taxonomic root to query. + min_records (int): Minimum dataset count to accept. + + Returns: + tuple[int, int]: Number of dataset rows and file entries written. + """ + session = _build_session() + print(f"Fetching BlobToolKit datasets for {root}") + datasets = _stream_datasets(root, session) + print(f"Found {len(datasets)} datasets") + + if len(datasets) < min_records: + raise RuntimeError(f"BlobToolKit returned fewer than {min_records} datasets: {len(datasets)}") + + tsv_path = os.path.join(output_dir, "btk.tsv") + gz_path = os.path.join(output_dir, "btk.tsv.gz") + files_path = os.path.join(output_dir, "btk.files.yaml") + + all_rows = [] + all_files = [] + for dataset in datasets: + meta = dataset if isinstance(dataset, dict) else {} + row = _extract_stats(meta) + all_rows.append(row) + files = _describe_files(meta) + all_files.extend(files) + + with open(tsv_path, "w", newline="") as f: + writer = csv.DictWriter(f, fieldnames=TSV_FIELDNAMES, delimiter="\t", lineterminator="\n") + writer.writeheader() + for row in all_rows: + writer.writerow(row) + + with open(tsv_path, "rb") as f_in, gzip.open(gz_path, "wb") as f_out: + f_out.write(f_in.read()) + os.remove(tsv_path) + + import yaml + + with open(files_path, "w") as f: + yaml.dump(all_files, f, default_flow_style=False) + + print(f"Wrote {len(all_rows)} rows to {gz_path}") + print(f"Wrote {len(all_files)} file entries to {files_path}") + return len(all_rows), len(all_files) + + +@task(log_prints=True) +def upload_s3_files(output_dir: str, s3_path: str) -> None: + """Upload BTK output files to S3.""" + for filename in ("btk.tsv.gz", "btk.files.yaml"): + local = os.path.join(output_dir, filename) + remote = f"{s3_path.rstrip('/')}/{filename}" + if os.path.exists(local): + print(f"Uploading {local} to {remote}") + upload_to_s3(local, remote) + + +@flow() +def update_blobtoolkit( + output_path: str, + s3_path: str = "", + min_records: int = 0, +) -> bool: + """Fetch BlobToolKit analysis data and optionally upload to S3. + + Args: + output_path (str): Directory to write output files. + s3_path (str): Optional S3 directory to upload results. + min_records (int): Minimum dataset count to accept. + + Returns: + bool: True on success. + """ + if not is_safe_path(output_path): + raise ValueError(f"Unsafe output path: {output_path}") + + resolved_path = os.path.abspath(output_path) + filename = "btk.tsv.gz" + if ".tsv" in resolved_path: + filename = os.path.basename(resolved_path) + resolved_path = os.path.dirname(resolved_path) + os.makedirs(resolved_path, exist_ok=True) + + row_count, file_count = fetch_blobtoolkit(resolved_path, min_records=min_records) + + if s3_path: + upload_s3_files(f"{resolved_path}/{filename}", s3_path) + upload_s3_files(f"{resolved_path}/btk.files.yaml", s3_path) + + emit_event( + event="update.blobtoolkit.finished", + resource={ + "prefect.resource.id": f"update.btk.{resolved_path}", + "prefect.resource.type": "blobtoolkit", + }, + payload={"row_count": row_count, "file_count": file_count}, + ) + return True + + +if __name__ == "__main__": + args = parse_args( + [required(OUTPUT_PATH), S3_PATH, MIN_RECORDS], + "Fetch BlobToolKit analysis data.", + ) + update_blobtoolkit(**vars(args)) diff --git a/flows/updaters/update_boat_config.py b/flows/updaters/update_boat_config.py index 4ee5303..71b27cb 100644 --- a/flows/updaters/update_boat_config.py +++ b/flows/updaters/update_boat_config.py @@ -30,29 +30,16 @@ def taxon_id_to_ssh_path(ssh_host, taxon_id, assembly_name): ssh_host, "bash", "-c", - ( - f"'. /etc/profile && module load speciesops && " - f"speciesops getdir --taxon_id {taxon_id}'" - ), + (f"'. /etc/profile && module load speciesops && " f"speciesops getdir --taxon_id {taxon_id}'"), ] result = run_quoted(command, capture_output=True, text=True) if result.returncode != 0: - print( - ( - f"WARNING: Error fetching directory for taxon_id {taxon_id}: " - f"{result.stderr}" - ) - ) + print((f"WARNING: Error fetching directory for taxon_id {taxon_id}: " f"{result.stderr}")) return # Filter the result to get the lustre path lustre_path = [line for line in result.stdout.splitlines() if "/lustre" in line] if not lustre_path: - print( - ( - f"WARNING: No lustre path found for taxon_id {taxon_id} in result: " - f"{result.stdout}" - ) - ) + print((f"WARNING: No lustre path found for taxon_id {taxon_id} in result: " f"{result.stdout}")) return # Use the first lustre path lustre_path = lustre_path[0].strip() @@ -60,6 +47,7 @@ def taxon_id_to_ssh_path(ssh_host, taxon_id, assembly_name): def lookup_buscos(ssh_host, file_path): + busco_dirs = [] if "lustre" in file_path: if not is_safe_path(ssh_host): raise ValueError(f"Unsafe ssh host: {ssh_host}") @@ -77,9 +65,7 @@ def lookup_buscos(ssh_host, file_path): if result.returncode != 0: return [] busco_dirs = [ - os.path.basename(os.path.normpath(line)) - for line in result.stdout.splitlines() - if "/busco" in line + os.path.basename(os.path.normpath(line)) for line in result.stdout.splitlines() if "/busco" in line ] return busco_dirs @@ -117,11 +103,9 @@ def assembly_id_to_busco_sets(alt_host, assembly_id): ] busco_sets = [] for lineage in lineages: - busco_url = ( - f"https://busco.cog.sanger.ac.uk/{assembly_id}/{lineage}/full_table.tsv" - ) + busco_url = f"https://busco.cog.sanger.ac.uk/{assembly_id}/{lineage}/full_table.tsv" response = safe_get(busco_url) - if response.status_code == 200: + if response is not None and response.status_code == 200: busco_sets.append(lineage) return f"https://busco.cog.sanger.ac.uk/{assembly_id}", busco_sets @@ -192,10 +176,10 @@ def fetch_goat_results(root_taxid: str, output_path: str) -> list[dict]: # fetch query_url with accept header tsv. use python module requests headers = {"Accept": "text/tab-separated-values"} response = safe_get(query_url, headers=headers) + if response is None: + raise RuntimeError("Error fetching BoaT config info: No response received") if response.status_code != 200: - raise RuntimeError( - f"Error fetching BoaT config info: {response.status_code} {response.text}" - ) + raise RuntimeError(f"Error fetching BoaT config info: {response.status_code} {response.text}") # Parse the TSV response if tsv_data := parse_tsv(response.text): @@ -294,29 +278,21 @@ def fetch_boat_config_info( int: Number of lines written to the output file. """ - tsv_data = fetch_goat_results(root_taxid) + tsv_data = fetch_goat_results(root_taxid, file_path) # Prepare output files and get visited assembly IDs visited_file_path = f"{os.path.splitext(file_path)[0]}.visited" - visited_assembly_ids, line_count = prepare_output_files( - file_path, visited_file_path, append - ) + visited_assembly_ids, line_count = prepare_output_files(file_path, visited_file_path, append) for row in tsv_data: taxon_id = row["taxon_id"] assembly_id = row["assembly_id"] # Skip if the assembly_id has already been visited if assembly_id in visited_assembly_ids: - print( - ( - f"Skipping already visited assembly_id {assembly_id} " - f"for taxon_id {taxon_id}." - ) - ) + print((f"Skipping already visited assembly_id {assembly_id} " f"for taxon_id {taxon_id}.")) continue print( - f"Processing taxon_id {taxon_id}, assembly_id {assembly_id} " - f"for assembly_name {row['assembly_name']}." + f"Processing taxon_id {taxon_id}, assembly_id {assembly_id} " f"for assembly_name {row['assembly_name']}." ) # Add the assembly_id to the new visited list with open(visited_file_path, "a") as f: @@ -336,8 +312,7 @@ def fetch_boat_config_info( if not busco_sets: print( - f"Warning: No BUSCO sets found for taxon_id {taxon_id} " - f"and assembly_name {assembly_name}. Skipping." + f"Warning: No BUSCO sets found for taxon_id {taxon_id} " f"and assembly_name {assembly_name}. Skipping." ) continue @@ -358,9 +333,7 @@ def fetch_boat_config_info( line_count += 1 if line_count < min_lines: - print( - f"WARNING: File {file_path} has less than {min_lines} lines: {line_count}" - ) + print(f"WARNING: File {file_path} has less than {min_lines} lines: {line_count}") # Return the number of lines written to the file return line_count @@ -419,13 +392,7 @@ def generate_md5(file_path): def filter_buscos(buscos): # Exclude bacteria_odb and archaea_odb buscos = [ - b - for b in buscos - if not ( - b.startswith("bacteria_odb") - or b.startswith("archaea_odb") - or b.startswith("mm49_") - ) + b for b in buscos if not (b.startswith("bacteria_odb") or b.startswith("archaea_odb") or b.startswith("mm49_")) ] # Group by prefix before _odb prefix_map = defaultdict(list) @@ -446,9 +413,7 @@ def filter_buscos(buscos): @task(log_prints=True) -def filter_farm_data( - farm_results_path: str, goat_results_path: str, output_path: str -) -> None: +def filter_farm_data(farm_results_path: str, goat_results_path: str, output_path: str) -> None: """Filter farm results to include only assemblies with lepidoptera BUSCOs. Combine with GoaT results to add additional fields. @@ -513,9 +478,7 @@ def filter_farm_data( @flow() -def update_boat_config( - root_taxid: str, output_path: str, append: bool, s3_path: str -) -> None: +def update_boat_config(root_taxid: str, output_path: str, append: bool = False, s3_path: str = "") -> None: # fetch_goat_results(root_taxid, f"{output_path}/goat_results.tsv") # trawl_farm_data( diff --git a/flows/updaters/update_ena_taxonomy_extra.py b/flows/updaters/update_ena_taxonomy_extra.py index 132c557..44eae59 100644 --- a/flows/updaters/update_ena_taxonomy_extra.py +++ b/flows/updaters/update_ena_taxonomy_extra.py @@ -33,9 +33,7 @@ def read_ncbi_tax_ids(taxdump_path: str) -> set[str]: @task(log_prints=True) -def add_jsonl_tax_ids( - jsonl_path: str, tax_ids: set[str], allowed_tax_ids: set[str] | None = None -) -> None: +def add_jsonl_tax_ids(jsonl_path: str, tax_ids: set[str], allowed_tax_ids: set[str] | None = None) -> None: print(f"Reading previously fetched ENA taxids from {jsonl_path}") filtered_path = f"{jsonl_path}.filtered" try: @@ -43,9 +41,7 @@ def add_jsonl_tax_ids( for line in f: data = json.loads(line) tax_id = data["taxId"] - if ( - allowed_tax_ids is None or tax_id in allowed_tax_ids - ) and tax_id not in tax_ids: + if (allowed_tax_ids is None or tax_id in allowed_tax_ids) and tax_id not in tax_ids: f_out.write(line) tax_ids.add(tax_id) os.replace(filtered_path, jsonl_path) @@ -59,10 +55,7 @@ def get_ena_api_taxids(root_taxid: str) -> set[str]: print(f"Fetching taxids for tax_tree({root_taxid}) from ENA API") limit = 10000000 - url = ( - f"https://www.ebi.ac.uk/ena/portal/api/search?result=taxon" - f"&query=tax_tree({root_taxid})&limit={limit}" - ) + url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=taxon" f"&query=tax_tree({root_taxid})&limit={limit}" # Stream the content of the URL column_index = None @@ -138,8 +131,8 @@ def upload_s3_jsonl(local_path: str, s3_path: str) -> None: @flow() def update_ena_taxonomy_extra( - root_taxid: str, taxdump_path: str, output_path: str, s3_path: str, append: bool -) -> None: + root_taxid: str, taxdump_path: str, output_path: str, s3_path: str = "", append: bool = False +) -> bool: """Update the ENA taxonomy JSONL file. Args: diff --git a/flows/updaters/update_ensembl_metadata.py b/flows/updaters/update_ensembl_metadata.py new file mode 100644 index 0000000..03420df --- /dev/null +++ b/flows/updaters/update_ensembl_metadata.py @@ -0,0 +1,198 @@ +import csv +import gzip +import os +from enum import Enum + +from flows.lib.conditional_import import emit_event, flow, task +from flows.lib.shared_args import DIVISION, OUTPUT_PATH, S3_PATH, parse_args, required +from flows.lib.utils import is_safe_path, safe_get, upload_to_s3 + + +class EnsemblDivision(Enum): + """Supported Ensembl genome database divisions.""" + + FUNGI = "fungi" + METAZOA = "metazoa" + PLANTS = "plants" + PROTISTS = "protists" + RAPID = "rapid" + VERTEBRATES = "vertebrates" + + +DIVISION_URLS = { + EnsemblDivision.FUNGI: ("http://ftp.ensemblgenomes.org/pub/current/fungi/" "species_metadata_EnsemblFungi.json"), + EnsemblDivision.METAZOA: ( + "http://ftp.ensemblgenomes.org/pub/current/metazoa/" "species_metadata_EnsemblMetazoa.json" + ), + EnsemblDivision.PLANTS: ("http://ftp.ensemblgenomes.org/pub/current/plants/" "species_metadata_EnsemblPlants.json"), + EnsemblDivision.PROTISTS: ( + "http://ftp.ensemblgenomes.org/pub/current/protists/" "species_metadata_EnsemblProtists.json" + ), + EnsemblDivision.RAPID: ("https://ftp.ensembl.org/pub/rapid-release/" "species_metadata.json"), + EnsemblDivision.VERTEBRATES: ("https://ftp.ensembl.org/pub/current/" "species_metadata_EnsemblVertebrates.json"), +} + +DIVISION_OUTPUT_NAMES = { + EnsemblDivision.FUNGI: "species_metadata_EnsemblFungi.tsv.gz", + EnsemblDivision.METAZOA: "species_metadata_EnsemblMetazoa.tsv.gz", + EnsemblDivision.PLANTS: "species_metadata_EnsemblPlants.tsv.gz", + EnsemblDivision.PROTISTS: "species_metadata_EnsemblProtists.tsv.gz", + EnsemblDivision.RAPID: "species_metadata_EnsemblRapid.tsv.gz", + EnsemblDivision.VERTEBRATES: "species_metadata_EnsemblVertebrates.tsv.gz", +} + + +def _extract_fields(record: dict, division: EnsemblDivision) -> list: + """Extract TSV fields from a single Ensembl metadata JSON record. + + Different divisions use slightly different JSON structures for the + same conceptual fields. This normalises them to a common 5-column + format: assembly_accession, name, release_date, strain, taxonomy_id. + + Args: + record (dict): A single species metadata JSON object. + division (EnsemblDivision): The Ensembl division. + + Returns: + list: A list of 5 string values, or None if the record is invalid. + """ + if division == EnsemblDivision.RAPID: + accession = record.get("assembly_accession", "") + name = record.get("ensembl_production_name", "") + release_date = record.get("release_date", "") + strain = record.get("strain", "") + elif division == EnsemblDivision.VERTEBRATES: + assembly = record.get("assembly", {}) + organism = record.get("organism", {}) + accession = assembly.get("assembly_accession", "") + name = organism.get("url_name", "") + release_date = record.get("release_date", "") + strain = organism.get("strain", "") + else: + organism = record.get("organism", {}) + accession = record.get("assembly_accession", "") + name = organism.get("url_name", "") + release_date = record.get("release_date", "") + strain = organism.get("strain", "") + taxonomy_id = str(record.get("taxonomy_id", "")) + if not accession: + return [] + return [accession, name, release_date, strain, taxonomy_id] + + +TSV_HEADERS = [ + "assembly_accession", + "name", + "release_date", + "strain", + "taxonomy_id", +] + + +@task(retries=2, retry_delay_seconds=10, log_prints=True) +def fetch_ensembl_division( + division: EnsemblDivision, + output_dir: str, +) -> tuple[str, int]: + """Fetch Ensembl species metadata JSON and convert to gzipped TSV. + + Args: + division (EnsemblDivision): Ensembl division to fetch. + output_dir (str): Directory to write the output file. + + Returns: + tuple[str, int]: Path to the output file and number of records written. + """ + url = DIVISION_URLS[division] + output_name = DIVISION_OUTPUT_NAMES[division] + output_path = os.path.join(output_dir, output_name) + + print(f"Fetching Ensembl {division.value} from {url}") + response = safe_get(url, timeout=600) + if response is None: + raise RuntimeError(f"Failed to fetch Ensembl {division.value}: no response received") + if response.status_code != 200: + raise RuntimeError( + f"Failed to fetch Ensembl {division.value}: HTTP {response.status_code} — " + f"check the URL and your network connection" + ) + response.raise_for_status() + + records = response.json() + if not isinstance(records, list): + raise ValueError(f"Expected JSON array from {url}, got {type(records).__name__}") + + tsv_path = output_path.removesuffix(".gz") + row_count = 0 + with open(tsv_path, "w", newline="") as f: + writer = csv.writer(f, delimiter="\t", lineterminator="\n") + writer.writerow(TSV_HEADERS) + for record in records: + if row := _extract_fields(record, division): + writer.writerow(row) + row_count += 1 + + with open(tsv_path, "rb") as f_in, gzip.open(output_path, "wb") as f_out: + f_out.write(f_in.read()) + os.remove(tsv_path) + + print(f"Wrote {row_count} records to {output_path}") + return output_path, row_count + + +@task(log_prints=True) +def upload_s3_file(local_path: str, s3_path: str) -> None: + """Upload file to S3.""" + print(f"Uploading {local_path} to {s3_path}") + upload_to_s3(local_path, s3_path) + + +@flow() +def update_ensembl_metadata( + output_path: str, + s3_path: str = "", + division: str = "vertebrates", +) -> bool: + """Fetch Ensembl species metadata for a given division. + + Args: + output_path (str): Directory to write output files. + s3_path (str): Optional S3 directory to upload the result. + division (str): Ensembl division name (fungi, metazoa, plants, + protists, rapid, vertebrates). + + Returns: + bool: True on success. + """ + if not is_safe_path(output_path): + raise ValueError(f"Unsafe output path: {output_path}") + + resolved_path = os.path.abspath(output_path) + os.makedirs(resolved_path, exist_ok=True) + + div = EnsemblDivision(division.lower()) + local_file, row_count = fetch_ensembl_division(div, resolved_path) + + if s3_path: + output_name = DIVISION_OUTPUT_NAMES[div] + remote_path = f"{s3_path.rstrip('/')}/{output_name}" + upload_s3_file(local_file, remote_path) + + emit_event( + event="update.ensembl.metadata.finished", + resource={ + "prefect.resource.id": f"update.ensembl.{division}.{resolved_path}", + "prefect.resource.type": "ensembl.metadata", + "prefect.resource.division": division, + }, + payload={"division": division, "row_count": row_count}, + ) + return True + + +if __name__ == "__main__": + args = parse_args( + [required(OUTPUT_PATH), S3_PATH, DIVISION], + "Fetch Ensembl species metadata for a given division.", + ) + update_ensembl_metadata(**vars(args)) diff --git a/flows/updaters/update_genomehubs_taxonomy.py b/flows/updaters/update_genomehubs_taxonomy.py index d4eb7e9..7d58ce1 100644 --- a/flows/updaters/update_genomehubs_taxonomy.py +++ b/flows/updaters/update_genomehubs_taxonomy.py @@ -20,12 +20,13 @@ def get_file_paths_from_config(config: dict, file_paths: dict) -> dict: key = config.get("xref_label") input_path = config.get("path") - output_path = config.get("out") + output_path = config.get("out", "./taxonomy.jsonl") if key is not None and input_path is not None: file_paths[key] = { "input": input_path, } - return output_path + file_paths["out"] = output_path + return file_paths @task(log_prints=True) @@ -39,9 +40,7 @@ def read_input_config(input_path: str) -> dict: print(f"Error reading {input_path}: {e}") exit() try: - output_path = get_file_paths_from_config(config, file_paths) - if output_path is not None: - file_paths["out"] = output_path + file_paths = get_file_paths_from_config(config, file_paths) for taxonomy in config.get("taxonomies", []): get_file_paths_from_config(taxonomy, file_paths) except Exception as e: @@ -79,8 +78,9 @@ def run_blobtk_taxonomy(root_taxid: str, input_path: str, output_path: str) -> N text=True, bufsize=1, ) - for line in process.stdout: - print(line, end="") + if process.stdout is not None: + for line in process.stdout: + print(line, end="") process.wait() if process.returncode != 0: print(f"Command failed with exit code {process.returncode}") @@ -103,9 +103,7 @@ def upload_s3_file(local_path: str, s3_path: str) -> None: @flow() -def update_genomehubs_taxonomy( - root_taxid: str, input_path: str, output_path: str, s3_path: str -) -> None: +def update_genomehubs_taxonomy(root_taxid: str, input_path: str, output_path: str, s3_path: str = "") -> None: """Update the GenomeHubs taxonomy JSONL file. Args: diff --git a/flows/updaters/update_google_sheets_status.py b/flows/updaters/update_google_sheets_status.py new file mode 100644 index 0000000..390389a --- /dev/null +++ b/flows/updaters/update_google_sheets_status.py @@ -0,0 +1,446 @@ +"""Fetch project status data from Google Sheets. + +Replaces the legacy R script (get_googlesheets.R) and Python pipeline +(import_status.py + import_status_lib.py). Fetches three categories: + +1. Project status sheets — a private TSV index pointing to ~26 project + spreadsheets that follow the GoaT schema 2.5 format. +2. DTOL Plant Genome Size Estimates — Kew genome size data. +3. DTOL assembly informatics status — tolqc kmer draft sizes. +4. CNGB project status. + +Outputs are per-project expanded TSV files matching legacy format. +""" + +import csv +import io +import os + +import numpy as np +import pandas as pd + +from flows.lib.conditional_import import emit_event, flow, task +from flows.lib.shared_args import ( + INDEX_URL, + MIN_RECORDS, + OUTPUT_PATH, + S3_PATH, + parse_args, + required, +) +from flows.lib.utils import is_safe_path, safe_get, upload_to_s3 + +# Google Sheets URLs for non-project-status data +DTOL_PLANT_GENOME_SIZE_URL = ( + "https://docs.google.com/spreadsheets/d/e/" + "2PACX-1vSt0R1T3MpoOM6UFNMaT_Q9gR5TYyUZC1wgLqW_6_cH9zzII8ehadrbHX8bpktjTv2_yt_KHaj3x_e1" + "/pub?output=tsv" +) +DTOL_TOLQC_STATUS_URL = ( + "https://docs.google.com/spreadsheets/d/e/" + "2PACX-1vTU-En_URbYPtfyjBueQhnz7wYHt-OHVxvRyv9tNvCUPCTX9EEzxOL41QCUh6hgVNv-Vv_gLSAMJXv-" + "/pub?gid=1442224132&single=true&output=tsv" +) +CNGB_URL = ( + "https://docs.google.com/spreadsheets/d/e/" + "2PACX-1vQeTqi-qnoNgNl58gWDBT4CcR8nF9SmFOkC82KC6pkH42CoEi94yInhBE25SfxBqNeMBeVbpeEVs9GI" + "/pub?gid=1726876704&single=true&output=tsv" +) + + +# --------------------------------------------------------------------------- +# Project status processing (port of import_status_lib.py) +# --------------------------------------------------------------------------- + + +def _open_google_spreadsheet(acronym: str, url: str, header_index: int) -> pd.DataFrame: + """Download a published Google Sheet as TSV and return a DataFrame.""" + encodings = ["utf-8", "ISO-8859-1", "latin1"] + response = safe_get(url, timeout=120) + if response is None: + raise RuntimeError(f"Failed to fetch sheet for {acronym}: no response received") + response.raise_for_status() + + df = None + for enc in encodings: + try: + content = response.content.decode(enc) + df = pd.read_csv( + io.StringIO(content), + delimiter="\t", + header=header_index, + dtype=object, + quoting=csv.QUOTE_NONE, + ) + break + except (UnicodeDecodeError, pd.errors.ParserError): + continue + + if df is None: + raise ValueError(f"Failed to decode sheet for {acronym}") + + df.rename(columns={"#NCBI_taxon_id": "NCBI_taxon_id"}, inplace=True) + df["project"] = acronym.upper() + return df + + +def _general_cleanup(df: pd.DataFrame) -> pd.DataFrame: + """Replace whitespace-only cells with NaN, drop empty rows/cols.""" + df = df.replace(r"^\s*$", np.nan, regex=True) + df = df.replace("publication_available", "published", regex=False) + df.dropna(how="all", axis=1, inplace=True) + df.dropna(how="all", axis=0, inplace=True) + df.rename(columns={"#NCBI_taxon_id": "NCBI_taxon_id"}, inplace=True) + return df + + +def _cleanup_headers(df: pd.DataFrame) -> pd.DataFrame: + """Normalise column headers: lowercase, underscored, no parens.""" + df.columns = ( + df.columns.str.replace(" ", "_") + .str.replace(r"\(", "", regex=True) + .str.replace(r"\)", "", regex=True) + .str.lower() + ) + return df + + +def _create_mandatory_columns(df: pd.DataFrame) -> pd.DataFrame: + """Ensure mandatory columns exist.""" + for col in [ + "ncbi_taxon_id", + "species", + "family", + "synonym", + "publication_id", + "contributing_project_lab", + "target_list_status", + "sequencing_status", + ]: + if col not in df.columns: + df[col] = None + return df + + +def _expand_target_status(df: pd.DataFrame, acronym: str) -> pd.DataFrame: + """Populate long_list, family_representative, other_priority columns.""" + for col in ["long_list", "family_representative", "other_priority"]: + if col not in df.columns: + df[col] = None + df["long_list"] = acronym + + lower = acronym.lower() + fr_mask = df["target_list_status"].isin([f"{lower}_family_representative", "family_representative"]) + df.loc[fr_mask, "family_representative"] = acronym + + op_mask = df["target_list_status"].isin([f"{lower}_other_priority", "other_priority"]) + df.loc[op_mask, "other_priority"] = acronym + return df + + +def _reduce_sequencing_status(df: pd.DataFrame, acronym: str) -> pd.DataFrame: + """Map project-prefixed statuses to simple GoaT statuses.""" + status_map = { + f"{acronym}_published": "published", + f"{acronym}_insdc_open": "insdc_open", + f"{acronym}_open": "open", + f"{acronym}_insdc_submitted": "in_progress", + f"{acronym}_in_assembly": "in_progress", + f"{acronym}_data_generation": "in_progress", + f"{acronym}_in_progress": "in_progress", + f"{acronym}_sample_acquired": "sample_acquired", + f"{acronym}_sample_collected": "sample_collected", + } + df["sequencing_status"] = df["sequencing_status"].replace(status_map) + return df + + +def _create_status_columns(df: pd.DataFrame, acronym: str) -> pd.DataFrame: + """Create and populate per-status columns.""" + statuses = [ + "sample_collected", + "sample_acquired", + "in_progress", + "data_generation", + "in_assembly", + "insdc_submitted", + "open", + "insdc_open", + "published", + ] + for s in statuses: + if s not in df.columns: + df[s] = None + df.loc[df["sequencing_status"] == s, s] = acronym + return df + + +def _expand_sequencing_status(df: pd.DataFrame, acronym: str) -> pd.DataFrame: + """Cascade statuses upward: published implies insdc_open, etc.""" + df.loc[df["published"] == acronym, "insdc_open"] = acronym + df.loc[df["insdc_open"] == acronym, "open"] = acronym + df.loc[df["open"] == acronym, "in_progress"] = acronym + df.loc[df["data_generation"] == acronym, "in_progress"] = acronym + df.loc[df["in_assembly"] == acronym, "in_progress"] = acronym + df.loc[df["in_progress"] == acronym, "data_generation"] = acronym + df.loc[df["in_progress"] == acronym, "sample_acquired"] = acronym + df.loc[df["sample_acquired"] == acronym, "sample_collected"] = acronym + return df + + +def _process_project(acronym: str, url: str, header_row: int) -> pd.DataFrame: + """Full processing pipeline for one project status sheet.""" + df = _open_google_spreadsheet(acronym, url, header_row) + df = _general_cleanup(df) + df = _cleanup_headers(df) + df = _create_mandatory_columns(df) + df = _expand_target_status(df, acronym) + df = _reduce_sequencing_status(df, acronym) + df = _create_status_columns(df, acronym) + df = _expand_sequencing_status(df, acronym) + return df + + +# --------------------------------------------------------------------------- +# Dedicated sheet fetchers (port of get_googlesheets.R) +# --------------------------------------------------------------------------- + + +def _fetch_dtol_plant_genome_sizes(output_path: str) -> int: + """Fetch DTOL Plant Genome Size Estimates from Kew.""" + response = safe_get(DTOL_PLANT_GENOME_SIZE_URL, timeout=120) + if response is None: + raise RuntimeError("Failed to fetch DTOL Plant Genome Size Estimates: no response received") + response.raise_for_status() + df = pd.read_csv(io.StringIO(response.text), delimiter="\t", dtype=str) + df.columns = ( + df.columns.str.strip() + .str.replace(" ", "_") + .str.replace(r"\(", "", regex=True) + .str.replace(r"\)", "", regex=True) + .str.lower() + ) + df = df.dropna(how="all", axis=0) + df = df[df["genus"].notna() & (df.get("project", pd.Series()) == "DTOL")] + df["primary"] = "1" + df.to_csv(output_path, sep="\t", index=False) + return len(df) + + +def _fetch_dtol_tolqc_status(output_path: str) -> int: + """Fetch DTOL assembly informatics status (kmer draft).""" + response = safe_get(DTOL_TOLQC_STATUS_URL, timeout=120) + if response is None: + raise RuntimeError("Failed to fetch DTOL assembly informatics status: no response received") + response.raise_for_status() + df = pd.read_csv( + io.StringIO(response.text), + delimiter="\t", + dtype=str, + na_values=["NA", "missing", "", "NULL"], + ) + df.columns = ( + df.columns.str.strip() + .str.replace(".", "") + .str.replace(" ", "_") + .str.replace(r"\(", "", regex=True) + .str.replace(r"\)", "", regex=True) + .str.lower() + ) + df = df.dropna(how="all", axis=0) + df = df[df["taxon"].notna()] + df = df[df["accession"].isna() | ~df["accession"].str.startswith("GCA_", na=False)] + df = df[~df["statussummary"].str.startswith("9", na=False)] + df = df[~df["statussummary"].str.startswith("5", na=False)] + df = df[["taxon", "est_size_mb", "length_mb"]].copy() + for col in ["est_size_mb", "length_mb"]: + df[col] = pd.to_numeric(df[col], errors="coerce") + df = df[df["est_size_mb"].notna() | df["length_mb"].notna()] + df.to_csv(output_path, sep="\t", index=False) + return len(df) + + +def _fetch_cngb(output_path: str) -> int: + """Fetch CNGB project status sheet.""" + response = safe_get(CNGB_URL, timeout=120) + if response is None: + raise RuntimeError("Failed to fetch CNGB project status sheet: no response received") + response.raise_for_status() + df = pd.read_csv( + io.StringIO(response.text), + delimiter="\t", + dtype=str, + na_values=["NA", "missing", "", "NULL"], + ) + df = df.dropna(how="all", axis=0) + df.to_csv(output_path, sep="\t", index=False) + return len(df) + + +# --------------------------------------------------------------------------- +# Prefect tasks and flow +# --------------------------------------------------------------------------- + + +@task(retries=2, retry_delay_seconds=30, log_prints=True) +def fetch_project_status_sheets(index_url: str, output_dir: str) -> dict: + """Fetch all project status sheets listed in the private index TSV. + + Args: + index_url (str): URL (or path) to the index TSV with columns + project_acronym, published_url, start_header_line. + output_dir (str): Directory to write per-project expanded TSVs. + + Returns: + dict: Mapping of project acronym to row count. + """ + response = safe_get(index_url, timeout=60) + if response is None: + raise RuntimeError("Failed to fetch project status sheets index: no response received") + response.raise_for_status() + + index_df = pd.read_csv( + io.StringIO(response.text), + delimiter="\t", + usecols=["project_acronym", "published_url", "start_header_line"], + dtype={"project_acronym": str, "published_url": str, "start_header_line": int}, + ) + print(f"Found {len(index_df)} project sheets in index") + + results = {} + for _, row in index_df.iterrows(): + acronym = row["project_acronym"] + url = row["published_url"] + header_row = int(row["start_header_line"]) + print(f"Processing {acronym} (header row {header_row})") + try: + df = _process_project(acronym, url, header_row) + out_file = os.path.join(output_dir, f"{acronym}_expanded.tsv") + df.to_csv(out_file, sep="\t", index=False) + results[acronym] = len(df) + print(f" {acronym}: {len(df)} rows") + except Exception as exc: + print(f" {acronym}: FAILED — {exc}") + failed_path = os.path.join(output_dir, f"{acronym}_expanded.tsv.failed") + open(failed_path, "w").close() # noqa: SIM115 — legacy compat + results[acronym] = 0 + print(results) + return results + + +@task(retries=2, retry_delay_seconds=30, log_prints=True) +def fetch_other_sheets(output_dir: str) -> dict: + """Fetch the three non-project Google Sheets (DTOL plant, tolqc, CNGB). + + Args: + output_dir (str): Directory to write TSV files. + + Returns: + dict: Mapping of filename to row count. + """ + results = {} + + plant_path = os.path.join(output_dir, "DTOL_Plant_Genome_Size_Estimates.tsv") + try: + results["DTOL_Plant_Genome_Size_Estimates"] = _fetch_dtol_plant_genome_sizes(plant_path) + print(f"Plant genome sizes: {results['DTOL_Plant_Genome_Size_Estimates']} rows") + except Exception as exc: + print(f"Plant genome sizes: FAILED — {exc}") + results["DTOL_Plant_Genome_Size_Estimates"] = 0 + + tolqc_path = os.path.join(output_dir, "DTOL_assembly_informatics_status_kmer_draft.tsv") + try: + results["DTOL_tolqc_status"] = _fetch_dtol_tolqc_status(tolqc_path) + print(f"DTOL tolqc status: {results['DTOL_tolqc_status']} rows") + except Exception as exc: + print(f"DTOL tolqc status: FAILED — {exc}") + results["DTOL_tolqc_status"] = 0 + + cngb_path = os.path.join(output_dir, "cngb.tsv") + try: + results["cngb"] = _fetch_cngb(cngb_path) + print(f"CNGB: {results['cngb']} rows") + except Exception as exc: + print(f"CNGB: FAILED — {exc}") + results["cngb"] = 0 + + return results + + +@task(log_prints=True) +def upload_s3_dir(local_dir: str, s3_path: str) -> None: + """Upload all TSV files in a directory to S3.""" + for fname in sorted(os.listdir(local_dir)): + if fname.endswith(".tsv") or fname.endswith(".tsv.gz"): + local_path = os.path.join(local_dir, fname) + remote_path = f"{s3_path.rstrip('/')}/{fname}" + print(f"Uploading {fname} to {remote_path}") + upload_to_s3(local_path, remote_path) + + +@flow() +def update_google_sheets_status( + output_path: str, + index_url: str, + s3_path: str = "", + min_records: int = 0, +) -> bool: + """Fetch all Google Sheets project status and supplementary data. + + Args: + output_path (str): Directory to write output TSVs. + index_url (str): URL to the private index TSV (from env + GOAT_SHEETS_INDEX_URL if not provided). + s3_path (str): Optional S3 path to upload results. + min_records (int): Minimum total records to accept. + + Returns: + bool: True on success. + """ + if not is_safe_path(output_path): + raise ValueError(f"Unsafe output path: {output_path}") + + resolved_path = os.path.abspath(output_path) + os.makedirs(resolved_path, exist_ok=True) + + if index_url is None: + index_url = os.environ.get("GOAT_SHEETS_INDEX_URL", "") + + total = 0 + project_results = {} + if index_url: + project_results = fetch_project_status_sheets(index_url, resolved_path) + total += sum(project_results.values()) + else: + print("No index URL provided — skipping project status sheets") + + other_results = fetch_other_sheets(resolved_path) + total += sum(other_results.values()) + + if total < min_records: + raise RuntimeError(f"Google Sheets: fewer than {min_records} total records: {total}") + + if s3_path: + upload_s3_dir(resolved_path, s3_path) + + emit_event( + event="update.google.sheets.status.finished", + resource={ + "prefect.resource.id": f"update.google.sheets.status.{resolved_path}", + "prefect.resource.type": "google.sheets.status", + }, + payload={ + "total_records": total, + "projects": len(project_results), + "other_sheets": len(other_results), + }, + ) + return True + + +if __name__ == "__main__": + args = parse_args( + [required(OUTPUT_PATH), required(INDEX_URL), S3_PATH, MIN_RECORDS], + "Fetch project status data from Google Sheets.", + ) + update_google_sheets_status(**vars(args)) diff --git a/flows/updaters/update_jgi_status.py b/flows/updaters/update_jgi_status.py new file mode 100644 index 0000000..2910bf9 --- /dev/null +++ b/flows/updaters/update_jgi_status.py @@ -0,0 +1,214 @@ +import csv +import os + +from flows.lib.conditional_import import emit_event, flow, task +from flows.lib.shared_args import MIN_RECORDS, OUTPUT_PATH, S3_PATH, parse_args, required +from flows.lib.utils import is_safe_path, safe_get, upload_to_s3 + +JGI_BASE_URL = "https://gold-ws.jgi.doe.gov" +JGI_STUDY_ID = "Gs0000001" + +FIELDNAMES = [ + "projectGoldId", + "projectName", + "legacyGoldId", + "studyGoldId", + "biosampleGoldId", + "organismGoldId", + "itsProposalId", + "itsSpid", + "itsSampleId", + "pmoProjectId", + "gptsProposalId", + "ncbiBioProjectAccession", + "ncbiBioSampleAccession", + "projectStatus", + "sequencingStatus", + "jgiFundingProgram", + "jgiFundingYear", + "hmpId", + "modDate", + "addDate", + "sequencingStrategy", + "sequencingCenters", + "seqMethod", + "genomePublications", + "otherPublications", + "sraExperimentIds", + "ncbiTaxId", +] + + +def _exchange_token(offline_token: str) -> str: + """Exchange a JGI offline token for an access token. + + Args: + offline_token (str): The JGI offline (API) token. + + Returns: + str: A valid access token. + + Raises: + RuntimeError: If the token exchange fails. + """ + url = f"{JGI_BASE_URL}/exchange?offlineToken={offline_token}" + response = safe_get(url, timeout=30) + if response is None: + raise RuntimeError("JGI token exchange failed: no response received") + if response.status_code != 200: + raise RuntimeError( + f"JGI token exchange failed: HTTP {response.status_code} — " f"check that JGI_OFFLINE_TOKEN is valid" + ) + if not (token := response.content.decode().strip()): + raise RuntimeError("JGI token exchange returned empty access token") + return token + + +def _fetch_organisms(access_token: str) -> dict: + """Fetch organism-to-taxid mapping from JGI GOLD API. + + Args: + access_token (str): Valid JGI access token. + + Returns: + dict: Mapping of organismGoldId to ncbiTaxId. + """ + headers = {"Authorization": f"Bearer {access_token}", "Accept": "application/json"} + url = f"{JGI_BASE_URL}/api/v1/organisms?studyGoldId={JGI_STUDY_ID}" + response = safe_get(url, headers=headers, timeout=120) + if response is None: + raise RuntimeError("JGI organism fetch failed: no response received") + if response.status_code != 200: + raise RuntimeError( + f"JGI organism fetch failed: HTTP {response.status_code} — " f"check that JGI_OFFLINE_TOKEN is valid" + ) + response.raise_for_status() + organisms = response.json() + return {org["organismGoldId"]: org.get("ncbiTaxId", "") for org in organisms} + + +def _fetch_projects(access_token: str) -> list: + """Fetch project records from JGI GOLD API. + + Args: + access_token (str): Valid JGI access token. + + Returns: + list: List of project dictionaries. + """ + headers = {"Authorization": f"Bearer {access_token}", "Accept": "application/json"} + url = f"{JGI_BASE_URL}/api/v1/projects?studyGoldId={JGI_STUDY_ID}" + response = safe_get(url, headers=headers, timeout=120) + if response is None: + raise RuntimeError("JGI project fetch failed: no response received") + if response.status_code != 200: + raise RuntimeError( + f"JGI project fetch failed: HTTP {response.status_code} — " f"check that JGI_OFFLINE_TOKEN is valid" + ) + response.raise_for_status() + return response.json() + + +@task(retries=2, retry_delay_seconds=10, log_prints=True) +def fetch_jgi_tsv(file_path: str, min_lines: int = 1) -> int: + """Fetch JGI 1KFG project data and write to TSV. + + Exchanges the offline token for an access token, fetches organism-taxid + mapping and project records, filters for whole genome sequencing projects, + and writes a TSV. + + Args: + file_path (str): Path to the output TSV file. + min_lines (int): Minimum number of data rows expected. + + Returns: + int: Number of lines written (including header). + """ + offline_token = os.environ.get("JGI_OFFLINE_TOKEN") + if not offline_token: + raise RuntimeError( + "JGI_OFFLINE_TOKEN environment variable is not set — " "cannot authenticate with JGI GOLD API" + ) + + print("Exchanging JGI offline token for access token") + access_token = _exchange_token(offline_token) + + print(f"Fetching organisms for study {JGI_STUDY_ID}") + org_to_taxid = _fetch_organisms(access_token) + print(f"Found {len(org_to_taxid)} organisms") + + print(f"Fetching projects for study {JGI_STUDY_ID}") + projects = _fetch_projects(access_token) + print(f"Found {len(projects)} total projects") + + source_fields = [f for f in FIELDNAMES if f != "ncbiTaxId"] + row_count = 0 + with open(file_path, "w", newline="") as f: + writer = csv.writer(f, delimiter="\t", lineterminator="\n") + writer.writerow(FIELDNAMES) + for project in projects: + if project.get("sequencingStrategy") != "Whole Genome Sequencing": + continue + organism_id = project.get("organismGoldId", "") + taxid = org_to_taxid.get(organism_id, "") + row = [project.get(field, "") for field in source_fields] + [taxid] + writer.writerow(row) + row_count += 1 + + line_count = row_count + 1 # include header + if row_count < min_lines: + raise RuntimeError(f"JGI file has fewer than {min_lines} data rows: {row_count}") + print(f"Wrote {row_count} WGS projects to {file_path}") + return line_count + + +@task(log_prints=True) +def upload_s3_tsv(local_path: str, s3_path: str) -> None: + """Upload JGI TSV to S3.""" + print(f"Uploading JGI TSV from {local_path} to {s3_path}") + upload_to_s3(local_path, s3_path) + + +@flow() +def update_jgi_status( + output_path: str, + s3_path: str = "", + min_records: int = 0, +) -> bool: + """Fetch JGI 1KFG status list and optionally upload to S3. + + Args: + output_path (str): Path to the output TSV file. + s3_path (str): Optional S3 path to upload the result. + min_records (int): Minimum record count to accept the output. + + Returns: + bool: True on success. + """ + if not is_safe_path(output_path): + raise ValueError(f"Unsafe output path: {output_path}") + resolved_path = os.path.abspath(output_path) + os.makedirs(os.path.dirname(resolved_path), exist_ok=True) + + line_count = fetch_jgi_tsv(resolved_path, min_records) + + if line_count > min_records and s3_path: + upload_s3_tsv(resolved_path, s3_path) + + emit_event( + event="update.jgi.status.finished", + resource={ + "prefect.resource.id": f"update.jgi.{output_path}", + "prefect.resource.type": "jgi.status", + }, + payload={"line_count": line_count}, + ) + return True + + +if __name__ == "__main__": + args = parse_args( + [required(OUTPUT_PATH), S3_PATH, MIN_RECORDS], + "Fetch JGI 1KFG status list.", + ) + update_jgi_status(**vars(args)) diff --git a/flows/updaters/update_ncbi_datasets.py b/flows/updaters/update_ncbi_datasets.py index e2f1e4c..c02ca1b 100644 --- a/flows/updaters/update_ncbi_datasets.py +++ b/flows/updaters/update_ncbi_datasets.py @@ -202,7 +202,7 @@ def generate_md5(file_path): def update_ncbi_datasets( root_taxid: str, output_path: str, - s3_path: str, + s3_path: str = "", data_freeze_path: Optional[str] = None, ) -> bool: line_count = fetch_ncbi_datasets_summary(root_taxid, file_path=output_path, data_freeze_path=data_freeze_path) diff --git a/flows/updaters/update_ncbi_taxonomy.py b/flows/updaters/update_ncbi_taxonomy.py index 890d11a..f6f9935 100644 --- a/flows/updaters/update_ncbi_taxonomy.py +++ b/flows/updaters/update_ncbi_taxonomy.py @@ -86,7 +86,7 @@ def taxonomy_is_up_to_date(local_path: str, http_path: str) -> bool: @flow() -def update_ncbi_taxonomy(output_path: str) -> None: +def update_ncbi_taxonomy(output_path: str) -> bool: """Fetch and the NCBI taxonomy dump. Args: diff --git a/flows/updaters/update_nhm_status_list.py b/flows/updaters/update_nhm_status_list.py index 707e15d..6e75055 100644 --- a/flows/updaters/update_nhm_status_list.py +++ b/flows/updaters/update_nhm_status_list.py @@ -43,9 +43,7 @@ def fetch_nhm_tsv( # If the file has less than min_records lines, raise an error if line_count < min_lines: - raise RuntimeError( - f"File {file_path} has less than {min_lines} lines: {line_count}" - ) + raise RuntimeError(f"File {file_path} has less than {min_lines} lines: {line_count}") # Return the line count return line_count @@ -57,7 +55,7 @@ def upload_s3_tsv(local_path: str, s3_path: str) -> None: @flow() -def update_nhm_status_list(output_path: str, s3_path: str, min_records: int) -> None: +def update_nhm_status_list(output_path: str, s3_path: str = "", min_records: int = 0) -> bool: """Update the NHM status list TSV file.""" os.makedirs(os.path.dirname(output_path), exist_ok=True) line_count = fetch_nhm_tsv(output_path, min_records) @@ -67,7 +65,7 @@ def update_nhm_status_list(output_path: str, s3_path: str, min_records: int) -> event="update.nhm.tsv.finished", resource={ "prefect.resource.id": f"update.nhm.{output_path}", - "prefect.resource.type": "nhm.tsv", + "prefect.resource.type": "nhm.status", }, payload={"line_count": line_count}, ) @@ -82,9 +80,3 @@ def update_nhm_status_list(output_path: str, s3_path: str, min_records: int) -> ) update_nhm_status_list(**vars(args)) - args = parse_args( - [required(OUTPUT_PATH), S3_PATH, MIN_RECORDS], - "Fetch species data from NHM.", - ) - - update_nhm_status_list(**vars(args)) diff --git a/flows/updaters/update_ott_taxonomy.py b/flows/updaters/update_ott_taxonomy.py index 7280a14..725df70 100644 --- a/flows/updaters/update_ott_taxonomy.py +++ b/flows/updaters/update_ott_taxonomy.py @@ -42,9 +42,7 @@ def fetch_ott_taxonomy( # Find the extracted subdirectory (should start with 'ott') extracted_dirs = [ - d - for d in os.listdir(local_path) - if os.path.isdir(os.path.join(local_path, d)) and d.startswith("ott") + d for d in os.listdir(local_path) if os.path.isdir(os.path.join(local_path, d)) and d.startswith("ott") ] if not extracted_dirs: raise RuntimeError("No extracted ott directory found.") @@ -111,28 +109,30 @@ def set_ott_url() -> str: # Extract required fields source = ott_json.get("source", "") - name = ott_json.get("name", "") - version = ott_json.get("version", "") # Replace "draft" with "." in source to get OTT_VERSION ott_version = source.replace("draft", ".") - ott_major_version = f"{name}{version}" - return ( - f"https://files.opentreeoflife.org/ott/" - f"{ott_major_version}/{ott_version}.tgz" - ) + # may need to restore this if ott switch back to the major version URL structure, but for now we want the full version in the URL + # name = ott_json.get("name", "") + # version = ott_json.get("version", "") + # ott_major_version = f"{name}{version}" + + # return ( + # f"https://files.opentreeoflife.org/ott/" + # f"{ott_major_version}/{ott_version}.tgz" + # ) + return f"https://files.opentreeoflife.org/ott/{ott_version}/{ott_version}.tgz" @flow() -def update_ott_taxonomy(output_path: str) -> None: +def update_ott_taxonomy(output_path: str) -> bool: """Fetch the OTT taxonomy file. Args: output_path (str): Path to save the taxonomy dump. """ http_path = set_ott_url() - status = None complete = False if ott_taxonomy_is_up_to_date(output_path, http_path): status = True diff --git a/flows/updaters/update_refseq_organelles.py b/flows/updaters/update_refseq_organelles.py new file mode 100644 index 0000000..68511b7 --- /dev/null +++ b/flows/updaters/update_refseq_organelles.py @@ -0,0 +1,322 @@ +import contextlib +import csv +import gzip +import os +import re +import tempfile +from collections import Counter + +from flows.lib.conditional_import import emit_event, flow, task +from flows.lib.shared_args import ( + MIN_RECORDS, + OUTPUT_PATH, + ROOT_TAXID, + S3_PATH, + parse_args, + required, +) +from flows.lib.utils import is_safe_path, safe_get, upload_to_s3 + +REFSEQ_FTP = "https://ftp.ncbi.nlm.nih.gov/refseq/release" + +ORGANELLE_FIELDNAMES = [ + "id", + "organelle", + "taxonId", + "genbankAccession", + "assemblySpan", + "gcPercent", + "nPercent", + "releaseDate", + "sourceAuthor", + "sourceYear", + "sourceTitle", + "pubmedId", + "bioproject", + "biosample", + "sampleLocation", +] + +MONTHS = { + "JAN": "01", + "FEB": "02", + "MAR": "03", + "APR": "04", + "MAY": "05", + "JUN": "06", + "JUL": "07", + "AUG": "08", + "SEP": "09", + "OCT": "10", + "NOV": "11", + "DEC": "12", +} + + +def _reformat_date(date_str: str) -> str: + """Convert DD-MMM-YYYY to YYYY-MM-DD.""" + parts = re.split(r"[:\-]", date_str) + if len(parts) < 3: + return date_str + return f"{parts[2]}-{MONTHS.get(parts[1].upper(), '00')}-{parts[0].zfill(2)}" + + +def _refseq_listing(collection: str) -> list: + """Fetch directory listing of GenBank files for a RefSeq collection. + + Args: + collection (str): Collection name (e.g., "mitochondrion"). + + Returns: + list: URLs of .genomic.gbff.gz files. + """ + pattern = re.compile(r"(\w+\.\d+\.genomic\.gbff\.gz)") + url = f"{REFSEQ_FTP}/{collection}" + response = safe_get(url, timeout=120) + if response is None: + raise RuntimeError(f"Failed to fetch RefSeq listing for {collection}: no response received") + response.raise_for_status() + return [f"{url}/{match[1]}" for line in response.text.split("\n") if (match := pattern.search(line))] + + +def _parse_features(entry, fields: dict) -> None: + """Extract taxonId and sample location from SeqRecord features.""" + qualifiers = entry.features[0].qualifiers + if "db_xref" in qualifiers: + for xref in qualifiers["db_xref"]: + key, value = xref.split(":", 1) + if key == "taxon": + fields["taxonId"] = value + if "lat_lon" in qualifiers: + fields["sampleLocation"] = qualifiers["lat_lon"][0] + + +def _parse_references(entry, fields: dict) -> None: + """Extract reference metadata from SeqRecord annotations.""" + submitted_re = re.compile(r"Submitted\s\(\d{2}-\w{3}-(\d{4})\)") + published_re = re.compile(r"\s\((\d{4})\)[^(]*$") + for ref in entry.annotations.get("references", []): + if ref.journal == "Unpublished": + continue + if ref.journal.startswith("Submitted"): + if "sourceAuthor" in fields: + continue + if match := submitted_re.search(ref.journal): + fields["sourceYear"] = match[1] + elif "sourceAuthor" in fields: + continue + else: + if match := published_re.search(ref.journal): + fields["sourceYear"] = match[1] + if ref.title: + fields["sourceTitle"] = ref.title + if ref.pubmed_id: + fields["pubmedId"] = ref.pubmed_id + if ref.authors: + fields["sourceAuthor"] = ref.authors + elif ref.consrtm: + fields["sourceAuthor"] = ref.consrtm + + +def _parse_xrefs(entry, fields: dict) -> None: + """Extract BioProject/BioSample cross-references.""" + if not entry.dbxrefs: + return + bioprojects = [] + biosamples = [] + for dbxref in entry.dbxrefs: + with contextlib.suppress(ValueError): + key, value = dbxref.split(":", 1) + if key == "BioProject": + bioprojects.append(value) + elif key == "BioSample": + biosamples.append(value) + if bioprojects: + fields["bioproject"] = ";".join(bioprojects) + if biosamples: + fields["biosample"] = ";".join(biosamples) + + +def _parse_sequence(entry, fields: dict) -> bool: + """Compute sequence stats (GC%, N%, span). Returns False if all Ns.""" + seqstr = str(entry.seq.upper()) + counter = Counter(seqstr) + length = len(seqstr) + n_pct = counter["N"] / length * 100 if length > 0 else 100 + fields["nPercent"] = f"{n_pct:.2f}" + if n_pct == 100: + return False + gc = counter["G"] + counter["C"] + at = counter["A"] + counter["T"] + fields["gcPercent"] = f"{gc / (gc + at) * 100:.2f}" if (gc + at) > 0 else "0.00" + fields["assemblySpan"] = str(length) + return True + + +def _parse_flatfile(flatfile_path: str, organelle: str, root_taxon: str) -> list: + """Parse a single GenBank flatfile for organelle sequences. + + Args: + flatfile_path (str): Path to a gzipped GenBank file. + organelle (str): Organelle type ("mitochondrion" or "plastid"). + root_taxon (str): Optional taxonomic root to filter by. + + Returns: + list: List of row dicts. + """ + from Bio import SeqIO + + comment_re = re.compile( + r"(?:derived|identical)\s(?:from|to)\s([\w\d]+).*COMPLETENESS: full length", + re.DOTALL, + ) + rows = [] + with gzip.open(flatfile_path, "rt") as fh: + for entry in SeqIO.parse(fh, "gb"): + if root_taxon and root_taxon not in entry.annotations.get("taxonomy", []): + continue + fields = {"id": entry.id, "organelle": organelle} + comment = entry.annotations.get("comment", "") + if comment: + if match := comment_re.search(comment): + fields["genbankAccession"] = match[1] + else: + continue + _parse_features(entry, fields) + _parse_references(entry, fields) + fields["releaseDate"] = _reformat_date(entry.annotations.get("date", "")) + _parse_xrefs(entry, fields) + try: + if not _parse_sequence(entry, fields): + continue + except Exception: + continue + rows.append(fields) + return rows + + +@task(retries=2, retry_delay_seconds=30, log_prints=True) +def fetch_and_parse_organelles( + output_path: str, + organelles: list, + root_taxon: str, +) -> int: + """Fetch RefSeq organelle data and parse to gzipped TSV. + + Downloads GenBank flatfiles from NCBI FTP for each organelle type, + parses sequence records, and writes a combined TSV. + + Args: + output_path (str): Path to write the output TSV (or .tsv.gz). + organelles (list): List of organelle types to parse. + root_taxon (str): Optional taxonomic root filter. + + Returns: + int: Number of rows written. + """ + if not organelles: + organelles = ["mitochondrion", "plastid"] + + all_rows = [] + for organelle in organelles: + print(f"Fetching listing for {organelle}") + listing = _refseq_listing(organelle) + print(f"Found {len(listing)} files for {organelle}") + + for url in listing: + print(f"Downloading {url}") + response = safe_get(url, timeout=600) + if response is None: + raise RuntimeError(f"Failed to download {url}: no response received") + response.raise_for_status() + + with tempfile.NamedTemporaryFile(suffix=".gbff.gz", delete=False) as tmp: + tmp.write(response.content) + tmp_path = tmp.name + + try: + rows = _parse_flatfile(tmp_path, organelle, root_taxon) + all_rows.extend(rows) + print(f"Parsed {len(rows)} records from {os.path.basename(url)}") + finally: + os.unlink(tmp_path) + + tsv_path = output_path.removesuffix(".gz") + with open(tsv_path, "w", newline="") as f: + writer = csv.DictWriter( + f, + fieldnames=ORGANELLE_FIELDNAMES, + delimiter="\t", + lineterminator="\n", + extrasaction="ignore", + ) + writer.writeheader() + for row in all_rows: + writer.writerow(row) + + if output_path.endswith(".gz"): + with open(tsv_path, "rb") as f_in, gzip.open(output_path, "wb") as f_out: + f_out.write(f_in.read()) + os.remove(tsv_path) + + print(f"Wrote {len(all_rows)} total organelle records to {output_path}") + return len(all_rows) + + +@task(log_prints=True) +def upload_s3_file(local_path: str, s3_path: str) -> None: + """Upload file to S3.""" + print(f"Uploading {local_path} to {s3_path}") + upload_to_s3(local_path, s3_path) + + +@flow() +def update_refseq_organelles( + output_path: str, + root_taxid: str = "", + s3_path: str = "", + min_records: int = 0, +) -> bool: + """Fetch and parse RefSeq organelle data. + + Args: + output_path (str): Path to write the output TSV. + root_taxid (str): Optional root taxon filter. + s3_path (str): Optional S3 path to upload the result. + min_records (int): Minimum record count to accept. + + Returns: + bool: True on success. + """ + if not is_safe_path(output_path): + raise ValueError(f"Unsafe output path: {output_path}") + resolved_path = os.path.abspath(output_path) + os.makedirs(os.path.dirname(resolved_path), exist_ok=True) + + row_count = fetch_and_parse_organelles( + resolved_path, organelles=["mitochondrion", "plastid"], root_taxon=root_taxid + ) + + if row_count < min_records: + raise RuntimeError(f"RefSeq organelles: fewer than {min_records} records: {row_count}") + + if s3_path: + upload_s3_file(output_path, s3_path) + + emit_event( + event="update.refseq.organelles.finished", + resource={ + "prefect.resource.id": f"update.refseq.organelles.{output_path}", + "prefect.resource.type": "refseq.organelles", + }, + payload={"row_count": row_count}, + ) + return True + + +if __name__ == "__main__": + args = parse_args( + [required(OUTPUT_PATH), ROOT_TAXID, S3_PATH, MIN_RECORDS], + "Fetch and parse RefSeq organelle data.", + ) + update_refseq_organelles(**vars(args)) diff --git a/flows/updaters/update_sra_data.py b/flows/updaters/update_sra_data.py new file mode 100644 index 0000000..e1ffe79 --- /dev/null +++ b/flows/updaters/update_sra_data.py @@ -0,0 +1,370 @@ +import csv +import gzip +import os +import xml.etree.ElementTree as ET +from collections import defaultdict +from datetime import date, timedelta +from itertools import groupby + +from flows.lib.conditional_import import emit_event, flow, task +from flows.lib.shared_args import ( + INPUT_PATH, + MIN_RECORDS, + OUTPUT_PATH, + ROOT_TAXID, + S3_PATH, + default, + parse_args, + required, +) +from flows.lib.utils import is_safe_path, run_quoted, upload_to_s3 + +SRA_FIELDNAMES = [ + "taxon_id", + "sra_accession", + "run_accession", + "library_source", + "platform", + "reads", + "total_reads", + "total_runs", +] + + +def _split_chunks(values, split_val): + """Split an iterable into chunks at occurrences of split_val. + + Args: + values: Iterable to split. + split_val: Value at which to split. + + Yields: + (int, group) pairs. + """ + index = 0 + + def chunk_index(val): + nonlocal index + if val == split_val: + index += 1 + return index + + return groupby(values, chunk_index) + + +def _open_file(file_path, **kwargs): + """Open a file, decompressing gzip if needed.""" + if file_path.endswith(".gz"): + return gzip.open(file_path, "rt", encoding="utf8", **kwargs) + return open(file_path, "r", encoding="utf8", **kwargs) + + +def _read_exp_xml(node, obj): + """Extract fields from an ExpXml element.""" + for child in node: + tag = child.tag + if tag == "Bioproject": + obj["bioproject"] = child.text + elif tag == "Biosample": + obj["biosample"] = child.text + elif tag == "Organism": + obj["taxon_id"] = child.get("taxid") + elif tag == "Experiment": + obj["sra_accession"] = child.get("acc") + elif tag == "Summary": + obj["platform"] = child.findtext("Platform") or "" + elif tag == "Library_descriptor": + source = child.findtext("LIBRARY_SOURCE") + obj["library_source"] = source.lower() if source else "" + + +def _read_runs(node, obj): + """Extract run accessions and read counts from a Runs element.""" + if "runs" not in obj: + obj["runs"] = [] + for child in node: + obj["runs"].append({"accession": child.get("acc"), "reads": child.get("total_spots", "0")}) + + +def parse_sra_xml(xml_file: str) -> list: + """Parse an SRA efetch docsum XML file into row dicts. + + Args: + xml_file (str): Path to the XML (or .xml.gz) file. + + Returns: + list: List of dicts with taxon_id, sra_accession, runs, etc. + """ + rows = [] + xml_header = '\n' + with _open_file(xml_file) as f: + for _, doc in _split_chunks(f, xml_header): + lines = list(doc) + try: + root = ET.fromstringlist(lines) + except ET.ParseError: + continue + for doc_summary in root.iter("DocumentSummary"): + obj = {"date": "", "runs": []} + for child in doc_summary: + tag = child.tag + if tag == "CreateDate": + obj["date"] = child.text or "" + elif tag == "ExpXml": + _read_exp_xml(child, obj) + elif tag == "Runs": + _read_runs(child, obj) + if "taxon_id" in obj and obj["runs"]: + rows.append(obj) + return rows + + +def group_by_taxon(rows: list, grouped: dict) -> list: + """Group SRA runs by taxon, keeping the 10 most recent per taxon. + + Args: + rows (list): Parsed SRA row dicts with runs. + grouped (dict): Optional existing grouped data to merge into. + + Returns: + list: One dict per taxon with aggregated fields. + """ + if not grouped: + grouped = defaultdict(lambda: {"count": 0, "reads": 0, "runs": []}) + for obj in sorted(rows, key=lambda r: r.get("date", "")): + taxon_id = obj.get("taxon_id") + if not taxon_id: + continue + for run in obj.get("runs", []): + try: + reads = int(run["reads"]) + except (ValueError, TypeError): + reads = 0 + row = { + "sra_accession": obj.get("sra_accession", ""), + "run_accession": run["accession"], + "library_source": obj.get("library_source", ""), + "platform": obj.get("platform", ""), + "reads": reads, + } + grouped[taxon_id]["runs"].insert(0, row) + grouped[taxon_id]["count"] += 1 + grouped[taxon_id]["reads"] += reads + if len(grouped[taxon_id]["runs"]) > 10: + grouped[taxon_id]["runs"].pop() + + return [ + { + "taxon_id": taxon_id, + "sra_accession": ";".join(r["sra_accession"] for r in grp["runs"]), + "run_accession": ";".join(r["run_accession"] for r in grp["runs"]), + "library_source": ";".join(r["library_source"] for r in grp["runs"]), + "platform": ";".join(r["platform"] for r in grp["runs"]), + "reads": ";".join(str(r["reads"]) for r in grp["runs"]), + "total_reads": grp["reads"], + "total_runs": grp["count"], + } + for taxon_id, grp in grouped.items() + ] + + +def load_previous_tsv(file_path: str) -> dict: + """Load previously grouped SRA data from a TSV for incremental updates. + + Args: + file_path (str): Path to the existing TSV (or .tsv.gz). + + Returns: + dict: Grouped data keyed by taxon_id, or empty dict if file missing. + """ + if not os.path.isfile(file_path): + return {} + grouped = defaultdict(lambda: {"count": 0, "reads": 0, "runs": []}) + with _open_file(file_path, newline="") as f: + reader = csv.DictReader(f, delimiter="\t") + for row in reader: + taxon_id = row["taxon_id"] + grouped[taxon_id]["count"] = int(row["total_runs"]) + grouped[taxon_id]["reads"] = int(row["total_reads"]) + run_accs = row["run_accession"].split(";") + sra_accs = row["sra_accession"].split(";") + lib_srcs = row["library_source"].split(";") + platforms = row["platform"].split(";") + reads_list = row["reads"].split(";") + for i, run_acc in enumerate(run_accs): + grouped[taxon_id]["runs"].append( + { + "run_accession": run_acc, + "sra_accession": sra_accs[i] if i < len(sra_accs) else "", + "library_source": lib_srcs[i] if i < len(lib_srcs) else "", + "platform": platforms[i] if i < len(platforms) else "", + "reads": int(reads_list[i]) if i < len(reads_list) else 0, + } + ) + return grouped + + +def _get_yesterday() -> str: + """Return yesterday's date as YYYY/MM/DD.""" + return (date.today() - timedelta(days=1)).strftime("%Y/%m/%d") + + +@task(retries=2, retry_delay_seconds=30, log_prints=True) +def fetch_sra_xml( + output_xml: str, + root_taxid: str = "2759", + min_date: str = "2024/01/01", +) -> str: + """Fetch SRA docsum XML from NCBI using esearch/efetch. + + Requires the NCBI Entrez Direct (edirect) tools and NCBI_API_KEY + environment variable. + + Args: + output_xml (str): Path to write the XML output. + root_taxid (str): Root taxon ID to query. + min_date (str): Start date for the query (YYYY/MM/DD). + + Returns: + str: Path to the written XML file. + """ + max_date = _get_yesterday() + + query = f"(txid{root_taxid}[organism:exp])" + esearch_cmd = [ + "esearch", + "-db", + "sra", + "-query", + query, + ] + esearch_cmd.extend(["-mindate", min_date, "-maxdate", max_date]) + + efetch_cmd = ["efetch", "-db", "sra", "-format", "docsum"] + + print(f"Running esearch | efetch for taxid {root_taxid} ({min_date} to {max_date})") + esearch = run_quoted(esearch_cmd, capture_output=True, text=True, timeout=3000) + if esearch.returncode != 0: + raise RuntimeError(f"esearch failed: {esearch.stderr}") + + with open(output_xml, "w") as f: + efetch = run_quoted(efetch_cmd, input=esearch.stdout, capture_output=True, text=True, timeout=6000) + if efetch.returncode != 0: + raise RuntimeError(f"efetch failed: {efetch.stderr}") + f.write(efetch.stdout) + + print(f"Wrote SRA XML to {output_xml}") + return output_xml + + +@task(retries=2, retry_delay_seconds=5, log_prints=True) +def parse_and_write_sra( + xml_path: str, + output_path: str, + previous_path: str, +) -> int: + """Parse SRA XML and write grouped TSV. + + Args: + xml_path (str): Path to the SRA docsum XML. + output_path (str): Path to write the output TSV. + previous_path (str): Optional path to previous TSV for incremental merge. + + Returns: + int: Number of taxon rows written. + """ + previous = load_previous_tsv(previous_path) if previous_path else {} + rows = parse_sra_xml(xml_path) + print(f"Parsed {len(rows)} records from XML") + + grouped_rows = group_by_taxon(rows, grouped=previous) + print(f"Grouped into {len(grouped_rows)} taxa") + + tsv_path = output_path.removesuffix(".gz") + with open(tsv_path, "w", newline="") as f: + writer = csv.DictWriter(f, fieldnames=SRA_FIELDNAMES, delimiter="\t", lineterminator="\n") + writer.writeheader() + for row in grouped_rows: + writer.writerow(row) + + if output_path.endswith(".gz"): + with open(tsv_path, "rb") as f_in, gzip.open(output_path, "wb") as f_out: + f_out.write(f_in.read()) + os.remove(tsv_path) + + print(f"Wrote {len(grouped_rows)} taxon rows to {output_path}") + return len(grouped_rows) + + +@task(log_prints=True) +def upload_s3_file(local_path: str, s3_path: str) -> None: + """Upload file to S3.""" + print(f"Uploading {local_path} to {s3_path}") + upload_to_s3(local_path, s3_path) + + +@flow() +def update_sra_data( + output_path: str, + input_path: str = "", + s3_path: str = "", + root_taxid: str = "2759", + min_records: int = 0, +) -> bool: + """Fetch and parse SRA data, writing grouped TSV output. + + If input_path is provided, parses that XML file directly. Otherwise + fetches fresh data from NCBI using esearch/efetch. + + Args: + output_path (str): Path to write the output TSV (or .tsv.gz). + input_path (str): Optional path to an existing SRA XML file. + root_taxid (str): Root taxon ID for the NCBI query. + s3_path (str): Optional S3 path to upload the result. + min_records (int): Minimum taxon count to accept the output. + + Returns: + bool: True on success. + """ + if not is_safe_path(output_path): + raise ValueError(f"Unsafe output path: {output_path}") + resolved_path = os.path.abspath(output_path) + os.makedirs(os.path.dirname(resolved_path), exist_ok=True) + + if input_path and os.path.isfile(input_path): + xml_path = input_path + else: + xml_path = f"{resolved_path}.xml" + fetch_sra_xml(xml_path, root_taxid=root_taxid) + + row_count = parse_and_write_sra(xml_path, resolved_path, previous_path=resolved_path) + + if row_count < min_records: + raise RuntimeError(f"SRA output has fewer than {min_records} taxa: {row_count}") + + if s3_path: + upload_s3_file(output_path, s3_path) + + emit_event( + event="update.sra.data.finished", + resource={ + "prefect.resource.id": f"update.sra.{output_path}", + "prefect.resource.type": "sra.data", + }, + payload={"row_count": row_count}, + ) + return True + + +if __name__ == "__main__": + args = parse_args( + [ + required(OUTPUT_PATH), + INPUT_PATH, + default(ROOT_TAXID, "2759"), + S3_PATH, + MIN_RECORDS, + ], + "Fetch and parse SRA data into grouped TSV.", + ) + update_sra_data(**vars(args)) diff --git a/flows/updaters/update_tol_genome_notes.py b/flows/updaters/update_tol_genome_notes.py index 914dff6..c01ad94 100644 --- a/flows/updaters/update_tol_genome_notes.py +++ b/flows/updaters/update_tol_genome_notes.py @@ -101,7 +101,7 @@ def upload_s3_tsv(local_path: str, s3_path: str) -> None: @flow() -def update_tol_genome_notes(output_path: str, s3_path: str, min_records: int) -> None: +def update_tol_genome_notes(output_path: str, s3_path: str = "", min_records: int = 0) -> bool: """Update the ToL genome notes TSV file.""" os.makedirs(os.path.dirname(os.path.abspath(output_path)), exist_ok=True) line_count = fetch_tol_genome_notes(output_path, min_records) diff --git a/flows/updaters/update_tol_portal_status.py b/flows/updaters/update_tol_portal_status.py index 6fbe77f..bf10bc6 100644 --- a/flows/updaters/update_tol_portal_status.py +++ b/flows/updaters/update_tol_portal_status.py @@ -219,7 +219,7 @@ def upload_s3_tsv(local_path: str, s3_path: str) -> None: @flow() -def update_tol_portal_status(output_path: str, s3_path: str, min_records: int) -> None: +def update_tol_portal_status(output_path: str, s3_path: str = "", min_records: int = 0) -> None: """Update the ToL Portal Project Status TSV file.""" os.makedirs(os.path.dirname(os.path.abspath(output_path)), exist_ok=True) diff --git a/flows/updaters/update_ucsc_assemblies.py b/flows/updaters/update_ucsc_assemblies.py new file mode 100644 index 0000000..c2d9578 --- /dev/null +++ b/flows/updaters/update_ucsc_assemblies.py @@ -0,0 +1,97 @@ +import gzip +import os + +from flows.lib.conditional_import import emit_event, flow, task +from flows.lib.shared_args import OUTPUT_PATH, S3_PATH, parse_args, required +from flows.lib.utils import is_safe_path, safe_get, upload_to_s3 + +UCSC_URL = "https://hgdownload.soe.ucsc.edu/hubs/UCSC_GI.assemblyHubList.txt" +OUTPUT_FILENAME = "UCSC_GI.assemblyHubList.tsv.gz" + + +@task(retries=2, retry_delay_seconds=5, log_prints=True) +def fetch_ucsc_hub_list(output_dir: str) -> tuple[str, int]: + """Fetch the UCSC assembly hub accession list and write as gzipped TSV. + + The source file is a tab-separated text file served with ISO-8859-1 + encoding. We decode to UTF-8 for consistency. + + Args: + output_dir (str): Directory to write the output file. + + Returns: + tuple[str, int]: Path to the output file and number of data lines. + """ + output_path = os.path.join(output_dir, OUTPUT_FILENAME) + tsv_path = output_path.removesuffix(".gz") + + print(f"Fetching UCSC hub list from {UCSC_URL}") + response = safe_get(UCSC_URL, timeout=60) + if response is None: + raise RuntimeError("Failed to fetch UCSC hub list: no response received") + response.raise_for_status() + response.encoding = "iso-8859-1" + text = response.text + + with open(tsv_path, "w") as f: + f.write(text) + line_count = text.count("\n") + + with open(tsv_path, "rb") as f_in, gzip.open(output_path, "wb") as f_out: + f_out.write(f_in.read()) + os.remove(tsv_path) + + print(f"Wrote {line_count} lines to {output_path}") + return output_path, line_count + + +@task(log_prints=True) +def upload_s3_file(local_path: str, s3_path: str) -> None: + """Upload file to S3.""" + print(f"Uploading {local_path} to {s3_path}") + upload_to_s3(local_path, s3_path) + + +@flow() +def update_ucsc_assemblies( + output_path: str, + s3_path: str = "", +) -> bool: + """Fetch the UCSC assembly hub list and optionally upload to S3. + + Args: + output_path (str): Directory to write the output file. + s3_path (str): Optional S3 directory path to upload the result. + + Returns: + bool: True on success. + """ + if not is_safe_path(output_path): + raise ValueError(f"Unsafe output path: {output_path}") + + resolved_path = os.path.abspath(output_path) + os.makedirs(resolved_path, exist_ok=True) + + local_file, line_count = fetch_ucsc_hub_list(resolved_path) + + if s3_path: + remote_path = f"{s3_path.rstrip('/')}/{OUTPUT_FILENAME}" + upload_s3_file(local_file, remote_path) + + emit_event( + event="update.ucsc.assemblies.finished", + resource={ + "prefect.resource.id": f"update.ucsc.{resolved_path}", + "prefect.resource.type": "ucsc.assemblies", + }, + payload={"line_count": line_count}, + ) + return True + + +if __name__ == "__main__": + args = parse_args( + [required(OUTPUT_PATH), S3_PATH], + "Fetch UCSC assembly hub accession list.", + ) + update_ucsc_assemblies(**vars(args)) diff --git a/flows/updaters/update_vgp_original_status.py b/flows/updaters/update_vgp_original_status.py new file mode 100644 index 0000000..8feb950 --- /dev/null +++ b/flows/updaters/update_vgp_original_status.py @@ -0,0 +1,88 @@ +import os + +from flows.lib.conditional_import import emit_event, flow, task +from flows.lib.shared_args import MIN_RECORDS, OUTPUT_PATH, S3_PATH, parse_args, required +from flows.lib.utils import upload_to_s3 +from flows.updaters.api import api_config as cfg +from flows.updaters.api import api_tools as at + + +@task(retries=2, retry_delay_seconds=5, log_prints=True) +def fetch_vgp_original_tsv( + file_path: str, + min_lines: int = 1, +) -> int: + """Fetch VGP original status list from the Vertebrate Genomes Project GitHub tracker. + + Downloads the VGP genome portal YAML tracker, extracts species records, + and writes a TSV with per-species status fields. + + Args: + file_path (str): Path to the output TSV file. + min_lines (int): Minimum number of rows expected. + + Returns: + int: Number of lines written to the output file. + """ + at.get_from_source( + cfg.vgl_url_opener, + cfg.vgl_hub_count_handler, + cfg.vgl_row_handler, + cfg.vgl_fieldnames, + file_path, + ) + + with open(file_path, "r") as f: + line_count = sum(1 for _ in f) + + if line_count < min_lines: + raise RuntimeError(f"VGP file {file_path} has fewer than {min_lines} lines: {line_count}") + print(f"Wrote {line_count} lines to {file_path}") + return line_count + + +@task(log_prints=True) +def upload_s3_tsv(local_path: str, s3_path: str) -> None: + """Upload VGP TSV to S3.""" + print(f"Uploading VGP TSV from {local_path} to {s3_path}") + upload_to_s3(local_path, s3_path) + + +@flow() +def update_vgp_original_status(output_path: str, s3_path: str = "", min_records: int = 0) -> bool: + """Fetch the VGP original status list and optionally upload to S3. + + This is the scarcely-updated VGP source from the GitHub YAML tracker. + See update_vgp_status.py for the frequently-updated live Google Sheet. + + Args: + output_path (str): Path to the output TSV file. + s3_path (str): Optional S3 path to upload the result. + min_records (int): Minimum record count to accept the output. + + Returns: + bool: True on success. + """ + + resolved_path = os.path.abspath(output_path) + os.makedirs(os.path.dirname(resolved_path), exist_ok=True) + line_count = fetch_vgp_original_tsv(resolved_path, min_records) + if line_count > min_records and s3_path: + upload_s3_tsv(resolved_path, s3_path) + emit_event( + event="update.vgp.original.status.finished", + resource={ + "prefect.resource.id": f"update.vgp.original.{resolved_path}", + "prefect.resource.type": "vgp.original.status", + }, + payload={"line_count": line_count}, + ) + return True + + +if __name__ == "__main__": + args = parse_args( + [required(OUTPUT_PATH), S3_PATH, MIN_RECORDS], + "Fetch VGP original status list from the Vertebrate Genomes Project.", + ) + update_vgp_original_status(**vars(args)) diff --git a/flows/updaters/update_vgp_status.py b/flows/updaters/update_vgp_status.py new file mode 100644 index 0000000..5e0cba6 --- /dev/null +++ b/flows/updaters/update_vgp_status.py @@ -0,0 +1,282 @@ +"""Fetch VGP Ordinal Phase1+ status from the live Google Sheet. + +This updater replaces the legacy ``vgp_live_sheet_curation.py`` script +from goat-data. It downloads the live VGP spreadsheet, cleans headers, +translates project names to canonical acronyms, expands sequencing +status columns following the GoaT status hierarchy, and writes a TSV +matching the ``FILE_VGP_Ordinal_Phase1.types.yaml`` schema. + +The companion ``update_vgp_original_status.py`` fetches the less +frequently updated VGP GitHub YAML tracker source. +""" + +import io +import os + +import numpy as np +import pandas as pd + +from flows.lib.conditional_import import emit_event, flow, task +from flows.lib.shared_args import MIN_RECORDS, OUTPUT_PATH, S3_PATH, parse_args, required +from flows.lib.utils import is_safe_path, safe_get, upload_to_s3 + +# Published (export) link to the VGP Ordinal Phase1+ Google Sheet +VGP_SHEET_URL = ( + "https://docs.google.com/spreadsheets/d/1Jwjv6Kwc6VIn1UMMhnG6kvFCxjwGdC5b7p_HtbDOMOs" + "/export?format=tsv" + "&id=1Jwjv6Kwc6VIn1UMMhnG6kvFCxjwGdC5b7p_HtbDOMOs" + "&gid=1380659438" +) + +# Columns to import from the spreadsheet +SOURCE_COLUMNS = [ + "Order", + "Lineage", + "Superorder", + "Family Scientific Name", + "Scientific Name", + "English Name", + "NCBI taxon ID", + "Status", + "QV", + "IUCN (2016-2024)", + "CITES", + "Main project", + "Second project", +] + +# Map free-text project names to canonical EBP acronyms +PROJECT_ACRONYMS = { + "Sanger 25G": "25GP", + "Sanger 25G project": "25GP", + "AfricaBP": "AFRICABP", + "Cetacean GP": "CGP", + "DToL": "DTOL", + "DToL?": "DTOL", + "Yggdrasil": "YGG", + "CatalanBP": "CBP", + "Canadian Biogenome Project": "CANBP", + "Canada Biogenome Project": "CANBP", + "Threatened Species Initiative (TSI)": "TSI", + "Minderoo OceanOmics": "OG", + "DToL, ERGA": "DTOL,ERGA", + "Amazoomics : Genomics of Brazilian Biodiversity": "AMAZOOMICS,GBB", + "AmaZoomics : Genomics of Brazilian Biodiversity": "AMAZOOMICS,GBB", + "Individual, Google": "Individual,Google", +} + +# Numeric status code → GoaT sequencing status +STATUS_MAP = { + "0": "", + "1": "sample_collected", + "2": "", + "3": "in_progress", + "4": "open", + "5": "open", +} + +# Full ordered list of GoaT sequencing status columns +SEQUENCING_STATUSES = [ + "sample_collected", + "sample_acquired", + "in_progress", + "data_generation", + "in_assembly", + "insdc_submitted", + "open", + "insdc_open", + "published", +] + + +# --------------------------------------------------------------------------- +# Processing helpers +# --------------------------------------------------------------------------- + + +def _cleanup_table(df: pd.DataFrame) -> pd.DataFrame: + """Replace whitespace-only cells with NaN, drop empty rows/cols.""" + df = df.replace(r"^\s*$", np.nan, regex=True) + df = df.replace(r"^ +| +$", "", regex=True) + df.dropna(how="all", axis=1, inplace=True) + df.dropna(how="all", axis=0, inplace=True) + return df + + +def _cleanup_headers(df: pd.DataFrame) -> pd.DataFrame: + """Normalise column headers: lowercase, underscored, no parens.""" + df.columns = ( + df.columns.str.replace(" ", "_") + .str.replace(r"\(", "", regex=True) + .str.replace(r"\)", "", regex=True) + .str.lower() + ) + return df + + +def _get_acronym(project_name: str) -> str: + """Map a free-text project name to a canonical acronym.""" + return PROJECT_ACRONYMS.get(project_name, project_name) + + +def _translate_projects(df: pd.DataFrame) -> pd.DataFrame: + """Map free-text project names to canonical acronyms.""" + for col in ["main_project", "second_project", "project"]: + if col in df.columns: + df[col] = df[col].map(lambda v: _get_acronym(str(v)) if pd.notna(v) else v) + return df + + +def _build_all_projects(df: pd.DataFrame) -> pd.DataFrame: + """Create 'all_projects' column from project + main + second.""" + df["all_projects"] = df.apply( + lambda row: ",".join( + sorted( + { + x + for x in [ + row.get("project"), + row.get("main_project"), + row.get("second_project"), + ] + if pd.notna(x) + } + ) + ), + axis=1, + ) + return df + + +def _expand_sequencing_status(df: pd.DataFrame) -> pd.DataFrame: + """Map numeric status codes and cascade the GoaT status hierarchy.""" + # Ensure all status columns exist + for col in SEQUENCING_STATUSES: + if col not in df.columns: + df[col] = None + + # Map numeric codes to status names + df["sequencing_status"] = df["status"].map(STATUS_MAP) + + # Populate status columns with all_projects for matching rows + for status in SEQUENCING_STATUSES: + df.loc[df["sequencing_status"] == status, status] = df["all_projects"] + + # Cascade status hierarchy upward + df.loc[df["published"] == df["all_projects"], "insdc_open"] = df["all_projects"] + df.loc[df["insdc_open"] == df["all_projects"], "open"] = df["all_projects"] + df.loc[df["open"] == df["all_projects"], "in_progress"] = df["all_projects"] + df.loc[df["data_generation"] == df["all_projects"], "in_progress"] = df["all_projects"] + df.loc[df["in_assembly"] == df["all_projects"], "in_progress"] = df["all_projects"] + df.loc[df["in_progress"] == df["all_projects"], "data_generation"] = df["all_projects"] + df.loc[df["in_progress"] == df["all_projects"], "sample_acquired"] = df["all_projects"] + df.loc[df["sample_acquired"] == df["all_projects"], "sample_collected"] = df["all_projects"] + return df + + +def _process_vgp_sheet(raw_tsv: str) -> pd.DataFrame: + """Full processing pipeline for the VGP live sheet. + + Args: + raw_tsv (str): Raw TSV text content from Google Sheets. + + Returns: + pd.DataFrame: Cleaned, expanded DataFrame ready for export. + """ + df = pd.read_csv( + io.StringIO(raw_tsv), + sep="\t", + dtype=object, + engine="python", + on_bad_lines="warn", + usecols=SOURCE_COLUMNS, + ) + df = _cleanup_table(df) + df = _cleanup_headers(df) + df["project"] = "VGP" + df = _translate_projects(df) + df = _build_all_projects(df) + df = _expand_sequencing_status(df) + return df + + +# --------------------------------------------------------------------------- +# Prefect tasks and flow +# --------------------------------------------------------------------------- + + +@task(retries=2, retry_delay_seconds=30, log_prints=True) +def fetch_vgp_live_sheet(output_path: str, min_records: int = 0) -> int: + """Download the VGP Ordinal Phase1+ Google Sheet and write a TSV. + + Args: + output_path (str): Path to the output TSV file. + min_records (int): Minimum number of rows expected. + + Returns: + int: Number of data rows written. + """ + response = safe_get(VGP_SHEET_URL, timeout=120) + if response is None: + raise RuntimeError("Failed to fetch VGP live sheet: no response received") + response.raise_for_status() + + df = _process_vgp_sheet(response.text) + row_count = len(df) + print(f"VGP live sheet: {row_count} rows after processing") + + if row_count < min_records: + raise RuntimeError(f"VGP live sheet has fewer than {min_records} rows: {row_count}") + + df.to_csv(output_path, sep="\t", index=False) + print(f"Wrote {output_path}") + return row_count + + +@task(log_prints=True) +def upload_s3_tsv(local_path: str, s3_path: str) -> None: + """Upload VGP TSV to S3.""" + print(f"Uploading {local_path} to {s3_path}") + upload_to_s3(local_path, s3_path) + + +@flow() +def update_vgp_status(output_path: str, s3_path: str = "", min_records: int = 0) -> bool: + """Fetch the VGP Ordinal Phase1+ live sheet and optionally upload to S3. + + Args: + output_path (str): Path to the output TSV file. + s3_path (str): Optional S3 path to upload the result. + min_records (int): Minimum record count to accept the output. + + Returns: + bool: True on success. + """ + if not is_safe_path(output_path): + raise ValueError(f"Unsafe output path: {output_path}") + + resolved_path = os.path.abspath(output_path) + os.makedirs(os.path.dirname(resolved_path), exist_ok=True) + + row_count = fetch_vgp_live_sheet(resolved_path, min_records) + + if row_count > min_records and s3_path: + upload_s3_tsv(resolved_path, s3_path) + + emit_event( + event="update.vgp.status.finished", + resource={ + "prefect.resource.id": f"update.vgp.{resolved_path}", + "prefect.resource.type": "vgp.status", + }, + payload={"row_count": row_count}, + ) + return True + + +if __name__ == "__main__": + args = parse_args( + [required(OUTPUT_PATH), S3_PATH, MIN_RECORDS], + "Fetch VGP Ordinal Phase1+ status from the live Google Sheet.", + ) + update_vgp_status(**vars(args)) diff --git a/scripts/TESTING_README.md b/scripts/TESTING_README.md new file mode 100644 index 0000000..f6752e8 --- /dev/null +++ b/scripts/TESTING_README.md @@ -0,0 +1,121 @@ +# Parser Testing Wrapper + +## Usage + +Run all parser tests with comparison against S3 sources: + +```bash +python scripts/run_parse_validate_tests.py +``` + +Test a specific parser: + +```bash +python scripts/run_parse_validate_tests.py --parser REFSEQ_ORGANELLES +python scripts/run_parse_validate_tests.py --parser SRA_DATA +python scripts/run_parse_validate_tests.py --parser SEQUENCING_STATUS +``` + +Verbose output: + +```bash +python scripts/run_parse_validate_tests.py --verbose +``` + +## What the Script Does + +1. **Discovers parsers** from `flows/parsers/parse_*.py` +2. **Runs each parser** on example TSV files in `tsv_examples/` +3. **Compares output** to S3 source versions mirrored in `goat-data_s3_sources/` +4. **Validates**: + - Header names match between parsed output and S3 source + - Line counts are similar (within 10% tolerance) +5. **Generates a summary report** showing pass/fail status + +## Test Configuration + +Tests are configured in the `PARSER_INPUT_MAPPING` dict in the script: + +```python +PARSER_INPUT_MAPPING = { + "PARSER_NAME": [ + ("tsv_examples/input.tsv", "path/to/config.types.yaml", "s3_comparison_file.tsv"), + # (input_example, yaml_config, s3_source_name) + ], +} +``` + +- **input_example**: Path to example TSV in `tsv_examples/` +- **yaml_config**: Path to YAML schema (relative to data repo root) +- **s3_source_name**: Filename in `goat-data_s3_sources/` for comparison (or `None`) + +## Current Test Results + +### ✓ Passing (Perfect Match) + +- **REFSEQ_ORGANELLES**: 33,252 rows, 13 cols - matches S3 source exactly +- **SRA_DATA**: 27,606 rows, 8 cols - matches S3 source exactly + +### ✓ Partial Pass (Headers OK) + +- **SEQUENCING_STATUS (vgp)**: Headers match, but example is only a sample (292 vs 1093 rows) — expected +- **SEQUENCING_STATUS (AEGIS)**: Parser succeeds, no S3 source to compare +- **BLOBTOOLKIT**: Headers mostly match (23/25 cols), example input missing 'biosample'/'bioproject' — sample is incomplete + +### ❌ Known Issues + +- **GENOMESIZE_KARYOTYPE**: Genomehubs `write_tsv` error with composite headers (`header: [genus, species]`). This is a YAML schema complexity issue, not a parser bug. + +## Running Parsers Manually + +For direct parser invocation with SKIP_PREFECT: + +```bash +export SKIP_PREFECT=true +python -m flows.parsers.parse_refseq_organelles \ + -i path/to/input.tsv \ + -y path/to/config.types.yaml +``` + +## Adding New Tests + +1. Add example TSV to `tsv_examples/` or `tsv_examples//` +2. Update `PARSER_INPUT_MAPPING` with the new test config +3. Run the script to verify + +Example: + +```python +"MY_PARSER": [ + ("tsv_examples/my_example.tsv", "../goat-data/sources/my-category/my_config.types.yaml", "my_source.tsv"), +] +``` + +## Troubleshooting + +**"No output file generated"**: + +- Check that the YAML config specifies a different output filename than the input +- Verify the input TSV is accessible + +**"Headers mismatch"**: + +- Check if the example input has all expected columns +- Verify the YAML configuration includes all fields +- Example files may be incomplete samples + +**"Parser timeout"**: + +- Increase the timeout in the `run_parser()` function (currently 60 seconds) +- Check for issues in the parser logic or input data + +## Environment + +Requires: + +- `genomehubs >= 2.10.14` +- `boto3` +- `pyyaml` +- `requests` + +Install with: `pip install -q -r requirements.txt` diff --git a/scripts/run_parse_validate_tests.py b/scripts/run_parse_validate_tests.py new file mode 100755 index 0000000..052355e --- /dev/null +++ b/scripts/run_parse_validate_tests.py @@ -0,0 +1,330 @@ +#!/usr/bin/env python3 +"""Test wrapper: run parsers on example TSVs and compare against S3 sources. + +Usage: + python scripts/run_parse_validate_tests.py [--parser PARSER] [--verbose] + python scripts/run_parse_validate_tests.py --all + +This script: +1. Discovers parsers and example TSV inputs +2. Maps examples to YAML configs (from goat-data/sources) +3. Runs each parser and validates output +4. Compares output columns & line counts vs S3 source versions +5. Generates a test report + +Example files should be organized as: + - tsv_examples/*.tsv or *.tsv.gz + - tsv_examples//*.tsv or *.tsv.gz + +S3 source files are mirrored in: + - goat-data_s3_sources/{assembly-data,status-lists,sra,btk,conservation,genomesize-karyotype,uk-legislation}/imported/*.tsv +""" + +import argparse +import gzip +import json +import os +import subprocess +import sys +import tempfile +from csv import DictReader +from pathlib import Path +from typing import Optional, Tuple, List, Dict + +# Set SKIP_PREFECT before any imports from flows +os.environ["SKIP_PREFECT"] = "true" + +# Use absolute import paths +sys.path.insert(0, str(Path(__file__).parent.parent)) + + +PARSER_INPUT_MAPPING = { + # parser_name -> (example_input, yaml_config, comparison_file) + "SEQUENCING_STATUS": [ + ("tsv_examples/ATLASEA_expanded.tsv", "../goat-data/sources/assembly-data/FILE_ATLASEA.types.yaml", "ATLASEA_expanded.tsv"), + ("tsv_examples/AEGIS_expanded.tsv", "../goat-data/sources/status-lists/FILE_AEGIS.types.yaml", None), # no S3 source + ], + "REFSEQ_ORGANELLES": [ + ("tsv_examples/organelle_test.tsv", "../goat-data/sources/assembly-data/refseq_organelles.types.yaml", "refseq_organelles.tsv"), + ], + "BLOBTOOLKIT": [ + ("tsv_examples/blobtoolkit_test_results.tsv/btk.tsv.gz", "../goat-data/sources/btk/btk.types.yaml", "btk.tsv"), + ], + "SRA_DATA": [ + ("tsv_examples/sra.tsv", "../goat-data/sources/sra/sra.types.yaml", "sra.tsv"), + ], + "GENOMESIZE_KARYOTYPE": [ + ("tsv_examples/gsheets_test/DTOL_Plant_Genome_Size_Estimates.tsv", "../goat-data/sources/genomesize-karyotype/FILE_DTOL_Plant_Genome_Size_Estimates.types.yaml", None), + ], +} + +S3_SOURCES_ROOT = Path(__file__).parent.parent.parent / "goat-data_s3_sources" + + +def load_tsv_headers_and_count(path: str) -> Tuple[List[str], int]: + """Load TSV headers and line count (excluding header).""" + if path.endswith(".gz"): + fh = gzip.open(path, "rt", encoding="utf-8", newline="") + else: + fh = open(path, "rt", encoding="utf-8", newline="") + + try: + reader = DictReader(fh, delimiter="\t") + headers = reader.fieldnames or [] + count = sum(1 for _ in reader) + return list(headers), count + finally: + fh.close() + + +def find_s3_source(expected_name: str) -> Optional[str]: + """Locate the S3 source file for a parser output.""" + # Search in all subdirectories + for root, dirs, files in os.walk(S3_SOURCES_ROOT): + for f in files: + if f == expected_name or f == f"{expected_name}.gz": + return os.path.join(root, f) + return None + + +def run_parser( + parser_name: str, + input_path: str, + yaml_path: str, + work_dir: str, +) -> Tuple[bool, str, Optional[str]]: + """Run a parser and return (success, output_file, error_msg).""" + try: + import shutil + + # Copy input to work_dir with a renamed prefix to avoid conflicting with output names + input_abs = Path(input_path).resolve() + work_path = Path(work_dir) + # Rename to avoid output file conflicts (e.g., input btk.tsv.gz vs output btk.tsv.gz) + work_input = work_path / f"_input_{input_abs.name}" + + if not work_input.exists(): + shutil.copy2(str(input_abs), str(work_input)) + + # Construct the Python module path from parser name + module_name = "flows.parsers." + "parse_" + parser_name.lower().replace("_", "_") + + cmd = [ + sys.executable, + "-m", + module_name, + "-i", + str(work_input), + "-y", + yaml_path, + ] + + env = os.environ.copy() + env["SKIP_PREFECT"] = "true" + + result = subprocess.run( + cmd, + cwd=Path(__file__).parent.parent, + capture_output=True, + text=True, + timeout=60, + env=env, + ) + + if result.returncode != 0: + return False, None, f"Parser failed: {result.stderr}" + + # Find the output file in work_dir (excluding the input file and any _input_* files) + output_files = list(work_path.glob("*.tsv")) + list(work_path.glob("*.tsv.gz")) + output_files = [f for f in output_files if not f.name.startswith("_input_")] + + if not output_files: + return False, None, "No output file generated" + + output_file = str(output_files[0]) + return True, output_file, None + + except subprocess.TimeoutExpired: + return False, None, "Parser timeout" + except Exception as e: + return False, None, str(e) + + +def compare_outputs( + parsed_output: str, + s3_source: str, +) -> Dict[str, any]: + """Compare parsed output against S3 source.""" + try: + parsed_headers, parsed_count = load_tsv_headers_and_count(parsed_output) + s3_headers, s3_count = load_tsv_headers_and_count(s3_source) + + headers_match = set(parsed_headers) == set(s3_headers) + headers_extra = set(parsed_headers) - set(s3_headers) + headers_missing = set(s3_headers) - set(parsed_headers) + + count_diff = abs(parsed_count - s3_count) + count_pct_diff = 100.0 * count_diff / max(s3_count, 1) + + return { + "headers_match": headers_match, + "headers_extra": list(headers_extra), + "headers_missing": list(headers_missing), + "parsed_count": parsed_count, + "s3_count": s3_count, + "count_diff": count_diff, + "count_pct_diff": count_pct_diff, + "line_counts_similar": count_pct_diff < 10, # Allow 10% variance + } + except Exception as e: + return {"error": str(e)} + + +def run_tests(parser_name: Optional[str] = None, verbose: bool = False) -> int: + """Run tests for specified parser(s) and compare outputs.""" + # Determine which parsers to test + if parser_name: + parsers_to_test = [parser_name.upper()] + if parser_name.upper() not in PARSER_INPUT_MAPPING: + print(f"Error: Parser {parser_name} not configured in PARSER_INPUT_MAPPING") + return 1 + else: + parsers_to_test = list(PARSER_INPUT_MAPPING.keys()) + + results = {} + + for pname in parsers_to_test: + print(f"\n{'='*70}") + print(f"Testing parser: {pname}") + print(f"{'='*70}") + + if pname not in PARSER_INPUT_MAPPING: + print(f" ⚠️ No test configuration found") + continue + + test_configs = PARSER_INPUT_MAPPING[pname] + parser_results = [] + + for input_path, yaml_path, comparison_file in test_configs: + input_abs = Path(__file__).parent.parent / input_path + + if not input_abs.exists(): + print(f" ⚠️ Input not found: {input_path}") + parser_results.append({"status": "skipped", "reason": "input_not_found"}) + continue + + print(f"\n Input: {input_path}") + print(f" YAML: {yaml_path}") + + # Run parser in temp directory + with tempfile.TemporaryDirectory() as tmpdir: + yaml_abs = Path(__file__).parent.parent / yaml_path + + success, output_file, error = run_parser( + pname, + str(input_abs), + str(yaml_abs), + tmpdir, + ) + + if not success: + print(f" ❌ Parser failed: {error}") + parser_results.append({"status": "failed", "error": error}) + continue + + print(f" ✓ Parser succeeded") + output_headers, output_count = load_tsv_headers_and_count(output_file) + print(f" Output: {os.path.basename(output_file)} ({output_count} rows, {len(output_headers)} cols)") + + # Compare against S3 source if available + if comparison_file: + s3_source = find_s3_source(comparison_file) + + if s3_source: + print(f" Comparing against S3 source: {comparison_file}") + comparison = compare_outputs(output_file, s3_source) + + if "error" in comparison: + print(f" ⚠️ Comparison failed: {comparison['error']}") + else: + s3_headers, s3_count = load_tsv_headers_and_count(s3_source) + print(f" S3 source: {s3_count} rows, {len(s3_headers)} cols") + + if comparison["headers_match"]: + print(f" ✓ Headers match") + else: + print(f" ❌ Headers mismatch:") + if comparison["headers_extra"]: + print(f" Extra: {comparison['headers_extra']}") + if comparison["headers_missing"]: + print(f" Missing: {comparison['headers_missing']}") + + if comparison["line_counts_similar"]: + print(f" ✓ Line counts similar (~{comparison['count_pct_diff']:.1f}% diff)") + else: + print(f" ⚠️ Line counts differ substantially:") + print(f" Parsed: {comparison['parsed_count']}, S3: {comparison['s3_count']} ({comparison['count_pct_diff']:.1f}% diff)") + + parser_results.append({ + "status": "success", + "comparison": comparison, + }) + else: + print(f" ℹ️ No S3 source found for {comparison_file}") + parser_results.append({ + "status": "success", + "comparison": None, + }) + else: + parser_results.append({ + "status": "success", + "comparison": None, + }) + + results[pname] = parser_results + + # Summary + print(f"\n{'='*70}") + print("Summary") + print(f"{'='*70}") + + all_passed = True + for pname, presults in results.items(): + passed = sum(1 for r in presults if r.get("status") == "success" and (r.get("comparison") is None or r["comparison"].get("headers_match") and r["comparison"].get("line_counts_similar"))) + total = len(presults) + status = "✓" if passed == total else "❌" + print(f"{status} {pname}: {passed}/{total} passed") + if passed < total: + all_passed = False + + return 0 if all_passed else 1 + + +def main(): + parser = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + parser.add_argument( + "--parser", + help="Test a specific parser by name (e.g., SEQUENCING_STATUS)", + default=None, + ) + parser.add_argument( + "--all", + action="store_true", + help="Run all configured tests", + ) + parser.add_argument( + "--verbose", + action="store_true", + help="Print verbose output", + ) + + args = parser.parse_args() + + sys.exit(run_tests(parser_name=args.parser, verbose=args.verbose)) + + +if __name__ == "__main__": + main()