diff --git a/docs/plans/README.md b/docs/plans/README.md
new file mode 100644
index 0000000..e830dde
--- /dev/null
+++ b/docs/plans/README.md
@@ -0,0 +1,27 @@
+# Migration plans
+
+Versioned plans for the GoaT data import pipeline migration. The newest
+version is the source of truth; older versions are preserved verbatim so the
+evolution of the plan stays auditable.
+
+## Current
+
+- [v2-current-plan.md](v2-current-plan.md) — Phase 1 complete, Phase 2 in
+  progress (cleanup tasks tracked), Phases 3–5 outlined.
+
+## History
+
+- [v1-initial-plan.md](v1-initial-plan.md) — original plan written at project
+  kickoff. Defines the five-phase framing, the gap analysis, and the
+  network-robustness / logging / conventions reference that v2 still relies on.
+
+## Versioning convention
+
+- The newest plan is always named `vN-current-plan.md`.
+- When a new revision lands, the previous `current` file is renamed to
+  `vN-<short-descriptive-suffix>.md` (e.g. `v1-initial-plan.md`) and a new
+  `v(N+1)-current-plan.md` is added.
+- Historical plans are never edited after archival — corrections go into the
+  new version's change log.
+- Each new plan ends with a `Change log` section summarizing what changed
+  versus the previous version.
diff --git a/docs/plans/v1-initial-plan.md b/docs/plans/v1-initial-plan.md
new file mode 100644
index 0000000..7b04500
--- /dev/null
+++ b/docs/plans/v1-initial-plan.md
@@ -0,0 +1,571 @@
+# GoaT Data Import Pipeline Migration Plan
+
+## TL;DR
+
+Migrate all data fetching from the legacy `goat-data` GitHub Actions workflow to scheduled Prefect-backed updaters in the `data` repo, then wire up parsers and validators to produce import-ready TSV/YAML pairs on S3. Five phases: fetch (Phase 1), parse+validate (Phase 2), switch S3 source (Phase 3), replace import (Phase 4), full pipeline (Phase 5).
+
+---
+
+## Gap Analysis: Updater Coverage
+
+### Already Implemented (11 updaters)
+
+| Updater                      | Legacy Equivalent                                  | Schedule             | Notes                                              |
+| ---------------------------- | -------------------------------------------------- | -------------------- | -------------------------------------------------- |
+| `update_ncbi_datasets`       | fetch-ncbi-datasets-zip                            | Daily                | NCBI Datasets CLI → JSONL                          |
+| `update_ncbi_taxonomy`       | fetch-ncbi-taxdump (commented out in legacy)       | Weekly               | FTP taxdump with MD5 verification                  |
+| `update_ena_taxonomy_extra`  | fetch-ena-taxonomy-extra (commented out in legacy) | Weekly               | ENA REST API                                       |
+| `update_genomehubs_taxonomy` | fetch-genomehubs-taxonomy                          | Daily                | blobtk collation from NCBI+ENA+OTT                 |
+| `update_tolid_prefixes`      | fetch-tolids (commented out in legacy)             | Weekly               | GitLab WTSI; 400k line minimum validation          |
+| `update_ott_taxonomy`        | (no legacy equivalent)                             | Monthly              | Open Tree of Life .tgz download                    |
+| `update_tol_portal_status`   | STS API (fully replaced via tol-sdk)               | Daily (orchestrated) | Docker-isolated; replaces `STS_AUTHORIZATION_KEY`  |
+| `update_tol_genome_notes`    | (new source — no legacy equivalent)                | Daily (orchestrated) | Docker-isolated; tol-sdk                           |
+| `update_nhm_status_list`     | fetch-from-apis NHM (commented out in legacy)      | Weekly               | POST API with cursor pagination                    |
+| `update_boat_config`         | GoaT API + Lustre                                  | Daily                | Assembly QC config builder                         |
+| `tol_utils` (shared)         | —                                                  | —                    | Shared ToL Portal helper, not a standalone updater |
+
+### Missing — Need New Updaters (8 categories, ~11 updaters)
+
+| #   | Source                                   | Legacy Job                                          | Priority | Schedule | Complexity                                       |
+| --- | ---------------------------------------- | --------------------------------------------------- | -------- | -------- | ------------------------------------------------ |
+| 1   | **BlobToolKit**                          | fetch-blobtoolkit (Docker `genomehubs parse --btk`) | HIGH     | Daily    | Medium — API pagination + Docker                 |
+| 2   | **RefSeq Organelles**                    | fetch-refseq-organelles (FTP + BioPython)           | HIGH     | Weekly   | Medium — FTP + GenBank parsing                   |
+| 3   | **VGP Status**                           | fetch-from-apis (GitHub YAML)                       | MEDIUM   | Weekly   | Low — simple HTTP + YAML parse                   |
+| 4   | **JGI 1KFG**                             | fetch-from-apis (OAuth REST)                        | MEDIUM   | Weekly   | Medium — OAuth token exchange                    |
+| 5   | **Ensembl Metadata** (×6)                | fetch-assembly-links (6 JSON endpoints)             | MEDIUM   | Monthly  | Low — HTTP + JSON→TSV, one parameterized updater |
+| 6   | **UCSC Assembly Hubs**                   | fetch-assembly-links                                | LOW      | Monthly  | Low — HTTP + text parsing                        |
+| 7   | **Google Sheets Status** (~20+ projects) | fetch-from-apis (R + Python)                        | HIGH     | Weekly   | High — rewrite R→Python, normalize tables        |
+| 8   | **SRA Data**                             | (parse_sra_data.py)                                 | MEDIUM   | Weekly   | Medium — NCBI API + XML parsing                  |
+
+### Legacy-Only Binary Fetches (no migration needed)
+
+These legacy jobs fetch tool binaries, not data. They become package dependencies in the data repo:
+
+- **fetch-ncbi-datasets** — downloads `datasets` CLI executable → already a pip dependency (`ncbi-datasets-cli`)
+- **fetch-genomehubs-api** — downloads GenomeHubs API binary → already a pip dependency (`genomehubs`)
+- **fetch-genomehubs-ui** — downloads GenomeHubs UI binary → not needed for data pipeline
+
+### Static/Semi-Static Sources (no external fetch needed)
+
+These exist as curated YAML/TSV pairs in `goat-data/sources/` and are uploaded directly to S3:
+
+- **Genomesize/Karyotype** — 25 FILE\_ sources (genome size databases, chromosome counts)
+- **Conservation** — CITES index (periodically updated manually)
+- **UK Legislation** — 9 FILE\_ sources (very static)
+- **Regional Lists** — 7 FILE\_ sources (static geographic lists)
+- **Lineages** — ODB10 lineage mappings
+- **OTT IDs** — OTT taxonomy mappings
+
+Note: **ToLIDs** are NOT static — the prefix list is actively fetched by `update_tolid_prefixes`. The `tolids.names.yaml` config in `goat-data/sources/tolids/` is a naming convention file that ships with the YAML configs, not a separate data source.
+
+These should be synced to S3 via a simple `sync_static_sources` utility or manually, not via updaters.
+
+---
+
+## Phase 1: External Data Fetching
+
+### Goal
+
+All external data fetching implemented as Prefect updaters with scheduled deployments, uploading raw data to S3 and emitting events for downstream parsing.
+
+### Steps
+
+#### Group A: API-Based Updaters (parallel development)
+
+**Step 1: `update_vgp_status` — VGP Status List**
+
+- Fetch GitHub YAML from `https://raw.githubusercontent.com/vgl-hub/genome-portal/master/_data/table_tracker.yml`
+- Parse YAML, extract fields: common_name, family, order, scientific_name, status, taxon_id, vgp_phase
+- Write TSV to `s3://goat/resources/status-lists/vgp.tsv`
+- Schedule: Weekly
+- Reuse: `safe_get()` from `flows/lib/utils.py`, `parse_args/shared_args` pattern
+- Reference: `goat-data/scripts/api/api_config.py` VGL handlers
+
+**Step 2: `update_jgi_status` — JGI 1KFG**
+
+- OAuth token exchange: offline_token → access_token via `https://signon.jgi.doe.gov/signon/create`
+- Paginated API: `https://gold-ws.jgi.doe.gov/projects?studyGoldId=Gs0000001`
+- Write TSV to `s3://goat/resources/status-lists/jgi_1kfg.tsv`
+- Schedule: Weekly
+- Requires: `JGI_OFFLINE_TOKEN` secret (Prefect Secret block or env var)
+- Reference: `goat-data/scripts/jgi_to_tsv.py`
+- Bug risk: Legacy code has fragile OAuth flow — add proper token refresh and expiry handling
+
+**Step 3: `update_ensembl_metadata` — Ensembl Species Metadata (6 databases)**
+
+- Single parameterized updater deployed 6 times with different division parameters
+- Divisions: Fungi, Metazoa, Plants, Protists, Vertebrates, Rapid Release
+- Fetch JSON from Ensembl REST API endpoints
+- Transform JSON→TSV (replace legacy `jq` one-liners with explicit Python)
+- Write to `s3://goat/resources/assembly-data/species_metadata_Ensembl{Division}.tsv.gz`
+- Schedule: Monthly
+- Reference: `goat-data/.github/workflows/fetch-resources.yml` fetch-assembly-links job
+
+**Step 4: `update_ucsc_assemblies` — UCSC Genome Browser**
+
+- Fetch assembly hub list from UCSC API
+- Parse to TSV
+- Write to `s3://goat/resources/assembly-data/ucsc_ids.tsv`
+- Schedule: Monthly
+- Reuse: `safe_get()`, standard arg parsing
+
+**Step 5: `update_sra_data` — SRA Metadata**
+
+- Fetch from NCBI SRA API (Entrez or BigQuery)
+- Parse XML/JSON responses to TSV
+- Write to `s3://goat/resources/sra/sra.tsv.gz`
+- Schedule: Weekly
+- Reference: `goat-data/scripts/parse_sra_data.py`
+- Bug risk: Legacy script has hardcoded batch sizes and silent error swallowing
+
+#### Group B: Complex Updaters (sequential, more effort)
+
+**Step 6: `update_blobtoolkit` — BlobToolKit Analysis Data**
+
+- Approach A (preferred): Direct API fetch from `https://blobtoolkit.genomehubs.org/api/v1/search/Eukaryota` + per-assembly detail queries
+- Approach B: Docker-isolated `genomehubs parse --btk` via orchestrator pattern (like tol_genome_notes)
+- Outputs: `btk.tsv.gz` + `btk.files.yaml` to `s3://goat/resources/btk/`
+- Schedule: Daily
+- Reference: `goat-data/scripts/parse_blobtoolkit.py`
+- Bug risk: Legacy has `print(plots)` debug line left in (line 66); pagination may miss entries
+
+**Step 7: `update_refseq_organelles` — RefSeq Organelle Data**
+
+- Fetch from NCBI FTP: `ftp.ncbi.nlm.nih.gov/refseq/release/`
+- Parse GenBank flat files for mitochondrion/plastid sequences
+- Extract: accession, taxon_id, organism, sequence_length, references
+- Write to `s3://goat/resources/assembly-data/refseq_organelles.tsv.gz`
+- Schedule: Weekly
+- Reference: `goat-data/scripts/parse_refseq_organelles.py` (uses BioPython)
+- Consideration: BioPython dependency may need Docker isolation (check pydantic conflicts)
+
+**Step 8: `update_google_sheets_status` — Google Sheets Project Status Lists**
+
+- Rewrite R script (`get_googlesheets.R`) entirely in Python
+- Fetch TSVs from public Google Sheets URLs (no auth needed for public sheets)
+- Use `import_status_lib.py` patterns for table normalization but rewrite cleanly:
+  - Replace pandas one-liners with explicit column mapping
+  - Handle encoding robustly (UTF-8 with fallback)
+  - Normalize species names, taxon IDs
+- Projects list parameterized (deploy once per project group or batch)
+- Outputs: One TSV per project to `s3://goat/resources/status-lists/{project}_expanded.tsv`
+- Schedule: Weekly
+- Sub-steps:
+  - 8a: Core fetcher function (reusable across all sheets)
+  - 8b: Table normalizer (species name cleaning, status field mapping)
+  - 8c: Per-project configuration (sheet URLs, field mappings, column renames)
+  - 8d: Deploy as single flow with project list parameter
+- Reference: `goat-data/scripts/import_status_lib.py`, `goat-data/scripts/import_status.py`
+- Bug risks in legacy:
+  - Code duplication (import_status_lib.py copied to ebp_import/)
+  - Silent encoding failures
+  - Hardcoded 24-project list
+  - Pandas operations that silently drop data on merge conflicts
+
+#### Group C: Infrastructure & Static Data
+
+**Step 9: `sync_static_sources` — Static YAML/TSV pairs**
+
+- Utility to upload curated YAML/TSV pairs from goat-data/sources/ to S3
+- Not a scheduled updater — run manually or on goat-data repo changes
+- Covers: genomesize-karyotype, conservation, uk-legislation, regional-lists, lineages
+- Could be triggered by a webhook on goat-data repo pushes
+
+**Step 10: Secrets & Configuration**
+
+- Configure Prefect Secret blocks for: `JGI_OFFLINE_TOKEN`, Google Sheets URLs
+- STS_AUTHORIZATION_KEY no longer needed (replaced by tol-sdk)
+- Add deployment entries to `flows/prefect.yaml` for all new updaters
+
+### Relevant Files (Phase 1)
+
+**New files to create:**
+
+- `flows/updaters/update_vgp_status.py`
+- `flows/updaters/update_jgi_status.py`
+- `flows/updaters/update_ensembl_metadata.py`
+- `flows/updaters/update_ucsc_assemblies.py`
+- `flows/updaters/update_sra_data.py`
+- `flows/updaters/update_blobtoolkit.py`
+- `flows/updaters/update_refseq_organelles.py`
+- `flows/updaters/update_google_sheets_status.py`
+- `flows/lib/google_sheets.py` (shared Google Sheets fetching utilities)
+- `flows/lib/api_helpers.py` (shared API helpers: OAuth, pagination, JSON→TSV)
+
+**Existing files to modify:**
+
+- `flows/prefect.yaml` — add deployments for all new updaters
+- `flows/lib/utils.py` — add any missing shared utilities
+- `flows/lib/shared_args.py` — add new argument definitions if needed
+- `requirements.txt` — add BioPython if needed for RefSeq parsing
+
+**Reference files (goat-data, read-only):**
+
+- `goat-data/scripts/api/api_config.py` — API endpoint definitions
+- `goat-data/scripts/api/api_tools.py` — retry/pagination patterns
+- `goat-data/scripts/jgi_to_tsv.py` — JGI OAuth flow
+- `goat-data/scripts/parse_blobtoolkit.py` — BTK API parsing
+- `goat-data/scripts/parse_refseq_organelles.py` — GenBank parsing
+- `goat-data/scripts/parse_sra_data.py` — SRA parsing
+- `goat-data/scripts/import_status_lib.py` — table normalization
+- `goat-data/scripts/get_googlesheets.R` — Google Sheets URLs
+- `goat-data/.github/workflows/fetch-resources.yml` — complete fetch workflow
+
+### Verification (Phase 1)
+
+1. Each updater runs locally with `SKIP_PREFECT=true` and produces valid output TSV
+2. Output TSV format matches goat-data legacy output (diff comparison where possible)
+3. S3 upload succeeds to `s3://goat/resources/` paths
+4. Events emitted with correct resource types for downstream triggering
+5. All tests pass: `python -m pytest tests/`
+6. No secret values hardcoded; all auth via env vars or Prefect Secret blocks
+7. `prefect deploy --prefect-file flows/prefect.yaml --all` succeeds
+
+### Decisions (Phase 1)
+
+- **Google Sheets**: Rewrite in Python (not R) for consistency with the rest of the codebase
+- **BlobToolKit**: Prefer direct API approach over Docker genomehubs parse (simpler, avoids Docker-in-Docker); fall back to orchestrator pattern if API is insufficient
+- **RefSeq Organelles**: Use BioPython in Docker container if pydantic conflicts arise
+- **Static sources**: Not updaters — sync utility or manual upload
+- **STS replaced by ToL Portal**: No migration needed (already done via `update_tol_portal_status`)
+
+---
+
+## Phase 2: YAML-Backed Parsers & Validation
+
+### Goal
+
+All data sources processed by fetch-parse-validate pipeline. Parsing triggered by update events. Validated TSV/YAML pairs uploaded to new S3 directories (`s3://goat/validated/`).
+
+### Steps
+
+**Step 1: Implement `parse_sequencing_status` parser**
+
+- Handle all status list TSV formats (VGP, JGI, Google Sheets projects, NHM, ToL Portal)
+- Config-driven: read YAML to determine column mappings
+- Reuse `Config` class from `flows/lib/utils.py`
+- One parser handles all ~65 status list YAML configs
+
+**Step 2: Implement `parse_refseq_organelles` parser**
+
+- Replace stub with working implementation
+- Read YAML config, apply field mappings from `refseq_organelles.types.yaml`
+- Validate organelle accessions, taxonomy
+
+**Step 3: Implement `parse_blobtoolkit` parser**
+
+- Parse BTK TSV using YAML config from `btk.types.yaml`
+- Handle BUSCO stats, base composition, read mapping fields
+
+**Step 4: Implement `parse_ensembl_metadata` parser**
+
+- Handle all 6 Ensembl division TSVs
+- Single generic parser, config-driven via YAML
+
+**Step 5: Implement `parse_sra_data` parser**
+
+- Parse SRA TSV with YAML config from `sra.types.yaml`
+
+**Step 6: Implement `parse_genomesize_karyotype` parser**
+
+- Handle the 25+ genomesize/karyotype FILE\_ sources
+- Generic parser for simple TSV→validated TSV transformation
+
+**Step 7: Implement `parse_conservation` and `parse_legislation` parsers**
+
+- Static data validation parsers
+- Check CITES categories, legislation references against YAML constraints
+
+**Step 8: Wire all fetch-parse-validate deployments**
+
+- Add trigger entries in `prefect.yaml` for each parser
+- Events from Phase 1 updaters trigger corresponding parse-validate flows
+- `validate_file_pair()` runs `blobtk validate` on each output
+- Gate S3 upload on validation success
+
+**Step 9: Configure S3 output paths**
+
+- Validated outputs go to `s3://goat/validated/{directory}/` (NOT `s3://goat/resources/` or `s3://goat/sources/`)
+- Both validated TSV and validated YAML uploaded
+- Validation report (JSONL) uploaded alongside for audit
+
+### Relevant Files (Phase 2)
+
+- `flows/parsers/parse_sequencing_status.py` — complete implementation
+- `flows/parsers/parse_refseq_organelles.py` — replace stub
+- `flows/parsers/parse_blobtoolkit.py` — new
+- `flows/parsers/parse_ensembl_metadata.py` — new
+- `flows/parsers/parse_sra_data.py` — new
+- `flows/parsers/parse_genomesize_karyotype.py` — new (generic)
+- `flows/parsers/parse_conservation.py` — new
+- `flows/orchestration/wrapper_fetch_parse_validate.py` — existing, may need updates
+- `flows/validators/validate_file_pair.py` — existing, may need S3 path updates
+- `flows/prefect.yaml` — add trigger entries
+- Local copies of YAML configs from `goat-data/sources/` for development
+
+### Verification (Phase 2)
+
+1. Each parser produces TSV matching the YAML config headers
+2. `blobtk validate -g <yaml>` passes for each output with ≥95% valid rows
+3. Event chain works: updater → parse → validate → S3 upload
+4. Validated files appear in `s3://goat/validated/` directories
+5. Row counts comparable to legacy pipeline output
+6. No data loss: compare parsed row counts against raw input counts
+
+### Decisions (Phase 2)
+
+- **S3 validated path**: `s3://goat/validated/` (separate from `resources/` and `sources/`)
+- **Parser reuse**: `parse_sequencing_status` handles ALL status list formats via YAML config
+- **Parser reuse**: `parse_genomesize_karyotype` handles ALL genomesize/karyotype sources generically
+- **YAML configs**: Develop with local copies, production fetches from goat-data sources/
+- **Scope boundary**: Phase 2 does NOT change the legacy import at all
+
+---
+
+## Phase 3: Switch Legacy Import to Validated Data
+
+### Goal
+
+Legacy import workflow reads from `s3://goat/validated/` instead of `s3://goat/resources/` or `s3://goat/sources/`, removing all fetch steps from the import.
+
+### Steps
+
+1. Verify data parity: compare `s3://goat/validated/` against `s3://goat/sources/` for all directories
+2. Update `goat-data/.github/workflows/genomehubs-index.yml` to read from `s3://goat/validated/`
+3. Remove fetch jobs from `goat-data/.github/workflows/fetch-resources.yml` (or disable)
+4. Update `goat-data/.github/workflows/s3_release.yml` to skip fetch-resources
+5. Run test release with validated data; compare with latest production release
+6. Staged rollout: switch one directory at a time, verify, proceed
+
+### Verification (Phase 3)
+
+1. Test release produces identical (or improved) Elasticsearch indices
+2. API test suite passes
+3. UI test suite passes
+4. Row counts match or exceed previous release
+5. Rollback path confirmed: can revert to `s3://goat/sources/` if issues
+
+### Risk Mitigation
+
+- Keep `s3://goat/sources/` and `s3://goat/resources/` intact as rollback
+- Phase 3 changes only S3 paths in workflow config, easily reversible
+- Switch one source directory at a time (assembly-data first, then status-lists, etc.)
+
+---
+
+## Phase 4: Replace Legacy Import (Future)
+
+### Goal
+
+Replace `genomehubs index` with updated import code that reads validated TSV/YAML pairs directly.
+
+### Scope
+
+- Requires new import code not yet available
+- Skip validation/lookup steps (already done in Phase 2)
+- Direct TSV→Elasticsearch indexing
+
+---
+
+## Phase 5: Full Pipeline Migration (Future)
+
+### Goal
+
+Remove all GitHub Actions workflow dependencies; full pipeline runs in Prefect.
+
+### Scope
+
+- Yet to be defined
+- Includes: ES init, indexing, fill, test, release promotion
+- Replaces: s3_release.yml, genomehubs-init.yml, genomehubs-index.yml, genomehubs-fill.yml, genomehubs-test.yml
+
+---
+
+## Network Robustness Review
+
+### Current `safe_get()` Implementation
+
+The existing `safe_get()` in `flows/lib/utils.py` provides:
+
+- Configurable timeout (default 300s)
+- Supports GET/POST/HEAD methods
+- Raises on HTTP errors via `response.raise_for_status()`
+- **No built-in retry logic** — relies entirely on Prefect task-level retries
+
+### Current Retry Patterns Across Updaters
+
+| Updater                    | Task Retries | Delay | Notes                                                     |
+| -------------------------- | ------------ | ----- | --------------------------------------------------------- |
+| `update_tolid_prefixes`    | 2            | 2s    | Reasonable for a single file download                     |
+| `update_ncbi_datasets`     | 2            | 2s    | Too few for NCBI rate-limited API; batches of 50          |
+| `update_nhm_status_list`   | 100          | 60s   | Very aggressive — NHM API known to be unstable            |
+| `update_tol_portal_status` | 3            | 60s   | Orchestrator-level retries re-run entire Docker container |
+| `update_tol_genome_notes`  | 3            | 60s   | Same orchestrator pattern                                 |
+| `update_ncbi_taxonomy`     | 2            | 2s    | FTP download — may need longer delay                      |
+| `update_ott_taxonomy`      | 2            | 2s    | HTTP download of .tgz — adequate                          |
+| `update_boat_config`       | (not set)    | —     | SSH-based, different failure modes                        |
+
+### Gaps & Recommendations for New Updaters
+
+**1. Add HTTP-level retry to `safe_get()`** (or create `resilient_get()`)
+
+- Use `urllib3.util.Retry` with `requests.adapters.HTTPAdapter` for transport-level retries
+- Retry on: 429 (rate limit), 500, 502, 503, 504
+- Exponential backoff: 1s, 2s, 4s (3 attempts)
+- This separates transient HTTP failures from task-level Prefect retries (which re-run the entire task)
+- Existing updaters benefit automatically when `safe_get()` is hardened
+
+**2. Per-source timeout tuning**
+| Source | Recommended Timeout | Rationale |
+|--------|-------------------|-----------|
+| VGP (GitHub raw) | 30s | Small YAML file, fast CDN |
+| JGI API | 120s per page | Paginated, can be slow |
+| Ensembl FTP | 300s | Large JSON files (>100MB for Vertebrates) |
+| UCSC | 60s | Small text file |
+| SRA API | 300s | Potentially large responses |
+| BlobToolKit API | 120s per request | Many per-assembly detail calls |
+| RefSeq FTP | 600s | Large GenBank files |
+| Google Sheets | 60s per sheet | Can be slow on large sheets |
+
+**3. Partial failure handling for paginated APIs**
+
+- JGI, BlobToolKit, and SRA all paginate — a failure mid-pagination should not discard pages already fetched
+- Write each page to a temp file; only assemble final TSV after all pages succeed
+- If a page fails after retries, emit a warning event with partial count and halt gracefully
+
+**4. Idempotency and freshness checks**
+
+- `update_tolid_prefixes` already uses HTTP HEAD timestamp comparison — reuse this pattern
+- New updaters should check `Last-Modified` or `ETag` before downloading, using `is_local_file_current_http()`
+- For APIs without timestamp headers (JGI, BlobToolKit), compare MD5 of output against previous S3 version using `generate_md5()`
+
+**5. S3 upload atomicity**
+
+- Current `upload_to_s3()` uses `s3cmd put` — if interrupted, leaves partial file on S3
+- Recommendation: upload to a `.tmp` key first, then copy to final key and delete `.tmp`
+- Or use boto3 multipart upload with automatic cleanup on failure
+
+**6. Connection pooling for high-volume API calls**
+
+- BlobToolKit updater will make ~10,000+ individual API calls (one per assembly)
+- Use a `requests.Session()` to reuse TCP connections and benefit from connection pooling
+- Add rate limiting (e.g., 10 req/s) to avoid overwhelming the BlobToolKit API
+
+**7. DNS and TLS failure handling**
+
+- `requests.exceptions.ConnectionError` and `requests.exceptions.SSLError` are not HTTP status codes — they won't be caught by status-code retry logic
+- Ensure transport-level retries cover these cases
+- Add explicit handling in updaters: log the error clearly, distinguish transient DNS vs permanent config errors
+
+---
+
+## Logging Review
+
+### Current Logging Patterns in the Data Repo
+
+**Primary mechanism:** `print()` with `@task(log_prints=True)`
+
+- When running under Prefect, print statements are captured as INFO-level logs
+- When `SKIP_PREFECT=true`, print goes to stdout (useful for local testing)
+- No structured logging (no JSON, no log levels beyond print)
+
+**What's logged today (by updater):**
+
+| Updater                      | Logging Pattern                                  | Gaps                              |
+| ---------------------------- | ------------------------------------------------ | --------------------------------- |
+| `update_ncbi_datasets`       | Prints batch progress, line counts, match status | No timing info                    |
+| `update_ncbi_taxonomy`       | Prints MD5 comparison, extraction status         | Good — includes checksums         |
+| `update_tolid_prefixes`      | Prints line count, timestamp comparison result   | Good — includes freshness check   |
+| `update_nhm_status_list`     | Prints page count, record counts per page        | Missing total elapsed time        |
+| `update_tol_portal_status`   | Minimal — Docker output captured                 | Docker stdout mixed with app logs |
+| `update_tol_genome_notes`    | Minimal — Docker output captured                 | Same Docker stdout issue          |
+| `update_ena_taxonomy_extra`  | Prints taxon counts                              | Missing API call timing           |
+| `update_genomehubs_taxonomy` | Prints blobtk command and result                 | Good — includes command           |
+| `update_ott_taxonomy`        | Prints download size, extraction                 | Good                              |
+| `update_boat_config`         | Prints API queries, SSH commands                 | Good — verbose                    |
+
+### Recommendations for New Updaters
+
+**1. Standardize a logging helper**
+Create a `log_progress()` utility in `flows/lib/utils.py` that:
+
+- Prints a timestamped message (ISO 8601)
+- Includes the updater/task name as a prefix
+- Works identically with and without Prefect (`print()`-based, not `logging` module)
+- Example: `[2026-04-24T12:00:00Z] update_vgp_status: Fetched 1,234 records in 3.2s`
+
+**2. Log network call summaries**
+Every HTTP request should log:
+
+- URL (redacted if contains secrets)
+- Method (GET/POST/HEAD)
+- Response status code
+- Response size (bytes)
+- Elapsed time (seconds)
+- Whether the response was from cache/retry
+
+**3. Log output file summaries**
+After writing each output file, log:
+
+- File path (local and S3)
+- Row count
+- File size
+- MD5 hash
+- Whether it matches previous version (changed/unchanged)
+
+**4. Log events emitted**
+Print a summary when emitting Prefect events:
+
+- Event name
+- `matches.previous` value
+- Key payload fields (row count, etc.)
+- This aids debugging when running with `SKIP_PREFECT=true` (since `emit_event` is a no-op)
+
+**5. Error context in exceptions**
+Every caught exception should include:
+
+- The URL or resource that failed
+- The HTTP status code (if applicable)
+- The attempt number (if retrying)
+- A hint about whether the error is transient or permanent
+- Example: `RuntimeError("JGI OAuth token exchange failed (attempt 2/3): 401 Unauthorized — check JGI_OFFLINE_TOKEN is valid")`
+
+**6. Docker orchestrator logging**
+For Docker-isolated flows (`tol_portal_status`, `tol_genome_notes`):
+
+- Capture and prefix Docker stdout/stderr separately
+- Log Docker exit code explicitly
+- Log the full Docker command (with secrets redacted) for reproducibility
+
+---
+
+## Conventions Reference
+
+### YAML/TSV Pair Convention (goat-data)
+
+- **Prefix patterns**: `ATTR_` (attribute defs), `TAXON_` (taxonomy), `FILE_` (data sources), unprefixed (primary)
+- **YAML structure**: `file:` metadata, `attributes:` field mappings, `taxonomy:` taxon matching, `identifiers:` ID columns
+- **`needs:`** directive: lists dependent YAML files that must be co-located
+- **TSV naming**: matches `file.name` in YAML config, often `.gz` compressed
+
+### Data Repo Code Conventions
+
+- Absolute imports: `from flows.lib import utils`
+- Google-style docstrings with type hints
+- `SKIP_PREFECT=true` for local testing
+- `run_quoted()` for subprocess (never `shell=True`)
+- `safe_get()` for HTTP requests
+- `parse_args()` with `shared_args` constants for CLI
+- Tasks: focused, idempotent, with `@task(retries=N, log_prints=True)`
+- Events: `emit_event()` with `prefect.resource.id/type/matches.previous`
+- Black formatter, 88-char line length
+
+### Legacy Code Bug Risks to Avoid
+
+1. `parse_blobtoolkit.py` line 66: debug `print(plots)` left in production
+2. `import_status_lib.py`: duplicated across directories, encoding silently fails
+3. `fetch-or-fallback.sh`: `|| exit 0` masks real errors
+4. Google Sheets: hardcoded `gid` parameters break on URL changes
+5. JGI OAuth: no token refresh/expiry handling
+6. NCBI API: hardcoded 30s timeouts, silent failure on rate limit
+7. Pandas merge conflicts silently drop data in status list processing
diff --git a/docs/plans/v2-current-plan.md b/docs/plans/v2-current-plan.md
new file mode 100644
index 0000000..eb4e4c2
--- /dev/null
+++ b/docs/plans/v2-current-plan.md
@@ -0,0 +1,172 @@
+# GoaT Data Import Pipeline Migration Plan — v2 (current)
+
+> **Status as of this revision:** Phase 1 complete · Phase 2 mostly complete (cleanup tasks remaining) · Phases 3–5 not yet started.
+>
+> For the original framing and historical context see [v1-initial-plan.md](v1-initial-plan.md).
+
+## TL;DR
+
+Phase 1 (external data fetching) and the bulk of Phase 2 (YAML-backed parsers + validation) are now implemented. Every legacy fetch job from `goat-data/.github/workflows/fetch-resources.yml` has a corresponding Prefect updater, and every source directory that needs transformation has either a parser or a `SKIP_PARSING` assignment plus a YAML config. What remains in Phase 2 is targeted cleanup: confirm parser-vs-skip assignments, ensure every status-list YAML can be validated end-to-end locally, get `blobtk` on the worker PATH, and finalize a couple of YAML schemas. After that, Phase 3 cuts the legacy import over to `s3://goat/validated/`.
+
+## Status at a glance
+
+| Phase                                | State          | Notes                                                                               |
+| ------------------------------------ | -------------- | ----------------------------------------------------------------------------------- |
+| Phase 1 — External data fetching     | ✅ Complete    | 18 updaters deployed (see inventory below).                                         |
+| Phase 2 — Parsers & validation       | 🔄 In progress | 11 parsers + fetch-parse-validate wrapper landed. Cleanup tasks tracked below.      |
+| Phase 3 — Switch legacy import       | ⬜ Not started | Gated on Phase 2 cleanup + a parity comparison between `validated/` and `sources/`. |
+| Phase 4 — Replace `genomehubs index` | ⬜ Future      | Requires new import code; out of scope for this revision.                           |
+| Phase 5 — Full pipeline migration    | ⬜ Future      | Replaces remaining GitHub Actions workflows (release / init / index / fill / test). |
+
+---
+
+## Phase 1 — External Data Fetching (COMPLETE)
+
+Every external fetch job from the legacy workflow now has a Prefect-backed updater that writes raw data to `s3://goat/resources/...` and emits an `update.*.finished` event.
+
+**Updater inventory (`flows/updaters/`):**
+
+- `update_ncbi_datasets.py`
+- `update_ncbi_taxonomy.py`
+- `update_ena_taxonomy_extra.py`
+- `update_genomehubs_taxonomy.py`
+- `update_tolid_prefixes.py`
+- `update_ott_taxonomy.py`
+- `update_tol_portal_status.py`
+- `update_tol_genome_notes.py`
+- `update_nhm_status_list.py`
+- `update_boat_config.py`
+- `update_vgp_status.py`
+- `update_vgp_original_status.py`
+- `update_jgi_status.py`
+- `update_ensembl_metadata.py`
+- `update_ucsc_assemblies.py`
+- `update_sra_data.py`
+- `update_blobtoolkit.py`
+- `update_refseq_organelles.py`
+- `update_google_sheets_status.py`
+
+Shared helpers: `tol_utils.py`, `flows/updaters/api/`, `flows/lib/utils.py`, `flows/lib/shared_args.py`.
+
+All deployments are wired in `flows/prefect.yaml`. There are no remaining Phase 1 items.
+
+---
+
+## Phase 2 — YAML-Backed Parsers & Validation (IN PROGRESS)
+
+### What is in place
+
+**Parsers (`flows/parsers/`):**
+
+| Parser                                  | Handles                                                                  |
+| --------------------------------------- | ------------------------------------------------------------------------ |
+| `parse_ncbi_assemblies.py`              | NCBI Datasets + data-freeze assembly TSVs.                               |
+| `parse_refseq_organelles.py`            | NCBI RefSeq mitochondrion / plastid GenBank → pivoted TSV.               |
+| `parse_sequencing_status.py`            | JGI 1KFG (and any other status list whose source format needs pivoting). |
+| `parse_blobtoolkit.py`                  | BlobToolKit analysis exports (stub; see cleanup).                        |
+| `parse_sra_data.py`                     | SRA metadata TSV.                                                        |
+| `parse_genomesize_karyotype.py`         | Generic genomesize / karyotype `FILE_` sources.                          |
+| `parse_conservation.py`                 | CITES + conservation sources.                                            |
+| `parse_legislation.py`                  | UK legislation FILE\_ sources.                                           |
+| `parse_skip_parsing.py`                 | Pass-through for inputs that already match their YAML schema.            |
+| `parse_backfill_historical_versions.py` | Historical assembly version backfill.                                    |
+
+Discovery is automatic via `flows/parsers/register.py` (any `parse_*.py` is picked up). `Parser` enum members serialize to lowercase (e.g. `skip_parsing`) but the `PARSERS.parsers` dict is keyed by `Parser.name` (uppercase, e.g. `SKIP_PARSING`).
+
+**Wrappers, validators, and orchestrators:**
+
+- `flows/lib/wrapper_fetch_parse_validate.py` — production fetch → parse → validate → S3 upload pipeline.
+- `flows/lib/local_fetch_parse_validate.py` — local equivalent: copies the YAML + TSV into a work directory, runs the parser, runs `validate_file_pair` with `s3_path=None`, and gracefully skips validation if the `blobtk` binary is not on PATH. Handles plain ↔ gz conversion so the input file matches the YAML's `file.name`. Handles the lowercase-enum / uppercase-dict-key mismatch when looking up parsers.
+- `flows/lib/validate_file_pair.py` — wraps the `blobtk validate` Rust binary.
+- `flows/orchestrators/batch_validate_status_lists.py` — triggered by `update.google.sheets.status.finished`. Iterates every `FILE_*.types.yaml` under `goat-data/sources/status-lists/`, calls the standard `fetch_parse_validate(parser=Parser.SKIP_PARSING, …)` for each TSV present, and reports pass / fail / skip-no-tsv / skip-config-error counts. CLI flags: `--yaml-dir --work-dir --taxdump-path --s3-path --dry-run --min-valid --min-assigned`.
+
+**Other lib modules in current use:** `conditional_import.py`, `fetch_genomehubs_target_list.py`, `fetch_previous_file_pair.py`, `for_each_record.py`, `index_assembly_features.py`, `process_features.py`, `shared_args.py`, `shared_tasks.py`, `utils.py`.
+
+### Parser ↔ source assignment audit (current)
+
+| Source directory                             | Deployment                            | Parser                 |
+| -------------------------------------------- | ------------------------------------- | ---------------------- |
+| `assembly-data/ncbi_datasets`                | `fpv-ncbi-datasets`                   | `NCBI_ASSEMBLIES`      |
+| `assembly-data/data_freeze`                  | `fpv-data-freeze`                     | `NCBI_ASSEMBLIES`      |
+| `assembly-data/refseq_organelles`            | `fpv-refseq-organelles`               | `REFSEQ_ORGANELLES`    |
+| `assembly-data/ucsc`                         | `fpv-ucsc`                            | `SKIP_PARSING`         |
+| `btk/`                                       | `fpv-blobtoolkit`                     | `SKIP_PARSING`         |
+| `sra/`                                       | `fpv-sra`                             | `SKIP_PARSING`         |
+| `status-lists/vgp` (FILE_VGP_Ordinal_Phase1) | `fpv-vgp`                             | `SKIP_PARSING`         |
+| `status-lists/nhm`                           | `fpv-nhm`                             | `SKIP_PARSING`         |
+| `status-lists/jgi_1kfg`                      | `fpv-jgi`                             | `SEQUENCING_STATUS`    |
+| `status-lists/google_sheets/*`               | `batch-validate-google-sheets-status` | `SKIP_PARSING` (batch) |
+
+Rationale for `SKIP_PARSING` on BTK and UCSC: `blobtk validate` can derive the taxonomy columns from a `taxon_id` column automatically, so no pre-parse transformation is required. The YAML schema is the source of truth.
+
+### Phase 2 cleanup — remaining work
+
+1. **`blobtk` on worker PATH.** Validation currently no-ops locally on the developer machine because the binary is not installed. Add it to the worker image (and document a local install option) so `local_fetch_parse_validate.py` reports real validation outcomes instead of skipping.
+2. **`BLOBTOOLKIT` parser placeholder.** `parse_blobtoolkit.py` exists but is a thin pass-through. Decide whether to keep `SKIP_PARSING` permanently for `btk/` (current production setting) or graduate to a real parser once the BTK API export gains structured fields the YAML cannot describe.
+3. **`GENOMESIZE_KARYOTYPE` schema confirmation.** `parse_genomesize_karyotype.py` is generic, but a handful of `FILE_` sources still need their YAMLs cross-checked against the parser's column expectations. Walk every YAML under `goat-data/sources/genomesize-karyotype/` and run `local_fetch_parse_validate.py` once per file.
+4. **JGI YAML.** Confirm `sources/status-lists/jgi_1kfg/jgi_1kfg.types.yaml` matches the columns emitted by `update_jgi_status` after the OAuth pagination rewrite.
+5. **End-to-end parity check.** Run the batch validator (`batch_validate_status_lists.py`) over all current `status-lists/` YAMLs locally and record the pass / fail / skip rates. Fix anything that fails before Phase 3.
+
+### Verification (Phase 2)
+
+1. `python -m flows.lib.local_fetch_parse_validate --yaml … --tsv …` returns exit 0 for every (parser, source) pair in the table above.
+2. `python -m flows.orchestrators.batch_validate_status_lists --dry-run` lists every `FILE_*.types.yaml` under `status-lists/` with the expected parser assignment.
+3. `prefect deploy --prefect-file flows/prefect.yaml --all` succeeds and the trigger for `batch-validate-google-sheets-status` shows `update.google.sheets.status.finished`.
+
+---
+
+## Phase 3 — Switch Legacy Import to Validated Data (NOT STARTED)
+
+Unchanged from v1. Recap:
+
+1. Confirm parity between `s3://goat/validated/` and `s3://goat/sources/` per directory.
+2. Update `goat-data/.github/workflows/genomehubs-index.yml` to read from `validated/`.
+3. Disable fetch jobs in `goat-data/.github/workflows/fetch-resources.yml` and skip them from `s3_release.yml`.
+4. Test release; compare ES indices, API tests, UI tests.
+5. Staged rollout: assembly-data first, then status-lists, then the rest.
+
+Rollback path: revert the S3 path in the workflow — `sources/` and `resources/` remain intact.
+
+## Phase 4 — Replace Legacy Import (FUTURE)
+
+Unchanged from v1. Requires the new import code (skips re-validation/lookup, reads validated TSV/YAML pairs directly into Elasticsearch).
+
+## Phase 5 — Full Pipeline Migration (FUTURE)
+
+Unchanged from v1. Move ES init, indexing, fill, test, and release promotion out of GitHub Actions into Prefect.
+
+---
+
+## Implemented surface area (snapshot)
+
+- **Parsers:** 11 (see Phase 2 table).
+- **Updaters:** 19 (Phase 1 inventory).
+- **Orchestrators:** 5 — `batch_validate_status_lists`, `tasks`, `tol_data_pipeline`, `tol_genome_notes_orchestration`, `tol_portal_status_orchestration`.
+- **Lib modules:** 12 — `conditional_import`, `fetch_genomehubs_target_list`, `fetch_previous_file_pair`, `for_each_record`, `index_assembly_features`, `local_fetch_parse_validate`, `process_features`, `shared_args`, `shared_tasks`, `utils`, `validate_file_pair`, `wrapper_fetch_parse_validate`.
+- **Deployments in `flows/prefect.yaml`:** 34 (including the new `batch-validate-google-sheets-status`).
+
+## Reference material carried forward from v1
+
+The following sections of [v1-initial-plan.md](v1-initial-plan.md) remain authoritative and have not been duplicated here:
+
+- **Gap analysis** — historical record of which legacy jobs needed updaters. Now fully implemented.
+- **Network robustness review** — `safe_get()` hardening guidance, per-source timeout table, paginated-API partial-failure handling, idempotency / freshness checks, S3 upload atomicity, connection pooling, DNS / TLS handling.
+- **Logging review** — `log_progress()` helper proposal, network-call summaries, output-file summaries, event-emission logging, exception context, Docker orchestrator logging.
+- **Conventions reference** — YAML/TSV pair conventions, repo coding conventions, list of legacy code bug risks to avoid.
+
+These are general-purpose engineering guidance and apply to any future updater or parser work.
+
+---
+
+## Change log
+
+**v1 → v2 (this revision):**
+
+- Marked Phase 1 complete; replaced the "missing updaters" table with the implemented inventory.
+- Marked Phase 2 mostly complete; added the parser-vs-source assignment audit table.
+- Added the **Phase 2 cleanup** section enumerating the remaining items (blobtk PATH, BLOBTOOLKIT parser decision, GENOMESIZE_KARYOTYPE schema sweep, JGI YAML, end-to-end parity).
+- Documented `flows/lib/local_fetch_parse_validate.py` and `flows/orchestrators/batch_validate_status_lists.py` (both new since v1).
+- Recorded the BTK and UCSC `SKIP_PARSING` decision (auto-taxonomy in `blobtk validate`).
+- Recorded the VGP YAML correction (`FILE_VGP_Ordinal_Phase1.types.yaml`).
+- Phases 3–5 unchanged.
+- Network-robustness, logging, and conventions sections kept in v1 by reference rather than duplicated.
diff --git a/flows/.gitignore b/flows/.gitignore
new file mode 100644
index 0000000..47b052f
--- /dev/null
+++ b/flows/.gitignore
@@ -0,0 +1,2 @@
+!.gitignore
+__pycache__/
diff --git a/flows/lib/__pycache__/__init__.cpython-313.pyc b/flows/lib/__pycache__/__init__.cpython-313.pyc
new file mode 100644
index 0000000..a5e6e06
Binary files /dev/null and b/flows/lib/__pycache__/__init__.cpython-313.pyc differ
diff --git a/flows/lib/__pycache__/conditional_import.cpython-313.pyc b/flows/lib/__pycache__/conditional_import.cpython-313.pyc
new file mode 100644
index 0000000..a72da4a
Binary files /dev/null and b/flows/lib/__pycache__/conditional_import.cpython-313.pyc differ
diff --git a/flows/lib/__pycache__/shared_args.cpython-313.pyc b/flows/lib/__pycache__/shared_args.cpython-313.pyc
new file mode 100644
index 0000000..8f8009e
Binary files /dev/null and b/flows/lib/__pycache__/shared_args.cpython-313.pyc differ
diff --git a/flows/lib/__pycache__/utils.cpython-313.pyc b/flows/lib/__pycache__/utils.cpython-313.pyc
new file mode 100644
index 0000000..7597344
Binary files /dev/null and b/flows/lib/__pycache__/utils.cpython-313.pyc differ
diff --git a/flows/lib/conditional_import.py b/flows/lib/conditional_import.py
index ada08c4..89f0184 100644
--- a/flows/lib/conditional_import.py
+++ b/flows/lib/conditional_import.py
@@ -35,5 +35,6 @@ def emit_event(*_, **__):
     from prefect.events import emit_event
     from prefect.runtime.task_run import run_count
 
+NO_CACHE = NO_CACHE
 
-__all__ = ["flow", "task", "emit_event", "run_count", "skip_prefect", "NO_CACHE"]
+__all__ = ["flow", "task", "emit_event", "run_count", "skip_prefect"]
diff --git a/flows/lib/fetch_previous_file_pair.py b/flows/lib/fetch_previous_file_pair.py
index d0acbeb..cbb070b 100644
--- a/flows/lib/fetch_previous_file_pair.py
+++ b/flows/lib/fetch_previous_file_pair.py
@@ -112,13 +112,16 @@ def copy_yaml_files(yaml_path: str, config: Config, work_dir: str) -> None:
     # Copy any dependencies to the working directory
     if "needs" in config.config["file"]:
         source_dir = os.path.dirname(yaml_path)
-        for file in config.config["file"]["needs"]:
+        needs = config.config["file"]["needs"]
+        if not isinstance(needs, list):
+            needs = [needs]
+        for file in needs:
             file_path = os.path.join(source_dir, file)
             shutil.copy(file_path, work_dir)
 
 
 @flow()
-def fetch_previous_file_pair(yaml_path: str, s3_path: str, work_dir: str) -> None:
+def fetch_previous_file_pair(yaml_path: str, s3_path: str, work_dir: str) -> bool:
     """
     Fetch the previous YAML/TSV files and compare headers.
 
@@ -128,7 +131,7 @@ def fetch_previous_file_pair(yaml_path: str, s3_path: str, work_dir: str) -> Non
         work_dir (str): Path to the working directory.
     """
     config = utils.load_config(yaml_path)
-    (local_file, remote_file) = get_filenames(config, s3_path, work_dir)
+    local_file, remote_file = get_filenames(config, s3_path, work_dir)
     line_count = fetch_tsv_file(remote_file, local_file)
     copy_yaml_files(yaml_path, config, work_dir)
     status = compare_headers(config, local_file)
diff --git a/flows/lib/local_fetch_parse_validate.py b/flows/lib/local_fetch_parse_validate.py
new file mode 100644
index 0000000..e1e4757
--- /dev/null
+++ b/flows/lib/local_fetch_parse_validate.py
@@ -0,0 +1,266 @@
+"""Local fetch-parse-validate wrapper for testing without S3.
+
+This mirrors the production wrapper_fetch_parse_validate.py but:
+- Skips S3 fetch (uses a local input TSV directly)
+- Copies YAML + input TSV into work_dir with expected names
+- Runs the selected parser (or SKIP_PARSING)
+- Runs blobtk validate locally (no S3 upload)
+
+Usage:
+    SKIP_PREFECT=true python -m flows.lib.local_fetch_parse_validate \
+        -p SKIP_PARSING \
+        --yaml-path ../goat-data/sources/status-lists/FILE_VGP_Ordinal_Phase1.types.yaml \
+        --input-tsv tsv_examples/VGP_Ordinal_Phase1_plus.tsv \
+        --work-dir /tmp/test-vgp
+
+    SKIP_PREFECT=true python -m flows.lib.local_fetch_parse_validate \
+        -p REFSEQ_ORGANELLES \
+        --yaml-path ../goat-data/sources/assembly-data/refseq_organelles.types.yaml \
+        --input-tsv tsv_examples/refseq_organelles.tsv \
+        --work-dir /tmp/test-refseq
+"""
+
+import argparse
+import os
+import shutil
+import subprocess
+import sys
+
+from flows.lib.conditional_import import flow
+from flows.lib.utils import enum_action, load_config
+from flows.parsers.register import register_plugins
+from flows.validators.validate_file_pair import validate_file_pair
+
+PARSERS = register_plugins()
+
+
+def _check_blobtk():
+    """Verify blobtk is available on PATH."""
+    try:
+        subprocess.run(
+            ["blobtk", "--version"],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+        )
+    except FileNotFoundError:
+        print(
+            "[local] WARNING: 'blobtk' not found on PATH. "
+            "Validation step will be skipped.\n"
+            "       Install via: cd blobtk/rust && cargo build --release && "
+            "export PATH=$PWD/target/release:$PATH",
+            file=sys.stderr,
+        )
+        return False
+    return True
+
+
+def _copy_yaml_to_workdir(yaml_path: str, work_dir: str) -> str:
+    """Copy the YAML and its dependencies into work_dir.
+
+    Returns:
+        str: Path to the copied YAML in work_dir.
+    """
+    os.makedirs(work_dir, exist_ok=True)
+    config = load_config(yaml_path)
+    dest = os.path.join(work_dir, os.path.basename(yaml_path))
+    shutil.copy(yaml_path, dest)
+
+    # Copy dependency YAML files (e.g. "needs:" references)
+    if "needs" in config.config.get("file", {}):
+        source_dir = os.path.dirname(yaml_path)
+        needs = config.config["file"]["needs"]
+        if isinstance(needs, str):
+            needs = [needs]
+        for dep in needs:
+            dep_path = os.path.join(source_dir, dep)
+            if os.path.exists(dep_path):
+                shutil.copy(dep_path, work_dir)
+    return dest
+
+
+def _place_input_tsv(input_tsv: str, yaml_path: str, work_dir: str) -> str:
+    """Copy or symlink the input TSV into work_dir with the name expected by the YAML.
+
+    Handles gzip: if YAML expects .gz but input is plain, compress on copy.
+    If YAML expects plain but input is .gz, decompress on copy.
+
+    Returns:
+        str: Path to the TSV in work_dir.
+    """
+    import gzip as gzip_mod
+
+    config = load_config(yaml_path)
+    expected_name = os.path.basename(config.config["file"]["name"])
+    dest = os.path.join(work_dir, expected_name)
+
+    # If input already matches expected location, skip
+    if os.path.abspath(input_tsv) == os.path.abspath(dest):
+        return dest
+
+    expects_gz = expected_name.endswith(".gz")
+    input_is_gz = input_tsv.endswith(".gz")
+
+    if expects_gz and not input_is_gz:
+        # Compress plain input into .gz destination
+        with open(input_tsv, "rb") as f_in, gzip_mod.open(dest, "wb") as f_out:
+            shutil.copyfileobj(f_in, f_out)
+    elif not expects_gz and input_is_gz:
+        # Decompress .gz input into plain destination
+        with gzip_mod.open(input_tsv, "rb") as f_in, open(dest, "wb") as f_out:
+            shutil.copyfileobj(f_in, f_out)
+    else:
+        # Same format — straight copy
+        shutil.copy(input_tsv, dest)
+    return dest
+
+
+@flow(log_prints=True)
+def local_fetch_parse_validate(
+    parser: str,
+    yaml_path: str,
+    input_tsv: str,
+    work_dir: str,
+    taxdump_path: str = None,
+    append: bool = False,
+    min_valid: int = 0,
+    min_assigned: int = 0,
+) -> bool:
+    """Run the parse-validate pipeline locally without S3.
+
+    Args:
+        parser: Parser enum name (e.g. "SKIP_PARSING", "REFSEQ_ORGANELLES").
+        yaml_path: Path to the source YAML configuration file.
+        input_tsv: Path to the input TSV file from the updater.
+        work_dir: Working directory for intermediate files.
+        taxdump_path: Optional path to an NCBI taxdump for taxonomy validation.
+        append: Whether to run in append mode.
+        min_valid: Minimum expected valid row count.
+        min_assigned: Minimum expected assigned taxa count.
+
+    Returns:
+        bool: True if validation passed.
+    """
+    yaml_path = os.path.abspath(yaml_path)
+    input_tsv = os.path.abspath(input_tsv)
+    work_dir = os.path.abspath(work_dir)
+
+    if not os.path.exists(yaml_path):
+        raise FileNotFoundError(f"YAML file not found: {yaml_path}")
+    if not os.path.exists(input_tsv):
+        raise FileNotFoundError(f"Input TSV not found: {input_tsv}")
+
+    print(f"[local] Parser: {parser}")
+    print(f"[local] YAML:   {yaml_path}")
+    print(f"[local] Input:  {input_tsv}")
+    print(f"[local] Work:   {work_dir}")
+
+    # Step 1: Copy YAML to work_dir
+    working_yaml = _copy_yaml_to_workdir(yaml_path, work_dir)
+    print(f"[local] Copied YAML → {working_yaml}")
+
+    # Step 2: Place input TSV with expected filename
+    tsv_dest = _place_input_tsv(input_tsv, yaml_path, work_dir)
+    print(f"[local] Input TSV → {tsv_dest}")
+
+    # Step 3: Run parser
+    parser_key = parser.name if hasattr(parser, "name") else str(parser)
+    file_parser = PARSERS.parsers[parser_key]
+    print(f"[local] Running parser: {file_parser.name}")
+    file_parser.func(
+        working_yaml=working_yaml,
+        work_dir=work_dir,
+        append=append,
+        data_freeze_path=None,
+    )
+    print("[local] Parser completed")
+
+    # Step 4: Validate (no S3 upload — s3_path=None)
+    if _check_blobtk():
+        print("[local] Running validation...")
+        status = validate_file_pair(
+            yaml_path=yaml_path,
+            work_dir=work_dir,
+            taxdump_path=taxdump_path,
+            s3_path=None,
+            min_valid=min_valid,
+            min_assigned=min_assigned,
+        )
+        if status:
+            print("[local] ✓ Validation PASSED")
+        else:
+            print("[local] ✗ Validation FAILED")
+    else:
+        print("[local] ⚠ Validation SKIPPED (blobtk not available)")
+        status = None
+    return status
+
+
+def main():
+    """CLI entry point."""
+    arg_parser = argparse.ArgumentParser(
+        description="Local fetch-parse-validate (no S3).",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=__doc__,
+    )
+    arg_parser.add_argument(
+        "-p", "--parser",
+        required=True,
+        type=str,
+        action=enum_action(PARSERS.ParserEnum),
+        help=f"Parser to use. Choices: {[e.name for e in PARSERS.ParserEnum]}",
+    )
+    arg_parser.add_argument(
+        "--yaml-path",
+        required=True,
+        help="Path to the source YAML configuration file.",
+    )
+    arg_parser.add_argument(
+        "--input-tsv",
+        required=True,
+        help="Path to the input TSV file (from the updater).",
+    )
+    arg_parser.add_argument(
+        "--work-dir",
+        default="/tmp/local-fpv",
+        help="Working directory for intermediate files.",
+    )
+    arg_parser.add_argument(
+        "--taxdump-path",
+        default=None,
+        help="Path to an NCBI taxdump directory.",
+    )
+    arg_parser.add_argument(
+        "--append",
+        action="store_true",
+        help="Run in append mode.",
+    )
+    arg_parser.add_argument(
+        "--min-valid",
+        type=int,
+        default=0,
+        help="Minimum expected valid row count.",
+    )
+    arg_parser.add_argument(
+        "--min-assigned",
+        type=int,
+        default=0,
+        help="Minimum expected assigned taxa count.",
+    )
+
+    args = arg_parser.parse_args()
+    success = local_fetch_parse_validate(
+        parser=args.parser,
+        yaml_path=args.yaml_path,
+        input_tsv=args.input_tsv,
+        work_dir=args.work_dir,
+        taxdump_path=args.taxdump_path,
+        append=args.append,
+        min_valid=args.min_valid,
+        min_assigned=args.min_assigned,
+    )
+    # Exit 0 if validation passed or was skipped (None), 1 if failed
+    sys.exit(0 if success is not False else 1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/flows/lib/shared_args.py b/flows/lib/shared_args.py
index 4d5d801..4f9f4fb 100644
--- a/flows/lib/shared_args.py
+++ b/flows/lib/shared_args.py
@@ -57,6 +57,16 @@
     },
 }
 
+DIVISION = {
+    "flags": ["--division"],
+    "keys": {
+        "help": "Ensembl division (fungi, metazoa, plants, protists, rapid, vertebrates).",
+        "type": str,
+        "default": "vertebrates",
+    },
+}
+
+
 DRY_RUN = {
     "flags": ["-d", "--dry_run"],
     "keys": {
@@ -94,6 +104,15 @@
     "keys": {"help": "Type of index to fetch.", "type": str},
 }
 
+INDEX_URL = {
+    "flags": ["--index_url"],
+    "keys": {
+        "help": "URL to the private index TSV listing project sheets.",
+        "type": str,
+    },
+}
+
+
 INPUT_PATH = {
     "flags": ["-i", "--input_path"],
     "keys": {"help": "Path to the input file.", "type": str},
@@ -197,9 +216,7 @@
 }
 
 
-def default(
-    arg: Dict[str, Any], default: Union[int, float, bool, str]
-) -> Dict[str, Any]:
+def default(arg: Dict[str, Any], default: Union[int, float, bool, str]) -> Dict[str, Any]:
     """Return an argument with a default value."""
     # append/replace the default value to the help message
     default_re = r"\s*\(default: .*\)"
diff --git a/flows/lib/utils.py b/flows/lib/utils.py
index ed8f8bd..60e4855 100644
--- a/flows/lib/utils.py
+++ b/flows/lib/utils.py
@@ -1,6 +1,7 @@
 #!/usr/bin/python3
 
 import contextlib
+import glob
 import gzip
 import hashlib
 import os
@@ -18,6 +19,8 @@
 import boto3
 import requests
 from botocore.exceptions import ClientError
+from requests.adapters import HTTPAdapter
+from urllib3.util.retry import Retry
 from dateutil import parser
 from genomehubs import utils as gh_utils
 
@@ -131,6 +134,152 @@ def parse(self, data: dict) -> dict:
         return parsed_data
 
 
+def open_tsv(input_path: str):
+    """Open a TSV file (plain or gzipped) for reading.
+
+    Args:
+        input_path (str): Path to the input TSV file (.tsv or .tsv.gz).
+
+    Returns:
+        File handle in text mode.
+    """
+    if input_path.endswith(".gz"):
+        return gzip.open(input_path, "rt", encoding="utf-8", newline="")
+    return open(input_path, "rt", encoding="utf-8", newline="")
+
+
+def parse_tsv_with_config(
+    input_path: str,
+    config: "Config",
+    key_field: Optional[str] = None,
+    delimiter: str = "\t",
+) -> Dict[str, dict]:
+    """Parse a TSV file row-by-row through a Config's parse functions.
+
+    Each row of the TSV is treated as a flat dict keyed by column header,
+    matching the expected ``path:`` references in YAML attribute definitions.
+    Rows are passed through ``gh_utils.parse_report_values`` to apply any
+    YAML-defined translations and field mappings, then keyed in the returned
+    dict by ``key_field`` (or by row index if not provided).
+
+    Args:
+        input_path (str): Path to the input TSV file (.tsv or .tsv.gz).
+        config (Config): Loaded YAML configuration.
+        key_field (str): Optional input column name to use as the dict key.
+            If not provided, rows are keyed by sequential integer.
+        delimiter (str): Field delimiter in the TSV (default: tab).
+
+    Returns:
+        Dict[str, dict]: Mapping of key → parsed row dict (YAML-named fields).
+    """
+    parsed: Dict[str, dict] = {}
+    with open_tsv(input_path) as fh:
+        reader = DictReader(fh, delimiter=delimiter)
+        for index, record in enumerate(reader):
+            row = gh_utils.parse_report_values(config.parse_fns, record)
+            if key_field and key_field in record and record[key_field]:
+                key = record[key_field]
+            else:
+                key = str(index)
+            parsed[key] = row
+    return parsed
+
+
+def locate_input_tsv(work_dir: str, expected_name: Optional[str] = None) -> str:
+    """Locate the input TSV in ``work_dir`` for a generic parser.
+
+    Picks the single ``*.tsv`` or ``*.tsv.gz`` in ``work_dir`` whose basename
+    is not the expected output. Falls back to a direct hit on ``expected_name``
+    when present.
+
+    Args:
+        work_dir (str): Working directory.
+        expected_name (str): Output filename from ``config.meta["file_name"]``;
+            used to exclude the parser's intended output from candidate
+            inputs.
+
+    Returns:
+        str: Path to the input TSV.
+
+    Raises:
+        FileNotFoundError: If no candidate TSV is found.
+        ValueError: If multiple candidate TSVs are found.
+    """
+    candidates = sorted(
+        glob.glob(os.path.join(work_dir, "*.tsv"))
+        + glob.glob(os.path.join(work_dir, "*.tsv.gz"))
+    )
+    if expected_name:
+        candidates = [
+            c for c in candidates if os.path.basename(c) != expected_name
+        ]
+    if not candidates:
+        raise FileNotFoundError(
+            f"No TSV input found in {work_dir} (expected != {expected_name})"
+        )
+    if len(candidates) > 1:
+        raise ValueError(
+            f"Multiple TSV inputs in {work_dir}: {candidates!r}"
+        )
+    return candidates[0]
+
+
+def run_generic_tsv_parser(
+    working_yaml: str,
+    work_dir: str,
+    append: bool = False,
+    key_field: Optional[str] = None,
+) -> None:
+    """Run the generic flat-TSV-with-Config parsing pipeline.
+
+    Locates the input TSV in ``work_dir``, loads the YAML config,
+    applies ``parse_report_values`` row-by-row, and writes the
+    canonical TSV to ``work_dir`` (preserving the YAML-defined
+    ``file_name`` for downstream validation).
+
+    Args:
+        working_yaml (str): Path to the working YAML config file.
+        work_dir (str): Working directory.
+        append (bool): If True, load previous parsed data.
+        key_field (str): Optional input column to key parsed rows by.
+    """
+    config = load_config(config_file=working_yaml, load_previous=append)
+    expected_name = config.meta["file_name"]
+    input_path = locate_input_tsv(work_dir, expected_name)
+    print(f"Parsing {input_path} with {working_yaml}")
+
+    parsed = parse_tsv_with_config(input_path, config, key_field=key_field)
+    print(f"Parsed {len(parsed)} records")
+
+    output_name = config.meta["file_name"]
+    config.meta["file_name"] = os.path.join(
+        work_dir, os.path.basename(output_name)
+    )
+    try:
+        write_parsed_tsv(parsed, config)
+    finally:
+        config.meta["file_name"] = output_name
+
+
+def write_parsed_tsv(parsed: Dict[str, dict], config: "Config") -> None:
+    """Write a parsed dict to TSV using config-defined headers and meta.
+
+    Handles ``.gz`` filenames by writing uncompressed then gzipping.
+
+    Args:
+        parsed (Dict[str, dict]): Mapping of key → row dict.
+        config (Config): Loaded YAML configuration.
+    """
+    file_name = config.meta["file_name"]
+    if file_name.endswith(".gz"):
+        config.meta["file_name"] = file_name[:-3]
+        gh_utils.write_tsv(parsed, config.headers, config.meta)
+        os.system(f"gzip -f {config.meta['file_name']}")
+        config.meta["file_name"] = file_name
+    else:
+        gh_utils.write_tsv(parsed, config.headers, config.meta)
+
+
 def format_entry(entry, key: str, meta: dict) -> str:
     """
     Formats a single entry in a dictionary, handling the case where the entry is a list.
@@ -550,13 +699,56 @@ def __call__(self, parser, namespace, values, option_string=None):
     return EnumAction
 
 
+def _build_session(retries=3, backoff_factor=1.0, status_forcelist=None):
+    """Build a requests Session with transport-level retry logic.
+
+    Args:
+        retries (int): Total number of retries per request.
+        backoff_factor (float): Backoff factor for exponential delay between retries.
+        status_forcelist (list): HTTP status codes to trigger a retry.
+
+    Returns:
+        requests.Session: Configured session with retry adapter.
+    """
+    if status_forcelist is None:
+        status_forcelist = [429, 500, 502, 503, 504]
+    retry = Retry(
+        total=retries,
+        backoff_factor=backoff_factor,
+        status_forcelist=status_forcelist,
+        allowed_methods=["GET", "POST", "HEAD"],
+        raise_on_status=False,
+    )
+    adapter = HTTPAdapter(max_retries=retry)
+    session = requests.Session()
+    session.mount("https://", adapter)
+    session.mount("http://", adapter)
+    return session
+
+
 def safe_get(*args, method="GET", timeout=300, **kwargs):
+    """Make an HTTP request with transport-level retries.
+
+    Retries automatically on 429/5xx status codes and connection errors
+    with exponential backoff (1s, 2s, 4s). Separate from Prefect task-level
+    retries which re-run the entire task.
+
+    Args:
+        *args: Positional arguments passed to requests (typically the URL).
+        method (str): HTTP method — "GET", "POST", or "HEAD".
+        timeout (int): Request timeout in seconds.
+        **kwargs: Additional keyword arguments passed to requests.
+
+    Returns:
+        requests.Response: The HTTP response object.
+    """
+    session = _build_session()
     if method == "GET":
-        return requests.get(*args, timeout=timeout, **kwargs)
+        return session.get(*args, timeout=timeout, **kwargs)
     elif method == "POST":
-        return requests.post(*args, timeout=timeout, **kwargs)
+        return session.post(*args, timeout=timeout, **kwargs)
     elif method == "HEAD":
-        return requests.head(*args, timeout=timeout, **kwargs)
+        return session.head(*args, timeout=timeout, **kwargs)
 
 
 def find_http_file(http_path: str, filename: str) -> str:
diff --git a/flows/orchestrators/batch_validate_status_lists.py b/flows/orchestrators/batch_validate_status_lists.py
new file mode 100644
index 0000000..29863bd
--- /dev/null
+++ b/flows/orchestrators/batch_validate_status_lists.py
@@ -0,0 +1,111 @@
+"""Batch fetch-parse-validate for all Google Sheets status list TSVs.
+
+Triggered by update.google.sheets.status.finished, this flow iterates
+through all FILE_*.types.yaml in the status-lists directory and runs
+the full fetch-parse-validate pipeline (with SKIP_PARSING) for each one
+whose corresponding TSV is present in work_dir.
+"""
+
+import os
+import sys
+from glob import glob
+from typing import Optional
+
+from flows.lib.conditional_import import flow
+from flows.lib.utils import load_config
+from flows.lib.wrapper_fetch_parse_validate import Parser, fetch_parse_validate
+
+
+@flow(log_prints=True)
+def batch_validate_status_lists(
+    yaml_dir: str,
+    work_dir: str,
+    s3_path: str = "s3://goat/sources/status-lists/",
+    taxdump_path: Optional[str] = None,
+    dry_run: bool = False,
+    min_valid: int = 0,
+    min_assigned: int = 0,
+) -> bool:
+    """Run fetch-parse-validate (SKIP_PARSING) for all status list TSVs in work_dir.
+
+    For each FILE_*.types.yaml in yaml_dir, checks if the corresponding
+    TSV exists in work_dir. If present, invokes the standard
+    fetch_parse_validate flow with SKIP_PARSING.
+
+    Args:
+        yaml_dir: Directory containing FILE_*.types.yaml files.
+        work_dir: Directory containing TSVs output by the updater.
+        s3_path: S3 path prefix for validated files.
+        taxdump_path: Optional NCBI taxdump path.
+        dry_run: If True, skip S3 upload.
+        min_valid: Minimum valid row count per file.
+        min_assigned: Minimum assigned taxa per file.
+
+    Returns:
+        bool: True if all validations passed.
+    """
+    yaml_files = sorted(glob(os.path.join(yaml_dir, "FILE_*.types.yaml")))
+    if not yaml_files:
+        print(f"No FILE_*.types.yaml found in {yaml_dir}")
+        return False
+
+    results = {}
+
+    for yaml_path in yaml_files:
+        yaml_name = os.path.basename(yaml_path)
+        try:
+            config = load_config(yaml_path)
+        except Exception as e:
+            print(f"  SKIP {yaml_name}: failed to load config — {e}")
+            results[yaml_name] = "skip-config-error"
+            continue
+
+        tsv_name = os.path.basename(config.config["file"]["name"])
+        tsv_path = os.path.join(work_dir, tsv_name)
+
+        if not os.path.exists(tsv_path):
+            # TSV not present — updater may not have produced it this run
+            results[yaml_name] = "skip-no-tsv"
+            continue
+
+        try:
+            fetch_parse_validate(
+                parser=Parser.SKIP_PARSING,
+                yaml_path=yaml_path,
+                s3_path=s3_path,
+                work_dir=work_dir,
+                taxdump_path=taxdump_path,
+                dry_run=dry_run,
+                min_valid=min_valid,
+                min_assigned=min_assigned,
+            )
+            results[yaml_name] = "pass"
+            print(f"  ✓ {yaml_name}")
+        except Exception as e:
+            print(f"  ✗ {yaml_name}: {e}")
+            results[yaml_name] = "fail"
+
+    # Summary
+    passed = sum(1 for v in results.values() if v == "pass")
+    failed = sum(1 for v in results.values() if v == "fail")
+    skipped = sum(1 for v in results.values() if v.startswith("skip"))
+    print(f"\nBatch fetch-parse-validate: {passed} passed, {failed} failed, {skipped} skipped")
+
+    return failed == 0
+
+
+if __name__ == "__main__":
+    import argparse
+
+    p = argparse.ArgumentParser(description="Batch validate status list TSVs.")
+    p.add_argument("--yaml-dir", required=True, help="Directory with FILE_*.types.yaml")
+    p.add_argument("--work-dir", required=True, help="Directory with updater TSVs")
+    p.add_argument("--taxdump-path", default=None)
+    p.add_argument("--s3-path", default=None)
+    p.add_argument("--dry-run", action="store_true")
+    p.add_argument("--min-valid", type=int, default=0)
+    p.add_argument("--min-assigned", type=int, default=0)
+    args = p.parse_args()
+
+    success = batch_validate_status_lists(**vars(args))
+    sys.exit(0 if success else 1)
diff --git a/flows/parsers/parse_blobtoolkit.py b/flows/parsers/parse_blobtoolkit.py
new file mode 100644
index 0000000..0971bf5
--- /dev/null
+++ b/flows/parsers/parse_blobtoolkit.py
@@ -0,0 +1,93 @@
+"""Parse BlobToolKit assembly summary TSV.
+
+The corresponding updater (``update_blobtoolkit``) emits a flat TSV
+(one row per BTK dataset) whose column headers already match the
+``header:`` values in ``btk.types.yaml``. This parser simply applies
+the YAML parse functions and writes the canonical TSV.
+"""
+
+import os
+from glob import glob
+
+from flows.lib.conditional_import import flow
+from flows.lib.utils import (  # noqa: E402
+    Parser,
+    load_config,
+    parse_tsv_with_config,
+    write_parsed_tsv,
+)
+from flows.parsers.args import parse_args  # noqa: E402
+
+
+def _locate_input_tsv(work_dir: str, expected_name: str) -> str:
+    """Find the input TSV in ``work_dir``.
+
+    Args:
+        work_dir (str): Working directory.
+        expected_name (str): YAML-defined output filename.
+
+    Returns:
+        str: Path to the input TSV.
+    """
+    candidates = sorted(
+        glob(os.path.join(work_dir, "*.tsv"))
+        + glob(os.path.join(work_dir, "*.tsv.gz"))
+    )
+    candidates = [c for c in candidates if os.path.basename(c) != expected_name]
+    if not candidates:
+        raise FileNotFoundError(f"No BTK input TSV found in {work_dir}")
+    if len(candidates) > 1:
+        raise ValueError(f"Multiple TSV inputs in {work_dir}: {candidates!r}")
+    return candidates[0]
+
+
+@flow(log_prints=True)
+def parse_blobtoolkit(
+    working_yaml: str,
+    work_dir: str,
+    append: bool = False,
+    **kwargs,
+) -> None:
+    """Parse BTK summary TSV using YAML schema.
+
+    Args:
+        working_yaml (str): Path to the YAML configuration file.
+        work_dir (str): Working directory containing the input TSV.
+        append (bool): If True, load previous parsed data.
+        **kwargs: Ignored extra arguments from the wrapper.
+    """
+    config = load_config(config_file=working_yaml, load_previous=append)
+
+    expected_name = config.meta["file_name"]
+    input_path = _locate_input_tsv(work_dir, expected_name)
+    print(f"Parsing BlobToolKit summary: {input_path}")
+
+    parsed = parse_tsv_with_config(input_path, config, key_field="accession")
+    print(f"Parsed {len(parsed)} BTK dataset records")
+
+    output_name = config.meta["file_name"]
+    config.meta["file_name"] = os.path.join(
+        work_dir, os.path.basename(output_name)
+    )
+    try:
+        write_parsed_tsv(parsed, config)
+    finally:
+        config.meta["file_name"] = output_name
+
+
+def plugin():
+    """Register the parser plugin."""
+    return Parser(
+        name="BLOBTOOLKIT",
+        func=parse_blobtoolkit,
+        description="Parse BlobToolKit assembly summary TSV using YAML schema.",
+    )
+
+
+if __name__ == "__main__":
+    args = parse_args("Parse BlobToolKit assembly summary TSV.")
+    parse_blobtoolkit(
+        working_yaml=args.yaml_path,
+        work_dir=os.path.dirname(args.input_path) or ".",
+        append=args.append,
+    )
diff --git a/flows/parsers/parse_conservation.py b/flows/parsers/parse_conservation.py
new file mode 100644
index 0000000..f86c549
--- /dev/null
+++ b/flows/parsers/parse_conservation.py
@@ -0,0 +1,50 @@
+"""Parse conservation-status source TSV using a YAML schema.
+
+Handles the ``FILE_CITES_full_index.types.yaml`` (and any future
+``FILE_*.types.yaml``) configurations under ``sources/conservation``.
+Delegates to the shared generic flat-TSV pipeline.
+"""
+
+import os
+
+from flows.lib.conditional_import import flow
+from flows.lib.utils import Parser, run_generic_tsv_parser
+from flows.parsers.args import parse_args
+
+
+@flow(log_prints=True)
+def parse_conservation(
+    working_yaml: str,
+    work_dir: str,
+    append: bool = False,
+    **kwargs,
+) -> None:
+    """Parse a conservation-status TSV using YAML schema.
+
+    Args:
+        working_yaml (str): Path to the YAML configuration file.
+        work_dir (str): Working directory containing the input TSV.
+        append (bool): If True, load previous parsed data.
+        **kwargs: Ignored extra arguments from the wrapper.
+    """
+    run_generic_tsv_parser(
+        working_yaml=working_yaml, work_dir=work_dir, append=append
+    )
+
+
+def plugin():
+    """Register the parser plugin."""
+    return Parser(
+        name="CONSERVATION",
+        func=parse_conservation,
+        description="Parse a conservation-status TSV using a YAML schema.",
+    )
+
+
+if __name__ == "__main__":
+    args = parse_args("Parse a conservation-status TSV using a YAML schema.")
+    parse_conservation(
+        working_yaml=args.yaml_path,
+        work_dir=os.path.dirname(args.input_path) or ".",
+        append=args.append,
+    )
diff --git a/flows/parsers/parse_genomesize_karyotype.py b/flows/parsers/parse_genomesize_karyotype.py
new file mode 100644
index 0000000..715e4b7
--- /dev/null
+++ b/flows/parsers/parse_genomesize_karyotype.py
@@ -0,0 +1,53 @@
+"""Parse genome size & karyotype source TSV using a YAML schema.
+
+Generic parser for the ~25 ``FILE_*.types.yaml`` configurations under
+``sources/genomesize-karyotype``. Each source is a flat TSV whose
+columns map directly to YAML attribute headers; this parser delegates
+to the shared generic flat-TSV pipeline.
+"""
+
+import os
+
+from flows.lib.conditional_import import flow
+from flows.lib.utils import Parser, run_generic_tsv_parser
+from flows.parsers.args import parse_args
+
+
+@flow(log_prints=True)
+def parse_genomesize_karyotype(
+    working_yaml: str,
+    work_dir: str,
+    append: bool = False,
+    **kwargs,
+) -> None:
+    """Parse a genome-size or karyotype TSV using YAML schema.
+
+    Args:
+        working_yaml (str): Path to the YAML configuration file.
+        work_dir (str): Working directory containing the input TSV.
+        append (bool): If True, load previous parsed data.
+        **kwargs: Ignored extra arguments from the wrapper.
+    """
+    run_generic_tsv_parser(
+        working_yaml=working_yaml, work_dir=work_dir, append=append
+    )
+
+
+def plugin():
+    """Register the parser plugin."""
+    return Parser(
+        name="GENOMESIZE_KARYOTYPE",
+        func=parse_genomesize_karyotype,
+        description="Parse a genome-size or karyotype TSV using a YAML schema.",
+    )
+
+
+if __name__ == "__main__":
+    args = parse_args(
+        "Parse a genome-size or karyotype TSV using a YAML schema."
+    )
+    parse_genomesize_karyotype(
+        working_yaml=args.yaml_path,
+        work_dir=os.path.dirname(args.input_path) or ".",
+        append=args.append,
+    )
diff --git a/flows/parsers/parse_legislation.py b/flows/parsers/parse_legislation.py
new file mode 100644
index 0000000..43d3aa5
--- /dev/null
+++ b/flows/parsers/parse_legislation.py
@@ -0,0 +1,51 @@
+"""Parse legislation source TSV using a YAML schema.
+
+Handles the ``FILE_*.types.yaml`` configurations under
+``sources/uk-legislation`` (Council Directive 92/43/EEC, Conservation
+of Habitats and Species Regulations 2017, Wildlife and Countryside
+Act 1981, etc.). Delegates to the shared generic flat-TSV pipeline.
+"""
+
+import os
+
+from flows.lib.conditional_import import flow
+from flows.lib.utils import Parser, run_generic_tsv_parser
+from flows.parsers.args import parse_args
+
+
+@flow(log_prints=True)
+def parse_legislation(
+    working_yaml: str,
+    work_dir: str,
+    append: bool = False,
+    **kwargs,
+) -> None:
+    """Parse a legislation TSV using YAML schema.
+
+    Args:
+        working_yaml (str): Path to the YAML configuration file.
+        work_dir (str): Working directory containing the input TSV.
+        append (bool): If True, load previous parsed data.
+        **kwargs: Ignored extra arguments from the wrapper.
+    """
+    run_generic_tsv_parser(
+        working_yaml=working_yaml, work_dir=work_dir, append=append
+    )
+
+
+def plugin():
+    """Register the parser plugin."""
+    return Parser(
+        name="LEGISLATION",
+        func=parse_legislation,
+        description="Parse a legislation TSV using a YAML schema.",
+    )
+
+
+if __name__ == "__main__":
+    args = parse_args("Parse a legislation TSV using a YAML schema.")
+    parse_legislation(
+        working_yaml=args.yaml_path,
+        work_dir=os.path.dirname(args.input_path) or ".",
+        append=args.append,
+    )
diff --git a/flows/parsers/parse_refseq_organelles.py b/flows/parsers/parse_refseq_organelles.py
index c7245bd..c441264 100644
--- a/flows/parsers/parse_refseq_organelles.py
+++ b/flows/parsers/parse_refseq_organelles.py
@@ -1,32 +1,143 @@
-from flows.lib.utils import Parser  # noqa: E402
+"""Parse RefSeq organelle data into one-row-per-assembly TSV.
+
+The corresponding updater (``update_refseq_organelles``) emits one row per
+organelle sequence (mitochondrion or plastid). The GoaT YAML schema
+(``refseq_organelles.types.yaml``) expects one row per assembly with
+combined ``mitochondrion*`` / ``plastid*`` columns. This parser pivots
+the per-organelle rows by the assembly accession (genbank), then runs
+the records through the YAML parse functions.
+"""
+
+import os
+from csv import DictReader
+from glob import glob
+
+from genomehubs import utils as gh_utils
+
+from flows.lib.conditional_import import flow
+from flows.lib.utils import (  # noqa: E402
+    Parser,
+    load_config,
+    open_tsv,
+    write_parsed_tsv,
+)
 from flows.parsers.args import parse_args  # noqa: E402
 
+ORGANELLE_FIELDS = ("id", "assemblySpan", "gcPercent", "nPercent")
+
+
+def _locate_input_tsv(work_dir: str, expected_name: str) -> str:
+    """Find the per-organelle input TSV in ``work_dir``."""
+    expected_path = os.path.join(work_dir, expected_name)
+    if os.path.exists(expected_path):
+        return expected_path
+    candidates = sorted(
+        glob(os.path.join(work_dir, "*.tsv"))
+        + glob(os.path.join(work_dir, "*.tsv.gz"))
+    )
+    if not candidates:
+        raise FileNotFoundError(
+            f"No TSV input found in {work_dir} (expected {expected_name})"
+        )
+    if len(candidates) > 1:
+        raise ValueError(
+            f"Multiple TSV inputs in {work_dir}: {candidates!r}"
+        )
+    return candidates[0]
+
+
+def _pivot_by_assembly(input_path: str) -> dict:
+    """Group per-organelle rows by GenBank accession.
+
+    Args:
+        input_path (str): Path to the per-organelle TSV.
+
+    Returns:
+        dict: Mapping of assembly accession → nested record with
+            ``mitochondrion``/``plastid`` sub-dicts.
+    """
+    by_assembly: dict = {}
+    with open_tsv(input_path) as fh:
+        reader = DictReader(fh, delimiter="\t")
+        for row in reader:
+            assembly = row.get("genbankAccession") or row.get("id")
+            if not assembly:
+                continue
+            record = by_assembly.setdefault(
+                assembly,
+                {
+                    "id": row.get("id", assembly),
+                    "genbankAccession": assembly,
+                    "bioproject": row.get("bioproject", ""),
+                    "biosample": row.get("biosample", ""),
+                    "releaseDate": row.get("releaseDate", ""),
+                    "annotations": {"organism": row.get("organismName", "")},
+                    "taxonId": row.get("taxonId", ""),
+                    "sourceAuthor": row.get("sourceAuthor", ""),
+                    "sourceYear": row.get("sourceYear", ""),
+                    "sourceTitle": row.get("sourceTitle", ""),
+                    "pubmedId": row.get("pubmedId", ""),
+                    "sampleLocation": row.get("sampleLocation", ""),
+                },
+            )
+            organelle = (row.get("organelle") or "").lower()
+            if organelle in ("mitochondrion", "plastid"):
+                record[organelle] = {
+                    field: row.get(field, "") for field in ORGANELLE_FIELDS
+                }
+    return by_assembly
 
+
+@flow(log_prints=True)
 def parse_refseq_organelles(
-    working_yaml: str, work_dir: str, append: bool, **kwargs
+    working_yaml: str,
+    work_dir: str,
+    append: bool = False,
+    **kwargs,
 ) -> None:
-    """
-    Wrapper function to parse the RefSeq organelles JSONL file.
+    """Pivot per-organelle TSV to per-assembly and apply YAML schema.
 
     Args:
-        working_yaml (str): Path to the working YAML file.
-        work_dir (str): Path to the working directory.
-        append (bool): Whether to append to the existing TSV file.
-        **kwargs: Additional keyword arguments.
+        working_yaml (str): Path to the YAML configuration file.
+        work_dir (str): Working directory containing the input TSV.
+        append (bool): If True, load previous parsed data.
+        **kwargs: Ignored extra arguments from the wrapper.
     """
-    print("parsing RefSeq organelles files")
+    config = load_config(config_file=working_yaml, load_previous=append)
+
+    expected_name = config.meta["file_name"]
+    input_path = _locate_input_tsv(work_dir, expected_name)
+    print(f"Parsing RefSeq organelles: {input_path}")
+
+    grouped = _pivot_by_assembly(input_path)
+    print(f"Pivoted to {len(grouped)} assemblies")
+
+    parsed = {
+        key: gh_utils.parse_report_values(config.parse_fns, record)
+        for key, record in grouped.items()
+    }
+
+    output_name = config.meta["file_name"]
+    config.meta["file_name"] = os.path.join(work_dir, os.path.basename(output_name))
+    try:
+        write_parsed_tsv(parsed, config)
+    finally:
+        config.meta["file_name"] = output_name
 
 
 def plugin():
-    """Register the flow."""
+    """Register the parser plugin."""
     return Parser(
         name="REFSEQ_ORGANELLES",
         func=parse_refseq_organelles,
-        description="Parse the RefSeq organelles files in a directory.",
+        description="Pivot per-organelle TSV to per-assembly and apply YAML schema.",
     )
 
 
 if __name__ == "__main__":
-    """Run the flow."""
-    args = parse_args()
-    parse_refseq_organelles(**vars(args))
+    args = parse_args("Parse RefSeq organelle data into one-row-per-assembly TSV.")
+    parse_refseq_organelles(
+        working_yaml=args.yaml_path,
+        work_dir=os.path.dirname(args.input_path) or ".",
+        append=args.append,
+    )
diff --git a/flows/parsers/parse_sequencing_status.py b/flows/parsers/parse_sequencing_status.py
index 37915d7..c7091bf 100644
--- a/flows/parsers/parse_sequencing_status.py
+++ b/flows/parsers/parse_sequencing_status.py
@@ -1,32 +1,114 @@
-from flows.lib.utils import Parser  # noqa: E402
+"""Parse status list TSVs using a YAML schema.
+
+Generic parser that handles all sequencing status list inputs:
+VGP, JGI 1KFG, Google Sheets project lists, NHM, CNGB,
+ToL Portal, ToL Genome Notes, and similar.
+
+The input is a tab-separated file produced by the corresponding
+updater (one row per record). The YAML schema describes how each
+input column maps to a GoaT attribute (and may translate values
+via ``translate:`` blocks). This parser:
+
+1. Locates the input TSV in ``work_dir`` matching ``meta.file_name``
+   (or, failing that, the single TSV in the directory).
+2. Reads each row as a flat dict keyed by column header.
+3. Applies YAML parse functions via ``gh_utils.parse_report_values``.
+4. Writes the canonical TSV using YAML-defined headers.
+"""
+
+import os
+from glob import glob
+
+from flows.lib.conditional_import import flow
+from flows.lib.utils import (  # noqa: E402
+    Parser,
+    load_config,
+    parse_tsv_with_config,
+    write_parsed_tsv,
+)
 from flows.parsers.args import parse_args  # noqa: E402
 
 
+def _locate_input_tsv(work_dir: str, expected_name: str) -> str:
+    """Find the input TSV in ``work_dir``.
+
+    Looks first for ``expected_name`` (matching ``meta.file_name`` from
+    the YAML); falls back to a single ``*.tsv`` or ``*.tsv.gz`` in
+    ``work_dir`` that is not the expected output.
+
+    Args:
+        work_dir (str): Working directory.
+        expected_name (str): The filename declared in YAML ``file.name``.
+
+    Returns:
+        str: Absolute path to the input TSV.
+    """
+    expected_path = os.path.join(work_dir, expected_name)
+    if os.path.exists(expected_path):
+        return expected_path
+
+    candidates = sorted(
+        glob(os.path.join(work_dir, "*.tsv"))
+        + glob(os.path.join(work_dir, "*.tsv.gz"))
+    )
+    if not candidates:
+        raise FileNotFoundError(
+            f"No TSV input found in {work_dir} (expected {expected_name})"
+        )
+    if len(candidates) > 1:
+        raise ValueError(
+            f"Multiple TSV inputs in {work_dir}: {candidates!r}; "
+            "place a single source TSV or name it to match YAML file.name."
+        )
+    return candidates[0]
+
+
+@flow(log_prints=True)
 def parse_sequencing_status(
-    working_yaml: str, work_dir: str, append: bool, **kwargs
+    working_yaml: str,
+    work_dir: str,
+    append: bool = False,
+    **kwargs,
 ) -> None:
-    """
-    Wrapper function to parse the sequencing status files.
+    """Parse a sequencing status TSV using a YAML schema.
 
     Args:
-        working_yaml (str): Path to the working YAML file.
-        work_dir (str): Path to the working directory.
-        append (bool): Whether to append to the existing TSV file.
-        **kwargs: Additional keyword arguments.
+        working_yaml (str): Path to the YAML configuration file.
+        work_dir (str): Working directory containing the input TSV.
+        append (bool): If True, load previous parsed data for incremental
+            updates.
+        **kwargs: Ignored extra arguments from the wrapper.
     """
-    print("parsing sequencing status files")
+    config = load_config(config_file=working_yaml, load_previous=append)
+
+    expected_name = config.meta["file_name"]
+    input_path = _locate_input_tsv(work_dir, expected_name)
+    print(f"Parsing sequencing status: {input_path}")
+
+    parsed = parse_tsv_with_config(input_path, config)
+    print(f"Parsed {len(parsed)} rows")
+
+    output_name = config.meta["file_name"]
+    config.meta["file_name"] = os.path.join(work_dir, os.path.basename(output_name))
+    try:
+        write_parsed_tsv(parsed, config)
+    finally:
+        config.meta["file_name"] = output_name
 
 
 def plugin():
-    """Register the flow."""
+    """Register the parser plugin."""
     return Parser(
         name="SEQUENCING_STATUS",
         func=parse_sequencing_status,
-        description="Parse the sequencing status files in a directory.",
+        description="Parse a sequencing status TSV using a YAML schema.",
     )
 
 
 if __name__ == "__main__":
-    """Run the flow."""
-    args = parse_args()
-    parse_sequencing_status(**vars(args))
+    args = parse_args("Parse a sequencing status TSV using a YAML schema.")
+    parse_sequencing_status(
+        working_yaml=args.yaml_path,
+        work_dir=os.path.dirname(args.input_path) or ".",
+        append=args.append,
+    )
diff --git a/flows/parsers/parse_sra_data.py b/flows/parsers/parse_sra_data.py
new file mode 100644
index 0000000..cc29e50
--- /dev/null
+++ b/flows/parsers/parse_sra_data.py
@@ -0,0 +1,55 @@
+"""Parse SRA accession TSV using a YAML schema.
+
+The corresponding updater (``update_sra_data``) emits a flat TSV with
+columns matching the headers in ``sra.types.yaml`` (``run_accession``,
+``sra_accession``, ``platform``, ``library_source``, ``reads``,
+``total_runs``, ``total_reads``, ``taxon_id``). This parser delegates
+to the shared generic flat-TSV pipeline.
+"""
+
+import os
+
+from flows.lib.conditional_import import flow
+from flows.lib.utils import Parser, run_generic_tsv_parser
+from flows.parsers.args import parse_args
+
+
+@flow(log_prints=True)
+def parse_sra_data(
+    working_yaml: str,
+    work_dir: str,
+    append: bool = False,
+    **kwargs,
+) -> None:
+    """Parse SRA accession TSV using YAML schema.
+
+    Args:
+        working_yaml (str): Path to the YAML configuration file.
+        work_dir (str): Working directory containing the input TSV.
+        append (bool): If True, load previous parsed data.
+        **kwargs: Ignored extra arguments from the wrapper.
+    """
+    run_generic_tsv_parser(
+        working_yaml=working_yaml,
+        work_dir=work_dir,
+        append=append,
+        key_field="run_accession",
+    )
+
+
+def plugin():
+    """Register the parser plugin."""
+    return Parser(
+        name="SRA_DATA",
+        func=parse_sra_data,
+        description="Parse SRA accession TSV using a YAML schema.",
+    )
+
+
+if __name__ == "__main__":
+    args = parse_args("Parse SRA accession TSV using a YAML schema.")
+    parse_sra_data(
+        working_yaml=args.yaml_path,
+        work_dir=os.path.dirname(args.input_path) or ".",
+        append=args.append,
+    )
diff --git a/flows/prefect.yaml b/flows/prefect.yaml
index 07ae523..ee29ca7 100644
--- a/flows/prefect.yaml
+++ b/flows/prefect.yaml
@@ -264,3 +264,362 @@ deployments:
     schedules:
       - *daily
     work_pool: *goat_data_work_pool
+
+  # -----------------------------------------------------------------------
+  # Phase 1 updaters — external data fetching (migrated from goat-data)
+  # -----------------------------------------------------------------------
+
+  - name: update-vgp-original-status
+    # Fetch VGP original status list from GitHub YAML tracker (monthly)
+    entrypoint: flows/updaters/update_vgp_original_status.py:update_vgp_original_status
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/status-lists/vgp_original_status.tsv"
+      s3_path: s3://goat/resources/status-lists/vgp_original_status.tsv
+      min_records: 100
+    schedules:
+      - *monthly
+    work_pool: *goat_data_work_pool
+
+  - name: update-vgp-status
+    # Fetch VGP Ordinal Phase1+ status from the live Google Sheet
+    entrypoint: flows/updaters/update_vgp_status.py:update_vgp_status
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/status-lists/VGP_Ordinal_Phase1_plus.tsv"
+      s3_path: s3://goat/resources/status-lists/VGP_Ordinal_Phase1_plus.tsv
+      min_records: 100
+    schedules:
+      - *daily
+    work_pool: *goat_data_work_pool
+
+  - name: update-ensembl-metadata-main
+    # Fetch species metadata from Ensembl main site
+    entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_metadata.tsv.gz"
+      division: vertebrates
+      s3_path: s3://goat/resources/assembly-data/ensembl_metadata.tsv.gz
+    schedules:
+      - *weekly
+    work_pool: *goat_data_work_pool
+
+  - name: update-ensembl-metadata-metazoa
+    entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_metazoa_metadata.tsv.gz"
+      division: metazoa
+      s3_path: s3://goat/resources/assembly-data/ensembl_metazoa_metadata.tsv.gz
+    schedules:
+      - *weekly
+    work_pool: *goat_data_work_pool
+
+  - name: update-ensembl-metadata-plants
+    entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_plants_metadata.tsv.gz"
+      division: plants
+      s3_path: s3://goat/resources/assembly-data/ensembl_plants_metadata.tsv.gz
+    schedules:
+      - *weekly
+    work_pool: *goat_data_work_pool
+
+  - name: update-ensembl-metadata-fungi
+    entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_fungi_metadata.tsv.gz"
+      division: fungi
+      s3_path: s3://goat/resources/assembly-data/ensembl_fungi_metadata.tsv.gz
+    schedules:
+      - *weekly
+    work_pool: *goat_data_work_pool
+
+  - name: update-ensembl-metadata-protists
+    entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_protists_metadata.tsv.gz"
+      division: protists
+      s3_path: s3://goat/resources/assembly-data/ensembl_protists_metadata.tsv.gz
+    schedules:
+      - *weekly
+    work_pool: *goat_data_work_pool
+
+  - name: update-ensembl-metadata-rapid
+    entrypoint: flows/updaters/update_ensembl_metadata.py:update_ensembl_metadata
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/ensembl_rapid_metadata.tsv.gz"
+      division: rapid
+      s3_path: s3://goat/resources/assembly-data/ensembl_rapid_metadata.tsv.gz
+    schedules:
+      - *weekly
+    work_pool: *goat_data_work_pool
+
+  - name: update-ucsc-assemblies
+    # Fetch UCSC assembly hub accession list
+    entrypoint: flows/updaters/update_ucsc_assemblies.py:update_ucsc_assemblies
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/ucsc_assemblies.tsv"
+      s3_path: s3://goat/resources/assembly-data/ucsc_assemblies.tsv
+    schedules:
+      - *weekly
+    work_pool: *goat_data_work_pool
+
+  - name: update-jgi-status
+    # Fetch JGI 1KFG project data via OAuth API
+    entrypoint: flows/updaters/update_jgi_status.py:update_jgi_status
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/jgi_1kfg_status.tsv"
+      s3_path: s3://goat/resources/assembly-data/jgi_1kfg_status.tsv
+    schedules:
+      - *weekly
+    work_pool:
+      name: goat-data
+      work_queue_name: default
+      job_variables:
+        env:
+          JGI_OFFLINE_TOKEN: "{{ $JGI_OFFLINE_TOKEN }}"
+
+  - name: update-sra-data
+    # Fetch SRA data via NCBI E-utilities
+    entrypoint: flows/updaters/update_sra_data.py:update_sra_data
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/sra_data.tsv.gz"
+      s3_path: s3://goat/resources/assembly-data/sra_data.tsv.gz
+    schedules:
+      - *weekly
+    work_pool:
+      name: goat-data
+      work_queue_name: default
+      job_variables:
+        env:
+          NCBI_API_KEY: "{{ $NCBI_API_KEY }}"
+
+  - name: update-blobtoolkit
+    # Fetch BlobToolKit analysis data via API
+    entrypoint: flows/updaters/update_blobtoolkit.py:update_blobtoolkit
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/btk.tsv.gz"
+      s3_path: s3://goat/resources/assembly-data/btk.tsv.gz
+    schedules:
+      - *weekly
+    work_pool: *goat_data_work_pool
+
+  - name: update-refseq-organelles
+    # Fetch and parse RefSeq organelle data from NCBI FTP
+    entrypoint: flows/updaters/update_refseq_organelles.py:update_refseq_organelles
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/assembly-data/refseq_organelles.tsv.gz"
+      s3_path: s3://goat/resources/assembly-data/refseq_organelles.tsv.gz
+    schedules:
+      - *weekly
+    work_pool: *goat_data_work_pool
+
+  - name: update-google-sheets-status
+    # Fetch project status data from Google Sheets
+    entrypoint: flows/updaters/update_google_sheets_status.py:update_google_sheets_status
+    parameters:
+      output_path: "/home/ubuntu/tmp/test/status-lists/google-sheets"
+      # This needs to be set to the URL of a TSV file containing the list of Google Sheets to fetch
+      index_url: ""
+      s3_path: s3://goat/resources/status-lists/google-sheets/
+    schedules:
+      - *daily
+    work_pool: *goat_data_work_pool
+
+  # -----------------------------------------------------------------------
+  # Phase 2 fetch-parse-validate — triggered by Phase 1 update events
+  # -----------------------------------------------------------------------
+
+  - name: batch-validate-google-sheets-status
+    # Triggered by update.google.sheets.status.finished
+    # Runs fetch-parse-validate (SKIP_PARSING) for each status list TSV
+    entrypoint: flows/orchestrators/batch_validate_status_lists.py:batch_validate_status_lists
+    parameters:
+      yaml_dir: "../goat-data-main/sources/status-lists"
+      work_dir: "/home/ubuntu/tmp/test/status-lists/google-sheets"
+      s3_path: "s3://goat/sources/status-lists/"
+      dry_run: true
+    triggers:
+      - enabled: true
+        type: event
+        match:
+          prefect.resource.type: google.sheets.status
+        expect:
+          - update.google.sheets.status.finished
+        parameters:
+          yaml_dir: "../goat-data-main/sources/status-lists"
+          work_dir: "/home/ubuntu/tmp/test/status-lists/google-sheets"
+          s3_path: "s3://goat/sources/status-lists/"
+          dry_run: true
+    work_pool: *goat_data_work_pool
+
+  - name: fetch-parse-validate-blobtoolkit
+    # Triggered by update.blobtoolkit.finished
+    # Skip parsing — blobtk validate handles taxonomy column detection
+    entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate
+    parameters:
+      parser: "ParserEnum.SKIP_PARSING"
+      yaml_path: "../goat-data-main/sources/btk/btk.types.yaml"
+      s3_path: "s3://goat/sources/btk/"
+      work_dir: "/home/ubuntu/tmp/test/assembly-data"
+      dry_run: true
+    triggers:
+      - enabled: true
+        type: event
+        match:
+          prefect.resource.type: blobtoolkit
+        expect:
+          - update.blobtoolkit.finished
+        parameters:
+          parser: "ParserEnum.SKIP_PARSING"
+          yaml_path: "../goat-data-main/sources/btk/btk.types.yaml"
+          s3_path: "s3://goat/sources/btk/"
+          work_dir: "/home/ubuntu/tmp/test/assembly-data"
+          dry_run: true
+    work_pool: *goat_data_work_pool
+
+  - name: fetch-parse-validate-refseq-organelles
+    # Triggered by update.refseq.organelles.finished
+    entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate
+    parameters:
+      parser: "ParserEnum.REFSEQ_ORGANELLES"
+      yaml_path: "../goat-data-main/sources/assembly-data/refseq_organelles.types.yaml"
+      s3_path: "s3://goat/sources/assembly-data/"
+      work_dir: "/home/ubuntu/tmp/test/assembly-data"
+      dry_run: true
+    triggers:
+      - enabled: true
+        type: event
+        match:
+          prefect.resource.type: refseq.organelles
+        expect:
+          - update.refseq.organelles.finished
+        parameters:
+          parser: "ParserEnum.REFSEQ_ORGANELLES"
+          yaml_path: "../goat-data-main/sources/assembly-data/refseq_organelles.types.yaml"
+          s3_path: "s3://goat/sources/assembly-data/"
+          work_dir: "/home/ubuntu/tmp/test/assembly-data"
+          dry_run: true
+    work_pool: *goat_data_work_pool
+
+  - name: fetch-parse-validate-sra-data
+    # Triggered by update.sra.data.finished
+    # SRA updater output already matches YAML schema — skip parsing, validate only
+    entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate
+    parameters:
+      parser: "ParserEnum.SKIP_PARSING"
+      yaml_path: "../goat-data-main/sources/sra/sra.types.yaml"
+      s3_path: "s3://goat/sources/sra/"
+      work_dir: "/home/ubuntu/tmp/test/assembly-data"
+      dry_run: true
+    triggers:
+      - enabled: true
+        type: event
+        match:
+          prefect.resource.type: sra.data
+        expect:
+          - update.sra.data.finished
+        parameters:
+          parser: "ParserEnum.SKIP_PARSING"
+          yaml_path: "../goat-data-main/sources/sra/sra.types.yaml"
+          s3_path: "s3://goat/sources/sra/"
+          work_dir: "/home/ubuntu/tmp/test/assembly-data"
+          dry_run: true
+    work_pool: *goat_data_work_pool
+
+  - name: fetch-parse-validate-vgp-status
+    # Triggered by update.vgp.status.finished
+    # VGP Ordinal updater output is a superset of YAML columns — skip parsing, validate only
+    entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate
+    parameters:
+      parser: "ParserEnum.SKIP_PARSING"
+      yaml_path: "../goat-data-main/sources/status-lists/FILE_VGP_Ordinal_Phase1.types.yaml"
+      s3_path: "s3://goat/sources/status-lists/"
+      work_dir: "/home/ubuntu/tmp/test/status-lists"
+      dry_run: true
+    triggers:
+      - enabled: true
+        type: event
+        match:
+          prefect.resource.type: vgp.status
+        expect:
+          - update.vgp.status.finished
+        parameters:
+          parser: "ParserEnum.SKIP_PARSING"
+          yaml_path: "../goat-data-main/sources/status-lists/FILE_VGP_Ordinal_Phase1.types.yaml"
+          s3_path: "s3://goat/sources/status-lists/"
+          work_dir: "/home/ubuntu/tmp/test/status-lists"
+          dry_run: true
+    work_pool: *goat_data_work_pool
+
+  - name: fetch-parse-validate-jgi-status
+    # Triggered by update.jgi.status.finished
+    entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate
+    parameters:
+      parser: "ParserEnum.SEQUENCING_STATUS"
+      yaml_path: "../goat-data-main/sources/assembly-data/1kfg_manual_bioprojects.types.yaml"
+      s3_path: "s3://goat/sources/assembly-data/"
+      work_dir: "/home/ubuntu/tmp/test/assembly-data"
+      dry_run: true
+    triggers:
+      - enabled: true
+        type: event
+        match:
+          prefect.resource.type: jgi.status
+        expect:
+          - update.jgi.status.finished
+        parameters:
+          parser: "ParserEnum.SEQUENCING_STATUS"
+          yaml_path: "../goat-data-main/sources/assembly-data/1kfg_manual_bioprojects.types.yaml"
+          s3_path: "s3://goat/sources/assembly-data/"
+          work_dir: "/home/ubuntu/tmp/test/assembly-data"
+          dry_run: true
+    work_pool: *goat_data_work_pool
+
+  - name: fetch-parse-validate-ucsc-assemblies
+    # Triggered by update.ucsc.assemblies.finished
+    # Skip parsing — YAML column mapping can be adjusted later if needed
+    entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate
+    parameters:
+      parser: "ParserEnum.SKIP_PARSING"
+      yaml_path: "../goat-data-main/sources/assembly-data/ucsc_ids.types.yaml"
+      s3_path: "s3://goat/sources/assembly-data/"
+      work_dir: "/home/ubuntu/tmp/test/assembly-data"
+      dry_run: true
+    triggers:
+      - enabled: true
+        type: event
+        match:
+          prefect.resource.type: ucsc.assemblies
+        expect:
+          - update.ucsc.assemblies.finished
+        parameters:
+          parser: "ParserEnum.SKIP_PARSING"
+          yaml_path: "../goat-data-main/sources/assembly-data/ucsc_ids.types.yaml"
+          s3_path: "s3://goat/sources/assembly-data/"
+          work_dir: "/home/ubuntu/tmp/test/assembly-data"
+          dry_run: true
+    work_pool: *goat_data_work_pool
+
+  - name: fetch-parse-validate-nhm-status
+    # Triggered by update.nhm.status.finished (legacy NHM updater)
+    # NHM updater output is a superset of YAML columns — skip parsing, validate only
+    entrypoint: flows/lib/wrapper_fetch_parse_validate.py:fetch_parse_validate
+    parameters:
+      parser: "ParserEnum.SKIP_PARSING"
+      yaml_path: "../goat-data-main/sources/status-lists/FILE_DTOL_nhm.types.yaml"
+      s3_path: "s3://goat/sources/status-lists/"
+      work_dir: "/home/ubuntu/tmp/test/status-lists"
+      dry_run: true
+    triggers:
+      - enabled: true
+        type: event
+        match:
+          prefect.resource.type: nhm.status
+        expect:
+          - update.nhm.status.finished
+        parameters:
+          parser: "ParserEnum.SKIP_PARSING"
+          yaml_path: "../goat-data-main/sources/status-lists/FILE_DTOL_nhm.types.yaml"
+          s3_path: "s3://goat/sources/status-lists/"
+          work_dir: "/home/ubuntu/tmp/test/status-lists"
+          dry_run: true
+    work_pool: *goat_data_work_pool
diff --git a/flows/updaters/__pycache__/__init__.cpython-313.pyc b/flows/updaters/__pycache__/__init__.cpython-313.pyc
new file mode 100644
index 0000000..e0b0acd
Binary files /dev/null and b/flows/updaters/__pycache__/__init__.cpython-313.pyc differ
diff --git a/flows/updaters/__pycache__/tol_utils.cpython-312.pyc b/flows/updaters/__pycache__/tol_utils.cpython-312.pyc
new file mode 100644
index 0000000..d813ef1
Binary files /dev/null and b/flows/updaters/__pycache__/tol_utils.cpython-312.pyc differ
diff --git a/flows/updaters/__pycache__/tol_utils.cpython-313.pyc b/flows/updaters/__pycache__/tol_utils.cpython-313.pyc
new file mode 100644
index 0000000..5ca0d68
Binary files /dev/null and b/flows/updaters/__pycache__/tol_utils.cpython-313.pyc differ
diff --git a/flows/updaters/__pycache__/update_ensembl_metadata.cpython-313.pyc b/flows/updaters/__pycache__/update_ensembl_metadata.cpython-313.pyc
new file mode 100644
index 0000000..a26e96a
Binary files /dev/null and b/flows/updaters/__pycache__/update_ensembl_metadata.cpython-313.pyc differ
diff --git a/flows/updaters/__pycache__/update_google_sheets_status.cpython-313.pyc b/flows/updaters/__pycache__/update_google_sheets_status.cpython-313.pyc
new file mode 100644
index 0000000..fc2aa1d
Binary files /dev/null and b/flows/updaters/__pycache__/update_google_sheets_status.cpython-313.pyc differ
diff --git a/flows/updaters/__pycache__/update_sra_data.cpython-313.pyc b/flows/updaters/__pycache__/update_sra_data.cpython-313.pyc
new file mode 100644
index 0000000..8632a26
Binary files /dev/null and b/flows/updaters/__pycache__/update_sra_data.cpython-313.pyc differ
diff --git a/flows/updaters/__pycache__/update_tol_portal_status.cpython-312.pyc b/flows/updaters/__pycache__/update_tol_portal_status.cpython-312.pyc
new file mode 100644
index 0000000..3b8c262
Binary files /dev/null and b/flows/updaters/__pycache__/update_tol_portal_status.cpython-312.pyc differ
diff --git a/flows/updaters/__pycache__/update_tol_portal_status.cpython-313.pyc b/flows/updaters/__pycache__/update_tol_portal_status.cpython-313.pyc
new file mode 100644
index 0000000..0671839
Binary files /dev/null and b/flows/updaters/__pycache__/update_tol_portal_status.cpython-313.pyc differ
diff --git a/flows/updaters/__pycache__/update_ucsc_assemblies.cpython-313.pyc b/flows/updaters/__pycache__/update_ucsc_assemblies.cpython-313.pyc
new file mode 100644
index 0000000..c1c1a71
Binary files /dev/null and b/flows/updaters/__pycache__/update_ucsc_assemblies.cpython-313.pyc differ
diff --git a/flows/updaters/__pycache__/update_vgp_status.cpython-313.pyc b/flows/updaters/__pycache__/update_vgp_status.cpython-313.pyc
new file mode 100644
index 0000000..cdaf4fb
Binary files /dev/null and b/flows/updaters/__pycache__/update_vgp_status.cpython-313.pyc differ
diff --git a/flows/updaters/api/__pycache__/__init__.cpython-313.pyc b/flows/updaters/api/__pycache__/__init__.cpython-313.pyc
new file mode 100644
index 0000000..1c8256d
Binary files /dev/null and b/flows/updaters/api/__pycache__/__init__.cpython-313.pyc differ
diff --git a/flows/updaters/api/__pycache__/api_config.cpython-313.pyc b/flows/updaters/api/__pycache__/api_config.cpython-313.pyc
new file mode 100644
index 0000000..50ecbb2
Binary files /dev/null and b/flows/updaters/api/__pycache__/api_config.cpython-313.pyc differ
diff --git a/flows/updaters/api/__pycache__/api_tools.cpython-313.pyc b/flows/updaters/api/__pycache__/api_tools.cpython-313.pyc
new file mode 100644
index 0000000..5fcaf5f
Binary files /dev/null and b/flows/updaters/api/__pycache__/api_tools.cpython-313.pyc differ
diff --git a/flows/updaters/update_blobtoolkit.py b/flows/updaters/update_blobtoolkit.py
new file mode 100644
index 0000000..33721fc
--- /dev/null
+++ b/flows/updaters/update_blobtoolkit.py
@@ -0,0 +1,274 @@
+import contextlib
+import csv
+import gzip
+import os
+
+from flows.lib.conditional_import import emit_event, flow, task
+from flows.lib.shared_args import MIN_RECORDS, OUTPUT_PATH, S3_PATH, parse_args, required
+from flows.lib.utils import _build_session, is_safe_path, upload_to_s3
+
+BTK_API = "https://blobtoolkit.genomehubs.org/api/v1"
+BTK_VIEW = "https://blobtoolkit.genomehubs.org/view"
+
+TSV_FIELDNAMES = [
+    "accession",
+    "taxid",
+    "species",
+    "taxon_name",
+    "subspecies",
+    "id",
+    "source",
+    "sourceSlug",
+    "sourceStub",
+    "busco_lineage",
+    "busco_string",
+    "busco_complete",
+    "nohit",
+    "target",
+    "at_percent",
+    "gc_percent",
+    "n_percent",
+]
+
+
+def _stream_datasets(root: str, session) -> list:
+    """Stream BlobToolKit dataset entries for a taxon root.
+
+    Args:
+        root (str): Taxonomic root to query (e.g., "Eukaryota").
+        session: A requests.Session with retry support.
+
+    Returns:
+        list: List of dataset metadata dicts.
+    """
+    url = f"{BTK_API}/search/{root}"
+    response = session.get(url, timeout=300)
+    response.raise_for_status()
+    return response.json()
+
+
+def _extract_stats(meta: dict) -> dict:
+    """Extract BlobToolKit summary stats into a flat dict row.
+
+    Args:
+        meta (dict): Raw BTK dataset metadata.
+
+    Returns:
+        dict: Flat row dict matching TSV_FIELDNAMES.
+    """
+    summary = meta.get("summaryStats", {})
+    row = {
+        "accession": meta.get("accession", ""),
+        "taxid": str(meta.get("taxid", "")),
+        "species": meta.get("species", meta.get("taxon_name", "")),
+        "taxon_name": meta.get("taxon_name", ""),
+        "subspecies": "",
+        "id": meta.get("id", ""),
+        "source": "BlobToolKit",
+        "sourceSlug": meta.get("id", ""),
+        "sourceStub": "https://blobtoolkit.genomehubs.org/view/dataset/",
+        "busco_lineage": "",
+        "busco_string": "",
+        "busco_complete": "",
+        "nohit": "",
+        "target": "",
+        "at_percent": "",
+        "gc_percent": "",
+        "n_percent": "",
+    }
+
+    with contextlib.suppress(KeyError):
+        taxon_name = meta.get("taxon_name", "")
+        species = meta.get("species", "")
+        if species and taxon_name and len(taxon_name) > len(species):
+            row["subspecies"] = taxon_name
+
+    if "busco" in summary:
+        for lineage, stats in summary["busco"].items():
+            row["busco_lineage"] = lineage
+            row["busco_string"] = stats.get("string", "")
+            total = stats.get("t", 0)
+            if total > 0:
+                row["busco_complete"] = f"{stats.get('c', 0) / total * 100:.2f}"
+            break
+
+    if "stats" in summary:
+        row["nohit"] = f"{summary['stats'].get('noHit', 0) * 100:.2f}"
+        with contextlib.suppress(KeyError):
+            row["target"] = f"{summary['stats']['target'] * 100:.2f}"
+
+    if "baseComposition" in summary:
+        bc = summary["baseComposition"]
+        row["at_percent"] = f"{bc.get('at', 0) * 100:.2f}"
+        row["gc_percent"] = f"{bc.get('gc', 0) * 100:.2f}"
+        row["n_percent"] = f"{bc.get('n', 0) * 100:.2f}"
+
+    return row
+
+
+def _describe_files(meta: dict) -> list:
+    """Generate analysis file descriptors for a BlobToolKit dataset.
+
+    Args:
+        meta (dict): Raw BTK dataset metadata.
+
+    Returns:
+        list: List of file descriptor dicts.
+    """
+    plots = ["cumulative", "snail"]
+    summary = meta.get("summaryStats", {})
+    if summary.get("readMapping"):
+        plots.append("blob")
+
+    files = []
+    dataset_id = meta.get("id", "")
+    accession = meta.get("accession", "")
+    taxid = str(meta.get("taxid", ""))
+
+    for plot in plots:
+        if plot == "blob":
+            url = f"{BTK_API}/image/{dataset_id}/{plot}/circle?format=png"
+        else:
+            url = f"{BTK_API}/image/{dataset_id}/{plot}?format=png"
+        files.append(
+            {
+                "name": f"{plot}.png",
+                "url": url,
+                "source_url": f"{BTK_VIEW}/{dataset_id}/dataset/{dataset_id}/{plot}",
+                "analysis_id": f"btk-{dataset_id}",
+                "description": f"a {plot} plot from BlobToolKit analysis {dataset_id}",
+                "title": f"{plot} plot {dataset_id}",
+                "command": "blobtoolkit pipeline",
+                "assembly_id": accession,
+                "taxon_id": taxid,
+                "analysis": {
+                    "name": "BlobToolKit",
+                    "title": f"BlobToolKit analysis of {accession}",
+                    "description": (f"Analysis of public assembly {accession} " f"using BlobToolKit"),
+                    "source": "BlobToolKit",
+                    "source_url": (f"https://blobtoolkit.genomehubs.org/view/dataset/{dataset_id}"),
+                },
+            }
+        )
+    return files
+
+
+@task(retries=2, retry_delay_seconds=30, log_prints=True)
+def fetch_blobtoolkit(
+    output_dir: str,
+    root: str = "Eukaryota",
+    min_records: int = 1,
+) -> tuple[int, int]:
+    """Fetch BlobToolKit data and write TSV + files YAML.
+
+    Uses a persistent session with connection pooling for the many API calls.
+
+    Args:
+        output_dir (str): Directory to write btk.tsv.gz and btk.files.yaml.
+        root (str): Taxonomic root to query.
+        min_records (int): Minimum dataset count to accept.
+
+    Returns:
+        tuple[int, int]: Number of dataset rows and file entries written.
+    """
+    session = _build_session()
+    print(f"Fetching BlobToolKit datasets for {root}")
+    datasets = _stream_datasets(root, session)
+    print(f"Found {len(datasets)} datasets")
+
+    if len(datasets) < min_records:
+        raise RuntimeError(f"BlobToolKit returned fewer than {min_records} datasets: {len(datasets)}")
+
+    tsv_path = os.path.join(output_dir, "btk.tsv")
+    gz_path = os.path.join(output_dir, "btk.tsv.gz")
+    files_path = os.path.join(output_dir, "btk.files.yaml")
+
+    all_rows = []
+    all_files = []
+    for dataset in datasets:
+        meta = dataset if isinstance(dataset, dict) else {}
+        row = _extract_stats(meta)
+        all_rows.append(row)
+        files = _describe_files(meta)
+        all_files.extend(files)
+
+    with open(tsv_path, "w", newline="") as f:
+        writer = csv.DictWriter(f, fieldnames=TSV_FIELDNAMES, delimiter="\t", lineterminator="\n")
+        writer.writeheader()
+        for row in all_rows:
+            writer.writerow(row)
+
+    with open(tsv_path, "rb") as f_in, gzip.open(gz_path, "wb") as f_out:
+        f_out.write(f_in.read())
+    os.remove(tsv_path)
+
+    import yaml
+
+    with open(files_path, "w") as f:
+        yaml.dump(all_files, f, default_flow_style=False)
+
+    print(f"Wrote {len(all_rows)} rows to {gz_path}")
+    print(f"Wrote {len(all_files)} file entries to {files_path}")
+    return len(all_rows), len(all_files)
+
+
+@task(log_prints=True)
+def upload_s3_files(output_dir: str, s3_path: str) -> None:
+    """Upload BTK output files to S3."""
+    for filename in ("btk.tsv.gz", "btk.files.yaml"):
+        local = os.path.join(output_dir, filename)
+        remote = f"{s3_path.rstrip('/')}/{filename}"
+        if os.path.exists(local):
+            print(f"Uploading {local} to {remote}")
+            upload_to_s3(local, remote)
+
+
+@flow()
+def update_blobtoolkit(
+    output_path: str,
+    s3_path: str = "",
+    min_records: int = 0,
+) -> bool:
+    """Fetch BlobToolKit analysis data and optionally upload to S3.
+
+    Args:
+        output_path (str): Directory to write output files.
+        s3_path (str): Optional S3 directory to upload results.
+        min_records (int): Minimum dataset count to accept.
+
+    Returns:
+        bool: True on success.
+    """
+    if not is_safe_path(output_path):
+        raise ValueError(f"Unsafe output path: {output_path}")
+
+    resolved_path = os.path.abspath(output_path)
+    filename = "btk.tsv.gz"
+    if ".tsv" in resolved_path:
+        filename = os.path.basename(resolved_path)
+        resolved_path = os.path.dirname(resolved_path)
+    os.makedirs(resolved_path, exist_ok=True)
+
+    row_count, file_count = fetch_blobtoolkit(resolved_path, min_records=min_records)
+
+    if s3_path:
+        upload_s3_files(f"{resolved_path}/{filename}", s3_path)
+        upload_s3_files(f"{resolved_path}/btk.files.yaml", s3_path)
+
+    emit_event(
+        event="update.blobtoolkit.finished",
+        resource={
+            "prefect.resource.id": f"update.btk.{resolved_path}",
+            "prefect.resource.type": "blobtoolkit",
+        },
+        payload={"row_count": row_count, "file_count": file_count},
+    )
+    return True
+
+
+if __name__ == "__main__":
+    args = parse_args(
+        [required(OUTPUT_PATH), S3_PATH, MIN_RECORDS],
+        "Fetch BlobToolKit analysis data.",
+    )
+    update_blobtoolkit(**vars(args))
diff --git a/flows/updaters/update_boat_config.py b/flows/updaters/update_boat_config.py
index 4ee5303..71b27cb 100644
--- a/flows/updaters/update_boat_config.py
+++ b/flows/updaters/update_boat_config.py
@@ -30,29 +30,16 @@ def taxon_id_to_ssh_path(ssh_host, taxon_id, assembly_name):
         ssh_host,
         "bash",
         "-c",
-        (
-            f"'. /etc/profile && module load speciesops && "
-            f"speciesops getdir --taxon_id {taxon_id}'"
-        ),
+        (f"'. /etc/profile && module load speciesops && " f"speciesops getdir --taxon_id {taxon_id}'"),
     ]
     result = run_quoted(command, capture_output=True, text=True)
     if result.returncode != 0:
-        print(
-            (
-                f"WARNING: Error fetching directory for taxon_id {taxon_id}: "
-                f"{result.stderr}"
-            )
-        )
+        print((f"WARNING: Error fetching directory for taxon_id {taxon_id}: " f"{result.stderr}"))
         return
     # Filter the result to get the lustre path
     lustre_path = [line for line in result.stdout.splitlines() if "/lustre" in line]
     if not lustre_path:
-        print(
-            (
-                f"WARNING: No lustre path found for taxon_id {taxon_id} in result: "
-                f"{result.stdout}"
-            )
-        )
+        print((f"WARNING: No lustre path found for taxon_id {taxon_id} in result: " f"{result.stdout}"))
         return
     # Use the first lustre path
     lustre_path = lustre_path[0].strip()
@@ -60,6 +47,7 @@ def taxon_id_to_ssh_path(ssh_host, taxon_id, assembly_name):
 
 
 def lookup_buscos(ssh_host, file_path):
+    busco_dirs = []
     if "lustre" in file_path:
         if not is_safe_path(ssh_host):
             raise ValueError(f"Unsafe ssh host: {ssh_host}")
@@ -77,9 +65,7 @@ def lookup_buscos(ssh_host, file_path):
         if result.returncode != 0:
             return []
         busco_dirs = [
-            os.path.basename(os.path.normpath(line))
-            for line in result.stdout.splitlines()
-            if "/busco" in line
+            os.path.basename(os.path.normpath(line)) for line in result.stdout.splitlines() if "/busco" in line
         ]
     return busco_dirs
 
@@ -117,11 +103,9 @@ def assembly_id_to_busco_sets(alt_host, assembly_id):
     ]
     busco_sets = []
     for lineage in lineages:
-        busco_url = (
-            f"https://busco.cog.sanger.ac.uk/{assembly_id}/{lineage}/full_table.tsv"
-        )
+        busco_url = f"https://busco.cog.sanger.ac.uk/{assembly_id}/{lineage}/full_table.tsv"
         response = safe_get(busco_url)
-        if response.status_code == 200:
+        if response is not None and response.status_code == 200:
             busco_sets.append(lineage)
     return f"https://busco.cog.sanger.ac.uk/{assembly_id}", busco_sets
 
@@ -192,10 +176,10 @@ def fetch_goat_results(root_taxid: str, output_path: str) -> list[dict]:
     # fetch query_url with accept header tsv. use python module requests
     headers = {"Accept": "text/tab-separated-values"}
     response = safe_get(query_url, headers=headers)
+    if response is None:
+        raise RuntimeError("Error fetching BoaT config info: No response received")
     if response.status_code != 200:
-        raise RuntimeError(
-            f"Error fetching BoaT config info: {response.status_code} {response.text}"
-        )
+        raise RuntimeError(f"Error fetching BoaT config info: {response.status_code} {response.text}")
 
     # Parse the TSV response
     if tsv_data := parse_tsv(response.text):
@@ -294,29 +278,21 @@ def fetch_boat_config_info(
         int: Number of lines written to the output file.
     """
 
-    tsv_data = fetch_goat_results(root_taxid)
+    tsv_data = fetch_goat_results(root_taxid, file_path)
 
     # Prepare output files and get visited assembly IDs
     visited_file_path = f"{os.path.splitext(file_path)[0]}.visited"
-    visited_assembly_ids, line_count = prepare_output_files(
-        file_path, visited_file_path, append
-    )
+    visited_assembly_ids, line_count = prepare_output_files(file_path, visited_file_path, append)
 
     for row in tsv_data:
         taxon_id = row["taxon_id"]
         assembly_id = row["assembly_id"]
         # Skip if the assembly_id has already been visited
         if assembly_id in visited_assembly_ids:
-            print(
-                (
-                    f"Skipping already visited assembly_id {assembly_id} "
-                    f"for taxon_id {taxon_id}."
-                )
-            )
+            print((f"Skipping already visited assembly_id {assembly_id} " f"for taxon_id {taxon_id}."))
             continue
         print(
-            f"Processing taxon_id {taxon_id}, assembly_id {assembly_id} "
-            f"for assembly_name {row['assembly_name']}."
+            f"Processing taxon_id {taxon_id}, assembly_id {assembly_id} " f"for assembly_name {row['assembly_name']}."
         )
         # Add the assembly_id to the new visited list
         with open(visited_file_path, "a") as f:
@@ -336,8 +312,7 @@ def fetch_boat_config_info(
 
         if not busco_sets:
             print(
-                f"Warning: No BUSCO sets found for taxon_id {taxon_id} "
-                f"and assembly_name {assembly_name}. Skipping."
+                f"Warning: No BUSCO sets found for taxon_id {taxon_id} " f"and assembly_name {assembly_name}. Skipping."
             )
             continue
 
@@ -358,9 +333,7 @@ def fetch_boat_config_info(
         line_count += 1
 
     if line_count < min_lines:
-        print(
-            f"WARNING: File {file_path} has less than {min_lines} lines: {line_count}"
-        )
+        print(f"WARNING: File {file_path} has less than {min_lines} lines: {line_count}")
 
     # Return the number of lines written to the file
     return line_count
@@ -419,13 +392,7 @@ def generate_md5(file_path):
 def filter_buscos(buscos):
     # Exclude bacteria_odb and archaea_odb
     buscos = [
-        b
-        for b in buscos
-        if not (
-            b.startswith("bacteria_odb")
-            or b.startswith("archaea_odb")
-            or b.startswith("mm49_")
-        )
+        b for b in buscos if not (b.startswith("bacteria_odb") or b.startswith("archaea_odb") or b.startswith("mm49_"))
     ]
     # Group by prefix before _odb
     prefix_map = defaultdict(list)
@@ -446,9 +413,7 @@ def filter_buscos(buscos):
 
 
 @task(log_prints=True)
-def filter_farm_data(
-    farm_results_path: str, goat_results_path: str, output_path: str
-) -> None:
+def filter_farm_data(farm_results_path: str, goat_results_path: str, output_path: str) -> None:
     """Filter farm results to include only assemblies with lepidoptera BUSCOs.
 
     Combine with GoaT results to add additional fields.
@@ -513,9 +478,7 @@ def filter_farm_data(
 
 
 @flow()
-def update_boat_config(
-    root_taxid: str, output_path: str, append: bool, s3_path: str
-) -> None:
+def update_boat_config(root_taxid: str, output_path: str, append: bool = False, s3_path: str = "") -> None:
     # fetch_goat_results(root_taxid, f"{output_path}/goat_results.tsv")
 
     # trawl_farm_data(
diff --git a/flows/updaters/update_ena_taxonomy_extra.py b/flows/updaters/update_ena_taxonomy_extra.py
index 132c557..44eae59 100644
--- a/flows/updaters/update_ena_taxonomy_extra.py
+++ b/flows/updaters/update_ena_taxonomy_extra.py
@@ -33,9 +33,7 @@ def read_ncbi_tax_ids(taxdump_path: str) -> set[str]:
 
 
 @task(log_prints=True)
-def add_jsonl_tax_ids(
-    jsonl_path: str, tax_ids: set[str], allowed_tax_ids: set[str] | None = None
-) -> None:
+def add_jsonl_tax_ids(jsonl_path: str, tax_ids: set[str], allowed_tax_ids: set[str] | None = None) -> None:
     print(f"Reading previously fetched ENA taxids from {jsonl_path}")
     filtered_path = f"{jsonl_path}.filtered"
     try:
@@ -43,9 +41,7 @@ def add_jsonl_tax_ids(
             for line in f:
                 data = json.loads(line)
                 tax_id = data["taxId"]
-                if (
-                    allowed_tax_ids is None or tax_id in allowed_tax_ids
-                ) and tax_id not in tax_ids:
+                if (allowed_tax_ids is None or tax_id in allowed_tax_ids) and tax_id not in tax_ids:
                     f_out.write(line)
                     tax_ids.add(tax_id)
         os.replace(filtered_path, jsonl_path)
@@ -59,10 +55,7 @@ def get_ena_api_taxids(root_taxid: str) -> set[str]:
     print(f"Fetching taxids for tax_tree({root_taxid}) from ENA API")
 
     limit = 10000000
-    url = (
-        f"https://www.ebi.ac.uk/ena/portal/api/search?result=taxon"
-        f"&query=tax_tree({root_taxid})&limit={limit}"
-    )
+    url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=taxon" f"&query=tax_tree({root_taxid})&limit={limit}"
 
     # Stream the content of the URL
     column_index = None
@@ -138,8 +131,8 @@ def upload_s3_jsonl(local_path: str, s3_path: str) -> None:
 
 @flow()
 def update_ena_taxonomy_extra(
-    root_taxid: str, taxdump_path: str, output_path: str, s3_path: str, append: bool
-) -> None:
+    root_taxid: str, taxdump_path: str, output_path: str, s3_path: str = "", append: bool = False
+) -> bool:
     """Update the ENA taxonomy JSONL file.
 
     Args:
diff --git a/flows/updaters/update_ensembl_metadata.py b/flows/updaters/update_ensembl_metadata.py
new file mode 100644
index 0000000..03420df
--- /dev/null
+++ b/flows/updaters/update_ensembl_metadata.py
@@ -0,0 +1,198 @@
+import csv
+import gzip
+import os
+from enum import Enum
+
+from flows.lib.conditional_import import emit_event, flow, task
+from flows.lib.shared_args import DIVISION, OUTPUT_PATH, S3_PATH, parse_args, required
+from flows.lib.utils import is_safe_path, safe_get, upload_to_s3
+
+
+class EnsemblDivision(Enum):
+    """Supported Ensembl genome database divisions."""
+
+    FUNGI = "fungi"
+    METAZOA = "metazoa"
+    PLANTS = "plants"
+    PROTISTS = "protists"
+    RAPID = "rapid"
+    VERTEBRATES = "vertebrates"
+
+
+DIVISION_URLS = {
+    EnsemblDivision.FUNGI: ("http://ftp.ensemblgenomes.org/pub/current/fungi/" "species_metadata_EnsemblFungi.json"),
+    EnsemblDivision.METAZOA: (
+        "http://ftp.ensemblgenomes.org/pub/current/metazoa/" "species_metadata_EnsemblMetazoa.json"
+    ),
+    EnsemblDivision.PLANTS: ("http://ftp.ensemblgenomes.org/pub/current/plants/" "species_metadata_EnsemblPlants.json"),
+    EnsemblDivision.PROTISTS: (
+        "http://ftp.ensemblgenomes.org/pub/current/protists/" "species_metadata_EnsemblProtists.json"
+    ),
+    EnsemblDivision.RAPID: ("https://ftp.ensembl.org/pub/rapid-release/" "species_metadata.json"),
+    EnsemblDivision.VERTEBRATES: ("https://ftp.ensembl.org/pub/current/" "species_metadata_EnsemblVertebrates.json"),
+}
+
+DIVISION_OUTPUT_NAMES = {
+    EnsemblDivision.FUNGI: "species_metadata_EnsemblFungi.tsv.gz",
+    EnsemblDivision.METAZOA: "species_metadata_EnsemblMetazoa.tsv.gz",
+    EnsemblDivision.PLANTS: "species_metadata_EnsemblPlants.tsv.gz",
+    EnsemblDivision.PROTISTS: "species_metadata_EnsemblProtists.tsv.gz",
+    EnsemblDivision.RAPID: "species_metadata_EnsemblRapid.tsv.gz",
+    EnsemblDivision.VERTEBRATES: "species_metadata_EnsemblVertebrates.tsv.gz",
+}
+
+
+def _extract_fields(record: dict, division: EnsemblDivision) -> list:
+    """Extract TSV fields from a single Ensembl metadata JSON record.
+
+    Different divisions use slightly different JSON structures for the
+    same conceptual fields. This normalises them to a common 5-column
+    format: assembly_accession, name, release_date, strain, taxonomy_id.
+
+    Args:
+        record (dict): A single species metadata JSON object.
+        division (EnsemblDivision): The Ensembl division.
+
+    Returns:
+        list: A list of 5 string values, or None if the record is invalid.
+    """
+    if division == EnsemblDivision.RAPID:
+        accession = record.get("assembly_accession", "")
+        name = record.get("ensembl_production_name", "")
+        release_date = record.get("release_date", "")
+        strain = record.get("strain", "")
+    elif division == EnsemblDivision.VERTEBRATES:
+        assembly = record.get("assembly", {})
+        organism = record.get("organism", {})
+        accession = assembly.get("assembly_accession", "")
+        name = organism.get("url_name", "")
+        release_date = record.get("release_date", "")
+        strain = organism.get("strain", "")
+    else:
+        organism = record.get("organism", {})
+        accession = record.get("assembly_accession", "")
+        name = organism.get("url_name", "")
+        release_date = record.get("release_date", "")
+        strain = organism.get("strain", "")
+    taxonomy_id = str(record.get("taxonomy_id", ""))
+    if not accession:
+        return []
+    return [accession, name, release_date, strain, taxonomy_id]
+
+
+TSV_HEADERS = [
+    "assembly_accession",
+    "name",
+    "release_date",
+    "strain",
+    "taxonomy_id",
+]
+
+
+@task(retries=2, retry_delay_seconds=10, log_prints=True)
+def fetch_ensembl_division(
+    division: EnsemblDivision,
+    output_dir: str,
+) -> tuple[str, int]:
+    """Fetch Ensembl species metadata JSON and convert to gzipped TSV.
+
+    Args:
+        division (EnsemblDivision): Ensembl division to fetch.
+        output_dir (str): Directory to write the output file.
+
+    Returns:
+        tuple[str, int]: Path to the output file and number of records written.
+    """
+    url = DIVISION_URLS[division]
+    output_name = DIVISION_OUTPUT_NAMES[division]
+    output_path = os.path.join(output_dir, output_name)
+
+    print(f"Fetching Ensembl {division.value} from {url}")
+    response = safe_get(url, timeout=600)
+    if response is None:
+        raise RuntimeError(f"Failed to fetch Ensembl {division.value}: no response received")
+    if response.status_code != 200:
+        raise RuntimeError(
+            f"Failed to fetch Ensembl {division.value}: HTTP {response.status_code} — "
+            f"check the URL and your network connection"
+        )
+    response.raise_for_status()
+
+    records = response.json()
+    if not isinstance(records, list):
+        raise ValueError(f"Expected JSON array from {url}, got {type(records).__name__}")
+
+    tsv_path = output_path.removesuffix(".gz")
+    row_count = 0
+    with open(tsv_path, "w", newline="") as f:
+        writer = csv.writer(f, delimiter="\t", lineterminator="\n")
+        writer.writerow(TSV_HEADERS)
+        for record in records:
+            if row := _extract_fields(record, division):
+                writer.writerow(row)
+                row_count += 1
+
+    with open(tsv_path, "rb") as f_in, gzip.open(output_path, "wb") as f_out:
+        f_out.write(f_in.read())
+    os.remove(tsv_path)
+
+    print(f"Wrote {row_count} records to {output_path}")
+    return output_path, row_count
+
+
+@task(log_prints=True)
+def upload_s3_file(local_path: str, s3_path: str) -> None:
+    """Upload file to S3."""
+    print(f"Uploading {local_path} to {s3_path}")
+    upload_to_s3(local_path, s3_path)
+
+
+@flow()
+def update_ensembl_metadata(
+    output_path: str,
+    s3_path: str = "",
+    division: str = "vertebrates",
+) -> bool:
+    """Fetch Ensembl species metadata for a given division.
+
+    Args:
+        output_path (str): Directory to write output files.
+        s3_path (str): Optional S3 directory to upload the result.
+        division (str): Ensembl division name (fungi, metazoa, plants,
+            protists, rapid, vertebrates).
+
+    Returns:
+        bool: True on success.
+    """
+    if not is_safe_path(output_path):
+        raise ValueError(f"Unsafe output path: {output_path}")
+
+    resolved_path = os.path.abspath(output_path)
+    os.makedirs(resolved_path, exist_ok=True)
+
+    div = EnsemblDivision(division.lower())
+    local_file, row_count = fetch_ensembl_division(div, resolved_path)
+
+    if s3_path:
+        output_name = DIVISION_OUTPUT_NAMES[div]
+        remote_path = f"{s3_path.rstrip('/')}/{output_name}"
+        upload_s3_file(local_file, remote_path)
+
+    emit_event(
+        event="update.ensembl.metadata.finished",
+        resource={
+            "prefect.resource.id": f"update.ensembl.{division}.{resolved_path}",
+            "prefect.resource.type": "ensembl.metadata",
+            "prefect.resource.division": division,
+        },
+        payload={"division": division, "row_count": row_count},
+    )
+    return True
+
+
+if __name__ == "__main__":
+    args = parse_args(
+        [required(OUTPUT_PATH), S3_PATH, DIVISION],
+        "Fetch Ensembl species metadata for a given division.",
+    )
+    update_ensembl_metadata(**vars(args))
diff --git a/flows/updaters/update_genomehubs_taxonomy.py b/flows/updaters/update_genomehubs_taxonomy.py
index d4eb7e9..7d58ce1 100644
--- a/flows/updaters/update_genomehubs_taxonomy.py
+++ b/flows/updaters/update_genomehubs_taxonomy.py
@@ -20,12 +20,13 @@
 def get_file_paths_from_config(config: dict, file_paths: dict) -> dict:
     key = config.get("xref_label")
     input_path = config.get("path")
-    output_path = config.get("out")
+    output_path = config.get("out", "./taxonomy.jsonl")
     if key is not None and input_path is not None:
         file_paths[key] = {
             "input": input_path,
         }
-    return output_path
+    file_paths["out"] = output_path
+    return file_paths
 
 
 @task(log_prints=True)
@@ -39,9 +40,7 @@ def read_input_config(input_path: str) -> dict:
         print(f"Error reading {input_path}: {e}")
         exit()
     try:
-        output_path = get_file_paths_from_config(config, file_paths)
-        if output_path is not None:
-            file_paths["out"] = output_path
+        file_paths = get_file_paths_from_config(config, file_paths)
         for taxonomy in config.get("taxonomies", []):
             get_file_paths_from_config(taxonomy, file_paths)
     except Exception as e:
@@ -79,8 +78,9 @@ def run_blobtk_taxonomy(root_taxid: str, input_path: str, output_path: str) -> N
             text=True,
             bufsize=1,
         )
-        for line in process.stdout:
-            print(line, end="")
+        if process.stdout is not None:
+            for line in process.stdout:
+                print(line, end="")
         process.wait()
         if process.returncode != 0:
             print(f"Command failed with exit code {process.returncode}")
@@ -103,9 +103,7 @@ def upload_s3_file(local_path: str, s3_path: str) -> None:
 
 
 @flow()
-def update_genomehubs_taxonomy(
-    root_taxid: str, input_path: str, output_path: str, s3_path: str
-) -> None:
+def update_genomehubs_taxonomy(root_taxid: str, input_path: str, output_path: str, s3_path: str = "") -> None:
     """Update the GenomeHubs taxonomy JSONL file.
 
     Args:
diff --git a/flows/updaters/update_google_sheets_status.py b/flows/updaters/update_google_sheets_status.py
new file mode 100644
index 0000000..390389a
--- /dev/null
+++ b/flows/updaters/update_google_sheets_status.py
@@ -0,0 +1,446 @@
+"""Fetch project status data from Google Sheets.
+
+Replaces the legacy R script (get_googlesheets.R) and Python pipeline
+(import_status.py + import_status_lib.py). Fetches three categories:
+
+1. Project status sheets — a private TSV index pointing to ~26 project
+   spreadsheets that follow the GoaT schema 2.5 format.
+2. DTOL Plant Genome Size Estimates — Kew genome size data.
+3. DTOL assembly informatics status — tolqc kmer draft sizes.
+4. CNGB project status.
+
+Outputs are per-project expanded TSV files matching legacy format.
+"""
+
+import csv
+import io
+import os
+
+import numpy as np
+import pandas as pd
+
+from flows.lib.conditional_import import emit_event, flow, task
+from flows.lib.shared_args import (
+    INDEX_URL,
+    MIN_RECORDS,
+    OUTPUT_PATH,
+    S3_PATH,
+    parse_args,
+    required,
+)
+from flows.lib.utils import is_safe_path, safe_get, upload_to_s3
+
+# Google Sheets URLs for non-project-status data
+DTOL_PLANT_GENOME_SIZE_URL = (
+    "https://docs.google.com/spreadsheets/d/e/"
+    "2PACX-1vSt0R1T3MpoOM6UFNMaT_Q9gR5TYyUZC1wgLqW_6_cH9zzII8ehadrbHX8bpktjTv2_yt_KHaj3x_e1"
+    "/pub?output=tsv"
+)
+DTOL_TOLQC_STATUS_URL = (
+    "https://docs.google.com/spreadsheets/d/e/"
+    "2PACX-1vTU-En_URbYPtfyjBueQhnz7wYHt-OHVxvRyv9tNvCUPCTX9EEzxOL41QCUh6hgVNv-Vv_gLSAMJXv-"
+    "/pub?gid=1442224132&single=true&output=tsv"
+)
+CNGB_URL = (
+    "https://docs.google.com/spreadsheets/d/e/"
+    "2PACX-1vQeTqi-qnoNgNl58gWDBT4CcR8nF9SmFOkC82KC6pkH42CoEi94yInhBE25SfxBqNeMBeVbpeEVs9GI"
+    "/pub?gid=1726876704&single=true&output=tsv"
+)
+
+
+# ---------------------------------------------------------------------------
+# Project status processing (port of import_status_lib.py)
+# ---------------------------------------------------------------------------
+
+
+def _open_google_spreadsheet(acronym: str, url: str, header_index: int) -> pd.DataFrame:
+    """Download a published Google Sheet as TSV and return a DataFrame."""
+    encodings = ["utf-8", "ISO-8859-1", "latin1"]
+    response = safe_get(url, timeout=120)
+    if response is None:
+        raise RuntimeError(f"Failed to fetch sheet for {acronym}: no response received")
+    response.raise_for_status()
+
+    df = None
+    for enc in encodings:
+        try:
+            content = response.content.decode(enc)
+            df = pd.read_csv(
+                io.StringIO(content),
+                delimiter="\t",
+                header=header_index,
+                dtype=object,
+                quoting=csv.QUOTE_NONE,
+            )
+            break
+        except (UnicodeDecodeError, pd.errors.ParserError):
+            continue
+
+    if df is None:
+        raise ValueError(f"Failed to decode sheet for {acronym}")
+
+    df.rename(columns={"#NCBI_taxon_id": "NCBI_taxon_id"}, inplace=True)
+    df["project"] = acronym.upper()
+    return df
+
+
+def _general_cleanup(df: pd.DataFrame) -> pd.DataFrame:
+    """Replace whitespace-only cells with NaN, drop empty rows/cols."""
+    df = df.replace(r"^\s*$", np.nan, regex=True)
+    df = df.replace("publication_available", "published", regex=False)
+    df.dropna(how="all", axis=1, inplace=True)
+    df.dropna(how="all", axis=0, inplace=True)
+    df.rename(columns={"#NCBI_taxon_id": "NCBI_taxon_id"}, inplace=True)
+    return df
+
+
+def _cleanup_headers(df: pd.DataFrame) -> pd.DataFrame:
+    """Normalise column headers: lowercase, underscored, no parens."""
+    df.columns = (
+        df.columns.str.replace(" ", "_")
+        .str.replace(r"\(", "", regex=True)
+        .str.replace(r"\)", "", regex=True)
+        .str.lower()
+    )
+    return df
+
+
+def _create_mandatory_columns(df: pd.DataFrame) -> pd.DataFrame:
+    """Ensure mandatory columns exist."""
+    for col in [
+        "ncbi_taxon_id",
+        "species",
+        "family",
+        "synonym",
+        "publication_id",
+        "contributing_project_lab",
+        "target_list_status",
+        "sequencing_status",
+    ]:
+        if col not in df.columns:
+            df[col] = None
+    return df
+
+
+def _expand_target_status(df: pd.DataFrame, acronym: str) -> pd.DataFrame:
+    """Populate long_list, family_representative, other_priority columns."""
+    for col in ["long_list", "family_representative", "other_priority"]:
+        if col not in df.columns:
+            df[col] = None
+    df["long_list"] = acronym
+
+    lower = acronym.lower()
+    fr_mask = df["target_list_status"].isin([f"{lower}_family_representative", "family_representative"])
+    df.loc[fr_mask, "family_representative"] = acronym
+
+    op_mask = df["target_list_status"].isin([f"{lower}_other_priority", "other_priority"])
+    df.loc[op_mask, "other_priority"] = acronym
+    return df
+
+
+def _reduce_sequencing_status(df: pd.DataFrame, acronym: str) -> pd.DataFrame:
+    """Map project-prefixed statuses to simple GoaT statuses."""
+    status_map = {
+        f"{acronym}_published": "published",
+        f"{acronym}_insdc_open": "insdc_open",
+        f"{acronym}_open": "open",
+        f"{acronym}_insdc_submitted": "in_progress",
+        f"{acronym}_in_assembly": "in_progress",
+        f"{acronym}_data_generation": "in_progress",
+        f"{acronym}_in_progress": "in_progress",
+        f"{acronym}_sample_acquired": "sample_acquired",
+        f"{acronym}_sample_collected": "sample_collected",
+    }
+    df["sequencing_status"] = df["sequencing_status"].replace(status_map)
+    return df
+
+
+def _create_status_columns(df: pd.DataFrame, acronym: str) -> pd.DataFrame:
+    """Create and populate per-status columns."""
+    statuses = [
+        "sample_collected",
+        "sample_acquired",
+        "in_progress",
+        "data_generation",
+        "in_assembly",
+        "insdc_submitted",
+        "open",
+        "insdc_open",
+        "published",
+    ]
+    for s in statuses:
+        if s not in df.columns:
+            df[s] = None
+        df.loc[df["sequencing_status"] == s, s] = acronym
+    return df
+
+
+def _expand_sequencing_status(df: pd.DataFrame, acronym: str) -> pd.DataFrame:
+    """Cascade statuses upward: published implies insdc_open, etc."""
+    df.loc[df["published"] == acronym, "insdc_open"] = acronym
+    df.loc[df["insdc_open"] == acronym, "open"] = acronym
+    df.loc[df["open"] == acronym, "in_progress"] = acronym
+    df.loc[df["data_generation"] == acronym, "in_progress"] = acronym
+    df.loc[df["in_assembly"] == acronym, "in_progress"] = acronym
+    df.loc[df["in_progress"] == acronym, "data_generation"] = acronym
+    df.loc[df["in_progress"] == acronym, "sample_acquired"] = acronym
+    df.loc[df["sample_acquired"] == acronym, "sample_collected"] = acronym
+    return df
+
+
+def _process_project(acronym: str, url: str, header_row: int) -> pd.DataFrame:
+    """Full processing pipeline for one project status sheet."""
+    df = _open_google_spreadsheet(acronym, url, header_row)
+    df = _general_cleanup(df)
+    df = _cleanup_headers(df)
+    df = _create_mandatory_columns(df)
+    df = _expand_target_status(df, acronym)
+    df = _reduce_sequencing_status(df, acronym)
+    df = _create_status_columns(df, acronym)
+    df = _expand_sequencing_status(df, acronym)
+    return df
+
+
+# ---------------------------------------------------------------------------
+# Dedicated sheet fetchers (port of get_googlesheets.R)
+# ---------------------------------------------------------------------------
+
+
+def _fetch_dtol_plant_genome_sizes(output_path: str) -> int:
+    """Fetch DTOL Plant Genome Size Estimates from Kew."""
+    response = safe_get(DTOL_PLANT_GENOME_SIZE_URL, timeout=120)
+    if response is None:
+        raise RuntimeError("Failed to fetch DTOL Plant Genome Size Estimates: no response received")
+    response.raise_for_status()
+    df = pd.read_csv(io.StringIO(response.text), delimiter="\t", dtype=str)
+    df.columns = (
+        df.columns.str.strip()
+        .str.replace(" ", "_")
+        .str.replace(r"\(", "", regex=True)
+        .str.replace(r"\)", "", regex=True)
+        .str.lower()
+    )
+    df = df.dropna(how="all", axis=0)
+    df = df[df["genus"].notna() & (df.get("project", pd.Series()) == "DTOL")]
+    df["primary"] = "1"
+    df.to_csv(output_path, sep="\t", index=False)
+    return len(df)
+
+
+def _fetch_dtol_tolqc_status(output_path: str) -> int:
+    """Fetch DTOL assembly informatics status (kmer draft)."""
+    response = safe_get(DTOL_TOLQC_STATUS_URL, timeout=120)
+    if response is None:
+        raise RuntimeError("Failed to fetch DTOL assembly informatics status: no response received")
+    response.raise_for_status()
+    df = pd.read_csv(
+        io.StringIO(response.text),
+        delimiter="\t",
+        dtype=str,
+        na_values=["NA", "missing", "", "NULL"],
+    )
+    df.columns = (
+        df.columns.str.strip()
+        .str.replace(".", "")
+        .str.replace(" ", "_")
+        .str.replace(r"\(", "", regex=True)
+        .str.replace(r"\)", "", regex=True)
+        .str.lower()
+    )
+    df = df.dropna(how="all", axis=0)
+    df = df[df["taxon"].notna()]
+    df = df[df["accession"].isna() | ~df["accession"].str.startswith("GCA_", na=False)]
+    df = df[~df["statussummary"].str.startswith("9", na=False)]
+    df = df[~df["statussummary"].str.startswith("5", na=False)]
+    df = df[["taxon", "est_size_mb", "length_mb"]].copy()
+    for col in ["est_size_mb", "length_mb"]:
+        df[col] = pd.to_numeric(df[col], errors="coerce")
+    df = df[df["est_size_mb"].notna() | df["length_mb"].notna()]
+    df.to_csv(output_path, sep="\t", index=False)
+    return len(df)
+
+
+def _fetch_cngb(output_path: str) -> int:
+    """Fetch CNGB project status sheet."""
+    response = safe_get(CNGB_URL, timeout=120)
+    if response is None:
+        raise RuntimeError("Failed to fetch CNGB project status sheet: no response received")
+    response.raise_for_status()
+    df = pd.read_csv(
+        io.StringIO(response.text),
+        delimiter="\t",
+        dtype=str,
+        na_values=["NA", "missing", "", "NULL"],
+    )
+    df = df.dropna(how="all", axis=0)
+    df.to_csv(output_path, sep="\t", index=False)
+    return len(df)
+
+
+# ---------------------------------------------------------------------------
+# Prefect tasks and flow
+# ---------------------------------------------------------------------------
+
+
+@task(retries=2, retry_delay_seconds=30, log_prints=True)
+def fetch_project_status_sheets(index_url: str, output_dir: str) -> dict:
+    """Fetch all project status sheets listed in the private index TSV.
+
+    Args:
+        index_url (str): URL (or path) to the index TSV with columns
+            project_acronym, published_url, start_header_line.
+        output_dir (str): Directory to write per-project expanded TSVs.
+
+    Returns:
+        dict: Mapping of project acronym to row count.
+    """
+    response = safe_get(index_url, timeout=60)
+    if response is None:
+        raise RuntimeError("Failed to fetch project status sheets index: no response received")
+    response.raise_for_status()
+
+    index_df = pd.read_csv(
+        io.StringIO(response.text),
+        delimiter="\t",
+        usecols=["project_acronym", "published_url", "start_header_line"],
+        dtype={"project_acronym": str, "published_url": str, "start_header_line": int},
+    )
+    print(f"Found {len(index_df)} project sheets in index")
+
+    results = {}
+    for _, row in index_df.iterrows():
+        acronym = row["project_acronym"]
+        url = row["published_url"]
+        header_row = int(row["start_header_line"])
+        print(f"Processing {acronym} (header row {header_row})")
+        try:
+            df = _process_project(acronym, url, header_row)
+            out_file = os.path.join(output_dir, f"{acronym}_expanded.tsv")
+            df.to_csv(out_file, sep="\t", index=False)
+            results[acronym] = len(df)
+            print(f"  {acronym}: {len(df)} rows")
+        except Exception as exc:
+            print(f"  {acronym}: FAILED — {exc}")
+            failed_path = os.path.join(output_dir, f"{acronym}_expanded.tsv.failed")
+            open(failed_path, "w").close()  # noqa: SIM115 — legacy compat
+            results[acronym] = 0
+    print(results)
+    return results
+
+
+@task(retries=2, retry_delay_seconds=30, log_prints=True)
+def fetch_other_sheets(output_dir: str) -> dict:
+    """Fetch the three non-project Google Sheets (DTOL plant, tolqc, CNGB).
+
+    Args:
+        output_dir (str): Directory to write TSV files.
+
+    Returns:
+        dict: Mapping of filename to row count.
+    """
+    results = {}
+
+    plant_path = os.path.join(output_dir, "DTOL_Plant_Genome_Size_Estimates.tsv")
+    try:
+        results["DTOL_Plant_Genome_Size_Estimates"] = _fetch_dtol_plant_genome_sizes(plant_path)
+        print(f"Plant genome sizes: {results['DTOL_Plant_Genome_Size_Estimates']} rows")
+    except Exception as exc:
+        print(f"Plant genome sizes: FAILED — {exc}")
+        results["DTOL_Plant_Genome_Size_Estimates"] = 0
+
+    tolqc_path = os.path.join(output_dir, "DTOL_assembly_informatics_status_kmer_draft.tsv")
+    try:
+        results["DTOL_tolqc_status"] = _fetch_dtol_tolqc_status(tolqc_path)
+        print(f"DTOL tolqc status: {results['DTOL_tolqc_status']} rows")
+    except Exception as exc:
+        print(f"DTOL tolqc status: FAILED — {exc}")
+        results["DTOL_tolqc_status"] = 0
+
+    cngb_path = os.path.join(output_dir, "cngb.tsv")
+    try:
+        results["cngb"] = _fetch_cngb(cngb_path)
+        print(f"CNGB: {results['cngb']} rows")
+    except Exception as exc:
+        print(f"CNGB: FAILED — {exc}")
+        results["cngb"] = 0
+
+    return results
+
+
+@task(log_prints=True)
+def upload_s3_dir(local_dir: str, s3_path: str) -> None:
+    """Upload all TSV files in a directory to S3."""
+    for fname in sorted(os.listdir(local_dir)):
+        if fname.endswith(".tsv") or fname.endswith(".tsv.gz"):
+            local_path = os.path.join(local_dir, fname)
+            remote_path = f"{s3_path.rstrip('/')}/{fname}"
+            print(f"Uploading {fname} to {remote_path}")
+            upload_to_s3(local_path, remote_path)
+
+
+@flow()
+def update_google_sheets_status(
+    output_path: str,
+    index_url: str,
+    s3_path: str = "",
+    min_records: int = 0,
+) -> bool:
+    """Fetch all Google Sheets project status and supplementary data.
+
+    Args:
+        output_path (str): Directory to write output TSVs.
+        index_url (str): URL to the private index TSV (from env
+            GOAT_SHEETS_INDEX_URL if not provided).
+        s3_path (str): Optional S3 path to upload results.
+        min_records (int): Minimum total records to accept.
+
+    Returns:
+        bool: True on success.
+    """
+    if not is_safe_path(output_path):
+        raise ValueError(f"Unsafe output path: {output_path}")
+
+    resolved_path = os.path.abspath(output_path)
+    os.makedirs(resolved_path, exist_ok=True)
+
+    if index_url is None:
+        index_url = os.environ.get("GOAT_SHEETS_INDEX_URL", "")
+
+    total = 0
+    project_results = {}
+    if index_url:
+        project_results = fetch_project_status_sheets(index_url, resolved_path)
+        total += sum(project_results.values())
+    else:
+        print("No index URL provided — skipping project status sheets")
+
+    other_results = fetch_other_sheets(resolved_path)
+    total += sum(other_results.values())
+
+    if total < min_records:
+        raise RuntimeError(f"Google Sheets: fewer than {min_records} total records: {total}")
+
+    if s3_path:
+        upload_s3_dir(resolved_path, s3_path)
+
+    emit_event(
+        event="update.google.sheets.status.finished",
+        resource={
+            "prefect.resource.id": f"update.google.sheets.status.{resolved_path}",
+            "prefect.resource.type": "google.sheets.status",
+        },
+        payload={
+            "total_records": total,
+            "projects": len(project_results),
+            "other_sheets": len(other_results),
+        },
+    )
+    return True
+
+
+if __name__ == "__main__":
+    args = parse_args(
+        [required(OUTPUT_PATH), required(INDEX_URL), S3_PATH, MIN_RECORDS],
+        "Fetch project status data from Google Sheets.",
+    )
+    update_google_sheets_status(**vars(args))
diff --git a/flows/updaters/update_jgi_status.py b/flows/updaters/update_jgi_status.py
new file mode 100644
index 0000000..2910bf9
--- /dev/null
+++ b/flows/updaters/update_jgi_status.py
@@ -0,0 +1,214 @@
+import csv
+import os
+
+from flows.lib.conditional_import import emit_event, flow, task
+from flows.lib.shared_args import MIN_RECORDS, OUTPUT_PATH, S3_PATH, parse_args, required
+from flows.lib.utils import is_safe_path, safe_get, upload_to_s3
+
+JGI_BASE_URL = "https://gold-ws.jgi.doe.gov"
+JGI_STUDY_ID = "Gs0000001"
+
+FIELDNAMES = [
+    "projectGoldId",
+    "projectName",
+    "legacyGoldId",
+    "studyGoldId",
+    "biosampleGoldId",
+    "organismGoldId",
+    "itsProposalId",
+    "itsSpid",
+    "itsSampleId",
+    "pmoProjectId",
+    "gptsProposalId",
+    "ncbiBioProjectAccession",
+    "ncbiBioSampleAccession",
+    "projectStatus",
+    "sequencingStatus",
+    "jgiFundingProgram",
+    "jgiFundingYear",
+    "hmpId",
+    "modDate",
+    "addDate",
+    "sequencingStrategy",
+    "sequencingCenters",
+    "seqMethod",
+    "genomePublications",
+    "otherPublications",
+    "sraExperimentIds",
+    "ncbiTaxId",
+]
+
+
+def _exchange_token(offline_token: str) -> str:
+    """Exchange a JGI offline token for an access token.
+
+    Args:
+        offline_token (str): The JGI offline (API) token.
+
+    Returns:
+        str: A valid access token.
+
+    Raises:
+        RuntimeError: If the token exchange fails.
+    """
+    url = f"{JGI_BASE_URL}/exchange?offlineToken={offline_token}"
+    response = safe_get(url, timeout=30)
+    if response is None:
+        raise RuntimeError("JGI token exchange failed: no response received")
+    if response.status_code != 200:
+        raise RuntimeError(
+            f"JGI token exchange failed: HTTP {response.status_code} — " f"check that JGI_OFFLINE_TOKEN is valid"
+        )
+    if not (token := response.content.decode().strip()):
+        raise RuntimeError("JGI token exchange returned empty access token")
+    return token
+
+
+def _fetch_organisms(access_token: str) -> dict:
+    """Fetch organism-to-taxid mapping from JGI GOLD API.
+
+    Args:
+        access_token (str): Valid JGI access token.
+
+    Returns:
+        dict: Mapping of organismGoldId to ncbiTaxId.
+    """
+    headers = {"Authorization": f"Bearer {access_token}", "Accept": "application/json"}
+    url = f"{JGI_BASE_URL}/api/v1/organisms?studyGoldId={JGI_STUDY_ID}"
+    response = safe_get(url, headers=headers, timeout=120)
+    if response is None:
+        raise RuntimeError("JGI organism fetch failed: no response received")
+    if response.status_code != 200:
+        raise RuntimeError(
+            f"JGI organism fetch failed: HTTP {response.status_code} — " f"check that JGI_OFFLINE_TOKEN is valid"
+        )
+    response.raise_for_status()
+    organisms = response.json()
+    return {org["organismGoldId"]: org.get("ncbiTaxId", "") for org in organisms}
+
+
+def _fetch_projects(access_token: str) -> list:
+    """Fetch project records from JGI GOLD API.
+
+    Args:
+        access_token (str): Valid JGI access token.
+
+    Returns:
+        list: List of project dictionaries.
+    """
+    headers = {"Authorization": f"Bearer {access_token}", "Accept": "application/json"}
+    url = f"{JGI_BASE_URL}/api/v1/projects?studyGoldId={JGI_STUDY_ID}"
+    response = safe_get(url, headers=headers, timeout=120)
+    if response is None:
+        raise RuntimeError("JGI project fetch failed: no response received")
+    if response.status_code != 200:
+        raise RuntimeError(
+            f"JGI project fetch failed: HTTP {response.status_code} — " f"check that JGI_OFFLINE_TOKEN is valid"
+        )
+    response.raise_for_status()
+    return response.json()
+
+
+@task(retries=2, retry_delay_seconds=10, log_prints=True)
+def fetch_jgi_tsv(file_path: str, min_lines: int = 1) -> int:
+    """Fetch JGI 1KFG project data and write to TSV.
+
+    Exchanges the offline token for an access token, fetches organism-taxid
+    mapping and project records, filters for whole genome sequencing projects,
+    and writes a TSV.
+
+    Args:
+        file_path (str): Path to the output TSV file.
+        min_lines (int): Minimum number of data rows expected.
+
+    Returns:
+        int: Number of lines written (including header).
+    """
+    offline_token = os.environ.get("JGI_OFFLINE_TOKEN")
+    if not offline_token:
+        raise RuntimeError(
+            "JGI_OFFLINE_TOKEN environment variable is not set — " "cannot authenticate with JGI GOLD API"
+        )
+
+    print("Exchanging JGI offline token for access token")
+    access_token = _exchange_token(offline_token)
+
+    print(f"Fetching organisms for study {JGI_STUDY_ID}")
+    org_to_taxid = _fetch_organisms(access_token)
+    print(f"Found {len(org_to_taxid)} organisms")
+
+    print(f"Fetching projects for study {JGI_STUDY_ID}")
+    projects = _fetch_projects(access_token)
+    print(f"Found {len(projects)} total projects")
+
+    source_fields = [f for f in FIELDNAMES if f != "ncbiTaxId"]
+    row_count = 0
+    with open(file_path, "w", newline="") as f:
+        writer = csv.writer(f, delimiter="\t", lineterminator="\n")
+        writer.writerow(FIELDNAMES)
+        for project in projects:
+            if project.get("sequencingStrategy") != "Whole Genome Sequencing":
+                continue
+            organism_id = project.get("organismGoldId", "")
+            taxid = org_to_taxid.get(organism_id, "")
+            row = [project.get(field, "") for field in source_fields] + [taxid]
+            writer.writerow(row)
+            row_count += 1
+
+    line_count = row_count + 1  # include header
+    if row_count < min_lines:
+        raise RuntimeError(f"JGI file has fewer than {min_lines} data rows: {row_count}")
+    print(f"Wrote {row_count} WGS projects to {file_path}")
+    return line_count
+
+
+@task(log_prints=True)
+def upload_s3_tsv(local_path: str, s3_path: str) -> None:
+    """Upload JGI TSV to S3."""
+    print(f"Uploading JGI TSV from {local_path} to {s3_path}")
+    upload_to_s3(local_path, s3_path)
+
+
+@flow()
+def update_jgi_status(
+    output_path: str,
+    s3_path: str = "",
+    min_records: int = 0,
+) -> bool:
+    """Fetch JGI 1KFG status list and optionally upload to S3.
+
+    Args:
+        output_path (str): Path to the output TSV file.
+        s3_path (str): Optional S3 path to upload the result.
+        min_records (int): Minimum record count to accept the output.
+
+    Returns:
+        bool: True on success.
+    """
+    if not is_safe_path(output_path):
+        raise ValueError(f"Unsafe output path: {output_path}")
+    resolved_path = os.path.abspath(output_path)
+    os.makedirs(os.path.dirname(resolved_path), exist_ok=True)
+
+    line_count = fetch_jgi_tsv(resolved_path, min_records)
+
+    if line_count > min_records and s3_path:
+        upload_s3_tsv(resolved_path, s3_path)
+
+    emit_event(
+        event="update.jgi.status.finished",
+        resource={
+            "prefect.resource.id": f"update.jgi.{output_path}",
+            "prefect.resource.type": "jgi.status",
+        },
+        payload={"line_count": line_count},
+    )
+    return True
+
+
+if __name__ == "__main__":
+    args = parse_args(
+        [required(OUTPUT_PATH), S3_PATH, MIN_RECORDS],
+        "Fetch JGI 1KFG status list.",
+    )
+    update_jgi_status(**vars(args))
diff --git a/flows/updaters/update_ncbi_datasets.py b/flows/updaters/update_ncbi_datasets.py
index e2f1e4c..c02ca1b 100644
--- a/flows/updaters/update_ncbi_datasets.py
+++ b/flows/updaters/update_ncbi_datasets.py
@@ -202,7 +202,7 @@ def generate_md5(file_path):
 def update_ncbi_datasets(
     root_taxid: str,
     output_path: str,
-    s3_path: str,
+    s3_path: str = "",
     data_freeze_path: Optional[str] = None,
 ) -> bool:
     line_count = fetch_ncbi_datasets_summary(root_taxid, file_path=output_path, data_freeze_path=data_freeze_path)
diff --git a/flows/updaters/update_ncbi_taxonomy.py b/flows/updaters/update_ncbi_taxonomy.py
index 890d11a..f6f9935 100644
--- a/flows/updaters/update_ncbi_taxonomy.py
+++ b/flows/updaters/update_ncbi_taxonomy.py
@@ -86,7 +86,7 @@ def taxonomy_is_up_to_date(local_path: str, http_path: str) -> bool:
 
 
 @flow()
-def update_ncbi_taxonomy(output_path: str) -> None:
+def update_ncbi_taxonomy(output_path: str) -> bool:
     """Fetch and the NCBI taxonomy dump.
 
     Args:
diff --git a/flows/updaters/update_nhm_status_list.py b/flows/updaters/update_nhm_status_list.py
index 707e15d..6e75055 100644
--- a/flows/updaters/update_nhm_status_list.py
+++ b/flows/updaters/update_nhm_status_list.py
@@ -43,9 +43,7 @@ def fetch_nhm_tsv(
 
     # If the file has less than min_records lines, raise an error
     if line_count < min_lines:
-        raise RuntimeError(
-            f"File {file_path} has less than {min_lines} lines: {line_count}"
-        )
+        raise RuntimeError(f"File {file_path} has less than {min_lines} lines: {line_count}")
     # Return the line count
     return line_count
 
@@ -57,7 +55,7 @@ def upload_s3_tsv(local_path: str, s3_path: str) -> None:
 
 
 @flow()
-def update_nhm_status_list(output_path: str, s3_path: str, min_records: int) -> None:
+def update_nhm_status_list(output_path: str, s3_path: str = "", min_records: int = 0) -> bool:
     """Update the NHM status list TSV file."""
     os.makedirs(os.path.dirname(output_path), exist_ok=True)
     line_count = fetch_nhm_tsv(output_path, min_records)
@@ -67,7 +65,7 @@ def update_nhm_status_list(output_path: str, s3_path: str, min_records: int) ->
         event="update.nhm.tsv.finished",
         resource={
             "prefect.resource.id": f"update.nhm.{output_path}",
-            "prefect.resource.type": "nhm.tsv",
+            "prefect.resource.type": "nhm.status",
         },
         payload={"line_count": line_count},
     )
@@ -82,9 +80,3 @@ def update_nhm_status_list(output_path: str, s3_path: str, min_records: int) ->
     )
 
     update_nhm_status_list(**vars(args))
-    args = parse_args(
-        [required(OUTPUT_PATH), S3_PATH, MIN_RECORDS],
-        "Fetch species data from NHM.",
-    )
-
-    update_nhm_status_list(**vars(args))
diff --git a/flows/updaters/update_ott_taxonomy.py b/flows/updaters/update_ott_taxonomy.py
index 7280a14..725df70 100644
--- a/flows/updaters/update_ott_taxonomy.py
+++ b/flows/updaters/update_ott_taxonomy.py
@@ -42,9 +42,7 @@ def fetch_ott_taxonomy(
 
     # Find the extracted subdirectory (should start with 'ott')
     extracted_dirs = [
-        d
-        for d in os.listdir(local_path)
-        if os.path.isdir(os.path.join(local_path, d)) and d.startswith("ott")
+        d for d in os.listdir(local_path) if os.path.isdir(os.path.join(local_path, d)) and d.startswith("ott")
     ]
     if not extracted_dirs:
         raise RuntimeError("No extracted ott directory found.")
@@ -111,28 +109,30 @@ def set_ott_url() -> str:
 
     # Extract required fields
     source = ott_json.get("source", "")
-    name = ott_json.get("name", "")
-    version = ott_json.get("version", "")
 
     # Replace "draft" with "." in source to get OTT_VERSION
     ott_version = source.replace("draft", ".")
-    ott_major_version = f"{name}{version}"
 
-    return (
-        f"https://files.opentreeoflife.org/ott/"
-        f"{ott_major_version}/{ott_version}.tgz"
-    )
+    # may need to restore this if ott switch back to the major version URL structure, but for now we want the full version in the URL
+    # name = ott_json.get("name", "")
+    # version = ott_json.get("version", "")
+    # ott_major_version = f"{name}{version}"
+
+    # return (
+    #     f"https://files.opentreeoflife.org/ott/"
+    #     f"{ott_major_version}/{ott_version}.tgz"
+    # )
+    return f"https://files.opentreeoflife.org/ott/{ott_version}/{ott_version}.tgz"
 
 
 @flow()
-def update_ott_taxonomy(output_path: str) -> None:
+def update_ott_taxonomy(output_path: str) -> bool:
     """Fetch the OTT taxonomy file.
 
     Args:
         output_path (str): Path to save the taxonomy dump.
     """
     http_path = set_ott_url()
-    status = None
     complete = False
     if ott_taxonomy_is_up_to_date(output_path, http_path):
         status = True
diff --git a/flows/updaters/update_refseq_organelles.py b/flows/updaters/update_refseq_organelles.py
new file mode 100644
index 0000000..68511b7
--- /dev/null
+++ b/flows/updaters/update_refseq_organelles.py
@@ -0,0 +1,322 @@
+import contextlib
+import csv
+import gzip
+import os
+import re
+import tempfile
+from collections import Counter
+
+from flows.lib.conditional_import import emit_event, flow, task
+from flows.lib.shared_args import (
+    MIN_RECORDS,
+    OUTPUT_PATH,
+    ROOT_TAXID,
+    S3_PATH,
+    parse_args,
+    required,
+)
+from flows.lib.utils import is_safe_path, safe_get, upload_to_s3
+
+REFSEQ_FTP = "https://ftp.ncbi.nlm.nih.gov/refseq/release"
+
+ORGANELLE_FIELDNAMES = [
+    "id",
+    "organelle",
+    "taxonId",
+    "genbankAccession",
+    "assemblySpan",
+    "gcPercent",
+    "nPercent",
+    "releaseDate",
+    "sourceAuthor",
+    "sourceYear",
+    "sourceTitle",
+    "pubmedId",
+    "bioproject",
+    "biosample",
+    "sampleLocation",
+]
+
+MONTHS = {
+    "JAN": "01",
+    "FEB": "02",
+    "MAR": "03",
+    "APR": "04",
+    "MAY": "05",
+    "JUN": "06",
+    "JUL": "07",
+    "AUG": "08",
+    "SEP": "09",
+    "OCT": "10",
+    "NOV": "11",
+    "DEC": "12",
+}
+
+
+def _reformat_date(date_str: str) -> str:
+    """Convert DD-MMM-YYYY to YYYY-MM-DD."""
+    parts = re.split(r"[:\-]", date_str)
+    if len(parts) < 3:
+        return date_str
+    return f"{parts[2]}-{MONTHS.get(parts[1].upper(), '00')}-{parts[0].zfill(2)}"
+
+
+def _refseq_listing(collection: str) -> list:
+    """Fetch directory listing of GenBank files for a RefSeq collection.
+
+    Args:
+        collection (str): Collection name (e.g., "mitochondrion").
+
+    Returns:
+        list: URLs of .genomic.gbff.gz files.
+    """
+    pattern = re.compile(r"(\w+\.\d+\.genomic\.gbff\.gz)")
+    url = f"{REFSEQ_FTP}/{collection}"
+    response = safe_get(url, timeout=120)
+    if response is None:
+        raise RuntimeError(f"Failed to fetch RefSeq listing for {collection}: no response received")
+    response.raise_for_status()
+    return [f"{url}/{match[1]}" for line in response.text.split("\n") if (match := pattern.search(line))]
+
+
+def _parse_features(entry, fields: dict) -> None:
+    """Extract taxonId and sample location from SeqRecord features."""
+    qualifiers = entry.features[0].qualifiers
+    if "db_xref" in qualifiers:
+        for xref in qualifiers["db_xref"]:
+            key, value = xref.split(":", 1)
+            if key == "taxon":
+                fields["taxonId"] = value
+    if "lat_lon" in qualifiers:
+        fields["sampleLocation"] = qualifiers["lat_lon"][0]
+
+
+def _parse_references(entry, fields: dict) -> None:
+    """Extract reference metadata from SeqRecord annotations."""
+    submitted_re = re.compile(r"Submitted\s\(\d{2}-\w{3}-(\d{4})\)")
+    published_re = re.compile(r"\s\((\d{4})\)[^(]*$")
+    for ref in entry.annotations.get("references", []):
+        if ref.journal == "Unpublished":
+            continue
+        if ref.journal.startswith("Submitted"):
+            if "sourceAuthor" in fields:
+                continue
+            if match := submitted_re.search(ref.journal):
+                fields["sourceYear"] = match[1]
+        elif "sourceAuthor" in fields:
+            continue
+        else:
+            if match := published_re.search(ref.journal):
+                fields["sourceYear"] = match[1]
+            if ref.title:
+                fields["sourceTitle"] = ref.title
+            if ref.pubmed_id:
+                fields["pubmedId"] = ref.pubmed_id
+        if ref.authors:
+            fields["sourceAuthor"] = ref.authors
+        elif ref.consrtm:
+            fields["sourceAuthor"] = ref.consrtm
+
+
+def _parse_xrefs(entry, fields: dict) -> None:
+    """Extract BioProject/BioSample cross-references."""
+    if not entry.dbxrefs:
+        return
+    bioprojects = []
+    biosamples = []
+    for dbxref in entry.dbxrefs:
+        with contextlib.suppress(ValueError):
+            key, value = dbxref.split(":", 1)
+            if key == "BioProject":
+                bioprojects.append(value)
+            elif key == "BioSample":
+                biosamples.append(value)
+    if bioprojects:
+        fields["bioproject"] = ";".join(bioprojects)
+    if biosamples:
+        fields["biosample"] = ";".join(biosamples)
+
+
+def _parse_sequence(entry, fields: dict) -> bool:
+    """Compute sequence stats (GC%, N%, span). Returns False if all Ns."""
+    seqstr = str(entry.seq.upper())
+    counter = Counter(seqstr)
+    length = len(seqstr)
+    n_pct = counter["N"] / length * 100 if length > 0 else 100
+    fields["nPercent"] = f"{n_pct:.2f}"
+    if n_pct == 100:
+        return False
+    gc = counter["G"] + counter["C"]
+    at = counter["A"] + counter["T"]
+    fields["gcPercent"] = f"{gc / (gc + at) * 100:.2f}" if (gc + at) > 0 else "0.00"
+    fields["assemblySpan"] = str(length)
+    return True
+
+
+def _parse_flatfile(flatfile_path: str, organelle: str, root_taxon: str) -> list:
+    """Parse a single GenBank flatfile for organelle sequences.
+
+    Args:
+        flatfile_path (str): Path to a gzipped GenBank file.
+        organelle (str): Organelle type ("mitochondrion" or "plastid").
+        root_taxon (str): Optional taxonomic root to filter by.
+
+    Returns:
+        list: List of row dicts.
+    """
+    from Bio import SeqIO
+
+    comment_re = re.compile(
+        r"(?:derived|identical)\s(?:from|to)\s([\w\d]+).*COMPLETENESS: full length",
+        re.DOTALL,
+    )
+    rows = []
+    with gzip.open(flatfile_path, "rt") as fh:
+        for entry in SeqIO.parse(fh, "gb"):
+            if root_taxon and root_taxon not in entry.annotations.get("taxonomy", []):
+                continue
+            fields = {"id": entry.id, "organelle": organelle}
+            comment = entry.annotations.get("comment", "")
+            if comment:
+                if match := comment_re.search(comment):
+                    fields["genbankAccession"] = match[1]
+                else:
+                    continue
+            _parse_features(entry, fields)
+            _parse_references(entry, fields)
+            fields["releaseDate"] = _reformat_date(entry.annotations.get("date", ""))
+            _parse_xrefs(entry, fields)
+            try:
+                if not _parse_sequence(entry, fields):
+                    continue
+            except Exception:
+                continue
+            rows.append(fields)
+    return rows
+
+
+@task(retries=2, retry_delay_seconds=30, log_prints=True)
+def fetch_and_parse_organelles(
+    output_path: str,
+    organelles: list,
+    root_taxon: str,
+) -> int:
+    """Fetch RefSeq organelle data and parse to gzipped TSV.
+
+    Downloads GenBank flatfiles from NCBI FTP for each organelle type,
+    parses sequence records, and writes a combined TSV.
+
+    Args:
+        output_path (str): Path to write the output TSV (or .tsv.gz).
+        organelles (list): List of organelle types to parse.
+        root_taxon (str): Optional taxonomic root filter.
+
+    Returns:
+        int: Number of rows written.
+    """
+    if not organelles:
+        organelles = ["mitochondrion", "plastid"]
+
+    all_rows = []
+    for organelle in organelles:
+        print(f"Fetching listing for {organelle}")
+        listing = _refseq_listing(organelle)
+        print(f"Found {len(listing)} files for {organelle}")
+
+        for url in listing:
+            print(f"Downloading {url}")
+            response = safe_get(url, timeout=600)
+            if response is None:
+                raise RuntimeError(f"Failed to download {url}: no response received")
+            response.raise_for_status()
+
+            with tempfile.NamedTemporaryFile(suffix=".gbff.gz", delete=False) as tmp:
+                tmp.write(response.content)
+                tmp_path = tmp.name
+
+            try:
+                rows = _parse_flatfile(tmp_path, organelle, root_taxon)
+                all_rows.extend(rows)
+                print(f"Parsed {len(rows)} records from {os.path.basename(url)}")
+            finally:
+                os.unlink(tmp_path)
+
+    tsv_path = output_path.removesuffix(".gz")
+    with open(tsv_path, "w", newline="") as f:
+        writer = csv.DictWriter(
+            f,
+            fieldnames=ORGANELLE_FIELDNAMES,
+            delimiter="\t",
+            lineterminator="\n",
+            extrasaction="ignore",
+        )
+        writer.writeheader()
+        for row in all_rows:
+            writer.writerow(row)
+
+    if output_path.endswith(".gz"):
+        with open(tsv_path, "rb") as f_in, gzip.open(output_path, "wb") as f_out:
+            f_out.write(f_in.read())
+        os.remove(tsv_path)
+
+    print(f"Wrote {len(all_rows)} total organelle records to {output_path}")
+    return len(all_rows)
+
+
+@task(log_prints=True)
+def upload_s3_file(local_path: str, s3_path: str) -> None:
+    """Upload file to S3."""
+    print(f"Uploading {local_path} to {s3_path}")
+    upload_to_s3(local_path, s3_path)
+
+
+@flow()
+def update_refseq_organelles(
+    output_path: str,
+    root_taxid: str = "",
+    s3_path: str = "",
+    min_records: int = 0,
+) -> bool:
+    """Fetch and parse RefSeq organelle data.
+
+    Args:
+        output_path (str): Path to write the output TSV.
+        root_taxid (str): Optional root taxon filter.
+        s3_path (str): Optional S3 path to upload the result.
+        min_records (int): Minimum record count to accept.
+
+    Returns:
+        bool: True on success.
+    """
+    if not is_safe_path(output_path):
+        raise ValueError(f"Unsafe output path: {output_path}")
+    resolved_path = os.path.abspath(output_path)
+    os.makedirs(os.path.dirname(resolved_path), exist_ok=True)
+
+    row_count = fetch_and_parse_organelles(
+        resolved_path, organelles=["mitochondrion", "plastid"], root_taxon=root_taxid
+    )
+
+    if row_count < min_records:
+        raise RuntimeError(f"RefSeq organelles: fewer than {min_records} records: {row_count}")
+
+    if s3_path:
+        upload_s3_file(output_path, s3_path)
+
+    emit_event(
+        event="update.refseq.organelles.finished",
+        resource={
+            "prefect.resource.id": f"update.refseq.organelles.{output_path}",
+            "prefect.resource.type": "refseq.organelles",
+        },
+        payload={"row_count": row_count},
+    )
+    return True
+
+
+if __name__ == "__main__":
+    args = parse_args(
+        [required(OUTPUT_PATH), ROOT_TAXID, S3_PATH, MIN_RECORDS],
+        "Fetch and parse RefSeq organelle data.",
+    )
+    update_refseq_organelles(**vars(args))
diff --git a/flows/updaters/update_sra_data.py b/flows/updaters/update_sra_data.py
new file mode 100644
index 0000000..e1ffe79
--- /dev/null
+++ b/flows/updaters/update_sra_data.py
@@ -0,0 +1,370 @@
+import csv
+import gzip
+import os
+import xml.etree.ElementTree as ET
+from collections import defaultdict
+from datetime import date, timedelta
+from itertools import groupby
+
+from flows.lib.conditional_import import emit_event, flow, task
+from flows.lib.shared_args import (
+    INPUT_PATH,
+    MIN_RECORDS,
+    OUTPUT_PATH,
+    ROOT_TAXID,
+    S3_PATH,
+    default,
+    parse_args,
+    required,
+)
+from flows.lib.utils import is_safe_path, run_quoted, upload_to_s3
+
+SRA_FIELDNAMES = [
+    "taxon_id",
+    "sra_accession",
+    "run_accession",
+    "library_source",
+    "platform",
+    "reads",
+    "total_reads",
+    "total_runs",
+]
+
+
+def _split_chunks(values, split_val):
+    """Split an iterable into chunks at occurrences of split_val.
+
+    Args:
+        values: Iterable to split.
+        split_val: Value at which to split.
+
+    Yields:
+        (int, group) pairs.
+    """
+    index = 0
+
+    def chunk_index(val):
+        nonlocal index
+        if val == split_val:
+            index += 1
+        return index
+
+    return groupby(values, chunk_index)
+
+
+def _open_file(file_path, **kwargs):
+    """Open a file, decompressing gzip if needed."""
+    if file_path.endswith(".gz"):
+        return gzip.open(file_path, "rt", encoding="utf8", **kwargs)
+    return open(file_path, "r", encoding="utf8", **kwargs)
+
+
+def _read_exp_xml(node, obj):
+    """Extract fields from an ExpXml element."""
+    for child in node:
+        tag = child.tag
+        if tag == "Bioproject":
+            obj["bioproject"] = child.text
+        elif tag == "Biosample":
+            obj["biosample"] = child.text
+        elif tag == "Organism":
+            obj["taxon_id"] = child.get("taxid")
+        elif tag == "Experiment":
+            obj["sra_accession"] = child.get("acc")
+        elif tag == "Summary":
+            obj["platform"] = child.findtext("Platform") or ""
+        elif tag == "Library_descriptor":
+            source = child.findtext("LIBRARY_SOURCE")
+            obj["library_source"] = source.lower() if source else ""
+
+
+def _read_runs(node, obj):
+    """Extract run accessions and read counts from a Runs element."""
+    if "runs" not in obj:
+        obj["runs"] = []
+    for child in node:
+        obj["runs"].append({"accession": child.get("acc"), "reads": child.get("total_spots", "0")})
+
+
+def parse_sra_xml(xml_file: str) -> list:
+    """Parse an SRA efetch docsum XML file into row dicts.
+
+    Args:
+        xml_file (str): Path to the XML (or .xml.gz) file.
+
+    Returns:
+        list: List of dicts with taxon_id, sra_accession, runs, etc.
+    """
+    rows = []
+    xml_header = '<?xml version="1.0" encoding="UTF-8" ?>\n'
+    with _open_file(xml_file) as f:
+        for _, doc in _split_chunks(f, xml_header):
+            lines = list(doc)
+            try:
+                root = ET.fromstringlist(lines)
+            except ET.ParseError:
+                continue
+            for doc_summary in root.iter("DocumentSummary"):
+                obj = {"date": "", "runs": []}
+                for child in doc_summary:
+                    tag = child.tag
+                    if tag == "CreateDate":
+                        obj["date"] = child.text or ""
+                    elif tag == "ExpXml":
+                        _read_exp_xml(child, obj)
+                    elif tag == "Runs":
+                        _read_runs(child, obj)
+                if "taxon_id" in obj and obj["runs"]:
+                    rows.append(obj)
+    return rows
+
+
+def group_by_taxon(rows: list, grouped: dict) -> list:
+    """Group SRA runs by taxon, keeping the 10 most recent per taxon.
+
+    Args:
+        rows (list): Parsed SRA row dicts with runs.
+        grouped (dict): Optional existing grouped data to merge into.
+
+    Returns:
+        list: One dict per taxon with aggregated fields.
+    """
+    if not grouped:
+        grouped = defaultdict(lambda: {"count": 0, "reads": 0, "runs": []})
+    for obj in sorted(rows, key=lambda r: r.get("date", "")):
+        taxon_id = obj.get("taxon_id")
+        if not taxon_id:
+            continue
+        for run in obj.get("runs", []):
+            try:
+                reads = int(run["reads"])
+            except (ValueError, TypeError):
+                reads = 0
+            row = {
+                "sra_accession": obj.get("sra_accession", ""),
+                "run_accession": run["accession"],
+                "library_source": obj.get("library_source", ""),
+                "platform": obj.get("platform", ""),
+                "reads": reads,
+            }
+            grouped[taxon_id]["runs"].insert(0, row)
+            grouped[taxon_id]["count"] += 1
+            grouped[taxon_id]["reads"] += reads
+            if len(grouped[taxon_id]["runs"]) > 10:
+                grouped[taxon_id]["runs"].pop()
+
+    return [
+        {
+            "taxon_id": taxon_id,
+            "sra_accession": ";".join(r["sra_accession"] for r in grp["runs"]),
+            "run_accession": ";".join(r["run_accession"] for r in grp["runs"]),
+            "library_source": ";".join(r["library_source"] for r in grp["runs"]),
+            "platform": ";".join(r["platform"] for r in grp["runs"]),
+            "reads": ";".join(str(r["reads"]) for r in grp["runs"]),
+            "total_reads": grp["reads"],
+            "total_runs": grp["count"],
+        }
+        for taxon_id, grp in grouped.items()
+    ]
+
+
+def load_previous_tsv(file_path: str) -> dict:
+    """Load previously grouped SRA data from a TSV for incremental updates.
+
+    Args:
+        file_path (str): Path to the existing TSV (or .tsv.gz).
+
+    Returns:
+        dict: Grouped data keyed by taxon_id, or empty dict if file missing.
+    """
+    if not os.path.isfile(file_path):
+        return {}
+    grouped = defaultdict(lambda: {"count": 0, "reads": 0, "runs": []})
+    with _open_file(file_path, newline="") as f:
+        reader = csv.DictReader(f, delimiter="\t")
+        for row in reader:
+            taxon_id = row["taxon_id"]
+            grouped[taxon_id]["count"] = int(row["total_runs"])
+            grouped[taxon_id]["reads"] = int(row["total_reads"])
+            run_accs = row["run_accession"].split(";")
+            sra_accs = row["sra_accession"].split(";")
+            lib_srcs = row["library_source"].split(";")
+            platforms = row["platform"].split(";")
+            reads_list = row["reads"].split(";")
+            for i, run_acc in enumerate(run_accs):
+                grouped[taxon_id]["runs"].append(
+                    {
+                        "run_accession": run_acc,
+                        "sra_accession": sra_accs[i] if i < len(sra_accs) else "",
+                        "library_source": lib_srcs[i] if i < len(lib_srcs) else "",
+                        "platform": platforms[i] if i < len(platforms) else "",
+                        "reads": int(reads_list[i]) if i < len(reads_list) else 0,
+                    }
+                )
+    return grouped
+
+
+def _get_yesterday() -> str:
+    """Return yesterday's date as YYYY/MM/DD."""
+    return (date.today() - timedelta(days=1)).strftime("%Y/%m/%d")
+
+
+@task(retries=2, retry_delay_seconds=30, log_prints=True)
+def fetch_sra_xml(
+    output_xml: str,
+    root_taxid: str = "2759",
+    min_date: str = "2024/01/01",
+) -> str:
+    """Fetch SRA docsum XML from NCBI using esearch/efetch.
+
+    Requires the NCBI Entrez Direct (edirect) tools and NCBI_API_KEY
+    environment variable.
+
+    Args:
+        output_xml (str): Path to write the XML output.
+        root_taxid (str): Root taxon ID to query.
+        min_date (str): Start date for the query (YYYY/MM/DD).
+
+    Returns:
+        str: Path to the written XML file.
+    """
+    max_date = _get_yesterday()
+
+    query = f"(txid{root_taxid}[organism:exp])"
+    esearch_cmd = [
+        "esearch",
+        "-db",
+        "sra",
+        "-query",
+        query,
+    ]
+    esearch_cmd.extend(["-mindate", min_date, "-maxdate", max_date])
+
+    efetch_cmd = ["efetch", "-db", "sra", "-format", "docsum"]
+
+    print(f"Running esearch | efetch for taxid {root_taxid} ({min_date} to {max_date})")
+    esearch = run_quoted(esearch_cmd, capture_output=True, text=True, timeout=3000)
+    if esearch.returncode != 0:
+        raise RuntimeError(f"esearch failed: {esearch.stderr}")
+
+    with open(output_xml, "w") as f:
+        efetch = run_quoted(efetch_cmd, input=esearch.stdout, capture_output=True, text=True, timeout=6000)
+        if efetch.returncode != 0:
+            raise RuntimeError(f"efetch failed: {efetch.stderr}")
+        f.write(efetch.stdout)
+
+    print(f"Wrote SRA XML to {output_xml}")
+    return output_xml
+
+
+@task(retries=2, retry_delay_seconds=5, log_prints=True)
+def parse_and_write_sra(
+    xml_path: str,
+    output_path: str,
+    previous_path: str,
+) -> int:
+    """Parse SRA XML and write grouped TSV.
+
+    Args:
+        xml_path (str): Path to the SRA docsum XML.
+        output_path (str): Path to write the output TSV.
+        previous_path (str): Optional path to previous TSV for incremental merge.
+
+    Returns:
+        int: Number of taxon rows written.
+    """
+    previous = load_previous_tsv(previous_path) if previous_path else {}
+    rows = parse_sra_xml(xml_path)
+    print(f"Parsed {len(rows)} records from XML")
+
+    grouped_rows = group_by_taxon(rows, grouped=previous)
+    print(f"Grouped into {len(grouped_rows)} taxa")
+
+    tsv_path = output_path.removesuffix(".gz")
+    with open(tsv_path, "w", newline="") as f:
+        writer = csv.DictWriter(f, fieldnames=SRA_FIELDNAMES, delimiter="\t", lineterminator="\n")
+        writer.writeheader()
+        for row in grouped_rows:
+            writer.writerow(row)
+
+    if output_path.endswith(".gz"):
+        with open(tsv_path, "rb") as f_in, gzip.open(output_path, "wb") as f_out:
+            f_out.write(f_in.read())
+        os.remove(tsv_path)
+
+    print(f"Wrote {len(grouped_rows)} taxon rows to {output_path}")
+    return len(grouped_rows)
+
+
+@task(log_prints=True)
+def upload_s3_file(local_path: str, s3_path: str) -> None:
+    """Upload file to S3."""
+    print(f"Uploading {local_path} to {s3_path}")
+    upload_to_s3(local_path, s3_path)
+
+
+@flow()
+def update_sra_data(
+    output_path: str,
+    input_path: str = "",
+    s3_path: str = "",
+    root_taxid: str = "2759",
+    min_records: int = 0,
+) -> bool:
+    """Fetch and parse SRA data, writing grouped TSV output.
+
+    If input_path is provided, parses that XML file directly. Otherwise
+    fetches fresh data from NCBI using esearch/efetch.
+
+    Args:
+        output_path (str): Path to write the output TSV (or .tsv.gz).
+        input_path (str): Optional path to an existing SRA XML file.
+        root_taxid (str): Root taxon ID for the NCBI query.
+        s3_path (str): Optional S3 path to upload the result.
+        min_records (int): Minimum taxon count to accept the output.
+
+    Returns:
+        bool: True on success.
+    """
+    if not is_safe_path(output_path):
+        raise ValueError(f"Unsafe output path: {output_path}")
+    resolved_path = os.path.abspath(output_path)
+    os.makedirs(os.path.dirname(resolved_path), exist_ok=True)
+
+    if input_path and os.path.isfile(input_path):
+        xml_path = input_path
+    else:
+        xml_path = f"{resolved_path}.xml"
+        fetch_sra_xml(xml_path, root_taxid=root_taxid)
+
+    row_count = parse_and_write_sra(xml_path, resolved_path, previous_path=resolved_path)
+
+    if row_count < min_records:
+        raise RuntimeError(f"SRA output has fewer than {min_records} taxa: {row_count}")
+
+    if s3_path:
+        upload_s3_file(output_path, s3_path)
+
+    emit_event(
+        event="update.sra.data.finished",
+        resource={
+            "prefect.resource.id": f"update.sra.{output_path}",
+            "prefect.resource.type": "sra.data",
+        },
+        payload={"row_count": row_count},
+    )
+    return True
+
+
+if __name__ == "__main__":
+    args = parse_args(
+        [
+            required(OUTPUT_PATH),
+            INPUT_PATH,
+            default(ROOT_TAXID, "2759"),
+            S3_PATH,
+            MIN_RECORDS,
+        ],
+        "Fetch and parse SRA data into grouped TSV.",
+    )
+    update_sra_data(**vars(args))
diff --git a/flows/updaters/update_tol_genome_notes.py b/flows/updaters/update_tol_genome_notes.py
index 914dff6..c01ad94 100644
--- a/flows/updaters/update_tol_genome_notes.py
+++ b/flows/updaters/update_tol_genome_notes.py
@@ -101,7 +101,7 @@ def upload_s3_tsv(local_path: str, s3_path: str) -> None:
 
 
 @flow()
-def update_tol_genome_notes(output_path: str, s3_path: str, min_records: int) -> None:
+def update_tol_genome_notes(output_path: str, s3_path: str = "", min_records: int = 0) -> bool:
     """Update the ToL genome notes TSV file."""
     os.makedirs(os.path.dirname(os.path.abspath(output_path)), exist_ok=True)
     line_count = fetch_tol_genome_notes(output_path, min_records)
diff --git a/flows/updaters/update_tol_portal_status.py b/flows/updaters/update_tol_portal_status.py
index 6fbe77f..bf10bc6 100644
--- a/flows/updaters/update_tol_portal_status.py
+++ b/flows/updaters/update_tol_portal_status.py
@@ -219,7 +219,7 @@ def upload_s3_tsv(local_path: str, s3_path: str) -> None:
 
 
 @flow()
-def update_tol_portal_status(output_path: str, s3_path: str, min_records: int) -> None:
+def update_tol_portal_status(output_path: str, s3_path: str = "", min_records: int = 0) -> None:
     """Update the ToL Portal Project Status TSV file."""
 
     os.makedirs(os.path.dirname(os.path.abspath(output_path)), exist_ok=True)
diff --git a/flows/updaters/update_ucsc_assemblies.py b/flows/updaters/update_ucsc_assemblies.py
new file mode 100644
index 0000000..c2d9578
--- /dev/null
+++ b/flows/updaters/update_ucsc_assemblies.py
@@ -0,0 +1,97 @@
+import gzip
+import os
+
+from flows.lib.conditional_import import emit_event, flow, task
+from flows.lib.shared_args import OUTPUT_PATH, S3_PATH, parse_args, required
+from flows.lib.utils import is_safe_path, safe_get, upload_to_s3
+
+UCSC_URL = "https://hgdownload.soe.ucsc.edu/hubs/UCSC_GI.assemblyHubList.txt"
+OUTPUT_FILENAME = "UCSC_GI.assemblyHubList.tsv.gz"
+
+
+@task(retries=2, retry_delay_seconds=5, log_prints=True)
+def fetch_ucsc_hub_list(output_dir: str) -> tuple[str, int]:
+    """Fetch the UCSC assembly hub accession list and write as gzipped TSV.
+
+    The source file is a tab-separated text file served with ISO-8859-1
+    encoding. We decode to UTF-8 for consistency.
+
+    Args:
+        output_dir (str): Directory to write the output file.
+
+    Returns:
+        tuple[str, int]: Path to the output file and number of data lines.
+    """
+    output_path = os.path.join(output_dir, OUTPUT_FILENAME)
+    tsv_path = output_path.removesuffix(".gz")
+
+    print(f"Fetching UCSC hub list from {UCSC_URL}")
+    response = safe_get(UCSC_URL, timeout=60)
+    if response is None:
+        raise RuntimeError("Failed to fetch UCSC hub list: no response received")
+    response.raise_for_status()
+    response.encoding = "iso-8859-1"
+    text = response.text
+
+    with open(tsv_path, "w") as f:
+        f.write(text)
+    line_count = text.count("\n")
+
+    with open(tsv_path, "rb") as f_in, gzip.open(output_path, "wb") as f_out:
+        f_out.write(f_in.read())
+    os.remove(tsv_path)
+
+    print(f"Wrote {line_count} lines to {output_path}")
+    return output_path, line_count
+
+
+@task(log_prints=True)
+def upload_s3_file(local_path: str, s3_path: str) -> None:
+    """Upload file to S3."""
+    print(f"Uploading {local_path} to {s3_path}")
+    upload_to_s3(local_path, s3_path)
+
+
+@flow()
+def update_ucsc_assemblies(
+    output_path: str,
+    s3_path: str = "",
+) -> bool:
+    """Fetch the UCSC assembly hub list and optionally upload to S3.
+
+    Args:
+        output_path (str): Directory to write the output file.
+        s3_path (str): Optional S3 directory path to upload the result.
+
+    Returns:
+        bool: True on success.
+    """
+    if not is_safe_path(output_path):
+        raise ValueError(f"Unsafe output path: {output_path}")
+
+    resolved_path = os.path.abspath(output_path)
+    os.makedirs(resolved_path, exist_ok=True)
+
+    local_file, line_count = fetch_ucsc_hub_list(resolved_path)
+
+    if s3_path:
+        remote_path = f"{s3_path.rstrip('/')}/{OUTPUT_FILENAME}"
+        upload_s3_file(local_file, remote_path)
+
+    emit_event(
+        event="update.ucsc.assemblies.finished",
+        resource={
+            "prefect.resource.id": f"update.ucsc.{resolved_path}",
+            "prefect.resource.type": "ucsc.assemblies",
+        },
+        payload={"line_count": line_count},
+    )
+    return True
+
+
+if __name__ == "__main__":
+    args = parse_args(
+        [required(OUTPUT_PATH), S3_PATH],
+        "Fetch UCSC assembly hub accession list.",
+    )
+    update_ucsc_assemblies(**vars(args))
diff --git a/flows/updaters/update_vgp_original_status.py b/flows/updaters/update_vgp_original_status.py
new file mode 100644
index 0000000..8feb950
--- /dev/null
+++ b/flows/updaters/update_vgp_original_status.py
@@ -0,0 +1,88 @@
+import os
+
+from flows.lib.conditional_import import emit_event, flow, task
+from flows.lib.shared_args import MIN_RECORDS, OUTPUT_PATH, S3_PATH, parse_args, required
+from flows.lib.utils import upload_to_s3
+from flows.updaters.api import api_config as cfg
+from flows.updaters.api import api_tools as at
+
+
+@task(retries=2, retry_delay_seconds=5, log_prints=True)
+def fetch_vgp_original_tsv(
+    file_path: str,
+    min_lines: int = 1,
+) -> int:
+    """Fetch VGP original status list from the Vertebrate Genomes Project GitHub tracker.
+
+    Downloads the VGP genome portal YAML tracker, extracts species records,
+    and writes a TSV with per-species status fields.
+
+    Args:
+        file_path (str): Path to the output TSV file.
+        min_lines (int): Minimum number of rows expected.
+
+    Returns:
+        int: Number of lines written to the output file.
+    """
+    at.get_from_source(
+        cfg.vgl_url_opener,
+        cfg.vgl_hub_count_handler,
+        cfg.vgl_row_handler,
+        cfg.vgl_fieldnames,
+        file_path,
+    )
+
+    with open(file_path, "r") as f:
+        line_count = sum(1 for _ in f)
+
+    if line_count < min_lines:
+        raise RuntimeError(f"VGP file {file_path} has fewer than {min_lines} lines: {line_count}")
+    print(f"Wrote {line_count} lines to {file_path}")
+    return line_count
+
+
+@task(log_prints=True)
+def upload_s3_tsv(local_path: str, s3_path: str) -> None:
+    """Upload VGP TSV to S3."""
+    print(f"Uploading VGP TSV from {local_path} to {s3_path}")
+    upload_to_s3(local_path, s3_path)
+
+
+@flow()
+def update_vgp_original_status(output_path: str, s3_path: str = "", min_records: int = 0) -> bool:
+    """Fetch the VGP original status list and optionally upload to S3.
+
+    This is the scarcely-updated VGP source from the GitHub YAML tracker.
+    See update_vgp_status.py for the frequently-updated live Google Sheet.
+
+    Args:
+        output_path (str): Path to the output TSV file.
+        s3_path (str): Optional S3 path to upload the result.
+        min_records (int): Minimum record count to accept the output.
+
+    Returns:
+        bool: True on success.
+    """
+
+    resolved_path = os.path.abspath(output_path)
+    os.makedirs(os.path.dirname(resolved_path), exist_ok=True)
+    line_count = fetch_vgp_original_tsv(resolved_path, min_records)
+    if line_count > min_records and s3_path:
+        upload_s3_tsv(resolved_path, s3_path)
+    emit_event(
+        event="update.vgp.original.status.finished",
+        resource={
+            "prefect.resource.id": f"update.vgp.original.{resolved_path}",
+            "prefect.resource.type": "vgp.original.status",
+        },
+        payload={"line_count": line_count},
+    )
+    return True
+
+
+if __name__ == "__main__":
+    args = parse_args(
+        [required(OUTPUT_PATH), S3_PATH, MIN_RECORDS],
+        "Fetch VGP original status list from the Vertebrate Genomes Project.",
+    )
+    update_vgp_original_status(**vars(args))
diff --git a/flows/updaters/update_vgp_status.py b/flows/updaters/update_vgp_status.py
new file mode 100644
index 0000000..5e0cba6
--- /dev/null
+++ b/flows/updaters/update_vgp_status.py
@@ -0,0 +1,282 @@
+"""Fetch VGP Ordinal Phase1+ status from the live Google Sheet.
+
+This updater replaces the legacy ``vgp_live_sheet_curation.py`` script
+from goat-data. It downloads the live VGP spreadsheet, cleans headers,
+translates project names to canonical acronyms, expands sequencing
+status columns following the GoaT status hierarchy, and writes a TSV
+matching the ``FILE_VGP_Ordinal_Phase1.types.yaml`` schema.
+
+The companion ``update_vgp_original_status.py`` fetches the less
+frequently updated VGP GitHub YAML tracker source.
+"""
+
+import io
+import os
+
+import numpy as np
+import pandas as pd
+
+from flows.lib.conditional_import import emit_event, flow, task
+from flows.lib.shared_args import MIN_RECORDS, OUTPUT_PATH, S3_PATH, parse_args, required
+from flows.lib.utils import is_safe_path, safe_get, upload_to_s3
+
+# Published (export) link to the VGP Ordinal Phase1+ Google Sheet
+VGP_SHEET_URL = (
+    "https://docs.google.com/spreadsheets/d/1Jwjv6Kwc6VIn1UMMhnG6kvFCxjwGdC5b7p_HtbDOMOs"
+    "/export?format=tsv"
+    "&id=1Jwjv6Kwc6VIn1UMMhnG6kvFCxjwGdC5b7p_HtbDOMOs"
+    "&gid=1380659438"
+)
+
+# Columns to import from the spreadsheet
+SOURCE_COLUMNS = [
+    "Order",
+    "Lineage",
+    "Superorder",
+    "Family Scientific Name",
+    "Scientific Name",
+    "English Name",
+    "NCBI taxon ID",
+    "Status",
+    "QV",
+    "IUCN (2016-2024)",
+    "CITES",
+    "Main project",
+    "Second project",
+]
+
+# Map free-text project names to canonical EBP acronyms
+PROJECT_ACRONYMS = {
+    "Sanger 25G": "25GP",
+    "Sanger 25G project": "25GP",
+    "AfricaBP": "AFRICABP",
+    "Cetacean GP": "CGP",
+    "DToL": "DTOL",
+    "DToL?": "DTOL",
+    "Yggdrasil": "YGG",
+    "CatalanBP": "CBP",
+    "Canadian Biogenome Project": "CANBP",
+    "Canada Biogenome Project": "CANBP",
+    "Threatened Species Initiative (TSI)": "TSI",
+    "Minderoo OceanOmics": "OG",
+    "DToL, ERGA": "DTOL,ERGA",
+    "Amazoomics : Genomics of Brazilian Biodiversity": "AMAZOOMICS,GBB",
+    "AmaZoomics : Genomics of Brazilian Biodiversity": "AMAZOOMICS,GBB",
+    "Individual, Google": "Individual,Google",
+}
+
+# Numeric status code → GoaT sequencing status
+STATUS_MAP = {
+    "0": "",
+    "1": "sample_collected",
+    "2": "",
+    "3": "in_progress",
+    "4": "open",
+    "5": "open",
+}
+
+# Full ordered list of GoaT sequencing status columns
+SEQUENCING_STATUSES = [
+    "sample_collected",
+    "sample_acquired",
+    "in_progress",
+    "data_generation",
+    "in_assembly",
+    "insdc_submitted",
+    "open",
+    "insdc_open",
+    "published",
+]
+
+
+# ---------------------------------------------------------------------------
+# Processing helpers
+# ---------------------------------------------------------------------------
+
+
+def _cleanup_table(df: pd.DataFrame) -> pd.DataFrame:
+    """Replace whitespace-only cells with NaN, drop empty rows/cols."""
+    df = df.replace(r"^\s*$", np.nan, regex=True)
+    df = df.replace(r"^ +| +$", "", regex=True)
+    df.dropna(how="all", axis=1, inplace=True)
+    df.dropna(how="all", axis=0, inplace=True)
+    return df
+
+
+def _cleanup_headers(df: pd.DataFrame) -> pd.DataFrame:
+    """Normalise column headers: lowercase, underscored, no parens."""
+    df.columns = (
+        df.columns.str.replace(" ", "_")
+        .str.replace(r"\(", "", regex=True)
+        .str.replace(r"\)", "", regex=True)
+        .str.lower()
+    )
+    return df
+
+
+def _get_acronym(project_name: str) -> str:
+    """Map a free-text project name to a canonical acronym."""
+    return PROJECT_ACRONYMS.get(project_name, project_name)
+
+
+def _translate_projects(df: pd.DataFrame) -> pd.DataFrame:
+    """Map free-text project names to canonical acronyms."""
+    for col in ["main_project", "second_project", "project"]:
+        if col in df.columns:
+            df[col] = df[col].map(lambda v: _get_acronym(str(v)) if pd.notna(v) else v)
+    return df
+
+
+def _build_all_projects(df: pd.DataFrame) -> pd.DataFrame:
+    """Create 'all_projects' column from project + main + second."""
+    df["all_projects"] = df.apply(
+        lambda row: ",".join(
+            sorted(
+                {
+                    x
+                    for x in [
+                        row.get("project"),
+                        row.get("main_project"),
+                        row.get("second_project"),
+                    ]
+                    if pd.notna(x)
+                }
+            )
+        ),
+        axis=1,
+    )
+    return df
+
+
+def _expand_sequencing_status(df: pd.DataFrame) -> pd.DataFrame:
+    """Map numeric status codes and cascade the GoaT status hierarchy."""
+    # Ensure all status columns exist
+    for col in SEQUENCING_STATUSES:
+        if col not in df.columns:
+            df[col] = None
+
+    # Map numeric codes to status names
+    df["sequencing_status"] = df["status"].map(STATUS_MAP)
+
+    # Populate status columns with all_projects for matching rows
+    for status in SEQUENCING_STATUSES:
+        df.loc[df["sequencing_status"] == status, status] = df["all_projects"]
+
+    # Cascade status hierarchy upward
+    df.loc[df["published"] == df["all_projects"], "insdc_open"] = df["all_projects"]
+    df.loc[df["insdc_open"] == df["all_projects"], "open"] = df["all_projects"]
+    df.loc[df["open"] == df["all_projects"], "in_progress"] = df["all_projects"]
+    df.loc[df["data_generation"] == df["all_projects"], "in_progress"] = df["all_projects"]
+    df.loc[df["in_assembly"] == df["all_projects"], "in_progress"] = df["all_projects"]
+    df.loc[df["in_progress"] == df["all_projects"], "data_generation"] = df["all_projects"]
+    df.loc[df["in_progress"] == df["all_projects"], "sample_acquired"] = df["all_projects"]
+    df.loc[df["sample_acquired"] == df["all_projects"], "sample_collected"] = df["all_projects"]
+    return df
+
+
+def _process_vgp_sheet(raw_tsv: str) -> pd.DataFrame:
+    """Full processing pipeline for the VGP live sheet.
+
+    Args:
+        raw_tsv (str): Raw TSV text content from Google Sheets.
+
+    Returns:
+        pd.DataFrame: Cleaned, expanded DataFrame ready for export.
+    """
+    df = pd.read_csv(
+        io.StringIO(raw_tsv),
+        sep="\t",
+        dtype=object,
+        engine="python",
+        on_bad_lines="warn",
+        usecols=SOURCE_COLUMNS,
+    )
+    df = _cleanup_table(df)
+    df = _cleanup_headers(df)
+    df["project"] = "VGP"
+    df = _translate_projects(df)
+    df = _build_all_projects(df)
+    df = _expand_sequencing_status(df)
+    return df
+
+
+# ---------------------------------------------------------------------------
+# Prefect tasks and flow
+# ---------------------------------------------------------------------------
+
+
+@task(retries=2, retry_delay_seconds=30, log_prints=True)
+def fetch_vgp_live_sheet(output_path: str, min_records: int = 0) -> int:
+    """Download the VGP Ordinal Phase1+ Google Sheet and write a TSV.
+
+    Args:
+        output_path (str): Path to the output TSV file.
+        min_records (int): Minimum number of rows expected.
+
+    Returns:
+        int: Number of data rows written.
+    """
+    response = safe_get(VGP_SHEET_URL, timeout=120)
+    if response is None:
+        raise RuntimeError("Failed to fetch VGP live sheet: no response received")
+    response.raise_for_status()
+
+    df = _process_vgp_sheet(response.text)
+    row_count = len(df)
+    print(f"VGP live sheet: {row_count} rows after processing")
+
+    if row_count < min_records:
+        raise RuntimeError(f"VGP live sheet has fewer than {min_records} rows: {row_count}")
+
+    df.to_csv(output_path, sep="\t", index=False)
+    print(f"Wrote {output_path}")
+    return row_count
+
+
+@task(log_prints=True)
+def upload_s3_tsv(local_path: str, s3_path: str) -> None:
+    """Upload VGP TSV to S3."""
+    print(f"Uploading {local_path} to {s3_path}")
+    upload_to_s3(local_path, s3_path)
+
+
+@flow()
+def update_vgp_status(output_path: str, s3_path: str = "", min_records: int = 0) -> bool:
+    """Fetch the VGP Ordinal Phase1+ live sheet and optionally upload to S3.
+
+    Args:
+        output_path (str): Path to the output TSV file.
+        s3_path (str): Optional S3 path to upload the result.
+        min_records (int): Minimum record count to accept the output.
+
+    Returns:
+        bool: True on success.
+    """
+    if not is_safe_path(output_path):
+        raise ValueError(f"Unsafe output path: {output_path}")
+
+    resolved_path = os.path.abspath(output_path)
+    os.makedirs(os.path.dirname(resolved_path), exist_ok=True)
+
+    row_count = fetch_vgp_live_sheet(resolved_path, min_records)
+
+    if row_count > min_records and s3_path:
+        upload_s3_tsv(resolved_path, s3_path)
+
+    emit_event(
+        event="update.vgp.status.finished",
+        resource={
+            "prefect.resource.id": f"update.vgp.{resolved_path}",
+            "prefect.resource.type": "vgp.status",
+        },
+        payload={"row_count": row_count},
+    )
+    return True
+
+
+if __name__ == "__main__":
+    args = parse_args(
+        [required(OUTPUT_PATH), S3_PATH, MIN_RECORDS],
+        "Fetch VGP Ordinal Phase1+ status from the live Google Sheet.",
+    )
+    update_vgp_status(**vars(args))
diff --git a/scripts/TESTING_README.md b/scripts/TESTING_README.md
new file mode 100644
index 0000000..f6752e8
--- /dev/null
+++ b/scripts/TESTING_README.md
@@ -0,0 +1,121 @@
+# Parser Testing Wrapper
+
+## Usage
+
+Run all parser tests with comparison against S3 sources:
+
+```bash
+python scripts/run_parse_validate_tests.py
+```
+
+Test a specific parser:
+
+```bash
+python scripts/run_parse_validate_tests.py --parser REFSEQ_ORGANELLES
+python scripts/run_parse_validate_tests.py --parser SRA_DATA
+python scripts/run_parse_validate_tests.py --parser SEQUENCING_STATUS
+```
+
+Verbose output:
+
+```bash
+python scripts/run_parse_validate_tests.py --verbose
+```
+
+## What the Script Does
+
+1. **Discovers parsers** from `flows/parsers/parse_*.py`
+2. **Runs each parser** on example TSV files in `tsv_examples/`
+3. **Compares output** to S3 source versions mirrored in `goat-data_s3_sources/`
+4. **Validates**:
+   - Header names match between parsed output and S3 source
+   - Line counts are similar (within 10% tolerance)
+5. **Generates a summary report** showing pass/fail status
+
+## Test Configuration
+
+Tests are configured in the `PARSER_INPUT_MAPPING` dict in the script:
+
+```python
+PARSER_INPUT_MAPPING = {
+    "PARSER_NAME": [
+        ("tsv_examples/input.tsv", "path/to/config.types.yaml", "s3_comparison_file.tsv"),
+        # (input_example, yaml_config, s3_source_name)
+    ],
+}
+```
+
+- **input_example**: Path to example TSV in `tsv_examples/`
+- **yaml_config**: Path to YAML schema (relative to data repo root)
+- **s3_source_name**: Filename in `goat-data_s3_sources/` for comparison (or `None`)
+
+## Current Test Results
+
+### ✓ Passing (Perfect Match)
+
+- **REFSEQ_ORGANELLES**: 33,252 rows, 13 cols - matches S3 source exactly
+- **SRA_DATA**: 27,606 rows, 8 cols - matches S3 source exactly
+
+### ✓ Partial Pass (Headers OK)
+
+- **SEQUENCING_STATUS (vgp)**: Headers match, but example is only a sample (292 vs 1093 rows) — expected
+- **SEQUENCING_STATUS (AEGIS)**: Parser succeeds, no S3 source to compare
+- **BLOBTOOLKIT**: Headers mostly match (23/25 cols), example input missing 'biosample'/'bioproject' — sample is incomplete
+
+### ❌ Known Issues
+
+- **GENOMESIZE_KARYOTYPE**: Genomehubs `write_tsv` error with composite headers (`header: [genus, species]`). This is a YAML schema complexity issue, not a parser bug.
+
+## Running Parsers Manually
+
+For direct parser invocation with SKIP_PREFECT:
+
+```bash
+export SKIP_PREFECT=true
+python -m flows.parsers.parse_refseq_organelles \
+  -i path/to/input.tsv \
+  -y path/to/config.types.yaml
+```
+
+## Adding New Tests
+
+1. Add example TSV to `tsv_examples/` or `tsv_examples/<name>/`
+2. Update `PARSER_INPUT_MAPPING` with the new test config
+3. Run the script to verify
+
+Example:
+
+```python
+"MY_PARSER": [
+    ("tsv_examples/my_example.tsv", "../goat-data/sources/my-category/my_config.types.yaml", "my_source.tsv"),
+]
+```
+
+## Troubleshooting
+
+**"No output file generated"**:
+
+- Check that the YAML config specifies a different output filename than the input
+- Verify the input TSV is accessible
+
+**"Headers mismatch"**:
+
+- Check if the example input has all expected columns
+- Verify the YAML configuration includes all fields
+- Example files may be incomplete samples
+
+**"Parser timeout"**:
+
+- Increase the timeout in the `run_parser()` function (currently 60 seconds)
+- Check for issues in the parser logic or input data
+
+## Environment
+
+Requires:
+
+- `genomehubs >= 2.10.14`
+- `boto3`
+- `pyyaml`
+- `requests`
+
+Install with: `pip install -q -r requirements.txt`
diff --git a/scripts/run_parse_validate_tests.py b/scripts/run_parse_validate_tests.py
new file mode 100755
index 0000000..052355e
--- /dev/null
+++ b/scripts/run_parse_validate_tests.py
@@ -0,0 +1,330 @@
+#!/usr/bin/env python3
+"""Test wrapper: run parsers on example TSVs and compare against S3 sources.
+
+Usage:
+    python scripts/run_parse_validate_tests.py [--parser PARSER] [--verbose]
+    python scripts/run_parse_validate_tests.py --all
+
+This script:
+1. Discovers parsers and example TSV inputs
+2. Maps examples to YAML configs (from goat-data/sources)
+3. Runs each parser and validates output
+4. Compares output columns & line counts vs S3 source versions
+5. Generates a test report
+
+Example files should be organized as:
+  - tsv_examples/*.tsv or *.tsv.gz
+  - tsv_examples/<name>/*.tsv or *.tsv.gz
+
+S3 source files are mirrored in:
+  - goat-data_s3_sources/{assembly-data,status-lists,sra,btk,conservation,genomesize-karyotype,uk-legislation}/imported/*.tsv
+"""
+
+import argparse
+import gzip
+import json
+import os
+import subprocess
+import sys
+import tempfile
+from csv import DictReader
+from pathlib import Path
+from typing import Optional, Tuple, List, Dict
+
+# Set SKIP_PREFECT before any imports from flows
+os.environ["SKIP_PREFECT"] = "true"
+
+# Use absolute import paths
+sys.path.insert(0, str(Path(__file__).parent.parent))
+
+
+PARSER_INPUT_MAPPING = {
+    # parser_name -> (example_input, yaml_config, comparison_file)
+    "SEQUENCING_STATUS": [
+        ("tsv_examples/ATLASEA_expanded.tsv", "../goat-data/sources/assembly-data/FILE_ATLASEA.types.yaml", "ATLASEA_expanded.tsv"),
+        ("tsv_examples/AEGIS_expanded.tsv", "../goat-data/sources/status-lists/FILE_AEGIS.types.yaml", None),  # no S3 source
+    ],
+    "REFSEQ_ORGANELLES": [
+        ("tsv_examples/organelle_test.tsv", "../goat-data/sources/assembly-data/refseq_organelles.types.yaml", "refseq_organelles.tsv"),
+    ],
+    "BLOBTOOLKIT": [
+        ("tsv_examples/blobtoolkit_test_results.tsv/btk.tsv.gz", "../goat-data/sources/btk/btk.types.yaml", "btk.tsv"),
+    ],
+    "SRA_DATA": [
+        ("tsv_examples/sra.tsv", "../goat-data/sources/sra/sra.types.yaml", "sra.tsv"),
+    ],
+    "GENOMESIZE_KARYOTYPE": [
+        ("tsv_examples/gsheets_test/DTOL_Plant_Genome_Size_Estimates.tsv", "../goat-data/sources/genomesize-karyotype/FILE_DTOL_Plant_Genome_Size_Estimates.types.yaml", None),
+    ],
+}
+
+S3_SOURCES_ROOT = Path(__file__).parent.parent.parent / "goat-data_s3_sources"
+
+
+def load_tsv_headers_and_count(path: str) -> Tuple[List[str], int]:
+    """Load TSV headers and line count (excluding header)."""
+    if path.endswith(".gz"):
+        fh = gzip.open(path, "rt", encoding="utf-8", newline="")
+    else:
+        fh = open(path, "rt", encoding="utf-8", newline="")
+
+    try:
+        reader = DictReader(fh, delimiter="\t")
+        headers = reader.fieldnames or []
+        count = sum(1 for _ in reader)
+        return list(headers), count
+    finally:
+        fh.close()
+
+
+def find_s3_source(expected_name: str) -> Optional[str]:
+    """Locate the S3 source file for a parser output."""
+    # Search in all subdirectories
+    for root, dirs, files in os.walk(S3_SOURCES_ROOT):
+        for f in files:
+            if f == expected_name or f == f"{expected_name}.gz":
+                return os.path.join(root, f)
+    return None
+
+
+def run_parser(
+    parser_name: str,
+    input_path: str,
+    yaml_path: str,
+    work_dir: str,
+) -> Tuple[bool, str, Optional[str]]:
+    """Run a parser and return (success, output_file, error_msg)."""
+    try:
+        import shutil
+
+        # Copy input to work_dir with a renamed prefix to avoid conflicting with output names
+        input_abs = Path(input_path).resolve()
+        work_path = Path(work_dir)
+        # Rename to avoid output file conflicts (e.g., input btk.tsv.gz vs output btk.tsv.gz)
+        work_input = work_path / f"_input_{input_abs.name}"
+
+        if not work_input.exists():
+            shutil.copy2(str(input_abs), str(work_input))
+
+        # Construct the Python module path from parser name
+        module_name = "flows.parsers." + "parse_" + parser_name.lower().replace("_", "_")
+
+        cmd = [
+            sys.executable,
+            "-m",
+            module_name,
+            "-i",
+            str(work_input),
+            "-y",
+            yaml_path,
+        ]
+
+        env = os.environ.copy()
+        env["SKIP_PREFECT"] = "true"
+
+        result = subprocess.run(
+            cmd,
+            cwd=Path(__file__).parent.parent,
+            capture_output=True,
+            text=True,
+            timeout=60,
+            env=env,
+        )
+
+        if result.returncode != 0:
+            return False, None, f"Parser failed: {result.stderr}"
+
+        # Find the output file in work_dir (excluding the input file and any _input_* files)
+        output_files = list(work_path.glob("*.tsv")) + list(work_path.glob("*.tsv.gz"))
+        output_files = [f for f in output_files if not f.name.startswith("_input_")]
+
+        if not output_files:
+            return False, None, "No output file generated"
+
+        output_file = str(output_files[0])
+        return True, output_file, None
+
+    except subprocess.TimeoutExpired:
+        return False, None, "Parser timeout"
+    except Exception as e:
+        return False, None, str(e)
+
+
+def compare_outputs(
+    parsed_output: str,
+    s3_source: str,
+) -> Dict[str, any]:
+    """Compare parsed output against S3 source."""
+    try:
+        parsed_headers, parsed_count = load_tsv_headers_and_count(parsed_output)
+        s3_headers, s3_count = load_tsv_headers_and_count(s3_source)
+
+        headers_match = set(parsed_headers) == set(s3_headers)
+        headers_extra = set(parsed_headers) - set(s3_headers)
+        headers_missing = set(s3_headers) - set(parsed_headers)
+
+        count_diff = abs(parsed_count - s3_count)
+        count_pct_diff = 100.0 * count_diff / max(s3_count, 1)
+
+        return {
+            "headers_match": headers_match,
+            "headers_extra": list(headers_extra),
+            "headers_missing": list(headers_missing),
+            "parsed_count": parsed_count,
+            "s3_count": s3_count,
+            "count_diff": count_diff,
+            "count_pct_diff": count_pct_diff,
+            "line_counts_similar": count_pct_diff < 10,  # Allow 10% variance
+        }
+    except Exception as e:
+        return {"error": str(e)}
+
+
+def run_tests(parser_name: Optional[str] = None, verbose: bool = False) -> int:
+    """Run tests for specified parser(s) and compare outputs."""
+    # Determine which parsers to test
+    if parser_name:
+        parsers_to_test = [parser_name.upper()]
+        if parser_name.upper() not in PARSER_INPUT_MAPPING:
+            print(f"Error: Parser {parser_name} not configured in PARSER_INPUT_MAPPING")
+            return 1
+    else:
+        parsers_to_test = list(PARSER_INPUT_MAPPING.keys())
+
+    results = {}
+
+    for pname in parsers_to_test:
+        print(f"\n{'='*70}")
+        print(f"Testing parser: {pname}")
+        print(f"{'='*70}")
+
+        if pname not in PARSER_INPUT_MAPPING:
+            print(f"  ⚠️  No test configuration found")
+            continue
+
+        test_configs = PARSER_INPUT_MAPPING[pname]
+        parser_results = []
+
+        for input_path, yaml_path, comparison_file in test_configs:
+            input_abs = Path(__file__).parent.parent / input_path
+
+            if not input_abs.exists():
+                print(f"  ⚠️  Input not found: {input_path}")
+                parser_results.append({"status": "skipped", "reason": "input_not_found"})
+                continue
+
+            print(f"\n  Input: {input_path}")
+            print(f"  YAML:  {yaml_path}")
+
+            # Run parser in temp directory
+            with tempfile.TemporaryDirectory() as tmpdir:
+                yaml_abs = Path(__file__).parent.parent / yaml_path
+
+                success, output_file, error = run_parser(
+                    pname,
+                    str(input_abs),
+                    str(yaml_abs),
+                    tmpdir,
+                )
+
+                if not success:
+                    print(f"  ❌ Parser failed: {error}")
+                    parser_results.append({"status": "failed", "error": error})
+                    continue
+
+                print(f"  ✓ Parser succeeded")
+                output_headers, output_count = load_tsv_headers_and_count(output_file)
+                print(f"    Output: {os.path.basename(output_file)} ({output_count} rows, {len(output_headers)} cols)")
+
+                # Compare against S3 source if available
+                if comparison_file:
+                    s3_source = find_s3_source(comparison_file)
+
+                    if s3_source:
+                        print(f"  Comparing against S3 source: {comparison_file}")
+                        comparison = compare_outputs(output_file, s3_source)
+
+                        if "error" in comparison:
+                            print(f"    ⚠️  Comparison failed: {comparison['error']}")
+                        else:
+                            s3_headers, s3_count = load_tsv_headers_and_count(s3_source)
+                            print(f"    S3 source: {s3_count} rows, {len(s3_headers)} cols")
+
+                            if comparison["headers_match"]:
+                                print(f"    ✓ Headers match")
+                            else:
+                                print(f"    ❌ Headers mismatch:")
+                                if comparison["headers_extra"]:
+                                    print(f"       Extra: {comparison['headers_extra']}")
+                                if comparison["headers_missing"]:
+                                    print(f"       Missing: {comparison['headers_missing']}")
+
+                            if comparison["line_counts_similar"]:
+                                print(f"    ✓ Line counts similar (~{comparison['count_pct_diff']:.1f}% diff)")
+                            else:
+                                print(f"    ⚠️  Line counts differ substantially:")
+                                print(f"       Parsed: {comparison['parsed_count']}, S3: {comparison['s3_count']} ({comparison['count_pct_diff']:.1f}% diff)")
+
+                        parser_results.append({
+                            "status": "success",
+                            "comparison": comparison,
+                        })
+                    else:
+                        print(f"    ℹ️  No S3 source found for {comparison_file}")
+                        parser_results.append({
+                            "status": "success",
+                            "comparison": None,
+                        })
+                else:
+                    parser_results.append({
+                        "status": "success",
+                        "comparison": None,
+                    })
+
+        results[pname] = parser_results
+
+    # Summary
+    print(f"\n{'='*70}")
+    print("Summary")
+    print(f"{'='*70}")
+
+    all_passed = True
+    for pname, presults in results.items():
+        passed = sum(1 for r in presults if r.get("status") == "success" and (r.get("comparison") is None or r["comparison"].get("headers_match") and r["comparison"].get("line_counts_similar")))
+        total = len(presults)
+        status = "✓" if passed == total else "❌"
+        print(f"{status} {pname}: {passed}/{total} passed")
+        if passed < total:
+            all_passed = False
+
+    return 0 if all_passed else 1
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    parser.add_argument(
+        "--parser",
+        help="Test a specific parser by name (e.g., SEQUENCING_STATUS)",
+        default=None,
+    )
+    parser.add_argument(
+        "--all",
+        action="store_true",
+        help="Run all configured tests",
+    )
+    parser.add_argument(
+        "--verbose",
+        action="store_true",
+        help="Print verbose output",
+    )
+
+    args = parser.parse_args()
+
+    sys.exit(run_tests(parser_name=args.parser, verbose=args.verbose))
+
+
+if __name__ == "__main__":
+    main()