Skip to content

downloads.py re-writes#31

Open
dingyifei wants to merge 3 commits intoSBRG:mainfrom
dingyifei:downloads-api-modernization
Open

downloads.py re-writes#31
dingyifei wants to merge 3 commits intoSBRG:mainfrom
dingyifei:downloads-api-modernization

Conversation

@dingyifei
Copy link
Copy Markdown

  • remove selenium, use NCBI dataset API for N50
  • add method to query brc by taxon id (for 1a)
  • Replace brc FTP downloads with HTTPS downloads

dingyifei and others added 3 commits February 17, 2026 11:28
get_scaffold_n50_for_species() used Selenium + Chrome to scrape NCBI
web pages, which fails in headless environments (WSL, CI). Replace with
a direct call to the NCBI Datasets v2 REST API endpoint:
  https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/{id}/dataset_report

Remove now-unused selenium, webdriver-manager, and beautifulsoup4
dependencies. Update test fixture to use exact API value (4641652)
instead of rounded Selenium scrape (4600000).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a function to query the BV-BRC Data API for genome records by taxon
ID. Uses taxon_lineage_ids (not taxon_id) to include subspecies and
strain-level descendants. Supports optional filtering by genome_status
and genome_quality, with automatic pagination.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BV-BRC's FTP server now requires SSL/TLS on the control channel,
causing all genome downloads via urllib FTP to fail silently. Switch
download_genomes_bvbrc() to use HTTPS Data API endpoints with proper
content-type negotiation. Also fix stale loop variable bug in the
bad_genomes cleanup code (was using `genome` instead of `bad_genome`).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant