Refactor NCBI FTP Promote into CLI Tool#201
Conversation
There was a problem hiding this comment.
Pull request overview
This PR replaces the Phase 3 “NCBI promote” Jupyter notebook with a first-class ncbi_ftp_promote CLI pipeline, aligning promote execution with the existing CTS/CLI pipeline structure and updating tests/docs accordingly.
Changes:
- Added a new
ncbi_ftp_promotepipeline module withPromoteSettings+run_promote()orchestration and a console script entrypoint. - Refactored settings usage to follow the shared
CtsSettingspattern (including updating download settings/tests to providedlt_config). - Added a workaround in assembly summary parsing to handle malformed FTP-path columns (plus broad test/doc updates and notebook removal).
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/pipelines/test_ncbi_ftp_promote.py | New unit tests covering promote pipeline settings, CLI aliases, and orchestration behavior. |
| tests/pipelines/test_ncbi_ftp_download.py | Updates settings construction to supply dlt_config after DownloadSettings base-class change. |
| tests/ncbi_ftp/test_promote.py | Updates promote unit tests for required lakehouse_key_prefix parameter and related calls. |
| tests/ncbi_ftp/test_notebooks.py | Removes promote-notebook coverage after notebook deletion. |
| tests/integration/test_promote_e2e.py | Updates integration tests to pass explicit lakehouse_key_prefix and removes reliance on deleted constant. |
| tests/integration/test_manifest_e2e.py | Updates integration test constants after promote constant removal. |
| tests/integration/test_full_pipeline.py | Updates full pipeline integration to pass explicit promote key prefix. |
| src/cdm_data_loaders/pipelines/ncbi_ftp_promote.py | New promote CLI pipeline implementation (settings + runner). |
| src/cdm_data_loaders/pipelines/ncbi_ftp_download.py | Switches DownloadSettings to extend CtsSettings. |
| src/cdm_data_loaders/ncbi_ftp/promote.py | Makes lakehouse_key_prefix required and threads it through archiving calls. |
| src/cdm_data_loaders/ncbi_ftp/manifest.py | Adds fallback logic to recover FTP URL when the expected column is malformed. |
| pyproject.toml | Registers ncbi_ftp_promote as a console script. |
| notebooks/ncbi_ftp_promote.ipynb | Removes the promote notebook in favor of the CLI tool. |
| notebooks/ncbi_ftp_manifest.ipynb | Updates local-testing parameters and S3 credential initialization defaults. |
| docs/ncbi_ftp_e2e_walkthrough.md | Updates the walkthrough to use the promote CLI and adjusts local e2e steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if "https://ftp" not in ftp_url: | ||
| for col in row: | ||
| if "https://ftp" in col: | ||
| ftp_url = col | ||
| break | ||
| if "https://ftp" not in ftp_url: | ||
| msg = f"Missing ftp path for record {accession}." | ||
| logger.warning(msg) | ||
| continue |
There was a problem hiding this comment.
This is just a temporary fix until the CSV/TSV parser is included. I don't want to match anything that starts with https:// in case other URLs are included for things other than the FTP folder location.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #201 +/- ##
===========================================
+ Coverage 68.38% 68.57% +0.19%
===========================================
Files 72 73 +1
Lines 4432 4462 +30
===========================================
+ Hits 3031 3060 +29
- Misses 1401 1402 +1
Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
| ## 4. Phase 3 — Promote & archive (CLI) | ||
|
|
||
| Open `notebooks/ncbi_ftp_promote.ipynb`. | ||
| Phase 3 uses the `ncbi_ftp_promote` CLI to promote staged assemblies from |
| ) | ||
| from tests.ncbi_ftp.conftest import ACC_PATH_215, ACC_PATH_845, TEST_BUCKET | ||
|
|
||
| DEFAULT_LAKEHOUSE_KEY_PREFIX: PurePosixPath = PurePosixPath("tenant-general-warehouse/kbase/datasets/ncbi") |
There was a problem hiding this comment.
why define it here too?
There was a problem hiding this comment.
This was being imported and used in the tests. It really could be any path here, I just left it the same as the default used in the promote cli tool, but thought it made more sense to define in the test module instead of importing from the pipeline module.
ialarmedalien
left a comment
There was a problem hiding this comment.
Minor comments only. Looks great otherwise!
Description of PR purpose/changes
This PR replaces the NCBI Promote notebook with an equivalent command-line tool.
Minor Updates
DownloadSettingsconfig class to extendCtsSettingsto follow the pattern used elsewhere in the codeTesting Instructions
Regular unit and integration tests. Also can follow the NCBI FTP walk-through document to test locally using a containerized CEPH S3 store.
Dev Checklist:
formatto format my codecheckand fixed any errors that it uncoveredUpdating Version and Release Notes (if applicable)
N/A