- Build Name: SARS-CoV-2 Washington focused build
- Pathogen/Strain: SARS-CoV-2
- Scope: Whole Genome Sequences of SARS-CoV-2 in Washington state from the past year
- Purpose: This repository contains the Nextstrain build for the genomic surveillance of SARS-CoV-2 in Washington State for past year.
- Nextstrain Build Location: Washington-focused SARS-CoV-2 genomic analysis: Past year
- Pathogen Epidemiology
- Scientific Decisions
- Getting Started
- Run the Build
- Customizing for Local Adaptation
- Contributing
- License
- Acknowledgements
- SARS-CoV-2 (SC2) is a single-stranded RNA virus and was first detected in Wuhan, China in December 2019
- Infection with the SARS-Cov-2 virus can cause a COVID-19 respiratory illness. SC2 is a coronavirus that encodes structural spike glycoprotein. This spike protein is the primary target of natural and vaccine immunity as well as the target for most monoclonal antibody therapies (O'Toole et al 2022) (Zhou et al 2020)
- The virus spread through respiratory droplets from an infected persons through coughing, sneezing, speaking, etc. WHO). Most infected people will have mild to moderate respiratory illness, but in some cases the illness can be more severe are require medical attention.
- The Pango nomenclature system is a widely-used tool for SARS-CoV-2 lineage classification
- SC2 circulates endemically in the human population, with seasonality similar to that of other respiratory pathogens, peaking in late fall through spring (Wiekman et al 2023)
- Surveillance of SC2 provides insight into how the virus is evolving and spreading within Washington and supports outbreak detection and response to better guide Public Health Response.
- SC2 genomic data allows for monitoring of lineage patterns, supports outbreak investigations, and allows for monitoring of vaccine escape of antiviral resistance and supports further understanding of transmission pathways.
- Subsampling:
- 1 year Washington focus sampling: Subsampling includes all Washington sequences (no maximum number of sequences) from the past year
- Contextual proximity sampling: Subsampling includes 1000 sequences sampled from 2020 through current. This sampling helps to accurately reconstruct the number of introduction. Proximity sampling selects sequences as close as possible to the focal samples (Currently set to Washington). The genetic proximity between sequences in the focal set to other sequences are calculated in the priorities.py script.
- Crowding penalty: The crowding penalty in proximity subsampling controls how strongly the subsampling penalizes sequences that are genetically similar to each other. This build set the crowding penalty to 0. The default setting is 0.25. A crowding penalty value closer to 1 creates a bushier tree and discourages sequence redundancy. A crowding penalty closer to 0 allows more clustering. A crowding penalty of 0 disables crowding.
- Contextual random sampling: Subsampling includes 500 sequences sampled over month-year that allow for accurate clade timing in the tree.
- Reference selection: MN908947 is used as the reference because it is the complete genome of the SARS-CoV-2 Wuhan strain collected in December 2019.
- Clade labeling: Internal clade labels are included in the tree through the main_workflow.smk
This build utilizes the Nextstrain.org remote datasets to produce a Washington-focused SC2 Nextstrain build that can be used for genomic surveillance purposes.
Some high-level build features and capabilities are:
- 1 year Washington focus sampling: All Washington sequences from the last year are included in this build.
- Tiered subsampling: Additional sequences from the rest of the USA & the world are selected by genetic similarity to the state-level sequences. Additionally, earlier sequences from Washington and globally are provided for temporal context.
This build uses NCBI data and the SARS-Cov-2 Global Remote Dataset available on Nextstrain.org. The Remote Dataset data is sourced from GenBank cleaned/maintainted by the Nextstrain team. This build pulls in subsets Washington State sequences and metadata from GenBank, and pulls in contextual data from Nextstrain Global Remote Dataset that are the inputs to the ncov Nextstrain pipeline.
To include more contextualization, one could use the Full SARS-Covo2 Remote Dataset for the contextual sequences, however doing so may require AWS Batch to subsample from the dataset.
- Sequence Data: GenBank SARS-Cov-2 data from NCBI Datasets and Nextstrain.org SC2 Remote Dataset sourced GenBank
- Metadata: GenBank SARS-Cov-2 data from NCBI Datasets, Nextstrain.org SC2 Remote Dataset sourced GenBank and WA DOH county-level data
- Expected Inputs:
ncov_wa/data/county_metadata.csv(contains most recent line list of GenBank accession number and Washington State county designation)- Other sequencing and metadata will be automatically downloaded and ingested as part of the pipeline
Ensure that you have Nextstrain installed.
To check that Nextstrain is installed:
nextstrain check-setup
If Nextstrain is not installed, follow Nextstrain installation guidelines
Clone this repository by running:
git clone https://github.com/NW-PaGe/ncov.git
When running the build, the county_metadata.csv should be updated to capture the most up-to-date county data. This metadata file is generated by WA DOH and contains two columns: SEQUENCE_GENBANK_STRAIN containing GenkBank accession IDs that match to the sequence FASTA headers, and COUNTY_NAME column listing the associated county for each sequence.
To run the build, make sure you are in the correct directory file "ncov". The below code specifies how many CPUs to use as well as which config file to use. In this case, we are specifying to use the ncov_wa/config/build.yaml with our Washington-specific parameters.
nextstrain build --cpus=6 . --configfile ncov_wa/config/builds_ncbi.yaml
When you run the build using nextstrain build ., Nextstrain uses Snakemake as the workflow manager to automate genomic analyses. The Snakefile in a Nextstrain build defines how raw input data (sequences and metadata) are processed step-by-step in an automated way. Nextstrain builds are powered by Augur (for phylogenetics) and Auspice (for visualization) and Snakemake is used to automate the execution of these steps using Augur and Auspice based on file dependencies.
The file structure of the repository is as follows with * denoting folders that are the build's expected outputs.
.
├── README.md
├── Snakefile
├── auspice*
├── clade-labeling
├── config
├── new_data
├── results*
└── scripts
More details on the file structure of this build can be found here
After successfully running the build there will be two output folders containing the build results.
auspice/folder contains: .json filesresults/folder contains:
- Dropping /json into auspice.us
nextstrain view auspice/*.json
Additional resources for tree interpretation:
- The jurisdiction-focused sampling time frame of the build can be changed. It is currently set up to focus on the last year of Washington sequences, but this time frame can be altered to be shorter/longer by adjusting the
add_to_builds.smk and thebuild.yamlsubsampling scheme. - To adapt the build to a new jurisdiction, the current filters for Washington should be changed to filter for jurisdiction of interest. These filtering steps are in the filter_wa_metadata.sh bash script that pattern matches the metadata, and that is called within the filter_wa.smk workflow. Note: when working with bash scripts, be careful about editing the files in a Windows application, and be sure the files are saved with only the line feed character (LF) instead of the carriage return plus line feed (CRLF).
-
county_metadata.csvshould be updated to capture the most up-to-date county data. This metadata file contains two columns:SEQUENCE_GENBANK_STRAINcontaining accession IDs that match to the sequence FASTA headers, andCOUNTY_NAMEcolumn listing the associated county for each sequence. - The colors.tsv file can be adapted to change colors visualized in Auspice Color-By. The tsv file should include the divisions of interest that are to appear in the Color-By.
For any questions please submit them to our Discussions page otherwise software issues and requests can be logged as a Git Issue.
This project is licensed under a modified GPL-3.0 License. You may use, modify, and distribute this work, but commercial use is strictly prohibited without prior written permission.
These data are generously shared by labs around the world and deposited in NCBI Genbank by the authors. Please contact these labs first if you plan to publish using these data. We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances.
We also gratefully acknowledge the work done by the Bedford lab and Nextstrain team who were the original authors of this build.