Skip to content

NW-PaGe/ncov

 
 

Repository files navigation

SARS-CoV-2 Washington focused build

Build Overview

  • Build Name: SARS-CoV-2 Washington focused build
  • Pathogen/Strain: SARS-CoV-2
  • Scope: Whole Genome Sequences of SARS-CoV-2 in Washington state from the past year
  • Purpose: This repository contains the Nextstrain build for the genomic surveillance of SARS-CoV-2 in Washington State for past year.
  • Nextstrain Build Location: Washington-focused SARS-CoV-2 genomic analysis: Past year

Table of Contents:

Pathogen Epidemiology

Overview:

  • SARS-CoV-2 (SC2) is a single-stranded RNA virus and was first detected in Wuhan, China in December 2019
  • Infection with the SARS-Cov-2 virus can cause a COVID-19 respiratory illness. SC2 is a coronavirus that encodes structural spike glycoprotein. This spike protein is the primary target of natural and vaccine immunity as well as the target for most monoclonal antibody therapies (O'Toole et al 2022) (Zhou et al 2020)
  • The virus spread through respiratory droplets from an infected persons through coughing, sneezing, speaking, etc. WHO). Most infected people will have mild to moderate respiratory illness, but in some cases the illness can be more severe are require medical attention.

Taxonomic designations

Geographic distribution and seasonality

  • SC2 circulates endemically in the human population, with seasonality similar to that of other respiratory pathogens, peaking in late fall through spring (Wiekman et al 2023)

Public Health Importance

  • Surveillance of SC2 provides insight into how the virus is evolving and spreading within Washington and supports outbreak detection and response to better guide Public Health Response.

Genomic Relevance

  • SC2 genomic data allows for monitoring of lineage patterns, supports outbreak investigations, and allows for monitoring of vaccine escape of antiviral resistance and supports further understanding of transmission pathways.

Scientific Decisions

  • Subsampling:
    • 1 year Washington focus sampling: Subsampling includes all Washington sequences (no maximum number of sequences) from the past year
    • Contextual proximity sampling: Subsampling includes 1000 sequences sampled from 2020 through current. This sampling helps to accurately reconstruct the number of introduction. Proximity sampling selects sequences as close as possible to the focal samples (Currently set to Washington). The genetic proximity between sequences in the focal set to other sequences are calculated in the priorities.py script.
      • Crowding penalty: The crowding penalty in proximity subsampling controls how strongly the subsampling penalizes sequences that are genetically similar to each other. This build set the crowding penalty to 0. The default setting is 0.25. A crowding penalty value closer to 1 creates a bushier tree and discourages sequence redundancy. A crowding penalty closer to 0 allows more clustering. A crowding penalty of 0 disables crowding.
    • Contextual random sampling: Subsampling includes 500 sequences sampled over month-year that allow for accurate clade timing in the tree.
  • Reference selection: MN908947 is used as the reference because it is the complete genome of the SARS-CoV-2 Wuhan strain collected in December 2019.
  • Clade labeling: Internal clade labels are included in the tree through the main_workflow.smk

Getting Started

This build utilizes the Nextstrain.org remote datasets to produce a Washington-focused SC2 Nextstrain build that can be used for genomic surveillance purposes.

Some high-level build features and capabilities are:

  • 1 year Washington focus sampling: All Washington sequences from the last year are included in this build.
  • Tiered subsampling: Additional sequences from the rest of the USA & the world are selected by genetic similarity to the state-level sequences. Additionally, earlier sequences from Washington and globally are provided for temporal context.

Data Sources & Inputs

This build uses NCBI data and the SARS-Cov-2 Global Remote Dataset available on Nextstrain.org. The Remote Dataset data is sourced from GenBank cleaned/maintainted by the Nextstrain team. This build pulls in subsets Washington State sequences and metadata from GenBank, and pulls in contextual data from Nextstrain Global Remote Dataset that are the inputs to the ncov Nextstrain pipeline.

To include more contextualization, one could use the Full SARS-Covo2 Remote Dataset for the contextual sequences, however doing so may require AWS Batch to subsample from the dataset.

  • Sequence Data: GenBank SARS-Cov-2 data from NCBI Datasets and Nextstrain.org SC2 Remote Dataset sourced GenBank
  • Metadata: GenBank SARS-Cov-2 data from NCBI Datasets, Nextstrain.org SC2 Remote Dataset sourced GenBank and WA DOH county-level data
  • Expected Inputs:
    • ncov_wa/data/county_metadata.csv (contains most recent line list of GenBank accession number and Washington State county designation)
    • Other sequencing and metadata will be automatically downloaded and ingested as part of the pipeline

Setup & dependencies

Installation

Ensure that you have Nextstrain installed.

To check that Nextstrain is installed:

nextstrain check-setup

If Nextstrain is not installed, follow Nextstrain installation guidelines

Clone this ncov repository:

Clone this repository by running:

git clone https://github.com/NW-PaGe/ncov.git

Run the build

Files that need to be updated

When running the build, the county_metadata.csv should be updated to capture the most up-to-date county data. This metadata file is generated by WA DOH and contains two columns: SEQUENCE_GENBANK_STRAIN containing GenkBank accession IDs that match to the sequence FASTA headers, and COUNTY_NAME column listing the associated county for each sequence.

To run the build, make sure you are in the correct directory file "ncov". The below code specifies how many CPUs to use as well as which config file to use. In this case, we are specifying to use the ncov_wa/config/build.yaml with our Washington-specific parameters.

nextstrain build --cpus=6 . --configfile ncov_wa/config/builds_ncbi.yaml

When you run the build using nextstrain build ., Nextstrain uses Snakemake as the workflow manager to automate genomic analyses. The Snakefile in a Nextstrain build defines how raw input data (sequences and metadata) are processed step-by-step in an automated way. Nextstrain builds are powered by Augur (for phylogenetics) and Auspice (for visualization) and Snakemake is used to automate the execution of these steps using Augur and Auspice based on file dependencies.

Expected outputs

The file structure of the repository is as follows with * denoting folders that are the build's expected outputs.

.
├── README.md
├── Snakefile
├── auspice*
├── clade-labeling
├── config
├── new_data
├── results*
└── scripts

More details on the file structure of this build can be found here

After successfully running the build there will be two output folders containing the build results.

  • auspice/ folder contains: .json files
  • results/ folder contains:

Visualize Results

  • Dropping /json into auspice.us
  • nextstrain view auspice/*.json

Additional resources for tree interpretation:

Customization for Local Adaptation

  • The jurisdiction-focused sampling time frame of the build can be changed. It is currently set up to focus on the last year of Washington sequences, but this time frame can be altered to be shorter/longer by adjusting the add_to_builds.smk and the build.yaml subsampling scheme.
  • To adapt the build to a new jurisdiction, the current filters for Washington should be changed to filter for jurisdiction of interest. These filtering steps are in the filter_wa_metadata.sh bash script that pattern matches the metadata, and that is called within the filter_wa.smk workflow. Note: when working with bash scripts, be careful about editing the files in a Windows application, and be sure the files are saved with only the line feed character (LF) instead of the carriage return plus line feed (CRLF). -county_metadata.csv should be updated to capture the most up-to-date county data. This metadata file contains two columns: SEQUENCE_GENBANK_STRAIN containing accession IDs that match to the sequence FASTA headers, and COUNTY_NAME column listing the associated county for each sequence.
  • The colors.tsv file can be adapted to change colors visualized in Auspice Color-By. The tsv file should include the divisions of interest that are to appear in the Color-By.

Contributing

For any questions please submit them to our Discussions page otherwise software issues and requests can be logged as a Git Issue.

License

This project is licensed under a modified GPL-3.0 License. You may use, modify, and distribute this work, but commercial use is strictly prohibited without prior written permission.

Acknowledgments

These data are generously shared by labs around the world and deposited in NCBI Genbank by the authors. Please contact these labs first if you plan to publish using these data. We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances.

We also gratefully acknowledge the work done by the Bedford lab and Nextstrain team who were the original authors of this build.

About

Nextstrain build for SARS-CoV-2

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 91.5%
  • WDL 3.7%
  • Shell 1.7%
  • R 1.2%
  • Raku 1.2%
  • Perl 0.7%