This repository provides a computational framework for identifying technical artifacts in viral genomic datasets. By detecting lab-specific biases and primer-associated variants, the pipeline enables targeted masking to improve phylogenetic inference.
LSBFILT extends the workflow originally developed by Turakhia et al. (2020) for SARS-CoV-2, making it applicable to other viral genomic datasets with customizable parameters and enhanced filtering and masking capabilities.
Before installing LSBFILT on Linux or macOS, you need to install Docker Desktop and ensure Docker Desktop is running.
git clone --recursive https://github.com/khourious/LSBFILT.git
cd LSBFILT
chmod +x -R INSTALL_Unix
bash INSTALL_Unixgit clone --recursive https://github.com/khourious/LSBFILT.git
cd LSBFILT
chmod +x -R INSTALL_macOS
bash INSTALL_macOSAfter installation, refresh your shell configuration:
source ~/.zshrcTo run this pipeline, you need to provide a FASTA alignment file (with reference genome as first sequence), a NEWICK tree file generated from the alignment, a METADATA table in TSV format, and a directory containing primer scheme BED files.
The METADATA TSV-formatted table requires the following columns:
- sequence_id: unique identifier matching the FASTA alignment and NEWICK tree
- sequencing_lab: group/institution of the sequencing laboratory (e.g.,
KhouriLab,FIOCRUZ-BA) - sequencing_lab_country: country of the sequencing laboratory (e.g.,
Brazil) - sequencing_lib_prep: primer scheme identifier for amplicon data; otherwise, use
ShotgunorHybrid Capture(e.g.,Khouri_et_al_2026)
The BED files must be named according to the sequencing_lib_prep identifiers (e.g., Khouri_et_al_2026.bed for the entry Khouri_et_al_2026).
usage: LSBFILT.py [-h] -fasta FASTA -tree TREE -metadata METADATA -primers PRIMERS -outdir OUTDIR
[-minParsimony MINPARSIMONY] [-minLabAssociation MINLABASSOCIATION] [-minLdR2 MINLDR2]
[-minPSAC MINPSAC]
Lab-Specific Bias FILTer (LSBFILT)
options:
-h, --help show this help message and exit
-fasta FASTA path to alignment FASTA file
-tree TREE path to NEWICK TREE file
-metadata METADATA path to METADATA TSV file
-primers PRIMERS path to primer schemes BED files
-outdir OUTDIR path to output directory
-minParsimony MINPARSIMONY
minimum parsimony for lab associations (default = 4)
-minLabAssociation MINLABASSOCIATION
minimum fraction of allele calls from single source for lab associations (default = 0.6 =
60%)
-minLdR2 MINLDR2 minimum R2 value to report linkage disequilibrium (default = 0.4)
-minPSAC MINPSAC minimum PS:AC ratio for masking alignment FASTA file (default = 0.5)