MAD4HATTER Amplicon Sequencing Pipeline

MAD4HATTER is a bioinformatics analysis pipeline used to process amplicon sequencing data.

A quick start guide for running the complete pipeline is below and comprehensive documentation can be found here.

Setup

The mad4hatter pipeline uses nextflow and this will need to be installed prior to using the pipeline. Information about how to install and use the command line tool can be found on their website. The tool is also available from other package managers such as conda if you would like an alternative installation pathway.

One of the useful features of Nextflow is that it caches your job history, so if for any reason your pipeline fails midway you can make changes to fix the failure and use the -resume flag to pick up where you left off. See more information here.

Setting Parameters

To view the parameters and examples on the command line, run:

nextflow run main.nf --help

Mandatory Parameters

Below are the parameters that are essential for running the complete pipeline. For more details on running just a portion of the pipeline steps see here.

Parameter	Description
pools	The pools that were used for sequencing. [Options: D1,R1,R2 - check panel.config for more options]
readDIR	Path to folder containing fastq files

Here is an example of running the complete workflow:

nextflow run main.nf --readDIR /path/to/data --pools D1,R1,R2 -profile sge,apptainer

Optional Parameters

Below are parameters that are optional for running the pipeline.

Parameter	Description
outDIR	The folder where you want the resulting data to be saved (default 'results')
workflow_name	Workflow option to be run [Options: complete (default), qc, postprocessing]
Nextflow parameters
profile	The infrastructure you wish to run the pipeline on. The different profiles are listed below under `Runtime Profiles`, including any setup that is required. Please read that section for more details.
config	Resource configurations for each process that will override any defaults set in the pipeline. It is recommended to use the provided `custom.config` file to make these resource modifications.

Below is an example of how you may run the pipeline setting the above parameters.

nextflow run main.nf --readDIR /path/to/data --outDIR /path/to/results --pools D1,R1,R2 -profile docker --workflow_name qc -config conf/custom.config

DADA parameters

DADA2 infers amplicon sequences exactly and can be tuned depending on your needs. DADA2 is run in the DADA2 module of the pipeline (DADA2_ANALYSIS). Below are parameters that you can set to control your output.

Parameter	Description
omega_a	This controls the abundance threshold used to determine whether a sequence is overly abundant such that it is likely a true variant and not an error produced by DADA. (default `1e-120`)
dada2_pool	The method for information sharing across samples (default `pseudo`)
band_size	An alignment heuristic that controls whether an alignment will occur between sequences if the number of indels exceeds this threshold (default `16`)
maxEE	During filtering and trimming, reads that exceed the number of expected errors will be discarded (default `3`)
just_concatenate	Setting this to true will concatenate any DADA sequences that were unable to be merged. Reads that are concatenated will have 10 Ns separating the forward and reverse reads (i.e. `N`). Setting this to false will discard reads that did not have enough bases to merge. The minimum overlap required to merge forward and reverse reads is 12 bases. (default true)

For more information about DADA2 and the parameters that can be set, please refer to their documentation.

Below is an example of how you may use the above parameters on the command line:

nextflow run main.nf --readDIR /path/to/data --outDIR /path/to/results -profile docker --pools D1,R1,R2 -config conf/custom.config --omega_a 1e-120 --band_size 16 --dada2_pool pseudo

Post processing parameters

By default the pipeline will use the --pools parameter and panel.config to find the paths to the reference sequences for each of the pools. This can be overridden by setting either the refseq_fasta or genome parameter as detailed below.

Below are parameters that you can set to control the postprocessing module.

Parameter	Description
refseq_fasta or genome	Path to targeted reference sequence or a specified genome that covers all targets. If neither are specified then a reference will be built from the fasta files under `panel_information` based on the pools supplied.
homopolymer_threshold	Homopolymers greater than this threshold will be masked (default `5`)
trf_min_score	Used by Tandem Repeat Finder. This will control the alignment score required to call a sequence a tandem repeat and mask it (default `25`)
trf_max_period	Used by Tandem Repeat Finder. This will limit the range of the pattern size of a tandem repeat to be masked (default `3`)

Below is a continuation of the example above that shows how these parameters may be modified on the command line. Note that --refseq_fasta OR --genome OR no flag can be set to provide a reference. If no reference is provided (neither --refseq_fasta OR --genome are set) then the pipeline will build a targeted reference from the reference for each pool, stored under the resources directory.

nextflow run main.nf --readDIR /path/to/data --outDIR /path/to/results -profile sge,apptainer --refseq_fasta /path/to/targeted_reference --pools D1,R1,R2 -config conf/custom.config --omega_a 1e-120 --band_size 16 --dada2_pool pseudo --trf_min_score 25 --trf_max_period 3

nextflow run main.nf --readDIR /path/to/data --outDIR /path/to/results -profile sge,apptainer --genome /path/to/Whole_Genome.fasta --pools D1,R1,R2 -config conf/custom.config --omega_a 1e-120 --band_size 16 --dada2_pool pseudo --trf_min_score 25 --trf_max_period 3

Resmarker Module Parameters

By default the pipeline will check if any of the markers in the principal list (panel_information/principal_resistance_marker_info_table.tsv) are covered by any of the loci in the panel. If markers are covered then the resmarker module will run. This can be overridden and a customized table can be supplied using the following parameter.

Parameter	Description
resmarker_info	Path to table containing resmarker information

Runtime Profiles

Runtime profiles will provide all dependencies and setup needed for different computing environments. As an example, if you are using a cluster, grid or HPC environment, apptainer would be an appropriate profile as it supplies an image with all dependencies ready. If you are using a local computer, docker would be more appropriate. You can also choose to install the dependencies independently and run the pipeline that way if you choose to, but it is not recommended.

Continuing with our example above, the below could be used to run the pipeline using the SGE scheduler.

nextflow run main.nf --readDIR /path/to/data -profile sge,apptainer --pools D1,R1,R2

Apptainer

Note: apptainer is a prerequisite.

Apptainer should be used if you are using a computing cluster or grid. All dependencies needed to run the pipeline are contained within the apptainer image. The image can be created by pulling the docker image from dockerhub, which will create a mad4hatter_latest.sif image in your working directory.

apptainer pull docker://eppicenter/mad4hatter:v1.0.0

Once you have the image, you must include the apptainer profile on the command line in order for it to be used.

Note: you should also include the job scheduler you will be using. In this case, sge is the job scheduler that will be used. Contact your system administrator if you are unsure about this setting.

nextflow run main.nf --readDIR single --pools D1,R1,R2 -profile sge,apptainer

Docker

Note: docker is a prerequisite.

The pipeline can be easily run with docker and is the recommended way to run it when not using an HPC.

The EPPIcenter has a repository for images, and the docker image for the pipeline will be automatically pulled in the background when first running the pipeline. The image will then be stored locally on your machine and reused.

To run the pipeline with Docker, simply add -profile docker to your command.

nextflow run main.nf --readDIR /path/to/data -profile docker --pools D1,R1,R2

Alternatively, you can build the docker image on your machine using the Dockerfile recipe, although this is not the recommended way to set up the docker image.

If you would like to build the docker image yourself, you may run the command below:

docker build -t eppicenter/mad4hatter:latest .

Conda

To use conda, you must first install either conda or miniconda. Once installed, include the conda profile on the command line.

nextflow run main.nf --readDIR /path/to/data -profile conda --pools D1,R1,R2

Name		Name	Last commit message	Last commit date
Latest commit History 608 Commits
.github		.github
bin		bin
conf		conf
data/reference/v1		data/reference/v1
envs		envs
modules/local		modules/local
panel_information		panel_information
subworkflows/local		subworkflows/local
tests		tests
workflows		workflows
.gitignore		.gitignore
Apptainer		Apptainer
Dockerfile		Dockerfile
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
nf-test.config		nf-test.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAD4HATTER Amplicon Sequencing Pipeline

Contents

Setup

Setting Parameters

Mandatory Parameters

Optional Parameters

DADA parameters

Post processing parameters

Resmarker Module Parameters

Runtime Profiles

Apptainer

Docker

Conda

About

Uh oh!

Releases 23

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAD4HATTER Amplicon Sequencing Pipeline

Contents

Setup

Setting Parameters

Mandatory Parameters

Optional Parameters

DADA parameters

Post processing parameters

Resmarker Module Parameters

Runtime Profiles

Apptainer

Docker

Conda

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 23

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages