Skip to content

LCR-BCCRC/lcr-modules

Repository files navigation

lcr-modules: Standardizing genomic analyses

This project aims to become a collection of standard analytical modules for genomic and transcriptomic data. Too often do we copy-paste from each other’s pipelines, which has several pitfalls. Fortunately, all of these problems can be solved with standardized analytical modules, and the benefits are many.

Documentation: https://github.com/LCR-BCCRC/lcr-modules/wiki

License: LICENSE

Installing compatible Snakemake

Run the following commands in your terminal to create the opv12 environment with all necessary dependencies.

conda deactivate
git clone https://github.com/LCR-BCCRC/lcr-modules.git
git clone https://github.com/LCR-BCCRC/lcr-scripts.git
cd lcr-modules/
conda env create -f demo/env.yaml

Always activate this environment before running any pipelines that use LCR-modules.

conda activate opv12

You can check out demo project for the examples of how to use LCR-modules based on the data type, for example to analyze capture (capture_Snakefile.smk) or mrna (mrna_Snakefile.smk) data.

cd demo
./dry-run.sh capture_Snakefile.smk
./dry-run.sh mrna_Snakefile.smk

Module levels overview

Level 1 modules perform low-level tasks such as adapter trimming, quality control, and alignment of sequencing files, and obtaining data from repositories such as the European Genome-phenome Archive (EGA). These modules also perform gene expression analyses, including alignment using STAR and calculating mRNA abundance using salmon. Level 2 modules perform routine tasks for cancer analysis, such as detecting and annotating simple somatic mutations, copy-number alterations, and structural variations. Next, the level 3 modules perform analyses that rely on cohort-level aggregation. The cohorts and data sets can be flexibly defined based on different clinical characteristics through a set of configuration files. The modules at this level operate on the outputs of level 2 modules and perform tasks such as aggregation of individual files into cohort-level merges. Example workflows include analyses of mutation signatures, identification of significantly mutated genes, and sample classification into genetic subgroups.

Module levels

Currently available modules

Module overview

The tables below list the purpose of each module and supported sequencing types.

Table of Contents

Level 1

Purpose # modules
Alignment 2
Archive download 1
Fastq processing 2
Genome build conversion 1
Phasing long reads 1
QC 2

Level 2

Purpose # modules
CNV calling 5
DNA modification analysis 1
Gene expression 2
Pathogen analysis 1
Phasing long reads 3
Structural variants 3
Structural variants long reads 2
TCR, IG, HLA analysis 3
Variant annotation 1
Variant calling 6
Variant calling long reads 4

Level 3

Purpose # modules
Aggregation 3
Classifiers 2
Microenvironment 1
Mutation signatures 1
Mutation significance 8

Level 1

Alignment

module seq_type input_type output_type data_type
bwa_mem capture; genome FASTQ BAM/CRAM Illumina short reads
star mrna FASTQ BAM Illumina short reads

↑ Back to Table of Contents

Archive download

module seq_type input_type output_type data_type
ega_download capture; genome; mrna TSV VARIOUS Illumina short reads

↑ Back to Table of Contents

Fastq processing

module seq_type input_type output_type data_type
bam2fastq capture; genome; mrna BAM/CRAM FASTQ Illumina short reads
cutadapt capture; genome FASTQ FASTQ Illumina short reads

↑ Back to Table of Contents

Genome build conversion

module seq_type input_type output_type data_type
liftover capture; genome VARIOUS VARIOUS Illumina short reads

↑ Back to Table of Contents

Phasing long reads

module seq_type input_type output_type data_type
phase_variants promethION BAM/CRAM VCF Long reads

↑ Back to Table of Contents

QC

module seq_type input_type output_type data_type
picard_qc capture; genome; mrna BAM/CRAM TSV Illumina short reads
qc capture; genome BAM/CRAM TSV Illumina short reads

↑ Back to Table of Contents

Level 2

CNV calling

module seq_type input_type output_type data_type
battenberg capture; genome BAM/CRAM SEG Illumina short reads
cnvkit capture; genome BAM/CRAM SEG Illumina short reads
controlfreec genome BAM/CRAM SEG Illumina short reads
ichorcna genome BAM/CRAM SEG Illumina short reads
sequenza capture; genome BAM/CRAM SEG Illumina short reads

↑ Back to Table of Contents

DNA modification analysis

module seq_type input_type output_type data_type
modkit promethION BAM/CRAM TSV Long reads

↑ Back to Table of Contents

Gene expression

module seq_type input_type output_type data_type
salmon mrna FASTQ TSV Illumina short reads
stringtie mrna BAM/CRAM GTF Illumina short reads

↑ Back to Table of Contents

Pathogen analysis

module seq_type input_type output_type data_type
pathseq capture; genome; mrna BAM/CRAM TSV Illumina short reads

↑ Back to Table of Contents

Phasing long reads

module seq_type input_type output_type data_type
freebayes capture; genome BAM/CRAM VCF Illumina short reads
nanomethphase promethION BAM/CRAM TSV Long reads
whatshap genome; promethION BAM/CRAM VCF Illumina short reads

↑ Back to Table of Contents

Structural variants

module seq_type input_type output_type data_type
gridss capture; genome BAM/CRAM VCF Illumina short reads
hmftools genome BAM/CRAM VCF Illumina short reads
manta capture; genome; mrna BAM/CRAM VCF Illumina short reads

↑ Back to Table of Contents

Structural variants long reads

module seq_type input_type output_type data_type
cutesv promethION BAM VCF Long reads
sniffles promethION BAM/CRAM VCF Long reads

↑ Back to Table of Contents

TCR, IG, HLA analysis

module seq_type input_type output_type data_type
igcaller capture; genome BAM/CRAM TSV Illumina short reads
mixcr genome; mrna BAM/CRAM TSV Illumina short reads
spechla capture; genome; mrna BAM/CRAM TSV Illumina short reads

↑ Back to Table of Contents

Variant annotation

module seq_type input_type output_type data_type
vcf2maf capture; genome VCF MAF Illumina short reads

↑ Back to Table of Contents

Variant calling

module seq_type input_type output_type data_type
lofreq capture; genome BAM/CRAM VCF Illumina short reads
mutect2 capture; genome BAM/CRAM VCF Illumina short reads
sage capture; genome BAM/CRAM VCF Illumina short reads
slms_3 capture; genome BAM/CRAM VCF Illumina short reads
strelka capture; genome BAM/CRAM VCF Illumina short reads
varscan capture; genome BAM VCF Illumina short reads

↑ Back to Table of Contents

Variant calling long reads

module seq_type input_type output_type data_type
clair3 promethION BAM VCF Long reads
clairs promethION BAM/CRAM VCF Long reads
clairs_to promethION BAM/CRAM VCF Long reads
nanopolish promethION BAM/CRAM VCF Long reads

↑ Back to Table of Contents

Level 3

Aggregation

module seq_type input_type output_type data_type
cnv_master capture; genome SEG merged SEG Illumina short reads
starfish capture; genome; mrna VCF VCF Illumina short reads
svar_master capture; genome BEDPE merged BEDPE Illumina short reads

↑ Back to Table of Contents

Classifiers

module seq_type input_type output_type data_type
dlbclass capture; genome VARIOUS TSV Illumina short reads
lymphgen capture; genome VARIOUS TSV Illumina short reads

↑ Back to Table of Contents

Microenvironment

module seq_type input_type output_type data_type
ecotyper mrna TSV TSV Illumina short reads

↑ Back to Table of Contents

Mutation signatures

module seq_type input_type output_type data_type
sigprofiler capture; genome MAF TSV Illumina short reads

↑ Back to Table of Contents

Mutation significance

module seq_type input_type output_type data_type
dnds capture; genome MAF TSV Illumina short reads
fishhook capture; genome MAF TSV Illumina short reads
gistic2 capture; genome SEG TSV Illumina short reads
hotmaps capture; genome MAF TSV Illumina short reads
mutsig capture; genome MAF TSV Illumina short reads
oncodriveclustl capture; genome MAF TSV Illumina short reads
oncodrivefml capture; genome MAF TSV Illumina short reads
rainstorm capture; genome MAF BED Illumina short reads

↑ Back to Table of Contents

Known limitations

The LCR-modues is not intended for installation and use on personal devices (phones, laptops, personal workstations) and due to the high computational requirements of a number of tools (GATK, STAR, hmftools etc.) it is recommended for use on high performance computers with Unix OS. For processing of the large number of samples in parallel, we recommend computing clusters with scheduling managers support. We recommend the use of LCR-modules on Linux and portability to other operating systems is not supported when file systems are case-insensitive (APFS, NTFS), or has not been tested.

About

Collection of standard analytical pipelines for genomic and transcriptomic data

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors