All analysis parameters can be set in a runParams.yml file, which is then passed to nextflow with nextflow run -params-file runParams.yml.
Alternatively each option is this file can also be set on the nextflow command-line directly, overwriting the value in the parameter file. E.g.
nextflow run /PATH/TO/ScaleMethyl -profile docker --samples=samples.foo.csv
Note that nextflow options are given with a single -, while workflow parameters (e.g. samples) are given with a double dash --.
The workflow can start from an Illumina sequencer runFolder (bcl files), a directory with fastq files or directories with bam files. Specify either
runFolder : "path/to/runFolder"
ORfastqDir : "path/to/fastqs"
ORbam1Dir : "path/to/bams"bam2Dir : "path/to/bams"
where fastqDir is a directory containing all input fastq files. See Fastq Generation for details on file names, etc.
runFolder is the top-level directory for a sequencer output (containing RunInfo.xml). the workflow uses Illumina bcl-convert for automatic fastq generation.
When starting from bam files refer to mergeBam for details on additional parameters, convention on filenames, etc.
samples : "samples.csv"
A file listing all samples in the analysis with their names, sample barcodes and optional sample settings
aligner : ("bwa-meth","bsbolt","parabricks")
The selection of the methylation aligner of choice. Options include "bwa-meth" (default aligner - using bwa-mem2), "bsbolt" (original aligner), "parabricks" (GPU aligner - using bwa-mem). If parabricks is used, the dockerGPU profile must be used (It does not currently support conda or singularity).
genome : "/genomes/grch38/genome.json"
Path to a genome.json file that contains the location of all sequence and index files as well as other parameters for the reference genome to use. Make sure the path to the correct reference genome is present for your chosen aligner above. If using "bwa-meth" or "parabricks", the "ref_fasta" field in the genome.json file is also required.
See nextflow.config for a list of all available parameters. The file also includes nextflow system options (compute resource requirements, etc.).
fastqOutcontrols publishing fastq files from bcl-convert and bcParser barcode corrected sample demultiplexed fastqs to the output directory. If fastq files are not needed for custom analysis, disabling fastq output will save compute time and storage space.trimOutcontrols publishing post barcode demultiplexed trimmed fastq files to the output directory in thetrimfolder.bamOutcontrols .bam file output from the chosen aligner. Also,bamDedupOutcontrols deduplicated bam file output from scDedup. Enabling either of these options will increase compute time and output file storage space.bamMergeOutcontrols publishing of merged bam files to the output directory when starting the workflow from bam files, see mergeBam.covOutcontrols publishing methylation extraction files to the outputcovdirectory with per-sample, per-context (CG/CH) files in bismark .cov format. These files contain columns for thechr,pos,percent_methylated,methylated_countandunmethylated_countcolumns.allcOutwill produce per-barcode methylation call files for analysis by the ALLCools package, with columnschr,pos,strand,context,mc,cov,methylatedas described here.amethystOutwill create and publish HDF5 files to the output directory with per-cell per-context methylation calls for analysis with Amethyst v.1.0.0+. Files from runs prior to version 1.3 can be converted to the new Amethyst 1.0.0+ format through a utility provided with Amethyst.
The default cell thresholding algorithm used by the ScaleMethyl workflow imposes a one-dimensional threshold on the coordinate collapsed uniquely mapped reads per cell barcode.
Parameters:
- All cell-barcodes with over
minUniqCount[1000] unique reads are considered as possible cells - The read-count of top cells, topCount, is estimated as the
topCellPercentile[99] of read-counts of cell-barcodes aboveminUniqCount - The cell threshold is set a fixed
minCellRatio[20] of thetopCount;topCount/minCellRatio
You can adjust --minUniqCount, --topCellPercentile, --minCellRatio in the runParams.yml or on the command line. These can be overridden by providing the threshold column in the samples.csv.
In the sample reports, you will find our plot of unique reads vs percent unique reads compared (compared to total reads per barcode) report/<sample>/png/<sample>.complexityTotal.png. To filter passing cells for a range in the percent unique reads (x-axis) in addition to the unique read threshold, provide these options in addition to cell calling options above:
--minUniqTotal[1] minimum percent unique reads per barcode for cell calling.--maxUniqTotal[100] maximum percent unique reads per barcode for cell calling.
--chReadsThreshold[50] Reads with greater than this percentage CH methylation will be discarded as failing bisulfite-conversion.
This assumes that you have run the workflow and would like to apply a different threshold than being called by the cell thresholding algorithm.
If you resume the run with a samples.csv that you have added the threshold to (by adding the threshold column to the samples csv), it will rerun the workflow from the beginning. However, these matrixes can be easily filtered for the passing cells if needed.
If you wish to rerun only the report with a more stringent threshold and can filter your matrices for downstream analysis, there is a reporting only entry point to the workflow.
reportingOnly : trueoption to only run reporting from apreviousOutDirfor a completed run.previousOutDirwould be the outDir of your original run.outDiris your new outDir for the new reports. It is recommended that you do not use the same folder as the original run, as it will overwrite all original results.samples.csvmust be provided with the additional threshold column, see samples.csv for details.
Analysis is by default executed highly parallel, controlled by the splitFastq parameter. If starting from runfolder, that controls whether library fastq files are split by i5 barcode in the bcl-convert step. Each of these fastq files are processed in parallel and demultiplexed into sample-specific fastqs files for each TN5 barcode by bcParser. This results in 288 sets of fastq files (R1 & R2) which then go through alignment, deduplication and extraction in parallel. Split sub-samples are merged again at the end (methylation outputs, QC reports).
When starting from fastq files, the first workflow steps are parallelized per set of input fastq files, so it is important to have the data in multiple input fastq files for better performance; see fastqGeneration
Since single-cell methylation analysis is quite compute intensive, our recommendation is to keep splitFastq set to true. The caveat is that this configuration will launch a large number of compute jobs in parallel for each sample, so your infrastructure must be able to support that.
The Parabricks NVIDIA GPU aligner requires at least one GPU present to function. Multiple GPU machine instances from AWS through a scheduler like Seqera can be utilized for parallel execution by including the parabricksNumGpu option. Docker doesn't currently allow splitting of multiple GPUs on one machine/instance in an automated way, and we are looking into a solution for this in an upcoming version. If in doubt, continue to use the default: parabricksNumGpu: 1.
libStructure : "lib.json"
The library structure JSON file defines
- Barcode locations in paired-end reads Where in the reads cell-barcodes are found
- List of allowed barcode sequences for Tn5 barcodes, i7 and i5 PCR barcodes.
- Which parts of the reads represent genomic DNA and which should be masked (e.g. adapter sequences, randomer)
The default file, for our standard product configuration, is included in lib.json.
Previous alpha kit configuration is available with libStructure : "prev_pcr/lib.json" in the runParams.yml.