A Snakemake workflow for processing bulk RNA-seq data from NCBI GEO accession numbers or FASTQ files. This pipeline performs read quality control, alignment to a reference genome, and generates gene count matrices suitable for downstream analysis such as differential expression or metabolic modeling.
- Overview
- Requirements
- Installation
- Conda Environment Setup
- Configuration
- Profile Configuration
- Running the Pipeline
- Output Description
- Benchmarking
- Troubleshooting
- Citation
- Contributing
- License
This workflow:
- Downloads SRA files from NCBI/GEO (if using prefetch) or reads local FASTQ files
- Converts SRA files to FASTQ format using parallel-fastq-dump
- Performs quality control with FastQC before and after trimming
- (Optionally) Trims adapters and low-quality bases with Trim Galore
- Aligns reads to a chosen reference genome using STAR
- Quantifies gene counts using Salmon
- Collects RNA-seq metrics with Picard
- Calculates insert sizes and fragment sizes
- Screens for common contaminants
- Generates a MultiQC report combining all quality metrics
Clone the repository using Git:
git clone https://git.unl.edu/HelikarLab/FastqToGeneCounts.git
cd FastqToGeneCountsCreate and activate a new Conda environment with the required dependencies:
conda create -n fastq2genecounts python=3.10 snakemake
conda activate fastq2genecounts
# If you are running this pipeline and want to use SLURM, install the executor plugin:
pip install snakemake-executor-plugin-slurmVerify the installation:
snakemake --versionThe config.yaml file contains all customization options available to the pipeline. The options in this file are
explained below.
The MASTER_CONTROL setting specifies the CSV file containing sample information. This file must have the
following four columns as a header row:
| Column | Description | Valid Values |
|---|---|---|
srr |
SRA accession code | SRR###### (leave this empty if you are using local files) |
sample |
Sample identifier | Prefix with any alphanumeric string (e.g., tissue name), plus the "Study", "Run", and "Replicate" for that sample, separated by underscores |
endtype |
Sequencing type | PE (paired-end) or SE (single-end) |
prep_method |
Library preparation | total (total RNA) or mrna (PolyA/mRNA) |
Example CSV file:
srr,sample,endtype,prep_method
SRR101,tissueA_S1R1,PE,mrna
SRR102,tissueA_S1R2,SE,total
SRR103,tissueB_S1R1r1,PE,total
SRR104,tissueB_S1R1r2,PE,totalFor local FASTQ files (see Processing Options below), leave the srr column empty. If the
sampe uses paired-end sequencing, the pipeline will automatically find the forward and reverse reads based on the
sample name provded in the file (e.g., tissueA_S1R1_1.fastq.gz and tissueA_S1R1_2.fastq.gz)
srr,sample,endtype,prep_method
,tissueA_S1R1,PE,total
,tissueA_S1R2,SE,total| Setting | Description |
|---|---|
ROOTDIR |
Directory where results will be saved (default: results) |
EXPERIMENT_NAME |
Optional subdirectory name for organizing experiment outputs |
BENCHMARK_DIR |
Directory for benchmark output (default: benchmarks) |
BENCHMARK_TIMES |
Number of times to run each rule for benchmarking (default: 1, minimal benchmarking performed) |
LOG_DIR |
Directory for log files (default: logs) |
| Setting | Description | Default |
|---|---|---|
PERFORM_DUMP_FASTQ |
Download + convert SRA files to FASTQ format | True |
PERFORM_TRIM |
Trim adapters and low-quality bases | True |
PERFORM_SCREEN |
Screen for contamination | True |
PERFORM_GET_RNASEQ_METRICS |
Collect RNA-seq metrics with Picard | True |
PERFORM_GET_INSERT_SIZE |
Calculate insert sizes | True |
PERFORM_GET_FRAGMENT_SIZE |
Calculate fragment sizes with RSeQC | True |
BYPASS_REPLICATE_VALIDATION |
Skip replicate count validation (for COMO) | False |
BYPASS_GENOME_VALIDATION |
Skip genome file validation before running | False |
To process local FASTQ files instead of downloading from SRA:
- Set
PERFORM_DUMP_FASTQ: False - Set
LOCAL_FASTQ_FILES: "<dirname>"(change<dirname>to the directory path) - Place your FASTQ files in the specified directory
- Leave the
srrcolumn empty in your MASTER_CONTROL CSV
The pipeline automatically downloads and prepares reference genome files from Ensembl. Configure these settings under
the GENOME section:
| Setting | Description | Example |
|---|---|---|
SAVE_DIR |
Directory for genome files | genome |
TAXONOMY_ID |
NCBI taxonomy ID | 9606 (human), 10090 (mouse) |
VERSION |
Ensembl release version | 115, latest, release-112 |
SHOW_PROGRESS |
Show download progress | True |
FASTA_VERSION |
FASTA file type | primary_assembly or toplevel |
Find your taxonomy ID at https://www.ncbi.nlm.nih.gov/Taxonomy
Available Ensembl releases are listed at https://www.ftp.ensembl.org/pub/
Snakemake profiles configure how jobs are submitted to your compute environment. Two profiles are included with this pipeline.
Use the SLURM profile for high-performance computing clusters:
snakemake --profile profiles/clusterThe SLURM profile configuration (profiles/cluster/config.v8+.yaml) includes:
| Setting | Description | Default |
|---|---|---|
cores |
Maximum cores used in workflow | 10 |
jobs |
Maximum concurrent jobs to submit to SLURM | 250 |
restart-times |
Retry attempts on failure | 0 |
use-conda |
Enable Conda environments | True |
[!WARNING] Conda The
use-condasetting should always be kept to True, otherwise Snakemake assumes all required dependencies are installed in the current environment. By keeping this value as True, Snakemake will create the required conda environments and activate them for each required rule.
Edit profiles/cluster/config.v8+.yaml to change the SLURM account:
sbatch
--job-name=smk-{rule}-{wildcards}
--account=YOUR_ACCOUNT_NAME # Change to your account name: `sacctmgr show user $USER accounts`
--cpus-per-task={threads}
...For execution on a single machine without a cluster scheduler:
snakemake --profile profiles/localWarning
Local execution may be slow for large datasets and, unless you are running this pipeline on a workstation, is not recommended for genome alignment due to memory constraints.
If you would like to change any of the default parameters in the profile, you can either:
- Modify the profile to your liking; this will become the new default for all future pipeline workflows, or
- Override the parameters on a per-run basis by including them on the command line. Execute
snakemake --helpto view all options
A dry run can be performed to verify configuration:
snakemake --profile profiles/cluster --dry-run # or `--profile profiles/local` if using a local workstationThe dry run will:
- Validate the Snakefile syntax
- Print the defined output directories
- Display all rules that will be executed
- Check that configuration is properly set up
- Verify input files are accessible
After confirming the dry run succeeds, run the full pipeline:
snakemake --profile profiles/cluster # or `--profile profiles/local` if using a local workstationSince Snakemake does not daemonize by default, use screen or tmux before running Snakemake to keep the
main process running:
[!NOTE] Screen vs Tmux Screen is easer to set up and use, but Tmux offers more powerful features
# Start a screen session
screen -S snakemake
# Detach from session: Ctrl+a, d
# Reattach to session
screen -r snakemakeLog files are stored in the logs directory organized by tissue and rule:
logs/ # or the defined `LOG_DIR` configuration option
{tissue}/
{rule}/
{tissue}_{rule}.log
The pipeline supports benchmarking to measure execution time for each rule. To enable benchmarking:
- Set
BENCHMARK_TIMESto your desired number of repetitions (e.g.,3) - Run the pipeline normally
Generate benchmark plots using the provided script:
python utils/benchmark.pyThis script reads benchmark data from the benchmarks/ directory and produces visualization plots.
[!WARNING] Bencharmking The pipeline will run
BENCHMARK_TIMESfor each input file. If set to 2, the entire pipeline will run twice to collect benchmark statistics.
Please open an issue with:
- The exact error message
- Your configuration file
- The rule and sample that failed
- Relevant log files from
logs/
If you use this pipeline in your research, please cite it as:
@software{FastqToGeneCounts,
author = {Josh Loecker, Brandt Bessell, Bhanwar lal Puniya, Tomáš Helikar},
title = {FastqToGeneCounts: Automated Bulk RNA-seq Alignment},
url = {https://git.unl.edu/helikarlab/FastqToGeneCounts}
}Contributions are welcome! Please open an issue or submit a merge request for:
- Bug reports
- Feature requests
- Documentation improvements
- Code enhancements