A modular, containerized NGS pipeline for RNA-seq, long-read, and metagenomic analysis
OmicsFlow is a production-ready bioinformatics pipeline built with Nextflow and Docker, designed for reproducible multi-omics data analysis. It supports three major sequencing technologies and workflows:
| Workflow | Technology | Key tools | Status |
|---|---|---|---|
rnaseq.nf |
Illumina short reads | FastQC Β· STAR Β· Salmon Β· DESeq2 | β Stable |
longread.nf |
Oxford Nanopore (ONT) | NanoStat Β· Minimap2 Β· Samtools | π§ In development |
metagenomics.nf |
Illumina / ONT | Kraken2 Β· Bracken | π§ In development |
All workflows are fully containerized via Docker and can run locally, on HPC clusters (SLURM/PBS), or in the cloud (AWS Batch).
Benchmarked on nf-core test dataset (S. cerevisiae, GSE110004, 4 samples Γ 50,000 reads):
| Metric | Value |
|---|---|
| Input reads per sample | 50,000 |
| Reads passing QC | 99.5% |
| Adapter contamination (auto-detected & removed) | 40.3% |
| Uniquely mapped reads (STAR) | 81.8% β 84.6% |
| Properly paired reads | 100% |
| Mismatch rate | 0.9% |
| Pipeline execution time (4 samples, 4 CPUs) | ~8 min |
| Docker image size | 4.63 GB |
OmicsFlow is flexible β you can use the full pipeline or individual tools depending on your needs.
| Use case | What you need |
|---|---|
| Quality control only | FASTQ files |
| Trimming only | FASTQ files |
| Alignment (STAR) | FASTQ files + reference genome + GTF + STAR index |
| Quantification (Salmon) | FASTQ files + Salmon index |
| Statistics (Samtools) | An existing BAM file |
| Differential expression | Salmon counts + sample metadata |
| Python / R analysis | Your own data + scripts |
You do not need to prepare everything upfront. Start with what you have and add steps as needed.
If you plan to use STAR for alignment, you need a reference genome and its annotation.
If you already have a STAR index on your server or HPC β just point to it with --genomeDir. No need to rebuild it. Any STAR-compatible index works, regardless of how it was generated.
If you need to build one (one-time operation, ~45 min for full human genome):
# Download reference genome (human GRCh38)
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# Download gene annotation
wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
gunzip Homo_sapiens.GRCh38.109.gtf.gz
# Build STAR index using OmicsFlow Docker
docker run --rm -v $(pwd):/data smill/omicsflow:1.0.0 \
bash -c "mkdir -p /data/star_index && STAR --runMode genomeGenerate \
--genomeDir /data/star_index \
--genomeFastaFiles /data/genome/GRCh38.fa \
--sjdbGTFfile /data/genome/GRCh38.gtf \
--runThreadN 8"
β οΈ Build the index once, store it, reuse it forever for all your experiments.
Everything is already inside the Docker image:
| Tool | Without OmicsFlow | With OmicsFlow |
|---|---|---|
| FastQC | Manual install | β Included |
| Trim Galore | Manual install | β Included |
| STAR | Compile from source | β Included |
| Salmon | Manual install | β Included |
| Samtools | Compile from source | β Included |
| DESeq2 | R + Bioconductor setup | β Included |
| MultiQC | pip install | β Included |
| BioPython | pip install | β Included |
| numpy / pandas / matplotlib | pip install | β Included |
| NanoStat / NanoPlot | pip install | β Included |
| Kraken2 | Manual install | β Included |
| Minimap2 | Manual install | β Included |
The easiest way to use OmicsFlow β just Docker, no installation needed.
# Pull the image
docker pull smill/omicsflow:1.0.0
# Step 1 β Quality control (FastQC)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
bash -c "fastqc /data/sample_R1.fastq.gz /data/sample_R2.fastq.gz --outdir /data/qc"
# Step 2 β Adapter trimming (Trim Galore)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
bash -c "trim_galore --paired --cores 4 \
/data/sample_R1.fastq.gz /data/sample_R2.fastq.gz \
-o /data/trimmed"
# Step 3 β Alignment (STAR)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
bash -c "STAR --runMode alignReads \
--genomeDir /data/star_index \
--readFilesIn /data/trimmed/sample_R1_val_1.fq.gz /data/trimmed/sample_R2_val_2.fq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/aligned/sample. \
--runThreadN 4"
# Step 4 β Quantification (Salmon)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
bash -c "salmon quant \
--index /data/salmon_index \
--libType A \
-1 /data/trimmed/sample_R1_val_1.fq.gz \
-2 /data/trimmed/sample_R2_val_2.fq.gz \
--output /data/counts/sample \
--threads 4 \
--validateMappings"
# Step 5 β BAM statistics (Samtools)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
bash -c "samtools flagstat /data/aligned/sample.Aligned.sortedByCoord.out.bam"
# Step 6 β Aggregated QC report (MultiQC)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
bash -c "multiqc /data --outdir /data/multiqc"
# Interactive R session (DESeq2, ggplot2...)
docker run --rm -it -v $(pwd)/data:/data smill/omicsflow:1.0.0 R
# Interactive Python session (biopython, pandas, matplotlib...)
docker run --rm -it -v $(pwd)/data:/data smill/omicsflow:1.0.0 python3Windows users: replace
$(pwd)with%cd%in CMD, or use the full path.
- Nextflow β₯ 22.10
- Docker or Singularity
- Java 17+
# Clone the repository
git clone https://github.com/Millimono/OmicsFlow.git
cd OmicsFlow
# Run RNA-seq pipeline with test data
nextflow run workflows/rnaseq.nf \
--input data/test/samplesheet.csv \
--genome GRCh38 \
--outdir results/ \
-profile dockersample,fastq_1,fastq_2,strandedness
ctrl_rep1,/path/to/ctrl_rep1_R1.fastq.gz,/path/to/ctrl_rep1_R2.fastq.gz,reverse
ctrl_rep2,/path/to/ctrl_rep2_R1.fastq.gz,/path/to/ctrl_rep2_R2.fastq.gz,reverse
treat_rep1,/path/to/treat_rep1_R1.fastq.gz,/path/to/treat_rep1_R2.fastq.gz,reverse
treat_rep2,/path/to/treat_rep2_R1.fastq.gz,/path/to/treat_rep2_R2.fastq.gz,reverseStrandedness: use
reversefor most Illumina TruSeq kits,forwardfor some stranded protocols,unstrandedif unsure.
OmicsFlow/
βββ workflows/
β βββ rnaseq.nf # β
RNA-seq Illumina pipeline (stable)
β βββ longread.nf # π§ Nanopore long-read pipeline (in development)
β βββ metagenomics.nf # π§ Metagenomic pipeline (in development)
β
βββ modules/
β βββ qc/ # FastQC, MultiQC, NanoStat
β βββ alignment/ # STAR, Minimap2, Samtools
β βββ quantification/ # Salmon, DESeq2
β
βββ analysis/
β βββ deseq2.R # Differential expression (DESeq2 / edgeR)
β βββ plots.py # Heatmaps, volcano plots, PCA
β βββ report.Rmd # Automated HTML report template
β
βββ containers/
β βββ Dockerfile # All tools in one reproducible image
β
βββ data/
β βββ test/ # Public mini-datasets for testing
β βββ samplesheet.csv
β βββ reads/ # nf-core GSE110004 subset (Illumina)
β
βββ docs/ # Documentation (GitHub Pages)
βββ .github/
β βββ workflows/
β βββ ci.yml # GitHub Actions CI/CD
βββ nextflow.config # Profiles: local, cluster, cloud
Designed for bulk RNA-seq analysis from raw FASTQ to differential expression.
Input FASTQ
β
βΌ
[FastQC] ββββββββββββββββββββββββββ> QC report
β
βΌ
[Trim Galore] ββ> Trimmed reads
β
βΌ
[STAR] ββ> Aligned BAM + splice junctions
β
βΌ
[Salmon] ββ> Gene/transcript counts
β
βΌ
[DESeq2 / edgeR] ββ> Differential expression
β
βΌ
[MultiQC] ββ> Aggregated QC report (HTML)
Output files:
results/qc/β FastQC + MultiQC reportsresults/aligned/β BAM files + indexresults/counts/β Salmon quantificationresults/deseq2/β DE results, volcano plots, heatmapsresults/report.htmlβ Full automated HTML report
For Nanopore sequencing data. Coming soon β tools already available in the Docker image.
Input FASTQ (ONT)
β
βΌ
[NanoStat / NanoPlot] ββ> Read quality stats
β
βΌ
[Minimap2] ββ> Aligned BAM
β
βΌ
[Samtools] ββ> Sorted + indexed BAM
β
βΌ
[MultiQC] ββ> Aggregated report
In the meantime, you can use these tools individually via Docker β see the Docker section above.
Taxonomic classification and abundance profiling. Coming soon β Kraken2 already available in the Docker image.
Input FASTQ
β
βΌ
[FastQC + Trim Galore] ββ> Clean reads
β
βΌ
[Kraken2] ββ> Taxonomic classification
β
βΌ
[Bracken] ββ> Abundance re-estimation
In the meantime, you can run Kraken2 directly:
docker run --rm -v $(pwd):/data smill/omicsflow:1.0.0 \ bash -c "kraken2 --db /data/kraken2_db --paired \ /data/R1.fastq.gz /data/R2.fastq.gz \ --output /data/kraken2_output.txt \ --report /data/kraken2_report.txt"
| Category | Tools | Versions |
|---|---|---|
| Pipeline orchestration | Nextflow DSL2 | β₯ 22.10 |
| Containerization | Docker Β· Singularity | 28.x |
| QC | FastQC Β· MultiQC Β· NanoStat Β· NanoPlot | 0.12.1 Β· 1.35 Β· 1.6.0 |
| Alignment | STAR Β· Minimap2 | 2.7.11b Β· 2.31 |
| Quantification | Salmon | 1.12.0 |
| Variant calling | Samtools Β· BCFtools | 1.23.1 |
| Metagenomics | Kraken2 | 2.1.3 |
| Statistical analysis | DESeq2 Β· edgeR Β· R | R 4.5.2 |
| Visualization | ggplot2 Β· matplotlib Β· seaborn | β |
| Languages | Python Β· R Β· Bash Β· C Β· C++ | Python 3.x |
| CI/CD | GitHub Actions | β |
| Documentation | GitHub Pages | β |
Test data used during development (publicly available):
| Dataset | Source | Size | Used for |
|---|---|---|---|
| GSE110004 / SRR6357070-71 (4 samples) | nf-core test datasets | ~8 MB | RNA-seq validation |
| S. cerevisiae R64-1-1 genome | nf-core test datasets | ~230 KB | Reference genome |
| S. cerevisiae gene annotation | nf-core test datasets | ~200 KB | Gene annotation |
OmicsFlow supports multiple execution profiles defined in nextflow.config:
profiles {
docker {
docker.enabled = true
process.executor = 'local'
}
cluster {
process.executor = 'slurm'
singularity.enabled = true
process.queue = 'normal'
}
cloud {
process.executor = 'awsbatch'
aws.region = 'ca-central-1'
}
test {
params.input = "${projectDir}/data/test/samplesheet.csv"
params.outdir = 'results_test'
docker.enabled = true
}
}Every run generates a timestamped output directory:
results/
βββ qc/
β βββ fastqc/ # Per-sample FastQC reports (HTML)
β βββ multiqc_report.html # Aggregated QC report
βββ trimmed/
β βββ logs/ # Trim Galore trimming reports
βββ aligned/
β βββ sample.Aligned.sortedByCoord.out.bam
β βββ sample.Log.final.out # Mapping statistics
βββ counts/
β βββ salmon/ # Transcript-level quantification
β βββ quant.sf
βββ deseq2/
β βββ deseq2_results.csv # DE genes table
β βββ volcano_plot.pdf # Volcano plot
β βββ heatmap_top50.pdf # Top 50 DE genes heatmap
β βββ pca_plot.pdf # PCA plot
βββ pipeline_info/
βββ execution_report.html
βββ execution_timeline.html
This pipeline was developed in conjunction with research in AI-based medical imaging and bioinformatics:
- MalariaScan β AI detection of malaria via blood microscopy. Prix Coup de cΕur Jean-Marc LΓ©ger, UdeM 2025.
- HAtt-CNN β Adaptive visual attention supervision with heuristic masks for CNN interpretability. (Under review 2026)
- EpitopeNet β Backpropagation-free prototype learning inspired by B-cell dynamics for mammography classification. 76.03% accuracy on MiniDDSM. (Under review 2026)
The
analysis/module is extensible β ML models from the above projects can be integrated as additional pipeline steps.
Full documentation available at: millimono.github.io/OmicsFlow
Contributions welcome! Please open a pull request.
git clone https://github.com/Millimono/OmicsFlow.git
cd OmicsFlow
git checkout -b feature/my-new-moduleMillimono, S. (2026). OmicsFlow: A modular containerized NGS pipeline
for reproducible multi-omics analysis (v1.0.1). Zenodo.
https://doi.org/10.5281/zenodo.20677900
Sory Millimono PhD Candidate in AI Β· Bioinformatician UniversitΓ© de MontrΓ©al & Mohammed V University β ENSIAS
- π§ millimono64.sm@gmail.com
- π LinkedIn
- π Google Scholar β h-index 1 Β· 24 citations
- π¬ ORCID: 0009-0005-1960-9136
MIT License β see LICENSE for details.



