Reproducible Nextflow pipeline benchmarking aligners and variant callers for whole-genome sequencing data, replicating Pucker et al. (2020).
- Workflow manager: Nextflow
- Aligners: BWA-MEM, Bowtie2, GEM3
- Variant callers: GATK HaplotypeCaller, FreeBayes
- Evaluation: hap.py (Illumina), Picard, samtools
- Reference genome: TAIR10 (Arabidopsis thaliana)
- Truth set: Schilbert et al. (2020) gold standard (Nd-1 vs Col-0)
These plots show the distribution of insert sizes before (first row) and after (second row) filtering. Reads with insert size lower than 100 or greater than 2000 were removed.
Before filtereing (raw SAM):
| BWA-MEM | Bowtie2 | GEM3 |
|---|---|---|
![]() |
![]() |
![]() |
After filtering (100 <= insert size <= 2000):
| BWA-MEM | Bowtie2 | GEM3 |
|---|---|---|
![]() |
![]() |
![]() |
The following are some plots showing main alignment metrics:
This table shows an overview of the data:
| Metric | BWA | Bowtie2 | GEM3 |
|---|---|---|---|
| Total reads | 24.0M | 4.85M | 28.2M |
| Mapped reads (%) | 100 | 100 | 100 |
| Properly paired (%) | 99.03 | 97.71 | 100 |
| Improperly paired (%) | 37.3 | 41.7 | 0 |
| Mismatch rate (%) | 0.78 | 2.94 | 1.06 |
| Error rate (%) | 0.70 | 2.18 | 0.76 |
| Indel rate (%) | 0.08 | 0.84 | 0.33 |
| MAPQ=0 reads (%) | 12.1 | 22.9 | 47.9 |
| Insert size (mean; bp) | 221.2 | 299.7 | 245.8 |
| Insert size (stdev; bp) | 79.3 | 52.2 | 131.4 |
Looking at alignment quality, all the reads were aligned to the reference genome for all aligners. It should be noted that the raw SAM out of the aligners were analyzed for insert size, and reads with insert sizes shorter than 100 or longer than 2000 were filtered out. The insert size histograms before and after filtering are shown above.
Most of the reads in BWA-MEM and Bowtie2 had their pair align as expected (matching orientation and distance from each other). 100% of the reads aligned using GEM3 achieved this.
Mapping quality (MAPQ) describes the confidence with which a read has been aligned to a specific region of the genome. A lower MAPQ indicates that a read aligned to multiple regions with equal confidence, increasing the ambiguity of its alignment. BWA-MEM has the lowest proportion of reads with a MAPQ=0, whereas 47.9% of reads aligned by GEM3 are ambiguous. All aligners show similarly high duplication rate, indicating that there are many PCR duplicates detected by all aligners in this dataset.
The second plot shows error metrics for the aligners. The error rate reported here is the proportion of high quality aligned bases (that is, they have Q20 or greater) that are mismatches. The mismatch rate is the overall proportion of all bases that shows mismatches to the reference. The indel rate is the frequency of insertions/deletions in the aligned reads. BWA-MEM consistently shows the lowest scores in all these metrics, while Bowtie2 shows the highest scores, suggesting that BWA-MEM alignments are likely more accurate and of higher quality.
These results align with literature: BWA uses Burrows-Wheeler Transform and Smith Waterman alignment, which allows for gapped alignment. This results in better alignment accuracy for reads with indels and mismatches, which is important for complex and variant-rich genomes such as A. thaliana. Without gapped alignment, many of these reads may map incorrectly with artificial mismatches spread across the indel region or fail to map entirely. While Bowtie2 also supports gapped alignment, it's algorithm prioritizes speed and sometimes chooses suboptimal alignments. BWA-MEM is relatively more conservative and optimized for variant calling accuracy. This supports the lower mismatch and indel rates observed in BWA-MEM compared to Bowtie2 here.
Bowtie2 is faster and more sensitive, and generally works better for aligning large datasets of short reads (<100bp). For example, it would work well for aligning transcriptomic data to transcriptomes. BWA generally offers higher accuracy, particularly for longer or more complex reads, and handles indels well, making it preferred for variant calling and complex genomes.
From these data, BWA-MEM appears to be the best aligner to use for this A. thaliana dataset, considering the low error rates and high alignment quality.
Variant callers being assessed in this project are:
- GATK HaplotypeCaller
- FreeBayes
These metrics are collected from Illumina's hap.py tool, which assesses performance of variant callers against a truth set. I used the gold standard of Nd1 vs TAIR10 sequence variants from Schilbert H, Rempel A & Pucker B (2020) as the truth set.
TP, FP and FN counts:
| Metric | HaplotypeCaller | FreeBayes |
|---|---|---|
| True Positives | 563,832 | 581,205 |
| False Positives | 262,693 | 414,366 |
| False Negatives | 940,135 | 922,762 |
Overall scores:
| Metric | HaplotypeCaller | FreeBayes |
|---|---|---|
| Precision | 0.682 | 0.584 |
| Recall | 0.375 | 0.386 |
| F1 Score | 0.481 | 0.464 |
Scores grouped by variant type:
| Variant Type | Caller | Precision | Recall | F1 Score | Ti/Tv (Query) | Het/Hom (Query) |
|---|---|---|---|---|---|---|
| SNP | HaplotypeCaller | 0.699 | 0.377 | 0.490 | 1.24 | 0.20 |
| SNP | FreeBayes | 0.590 | 0.402 | 0.478 | 1.23 | 0.33 |
| INDEL | HaplotypeCaller | 0.605 | 0.366 | 0.456 | — | 0.17 |
| INDEL | FreeBayes | 0.548 | 0.306 | 0.393 | — | 0.23 |
In general, both variant callers perform similarly based on these metrics. GATK HaplotypeCaller has a better precision and F1 score than FreeBayes for the data in this experiment, which indicates greater accuracy in the variants it is calling. On the other hand, FreeBayes has a slightly better recall (sensitivity) score, indicating that it finds a greater number of true variants. However, FreeBayes also has more false positives than HaplotypeCaller, which contributes to its lower precision score.
The Ti/Tv and Het/Hom ratios are also in expected ranges. Since I am finding intraspecies variants between Nd-1 and Col-0 ecotypes of A. thaliana, the expected Ti/Tv ratio for this comparison is 1.0-1.4. Likewise, a low Het/Hom (<0.5) ratio is expected due to the high amount of inbreeding resulting in a highly homozygous genome.
In the comparison of HaplotypeCaller and FreeBayes for variant calling in BWA-MEM aligned WGS data between A. thaliana ecotypes Col-0 and Nd-1, both tools demonstrated comparable overall performance against the Schilbert et al. (2020) gold standard, with F1 scores of 0.481 and 0.464, respectively.
HaplotypeCaller exhibited higher precision (0.682 vs 0.584) at the cost of marginally lower SNP recall, reflecting its more conservative calling strategy that prioritizes fewer false positives. FreeBayes showed slightly better SNP recall (0.402 vs 0.377) but lower precision, consistent with its multi-sample Bayesian approach that favors sensitivity. Both callers struggled more with indels (F1 0.456-0.393) than SNPs, as expected for short-read data against a PacBio-validated truth set containing complex structural variants.
These results align with published benchmarks for plant genomes, where short-read variant calling against high-confidence gold standards typically yields F1 scores in the 0.4-0.6 range due to coverage variability, repetitive regions, and cross-technology validation challenges. The observed Ti/Tv ratios (~1.24) and low heterozygous/homozygous ratios further reflect the specific biology of ecotype divergence rather than calling artifacts. Overall, HaplotypeCaller emerges as the preferable choice for precision-focused downstream analysis in this Arabidopsis WGS context, while both tools validate the robustness of the BWA-MEM alignment pipeline.
# Install Nextflow
curl -s https://get.nextflow.io | bash
# Run the pipeline
nextflow run vc_pipeline.nfDependencies (BWA, Bowtie2, GEM3, GATK, FreeBayes, samtools, Picard) should be available in your PATH or configured via Docker/conda.
Pucker et al. (2020). Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data. Genes, 11(5), 543.







