WGS Variant Calling Pipeline Benchmarking

Reproducible Nextflow pipeline benchmarking aligners and variant callers for whole-genome sequencing data, replicating Pucker et al. (2020).

Tools & Technologies

Workflow manager: Nextflow
Aligners: BWA-MEM, Bowtie2, GEM3
Variant callers: GATK HaplotypeCaller, FreeBayes
Evaluation: hap.py (Illumina), Picard, samtools
Reference genome: TAIR10 (Arabidopsis thaliana)
Truth set: Schilbert et al. (2020) gold standard (Nd-1 vs Col-0)

Comparing alignment metrics

These plots show the distribution of insert sizes before (first row) and after (second row) filtering. Reads with insert size lower than 100 or greater than 2000 were removed.

Before filtereing (raw SAM):

BWA-MEM	Bowtie2	GEM3

After filtering (100 <= insert size <= 2000):

BWA-MEM	Bowtie2	GEM3

The following are some plots showing main alignment metrics:

This table shows an overview of the data:

Metric	BWA	Bowtie2	GEM3
Total reads	24.0M	4.85M	28.2M
Mapped reads (%)	100	100	100
Properly paired (%)	99.03	97.71	100
Improperly paired (%)	37.3	41.7	0
Mismatch rate (%)	0.78	2.94	1.06
Error rate (%)	0.70	2.18	0.76
Indel rate (%)	0.08	0.84	0.33
MAPQ=0 reads (%)	12.1	22.9	47.9
Insert size (mean; bp)	221.2	299.7	245.8
Insert size (stdev; bp)	79.3	52.2	131.4

Findings

Looking at alignment quality, all the reads were aligned to the reference genome for all aligners. It should be noted that the raw SAM out of the aligners were analyzed for insert size, and reads with insert sizes shorter than 100 or longer than 2000 were filtered out. The insert size histograms before and after filtering are shown above.

Most of the reads in BWA-MEM and Bowtie2 had their pair align as expected (matching orientation and distance from each other). 100% of the reads aligned using GEM3 achieved this.

Mapping quality (MAPQ) describes the confidence with which a read has been aligned to a specific region of the genome. A lower MAPQ indicates that a read aligned to multiple regions with equal confidence, increasing the ambiguity of its alignment. BWA-MEM has the lowest proportion of reads with a MAPQ=0, whereas 47.9% of reads aligned by GEM3 are ambiguous. All aligners show similarly high duplication rate, indicating that there are many PCR duplicates detected by all aligners in this dataset.

The second plot shows error metrics for the aligners. The error rate reported here is the proportion of high quality aligned bases (that is, they have Q20 or greater) that are mismatches. The mismatch rate is the overall proportion of all bases that shows mismatches to the reference. The indel rate is the frequency of insertions/deletions in the aligned reads. BWA-MEM consistently shows the lowest scores in all these metrics, while Bowtie2 shows the highest scores, suggesting that BWA-MEM alignments are likely more accurate and of higher quality.

These results align with literature: BWA uses Burrows-Wheeler Transform and Smith Waterman alignment, which allows for gapped alignment. This results in better alignment accuracy for reads with indels and mismatches, which is important for complex and variant-rich genomes such as A. thaliana. Without gapped alignment, many of these reads may map incorrectly with artificial mismatches spread across the indel region or fail to map entirely. While Bowtie2 also supports gapped alignment, it's algorithm prioritizes speed and sometimes chooses suboptimal alignments. BWA-MEM is relatively more conservative and optimized for variant calling accuracy. This supports the lower mismatch and indel rates observed in BWA-MEM compared to Bowtie2 here.

Bowtie2 is faster and more sensitive, and generally works better for aligning large datasets of short reads (<100bp). For example, it would work well for aligning transcriptomic data to transcriptomes. BWA generally offers higher accuracy, particularly for longer or more complex reads, and handles indels well, making it preferred for variant calling and complex genomes.

From these data, BWA-MEM appears to be the best aligner to use for this A. thaliana dataset, considering the low error rates and high alignment quality.

Variant Callers

Variant callers being assessed in this project are:

GATK HaplotypeCaller
FreeBayes

Comparing variant caller metrics

These metrics are collected from Illumina's hap.py tool, which assesses performance of variant callers against a truth set. I used the gold standard of Nd1 vs TAIR10 sequence variants from Schilbert H, Rempel A & Pucker B (2020) as the truth set.

TP, FP and FN counts:

Metric	HaplotypeCaller	FreeBayes
True Positives	563,832	581,205
False Positives	262,693	414,366
False Negatives	940,135	922,762

Overall scores:

Metric	HaplotypeCaller	FreeBayes
Precision	0.682	0.584
Recall	0.375	0.386
F1 Score	0.481	0.464

Scores grouped by variant type:

Variant Type	Caller	Precision	Recall	F1 Score	Ti/Tv (Query)	Het/Hom (Query)
SNP	HaplotypeCaller	0.699	0.377	0.490	1.24	0.20
SNP	FreeBayes	0.590	0.402	0.478	1.23	0.33
INDEL	HaplotypeCaller	0.605	0.366	0.456	—	0.17
INDEL	FreeBayes	0.548	0.306	0.393	—	0.23

Findings

In general, both variant callers perform similarly based on these metrics. GATK HaplotypeCaller has a better precision and F1 score than FreeBayes for the data in this experiment, which indicates greater accuracy in the variants it is calling. On the other hand, FreeBayes has a slightly better recall (sensitivity) score, indicating that it finds a greater number of true variants. However, FreeBayes also has more false positives than HaplotypeCaller, which contributes to its lower precision score.

The Ti/Tv and Het/Hom ratios are also in expected ranges. Since I am finding intraspecies variants between Nd-1 and Col-0 ecotypes of A. thaliana, the expected Ti/Tv ratio for this comparison is 1.0-1.4. Likewise, a low Het/Hom (<0.5) ratio is expected due to the high amount of inbreeding resulting in a highly homozygous genome.

Conclusions

In the comparison of HaplotypeCaller and FreeBayes for variant calling in BWA-MEM aligned WGS data between A. thaliana ecotypes Col-0 and Nd-1, both tools demonstrated comparable overall performance against the Schilbert et al. (2020) gold standard, with F1 scores of 0.481 and 0.464, respectively.

HaplotypeCaller exhibited higher precision (0.682 vs 0.584) at the cost of marginally lower SNP recall, reflecting its more conservative calling strategy that prioritizes fewer false positives. FreeBayes showed slightly better SNP recall (0.402 vs 0.377) but lower precision, consistent with its multi-sample Bayesian approach that favors sensitivity. Both callers struggled more with indels (F1 0.456-0.393) than SNPs, as expected for short-read data against a PacBio-validated truth set containing complex structural variants.

These results align with published benchmarks for plant genomes, where short-read variant calling against high-confidence gold standards typically yields F1 scores in the 0.4-0.6 range due to coverage variability, repetitive regions, and cross-technology validation challenges. The observed Ti/Tv ratios (~1.24) and low heterozygous/homozygous ratios further reflect the specific biology of ecotype divergence rather than calling artifacts. Overall, HaplotypeCaller emerges as the preferable choice for precision-focused downstream analysis in this Arabidopsis WGS context, while both tools validate the robustness of the BWA-MEM alignment pipeline.

Setup

# Install Nextflow
curl -s https://get.nextflow.io | bash

# Run the pipeline
nextflow run vc_pipeline.nf

Dependencies (BWA, Bowtie2, GEM3, GATK, FreeBayes, samtools, Picard) should be available in your PATH or configured via Docker/conda.

Reference

Pucker et al. (2020). Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data. Genes, 11(5), 543.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
nextflow		nextflow
plots		plots
README.md		README.md
var_caller_labnotebook.md		var_caller_labnotebook.md
vc_pipeline.nf		vc_pipeline.nf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WGS Variant Calling Pipeline Benchmarking

Tools & Technologies

Comparing alignment metrics

Findings

Variant Callers

Comparing variant caller metrics

Findings

Conclusions

Setup

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WGS Variant Calling Pipeline Benchmarking

Tools & Technologies

Comparing alignment metrics

Findings

Variant Callers

Comparing variant caller metrics

Findings

Conclusions

Setup

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages