Skip to content

layaasiv/variant-calling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WGS Variant Calling Pipeline Benchmarking

Reproducible Nextflow pipeline benchmarking aligners and variant callers for whole-genome sequencing data, replicating Pucker et al. (2020).

Tools & Technologies

  • Workflow manager: Nextflow
  • Aligners: BWA-MEM, Bowtie2, GEM3
  • Variant callers: GATK HaplotypeCaller, FreeBayes
  • Evaluation: hap.py (Illumina), Picard, samtools
  • Reference genome: TAIR10 (Arabidopsis thaliana)
  • Truth set: Schilbert et al. (2020) gold standard (Nd-1 vs Col-0)

Comparing alignment metrics

These plots show the distribution of insert sizes before (first row) and after (second row) filtering. Reads with insert size lower than 100 or greater than 2000 were removed.

Before filtereing (raw SAM):

BWA-MEM Bowtie2 GEM3

After filtering (100 <= insert size <= 2000):

BWA-MEM Bowtie2 GEM3

The following are some plots showing main alignment metrics:

This plot comapares various alignment metrics collected from picard and samtools.

This plot compares error rates between the aligners.

This table shows an overview of the data:

Metric BWA Bowtie2 GEM3
Total reads 24.0M 4.85M 28.2M
Mapped reads (%) 100 100 100
Properly paired (%) 99.03 97.71 100
Improperly paired (%) 37.3 41.7 0
Mismatch rate (%) 0.78 2.94 1.06
Error rate (%) 0.70 2.18 0.76
Indel rate (%) 0.08 0.84 0.33
MAPQ=0 reads (%) 12.1 22.9 47.9
Insert size (mean; bp) 221.2 299.7 245.8
Insert size (stdev; bp) 79.3 52.2 131.4

Findings

Looking at alignment quality, all the reads were aligned to the reference genome for all aligners. It should be noted that the raw SAM out of the aligners were analyzed for insert size, and reads with insert sizes shorter than 100 or longer than 2000 were filtered out. The insert size histograms before and after filtering are shown above.

Most of the reads in BWA-MEM and Bowtie2 had their pair align as expected (matching orientation and distance from each other). 100% of the reads aligned using GEM3 achieved this.

Mapping quality (MAPQ) describes the confidence with which a read has been aligned to a specific region of the genome. A lower MAPQ indicates that a read aligned to multiple regions with equal confidence, increasing the ambiguity of its alignment. BWA-MEM has the lowest proportion of reads with a MAPQ=0, whereas 47.9% of reads aligned by GEM3 are ambiguous. All aligners show similarly high duplication rate, indicating that there are many PCR duplicates detected by all aligners in this dataset.

The second plot shows error metrics for the aligners. The error rate reported here is the proportion of high quality aligned bases (that is, they have Q20 or greater) that are mismatches. The mismatch rate is the overall proportion of all bases that shows mismatches to the reference. The indel rate is the frequency of insertions/deletions in the aligned reads. BWA-MEM consistently shows the lowest scores in all these metrics, while Bowtie2 shows the highest scores, suggesting that BWA-MEM alignments are likely more accurate and of higher quality.

These results align with literature: BWA uses Burrows-Wheeler Transform and Smith Waterman alignment, which allows for gapped alignment. This results in better alignment accuracy for reads with indels and mismatches, which is important for complex and variant-rich genomes such as A. thaliana. Without gapped alignment, many of these reads may map incorrectly with artificial mismatches spread across the indel region or fail to map entirely. While Bowtie2 also supports gapped alignment, it's algorithm prioritizes speed and sometimes chooses suboptimal alignments. BWA-MEM is relatively more conservative and optimized for variant calling accuracy. This supports the lower mismatch and indel rates observed in BWA-MEM compared to Bowtie2 here.

Bowtie2 is faster and more sensitive, and generally works better for aligning large datasets of short reads (<100bp). For example, it would work well for aligning transcriptomic data to transcriptomes. BWA generally offers higher accuracy, particularly for longer or more complex reads, and handles indels well, making it preferred for variant calling and complex genomes.

From these data, BWA-MEM appears to be the best aligner to use for this A. thaliana dataset, considering the low error rates and high alignment quality.

Variant Callers

Variant callers being assessed in this project are:

  • GATK HaplotypeCaller
  • FreeBayes

Comparing variant caller metrics

These metrics are collected from Illumina's hap.py tool, which assesses performance of variant callers against a truth set. I used the gold standard of Nd1 vs TAIR10 sequence variants from Schilbert H, Rempel A & Pucker B (2020) as the truth set.

TP, FP and FN counts:

Metric HaplotypeCaller FreeBayes
True Positives 563,832 581,205
False Positives 262,693 414,366
False Negatives 940,135 922,762

Overall scores:

Metric HaplotypeCaller FreeBayes
Precision 0.682 0.584
Recall 0.375 0.386
F1 Score 0.481 0.464

Scores grouped by variant type:

Variant Type Caller Precision Recall F1 Score Ti/Tv (Query) Het/Hom (Query)
SNP HaplotypeCaller 0.699 0.377 0.490 1.24 0.20
SNP FreeBayes 0.590 0.402 0.478 1.23 0.33
INDEL HaplotypeCaller 0.605 0.366 0.456 0.17
INDEL FreeBayes 0.548 0.306 0.393 0.23

Findings

In general, both variant callers perform similarly based on these metrics. GATK HaplotypeCaller has a better precision and F1 score than FreeBayes for the data in this experiment, which indicates greater accuracy in the variants it is calling. On the other hand, FreeBayes has a slightly better recall (sensitivity) score, indicating that it finds a greater number of true variants. However, FreeBayes also has more false positives than HaplotypeCaller, which contributes to its lower precision score.

The Ti/Tv and Het/Hom ratios are also in expected ranges. Since I am finding intraspecies variants between Nd-1 and Col-0 ecotypes of A. thaliana, the expected Ti/Tv ratio for this comparison is 1.0-1.4. Likewise, a low Het/Hom (<0.5) ratio is expected due to the high amount of inbreeding resulting in a highly homozygous genome.

Conclusions

In the comparison of HaplotypeCaller and FreeBayes for variant calling in BWA-MEM aligned WGS data between A. thaliana ecotypes Col-0 and Nd-1, both tools demonstrated comparable overall performance against the Schilbert et al. (2020) gold standard, with F1 scores of 0.481 and 0.464, respectively.

HaplotypeCaller exhibited higher precision (0.682 vs 0.584) at the cost of marginally lower SNP recall, reflecting its more conservative calling strategy that prioritizes fewer false positives. FreeBayes showed slightly better SNP recall (0.402 vs 0.377) but lower precision, consistent with its multi-sample Bayesian approach that favors sensitivity. Both callers struggled more with indels (F1 0.456-0.393) than SNPs, as expected for short-read data against a PacBio-validated truth set containing complex structural variants.

These results align with published benchmarks for plant genomes, where short-read variant calling against high-confidence gold standards typically yields F1 scores in the 0.4-0.6 range due to coverage variability, repetitive regions, and cross-technology validation challenges. The observed Ti/Tv ratios (~1.24) and low heterozygous/homozygous ratios further reflect the specific biology of ecotype divergence rather than calling artifacts. Overall, HaplotypeCaller emerges as the preferable choice for precision-focused downstream analysis in this Arabidopsis WGS context, while both tools validate the robustness of the BWA-MEM alignment pipeline.

Setup

# Install Nextflow
curl -s https://get.nextflow.io | bash

# Run the pipeline
nextflow run vc_pipeline.nf

Dependencies (BWA, Bowtie2, GEM3, GATK, FreeBayes, samtools, Picard) should be available in your PATH or configured via Docker/conda.

Reference

Pucker et al. (2020). Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data. Genes, 11(5), 543.

About

Comparing different tools used in the variant calling pipeline. Replicating the work done in this paper: Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data (Pucker et al. 2020).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors