Skip to content

853tony/VarFilter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

VarFilter

A comprehensive variant filtering and population genetics analysis pipeline for whole-exome sequencing (WES) data from gnomAD v2.1.1, designed to identify population-specific genetic variants through systematic quality control and allele frequency analysis across seven ancestral populations.

This study identifies population-specific common and rare genetic variants by analyzing allele frequency differences across populations using gnomAD v2.1.1 exome sequencing data. We developed a customized filtering pipeline that performs rigorous quality control and stratified analysis across seven genetic ancestries: East Asian (EAS), South Asian (SAS), Non-Finnish European (NFE), Finnish (FIN), African (AFR), Admixed American (AMR), and Ashkenazi Jewish (ASJ).

Variant extraction

First, we used BCFtools to decompress the compressed VCF files and calculate variant statistics for each chromosome. Next, we developed a Python script that utilizes the cyvcf2 package to extract allele frequencies and other relevant information from the VCF files and organize the results into a standard TSV format.

Variant quality control

In the variant filtering process for gnomAD v2.1.1, we initially performed quality control based on allele count (AC) and allele number (AN) values. We then employed two population genetic structure models, Model A and Model B, to account for different population stratification scenarios.

Step Description Number of Variants
0 Initial VCF extraction 17,209,972
1 AC QC: Keep variants with AC > 0 in at least one population 15,425,384
2 AN QC: Keep variants with AN > 0 in all seven populations 15,417,683
3.1 Call Rate 10% QC: AN > 10% of maximum AN in all populations 15,408,487
3.2 Call Rate 20% QC: AN > 10% of maximum AN in all populations 15,404,555
3.3 Call Rate 30% QC: AN > 10% of maximum AN in all populations 15,401,073
3.4 Call Rate 40% QC: AN > 10% of maximum AN in all populations 15,397,425

Note

Rationale: Call rate thresholds ensure adequate sequencing coverage across all populations, with higher thresholds (e.g., 40%) providing maximum confidence at the cost of slightly reduced variant numbers.

Population-Specific Variant Filtering

For each target population, we applied 32 filtering combinations derived from two complementary approaches: Model A: Common in target, rare in others (16 combinations)

  • Target population AC ≥ {1, 5, 10, 20}
  • Reference populations (all 6) AF ≤ {0.5, 0.1, 0.05, 0.01}
  • Example: EAS AC ≥ 10 AND (SAS, NFE, FIN, AFR, AMR, ASJ) all AF ≤ 0.01
    • Interpretatio: Variants present in East Asians but rare in other populations

Model B: Rare in Target, common in Others (16 combinations)

  • Target population AF ≤ {0.5, 0.1, 0.05, 0.01}
  • Reference populations (all 6) AC ≥ {1, 5, 10, 20}
  • Example: EAS AF ≤ 0.01 AND (SAS, NFE, FIN, AFR, AMR, ASJ) all AC ≥ 10
    • Interpretation: Variants common in other populations but rare in East Asians

Filtering Matrix

7 populations × 32 filtering combinations = 224 population-specific variant sets

These 224 filtering conditions were applied to each of the five quality-controlled datasets (Step 2 + Step 3.1–3.4), generating:

224 combinations × 5 call rate QC levels = 1,120 population-specific variant files

Scenario Target Pop Target Pop AF Threshold Ref Pops Ref Pop AF Threshold Interpretation
A EAS AF ≥ 20% SAS, NFE, FIN, AFR, AMR, ASJ AF ≤ 0.01 East Asian-specific common variant
B NFE AF ≥ 1% EAS, SAS, FIN, AFR, AMR, ASJ AF ≤ 0.05 European-enriched low-frequency variant with moderate stringency filtering.
C AFR AF ≤ 0.01 EAS, SAS, FIN, AFR, AMR, ASJ AF ≥ 20% Pan-ancestral common variant depleted in Africans

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors