A comprehensive variant filtering and population genetics analysis pipeline for whole-exome sequencing (WES) data from gnomAD v2.1.1, designed to identify population-specific genetic variants through systematic quality control and allele frequency analysis across seven ancestral populations.
This study identifies population-specific common and rare genetic variants by analyzing allele frequency differences across populations using gnomAD v2.1.1 exome sequencing data. We developed a customized filtering pipeline that performs rigorous quality control and stratified analysis across seven genetic ancestries: East Asian (EAS), South Asian (SAS), Non-Finnish European (NFE), Finnish (FIN), African (AFR), Admixed American (AMR), and Ashkenazi Jewish (ASJ).
First, we used BCFtools to decompress the compressed VCF files and calculate variant statistics for each chromosome. Next, we developed a Python script that utilizes the cyvcf2 package to extract allele frequencies and other relevant information from the VCF files and organize the results into a standard TSV format.
In the variant filtering process for gnomAD v2.1.1, we initially performed quality control based on allele count (AC) and allele number (AN) values. We then employed two population genetic structure models, Model A and Model B, to account for different population stratification scenarios.
| Step | Description | Number of Variants |
|---|---|---|
| 0 | Initial VCF extraction | 17,209,972 |
| 1 | AC QC: Keep variants with AC > 0 in at least one population | 15,425,384 |
| 2 | AN QC: Keep variants with AN > 0 in all seven populations | 15,417,683 |
| 3.1 | Call Rate 10% QC: AN > 10% of maximum AN in all populations | 15,408,487 |
| 3.2 | Call Rate 20% QC: AN > 10% of maximum AN in all populations | 15,404,555 |
| 3.3 | Call Rate 30% QC: AN > 10% of maximum AN in all populations | 15,401,073 |
| 3.4 | Call Rate 40% QC: AN > 10% of maximum AN in all populations | 15,397,425 |
Note
Rationale: Call rate thresholds ensure adequate sequencing coverage across all populations, with higher thresholds (e.g., 40%) providing maximum confidence at the cost of slightly reduced variant numbers.
For each target population, we applied 32 filtering combinations derived from two complementary approaches: Model A: Common in target, rare in others (16 combinations)
- Target population AC ≥ {1, 5, 10, 20}
- Reference populations (all 6) AF ≤ {0.5, 0.1, 0.05, 0.01}
- Example: EAS AC ≥ 10 AND (SAS, NFE, FIN, AFR, AMR, ASJ) all AF ≤ 0.01
- Interpretatio: Variants present in East Asians but rare in other populations
Model B: Rare in Target, common in Others (16 combinations)
- Target population AF ≤ {0.5, 0.1, 0.05, 0.01}
- Reference populations (all 6) AC ≥ {1, 5, 10, 20}
- Example: EAS AF ≤ 0.01 AND (SAS, NFE, FIN, AFR, AMR, ASJ) all AC ≥ 10
- Interpretation: Variants common in other populations but rare in East Asians
Filtering Matrix
7 populations × 32 filtering combinations = 224 population-specific variant sets
These 224 filtering conditions were applied to each of the five quality-controlled datasets (Step 2 + Step 3.1–3.4), generating:
224 combinations × 5 call rate QC levels = 1,120 population-specific variant files
| Scenario | Target Pop | Target Pop AF Threshold | Ref Pops | Ref Pop AF Threshold | Interpretation |
|---|---|---|---|---|---|
| A | EAS | AF ≥ 20% | SAS, NFE, FIN, AFR, AMR, ASJ | AF ≤ 0.01 | East Asian-specific common variant |
| B | NFE | AF ≥ 1% | EAS, SAS, FIN, AFR, AMR, ASJ | AF ≤ 0.05 | European-enriched low-frequency variant with moderate stringency filtering. |
| C | AFR | AF ≤ 0.01 | EAS, SAS, FIN, AFR, AMR, ASJ | AF ≥ 20% | Pan-ancestral common variant depleted in Africans |