Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,6 @@
Genomics Kit integrates the utilities for bioinformatics analysis

1. [igv-report](./igv-report): Generate the IGV report html format
2. [spark-on-slurm](./spark-on-slurm/): Spark on SLURM cluster configuration
2. [spark-on-slurm](./spark-on-slurm/): Spark on SLURM cluster configuration, supported
- [Hail](https://hail.is/): Powering genomic analysis, at every scale
3. [glnexus](https://github.com/dnanexus-rnd/GLnexus): The joint variant calling for cohort vcf for deepvariant gvcf
2 changes: 2 additions & 0 deletions glnexus/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# SCM syntax highlighting & preventing 3-way merges
pixi.lock merge=binary linguist-language=YAML linguist-generated=true -diff
3 changes: 3 additions & 0 deletions glnexus/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# pixi environments
.pixi/*
!.pixi/config.toml
1 change: 1 addition & 0 deletions glnexus/ALDH2_5kb.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
chr12 111760000 111765000
6 changes: 6 additions & 0 deletions glnexus/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.PHONY: test
${HOME}/.pixi/bin/pixi:
curl -sSL https://pixi.sh/install.sh | sh

test: ${HOME}/.pixi/bin/pixi
${HOME}/.pixi/bin/pixi run bash joint_genotyping_cohort_vcf.sh
192 changes: 192 additions & 0 deletions glnexus/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# GLnexus Joint Genotyping Configuration

A simple configuration for performing joint genotyping of genomic variants using **GLnexus**, demonstrating cohort analysis with sample data from GIAB and 1000 Genomes Project.

## Overview

This project automates joint variant calling on gVCF files from multiple samples using GLnexus. It includes:

- **Data Download**: Automatically retrieves sample gVCF files from public cloud storage (GIAB and 1000 Genomes)
- **Joint Calling**: Performs multi-sample variant calling on the ALDH2 region (5kb) using GLnexus
- **Output Conversion**: Converts BCF output to gzip-compressed VCF format

## Quick Start

### Prerequisites

- [Pixi](https://pixi.sh/) - Package manager for reproducible environments
- Internet connectivity (for downloading sample data from cloud storage)
- AWS CLI or gsutil credentials (for accessing public datasets)

### Installation

1. Install Pixi if you haven't already:

```bash
curl -sSL https://pixi.sh/install.sh | sh
```

2. Clone or navigate to this repository:

```bash
cd /path/to/glnexus
```

### Running the Pipeline

Execute the complete joint genotyping pipeline:

```bash
make test
```

This will:

1. Install dependencies in a managed Pixi environment
2. Download gVCF files from GIAB and 1000 Genomes
3. Run GLnexus joint calling
4. Convert outputs to compressed VCF format

## Project Structure

```
glnexus/
├── joint_genotyping_cohort_vcf.sh # Main pipeline script
├── ALDH2_5kb.bed # Test region (BED format)
├── Makefile # Build automation
├── pixi.toml # Dependency configuration
├── pixi.lock # Locked dependency versions
├── cohort_vcf/ # Output directory structure
│ ├── GIAB/data/ # GIAB samples output
│ └── 1KGP/data/ # 1000 Genomes samples output
└── Readme.md # This file
```

## Dependencies

The pipeline uses the following tools (managed by Pixi):

- **GLnexus** (>=1.4.1, <2) - Multi-sample variant calling
- **BCFtools** (>=1.23.1, <2) - VCF/BCF manipulation
- **SAMtools** (>=1.23.1, <2) - Sequence manipulation
- **gsutil** - Google Cloud Storage access
- **google-cloud-storage** (>=2.10.0) - Cloud storage library

## Sample Data

### GIAB (Genome in a Bottle)

- HG002 (child)
- HG003 (parent 1)
- HG004 (parent 2)

Source: DeepVariant case study outputs (v1.10.0)

### 1000 Genomes Project (1KGP)

- NA21144
- NA21143
- NA21142

Source: DRAGEN v4.2.7 processed individuals

## Test Region

The pipeline analyzes the **ALDH2 region** on chromosome 12 (5kb window):

- **Chromosome**: chr12
- **Start**: 111,760,000
- **End**: 111,765,000
- **File**: ALDH2_5kb.bed

## Output Files

After running the pipeline, the following VCF files are generated:

```
cohort_vcf/GIAB/data/GIAB_ALDH2_5kb.vcf.gz # GIAB joint calls
cohort_vcf/1KGP/data/1KGP_ALDH2_5kb.vcf.gz # 1KGP joint calls
```

Both outputs include corresponding index files (.tbi).

## Configuration

### GLnexus Parameters

The pipeline uses the following GLnexus settings:

```bash
glnexus_cli --config DeepVariant --threads 4 --bed ALDH2_5kb.bed
```

- **Config**: DeepVariant (optimized for DeepVariant gVCF output)
- **Threads**: 4 (CPU cores for parallel processing)
- **BED region**: ALDH2_5kb.bed (limits analysis to test region)

### Customization

To modify the analysis:

1. **Change region of interest**: Edit `ALDH2_5kb.bed` with new coordinates
2. **Adjust threading**: Modify `--threads` parameter in `joint_genotyping_cohort_vcf.sh`
3. **Add more samples**: Add gsutil/aws commands and include gVCF files in the glob pattern

## Usage Examples

### Run complete pipeline

```bash
make test
```

### Manual execution

```bash
# Activate environment
pixi shell

# Run pipeline
bash joint_genotyping_cohort_vcf.sh
```

### Run specific cohort only

```bash
# GIAB only
cd cohort_vcf/GIAB/data
glnexus_cli --config DeepVariant --threads 4 --bed ../../ALDH2_5kb.bed *.g.vcf.gz > GIAB_ALDH2_5kb.bcf
```

## Troubleshooting

### Low Memory

Adjust thread count to reduce memory usage:

```bash
--threads 2 # Instead of 4
```

### File Not Found

Ensure the `cohort_vcf/GIAB/data` and `cohort_vcf/1KGP/data` directories exist. They should be created by the pipeline, but you can create them manually:

```bash
mkdir -p cohort_vcf/GIAB/data cohort_vcf/1KGP/data
```

## References

- [GLnexus Documentation](https://github.com/dnanexus-rnd/GLnexus)
- [DeepVariant Case Study](https://github.com/google/deepvariant/blob/r1.10/docs/case_study.md)
- [1000 Genomes Project](https://www.internationalgenome.org/)
- [Genome in a Bottle](https://www.nist.gov/programs/giab)

## License

This configuration is provided as-is for educational and research purposes.

## Author

nttg8100 &lt;nttg8100@gmail&gt;
5 changes: 5 additions & 0 deletions glnexus/cohort_vcf/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
*.gz
*gz.tbi
GLnexus.DB
*.bcf
*.bed
Empty file.
Empty file.
39 changes: 39 additions & 0 deletions glnexus/joint_genotyping_cohort_vcf.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
################ Download data ##############################
HOME_DIR=$(pwd)
# GIAB
cd ${HOME_DIR}/cohort_vcf/GIAB/data
gsutil cp gs://deepvariant/case-study-outputs/1.10.0/deeptrio/wgs/HG002.child.g.vcf.gz .
gsutil cp gs://deepvariant/case-study-outputs/1.10.0/deeptrio/wgs/HG002.child.g.vcf.gz.tbi .

gsutil cp gs://deepvariant/case-study-outputs/1.10.0/deeptrio/wgs/HG003.parent1.g.vcf.gz .
gsutil cp gs://deepvariant/case-study-outputs/1.10.0/deeptrio/wgs/HG003.parent1.g.vcf.gz.tbi .

gsutil cp gs://deepvariant/case-study-outputs/1.10.0/deeptrio/wgs/HG004.parent2.g.vcf.gz .
gsutil cp gs://deepvariant/case-study-outputs/1.10.0/deeptrio/wgs/HG004.parent2.g.vcf.gz.tbi .


# 1KGP
cd ${HOME_DIR}/cohort_vcf/1KGP/data
aws s3 cp --no-sign-request s3://1000genomes-dragen-v4-2-7/data/individuals/hg38_alt_masked_graph_v3/NA21144/NA21144.final.hard-filtered.gvcf.gz .
aws s3 cp --no-sign-request s3://1000genomes-dragen-v4-2-7/data/individuals/hg38_alt_masked_graph_v3/NA21144/NA21144.final.hard-filtered.gvcf.gz.tbi .

aws s3 cp --no-sign-request s3://1000genomes-dragen-v4-2-7/data/individuals/hg38_alt_masked_graph_v3/NA21143/NA21143.final.hard-filtered.gvcf.gz .
aws s3 cp --no-sign-request s3://1000genomes-dragen-v4-2-7/data/individuals/hg38_alt_masked_graph_v3/NA21143/NA21143.final.hard-filtered.gvcf.gz.tbi .

aws s3 cp --no-sign-request s3://1000genomes-dragen-v4-2-7/data/individuals/hg38_alt_masked_graph_v3/NA21142/NA21142.final.hard-filtered.gvcf.gz .
aws s3 cp --no-sign-request s3://1000genomes-dragen-v4-2-7/data/individuals/hg38_alt_masked_graph_v3/NA21142/NA21142.final.hard-filtered.gvcf.gz.tbi .


################ Joint calling ##############################
# 1. GLnexus
# create testing bed file
cd $HOME_DIR
echo -e "chr12\t111760000\t111765000" > ALDH2_5kb.bed

cd ${HOME_DIR}/cohort_vcf/GIAB/data
glnexus_cli --config DeepVariant --threads 4 --bed $HOME_DIR/ALDH2_5kb.bed *.g.vcf.gz > GIAB_ALDH2_5kb.bcf
bcftools view GIAB_ALDH2_5kb.bcf | bgzip -@ 4 -c > GIAB_ALDH2_5kb.vcf.gz

cd ${HOME_DIR}/cohort_vcf/1KGP/data
glnexus_cli --config DeepVariant --threads 4 --bed $HOME_DIR/ALDH2_5kb.bed *.gvcf.gz > 1KGP_ALDH2_5kb.bcf
bcftools view 1KGP_ALDH2_5kb.bcf | bgzip -@ 4 -c > 1KGP_ALDH2_5kb.vcf.gz
Loading
Loading