-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME.txt
More file actions
151 lines (90 loc) · 3.74 KB
/
README.txt
File metadata and controls
151 lines (90 loc) · 3.74 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
The following README is duplicated from the raw data archive for this publication. It provides instructions for running these scripts, although the file locations may differ.
No installation is necessary for any scripts.
---------------------------------------
SEQUENCING ANALYSIS: INTERNAL BARCODES
---------------------------------------
All Python (*.py) scripts work on any machine running Python 3. Shell scripts (*.py) may be run in a standard terminal.
Sequencing datasets are separated by dated subfolders for each sequencing run. Analysis scripts are included in each subfolder.
Demo instructions:
1. In a terminal, run
assemble_reads.sh.
This script performs data preprocessing tasks. It uses raw sequencing data in raw_reads/ and outputs in deduplicated_reads/ and assembled_reads/.
2. Run
python alignment_score.py.
This script reads sequencing data in assembled_reads/ and attempts to match each read to the 96 internal barcodes by pairwise alignment. If the alignment score between a read and barcode exceeds a fixed threshold. Outputs are given in *.csv files located in match_analysis/.
3. (Optional) Plot analysis using
python csv2heatmap.py
python roc.py
Scripts should take less than an hour per dataset on a standard machine. Outputs are also included in the subfolders.
---------------------------------------
SEQUENCING ANALYSIS: SARS-CoV-2
---------------------------------------
1. Clinical samples analysis:
A. align_reads.sh
Purpose: Automates alignment of sequencing reads to a reference genome, marks duplicate reads, calls genomic variants, filters variants, and generates consensus sequences and coverage metrics.
Prerequisites:
bwa for read alignment
samtools for indexing and statistics
GATK 4.3.0.0 for variant calling and metrics collection
Input Files:
Reference genome file (NC_045512.2.fa)
Raw sequencing reads in FASTQ format
Output Files:
Aligned and sorted BAM files
Raw and filtered variant files (VCF)
Consensus FASTA sequences
Alignment and coverage metrics
Usage:
./align_reads.sh
B. concordance.sh
Purpose: Calculates precision, recall, and F1-score between pairs of filtered variant call files to assess variant calling concordance across samples.
Prerequisites:
bcftools for variant file normalization, indexing, and comparison
Input Files:
Hard-filtered VCF files (*.enc.hard-filtered.vcf and *.hard-filtered.vcf)
Output Files:
concordance_results.tsv containing precision, recall, and F1-score metrics for each sample
Usage:
./concordance.sh
2. Synthetic RNA samples analysis:
A. align_reads.sh
Purpose: Automates the process of aligning synthetic RNA sequencing reads to a reference genome, removing duplicate reads, variant calling using LoFreq, depth calculation, and variant demixing using freyja.
Prerequisites:
bwa for read alignment
samtools for sorting and depth calculation
GATK for marking duplicates
LoFreq for variant calling
freyja for variant demixing
Input Files:
Reference genome (NC_045512.2.fa)
FASTQ files containing synthetic RNA sequencing reads
Output Files:
Sorted BAM files
Variant Call Format (VCF) files
Depth metrics
Demixed variant data
Usage:
./align_reads.sh
B. amplicon_coverage.sh
Purpose: Calculates amplicon coverage statistics and read depth from BAM files generated during synthetic RNA analysis.
Prerequisites:
samtools
Input Files:
BAM files
Primer BED file (nCoV-2019.primer.bed)
Output Files:
Amplicon coverage statistics
Depth statistics
Usage:
./amplicon_coverage.sh
C. samtools_grep.sh
Purpose:Extracts coverage information (FPCOV) from statistics generated by samtools and formats the data.
Prerequisites:
grep
awk
Input Files:
Output from amplicon_coverage.sh
Output Files:
Formatted coverage statistics
Usage:
./samtools_grep.sh