This is the first section of the pipeline and it allows to pre-process the raw Hi-C data (fastq files), in order to generate input files for the normalization step.
In order to start the Hi-C data analysis and preprocess your data you have to provide:
- Two fastq files, respectively of the first and the second reads of the pairs. If you wish to use public datasets on GEO and you need instructions to download the data, see section 1.1..
- The Bowtie2 indexes of your reference genome.
- The restriction enzyme used in the Hi-C experiment.
Note! To reproduce the results of this tutorial, use the public dataset with this GEO accession number: GSM1551550.
HiCtool allows to process Hi-C data generated for one of the following species:
- hg19
- hg38
- mm9
- mm10
- dm6
- susScr3
- susScr11
HiCtool allows to process Hi-C data generated using one or more of the following restriction enzymes:
- HindIII
- MboI
- DpnII
- Sau3AI
- BglII
- NcoI
- Hinfl
The Arima Kit uses a cocktail of restriction enzymes which includes MboI and Hinfl. If your experiment was performed using a different species or restriction enzyme(s), please contact Riccardo Calandrelli at rcalandrelli@eng.ucsd.edu.
If you do not have the Bowtie2 genome index, please run the following in order to generate it:
bowtie2-build hg38.fa index
hg38.fa is the reference genome sequence in FASTA format (in this case of hg38), the output files in .bt2 format are named with the prefix index.
The data preprocessing is performed with a single unix command line (replace parameters in the code below properly) and comprises the following steps:
- Pre-truncation of the reads that contain potential ligation junctions to keep the longest piece without a junction sequence (Ay et al., 2015).
- Independent mapping of the read pairs to the reference genome to avoid any proximity constraint.
- Removing the unmapped reads and selecting reads that were uniquely mapped with a MAPQ >= 30, i.e. the estimated probability of mapping error is <= 0.1% (this can be changed with the parameter
-q). - Deduplicating aligned reads. (Note that PCR duplicates were previously removed in the following section, while now this step has been also added here to allow the extraction of deduplicated data already from the bam or bedpe files).
# Make the bash script executable
chmod u+x ./HiCtool-master/scripts/HiCtool_run_preprocessing.sh
# Run the script
./HiCtool-master/scripts/HiCtool_run_preprocessing.sh \
-h ./HiCtool-master/scripts/ \
-o /your_output_directory/ \
-1 /myfastq_path/file1.fastq \
-2 /myfastq_path/file2.fastq \
-e MboI \
-q 30 \
-g /path_to_the_genome_indexes/index \
-p 32 \
-c 50000000
where:
-h: path to the HiCtool scripts with the final trailing slash/.-o: path to save the output files. If the folder does not exist, it is created automatically.-1: the fastq file with the first reads of the pairs.-2: the fastq file with the second reads of the pairs.-e: the restriction enzyme or enzymes passed between square brackets (example: [MboI,Hinfl] for the cocktail of the Arima Kit).-q: to filter mapped reads with MAPQ smaller than this threshold.-g: Bowtie2 genome indexes. Only the filename should be passed here without extension, in this caseindex.-p: the number of parallel threads (processors) to use for alignment and preprocessing. The more the fastest the process.-c: chunk size. If your data are very big, you may encounter a memory error when the fastq files are loaded for pre-truncated and downstream when the paired reads between the two mapped files are selected. Thus, you it is suggested always to use this parameter in order to split the two fastq files into several temporary files with-clines each (this means all the lines, i.e. 4 lines per each read), that are pre-truncated separately. The temporary files will be processed with multiple threads if you set-pgreater than 1. Therefore, setting-cmay help to speed up the pre-truncation process. In addition, setting-clets the program work with smaller temporary files at the pairing step as well to generate the output bam files.
The structure of the output directory is the following:
~/your_output_directory
|___ file1.trunc.fastq
|___ file2.trunc.fastq
|___ pre_truncation_log.txt
|___ HiCfile1_pair1.bam
|___ HiCfile2_pair1.bam
|___ HiCfile1_log.txt
|___ HiCfile2_log.txt
The following output files are generated:
file1.trunc.fastqandfile2.trunc.fastqare the two fastq file with pre-truncated reads.pre_truncation_log.txtwith the information about the percentage of reads that have been truncated. This is also printed on the console during pre-processing:
SRR1658570_1.fastq
202095066 reads (length = 101 bp); of these:
29851195 (14.78%) contained a potential ligation junction and have been truncated.
SRR1658570_2.fastq
202095066 reads (length = 101 bp); of these:
28681691 (14.2%) contained a potential ligation junction and have been truncated.
HiCfile_pair1.bamandHiCfile_pair2.bamthat are the bam files of the pre-truncated first and second reads in the pairs respectively, generated after alignment, filtering and deduplication.HiCfile1_log.txtandHiCfile2_log.txtare the log files with alignment and filtering statistics for the first and second reads in the pairs respectively.
HiCfile1_log.txt
202095066 reads; of these:
202095066 (100.00%) were unpaired; of these:
5770798 (2.86%) aligned 0 times
156759009 (77.57%) aligned exactly 1 time
39565259 (19.58%) aligned >1 times
97.14% overall alignment rate
----------
202095066 reads; of these:
146243025 (72.36%) aligned with MAPQ>=30 and are deduplicated; of these:
109354498 (74.77%) were paired and saved into HiCfile_pair1.bam
HiCfile2_log.txt
202095066 reads; of these:
202095066 (100.00%) were unpaired; of these:
13381441 (6.62%) aligned 0 times
149852422 (74.15%) aligned exactly 1 time
38861203 (19.23%) aligned >1 times
93.38% overall alignment rate
----------
202095066 reads; of these:
137124105 (67.85%) aligned with MAPQ>=30 and are deduplicated; of these:
109354498 (79.74%) were paired and saved into HiCfile_pair2.bam
The source data in sra format are downloaded via GEO accession number using the command fastq-dump of SRA Toolkit and coverted to fastq format.
To download the data related to a GEO accession number, go to the bottom of that page and click on the SRA number under the section “Relations”. After that, under the section “Runs” you will find the SRR files, then run the following:
fastq-dump SRRXXXXXXX --split-3
where SRRXXXXXXX has to be replaced with the specific number of the run you want to download (SRR1658570 in this documentation).
To be more specific, this code will either download the SRA file under the output directory of the SRA Toolkit installation folder (see here to setup a custom output directory), but it will also convert the SRA file into fastq files and dump each read into separate files in the current working directory (the files you will need), to produce:
SRRXXXXXXX_1.fastqSRRXXXXXXX_2.fastqSRRXXXXXXX.fastq
where paired-end reads in SRRXXXXXXX.sra are split and stored into SRRXXXXXXX_1.fastq and SRRXXXXXXX_2.fastq, and SRRXXXXXXX.fastq (if present) contains reads with no mates.
Note! To produce our final results, use this GEO accession number: GSM1551550.
This is section allows to generate the bedpe read pairs file (see here the bedpe format specifications). Note that this file is not required in order to proceed with this pipeline but may be useful for other types of analyses.
In order to generate the bedpe file, use the following unix code after having generated the two bam files above:
bedtools bamtobed -i HiCfile_pair1.bam | sort -k 4,4 > HiCfile_pair1.bed
bedtools bamtobed -i HiCfile_pair2.bam | sort -k 4,4 > HiCfile_pair2.bed
paste HiCfile_pair1.bed HiCfile_pair2.bed | awk -v OFS='\t' '{print $1, $2, $3, $7, $8, $9, $4, ".", $6, $12}' > HiCfile_paired.bedpe
rm HiCfile_pair1.bed
rm HiCfile_pair2.bed
The fragment-end (FEND) bed file is used to generate the observed contact matrices and normalize the data (if Yaffe-Tanay normalization approach is used). It contains restriction site coordinates and optional additional information related to fragment properties, such as GC content and mappability score (needed only if you wish to normalize your data using the explicit-factor approach from Yaffe and Tanay). Specifically, for each fragment the GC content of 200 bp upstream and downstream to the restriction site is computed. For the mappability score, the entire genome sequence is split into artificial reads (50 bp reads, starting every 10 bp) and then mapped back to the genome. For each fragment end the mappability score is then defined to be the portion of artificial reads mapped uniquely to the genome (MAPQ > 30) within a 500-bp window upstream and downstream to the fragment. Fragment ends with a mappability score less than 0.5 are then discarded (Yaffe and Tanay, 2011).
Since generating the FEND bed file may be time consuming, we provide the most common FEND files available for download (DpnII is the same restriction site than MboI):
- HindIII-hg38
- MboI-hg38 (or DpnII-hg38) (file used in this documentation)
- Arima_Kit-hg38 (restriction ezymes cocktail with MboI and Hinfl)
- NcoI-hg38
- HindIII-mm10
- MboI-mm10 (or DpnII-mm10)
- MboI-dm6 (or DpnII-dm6)
Perform the following steps ONLY if you need to generate a new fragment end bed file (because you are using another species or a different restriction enzyme than those provided above). Otherwise, download the file of interest above and go to the data normalization section.
The generation of the FEND file is performed with a single unix command line (replace parameters in the code below accordingly) and comprises the following steps:
- Generation of the restriction enzyme sequence.
- Multiple alignment of the restriction site using Bowtie 2, to locate all the coordinates of the restriction enzyme sites.
- Conversion of the output sam file to bed format.
# Make the bash script executable
chmod u+x ./HiCtool-master/scripts/HiCtool_generate_fend_file.sh
# Run the script
./HiCtool-master/scripts/HiCtool_generate_fend_file.sh \
-h ./HiCtool-master/scripts/ \
-o /your_output_directory/ \
-e MboI \
-g /path_to_the_genome_indexes/index \
-s hg38 \
-p 32
where:
-h: path to the HiCtool scripts with the final trailing slash/.-o: path to save the output files. If the folder does not exist, it is created automatically.-e: the restriction enzyme or enzymes passed between square brackets (example: [MboI,Hinfl] for the cocktail of the Arima Kit).-g: Bowtie2 genome indexes. Only the filename should be passed here without extension, in this caseindex.-s: species name.-p: the number of parallel threads (processors) to use for alignment and preprocessing. The more the fastest the process.
The structure of the output directory is the following:
~/your_output_directory
|___ restrictionsites.bed
|___ restrictionsites_log.txt
The following output files are generated:
-
restrictionsites.bedis the FEND bed file. -
restrictionsites_log.txtis the log file with the restriction site multiple alignment statistics. -
If you will be using the Hi-Corrector normalization approach, you do NOT need to perform the following steps, because additional information such as GC content or mappability score are not needed. You can use
restrictionsites.bedas the FEND file. -
Proceed with the following section ONLY if you will be using Yaffe and Tanay's normalization approach.
You can decide if adding both the GC content and mappability score information or only one of them. Our suggestion is to add both at this point, so you will have a complete FEND file, then you may decide later on if not considering one of these features for normalization. This process is very time consuming, so we suggest to use the highest number of processors available.
The update of the FEND file to add GC content and/or mappability score is performed with a single unix command line (replace parameters in the code below accordingly) and comprises the following steps:
- Splitting
restrictionsites.bedinto single bed files, one per each chromosome. - Adding GC content information per each single chromosome (if GC content selected).
- Adding mappability score information per each single chromosome (if mappability selected).
- Merging all the files, removing fragments with mappability score < 0.5, and parsing the file format to generate the final FEND bed file.
# Make the bash script executable
chmod u+x ./HiCtool-master/scripts/HiCtool_fend_file_add_gc_map.sh
# Run the script
./HiCtool-master/scripts/HiCtool_fend_file_add_gc_map.sh \
-h ./HiCtool-master/scripts/ \
-o /your_output_directory/ \
-s hg38 \
-p 32 \
-b /a_path/gc5Base.bw \
-m /a_path/hg38.fa
where:
-h: path to the HiCtool scripts with the final trailing slash/.-o: path to save the output files. If the folder does not exist, it is created automatically.-s: species name.-p: the number of parallel threads (processors) to use for alignment and preprocessing. The more the fastest the process.-b: gc5Base.bw file with the GC content information (download it from this website: hgdownload.cse.ucsc.edu/gbdb/your_species/bbi/). If the GC content information has been already calculated for the reference genome (i.e. you have run already this code using the same reference genome), you will have aGC_infodirectory with severaltxtfiles (one per each chromosome); therefore you can input this directory path to avoid this redundant step. If-bis not declared, the GC content information is not added.-m: the reference genome in fasta format to add the mappability score to the FEND file. If the mappability score has been already calculated for the reference genome (i.e. you have run already this code using the same reference genome), you will have amappability_infodirectory with severaltxtfiles (one per each chromosome); therefore you can input this directory path to avoid this redundant step. If-mis not declared, the mappability score information is not added.
If both -b and -m are declared, the structure of the output directory is the following (i stands for a general chromosome, and there are gonna be one of those files per each chromosome):
~/your_output_directory
|___ restrictionsites.bed
|___ chri.bed
|___ chri_gc.bed
|___ chri_map.bed
|___ restrictionsites_gc_map.bed
|___ GC_info
|___ chri.txt
|___ mappability_info
|___ chri.txt
|___ artificial_reads.log
The following output files are generated:
restrictionsites.bedis the FEND bed file without information of GC content or mappability score.chri.bedare single bed files, one per each chromosome, derived by the splitting ofrestrictionsites.bed.chri_gc.bedare the correspondent ofchri.bedwith GC content information added.chri_map.bedare the correspondent ofchri.bedwith mappability score information added.restrictionsites_gc_map.bedis the final FEND file with GC content and mappability score.GC_infois a folder which contains the GC content information, one file per chromosome, specific to the reference genome. If you wish to compute another FEND file for the same reference genome but different restriction enzyme, you may keep this folder and input it later in-b.mappability_infois a folder which contains the mappability score information, one file per chromosome, specific to the reference genome. If you wish to compute another FEND file for the same reference genome but different restriction enzyme, you may keep this folder and input it later in-m. This folder contains alsoartificial_reads.logwhich contains the alignment statistics of the artificial reads used to compute the mappability score of the fragments.
restrictionsites_gc_map.bed is the FEND file that you will use in Yaffe and Tanay's normalization approach.