-
Notifications
You must be signed in to change notification settings - Fork 4
SIP method
SIP was developed to analyze the hic experiment based on the .hic file or .mcool. The first step of SIP is to obtain the raw interaction values for the resolution (bin size) chosen by the user.
We utilize juicer tools (https://github.com/aidenlab/juicer/wiki) with .hic file to retrieve this information for each chromosomal chunk (chunk size depending on the size of the matrix chosen by the user) (Durand et al., 2016). And cooler (https://github.com/mirnylab/cooler) and cooltools (https://github.com/mirnylab/cooltools) are used for .mcool file (Abdennur and Mirny, 2019).
Data is dumped at the resolution and with the normalization chosen by the user (normalization options created from juicer are KR/NONE/VC/VC_SQRT). KR normalization is preferred. Furthermore, the Coverage vector (VC) is also retrieved at the same resolution to filter rows and columns within the matrix with insufficient data.
The genome is analyzed by sliding windows depending on the resolution and matrix size specified by the user. For example, a resolution of 5kb with a matrix size of 2000 (default parameters) covers a 10Mb region. Retrieved data is used in two way (1) Observed minus expected (OmE) and (2) distance normalized by the formula :
(1) OmE= observedValue - expected
(2) normalizedValue = (observedValue + 1)⁄(expected + 1)
For the mcool file, SIP needs the last version available of cooltools (version >= 0.3.0) and cooler (version >= 0.8.6). First cooltools compute-expected is used to compute the normalized expected vector, and then cooler dump is used to dump the observed data weighted.
The genome is analyzed by sliding windows depending on the resolution and matrix size specified by the user. For example, a resolution of 5kb with a matrix size of 2000 (default parameters) covers a 10Mb region. Retrieved data is used in two way (1) Observed minus expected (OmE) and (2) distance normalized by the formula : Because teh value obtain via cooler an coolstools are really we apply a factor on each formula.
(1) OmE= observedValue1000 - expected1000
(2) normalizedValue = (observedValue10000 + 1)⁄(expected10000 + 1)
OmE and normalizedValue are used to respectively detect and compute the loops score (Figure 1).
Image processing methods are used to smooth the signal, to increase the contrast, to decrease the noise of the image, and to detect loop candidates.
The first step utilizes Gaussian blurring to smooth the Hi-C signal and avoid detection of outlier pixel signals. Afterwards, contrast enhancement, increases the contrast between the background and the signal of interest (Schneider et al., 2012).
Then white top-hat (mathematical morphology method (Beucher and Meyer, 1993)) from MorpholibJ plugin is used to homogenize the background and make bright structures easier to detect (Legland et al., 2016). The last step is uses Minimum and Maximum Filter (Schneider et al., 2012) combinations to remove isolated pixels and further homogenize the background. These steps provide a corrected image of the interactions (Figure 1).
The regional maxima detection algorithm available from imageJ is used to detect candidate loops (Schneider et al., 2012).
Then the distance normalized values from the original matrix are used to remove potential false positives. This filtering includes several steps.
The first step is to exclude pixels near columns and rows with insufficient data (the default is to filter any with >= 6 pixels with zero values in the surrounding 24 pixel neighborhood.
The second filter is to remove pixels without heightened interactions compared to the surrounding 8 pixel neighborhood and the 24 pixel neighborhood. Additional loops must display a decay value between these neighborhoods to avoid isolated enriched pixels.
A third filter is to remove low signal pixels with a normalized value < 0.30. Candidate loops are then filtered so that the center pixel 1.2 fold higher than nearby pixels (PA score). Loops are filtered based on a Poisson CDF function such that the probability that the center pixel is higher than the nearby pixels is higher than 0.9. Finally, candidate loops are filtered if their PA score is lower than the PA scores of a top percentage of random sites (e.g. FDR 0.01 filters by value of top 1% random sites).
Abdennur, N., and Mirny, L. (2019). Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. doi: 10.1093/bioinformatics/btz540.
Durand, N.C., Shamim, M.S., Machol, I., Rao, S.S.P., Huntley, M.H., Lander, E.S., Aiden, E.L., 2016. Juicer Provides a One - Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst. 3, 95–98. https://doi.org/10.1016/j.cels.2016.07.002
Legland, D., Arganda-Carreras, I., Andrey, P., 2016. MorphoLibJ: integrated library and plugins for mathematical morphology with ImageJ. Bioinformatics 32, 3532–3534. https://doi.org/10.1093/bioinformatics/btw413
Schneider, C.A., Rasband, W.S., Eliceiri, K.W., 2012. NIH Image to ImageJ: 25 years of image analysis. Nat. Methods 9, 671–675. https://doi.org/10.1038/nmeth.2089.
If you use SIP or SIPMeta please cite us.
Rowley MJ, Poulet A, Nichols M, Bixler B, Sanborn A, Brouhard E, Hermetz K, Linsenbaum H, Csankovszki G, Lieberman Aiden E, Corces VG. 2020. Analysis of Hi-C data using SIP effectively identifies loops in organisms from C. elegans to mammals. Genome Res. https://genome.cshlp.org/content/early/2020/03/03/gr.257832.119.long