Genome-wide cis-Regulatory Element Explorer
To install GREE, first install the devtools package, if it is not already installed:
install.packages("devtools") Then, install GREE from GitHub:
devtools::install_github("JiekaiLab/GREE")This should take no more than a few minutes.
Firstly , we create a tile matrix (user-defined tile size, default is 500 bp),the exon region, blacklist region and CpG island region overlapped by TSS are removed. This task is done by the function getBinByCellMatrix, and it takes as input a preprocessed fragment information in scATAC-seq data with 'RG' column containing cell barcodes. And genes, CpG, exon, blacklist region information with the same genomic version as the fragment information. These messages are in GRanges format. Then we can get the BinByCell matrix.
lnames = load("./data/data_examples.RData")
binByCell = getBinByCellMatrix(frag,genes,CpG,exon,Blacklist)Secondly, 1000 cells were randomly selected from each group, and sum the accessibility of cells extracted from each group. The summed accessibility then multiplied by normalization factors. The normalization factors to counterbalance sequencing depth were calculated as 10^7 divided by the sum of nFrags per group. The getBinByGroupMatrix function, takes as input the BinByCell matrix obtained by the getBinByCellMatrix function and a data frame containing metadata associated with each cell, minimally including the cell group labels. You can specify group column these using the group_col arguments:
binByGroup_norm=getBinByGroupMatrix(binByCell,cell_anno,group_col="CellTtypes")Thirdly, we filter out the bin that might be noisy, that is, the bin that has a smaller maximum accessibility. And select of highly variable bin based on the ratio of maximum to minimum accessibility of bin in each group. The getHVbinMatrix function, takes as input the binByGroup_norm matrix obtained by the getBinByGroupMatrix function and the threshold bin_max_thr and bin_fc_thr to get high variation bin.
binByGroup_norm_HVbin = getHVbinMatrix(binByGroup_norm,bin_max_thr=7,bin_fc_thr=70)
Finally, for each gene, GREE use getGeneByGroupMatrix function to sum the accessibility of highly variable bins that overlaps with the gene window (user-defined, default is 2 kb upstream and downstream of the gene body) and do not cross another gene region. And we can get the GeneByGroup matrix.
GeneByGroup = getGeneByGroupMatrix(genes,binByGroup_norm_HVbin)We can use the peatmap package to show some marker genes.
library(stringr)
library(pheatmap)
library(ggplot2)
gene = c('GAD1','SLC17A6','TFAP2A','LMX1A','FEZF2')
data = as.data.frame(t(as.matrix(GeneByGroup[gene,])))
options(repr.plot.width=8, repr.plot.height=5)
pal <- colorRampPalette( c("#7431A6","#000000","#F2F208"),bias = 1.7 )(255)
pheatmap(data,
fontsize_col=12,
color = pal,
border_color = "#dcdcdc",
scale = "column",
cluster_rows = F,
cluster_cols = F,
width = 6,
height = 3
)Get the top 10% bins for each cell type and perform motif enrichment analysis using Homer.
bin_list = getGroupBinList(binByGroup_norm_HVbin,probs = 0.90)
PATH_findMotifsGenome = '/public/home/yyzhao_gibh/opt/biosoft/homer-4.11/bin/findMotifsGenome.pl'
motifAnalysis(bin_list,PATH_findMotifsGenome,output = "motif_result")
# plot
KnownMotif = read.csv("./motif_result/motif.csv")
options(repr.plot.width=8, repr.plot.height=6)
tf = c('DLX1',"LHX2","SOX3","RFX","AP-2ALPHA")
KnownMotif2 = KnownMotif[KnownMotif$TF %in% tf,]
KnownMotif2$TF = factor(KnownMotif2$TF,levels = rev(tf))
KnownMotif2$cluster = factor(KnownMotif2$cluster,levels = c('GABAergic','Glutamate','RG','RFX4+','Neural_Crest','Retinal'))
KnownMotif2$rank[KnownMotif2$rank>15]=15
ggplot(KnownMotif2, aes(x=cluster, y=TF,color=-Log.P.value,size=-rank)) +
geom_point() + scale_colour_gradient2(low = "white",high = "red") +
theme_bw(base_size = 20)+
scale_size_continuous(breaks = -c(1,5,10,15,20), labels = c(1,5,10,15,20),range = c(1, 8), name = "Rank") +
ylab("") + xlab("") + labs(title="") +
theme(panel.grid=element_blank(),
axis.text.x = element_text(angle = 60,vjust = 1,hjust = 1))
