Differential chromatin binding analysis


ChIP-Seq Read Alignment

Reads for each data set were independently mapped to the human genome (hg19) using the BWA mem algorithm(v0.6.2). We filtered for reads that were mapped uniquely to the genome and removed duplicates.

bwa mem <reference genome>.fa <unaligned reads>.fq > <aligned reads>.sam

Identification of Enriched Regions

The MACS (version 2.01) callpeak algorithm was used to identify enriched peak summits in each of the data sets being assessed. A q-value cut off of 0.01 was used when identifying peaks. Only peaks with a p-value greater than \[10^-20\] where used in the analysis.

Unification of Enriched Regions

Before proceeding, the independently identified enriched summits were unified into a single set, the AFS. To form the AFS, the summit locations (p) from each data set (D) were combined into a single set P. P was partitioned such that the summit locations found wiqthin w (350bp) of one another were combined into a subset s, the set of all s being P^. Once all the summits had been grouped, a single summit location was determined for each s in P^. The single summit location was defined as the mean of all summits in a subset s. Note, this had no effect on the location of singleton members of P^. The resulting mean locations were defined as the set of unified enriched regions, i.e. the AFS.

Forming and Normalizing the Unified Density Matrix

For all data sets in the analysis, a vector of read count pile ups was determined. One pile up count per vector, from each member of the AFS. The resulting matrix had a width and height respective to the number of data sets in the analysis, and the number of peaks in the AFS. The read count pile up was defined as the number of reads which started or ended within w (350bp) of the peak summit. The read count density was determined by normalizing the read count pile ups by their respective peak widths (701bp). The matrix of normalized peak densities was defined as the UDM. To limit inter data set variation, the UDM was quantile normalized with respect to the data sets.

Spectral Decomposition of Unified Density Matrix

Spectral decomposition, in the form of single value decomposition (SVD), was applied to the quantile normalized UDM. Before the application of the SVD, the normalized UDM was row wise mean zero adjusted, resulting in matrix M. The mean normalized matrix M was split into the left (U) and right(V) matrices using SVD. The matrix U shares the dimensions with M and represents the principle components of the UDM. The first principle component vector (PCV) is the first column of the matrix U, and has a weighting for each peak in the AFS.

Isolating Subsets of the AFS

The dot product of the quantile normalized UDM and the matrix U resulted in weighted positions for all data sets in the analysis. Visual inspection of individual dimensions of the weighted positions, was used to identify which PCV separated the data sets of interest. The weights from the PCV that separated the data set of interest were used to define key peak sets. The peaks that had a weight greater or less than N standard deviation of the PCV mean, were considered key in the separation along that axis. These key peaks were separated into two sets, one for peaks greater than N standard deviations of the mean and one for those less than. This process was repeated for all PVC with weights that separated data sets of interest. Once all the peak sets had been determined, peaks that were shared by two or more sets were removed. This resulted in two sets of mutually exclusive peaks for each dimension that provided visual separation of relevant data sets.

Determining the Prevalence of Ebox Varients

The Ebox (CA**NN**TG) had 10 variants. We determined the prevalence of each variant for each set of key peaks. The prevalence was represented as the ratio of peaks containing the Ebox variant and the total number of peaks in the key peak set.

Motif De Novo Search and Annotation.

The Homer (v4.7) denvo algorithm was used to identify enriched motifs. Motifs were identified independently with fixed lengths of 6,7, and 8bp. The control set used when identifying enriched motifs consisted of peaks that fell within 0.25 standard deviations of the mean of each relevant PCV.

The STAMP webservice was used to annotate the motifs. The PCC was used as the column comparison metric, and the Ungapped Smith-Waterman algorithm was used as the alignment method. Both the TRANSFAC and JASPAR motif databases were used in the analysis.

Identifying Composite Motifs

The occurrence of known composite motifs, principally Ebox varients and Gata, were determined for each set of key peaks. Novel preferred distances between motifs were discovered by pairwise investigation of motif locations under sets of key peaks. Motifs that were both present in more than 50 key peaks were compared. The frequency of occurrence for the distance between motif start sites was plotted for a range from -100 to 100 bp. Pairs were flagged if significant frequency outliers were observed. The distance with the greatest frequency was defined as the preferred distance.

Gene Association

The GREAT Basal Plus Extension algorithm was used to associate peaks with genes. The basal region extended 5kbp upstream and 1kbp downstream of each gene’s transcription start site. The distal regions extended to the nearest basal region or a maximum of 1Mbp. The peaks were associated with the hg19 Refseq gene list.

Gene Ontology

DAVID was used for GO, following the May 2016 database update. GO terms from the Biological Process and Molecular Function categories were found for each set of genes associated to key peaks. Metascape was used to generate the GO term networks directly from the associated genes. A p-value cut off of 0.1 was used to restrict the number of genes displayed in the network.