My approach to determining differential binding in ChIP-Seq data sets

Proteins within cells have many roles, one of which is the regulation of genes. The presence of proteins at specific locations on the genome can turn a gene on or off, or change the frequency that it is read. When specific genes are activated the state of the cell can change. This is the process that allows for the formation of the multitude of cells present in the body from a single source, stem cells. While this process is required for healthy development, mistakes caused by mutation can lead to the formation of cancer.

T-Cell Acute Leukemia 1 (TAL1) is a key protein required for blood cell development. It is needed for the formation of healthy red blood cells and platelets. However, if it is present in T-Cells, T-Cell acute lymphoblastic leukemia (T-ALL) will develop. Of all T-ALL cases, 65% are caused by the aberrant presence of TAL1. Understanding the novel role of TAL1 in T-Cells is critical for understanding the process of leukemogenesis, i.e. the development of cancer from healthy T-Cells.

The analysis of one protein’s binding locations in the genome of one cell type is well understood. Identifying how these binding locations differ between cell types is not, which is problematic for understanding how TAL1 can have such different effects. This is because uncovering the differences in TAL1 binding between healthy and cancerous cells requires the comparison of samples from both conditions. Thus, a new methodology must be developed in order to understand the mechanism through which TAL1 drives cancer development. My thesis focused on developing such a method.

My approach had two key steps. First, all of the TAL1 binding locations from the two conditions were combined into a single unified set. This set of locations represented the region of the genome where TAL1 is most likely to bind. Once the unified location set was created, a unified data matrix (UDM) was generated, which combined the data from the different samples. The UDM condensed the information from the different data sets into a single matrix. Each column in the matrix represented a data set being compared and each row represented a member of the unified set of genomic locations.

Where the first step generated a uniform representation of the disparate data sets in the form of the UDM, the second step drew biologically relevant information out of the UDM. This was done using Principle Component Analysis (PCA). PCA identifies the most important aspects of variation in a data set, finding patterns in the noise. This enabled the variation in the data to be concentrated into a few dimensions, decreasing the number of columns needed to represent the information. Each new column could be analyzed to determine if the variance observed is caused by biological factors.

My TAL1 project was comprised of 22 data sets, 10 healthy and 12 cancerous. This resulted in a UDM with 22 columns. Using PCA, 68% of the variance could be represented in 3 columns. The first column of data was found to separate Leukemic data sets from the healthy, enabling the isolation of genomic locations that were important for each condition.

The final results of the analysis was the identification of 1,408 strictly healthy TAL1 binding locations and 1,291 strictly cancerous binding locations. These locations can be used in the future for the development of novel treatments and detection mechanisms. The identification of these locations also gives us insight into TAL1’s cancerous mechanism.