prediction of hi-c interaction counts and …derrtyle/downloads/tylerderr...supervised learner for...

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Tyler Derr @ Yue [email protected]

Background● Hi-C is a chromosome conformation capture (3C) based technology, which

outputs the number of interactions between loci at the genome-wide scale.

[3]

Background● Recent 3D prediction softwares such as BACH[1] and PASTIS[3] exist that

can use Hi-C data to produce 3D genome structures○ BACH:

“It utilizes a Poisson model that better fits the count data generated from Hi-C experiments than the Gaussian model used in MCMC5C[4]...”[1]

○ PASTIS:In [3] they present 4 methods and 2 of these are based upon a Poisson model.

● Thanks to the recent efforts of the ENCODE and Roadmap Epigenomics projects we have access to the following data per region (40kb resolution):

○ GC content, mappability, number of HindIII cut sites, Pol II, and 6 histone modifications such as H3k36me3

Basis of our research● Current softwares such as BACH[1] and PASTIS[3] that can predict 3D genome

structures based on Hi-C data have trouble dealing with the bias induced by the techniques to gather the data

○ Hi-C data collection is time consuming, expensive, and have known biases

○ It seems that Dr. Ming Hu (the creator of BACH[1]) had attempted to address the biases by taking into account the enzyme cutting frequency, GC content, and sequence uniqueness when making his 3D predictions

○ However Dr. Hu has recently stated that due to a recent Nature paper the assumptions on a Poisson distribution (which is crucial to BACH) is not appropriate for Hi-C data and therefore invalidating any approach using a Poisson distribution assumption. [2]

● Can we use Machine Learning techniques to not only alleviate the bias, but also perhaps predict the Hi-C data?

Predicting Hi-CWe present two methods:

Method 1: Using the entire Hi-C matrix as training data for a single

Random Forest (RF) and also a single Artificial Neural Network (ANN)

Method 2: Creating a separate RF for each diagonal of the matrix

(i.e. Any given RF will only be trained on region pairs of a fixed distance.) (e.g. RF_2 will be trained on all region pairs that are 2 regions away, 80kb)

The reasoning behind Method 2 is that it will provide us with knowledge into what features are more meaningful for prediction at different distances.

Predicting Hi-CWe use mESC mm9 chrs to train and validate our models

Data Features used to Learn Hi-C: 10 for each 40kb regionGC content, number of HindIII cut sites, mappability, H3k4me1, H3k4me3, H3K27ac, H3K27me3, H3k36me3, Pol II, and CTCF

Method 1: Using RF and ANN● Training input for predicting the interaction of two regions rI and rJ consists of the 10

features for both the regions ○ plus an additional feature of the distance between the regions○ [rI.GC, rI.HindIII, … , rI.CTCF, rJ.GC, rJ.HindIII, … ,rJ.CTCF, distance]

● Attempting to use the above features to predict the Hi-C interaction value between the two given regions rI and rJ for all pairs of regions in the chr.

Predicting Hi-CWe use mESC mm9 chrs to train and validate our models

Data Features used to Learn Hi-C: 10 for each 40kb regionGC content, number of HindIII cut sites, mappability, H3k4me1, H3k4me3, H3K27ac, H3K27me3, H3k36me3, Pol II, and CTCF

Method 2: Using RF● Attempting to use the above features to predict the Hi-C interaction value between

the two given regions rI and rJ for all pairs of regions for a specific distance in the chr.

● e.g. For training model RF_2 we use all pairs of regions which are 80kb in distance Input for predicting the interaction of two regions rI and rJ consists of the 10 features for both the regions (and not using the distance)○ [rI.GC, rI.HindIII, …, rI.CTCF, rJ.GC, rJ.HindIII, …,rJ.CTCF] ➔ Interaction of

rI,rJ

What we have so far...● Method 1: Training on all pairs of regions from chr1 and

testing our model with all pairs of chr2○ RMSE=2.309 & R-squared=0.869

What we have planned for the near future:○ Performing a leave-one-out cross validation with

using all the mESC mm9 chrs○ Using higher resolution 1kb region sizes

Scatter Plot of Real vs Predicted Hi-C Data

Rea

l Int

erac

tion

Val

ues

Predicted Interaction Values

300

600

300 600

3D Structure of mESC mm9 chr2Using Predicted Hi-C Using raw Hi-C3D models generated

using PASTIS (MDS)3D prediction software [3]

Coloration corresponds to the distance from the starting point of the chr (blue, cyan, green, yellow, orange, red)[2]

Hi-C Heatmaps of mm9 Chr2 - (Entire Chr)

Predicted Data Real Data

Hi-C Heatmaps of mm9 Chr2 - (0 - 40Mbp)

Predicted Data Real Data

Feature ImportancesAnother part of our project is to attempt at determining which of the 10 features are more meaningful in determining the interaction between the loci regions Question:

Are there differences in which features are more significant for the Hi-C values of paired regions that are close compared to far away interactions?

Feature Importances

40kbH3k36me3_norm = 0.3571

HindIII = 0.2871

Map = 0.1062

H3k27ac_norm = 0.0505

POL2_norm = 0.0453

H3k4me1_norm = 0.0359

H3k27me3_norm = 0.0358

GC = 0.0295

CTCF_norm = 0.0269

H3k4me3_norm = 0.0258

totals 100.03%

2MbpHindIII = 0.238

Map = 0.1686

H3k27me3_norm = 0.0944

POL2_norm = 0.0862

GC = 0.0794

H3k36me3_norm = 0.0721

CTCF_norm = 0.0711

H3k4me3_norm = 0.0642

H3k4me1_norm = 0.0632

H3k27ac_norm = 0.0606

totals = 99.78%

Using Method 2: Feature importances (in sorted order) for predicting the interaction between regions which are 40kb vs 2Mbp in distance

Note: These values are obtained by analysis on the Decision Trees in a Random Forest model.

The feature importances are calculated by randomly permuting the values for a single feature among the training instances. The more the variation in prediction accuracy when using the correct feature values vs the permuted values imply that the feature is more meaningful/important for the prediction.

Feature Importances

Feature ImportancesFuture Work Idea:

Use data mining techniques to determine more information behind the correlation of features (and also pairs of features) to the Hi-C interaction values

Thank you

References[1] Hu, Ming, et al. "Bayesian inference of spatial organizations of chromosomes."PLoS computational biology 9.1 (2013): e1002893.

[2] Kuang, Simon 2014 Google Science Fair Poster

[3] Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." science 326.5950 (2009): 289-293.

[4]Rousseau, Mathieu, et al. "Three-dimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling."BMC bioinformatics 12.1 (2011): 414.

[5] Varoquaux, Nelle, et al. "A statistical approach for inferring the 3D structure of the genome." Bioinformatics 30.12 (2014): i26-i33.

prediction of hi-c interaction counts and …derrtyle/downloads/tylerderr...supervised learner for...

Documents