prediction of hi-c interaction counts and …derrtyle/downloads/tylerderr...supervised learner for...

20
Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features Tyler Derr @ Yue Lab [email protected]

Upload: others

Post on 16-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Tyler Derr @ Yue [email protected]

Page 2: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Background● Hi-C is a chromosome conformation capture (3C) based technology, which

outputs the number of interactions between loci at the genome-wide scale.

[3]

Page 3: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Background● Recent 3D prediction softwares such as BACH[1] and PASTIS[3] exist that

can use Hi-C data to produce 3D genome structures○ BACH:

“It utilizes a Poisson model that better fits the count data generated from Hi-C experiments than the Gaussian model used in MCMC5C[4]...”[1]

○ PASTIS:In [3] they present 4 methods and 2 of these are based upon a Poisson model.

● Thanks to the recent efforts of the ENCODE and Roadmap Epigenomics projects we have access to the following data per region (40kb resolution):

○ GC content, mappability, number of HindIII cut sites, Pol II, and 6 histone modifications such as H3k36me3

Page 4: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Basis of our research● Current softwares such as BACH[1] and PASTIS[3] that can predict 3D genome

structures based on Hi-C data have trouble dealing with the bias induced by the techniques to gather the data

○ Hi-C data collection is time consuming, expensive, and have known biases

○ It seems that Dr. Ming Hu (the creator of BACH[1]) had attempted to address the biases by taking into account the enzyme cutting frequency, GC content, and sequence uniqueness when making his 3D predictions

○ However Dr. Hu has recently stated that due to a recent Nature paper the assumptions on a Poisson distribution (which is crucial to BACH) is not appropriate for Hi-C data and therefore invalidating any approach using a Poisson distribution assumption. [2]

● Can we use Machine Learning techniques to not only alleviate the bias, but also perhaps predict the Hi-C data?

Page 5: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Predicting Hi-CWe present two methods:

Method 1: Using the entire Hi-C matrix as training data for a single

Random Forest (RF) and also a single Artificial Neural Network (ANN)

Method 2: Creating a separate RF for each diagonal of the matrix

(i.e. Any given RF will only be trained on region pairs of a fixed distance.) (e.g. RF_2 will be trained on all region pairs that are 2 regions away, 80kb)

The reasoning behind Method 2 is that it will provide us with knowledge into what features are more meaningful for prediction at different distances.

Page 6: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Predicting Hi-CWe use mESC mm9 chrs to train and validate our models

Data Features used to Learn Hi-C: 10 for each 40kb regionGC content, number of HindIII cut sites, mappability, H3k4me1, H3k4me3, H3K27ac, H3K27me3, H3k36me3, Pol II, and CTCF

Method 1: Using RF and ANN● Training input for predicting the interaction of two regions rI and rJ consists of the 10

features for both the regions ○ plus an additional feature of the distance between the regions○ [rI.GC, rI.HindIII, … , rI.CTCF, rJ.GC, rJ.HindIII, … ,rJ.CTCF, distance]

● Attempting to use the above features to predict the Hi-C interaction value between the two given regions rI and rJ for all pairs of regions in the chr.

Page 7: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Predicting Hi-CWe use mESC mm9 chrs to train and validate our models

Data Features used to Learn Hi-C: 10 for each 40kb regionGC content, number of HindIII cut sites, mappability, H3k4me1, H3k4me3, H3K27ac, H3K27me3, H3k36me3, Pol II, and CTCF

Method 2: Using RF● Attempting to use the above features to predict the Hi-C interaction value between

the two given regions rI and rJ for all pairs of regions for a specific distance in the chr.

● e.g. For training model RF_2 we use all pairs of regions which are 80kb in distance Input for predicting the interaction of two regions rI and rJ consists of the 10 features for both the regions (and not using the distance)○ [rI.GC, rI.HindIII, …, rI.CTCF, rJ.GC, rJ.HindIII, …,rJ.CTCF] ➔ Interaction of

rI,rJ

Page 8: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

What we have so far...● Method 1: Training on all pairs of regions from chr1 and

testing our model with all pairs of chr2○ RMSE=2.309 & R-squared=0.869

What we have planned for the near future:○ Performing a leave-one-out cross validation with

using all the mESC mm9 chrs○ Using higher resolution 1kb region sizes

Page 9: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Scatter Plot of Real vs Predicted Hi-C Data

Rea

l Int

erac

tion

Val

ues

Predicted Interaction Values

300

600

300 600

Page 10: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features
Page 11: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features
Page 12: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

3D Structure of mESC mm9 chr2Using Predicted Hi-C Using raw Hi-C3D models generated

using PASTIS (MDS)3D prediction software [3]

Coloration corresponds to the distance from the starting point of the chr (blue, cyan, green, yellow, orange, red)[2]

Page 13: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Hi-C Heatmaps of mm9 Chr2 - (Entire Chr)

Predicted Data Real Data

Page 14: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Hi-C Heatmaps of mm9 Chr2 - (0 - 40Mbp)

Predicted Data Real Data

Page 15: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Feature ImportancesAnother part of our project is to attempt at determining which of the 10 features are more meaningful in determining the interaction between the loci regions Question:

Are there differences in which features are more significant for the Hi-C values of paired regions that are close compared to far away interactions?

Page 16: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Feature Importances

40kbH3k36me3_norm = 0.3571

HindIII = 0.2871

Map = 0.1062

H3k27ac_norm = 0.0505

POL2_norm = 0.0453

H3k4me1_norm = 0.0359

H3k27me3_norm = 0.0358

GC = 0.0295

CTCF_norm = 0.0269

H3k4me3_norm = 0.0258

totals 100.03%

2MbpHindIII = 0.238

Map = 0.1686

H3k27me3_norm = 0.0944

POL2_norm = 0.0862

GC = 0.0794

H3k36me3_norm = 0.0721

CTCF_norm = 0.0711

H3k4me3_norm = 0.0642

H3k4me1_norm = 0.0632

H3k27ac_norm = 0.0606

totals = 99.78%

Using Method 2: Feature importances (in sorted order) for predicting the interaction between regions which are 40kb vs 2Mbp in distance

Note: These values are obtained by analysis on the Decision Trees in a Random Forest model.

The feature importances are calculated by randomly permuting the values for a single feature among the training instances. The more the variation in prediction accuracy when using the correct feature values vs the permuted values imply that the feature is more meaningful/important for the prediction.

Page 17: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Feature Importances

Page 18: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Feature ImportancesFuture Work Idea:

Use data mining techniques to determine more information behind the correlation of features (and also pairs of features) to the Hi-C interaction values

Page 19: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

Thank you

Page 20: Prediction of Hi-C Interaction Counts and …derrtyle/downloads/TylerDerr...Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features

References[1] Hu, Ming, et al. "Bayesian inference of spatial organizations of chromosomes."PLoS computational biology 9.1 (2013): e1002893.

[2] Kuang, Simon 2014 Google Science Fair Poster

[3] Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." science 326.5950 (2009): 289-293.

[4]Rousseau, Mathieu, et al. "Three-dimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling."BMC bioinformatics 12.1 (2011): 414.

[5] Varoquaux, Nelle, et al. "A statistical approach for inferring the 3D structure of the genome." Bioinformatics 30.12 (2014): i26-i33.