a n e nsemble svm m odel for the a ccurate p rediction of n on - c anonical m icro rna t argets...

AN ENSEMBLE SVM MODEL FOR THE ACCURATE PREDICTION OF NON-CANONICAL MICRORNA

TARGETS

Asish Ghoshal1, Ananth Grama1, Saurabh Bagchi2, Somali Chaterji1

1: Computer Science2: Electrical and Computer EngineeringPurdue University, West Lafayette, IN

MicroRNA (miRNA): “The genome’s guiding hand?”

• miRNA are 22 nucleotide (nt) strings of RNA, base-pairing with messenger RNA (mRNA) to cause mRNA degradation or translational repression

• Can be thought of as biology’s dark matter: small regulatory RNA that are abundant and encoded in the genome

• Dysregulation of miRNA may contribute to diverse diseases• Canonical (i.e., exact) matches involve the miRNA’s seed

region (nt 2-7) and the 3’ untranslated region (UTR) of mRNA and were thought of as the only form of interaction

• Recent high-throughput experimental studies have indicated the high-preponderance of “non-canonical” miRNA targets

miRNA

3

Background

• microRNA• mRNA• Argonaute• RISC (complex)

Competition between target sites of regulators shapes post-transcriptional gene regulation: Marvin Jens & Nikolaus Rajewsky, Nature Reviews Genetics, 2015

• We are attempting computational prediction of miRNA-mRNA interactions

• Challenging because of:– Large number of features in miRNAs and

mRNAs– Noisy and incomplete data sets from

experimental approaches– Wide variety of positive miRNA-mRNA

interactions

• Prior computational approaches are deterministic and rely on perfect seed matches of the miRNA and mRNA nucleotides [Nature Methods 2013, NAR 2010]

Computational miRNA Target Prediction

• Avishkar provides an ensemble model classifier, specialized to each miRNA family

• It achieves the highest precision and recall among all competitive computational approaches– Shows how to use an ensemble non-linear SVM classifier – Shows how to transform spatial features into smooth curves– Shows how to handle perfect matches as well as imperfect matches in a unified manner

• Since training non-linear SVMs is computationally expensive, we provide an open source, efficient implementation of Cascade SVM1, on top of Apache Spark

• Key Result: TPR of 76% and an FPR of 20%, with the AUC (ROC curve) for the ensemble non-linear model being 20% higher than for the simple linear model. This is an improvement of over 150% in the TPR over the best competitive protocol.

Our Contributions: Avishkar

Parallel Support Vector Machines: The Cascade SVM: Hans Peter Graf, Eric Cosatto, Leon Bottou, Igor Durdanovic, Vladimir Vapnik, NIPS 2005

6

Problem Statement

• Predict if a miRNA targets an mRNA (segment)

miRNAs mRNA segments

1 or 0 ?

• Ideally we would want some experimentally verified edge labels to train on

1 or 0 ?

1

0

1

Edges with ground truth labelsEdges whose labels have to be predicted


7

Problem Statement

• We have labels on vertices from CLIP-Seq experiments, i.e., which mRNA segments were targeted.

• We do not know which miRNA targeted a particular mRNA segment.

1 or 0 ? 0


1

0

0Not contained within an IP region

Contained within an Immuno-precipitated (IP) region

8

Methods: Generating Ground Truth Data

• Consider only the most highly expressed miRNAs– 10 in humans, 20 in mouse

• Label an edge ‘1’ if – The mRNA segment is labeled ‘1’, AND– The binding between a miRNA and

mRNA segment is strong enough, i.e., ΔG is below a certain threshold, OR,

– There is at least a 6-mer seed match between the mRNA segment and miRNA.

• All other edges are labeled ‘0’

1 0


1

0

0

0

0 1

9

Methods: Feature Construction

• The alternating blue and green regions denote the 13 consecutive windows around the miRNA target site (red). These are the windows where the average thermodynamic and sequence features are computed.

• Compute interaction profiles at two different resolutions– Window size of 46 and using the entire miRNA: “site” curves– Window size of 9 and only using the seed region of the miRNA: “seed” curves

• Use coefficients of B-spline basis functions as features for classifier• We hypothesize that the curves are different for the positive and negative samples.

Seed match siteRISC (complex)

10

Methods: Metric for Non-Canonical Matches

• Encode the alignment of the miRNA seed region with mRNA nucleotides using a string of 1s (matches), 2s (mismatches), 3s (gaps), and 4s (GU wobbles)

• Compute occurrence frequencies of seed match patterns among positive miRNA-mRNA interactions

• If the pattern a occurs k times among n positive samples, we calculate α = Probability(pattern a occurs k of n by chance)

• Define seed enrichment score for the seed match pattern a as: SES = 1 – α

• The “seed enrichment score” captures, in a single unified numeric feature, the relative efficacy of various kinds of seed matches – ML techniques handle numeric features better

11

Validation Protocol used to Evaluate Avishkar

12

Improving Classification Performance with Kernel SVM

• Linear classifier suffers from high bias (large error even on training set)

• Solution: Use more complex learning model– Non-linear or Kernel SVM

• SVMs suffer from a widely recognized scalability problem in both memory use and compute time.

• Kernel SVM computational cost: O(n3)• Does not scale beyond a few thousand examples for

feature vector of dimension ~ 150

13

Distributed Training for Kernel SVM

• Cascading SVM [Graf et al. NIPS 2005]• Key ideas:

– Train on partitions of the whole data set and do this in parallel– Merge SVs from each partition in a hierarchical manner– Final step is serial and it is hoped that the number of SVs is reasonably small at

that stage

• Implemented on top of Apache Spark: Open source release

1 2 3

14

Making Kernel SVM Scale Up

• Biological insight: miRNAs within an miRNA family share structural similarities

• Therefore, we create a separate non-linear classifier for each miRNA family

• Within each family, we train in parallel using Cascade SVM approach

15

CLIP Data SetSpecies # Positive

examples (Seed,

Seedless)

# Negative examples

# mRNA # miRNA Positive target sites

3’ UTR CDS 5’ UTR

HITS-CLIP (Mouse)

861,208 (6%, 94%)

35,608,333 4,059 119 56% 43% 1%

PAR-CLIP (Human)

141,109 (8%,92%)

2,659,748 1,211 35 57% 39% 4%

16

Results: Misclassification Rate for Linear and Non-Linear Classifier

• Mean test error of the non-linear SVMs for each of the miRNA families is less than the corresponding linear models.

• Benefit of non-linear SVM is more pronounced for larger miRNA families: e.g., for let-7 and miR-320 families, benefits are 50% and 69.9% over linear model.

• Insight: Non-linear model can remove the prediction bias inherent in linear model.

17

Results: ROC Curve for Linear and Non-linear SVM

• ROC curves for the ensemble linear model and ensemble non-linear model, obtained by varying the probability threshold for the output of the SVM.

• The misclassification error, true positive rate, and false positive rate were computed using 5-fold stratified cross-validation for seedless sites in the human data set.

• One possible operating region is with an FPR of 0.2, the TPR for the linear model is 0.469, while the TPR for the non-linear model is 0.756.

18

Take-Away Points• We have developed “Avishkar”, a machine-learning, support

vector machine-based model, to predict both canonical and non-canonical miRNA-mRNA interactions

• Avishkar extracts thermodynamic and sequence features and does smoothening through curve fitting, in order to extract enriched features from CLIP data

• We use non-linear SVM to minimize bias and scale it up to the large biological data sizes through a biologically-driven parallelization strategy

• We achieve the best-in-class recall (true positive rate), with an improvement of over 150%, over the best competitive protocol

• Open source software releases: https://bitbucket.org/cellsandmachines/avishkar

https://bitbucket.org/cellsandmachines/kernelsvmspark

19

Thanks!

20

EXTRA Slides + Notes

21

Background: CLIP Technology and Non-Canonical miRNA-mRNA Interactions

An experimental method to map the binding sites of RNA-binding proteins across the transcriptome. Proteins are crosslinked to RNA using ultraviolet light, and an antibody is used to specifically isolate the RNA-binding protein of interest together with its RNA interaction partners, which are subjected to sequencing.

23

Background: RNA interference

24

Methods: Capturing Spatial Interaction using Smooth Curves

• Compute thermodynamic interaction profile upstream and downstream of the target region

– Data is collected for fixed-size windows on both sides of the target region– Averaging is done within each window

• Compute smooth curves to remove noise in the biological data set– Used B-spline interpolation for smoothing out the points

25

Methods: Our Classifier

• Distributed linear SVM using gradient descent (Apache Spark)• 9 Intel X86 nodes, 8 GB memory, 4 cores.

26

Results: ROC Curves

27

Results: Feature Importance

28

Results

• Does the classification performance improve due to clustering by miRNA family or due to the use of a more complex model?

• Performance of linear classifier does not improve by clustering data.

a n e nsemble svm m odel for the a ccurate p rediction of n on - c anonical m icro rna t argets...

Documents