a n e nsemble svm m odel for the a ccurate p rediction of n on - c anonical m icro rna t argets...
TRANSCRIPT
AN ENSEMBLE SVM MODEL FOR THE ACCURATE PREDICTION OF NON-CANONICAL MICRORNA
TARGETS
Asish Ghoshal1, Ananth Grama1, Saurabh Bagchi2, Somali Chaterji1
1: Computer Science2: Electrical and Computer EngineeringPurdue University, West Lafayette, IN
MicroRNA (miRNA): “The genome’s guiding hand?”
• miRNA are 22 nucleotide (nt) strings of RNA, base-pairing with messenger RNA (mRNA) to cause mRNA degradation or translational repression
• Can be thought of as biology’s dark matter: small regulatory RNA that are abundant and encoded in the genome
• Dysregulation of miRNA may contribute to diverse diseases• Canonical (i.e., exact) matches involve the miRNA’s seed
region (nt 2-7) and the 3’ untranslated region (UTR) of mRNA and were thought of as the only form of interaction
• Recent high-throughput experimental studies have indicated the high-preponderance of “non-canonical” miRNA targets
miRNA
3
Background
• microRNA• mRNA• Argonaute• RISC (complex)
Competition between target sites of regulators shapes post-transcriptional gene regulation: Marvin Jens & Nikolaus Rajewsky, Nature Reviews Genetics, 2015
• We are attempting computational prediction of miRNA-mRNA interactions
• Challenging because of:– Large number of features in miRNAs and
mRNAs– Noisy and incomplete data sets from
experimental approaches– Wide variety of positive miRNA-mRNA
interactions
• Prior computational approaches are deterministic and rely on perfect seed matches of the miRNA and mRNA nucleotides [Nature Methods 2013, NAR 2010]
Computational miRNA Target Prediction
• Avishkar provides an ensemble model classifier, specialized to each miRNA family
• It achieves the highest precision and recall among all competitive computational approaches– Shows how to use an ensemble non-linear SVM classifier – Shows how to transform spatial features into smooth curves– Shows how to handle perfect matches as well as imperfect matches in a unified manner
• Since training non-linear SVMs is computationally expensive, we provide an open source, efficient implementation of Cascade SVM1, on top of Apache Spark
• Key Result: TPR of 76% and an FPR of 20%, with the AUC (ROC curve) for the ensemble non-linear model being 20% higher than for the simple linear model. This is an improvement of over 150% in the TPR over the best competitive protocol.
Our Contributions: Avishkar
Parallel Support Vector Machines: The Cascade SVM: Hans Peter Graf, Eric Cosatto, Leon Bottou, Igor Durdanovic, Vladimir Vapnik, NIPS 2005
6
Problem Statement
• Predict if a miRNA targets an mRNA (segment)
miRNAs mRNA segments
1 or 0 ?
• Ideally we would want some experimentally verified edge labels to train on
1 or 0 ?
1
0
1
Edges with ground truth labelsEdges whose labels have to be predicted
miRNAs mRNA segments
7
Problem Statement
• We have labels on vertices from CLIP-Seq experiments, i.e., which mRNA segments were targeted.
• We do not know which miRNA targeted a particular mRNA segment.
1 or 0 ? 0
miRNAs mRNA segments
1
0
0Not contained within an IP region
Contained within an Immuno-precipitated (IP) region
8
Methods: Generating Ground Truth Data
• Consider only the most highly expressed miRNAs– 10 in humans, 20 in mouse
• Label an edge ‘1’ if – The mRNA segment is labeled ‘1’, AND– The binding between a miRNA and
mRNA segment is strong enough, i.e., ΔG is below a certain threshold, OR,
– There is at least a 6-mer seed match between the mRNA segment and miRNA.
• All other edges are labeled ‘0’
1 0
miRNAs mRNA segments
1
0
0
0
0 1
9
Methods: Feature Construction
• The alternating blue and green regions denote the 13 consecutive windows around the miRNA target site (red). These are the windows where the average thermodynamic and sequence features are computed.
• Compute interaction profiles at two different resolutions– Window size of 46 and using the entire miRNA: “site” curves– Window size of 9 and only using the seed region of the miRNA: “seed” curves
• Use coefficients of B-spline basis functions as features for classifier• We hypothesize that the curves are different for the positive and negative samples.
Seed match siteRISC (complex)
10
Methods: Metric for Non-Canonical Matches
• Encode the alignment of the miRNA seed region with mRNA nucleotides using a string of 1s (matches), 2s (mismatches), 3s (gaps), and 4s (GU wobbles)
• Compute occurrence frequencies of seed match patterns among positive miRNA-mRNA interactions
• If the pattern a occurs k times among n positive samples, we calculate α = Probability(pattern a occurs k of n by chance)
• Define seed enrichment score for the seed match pattern a as: SES = 1 – α
• The “seed enrichment score” captures, in a single unified numeric feature, the relative efficacy of various kinds of seed matches – ML techniques handle numeric features better
11
Validation Protocol used to Evaluate Avishkar
12
Improving Classification Performance with Kernel SVM
• Linear classifier suffers from high bias (large error even on training set)
• Solution: Use more complex learning model– Non-linear or Kernel SVM
• SVMs suffer from a widely recognized scalability problem in both memory use and compute time.
• Kernel SVM computational cost: O(n3)• Does not scale beyond a few thousand examples for
feature vector of dimension ~ 150
13
Distributed Training for Kernel SVM
• Cascading SVM [Graf et al. NIPS 2005]• Key ideas:
– Train on partitions of the whole data set and do this in parallel– Merge SVs from each partition in a hierarchical manner– Final step is serial and it is hoped that the number of SVs is reasonably small at
that stage
• Implemented on top of Apache Spark: Open source release
1 2 3
14
Making Kernel SVM Scale Up
• Biological insight: miRNAs within an miRNA family share structural similarities
• Therefore, we create a separate non-linear classifier for each miRNA family
• Within each family, we train in parallel using Cascade SVM approach
15
CLIP Data SetSpecies # Positive
examples (Seed,
Seedless)
# Negative examples
# mRNA # miRNA Positive target sites
3’ UTR CDS 5’ UTR
HITS-CLIP (Mouse)
861,208 (6%, 94%)
35,608,333 4,059 119 56% 43% 1%
PAR-CLIP (Human)
141,109 (8%,92%)
2,659,748 1,211 35 57% 39% 4%
16
Results: Misclassification Rate for Linear and Non-Linear Classifier
• Mean test error of the non-linear SVMs for each of the miRNA families is less than the corresponding linear models.
• Benefit of non-linear SVM is more pronounced for larger miRNA families: e.g., for let-7 and miR-320 families, benefits are 50% and 69.9% over linear model.
• Insight: Non-linear model can remove the prediction bias inherent in linear model.
17
Results: ROC Curve for Linear and Non-linear SVM
• ROC curves for the ensemble linear model and ensemble non-linear model, obtained by varying the probability threshold for the output of the SVM.
• The misclassification error, true positive rate, and false positive rate were computed using 5-fold stratified cross-validation for seedless sites in the human data set.
• One possible operating region is with an FPR of 0.2, the TPR for the linear model is 0.469, while the TPR for the non-linear model is 0.756.
18
Take-Away Points• We have developed “Avishkar”, a machine-learning, support
vector machine-based model, to predict both canonical and non-canonical miRNA-mRNA interactions
• Avishkar extracts thermodynamic and sequence features and does smoothening through curve fitting, in order to extract enriched features from CLIP data
• We use non-linear SVM to minimize bias and scale it up to the large biological data sizes through a biologically-driven parallelization strategy
• We achieve the best-in-class recall (true positive rate), with an improvement of over 150%, over the best competitive protocol
• Open source software releases: https://bitbucket.org/cellsandmachines/avishkar
https://bitbucket.org/cellsandmachines/kernelsvmspark
19
Thanks!
20
EXTRA Slides + Notes
21
Background: CLIP Technology and Non-Canonical miRNA-mRNA Interactions
An experimental method to map the binding sites of RNA-binding proteins across the transcriptome. Proteins are crosslinked to RNA using ultraviolet light, and an antibody is used to specifically isolate the RNA-binding protein of interest together with its RNA interaction partners, which are subjected to sequencing.
22
23
Background: RNA interference
24
Methods: Capturing Spatial Interaction using Smooth Curves
• Compute thermodynamic interaction profile upstream and downstream of the target region
– Data is collected for fixed-size windows on both sides of the target region– Averaging is done within each window
• Compute smooth curves to remove noise in the biological data set– Used B-spline interpolation for smoothing out the points
25
Methods: Our Classifier
• Distributed linear SVM using gradient descent (Apache Spark)• 9 Intel X86 nodes, 8 GB memory, 4 cores.
26
Results: ROC Curves
27
Results: Feature Importance
28
Results
• Does the classification performance improve due to clustering by miRNA family or due to the use of a more complex model?
• Performance of linear classifier does not improve by clustering data.