machine learning prediction of oncology drug targets based

12
METHODOLOGY ARTICLE Open Access Machine learning prediction of oncology drug targets based on protein and network properties Zoltán Dezső 1* and Michele Ceccarelli 1,2,3* Abstract Background: The selection and prioritization of drug targets is a central problem in drug discovery. Computational approaches can leverage the growing number of large-scale human genomics and proteomics data to make in- silico target identification, reducing the cost and the time needed. Results: We developed a machine learning approach to score proteins to generate a druggability score of novel targets. In our model we incorporated 70 protein features which included properties derived from the sequence, features characterizing protein functions as well as network properties derived from the protein-protein interaction network. The advantage of this approach is that it is unbiased and even less studied proteins with limited information about their function can score well as most of the features are independent of the accumulated literature. We build models on a training set which consist of targets with approved drugs and a negative set of non-drug targets. The machine learning techniques help to identify the most important combination of features differentiating validated targets from non-targets. We validated our predictions on an independent set of clinical trial drug targets, achieving a high accuracy characterized by an Area Under the Curve (AUC) of 0.89. Our most predictive features included biological function of proteins, network centrality measures, protein essentiality, tissue specificity, localization and solvent accessibility. Our predictions, based on a small set of 102 validated oncology targets, recovered the majority of known drug targets and identifies a novel set of proteins as drug target candidates. Conclusions: We developed a machine learning approach to prioritize proteins according to their similarity to approved drug targets. We have shown that the method proposed is highly predictive on a validation dataset consisting of 277 targets of clinical trial drug confirming that our computational approach is an efficient and cost- effective tool for drug target discovery and prioritization. Our predictions were based on oncology targets and cancer relevant biological functions, resulting in significantly higher scores for targets of oncology clinical trial drugs compared to the scores of targets of trial drugs for other indications. Our approach can be used to make indication specific drug-target prediction by combining generic druggability features with indication specific biological functions. Keywords: Machine learning, Drug target prioritization, Network analysis © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. * Correspondence: [email protected]; [email protected] 1 Computational Biology-Genomic Research Center, ABBVIE, Redwood City, CA, USA Full list of author information is available at the end of the article Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 https://doi.org/10.1186/s12859-020-3442-9

Upload: others

Post on 10-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine learning prediction of oncology drug targets based

METHODOLOGY ARTICLE Open Access

Machine learning prediction of oncologydrug targets based on protein and networkpropertiesZoltán Dezső1* and Michele Ceccarelli1,2,3*

Abstract

Background: The selection and prioritization of drug targets is a central problem in drug discovery. Computationalapproaches can leverage the growing number of large-scale human genomics and proteomics data to make in-silico target identification, reducing the cost and the time needed.

Results: We developed a machine learning approach to score proteins to generate a druggability score of noveltargets. In our model we incorporated 70 protein features which included properties derived from the sequence,features characterizing protein functions as well as network properties derived from the protein-protein interactionnetwork. The advantage of this approach is that it is unbiased and even less studied proteins with limitedinformation about their function can score well as most of the features are independent of the accumulatedliterature. We build models on a training set which consist of targets with approved drugs and a negative set ofnon-drug targets. The machine learning techniques help to identify the most important combination of featuresdifferentiating validated targets from non-targets. We validated our predictions on an independent set of clinicaltrial drug targets, achieving a high accuracy characterized by an Area Under the Curve (AUC) of 0.89. Our mostpredictive features included biological function of proteins, network centrality measures, protein essentiality, tissuespecificity, localization and solvent accessibility. Our predictions, based on a small set of 102 validated oncologytargets, recovered the majority of known drug targets and identifies a novel set of proteins as drug targetcandidates.

Conclusions: We developed a machine learning approach to prioritize proteins according to their similarity toapproved drug targets. We have shown that the method proposed is highly predictive on a validation datasetconsisting of 277 targets of clinical trial drug confirming that our computational approach is an efficient and cost-effective tool for drug target discovery and prioritization. Our predictions were based on oncology targets andcancer relevant biological functions, resulting in significantly higher scores for targets of oncology clinical trial drugscompared to the scores of targets of trial drugs for other indications. Our approach can be used to make indicationspecific drug-target prediction by combining generic druggability features with indication specific biologicalfunctions.

Keywords: Machine learning, Drug target prioritization, Network analysis

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you giveappropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate ifchanges were made. The images or other third party material in this article are included in the article's Creative Commonslicence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commonslicence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtainpermission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to thedata made available in this article, unless otherwise stated in a credit line to the data.

* Correspondence: [email protected]; [email protected] Biology-Genomic Research Center, ABBVIE, Redwood City,CA, USAFull list of author information is available at the end of the article

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 https://doi.org/10.1186/s12859-020-3442-9

Page 2: Machine learning prediction of oncology drug targets based

BackgroundDrug target identification is one of the most criticalsteps in the pre-clinical drug development pipeline andthe choice of the right target can significantly impact thechances of advancing the drug to clinical developmentand the success of the clinical trials. With the accumula-tion of approved and clinical trial drugs, it has becomeclear that successful drug targets share several importantfeatures, which include having a disease relevant bio-logical function and certain properties that would favorthe existence of binding sites, thus making the proteincapable of binding to small molecules.There are a variety of experimental techniques for target

identification including affinity pull downs, pooled RNAidata [1] and more recently genome-scale CRISPR–Cas9screens [2]; however, these methods are expensive and laborintensive and not without limitations. Computational ap-proaches on the other hand can leverage the growing num-ber of large-scale human genomics and proteomics data setsto make in-silico target identification, thus potentially redu-cing substantially the cost and the time needed to assess atarget. A targeted computational approaches for example isthe structure-based drug discovery where protein druggabil-ity is determined through molecular docking methodswhich can predict binding sites and binding affinity of thetarget proteins [3]. These methods however are limited infinding novel targets because the three-dimensional struc-ture of most proteins is not readily available.Several databases aim to systematically capture the

expanding number of approved and clinical trial drugs,including their indications and corresponding targets.Currently the DrugBank database [4, 5] contains 2594approved small molecule drugs and 1289 approved bio-tech drugs. Another comprehensive collection of druginformation, the Therapeutic Target Database (TTD)[6], contains 2544 drugs and 2589 corresponding targets,as well as drug resistance mutations and gene expressionprofiles after treatments for some of the drugs. The ex-istence of such databases allows us to find commoncharacteristics among the successful drug targets and touse these features in combination with other knowledgeto guide novel drug discovery research.Many data-driven approaches have focused on gene

expression changes after drug treatment to predict simi-larity between drugs and potentially predict shared tar-gets [7–9]. Several studies combined different data typesto improve the prediction of drug similarity [10] and topredict drug-target interactions [11, 12]. Other methodsfocused on predicting the probability of success in aclinical trial by estimating the toxicity based on severalchemical properties, drug-likeness measures of the mole-cules and the target properties [13].Computational approaches to identify novel targets are

often limited by the availability of data for less studied

proteins. We propose an unbiased approach wherebynovel target identification leverages the characterization ofall proteins based on properties that are known or pre-dicted based on protein sequence or genome-wide experi-mental data. Machine learning techniques provide theopportunity to identify the most important combinationof features differentiating validated targets from non-targets. In this approach, the first step is to build a modelon a training set which consists of targets with approveddrugs and a negative set of non-drug targets. The modelcan be used to generate a druggability score of the poten-tial novel targets. A similar approach was first used byBakheet and Doig [14, 15] and followed by several otherworks where the performance of the models was improvedby adding new features [16]. The protein features in thesemodels included properties derived from the sequenceand other protein functions such as gene ontology, essen-tiality based on mouse gene knockdown experiments andtissue specificity. The advantage of this approach is that isit unbiased and can be used to evaluate proteins with lim-ited information about biological function or essentialityusing a variety of properties determined from the aminoacid sequence of the protein.Here we hypothesized that druggability of a target can

be indirectly and automatically derived from a set ofproteins that have been successfully identified as gooddrug targets (positive set) with a machine learning ap-proach that discovers their shared properties comparedto a negative set of proteins. The small set of known val-idated drug targets and the large set of unknown poten-tial targets resulted in an unbalanced positive-onlylearning task [17, 18], requiring us to base our approachon bagging thousands of Random Forest classifierstrained on different instances of the negative set toachieve high accuracy predictions in identifying the trialdrugs in clinical trials not belonging to the training set.Network features have been shown to be particularly

important to score potential drug target proteins [19,20] There are several differences between our approachand the work presented in [19, 20], first, we focus juston oncology drug target and use as additional functionalfeature an ad hoc representation of the pathways whereeach candidate protein is involved, second we show thatjust five features encoding the protein network proper-ties are enough to efficiently score the drug target withhigh accuracy, and finally we adopt a bagging approachto take into account the lack of appropriate negative ex-amples of oncology drug targets. We also report the ap-plication of our approach to the whole drugbank datasetas was performed in [19, 20].We focused our predictions on oncological targets by

scoring proteins by the most cancer-relevant biologicalprocesses, molecular functions and signaling pathwaysbased on a manually curated database [21].

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 2 of 12

Page 3: Machine learning prediction of oncology drug targets based

The proposed approach can be readily used to makeindication-specific drug-target prediction by combininggeneric druggability features with indication-specific bio-logical functions.

ResultsProperties of drug targetsWe determined the set of approved oncology drugsbased on the current version of the Therapeutic TargetDatabase [6]. Approved, clinical and investigationaldrugs with corresponding indication and their respectivetargets were downloaded from the database. The datawas filtered and curated to determine a high confidencelist of approved and clinical trial oncology targets. Ourfinal list consisted of 102 targets for approved drugs andan additional 277 targets for clinical drugs (Table S1).Next, we determined a comprehensive list of 70 prop-

erties for all human proteins by combining the manuallycurated literature captured in the Swiss-prot database,the computational predictions for missing features andnetwork centrality properties calculated based on theprotein-protein interaction network (Methods). The setof features included standard characteristics based onthe protein sequences from Swiss-prot database [22](such as molecular weight and physicochemical classes)as well as several other properties that have been shownto differentiate drug-targets from non-drug targets basedon previous publications [14–16]. These features in-cluded: subcellular localization, post-translational modi-fications, enzyme classification, PEST region (peptidesequence enriched in proline, glutamic acid, serine andthreonine amino acids), secondary structure, signal pep-tide cleavage, protein essentiality, solvent accessibility,and tissue specificity (for a full list of features see TableS2). Protein essentiality was determined based on mousehomozygous loss-of function mutations that lead to le-thality and mapped through orthologs to human pro-teins [23].In addition, we calculated network centrality measures

for each protein based on the protein-protein network in-formation from the STRING database [24]. It has beenshown that network properties of proteins correlate withtheir biological functions, essentiality and tissue specificity[25, 26]. Therefore, the network information complementsthese other properties further helping with the evaluationof less studied proteins where information about bio-logical function and essentiality may be limited.To capture the known biological functions of the pro-

teins we used the comprehensive ontology database ofthe commercially available database of Metacore [21]and scored the proteins based on biological processes,molecular function, and signaling pathways. The scoringof the protein biological functions was done accordingto the ranked ontologies ordered by the enrichment

analysis of the targets in the training set (Methods). Thisprocedure captured the most significant gene ontologycategories for the validated oncology targets in the posi-tive training dataset.Next, we performed statistical tests to identify the fea-

ture which were significantly different between the set ofdrug targets and non-drug targets (Fig. 1). Our mainfindings were in good agreement with previous studies[14, 15]. For example, we found that targets were morelikely to be membrane proteins (p = 2.33*10− 7), wereenriched in enzymes (p = 2.8*10− 11) and tended to betissue-specific (1.6*10− 8). We found the presence ofmore glycosylation sites (p = 9.8*10− 20) possibly indicat-ing longer half-lives for drug targets. The target essenti-ality status, determined from mouse knock-out studies[23], was significantly more established compared to thenon-targets (p < 2.2*10− 16) reflecting the fact that thedrug targets are in general more studied proteins.Interestingly, the network properties of the proteins

showed large differences between drug- and non-drug tar-gets (p < 2.2*10− 16 for all network measure). Indeed, it hasbeen observed in previous studies that drug targets tend tohave a higher number of connections compared to non-target proteins based on unbiased high-throughput yeasttwo-hybrid system [25]. We found that the commonly usednetwork centrality measures showed some of the most sig-nificant differentiation between targets and non-targets(Fig. 1). We included a complete list of differentiating fea-tures in supplementary data (Table S2 and Figure S1).

Machine learning prediction of target “druggability”Our main objective is to use the set of 70 protein featuresdescribed above to prioritize and score proteins in order topromote the discovery of novel targets and/or to filter thecandidate list of targets. The set of 102 targets of approvedcancer drugs is our positive training set, this is the typicalsetting where a learner only has access to positive examplesand unlabeled data because in the set of proteins outsidethis small collection of approved targets there are many tar-gets (for example in the pipelines) or targets still to be con-sidered or discovered. This kind of problem is known inmachine learning as Positive Unlabeled (PU) [17, 27] withthe additional complication of the high unbalance [28] be-tween the positive set and the wide set on unlabeled sam-ples. Here we adopt an approach combining easy ensemble[28] and bagging [29] as shown in Fig. 2. In order to have abalanced training set for our model we generated negativetraining sets of the same size of the positive set by randomsampling without replacement all human proteins after ex-cluding both the approved and clinical trial oncology tar-gets. We built 10,000 random forest models using each ofthe random negative sets and made predictions based oneach model. We then assigned a drug target probabilityscore to each protein by averaging the predictions over the

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 3 of 12

Page 4: Machine learning prediction of oncology drug targets based

10,000 models. To evaluate the performance of our modelwe considered the independent set of 277 targets whichhad at least one oncology clinical trial drug targeting them,but no approved drugs. We achieved a high-accuracy pre-diction in identifying the clinical trial drugs resulting in anAUC of 0.89 (Fig. 3a and Table S3). Furthermore, we inves-tigated if the threshold on the protein-protein interactionconfidence would affect our results, but found no signifi-cant changes in prediction accuracy (Table S3A). We notethat we tested neural network models as well for the pre-dictions and the results and prediction accuracy was verysimilar to the random forest models (results not shown).The list of targets for approved cancer drugs used for thetraining set and the clinical trial drug targets for validationwas included in the supplementary data (Table S1).To further evaluate our predictions, we compared our

drug target score with a recent genome-scale CRISPRscreen by Behan et al. [2], where the authors defined apriority score based on a 324 cancer cell line panel. We

found that our prediction score had a significant correl-ation with the priority score calculated from the CRISPRscreen (p = 5.5*10− 12 significant correlation) validatingour findings.

Predictive features of target “druggability”In our models, we included all features, regardless ofhow well they differentiated drug targets form non-drugtargets. Most of the features included can be consideredas independent variables, however, there were few sub-sets such as the network centrality measures or the geneontology classifications which were strongly correlated(Fig. S2). Nevertheless, we found that the random forestmodels were able to pick the most important variablesfor the predictions. To characterize the relative import-ance of the protein features in the predictions we lookedat the mean decrease of Gini metric [30]. We averagedthe variable importance over 10,000 models giving us anoverall estimate of feature importance (Fig. 3b). The top

Fig. 1 Top features differentiating drug targets from non-targets. The features are significantly different between drug targets and non-targets.The asterisk marks the significance level of the difference: p < 0.05 indicated as *, p < 0.01 indicated as ** and p < 0.001 indicated as ***. The redand black bars correspond to drug and non-drug targets, respectively. a Degree or the number of interactions of the proteins. b Betweennesscentrality of the proteins. c Tissue specificity of proteins characterized by Shannon entropy. d Subcellular localization of proteins. e Enzymeclassification of proteins. f Essentiality of proteins. g Fraction of protein characterized by different post translational modifications

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 4 of 12

Page 5: Machine learning prediction of oncology drug targets based

measures include the biological function of proteins,network centrality measures (page-rank, closeness, be-tweenness, degree), protein essentiality, tissue specificity,localization, and solvent accessibility. Indeed, it is ex-pected that the biological function is a crucial feature ofa drug target protein, because a protein may be drug-gable based on its 3D structure and other properties, butif it is not involved in disease-relevant processes target-ing it with a drug will have minimal impact on the dis-ease state.The different network centrality measures and protein

essentiality are correlated as shown in previous studies[31]. The combination of biological function, essentialityand centrality measures reflects the fact that the targetnot only has to have the right biological function, butthe high centrality also assures that targeting it with adrug will have a high impact on those processes. An-other important feature for our predictions was tissue

specificity. Indeed, tissue-specific genes have been shownthat are more likely to be drug targets [26, 32] due tothe reduced risk of side effects.We compared our predictions scores for the approved

oncology targets, clinical targets and the rest of the pro-teins (Fig. 4a). As expected, our training data had thehighest score with a median of 0.96. Nevertheless, theindependent set of cancer clinical targets was also char-acterized by a high median score of 0.73 compared tothe rest of the non-target proteins which had a medianscore of only 0.11. It is expected that a large fraction ofproteins is not good drug targets because of poor drugg-ability, toxicity or disease irrelevant biological functions.However, the set of non-drug targets had a subset of2117 outliers, proteins characterized with scores largerthan 0.5 (Fig. 4a). We believe this subset of proteins maycontain potentially interesting candidates for novel drugtargets.

Fig. 2 Overview of the drug target druggability predictions. Our positive training set consisted of 102 approved oncology drug targets. Wegenerated negative training sets of the same size as the positive set by random sampling without replacement all human proteins afterexcluding both the approved and clinical trial oncology targets. We built a large number of 10,000 random forest models using each of therandom negative sets and made predictions based on each model. We then assign a drug target probability score to each protein by averagingthe predictions of the 10,000 models. To evaluate the performance of our model we considered the independent set of 277 targets which had atleast one oncology clinical trial drug targeting them, but no approved drugs

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 5 of 12

Page 6: Machine learning prediction of oncology drug targets based

Next, we compared our predictions with existing druginformation from the DrugBank (Fig. 4b). We found astrong positive correlation between the number of cur-rently approved drugs and the prediction scores of theircorresponding targets. Indeed, the proteins with scores ofat least 0.7 had in average at least two drugs targetingthem. Our top scored proteins (> 0.9) were well estab-lished targets, being targeted on average by nearly eightdrugs (Fig. 4b). These results show that using only a smallset of 102 validated targets we were able to recover themajority of known drug targets with high confidence. Al-though our predictions were based on a training set of on-cology targets, many of the druggability features in ourmodel most likely were not specific to oncology targets.However, because we scored higher the cancer specificbiological functions, we expected that cancer targets mayscore higher compared to the targets of other indications.In order to test this, we compared the scores of 540 non-cancer clinical trial targets with the scores of the 277

cancer clinical trial targets from TTD. Indeed, we found asmall but significant (p = 0.02) difference between thescores of non-cancer and cancer clinical trial targets (0.64versus 0.73). Additionally, we downloaded a set of 402 tar-gets of withdrawn drugs from the DrugBank. A small setof 20 targets of withdrawn drugs were identified as nothaving any approved drug targeting them. We found thatthe scores of targets associated only with withdrawn drugswere significantly smaller compared to the targets of drugswith ongoing clinical trials (0.37 vs 0.73, Figure S3). Wenote that the set of targets associated with withdrawn tar-gets was rather small, as most drugs fail not necessarilybecause of their targets, but issues related to efficacy, off-target effects and clinical trial design.Almost all the highest scored proteins not included in

the training set had existing drugs targeting them (TableS4). The target proteins with highest scores includedmany well validated oncology targets (for example:EGFR, VEGF, c-KIT and c-Met).

Fig. 3 Prediction performance and the most predictive features. a Receiver operating characteristic (ROC) curve showing the performance of themodel. The plot is characterized by an AUC (area under the curve) of 0.89. b The decrease in Gini for top features. The bar plot shows thefeatures ordered by decreasing importance of prediction in the models

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 6 of 12

Page 7: Machine learning prediction of oncology drug targets based

The indication specificity of the predictions can be fur-ther enhanced by filtering the candidate targets usingdata specific for the indication of interest. For example,oncology targets could be further filtered down by cor-relation analysis of gene expression with overall survivalin cancer based on public data such as the Cancer Gen-ome Atlas (TCGA) [33]. Similarly for immune-oncologytargets it is possible to focus on targets with immune cellspecific expression profiles based on publicly availabledatasets such as for example Database of Immune CellEQTLs (DICE) [34].

Comparisons with similar approachesWe compared our prediction performance with other simi-lar publications [15, 16]. Direct comparison between theseapproaches is difficult due to the differences in training and

validation sets, the use of different features and the differ-ences in the performance metrics reported. We would notethat the main difference between our approach and thepublications of Kim et al. and Bakheet at al [15, 16]. are inthe addition of network centrality measures and our rank-based scoring of the oncology relevant biological functions.Indeed, including these features had a significant impact onimproving the prediction accuracy (Table S3C) and also en-abled us to prioritize targets based on our indication ofinterest. Our leave-one -out cross validation averaged over100 random non-target sets achieved a sensitivity, specifi-city and accuracy around 0.91 and an AUC value of 0.97(Table S3A). These values show improvement compared tothe sensitivity between 0.64–0.85 reported by Kim et al.and the accuracy of 0.89 reported by Bakheet et al. Finally,the publication of Li et al. is similar in their choice to

Fig. 4 Predicting novel drug targets. a Predictions scores. We compared the distribution of predictions scores among the training set (approvedcancer drug targets), validation set (clinical drug targets) and the rest of the proteins. The median score for the approved targets is the highest(0.96), followed by the clinical trial targets (0.73) and the rest of non-target proteins (0.11). The subset of proteins from the non-target setcharacterized with high scores are good novel drug target candidates. b Correlation of the number of currently approved drugs and theprediction scores of their corresponding targets

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 7 of 12

Page 8: Machine learning prediction of oncology drug targets based

include several network properties. Direct comparison isdifficult since they use a very large set of different features.Nevertheless, we performed a cross validation on the 1100drug targets they utilized in their model and performed across validation using random sampling for the negativedrug target set (Table S3B). Our performance on this set oftargets was very similar within the sampling error to theprediction performance reported by Li et al. (sensitivity is0.87 vs 0.9; specificity is 0.85 vs 0.8; accuracy 0.87 vs 0.85)Our approach has higher prediction performance when fo-cusing only on oncology targets by applying a rank-orderedscoring method that can prioritize cancer specific biologicalfunctions and potentially can make predictions specific toany indication of interest.

DiscussionDespite large R&D investments in preclinical studies only4% of drug development pipelines yield licensed drugs[35] due to the low predictivity of such studies in terms oftherapeutic efficacy and due to the fact that the definitiveevidence of the validity of a new drug target for a diseaseis not validated until late phase development is completed.Therefore, the selection and prioritization of good drugtargets is a central problem in drug discovery.We presented a Machine Learning approach to prioritize

proteins according to their similarity to approved drug tar-gets. The main characteristic of our approach is the factthat it is completely unbiased. We computed a large collec-tion of protein features and let the learning method scorethe features that best discriminate between approved tar-gets from other proteins.The interaction between drugs and their targets acti-

vates signaling cascades through Protein-Protein Inter-action (PPI) networks causing downstream perturbationsin the cell’s transcriptome. A (PPI) network models thecascade of relationships between targets and proteins byusing physical contacts, genetic interactions, and func-tional relationships. Network analysis not only gives asystems-level understanding of drug action and diseasecomplexity but can also help to improve the efficiency ofdrug design [36]. Hence various network measures (e.g.,centrality measures, random walk, shortest path, nearestneighbor etc.) can be integrated with gene expressionprofiles to validate known drug targets or to identify es-sential proteins [20, 37]. Our results show, in a com-pletely unbiased approach, that network centralitymeasures are among the most important discriminativeproperties to identify novel drug targets and characterizeexisting ones.In addition to network centrality measures, other im-

portant features that our approach selects to score drugtarget are biological processes, essentiality and tissuespecificity. Intuitively all these aspects are extremely im-portant to characterize proteins associated with diseases.

Interestingly the high importance score that the learningsystem gives to tissue specificity confirms the assump-tions of large scale CRISPR-Cas9 screens adopted toprioritize cancer therapeutic targets and, surprisingly,our score is highly correlated to the one experimentallyderived CRISPR-Cas9 data set [2].The main bias toward cancer drug targets of our ap-

proach is related to the choice of the training set (102 ap-proved targets) and the representation of the functionalfeatures associated with biological functions and signalingpathways, nevertheless our approach is general enough tobe easily applied to other disease contexts with the appro-priate selection of the training set and the correspondingrepresentation of the functional features. We have shownthat the method proposed here is highly predictive on avalidation dataset consisting of 277 targets of clinical trialdrug confirming that our druggability score is an efficientand inexpensive tool to guide the main choices in theprocess of drug discovery and development.Finally, in order to apply an unbiased approach to

score every possible protein, the method described herelacks of features based on genomic evidence [38, 39] asthese features could not be extended to the whole prote-ome. Another important limitation of our method is thatit is completely indirect, i.e. it does not take into accountmolecular models that use protein and drug structuresto predict the binding between small molecules to theappropriate targets [40].However, we believe that the target druggability score

developed here is a complementary approach with re-spect to the ones based on genetic evidence and can beeasily integrated with these more focused studies.

ConclusionsWe developed a machine learning approach to prioritizeproteins according to their similarity to approved drugtargets. We used the majority of features known to bedifferentiating drug targets from non-targets based onprevious studies. We show that the network centralitymeasures are among the most unbiased predictive fea-tures to identify good drug targets and our results agreewith those reported from large scale CRISPR-Cas9screens. We made our predictions oncology specific byencoding the most cancer specific biological functions inour machine learning algorithm. This algorithm can beapplied for other therapeutic areas, by combining the ap-propriate target training set with the most disease rele-vant gene ontology categories. Our method was highlypredictive on a validation dataset consisting of clinicaltrial drug and recovered the vast majority of validatedtargets already targeted with many existing drugs. Ourmethod also identified subset of novel proteins with highdruggability scores but currently without any knowndrug targeting them.

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 8 of 12

Page 9: Machine learning prediction of oncology drug targets based

MethodsProtein propertiesThe sequences of all human proteins were downloadedfrom the Uniprot database. Basic protein features weredetermined using the pepstat program of the EuropeanMolecular Biology Open Software Suite (EMBOSS) [41].All properties generated from the Pepstat program wereincluded in the prediction.The Swiss-prot database was used to extract the post-

translational modifications such as phosphorylation andglycosylation sites as well as the enzyme classificationinformation.

Tissue specificityThe RNAseq tissue expression data was downloaded fromthe Genotype-Tissue Expression (GTEx) and the HumanProtein Atlas (HPA) [42, 43] databases. To determine thetissue specificity we calculated the entropy measure for agene expression profile as described below. We note thatShannon entropy or similar measures have been used be-fore to characterize tissue specificity of gene expressionprofiles [44]. The equation describes the tissue specificityof a gene (g), where eig is the gene’s expression in tissue i,N is the total number of tissues and Sg is the sum of ex-pression in all tissues considered:

E gð Þ ¼ −1Sg

XN

i¼1

eig log2eigSg

The entropy of a gene’s expression ranges from zero forgenes expressed in a single tissue to log2(N) for genes char-acterized by a uniform expression profile across all tissues.The entropy was calculated for both datasets separately andthe averaged values were used for the predictions.

Computational predictions of protein propertiesIn order to determine a complete set of features for allhuman proteins, we utilized a set of computational pre-diction methods. The input for all algorithms was the se-quence of the proteins. We selected algorithms based onavailability and feasibility in terms of computational

times. A full list of the different methods utilized hasbeen listed in Table 1.The list of computational methods utilized to predict

target features.

Statistical analysis and data normalizationOur protein features included continuous and categoricalvalues. We applied the Wilcoxon Rank Sum and the Chi-squared test for the continuous and categorical variables,respectively as implemented in the R statistical software.The determined features had a wide range of values and

applying appropriate scaling was necessary. First, the fea-tures characterized by a heavy-tailed distribution (such asthe network properties) were log-transformed. Second, allfeatures were scaled to a number between zero and oneby normalizing them to the difference between the max-imum (fmax) and the minimum of the feature (fmin):

f scaled ¼ f − f min

f max− f min:

Network propertiesThe protein interaction table was downloaded from theSTRING database [24] and the top 10% of the highestscored interactions were used for the degree and central-ity measure calculations. The degree, betweenness cen-trality, closeness centrality, pagerank and eigen centralitywas calculated using the igraph R package.

Identifying drug targetsWe downloaded from the TTD website [6] the full data-base (version 6.1.01). This database contained 2917unique protein targets. The downloaded data included345 approved, 903 clinical trial and 1669 research tar-gets. We filtered these targets based on the indicationsand kept only the ones with at least one indication re-lated to oncology. Our final list of oncology targets con-sisted of 102 approved drug targets and an additional of277 clinical trial drug targets.

Table 1 Computational methods to predict features

Name of the method Predicted Feature Reference website

NetPhos [45] phosphorylation sites http://www.cbs.dtu.dk/services/NetPhos/

GlycoMine [46] glycosylation sites http://glycomine.erc.monash.edu/Lab/GlycoMine/

WESA [47] solvent accessibility http://pipe.scs.fsu.edu/wesa/

Garnier [48] secondary protein structure http://www.bioinformatics.nl/cgi-bin/emboss/garnier

Epestfind [41] PEST motif http://emboss.bioinformatics.nl/cgi-bin/emboss/epestfind

SignalP [49] signal peptide cleavage site http://www.cbs.dtu.dk/services/SignalP/

CELLO [50] cellular sub-localization http://cello.life.nctu.edu.tw/

TMHMM [51] presence of transmembrane helices http://www.cbs.dtu.dk/services/TMHMM/

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 9 of 12

Page 10: Machine learning prediction of oncology drug targets based

We downloaded a second set of drug targets from theDrugBank database [5]. This data consisted of 2847 tar-get protein with corresponding drug target information.We used this large dataset to add existing drug informa-tion to our list of prioritized targets.

Gene ontologiesWe used three ontology categories based on the content ofMetabase: 1652 biological processes, 114 molecular func-tions and 246 signaling pathways. To determine the mostimportant ontologies for cancer drug targets we performedan enrichment analysis of the 102 targets of approveddrugs. The ontologies were rank-ordered based on their en-richment p-value (Fisher’s over-representation test). Ontol-ogies that did not overlap with any of the approved targetswere assigned a maximum rank value equal to the totalnumber of ontologies in that category. Each protein wasscored based on the rank of the ontology they belonged to.As in most cases, proteins were part of multiple ontologieswe assigned the top 3 ranks to score them. Using this scor-ing system, the proteins characterized by cancer-specificfunctions were assigned top ranks resulting in large differ-ences between the targets and non-targets (Figure S1).

Predictions of target druggability scoreThe random forest models, predictions and performanceevaluation were done using the “randomForest” and“caret” libraries in R statistical software. The training setwas generated by combining the set of 102 approved on-cology target drugs with a set of non-drug targets of thesame size. The non-target set was defined as all proteinsexcluding targets with known approved drugs regardlessof indications. This set of 18,604 non-drug target pro-teins was sampled without replacement to generate 102non-drug targets for training the model.The number of trees in each model was fixed at 1000

as any increase in trees above this value did not yieldany significant changes in the prediction probability. Weused the tuneRF function in the random forest packageto determine the optimal number of variables (“mtry”)sampled at each split. The stepfactor was set at 0.01 andthe “improve” parameter at 0.01 value. The predictionswere averaged over 10,000 models generated by the ran-dom sampling of the non-drug targets (Fig. 2).

Supplementary informationSupplementary information accompanies this paper at https://doi.org/10.1186/s12859-020-3442-9.

Additional file 1: Figure S1. The protein features differentiatingbetween drug-targets and non drug-targets. The asterisk marks the sig-nificance level of the difference: p < 0.05 indicated as *, p < 0.01 indi-cated as ** and p < 0.001 indicated as ***. The red and black barscorrespond to drug and non-drug targets, respectively.

Additional file 2: Figure S2. The protein feature correlation matrix plot.

Additional file 3: Figure S3. The drug target druggability scoredistributions for approved, clinical, withdrawn drugs and non-drugtargets.

Additional file 4: Table S1. The list of approved drug targets in thetraining set and the list of clinical targets in the validation set.

Additional file 5: Table S2. The list of protein features utilized in theprediction model and the statistical significance differentiating theprotein properties of drug targets from non-drug targets.

Additional file 6: Table S3. A. The prediction performance measuresbased on cross-validation as function of network interaction thresholds. B.Prediction performance based on the DrugBank targets. C. The predictionperformance without network and gene ontology features.

Additional file 7: Table S4. The top 100 highest scored drug targetand current drugs targeting them.

AbbreviationsTTD: Therapeutic Target Database; PU: Positive Unlabeled; AUC: Area Underthe Curve; TCGA: The Cancer Genome Atlas; DICE: Database of Immune CellEQTLs; GTEx: Genotype-Tissue Expression; HPA: Human Protein Atlas;EMBOSS: European Molecular Biology Open Software Suite; CELLO: AsubCELlular LOcalization; TMHMM: Transmembrane Helices Hidden MarkovModels; WESA: Weighted Ensemble Solvent Accessibility; PPI: Protein-Proteininteration

AcknowledgementsNot applicable.

Authors’ contributionsConception and design of the analysis: ZD, MC. Collection of data: ZD.Perform the analysis: ZD. Manuscript preparation: ZD, MC. The authors readand approved the final manuscript.

FundingThe research was founded by Abbvie, Inc. The research leading to theseresults has also received funding from AIRC under IG 2018 - ID. 21846project – P.I. Ceccarelli Michele. The funding body played no role in thedesign of the study and collection, analysis, and interpretation of data and inwriting the manuscript.

Availability of data and materialsThe matrix of features for each protein is available at: https://figshare.com/s/b6c6566e7d73297bdd0c.

Ethics approval and consent to participateNot applicable.

Consent for publicationNot applicable.

Competing interestsZD is an employee of ABBVIE Biotherapeutics, Redwood City, CA. MC hasbeen employee of ABBVIE.

Author details1Computational Biology-Genomic Research Center, ABBVIE, Redwood City,CA, USA. 2Department of Electrical Engineering and Information Technology(DIETI), University of Naples “Federico II”, 80128 Naples, Italy. 3Istituto diRicerche Genetiche “G. Salvatore”, Biogem s.c.ar.l, 83031 Ariano Irpino, Italy.

Received: 24 September 2019 Accepted: 4 March 2020

References1. McFarland JM, Ho ZV, Kugener G, Dempster JM, Montgomery PG, Bryan JG,

et al. Improved estimation of cancer dependencies from large-scale RNAiscreens using model-based normalization and data integration. NatCommun. 2018;9:4610. https://doi.org/10.1038/s41467-018-06916-5.

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 10 of 12

Page 11: Machine learning prediction of oncology drug targets based

2. Behan FM, Iorio F, Picco G, Gonçalves E, Beaver CM, Migliardi G, et al.Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens.Nature. 2019;568:511–6. https://doi.org/10.1038/s41586-019-1103-9.

3. Salmaso V, Moro S. Bridging molecular docking to molecular dynamics inexploring ligand-protein recognition process: an overview. Front Pharmacol.2018;9:923. https://doi.org/10.3389/fphar.2018.00923.

4. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, et al.DrugBank: a comprehensive resource for in silico drug discovery andexploration. Nucleic Acids Res. 2006;34(Database issue):D668–72. https://doi.org/10.1093/nar/gkj067.

5. Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, et al. DrugBank 4.0:shedding new light on drug metabolism. Nucleic Acids Res. 2014;42(Database issue):D1091–7. https://doi.org/10.1093/nar/gkt1068.

6. Li YH, Yu CY, Li XX, Zhang P, Tang J, Yang Q, et al. Therapeutic targetdatabase update 2018: enriched resource for facilitating bench-to-clinicresearch of targeted therapeutics. Nucleic Acids Res. 2018;46:D1121–7.https://doi.org/10.1093/nar/gkx1076.

7. Wang K, Sun J, Zhou S, Wan C, Qin S, Li C, et al. Prediction of drug-targetinteractions for drug repositioning only based on genomic expressionsimilarity. PLoS Comput Biol. 2013;9:e1003315. https://doi.org/10.1371/journal.pcbi.1003315.

8. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, et al. Theconnectivity map: using gene-expression signatures to connect smallmolecules, genes, and disease. Science. 2006;313:1929–35. https://doi.org/10.1126/science.1132939.

9. Lamb J. The connectivity map: a new tool for biomedical research. Nat RevCancer. 2007;7:54–60. https://doi.org/10.1038/nrc2044.

10. Ma’ayan A, Rouillard AD, Clark NR, Wang Z, Duan Q, Kou Y. Lean big dataintegration in systems biology and systems pharmacology. TrendsPharmacol Sci. 2014;35:450–60. https://doi.org/10.1016/j.tips.2014.07.001.

11. Perlman L, Gottlieb A, Atias N, Ruppin E, Sharan R. Combining drug andgene similarity measures for drug-target elucidation. J Comput Biol. 2011;18:133–45. https://doi.org/10.1089/cmb.2010.0213.

12. Fakhraei S, Huang B, Raschid L, Getoor L. Network-based drug-targetinteraction prediction with probabilistic soft logic. IEEE/ACM Trans ComputBiol Bioinform. 2014;11:775–87. https://doi.org/10.1109/TCBB.2014.2325031.

13. Gayvert KM, Madhukar NS, Elemento O. A data-driven approach topredicting successes and failures of clinical trials. Cell Chem Biol. 2016;23:1294–301. https://doi.org/10.1016/j.chembiol.2016.07.023.

14. Bull SC, Doig AJ. Properties of protein drug target classes. PLoS ONE. 2015;10:e0117955. https://doi.org/10.1371/journal.pone.0117955.

15. Bakheet TM, Doig AJ. Properties and identification of human protein drugtargets. Bioinformatics. 2009;25:451–7. https://doi.org/10.1093/bioinformatics/btp002.

16. Kim B, Jo J, Han J, Park C, Lee H. In silico re-identification of properties ofdrug target proteins. BMC Bioinformatics. 2017;18(Suppl 7):248. https://doi.org/10.1186/s12859-017-1639-3.

17. Cerulo L, Elkan C, Ceccarelli M. Learning gene regulatory networks fromonly positive and unlabeled data. BMC Bioinformatics. 2010;11:228. https://doi.org/10.1186/1471-2105-11-228.

18. Elkan C, Noto K. Learning classifiers from only positive and unlabeled data.In: Proceeding of the 14th ACM SIGKDD international conference onKnowledge discovery and data mining - KDD 08. New York: ACM Press;2008. p. 213. https://doi.org/10.1145/1401890.1401920.

19. Li Z-C, Zhong W-Q, Liu Z-Q, Huang M-H, Xie Y, Dai Z, et al. Large-scaleidentification of potential drug targets based on the topological features ofhuman protein-protein interaction network. Anal Chim Acta. 2015;871:18–27. https://doi.org/10.1016/j.aca.2015.02.032.

20. Isik Z, Baldow C, Cannistraci CV, Schroeder M. Drug target prioritization byperturbed gene expression and network information. Sci Rep. 2015;5:17417.https://doi.org/10.1038/srep17417.

21. Ekins S, Bugrim A, Brovold L, Kirillov E, Nikolsky Y, Rakhmatulin E, et al.Algorithms for network analysis in systems-ADME/Tox using the MetaCoreand MetaDrug platforms. Xenobiotica. 2006;36:877–901. https://doi.org/10.1080/00498250600861660.

22. Bairoch A, Boeckmann B. The SWISS-PROT protein sequence data bank. NucleicAcids Res. 1991;19(Suppl):2247–9. https://doi.org/10.1093/nar/19.suppl.2247.

23. Georgi B, Voight BF, Bućan M. From mouse to human: evolutionarygenomics analysis of human orthologs of essential genes. PLoS Genet. 2013;9:e1003484. https://doi.org/10.1371/journal.pgen.1003484.

24. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al.STRING v11: protein-protein association networks with increased coverage,supporting functional discovery in genome-wide experimental datasets.Nucleic Acids Res. 2019;47:D607–13. https://doi.org/10.1093/nar/gky1131.

25. Yildirim MA, Goh K-I, Cusick ME, Barabási A-L, Vidal M. Drug-target network.Nat Biotechnol. 2007;25:1119–26. https://doi.org/10.1038/nbt1338.

26. Dezso Z, Nikolsky Y, Sviridov E, Shi W, Serebriyskaya T, Dosymbekov D, et al.A comprehensive functional analysis of tissue specificity of human geneexpression. BMC Biol. 2008;6:49. https://doi.org/10.1186/1741-7007-6-49.

27. Elkan C, Noto K. Learning classifiers from only positive and unlabeled data.portal.acm.org. 2008.

28. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl DataEng. 2009;21:1263–84. https://doi.org/10.1109/TKDE.2008.239.

29. Breiman L. Bagging predictors. Mach Learn. 1996;24:123–40. https://doi.org/10.1007/BF00058655.

30. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regressiontrees. Monterey: Wadsworth & Brooks/Cole Advanced Books & Software;1984. https://doi.org/10.1201/9781315139470.

31. Jeong H, Mason SP, Barabási AL, Oltvai ZN. Lethality and centrality inprotein networks. Nature. 2001;411:41–2. https://doi.org/10.1038/35075138.

32. Ryaboshapkina M, Hammar M. Tissue-specific genes as an underutilizedresource in drug discovery. Sci Rep. 2019;9:7233. https://doi.org/10.1038/s41598-019-43829-9.

33. Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA):an immeasurable source of knowledge. Contemp Oncol (Pozn). 2015;19:A68–77. https://doi.org/10.5114/wo.2014.47136.

34. Schmiedel BJ, Singh D, Madrigal A, Valdovino-Gonzalez AG, White BM,Zapardiel-Gonzalo J, et al. Impact of genetic polymorphisms on humanimmune cell gene expression. Cell. 2018;175:1701–1715.e16. https://doi.org/10.1016/j.cell.2018.10.022.

35. Munos B. Lessons from 60 years of pharmaceutical innovation. Nat RevDrug Discov. 2009;8:959–68. https://doi.org/10.1038/nrd2961.

36. Csermely P, Korcsmáros T, Kiss HJM, London G, Nussinov R. Structure anddynamics of molecular networks: a novel paradigm of drug discovery: acomprehensive review. Pharmacol Ther. 2013;138:333–408. https://doi.org/10.1016/j.pharmthera.2013.01.016.

37. Li M, Zhang H, Wang J, Pan Y. A new essential protein discovery methodbased on the integration of protein-protein interaction and gene expressiondata. BMC Syst Biol. 2012;6:15. https://doi.org/10.1186/1752-0509-6-15.

38. Finan C, Gaulton A, Kruger FA, Lumbers RT, Shah T, Engmann J, et al. Thedruggable genome and support for target identification and validation indrug development. Sci Transl Med. 2017;9:eaag1166. https://doi.org/10.1126/scitranslmed.aag1166.

39. Floris M, Olla S, Schlessinger D, Cucca F. Genetic-driven druggable targetidentification and validation. Trends Genet. 2018;34:558–70. https://doi.org/10.1016/j.tig.2018.04.004.

40. Kitchen DB, Decornez H, Furr JR, Bajorath J. Docking and scoring in virtualscreening for drug discovery: methods and applications. Nat Rev DrugDiscov. 2004;3:935–49. https://doi.org/10.1038/nrd1549.

41. Rice P, Longden I, Bleasby A. EMBOSS: the european molecular biologyopen software suite. Trends Genet. 2000;16:276–7. https://doi.org/10.1016/s0168-9525(00)02024-2.

42. GTEx Consortium, Laboratory, Data Analysis &Coordinating Center(LDACC)—Analysis Working Group, Statistical Methods groups—AnalysisWorking Group, Enhancing GTEx (eGTEx) groups, NIH Common Fund, NIH/NCI, et al. Genetic effects on gene expression across human tissues. Nature.2017;550:204–13. https://doi.org/10.1038/nature24277.

43. Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A,et al. Proteomics. Tissue-based map of the human proteome. Science. 2015;347:1260419. https://doi.org/10.1126/science.1260419.

44. Schug J, Schuller W-P, Kappen C, Salbaum JM, Bucan M, Stoeckert CJ.Promoter features related to tissue specificity as measured by Shannonentropy. Genome Biol. 2005;6:R33. https://doi.org/10.1186/gb-2005-6-4-r33.

45. Blom N, Gammeltoft S, Brunak S. Sequence and structure-based predictionof eukaryotic protein phosphorylation sites. J Mol Biol. 1999;294:1351–62.https://doi.org/10.1006/jmbi.1999.3310.

46. Li F, Li C, Wang M, Webb GI, Zhang Y, Whisstock JC, et al. GlycoMine: amachine learning-based approach for predicting N-, C- and O-linkedglycosylation in the human proteome. Bioinformatics. 2015;31:1411–9.https://doi.org/10.1093/bioinformatics/btu852.

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 11 of 12

Page 12: Machine learning prediction of oncology drug targets based

47. Chen H, Zhou H-X. Prediction of solvent accessibility and sites of deleteriousmutations from protein sequence. Nucleic Acids Res. 2005;33:3193–9.https://doi.org/10.1093/nar/gki633.

48. Garnier J, Osguthorpe DJ, Robson B. Analysis of the accuracy andimplications of simple methods for predicting the secondary structure ofglobular proteins. J Mol Biol. 1978;120:97–120. https://doi.org/10.1016/0022-2836(78)90297-8.

49. Armenteros JJA, Tsirigos KD, Sønderby CK, Petersen TN, Winther O, Brunak S,et al. SignalP 5.0 improves signal peptide predictions using deep neuralnetworks. Nat Biotechnol. 2019;37:420–3. https://doi.org/10.1038/s41587-019-0036-z.

50. Yu C-S, Chen Y-C, Lu C-H, Hwang J-K. Prediction of protein subcellularlocalization. Proteins. 2006;64:643–51. https://doi.org/10.1002/prot.21018.

51. Krogh A, Larsson B, von Heijne G, Sonnhammer ELL. Predictingtransmembrane protein topology with a hidden Markov model: applicationto complete genomes. J Mol Biol. 2001;305:567–80. https://doi.org/10.1006/jmbi.2000.4315.

Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.

Dezső and Ceccarelli BMC Bioinformatics (2020) 21:104 Page 12 of 12