research article prediction of cancer proteins by...

Research ArticlePrediction of Cancer Proteins by Integrating ProteinInteraction Domain Frequency and Domain InteractionData Using Machine Learning Algorithms

Chien-Hung Huang1 Huai-Shun Peng1 and Ka-Lok Ng23

1Department of Computer Science and Information Engineering National Formosa University 64 Wen-Hwa RoadHuwei Yunlin 63205 Taiwan2Department of Biomedical Informatics Asia University Wufeng Shiang Taichung 41354 Taiwan3Department of Medical Research China Medical University Hospital China Medical University Taichung 40402 Taiwan

Correspondence should be addressed to Ka-Lok Ng ppiddigmailcom

Received 2 December 2014 Revised 25 February 2015 Accepted 3 March 2015

Academic Editor Xia Li

Copyright copy 2015 Chien-Hung Huang et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Many proteins are known to be associated with cancer diseases It is quite often that their precise functional role in diseasepathogenesis remains unclear A strategy to gain a better understanding of the function of these proteins is to make use of acombination of different aspects of proteomics data types In this study we extended Araguesrsquos method by employing the protein-protein interaction (PPI) data domain-domain interaction (DDI) data weighted domain frequency score (DFS) and cancer linkerdegree (CLD) data to predict cancer proteins Performances were benchmarked based on three kinds of experiments as follows (I)using individual algorithm (II) combining algorithms and (III) combining the same classification types of algorithms Whencompared with Araguesrsquos method our proposed methods that is machine learning algorithm and voting with the majorityare significantly superior in all seven performance measures We demonstrated the accuracy of the proposed method on twoindependent datasets The best algorithm can achieve a hit ratio of 894 and 728 for lung cancer dataset and lung cancermicroarray study respectively It is anticipated that the current research could help understand disease mechanisms and diagnosis

1 Introduction

It has been known for a long time that cancer is a result of lossof cell cycle control The loss of control is a result of series ofgenetic mutations involving activation of proto-oncogenes tooncogenes and inactivation of tumor-suppressing genesOncogenes and tumor suppressors may cause cancer byalternating the transcription factors such as the p53 and rasoncoproteins which in turn control expression of othergenes Therefore understanding how oncoprotein-oncopro-tein interacts and how oncoproteins drive the cell divisioncycle is indispensable for the study of molecular oncologyPredicting novel cancer-related proteins is an important topicin biomedical research experimental techniques such asmicroarrays are being used to characterize cancer Howeverthe process could be time consuming and labor-intensiveNagaraj and Reverter [1] proposed a Boolean logic based

approach to predict colorectal cancer genes Li et al [2] tookGO enrichment scores and KEGG enrichment scores asfeatures to predict retinoblastoma related genes The abovetwo studies are confined to predict specific cancers Forgeneral types of cancers Hosur et al [3] combined linearprogramming formulation for interface alignment to predictcancer related PPIs Aragues et al [4] used PPI data to predictcancer-related proteins In this study we extended Araguesrsquosstudy by employing PPI data and domain information toattain improved performance

Protein-protein interactions are inherent in almost everycellular process In fact PPI is the core of the entire interac-tomics system in living cells PPI appears when two or moreproteins bind together and perform a biological function[5] Almost all major research topics in molecular biologyinvolve PPI such as cellular function [6] genetic diseases[7] conserved patterns [8] and homologous relationships

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 312047 15 pageshttpdxdoiorg1011552015312047

2 BioMed Research International

[9] The recent availability of PPI data has made it possibleto study human disease at a system level It is reported thatsince disease genes exhibit an increased tendency for theirprotein products to interact with one another they tendto be coexpressed in specific tissues and display coherentfunctions [10] Ideker and Sharan reported in a review article[11] on the applications of PPI networks to study diseasein four major areas (i) identifying new disease genes (ii)studying their network properties (iii) identifying disease-related subnetworks and (iv) performing network-baseddisease classification Another study [12] investigated thehuman cancer PPI network from a structural perspectivethat is protein interactions through their interfaces Theirfindings indicated that cancer-related proteins have smallermore planar more charged and less hydrophobic bindingsites than noncancer proteins

It is known that proteins are composed of multiple func-tional domains A domain is a unit of function associatedwith different catalytic functions or binding sites as foundin enzymes or regulatory proteins It is hypothesized thatcancer proteins also known as tumor associated genesmay share common functional domains [13] and thus aweighted domain score for each tumor associated genersquosdomain is determined Novel cancer proteins are determinedby translating full cDNA sequences to the corresponding pro-tein sequences and calculating the weighted domain scoresAnother work [14] used established methods to identifythe network topology of a cancer protein network Theyshowed that cancer proteins contain a high ratio of structuraldomains which have a high propensity for mediating proteininteractions Recently Clancy et al designed a statisticalmethod to infer the physical interactions between two com-plexes for the human and yeast species [15] Domains suchas the immunoglobulin domain Zinc-finger and the proteinkinase domains are the top three most frequently observedcancer protein domains Many other works also employedPPI and DDI to characterize disease networks [7 16ndash20] Inour previous works [21 22] a one-to-one DDI model wasproposed to obtain specific sets of DDI for oncoproteins andtumor suppressor proteins respectivelyThree specific sets ofDDI that is oncoprotein and oncoprotein tumor suppressorprotein and tumor suppressor protein and oncoprotein andtumor suppressor protein are derived from their PPIs

Weka (httpwwwcswaikatoacnzsimmlweka) [23] isa well known software tool which provides environmentsrelated to machine learning data mining text miningpredictive analysis and business analysis Machine learningand data mining algorithms have been widely used in bioin-formatics and computational biology [24ndash28] The presentauthors adopted amino acid composition profile informa-tion with the SVM classifier to improve protein complexesclassification [29] Additionally we also proposed identifyingmicroRNAs target of Arabidopsis thaliana by integratingprediction scores from PITA miRanda and RNAHybridalgorithms [30] Recently Li et al [31] used random forestmachine learning algorithm and topology features to identifythe functions of protein complexes

In this research we began by collecting cancerous proteininteraction data That is only interactions involved with

Collect cancer noncancer PPIs

Use domain information to annotate protein interaction

Calculate four types of feature scores as input for classifiers

Comparison with previous work

Predict cancer-related proteins by individual classifier

Predict cancer-related proteins by voting

Investigate performance by independent test dataset

Discover potential cancer proteins

Figure 1 System flowchart for this study

cancer proteins are consideredThese types of interactions areknown as cancerous PPIs Noncancerous protein interactionmeans either one or both of the proteins are not yet identifiedin relation to cancer Given the cancerous PPIs a set of DDIrules for cancer proteins are derived In addition to thisset of DDI we also considered other features the weighteddomain frequency scores (DFS) DFS C for cancer proteinsand DFS X for noncancer proteins and the cancer linkerdegree (CLD) score A total of four features (DDI DFS CDFS X and CLD) are adopted to make novel cancer proteinpredictions by using 39 machine learning algorithms fromthe Weka tool In addition we also verified the accuracy ofthe predictive model on two independent datasets Finallyusing differentially expressed genes found in lung cancermicroarray data as a case study we discovered some potentialcancer genes for further experimental investigation

2 Methods

21 System Flowchart In this study a system was set up topredict cancer proteins by integrating four types of featuresFirstly cancer and noncancer PPIs were collected from bio-logical databases and then those interactions were annotatedby using the domain information Next we determinedfour feature scores data normalization is needed to ensureconsistency in their distributionThe system flowchart of thisstudy is illustrated in Figure 1

22 Data Sources and Datasets Generation Cancer proteins(tumor suppressor protein (TSP) or oncoprotein (OCP)) ofthe learning datasetwere integrated from Institute of Biophar-maceutical Sciences of Taiwan National Yang Ming Univer-sity Tumor Associated Gene (TAG httpwwwbinfonckuedutwTAGGeneDocphp) database [13] and Memorial

BioMed Research International 3

PFamBioGrid

Swiss-Prot

PPI

Swiss

TaiwanYang Ming University

TAG

MSKCC

Cancer proteins (C)

Noncancer proteins (X)

Domain

OMIM

HLungDB

Learning datasetTest dataset

C-C PPIsC-XPPIsX-XPPIs

ID

Figure 2 Data sources and datasets generation

Sloan-Kettering Cancer Center (MSKCC httpcbiomskccorgresearchcancergenomicsindexhtml) Noncancer pro-teins were from BioGrid (httpthebiogridorg) [32] On theother hand cancer proteins of the independent test datasetfor Case Study 1 were obtained from two resources that isOnline Mendelian Inheritance in Man (OMIM httpwwwncbinlmnihgovomim) and the human lung cancer data-base (HLungDB) [33]

Protein domain informationwas downloaded fromPFamdatabase (httppfamxfamorg) [34] PPIs for both can-cer and noncancer proteins were retrieved from BioGriddatabase The Swiss ID of proteins was obtained from Swiss-Prot Database (httpwwwexpasyorgsprot)

There are three classified combinations of protein inter-action adopted as input data in the current study ldquo119862-119862rdquoindicates cancer-cancer protein interaction ldquo119862-119883rdquo indicatescancer-noncancer protein interaction and ldquo119883-119883rdquo indicatesnoncancer-noncancer protein interaction Figure 2 depictsinput data sources and dataset generation of this research

23 Feature ScoresGeneration Wederived four feature scoresbased on three different approaches which are domain-domain interaction weighted domain frequency and cancerlinker degree For the domain-domain interactions it mayhappen that some relationships are derived fromhomologoussequences which may produce a bias in the 10-fold valida-tion therefore some redundancies (by a certain homologydegree that includes domains) should be removed By extract-ing the homologous PPIs by using the ldquoUniRef50rdquo datasetobtained from the UniProt Reference Clusters (UniRefhttpwwwuniprotorguniref) we found that among the123751 edges derived from BioGrid database only 431 edges(approximately 0348) are homologous PPIs and 86 edgesare cancer PPIs it means that most of the PPIs are not homol-ogous PPIs We removed the 431 PPIs and the remaining123320 PPIs are used in our experiment In addition the 431homologous PPIs comprise 180 proteins in which 179 pro-teins still appear in other nonhomologous PPIs therefore thetotal number of domains that is 3970 remains unchanged

24 One-to-One Domain Interaction Model Assuming thatproteins 119875

119894and 119875

119895contain 119872 and 119873 domains respectively

and then given an interacting protein pair (119875119894 119875

119895) one consid-

ers that there are 119872119873 possible domain pairs The set ofdomain pairs of two proteins 119875

119894and 119875

119895 119878119894119895 is defined by

119878

119894119895= 119878 (119875

119894) times 119878 (119875

119895) (1)

where 119878(119875119894) and 119878(119875

119895) denote sets of protein domains in

proteins 119875119894and 119875

119895 respectively and times denotes the Cartesian

product of two sets 119878(119875119894) and 119878(119875

119895)

To measure the likelihood of a DDI combination a DDIpair interaction matrix 119868 is introduced The element 119868

120572120573

denotes the weighted combination probability of a domainpair (120572 120573) for a given protein pair (119875

119894 119875

119895) and it is given by

119868

120572120573

= sum

(119875119894 119875119895)

1

1003816

1003816

1003816

1003816

119878 (119875

119894)

1003816

1003816

1003816

1003816

lowast

1003816

1003816

1003816

1003816

1003816

119878 (119875

119895)

1003816

1003816

1003816

1003816

1003816

if 120572 in 119878 (119875119894) and 120573 in 119878 (119875

119895)

(2)

where |119878(119875119894)| and |119878(119875

119895)| denote the set sizes of 119878(119875

119894) and

119878(119875

119895) respectively and lowast is the multiplication operation

the summation is over all possible protein pairs of (119875119894 119875

119895)

such that 120572 and 120573 are an element of 119878(119875119894) and 119878(119875

119895)

respectively Subsequently protein domains are randomizedwhile maintaining the number of domain assignments foreach protein the same as the original set The randomizedcounterpart of 119868

120572120573 ⟨119868rand120572120573

⟩ is performed in order to justifythe protein domain pair calculation Then the domain pairscore of the domain pair (120572 120573) 119877

120572120573 is defined by

119877

120572120573=

119868

120572120573

⟨119868

rand120572120573

⟩

(3)

where ⟨119868rand120572120573

⟩ denotes the ensemble average (we randomizedthe data 20 and 40 times and it was found that the resultsconverge after 40 times) of the randomized counterpart of119868

120572120573 This result provides a criterion to rank the domain pairs

If the ratio 119877

120572120573is larger than one then the correlation is

stronger than the randomized counterpart so the domainpair (120572 120573) is a preferred DDI relation

Given the set of domain annotation for any two proteinsone can turn around and compute a score that signifies PPIbased on the set of119877

120572120573values for DDIThis derived PPI score

can answer the question whether any two proteins interact ornot given their domain components The DDI score for theprotein pair (119875

119894 119875

119895) DDI

119894119895 is defined as follows

DDI119894119895= sum

120572isin119878(119875119894)120573isin119878(119875119895)

119877

120572120573if 119877120572120573

gt 1 (4)

where 119878(119875119894) and 119878(119875

119895) denote the set of domains in proteins

119875

119894and 119875119895 respectively

25 Weighted Domain Frequency Score (DFS) The two fea-ture scores (DFS C and DFS X) are defined in this sectionas the variations from the study by Chan [35] Among thetotal of 3970 collected human domain types the numbers


of 381 and 2750 of them appear only in cancer proteins andnoncancer proteins respectively and 839 of them appearin both cancer proteins and noncancer proteins This resultsupports the propensity that certain domain types reside incancer and noncancer proteins

Let 119862 = (119862

1 119862

2 119862

119904) represent the set of cancer

proteins and let 119863119862 = (119889

119862

1 119889

119862

2 119889

119862

119898) be the set of domain

types that appear in the cancer proteins similarly let 119883 =

(119883

1 119883

2 119883

119905) denote the set of noncancer proteins and let

119863

119883= (119889

119883

1 119889

119883

2 119889

119883

119899) be the set of domain types that appear

in noncancer proteins For each domain 120572 let 119862(120572) and119883(120572)denote the numbers of occurrence of domain 120572 in cancerproteins and noncancer proteins respectively A higher scorevalue suggested that the domain has a high propensitywhich resides in cancer or noncancer proteins Then twoweightedDFS values for the protein pair (119875

119894 119875

119895) DFS 119862

119894119895and

DFS 119883119894119895 are defined by the following respectively

DFS 119862119894119895=

sum

120572isin119878(119875119894)119862 (120572) + sum

120573isin119878(119875119895)119862 (120573)

119898 + 119899

(5)

DFS 119883119894119895=

sum

120572isin119878(119875119894)119883 (120572) + sum

120573isin119878(119875119895)119883(120573)

119898 + 119899

(6)

where 119898 and 119899 are the total number of domain types thatappear in cancer and noncancer proteins respectively and119878(119875

119894) and 119878(119875

119895) denote sets of protein domains in proteins 119875

119894

and 119875119895 respectively

The weighted DFS is adopted to measure the propensityof domain occurrence in cancer and noncancer proteins

26 Cancer Linker Degree (CLD) The last feature is namedthe cancer linker degree (CLD) score which was adoptedfrom the model proposed by Aragues et al [4] In organismsproteins interact with each other to form a protein complexin order to perform special functions We can conjecture thecategory of function and the level of activity by observingtheir interaction partners For a given protein pair (119875

119894 119875

119895)

let 119899119862(119875119894 119875

119895) and 119899119883(119875

119894 119875

119895) denote the number of adjacent

cancer proteins and noncancer proteins in PPI respectivelyThen the cancer linker degree score for the protein pair(119875119894 119875

119895) CLD


CLD119894119895=

119899

119862(119875

119894 119875

119895)

119899

119862(119875

119894 119875

119895) + 119899

119883(119875

119894 119875

119895)

(7)

As an illustration an example is presented in Figure 3where 119862

119886and 119862

119887are cancer proteins and 119883

119886 119883119887 119883119888 and

119883

119889are noncancer proteinsThe CLD score represents the interaction ratio for a

specified PPI interacting with a cancer partner If the CLDscore is close to one it implies that the interaction edge isconnecting many cancer nodes and could be located in thecore of the cancer-related protein clusters

27 Data Normalization The four features scores consistof DDI score (DDI

119894119895) weighted domain frequency scores

(DFS 119862119894119895

and DFS 119883119894119895) and cancer linker degree score

CaCb

Xa

Xb Xd

XcPi-Pj

∙ CLDij = 2(2 + 4) = 13

Figure 3 The calculation of score CLD for the protein pair (119875119894 119875

119895)

(CLD119894119895) Due to the fact that the distributions of the numer-

ical values for the above four features are not consistentbetween each other data normalization is needed For thevalue after normalization 119884 is defined by

119884 =

119883

119863119882119862minusmin

maxminusmin

(8)

where 119883119863119882119862 denotes the unnormalized feature value andmax and min are the maximum and minimum values for119883

119863119882119862 respectively

28 Machine Learning Algorithms and Performance StatisticalMeasures Since different choices of machine learning algo-rithms resulted in different predictions of performance weconducted several comprehensive experiments to determinethe optimal combinations of the algorithms Thirty-ninemachine learning algorithms in Weka are discussed in thisstudy Readers may refer to [23] for detailed descriptionsabout these algorithms According to Weka machine learn-ing algorithms are divided into six classification types that isldquoBayesrdquo (6 algorithms) ldquofunctionsrdquo (6 algorithms) ldquoMiscrdquo (2algorithms) ldquolazyrdquo (4 algorithms) ldquorulesrdquo (9 algorithms) andldquotreesrdquo (12 algorithms) A rigorous 10-fold cross validation testis performed to test the classification performance

Six statistical measures are introduced to quantify theprediction performance that is accuracy (ACC) specificity(SPE) sensitivity (SEN) 119865-score (1198651) Matthewrsquos correlationcoefficient (MCC) and positive predictive value (PPV)which are defined in terms of TP TN FP and FN where theydenote true positive true negative false positive and falsenegative events respectivelyTheir definitions are listed in (9)SPE and SEN measure how well a true cancer protein or atrue noncancer protein is identified 1198651 conveys the balancebetween SPE and SEN ACC andMCC provide an integrativemeasure of correct identification PPV is positive predictivefraction In addition the AUC (area under the curve) scorewhich provides a global performance evaluation is alsoincluded

ACC =

TP + TNTP + TN + FP + FN

SPE = TNTN + FP


SEN =

TPTP + FN

119865

1=

2 times SPE times SENSPE + SEN

times 100

MCC

=

(TP lowast TN) minus (FN lowast FP)radic(TP + FN) lowast (TN + TP) lowast (TP + FP) lowast (TN + FN)

PPV =

TPTP + FP

(9)

3 Results

A total of 123320 PPIs which are composed of 15214 cancerPPIs and 108106 noncancer PPIs 2863 cancer proteins and3970 domains were used in our experiment A 10-fold crossvalidation test was conducted to determine the optimalthreshold settings for each classifier We assumed the ldquo119862-119862rdquo type PPI as a positive set and the rest as a negativeset According to our previous work [30] the balancedtrained dataset usually has better performance than theunbalanced one hence the algorithms are trained with anequal size ratio of 1 1 for the positive and negative datasetSince the sizes of the original positive and negative setsdiffer by a factor of about 6 (unbalanced learning set) togenerate a balanced learning set the 15214 positive targetinteractions (cancer PPIs) were kept and a total of 15214noncancer PPIs were randomly selected from the negativeset Later on the above-mentioned seven statistical measuresare determined For comparison the corresponding resultsof unbalanced dataset are listed in Supplementary File 1Appendix Tables S1 to S5 in Supplementary Material avail-able online at httpdxdoiorg1011552015312047 wherethe performance of MCC and PPV is much worse due tothe very large TN and very small TP Therefore the use ofbalanced datasets is more preferable

31 Performance Comparison by Individual Machine LearningAlgorithm and Voting with the Majority The performancecomparison for the individual algorithm is listed in Table 1The LMT algorithm of the ldquotreesrdquo type achieved the highestACC (0772) 1198651 (0774) and MCC (0548) among the39 algorithms Interestingly according to either ACC 1198651or MCC the top six algorithms (LMT SimpleCart J48J48graft REPTree and FT) are all of the ldquotreesrdquo type On theother hand the LWL algorithm of the ldquolazyrdquo type the VFIalgorithm of the ldquomiscrdquo type the ConjunctiveRule algorithmof the ldquorulesrdquo type and the DecisionStump algorithm of theldquotreesrdquo type achieved the highest SPE and PPV (1000) andthe Nnge algorithm of the ldquorulesrdquo type achieved the highestSEN (0858) while the Ridor algorithm of the ldquorulesrdquo typeachieved the highest AUC (0780) We also tried to combinethe subsets of the four features for predicting the cancerproteins but the predictive performance is not markedlyimproved

The individual classifier has its own strengths and weak-nesses therefore it is inspired to integratemultiple classifiersthat is voting with the majority system to improve theclassification performance Thirty-nine machine learningalgorithms in Weka were selected and integrated usingvarious types of votingwith themajority system In this studythe voting with the majority system involved an odd numberof algorithms For example top 5 in Table 2 represents com-bining the top 5 algorithms extracted from Table 1 which areLMT SimpleCart J48 J48graft and REPTree Performancecomparison by voting with the majority is listed in Table 2The best performance in voting is attained when the top 23algorithms are selected which have the highest 1198651 (0786)and PPV (0890) When comparing Table 1 with Table 2except SEN the top voting with the majority system (top23) is better than top individual algorithm (LMT) in theother six performance measures and it also outperformedall thirty-nine individual algorithm in ACC 1198651 MCC andAUCWe also noted that the best performance of voting withthemajority system (top 23) delivered lower FP events that is186 than those of the top three individual algorithms whichare 303 297 and 307 respectively In other words the votingapproach does not introduce spurious events

32 Performance Comparison with Group Voting with theMajority Performance comparison by voting with themajority for each group is listed in Table 3 (Misc type isomitted here because it contains only two algorithms) Forinstance under the ldquofunctionsrdquo type the ldquoTop 3rdquo classifierin Table 3 represents the combination of ldquofunctionsrdquo typealgorithms that is MultilayerPerceptron Logistic and Sim-pleLogistic algorithms (see Table 1)The results indicated thatldquotreesrdquo type achieved the highest ACC (0781) SEN (0749)1198651 (0786) MCC (0567) MCC (0513) and AUC (0788)while the ldquorulesrdquo type had the highest SPE (0838) and ldquolazyrdquotype achieved the highest PPV (0883)

For clarity Figure 4 illustrates the majority voting resultsfor various types of algorithms the results suggest that ldquotreesrdquotype algorithms perform better in most of the performancemeasures

33 Performance Comparison with the Competing StudySince there are four features that are considered in this studyit is necessary to study their significance in classificationperformance To study the prediction performance of the fourfeatures individually we evaluated the feature importanceby the area under the curve (AUC) value [36 37] Featureswith a higher AUC score are ranked as more important thanfeatures with a low score The results of AUC values for thefour features are given in Table 4 The DFS C feature ranksat the top in AUC value while the DFS X feature has thelowest AUC value These results suggest that DFS C featurehas the greatest discrimination information between positivedatasets and negative datasets

To demonstrate the effectiveness of the present studywe compared our results with the work by Aragues et al[4] which uses the CLD feature only As shown in Table 5and Figure 5 for a single classifier our method achievedbetter performance than Araguesrsquos in all seven performance


Table 1 Performance comparison for the individual algorithm sorted by 1198651 value

Type Algorithm ACC SPE SEN 1198651 MCC PPV AUCTrees LMT 0772 0802 0748 0774 0548 0821 0774Trees SimpleCart 0770 0804 0742 0773 0546 0825 0775Trees J48 0767 0799 0741 0770 0538 0818 0771Trees J48graft 0767 0799 0742 0769 0538 0818 0771Trees REPTree 0766 0796 0741 0767 0536 0818 0767Trees FT 0763 0798 0735 0766 0528 0821 0766Rules DTNB 0760 0804 0728 0763 0527 0833 0765Trees NBTree 0760 0796 0733 0763 0527 0819 0764Trees RandomForest 0760 0777 0744 0761 0524 0791 0761Rules Ridor 0718 0910 0650 0758 0492 0954 0780Rules Jrip 0754 0790 0726 0757 0512 0817 0757Rules DecisionTable 0752 0790 0722 0756 0509 0817 0757Rules PART 0744 0758 0745 0748 0494 0754 0751Lazy Kstar 0744 0766 0725 0744 0491 0784 0744Functions MultilayerPerceptron 0724 0792 0683 0734 0462 0835 0739Trees LADTree 0720 0762 0696 0727 0447 0788 0729Trees RandomTree 0721 0721 0720 0721 0440 0723 0721Lazy LWL 0610 1000 0560 0720 0350 1000 0780Misc VFI 0610 1000 0560 0720 0350 1000 0780Rules ConjunctiveRule 0610 1000 0560 0720 0350 1000 0780Trees DecisionStump 0610 1000 0560 0720 0350 1000 0780Trees ADTree 0709 0762 0685 0719 0433 0787 0724Bayes BayesNet 0714 0755 0683 0717 0431 0796 0719Lazy IB1 0713 0716 0711 0713 0427 0718 0713Lazy Ibk 0713 0716 0711 0713 0427 0718 0713Functions Logistic 0669 0652 0692 0672 0342 0606 0673Bayes BayesianLogisticRegression 0670 0655 0687 0671 0342 0622 0673Functions SimpleLogistic 0669 0652 0691 0670 0342 0605 0672Functions SMO 0665 0641 0699 0667 0334 0580 0670Functions VotedPerceptron 0642 0606 0721 0657 0305 0467 0664Rules Nnge 0522 0511 0858 0641 0128 0052 0686Rules OneR 0635 0629 0641 0635 0269 0610 0634Functions RBFNetwork 0624 0603 0655 0627 0254 0529 0629Bayes NaiveBayes 0598 0571 0660 0614 0214 0410 0615Bayes NaiveBayesUpdateable 0598 0571 0660 0614 0214 0410 0615Bayes NaiveBayesSimple 0598 0571 0660 0613 0214 0411 0614Bayes NaiveBayesMultinomial 0576 0554 0634 0590 0168 0366 0594Misc HyperPipes 0570 0568 0579 0571 0143 0534 0572Rules ZeroR 0510 0510 0510 0510 0010 0510 0510

measures From Tables 1 and 5 we can see that the best singleclassifier of the current method (LMT) outperforms the bestsingle classifier of Araguesrsquos method (SimpleCart) in all sevenperformance measures the difference value of ACC is 119

SPE is 151 SEN is 96 1198651 is 121 MCC is 245 PPV is174 and AUC is 121

From Tables 5 and 6 we can see that using the CLDfeature only the best performance in voting is attained when


Table 2 Performance comparison by voting with the majority sorted by 1198651 value

Classifiers ACC SPE SEN 1198651 MCC PPV AUCTOP 23 0774 0855 0721 0786 0562 0890 0787TOP 11 0781 0829 0744 0785 0569 0855 0787TOP 13 0781 0827 0746 0785 0567 0851 0786TOP 9 0783 0820 0751 0784 0568 0844 0786TOP 19 0776 0841 0734 0784 0566 0872 0787TOP 21 0775 0856 0723 0784 0564 0889 0789TOP 25 0775 0855 0723 0784 0562 0888 0789TOP 27 0774 0847 0727 0784 0562 0879 0787TOP 17 0779 0826 0742 0783 0565 0852 0784TOP 7 0780 0819 0748 0782 0563 0840 0784TOP 15 0779 0824 0742 0782 0564 0852 0785TOP 29 0773 0836 0730 0780 0556 0867 0783TOP 3 0777 0811 0749 0779 0558 0831 0780TOP 5 0776 0811 0748 0779 0556 0830 0780TOP 31 0775 0827 0738 0777 0556 0853 0783TOP 33 0774 0819 0738 0776 0552 0847 0778TOP 35 0771 0811 0738 0773 0545 0837 0775TOP 37 0768 0802 0739 0770 0540 0826 0772TOP 39 0768 0802 0739 0770 0541 0826 0771

Table 3 Performance comparison by voting with the majority for five classification types

Type Classifiers ACC SPE SEN 1198651 MCC PPV AUC

Bayes TOP 3 0672 0653 0695 0672 0346 0609 0673TOP 5 0599 0571 0660 0614 0215 0410 0614

Functions TOP 5 0673 0653 0700 0676 0349 0600 0676TOP 3 0669 0652 0692 0672 0342 0606 0673

Lazy TOP 3 0743 0837 0688 0755 0504 0883 0763

Rules

TOP 5 0764 0825 0721 0769 0537 0856 0774TOP 7 0764 0826 0721 0769 0537 0857 0774TOP 9 0765 0817 0726 0769 0534 0848 0771TOP 3 0755 0838 0705 0766 0528 0877 0771

Trees

TOP 11 0781 0831 0744 0786 0567 0859 0788TOP 9 0780 0820 0748 0783 0563 0842 0785TOP 7 0779 0816 0748 0781 0561 0838 0782TOP 3 0777 0811 0749 0779 0558 0831 0780TOP 5 0776 0811 0748 0779 0556 0830 0780

Table 4 The AUC value of the four features

Feature AUC RankDFS C 0677 1CLD 0651 2DDI 0546 3DFS X 0526 4

the top 3 algorithms (SimpleCart REPTree and FT) areselected using the CLD feature only but the performance of

voting with the majority is approximately equal to that of theindividual algorithm As shown in Table 6 and Figure 6 theproposed method significantly outperformed Araguesrsquos in allseven performancemeasures FromTables 2 and 6 we can seethat the best voting with the majority of the current method(top 23) outperforms the best voting with the majority ofAraguesrsquos method (top 3) in all seven performance measuresthe difference value of ACC is 122 SPE is 202 SEN is69 1198651 is 134 MCC is 258 PPV is 240 and AUCis 134 These results suggest that the proposed method issuperior to the competing study


Top 3 Top 5 Top 7 Top 9 Top 11Classifiers

BayesFunctionsLazy

MiscRulesTrees

04

06

08

1

AUC


0506070809

1

SPE

05

06

07

08


ACC


05

06

07

08

SEN


06

07

08

F1


02

04

06

MCC


04

06

08

1PP

V

Figure 4 The seven performance measures for group voting with the majority

4 Discussion

In this section we designed two case studies to demonstratethe performance of the proposedmethod and showed how todiscover potential cancer proteins respectively

41 Case Study 1 To investigate the performance of theproposed method we retrieved lung cancer protein datafrom OMIM and HLungDB databases Among the 2599experimentally confirmed lung cancer proteins there are atotal of 1302 cancer proteins not appearing in our originaltraining dataset which could be used as the independent

test dataset List of the 1302 cancer proteins can be found inSupplementary File 2 The LWL VFI ConjunctiveRule andDecisionStump were excluded from Case Study 1 because oftheir intrinsically high PPV which may bias the performanceestimation Consequently as shown in Table 7 the hit num-ber and hit ratio denote how many cancer proteins are truepositive events and true positive ratios respectively Mostof the algorithms had a hit ratio remarkably consistent over75 especially the Ridor algorithm which achieved 894Compared with classifiers with higher ranks in 1198651 (Table 1)the results appear to suggest that classifiers with high PPVachieve better hit ratios


Table 5 Performance comparison for the individual algorithm using the CLD feature sorted by 1198651

Type Algorithm ACC SPE SEN 1198651 MCC PPV AUCTrees SimpleCart 0653 0651 0652 0653 0303 0647 0653Trees REPTree 0650 0650 0649 0650 0300 0649 0650Trees FT 0645 0649 0641 0646 0288 0658 0646Rules DecisionTable 0642 0650 0639 0644 0288 0663 0644Rules DTNB 0642 0650 0639 0644 0288 0663 0644Trees NBTree 0643 0648 0639 0644 0287 0661 0644Bayes BayesNet 0642 0647 0640 0643 0286 0657 0643Trees ADTree 0642 0644 0641 0643 0283 0646 0643Rules Ridor 0614 0659 0639 0642 0257 0635 0647Trees LADTree 0642 0648 0641 0642 0287 0653 0644Trees LMT 0639 0656 0625 0640 0280 0693 0640Rules PART 0639 0655 0625 0639 0278 0695 0639Trees J48 0639 0655 0627 0639 0280 0689 0641Trees J48graft 0639 0655 0627 0639 0280 0689 0641Rules Jrip 0637 0639 0638 0638 0277 0634 0638Rules OneR 0635 0629 0641 0635 0269 0610 0634Functions VotedPerceptron 0611 0579 0697 0632 0248 0392 0638Lazy LWL 0612 0582 0683 0629 0246 0431 0632Trees DecisionStump 0612 0582 0684 0629 0246 0428 0632Rules ConjunctiveRule 0613 0583 0681 0628 0245 0437 0631Bayes NaiveBayes 0619 0595 0656 0623 0242 0496 0627Bayes NaiveBayesSimple 0619 0595 0656 0623 0242 0496 0627Bayes NaiveBayesUpdateable 0619 0595 0656 0623 0242 0496 0627Trees RandomForest 0623 0616 0631 0623 0247 0591 0624Lazy Ibk 0622 0613 0632 0622 0242 0586 0622Trees RandomTree 0622 0613 0632 0622 0242 0586 0622Bayes BayesianLogisticRegression 0621 0612 0630 0621 0240 0583 0621Functions Logistic 0621 0612 0631 0621 0241 0582 0621Functions SimpleLogistic 0621 0612 0633 0621 0241 0578 0621Functions SMO 0619 0607 0640 0621 0245 0549 0622Lazy Kstar 0619 0606 0640 0620 0242 0547 0623Functions MultilayerPerceptron 0618 0598 0649 0620 0242 0518 0621Functions RBFNetwork 0618 0598 0647 0620 0240 0517 0620Lazy IB1 0594 0564 0683 0619 0214 0351 0624Rules Nnge 0544 0601 0529 0562 0106 0832 0564Misc HyperPipes 0510 0510 0510 0510 0010 0510 0510Rules ZeroR 0510 0510 0510 0510 0010 0510 0510Bayes NaiveBayesMultinomial 0500 0561 0447 0480 minus0008 0489 0569Misc VFI 0500 NA 0500 NA NA 1000 NA

42 Case Study 2 It is known that interacting proteins areoften coexpressed one can identify differentially expressedgenes (DEGs) among a large number of gene expressions andunderstand themechanism of lung cancer formation inducedby these DEGs [38] We further explored the potential cancer

genes fromDEGs inmicroarray data Four sets of lung cancermicroarray data were downloaded from the GEO database[39] and summarized in Table 8 Experiments GSE7670 [40]andGSE10072 [41] use theHG-U133A array whereGSE19804[42] and GSE27262 [43] use HG-U133 plus 20 chip


0405060708

ACC

040608

1

SPE

040506070809

SEN

040506070809

F1

0020406

MCC

02040608

1

PPV

0010203040506070809

1

AD

Tree

Baye

sianL

ogist

icRe

gres

sion

Baye

sNet

Con

junc

tiveR

ule

Dec

ision

Stum

pD

ecisi

onTa

ble

DTN

B FTH

yper

Pipe

sIB

1Ib

kJ4

8J4

8gra

ftJri

pKs

tar

LAD

Tree

LMT

Logi

stic

LWL

Mul

tilay

erPe

rcep

tron

Nai

veBa

yes

Nai

veBa

yesM

ultin

omia

lN

aive

Baye

sSim

ple

Nai

veBa

yesU

pdat

eabl

eN

BTre

eN

nge

One

RPA

RTRa

ndom

Fore

stRa

ndom

Tree

RBFN

etw

ork

REPT

ree

Rido

rSi

mpl

eCar

tSi

mpl

eLog

istic

SMO

VFI

Vote

dPer

cept

ron

Zero

R

AUC

CLDFour features

Figure 5 The performance comparison of the individual algorithm for Aragues (blue) and the proposed method (red)

The tested DEGs are collected from the intersection setof the above four microarray datasets Among the 1345common DEGs in the four microarray datasets 360 DEGswere excluded because of their appearance in the originaltraining set another 209 DEGs are also removed due to theirlacking of domain data or PPI dataThe remaining 776 DEGsserve as input data

Five classifiers including the top three classifiers accord-ing to the 1198651measure LMT SimpleCart and J48 algorithmsas well as the top one classifier according to PPV measureLWL algorithm along with the top one classifier accordingto Case Study 1 Ridor algorithm were selected for evaluatingpotential cancer genes under strictly uniformed voting thatis only the one with five votes which all five classifiers


Table 6 Performance comparison by voting with the majority using the CLD feature sorted by 1198651


predict as a cancer protein was considered Among the 776DEGs a total of 565 DEGs (728 a higher value can beobtained if we relax the strictly uniformed requirement)were evaluated as potential cancer genes The complete 565potential cancer genes derived from the second case study arelisted in Supplementary File 3

To validate our findings we conducted a study of theliterature by randomly selecting five genes (DBF4 MCM2ID3 EXOSC4 and CDKN3) from the 565 potential cancergenes The results indicated that while EXOSC4 remainsunclear the others are in agreement with our predictionsBarkley proposed that miR-29a targets the 3-UTR of DBF4mRNA in lung cancer cells [44] and Bonte stated thatmost cell lines with increased Cdc7 protein levels also hadincreased DBF4 abundance and some tumor cell lines hadextra copies of the DBF4 gene [45] Alexandrow noted thatStat3-P and the proliferative markers MCM2 were expressedin mice lung tissues in vivo [46] Yang et al observed thatpatientswith higher levels ofMCM2and gelsolin experiencedshorter survival time than patients with low levels of MCM2and gelsolin [47] Langenfeld et al [48] indicated that Oct4cells give rise to lung cancer cells expressing nestin andorNeuN and BMP signaling is an important regulator of ID1and ID3 in both Oct4 and nestin cell populations Tangclaimed that CDKN3 has significant biological implicationsin tumor pathogenesis [49] In [50] a metasignature wasidentified in eight separate microarray analyses spanningseven types of cancer including lung adenocarcinoma andthese included many genes associated with cell proliferationand CDKN3 is among them

Given a single protein as testing data we can first treatthis protein as 119883 If both PPIs and domains information areavailable one can then apply the present method to classifythe interaction type ldquo119862-119883rdquo

Any ldquo119862-119883rdquo type is classified as ldquo119862-119862rdquo and then thereare two possible explanations for this (i) the classifier is notcompletely specific therefore one has FP prediction and(ii) prediction is a TP event If one can exclude the firstexplanation option then the present calculation provides apotential way to assign 119883 as 119862 in other words it provides afeasible solution for predicting cancer proteins

If the PPI information is missing given the FASTAsequence information one can make use of the STRINGdatabase [51] STRING is a database that provides known andpredicted PPI derived from four sources genomic contexthigh-throughput experiments conserved coexpression andpublished literature On domain prediction one can carry outthe analysis by using the online tool ldquoSEQUENCE SEARCHrdquounder PFam [34] to find matching domains Then given thePPIs and domains information one can conduct the sameanalysis as described in the last paragraph otherwise one isfacing a difficult task which requires further discussion orwork

5 Conclusion

Identifying cancer protein is a critical issue in treatingcancer however identifying cancer protein experimentallyis extremely time consuming and labor-intensive Alternative


06

07

08

09AC

C

06

07

08

09

SPE

06

07

08

SEN

06

07

08

09

F1

02

03

04

05

06

MCC

06

07

08

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

AUC

CLDFour features

06

07

08

09

1

PPV

Figure 6 The performance comparison of the voting with the majority for Aragues (blue) and the proposed method (red)

methods must be developed to discover cancer proteins Wehave integrated several proteomic data sources to develop amodel for predicting cancer protein-cancer protein interac-tions on a global scale based on domain-domain interactionsweighted domain frequency score and cancer linker degreeA one-to-one interaction model was introduced to quantifythe likelihood of cancer-specific DDI The weighted DFS isadopted to measure the propensity of domain occurrence incancer and noncancer proteins Finally the CLD is defined togauge cancer and noncancer proteinsrsquo interaction partnersAs a result voting with a majority system achieved ACC(0774) SPE (0855) SEN (0721) 1198651 (0786) MCC (0562)PPV (0890) and AUC (0787) when the top 23 algorithmswere selected which is better than the best single classifier(LMT) in six performance measures except SEN

We compared our performance with the previous work[4] It was shown that the present approach outperformed

Araguesrsquos in all seven performance measures in both indi-vidual algorithm and combining algorithms Effectivenessof the current research is further evaluated by two inde-pendent datasets experimental results demonstrated thatthe proposed method can identify cancer proteins withhigh hit ratios The current research not only significantlyimproves the prediction performance of cancer proteins butalso discovered some potential cancer proteins for futureexperimental investigation It is anticipated that the currentresearch could provide some insight into diseasemechanismsand diagnosis

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper


Table 7 Performance evaluation for OMIM and HLungDB data-sets

Type Algorithm Hitnumber

Hitratio

Rules Ridor 1164 0894Rules ZeroR 1159 0890Trees ADTree 1119 0859Misc HyperPipes 1115 0856Functions MultilayerPerceptron 1076 0826Trees LADTree 1061 0815Rules OneR 1047 0804Lazy IB1 1023 0786Lazy IBk 1023 0786Bayes BayesNet 1020 0783Rules PART 1019 0783Trees J48graft 1018 0782Trees J48 1017 0781Rules DecisionTable 1011 0776Trees FT 1004 0771Lazy KStar 999 0767Bayes BayesianLogisticRegression 998 0767Trees NBTree 995 0764Rules DTNB 991 0761Trees RandomTree 988 0759Trees LMT 983 0755Trees REPTree 975 0749Trees SimpleCart 973 0747Trees RandomForest 973 0747Rules JRip 966 0742Functions Logistic 964 0740Functions SimpleLogistic 964 0740Functions SMO 944 0725Functions RBFNetwork 925 0710Functions VotedPerceptron 844 0648Bayes NaiveBayesSimple 827 0635Bayes NaiveBayes 826 0634Bayes NaiveBayesUpdateable 826 0634Bayes NaiveBayesMultinomial 796 0611Rules NNge 94 0072

Acknowledgments

Thework of Chien-HungHuang is supported by theMinistryof Science and Technology of Taiwan (MOST) and NationalFormosa University under Grants nos NSC 101-2221-E-150-088-MY2 and EN103B-1005 respectively The work of Ka-Lok Ng is supported by MOST and Asia University underGrants nos NSC 102-2632-E-468-001-MY3 and 103-asia-06

Table 8 Summary of microarray datasets

GEO ID Organization name Number of DEGs

GSE7670 Taipei Veterans GeneralHospital 1874

GSE10072 National Cancer InstituteNIH 3138

GSE19804 National Taiwan University 5398

GSE27262 National Taiwan Yang MingUniversity 8476

1345 (intersection)

respectively The authors thank Yi-Wen Lin and NilubonKurubanjerdjit for their contributions to implementationassistance and providing the Case Study 1 dataset

References

[1] S H Nagaraj and A Reverter ldquoA Boolean-based systemsbiology approach to predict novel genes associated with cancerapplication to colorectal cancerrdquo BMC Systems Biology vol 5article 35 2011

[2] Z Li B-Q Li M Jiang et al ldquoPrediction and analysisof retinoblastoma related genes through gene ontology andKEGGrdquo BioMed Research International vol 2013 Article ID304029 8 pages 2013

[3] R Hosur J Xu J Bienkowska and B Berger ldquoIWRAP aninterface threading approach with application to prediction ofcancer-related protein-protein interactionsrdquo Journal of Molecu-lar Biology vol 405 no 5 pp 1295ndash1310 2011

[4] R Aragues C Sander and B Oliva ldquoPredicting cancer involve-ment of genes from heterogeneous datardquo BMC Bioinformaticsvol 9 article 172 2008

[5] L Hakes J W Pinney D L Robertson and S C LovellldquoProtein-protein interaction networks and biologymdashwhatrsquos theconnectionrdquo Nature Biotechnology vol 26 no 1 pp 69ndash722008

[6] C von Mering R Krause B Snel et al ldquoComparative assess-ment of large-scale data sets of protein-protein interactionsrdquoNature vol 417 no 6887 pp 399ndash403 2002

[7] B Schuster-Bockler and A Bateman ldquoProtein interactions inhuman genetic diseasesrdquo Genome Biology vol 9 no 1 articleR9 2008

[8] R Sharan S Suthram R M Kelley et al ldquoConserved patternsof protein interaction in multiple speciesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 102 no 6 pp 1974ndash1979 2005

[9] C-C Chen C-Y Lin Y-S Lo and J-M Yang ldquoPPISearch aweb server for searching homologous protein-protein interac-tions acrossmultiple speciesrdquoNucleic Acids Research vol 37 no2 pp W369ndashW375 2009

[10] K I Goh M E Cusick D Valle B Childs M Vidal and A-L Barabasi ldquoThe human disease networkrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 104 no 21 pp 8685ndash8690 2007

[11] T Ideker and R Sharan ldquoProtein networks in diseaserdquo GenomeResearch vol 18 no 4 pp 644ndash652 2008

[12] G Kar A Gursoy and O Keskin ldquoHuman cancer protein-protein interaction network a structural perspectiverdquo PLoSComputational Biology vol 5 no 12 Article ID e1000601 2009


[13] J-S Chen W-S Hung H-H Chan S-J Tsai and H SunnySun ldquoIn silico identification of oncogenic potential of fyn-related kinase in hepatocellular carcinomardquo Bioinformatics vol29 no 4 pp 420ndash427 2013

[14] P F Jonsson and P A Bates ldquoGlobal topological features ofcancer proteins in the human interactomerdquo Bioinformatics vol22 no 18 pp 2291ndash2297 2006

[15] T Clancy E A Roslashdland S Nygard and E Hovig ldquoPredictingphysical interactions between protein complexesrdquo Molecularand Cellular Proteomics vol 12 no 6 pp 1723ndash1734 2013

[16] D P Ryan and J M Matthews ldquoProtein-protein interactions inhuman diseaserdquo Current Opinion in Structural Biology vol 15no 4 pp 441ndash446 2005

[17] J Xu and Y Li ldquoDiscovering disease-genes by topologicalfeatures in human protein-protein interaction networkrdquo Bioin-formatics vol 22 no 22 pp 2800ndash2805 2006

[18] A Platzer P Perco A Lukas and BMayer ldquoCharacterization ofprotein-interaction networks in tumorsrdquo BMC Bioinformaticsvol 8 article 224 2007

[19] M Oti B Snel M A Huynen and H G Brunner ldquoPredictingdisease genes using protein-protein interactionsrdquo Journal ofMedical Genetics vol 43 no 8 pp 691ndash698 2007

[20] J Goni F J Esteban N V de Mendizabal et al ldquoA com-putational analysis of protein-protein interaction networksin neurodegenerative diseasesrdquo BMC Systems Biology vol 2article 52 2008

[21] Y-L Lee J-WWengW-C Chiang et al ldquoInvestigating cancer-related proteins specific domain interactions and differentialprotein interactions caused by alternative splicingrdquo in Proceed-ings of the 11th IEEE International Conference on Bioinformaticsand Bioengineering (BIBE rsquo11) pp 33ndash38 Taichung TaiwanOctober 2011

[22] K L Ng J S Ciou and C H Huang ldquoPrediction of proteinfunctions based on function-function correlation relationsrdquoComputers in Biology and Medicine vol 40 no 3 pp 300ndash3052010

[23] H Ian and E F Witten Data Mining Practical Machine Learn-ing Tools and Techniques Morgan Kaufmann San FranciscoCalif USA 2nd edition 2005

[24] Y-D Cai L Lu L Chen and J-F He ldquoPredicting subcellu-lar location of proteins using integrated-algorithm methodrdquoMolecular Diversity vol 14 no 3 pp 551ndash558 2010

[25] C R Peng L Liu B Niu et al ldquoPrediction of RNA-Bindingproteins by voting systemsrdquo Journal of Biomedicine and Biotech-nology vol 2011 Article ID 506205 8 pages 2011

[26] C-Y Hor C-B Yang Z-J Yang and C-T Tseng ldquoPredictionof protein essentiality by the support vector machine withstatistical testsrdquo Evolutionary Bioinformatics Online vol 9 pp387ndash416 2013

[27] S K Dhanda D Singla A K Mondal and G P S RaghavaldquoDrugMint a webserver for predicting and designing of drug-like moleculesrdquo Biology Direct vol 8 no 1 article 28 2013

[28] T Wilhelm ldquoPhenotype prediction based on genome-wideDNA methylation datardquo BMC Bioinformatics vol 15 no 1article 193 2014

[29] C-H Huang S-Y Chou and K-L Ng ldquoImproving proteincomplex classification accuracy using amino acid compositionprofilerdquo Computers in Biology and Medicine vol 43 no 9 pp1196ndash1204 2013

[30] N Kurubanjerdjit C-H Huang Y-L Lee J J P Tsai and K-L Ng ldquoPrediction of microRNA-regulated protein interaction

pathways in Arabidopsis using machine learning algorithmsrdquoComputers in Biology and Medicine vol 43 no 11 pp 1645ndash1652 2013

[31] Z-C Li Y-H Lai L-L Chen Y Xie Z Dai and X-Y ZouldquoIdentifying functions of protein complexes based on topologysimilaritywith random forestrdquoMolecular BioSystems vol 10 no3 pp 514ndash525 2014

[32] A Chatr-Aryamontri B-J Breitkreutz S Heinicke et al ldquoTheBioGRID interaction database 2013 updaterdquo Nucleic AcidsResearch vol 41 no 1 pp D816ndashD823 2013

[33] L Wang Y Xiong Y Sun et al ldquoHlungDB an integrated data-base of human lung cancer researchrdquo Nucleic Acids Researchvol 38 no 1 Article ID gkp945 pp D665ndashD669 2009

[34] R D Finn A Bateman J Clements et al ldquoPfam the proteinfamilies databaserdquo Nucleic Acids Research vol 42 no 1 ppD222ndashD230 2014

[35] H H Chan Identification of novel tumor-associated gene (TAG)by bioinformatics analysis [MS thesis] National Cheng KungUniversity Tainan City Taiwan 2006

[36] E Alpaydin Introduction to Machine Learning The MIT PressCambridge Mass USA 2nd edition 2011

[37] I R Galatzer-Levy K-I Karstoft A Statnikov andA Y ShalevldquoQuantitative forecasting of PTSD from early trauma responsesamachine Learning applicationrdquo Journal of Psychiatric Researchvol 59 pp 68ndash76 2014

[38] C-H Huang M-HWu P M-H Chang C-Y Huang and K-L Ng ldquoIn silico identification of potential targets and drugs fornon-small cell lung cancerrdquo IET Systems Biology vol 8 no 2pp 56ndash66 2014

[39] T Barrett S EWilhite P Ledoux et al ldquoNCBIGEO archive forfunctional genomics data setsmdashupdaterdquoNucleic Acids Researchvol 41 no 1 pp D991ndashD995 2013

[40] L J Su C W Chang Y C Wu et al ldquoSelection of DDX5 asa novel internal control for Q-RT-PCR from microarray datausing a block bootstrap re-sampling schemerdquo BMC Genomicsvol 8 article 140 2007

[41] M T Landi T Dracheva M Rotunno et al ldquoGene expressionsignature of cigarette smoking and its role in lung adenocar-cinoma development and survivalrdquo PLoS ONE vol 3 no 2Article ID e1651 2008

[42] T-P Lu M-H Tsai J-M Lee et al ldquoIdentification of a novelbiomarker SEMA5A for non-small cell lung carcinoma innonsmoking womenrdquo Cancer Epidemiology Biomarkers andPrevention vol 19 no 10 pp 2590ndash2597 2010

[43] T-Y W Wei C-C Juan J-Y Hisa et al ldquoProtein argininemethyltransferase 5 is a potential oncoprotein that upregulatesG1 cyclinscyclin-dependent kinases and the phosphoinositide3-kinaseAKT signaling cascaderdquo Cancer Science vol 103 no 9pp 1640ndash1650 2012

[44] L R Barkley and C Santocanale ldquoMicroRNA-29a regulatesthe benzo[a]pyrene dihydrodiol epoxide-inducedDNAdamageresponse through Cdc7 kinase in lung cancer cellsrdquo Oncogene-sis vol 2 article e57 2013

[45] D Bonte C Lindvall H Liu K Dykema K Furge andM Weinreich ldquoCdc7-Dbf4 kinase overexpression in multiplecancers and tumor cell lines is correlated with p53 inactivationrdquoNeoplasia vol 10 no 9 pp 920ndash931 2008

[46] M G Alexandrow L J Song S Altiok J Gray E B Haura andN B Kumar ldquoCurcumin a novel Stat3 pathway inhibitor forchemoprevention of lung cancerrdquo European Journal of CancerPrevention vol 21 no 5 pp 407ndash412 2012


[47] J Yang N Ramnath K B Moysich et al ldquoPrognostic signif-icance of MCM2 Ki-67 and gelsolin in non-small cell lungcancerrdquo BMC Cancer vol 6 article 203 2006

[48] E Langenfeld M Deen E Zachariah and J Langenfeld ldquoSmallmolecule antagonist of the bone morphogenetic protein type Ireceptors suppresses growth and expression of Id1 and Id3 inlung cancer cells expressing Oct4 or nestinrdquoMolecular Cancervol 12 no 1 article 129 2013

[49] H Tang G Xiao C Behrens et al ldquoA 12-gene set predictssurvival benefits from adjuvant chemotherapy in non-small celllung cancer patientsrdquoClinical Cancer Research vol 19 no 6 pp1577ndash1586 2013

[50] D M MacDermed N N Khodarev S P Pitroda et alldquoMUC1-associated proliferation signature predicts outcomes inlung adenocarcinoma patientsrdquo BMCMedical Genomics vol 3article 16 2010

[51] A Franceschini D Szklarczyk S Frankild et al ldquoSTRING v91protein-protein interaction networks with increased coverageand integrationrdquoNucleic Acids Research vol 41 no 1 pp D808ndashD815 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of


Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology


Molecular Biology International

GenomicsInternational Journal of


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


BioinformaticsAdvances in

Marine BiologyJournal of



Signal TransductionJournal of


BioMed Research International

Evolutionary BiologyInternational Journal of



Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Genetics Research International


Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational



Enzyme Research



Microbiology


[9] The recent availability of PPI data has made it possibleto study human disease at a system level It is reported thatsince disease genes exhibit an increased tendency for theirprotein products to interact with one another they tendto be coexpressed in specific tissues and display coherentfunctions [10] Ideker and Sharan reported in a review article[11] on the applications of PPI networks to study diseasein four major areas (i) identifying new disease genes (ii)studying their network properties (iii) identifying disease-related subnetworks and (iv) performing network-baseddisease classification Another study [12] investigated thehuman cancer PPI network from a structural perspectivethat is protein interactions through their interfaces Theirfindings indicated that cancer-related proteins have smallermore planar more charged and less hydrophobic bindingsites than noncancer proteins

It is known that proteins are composed of multiple func-tional domains A domain is a unit of function associatedwith different catalytic functions or binding sites as foundin enzymes or regulatory proteins It is hypothesized thatcancer proteins also known as tumor associated genesmay share common functional domains [13] and thus aweighted domain score for each tumor associated genersquosdomain is determined Novel cancer proteins are determinedby translating full cDNA sequences to the corresponding pro-tein sequences and calculating the weighted domain scoresAnother work [14] used established methods to identifythe network topology of a cancer protein network Theyshowed that cancer proteins contain a high ratio of structuraldomains which have a high propensity for mediating proteininteractions Recently Clancy et al designed a statisticalmethod to infer the physical interactions between two com-plexes for the human and yeast species [15] Domains suchas the immunoglobulin domain Zinc-finger and the proteinkinase domains are the top three most frequently observedcancer protein domains Many other works also employedPPI and DDI to characterize disease networks [7 16ndash20] Inour previous works [21 22] a one-to-one DDI model wasproposed to obtain specific sets of DDI for oncoproteins andtumor suppressor proteins respectivelyThree specific sets ofDDI that is oncoprotein and oncoprotein tumor suppressorprotein and tumor suppressor protein and oncoprotein andtumor suppressor protein are derived from their PPIs

Weka (httpwwwcswaikatoacnzsimmlweka) [23] isa well known software tool which provides environmentsrelated to machine learning data mining text miningpredictive analysis and business analysis Machine learningand data mining algorithms have been widely used in bioin-formatics and computational biology [24ndash28] The presentauthors adopted amino acid composition profile informa-tion with the SVM classifier to improve protein complexesclassification [29] Additionally we also proposed identifyingmicroRNAs target of Arabidopsis thaliana by integratingprediction scores from PITA miRanda and RNAHybridalgorithms [30] Recently Li et al [31] used random forestmachine learning algorithm and topology features to identifythe functions of protein complexes

In this research we began by collecting cancerous proteininteraction data That is only interactions involved with

Collect cancer noncancer PPIs

Use domain information to annotate protein interaction

Calculate four types of feature scores as input for classifiers

Comparison with previous work

Predict cancer-related proteins by individual classifier

Predict cancer-related proteins by voting

Investigate performance by independent test dataset

Discover potential cancer proteins

Figure 1 System flowchart for this study

cancer proteins are consideredThese types of interactions areknown as cancerous PPIs Noncancerous protein interactionmeans either one or both of the proteins are not yet identifiedin relation to cancer Given the cancerous PPIs a set of DDIrules for cancer proteins are derived In addition to thisset of DDI we also considered other features the weighteddomain frequency scores (DFS) DFS C for cancer proteinsand DFS X for noncancer proteins and the cancer linkerdegree (CLD) score A total of four features (DDI DFS CDFS X and CLD) are adopted to make novel cancer proteinpredictions by using 39 machine learning algorithms fromthe Weka tool In addition we also verified the accuracy ofthe predictive model on two independent datasets Finallyusing differentially expressed genes found in lung cancermicroarray data as a case study we discovered some potentialcancer genes for further experimental investigation

2 Methods

21 System Flowchart In this study a system was set up topredict cancer proteins by integrating four types of featuresFirstly cancer and noncancer PPIs were collected from bio-logical databases and then those interactions were annotatedby using the domain information Next we determinedfour feature scores data normalization is needed to ensureconsistency in their distributionThe system flowchart of thisstudy is illustrated in Figure 1

22 Data Sources and Datasets Generation Cancer proteins(tumor suppressor protein (TSP) or oncoprotein (OCP)) ofthe learning datasetwere integrated from Institute of Biophar-maceutical Sciences of Taiwan National Yang Ming Univer-sity Tumor Associated Gene (TAG httpwwwbinfonckuedutwTAGGeneDocphp) database [13] and Memorial


PFamBioGrid

Swiss-Prot

PPI

Swiss


TAG

MSKCC

Cancer proteins (C)


Domain

OMIM

HLungDB



ID







119894and 119875



119895) one consid-


119894and 119875

119895 119878119894119895 is defined by

119878

119894119895= 119878 (119875

119894) times 119878 (119875

119895) (1)

where 119878(119875119894) and 119878(119875


proteins 119875119894and 119875



119895)


120572120573


119894 119875


119868

120572120573

= sum

(119875119894 119875119895)

1

1003816

1003816

1003816

1003816

119878 (119875

119894)

1003816

1003816

1003816

1003816

lowast

1003816

1003816

1003816

1003816

1003816

119878 (119875

119895)

1003816

1003816

1003816

1003816

1003816

if 120572 in 119878 (119875119894) and 120573 in 119878 (119875

119895)

(2)

where |119878(119875119894)| and |119878(119875


119894) and

119878(119875



119895)


119895)


120572120573 ⟨119868rand120572120573



119877

120572120573=

119868

120572120573

⟨119868

rand120572120573

⟩

(3)

where ⟨119868rand120572120573



If the ratio 119877






119894 119875

119895) DDI


DDI119894119895= sum

120572isin119878(119875119894)120573isin119878(119875119895)

119877

120572120573if 119877120572120573

gt 1 (4)

where 119878(119875119894) and 119878(119875


119875





Let 119862 = (119862

1 119862

2 119862


proteins and let 119863119862 = (119889

119862

1 119889

119862

2 119889

119862



(119883

1 119883

2 119883


119863

119883= (119889

119883

1 119889

119883

2 119889

119883



119894 119875

119895) DFS 119862

119894119895and


DFS 119862119894119895=

sum

120572isin119878(119875119894)119862 (120572) + sum

120573isin119878(119875119895)119862 (120573)

119898 + 119899

(5)

DFS 119883119894119895=

sum

120572isin119878(119875119894)119883 (120572) + sum

120573isin119878(119875119895)119883(120573)

119898 + 119899

(6)


119894) and 119878(119875


119894




119894 119875

119895)

let 119899119862(119875119894 119875

119895) and 119899119883(119875

119894 119875



119895) CLD


CLD119894119895=

119899

119862(119875

119894 119875

119895)

119899

119862(119875

119894 119875

119895) + 119899

119883(119875

119894 119875

119895)

(7)


119886and 119862


119886 119883119887 119883119888 and

119883





(DFS 119862119894119895


CaCb

Xa

Xb Xd

XcPi-Pj

∙ CLDij = 2(2 + 4) = 13


119895)



119884 =

119883

119863119882119862minusmin

maxminusmin

(8)


119863119882119862 respectively



ACC =


SPE = TNTN + FP


SEN =

TPTP + FN

119865

1=


times 100

MCC

=


PPV =

TPTP + FP

(9)

3 Results



















Bayes TOP 3 0672 0653 0695 0672 0346 0609 0673TOP 5 0599 0571 0660 0614 0215 0410 0614


Lazy TOP 3 0743 0837 0688 0755 0504 0883 0763

Rules

TOP 5 0764 0825 0721 0769 0537 0856 0774TOP 7 0764 0826 0721 0769 0537 0857 0774TOP 9 0765 0817 0726 0769 0534 0848 0771TOP 3 0755 0838 0705 0766 0528 0877 0771

Trees








BayesFunctionsLazy

MiscRulesTrees

04

06

08

1

AUC


0506070809

1

SPE

05

06

07

08


ACC


05

06

07

08

SEN


06

07

08

F1


02

04

06

MCC


04

06

08

1PP

V


4 Discussion










0405060708

ACC

040608

1

SPE

040506070809

SEN

040506070809

F1

0020406

MCC

02040608

1

PPV

0010203040506070809

1

AD

Tree

Baye

sianL

ogist

icRe

gres

sion

Baye

sNet

Con

junc

tiveR

ule

Dec

ision

Stum

pD

ecisi

onTa

ble

DTN

B FTH

yper

Pipe

sIB

1Ib

kJ4

8J4

8gra

ftJri

pKs

tar

LAD

Tree

LMT

Logi

stic

LWL

Mul

tilay

erPe

rcep

tron

Nai

veBa

yes

Nai

veBa

yesM

ultin

omia

lN

aive

Baye

sSim

ple

Nai

veBa

yesU

pdat

eabl

eN

BTre

eN

nge

One

RPA

RTRa

ndom

Fore

stRa

ndom

Tree

RBFN

etw

ork

REPT

ree

Rido

rSi

mpl

eCar

tSi

mpl

eLog

istic

SMO

VFI

Vote

dPer

cept

ron

Zero

R

AUC

CLDFour features












5 Conclusion



06

07

08

09AC

C

06

07

08

09

SPE

06

07

08

SEN

06

07

08

09

F1

02

03

04

05

06

MCC

06

07

08

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

AUC

CLDFour features

06

07

08

09

1

PPV










Hitratio


Acknowledgments








1345 (intersection)


References






























































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


PFamBioGrid

Swiss-Prot

PPI

Swiss


TAG

MSKCC

Cancer proteins (C)


Domain

OMIM

HLungDB



ID







119894and 119875



119895) one consid-


119894and 119875

119895 119878119894119895 is defined by

119878

119894119895= 119878 (119875

119894) times 119878 (119875

119895) (1)

where 119878(119875119894) and 119878(119875


proteins 119875119894and 119875



119895)


120572120573


119894 119875


119868

120572120573

= sum

(119875119894 119875119895)

1

1003816

1003816

1003816

1003816

119878 (119875

119894)

1003816

1003816

1003816

1003816

lowast

1003816

1003816

1003816

1003816

1003816

119878 (119875

119895)

1003816

1003816

1003816

1003816

1003816

if 120572 in 119878 (119875119894) and 120573 in 119878 (119875

119895)

(2)

where |119878(119875119894)| and |119878(119875


119894) and

119878(119875



119895)


119895)


120572120573 ⟨119868rand120572120573



119877

120572120573=

119868

120572120573

⟨119868

rand120572120573

⟩

(3)

where ⟨119868rand120572120573



If the ratio 119877






119894 119875

119895) DDI


DDI119894119895= sum

120572isin119878(119875119894)120573isin119878(119875119895)

119877

120572120573if 119877120572120573

gt 1 (4)

where 119878(119875119894) and 119878(119875


119875





Let 119862 = (119862

1 119862

2 119862


proteins and let 119863119862 = (119889

119862

1 119889

119862

2 119889

119862



(119883

1 119883

2 119883


119863

119883= (119889

119883

1 119889

119883

2 119889

119883



119894 119875

119895) DFS 119862

119894119895and


DFS 119862119894119895=

sum

120572isin119878(119875119894)119862 (120572) + sum

120573isin119878(119875119895)119862 (120573)

119898 + 119899

(5)

DFS 119883119894119895=

sum

120572isin119878(119875119894)119883 (120572) + sum

120573isin119878(119875119895)119883(120573)

119898 + 119899

(6)


119894) and 119878(119875


119894




119894 119875

119895)

let 119899119862(119875119894 119875

119895) and 119899119883(119875

119894 119875



119895) CLD


CLD119894119895=

119899

119862(119875

119894 119875

119895)

119899

119862(119875

119894 119875

119895) + 119899

119883(119875

119894 119875

119895)

(7)


119886and 119862


119886 119883119887 119883119888 and

119883





(DFS 119862119894119895


CaCb

Xa

Xb Xd

XcPi-Pj

∙ CLDij = 2(2 + 4) = 13


119895)



119884 =

119883

119863119882119862minusmin

maxminusmin

(8)


119863119882119862 respectively



ACC =


SPE = TNTN + FP


SEN =

TPTP + FN

119865

1=


times 100

MCC

=


PPV =

TPTP + FP

(9)

3 Results



















Bayes TOP 3 0672 0653 0695 0672 0346 0609 0673TOP 5 0599 0571 0660 0614 0215 0410 0614


Lazy TOP 3 0743 0837 0688 0755 0504 0883 0763

Rules

TOP 5 0764 0825 0721 0769 0537 0856 0774TOP 7 0764 0826 0721 0769 0537 0857 0774TOP 9 0765 0817 0726 0769 0534 0848 0771TOP 3 0755 0838 0705 0766 0528 0877 0771

Trees








BayesFunctionsLazy

MiscRulesTrees

04

06

08

1

AUC


0506070809

1

SPE

05

06

07

08


ACC


05

06

07

08

SEN


06

07

08

F1


02

04

06

MCC


04

06

08

1PP

V


4 Discussion










0405060708

ACC

040608

1

SPE

040506070809

SEN

040506070809

F1

0020406

MCC

02040608

1

PPV

0010203040506070809

1

AD

Tree

Baye

sianL

ogist

icRe

gres

sion

Baye

sNet

Con

junc

tiveR

ule

Dec

ision

Stum

pD

ecisi

onTa

ble

DTN

B FTH

yper

Pipe

sIB

1Ib

kJ4

8J4

8gra

ftJri

pKs

tar

LAD

Tree

LMT

Logi

stic

LWL

Mul

tilay

erPe

rcep

tron

Nai

veBa

yes

Nai

veBa

yesM

ultin

omia

lN

aive

Baye

sSim

ple

Nai

veBa

yesU

pdat

eabl

eN

BTre

eN

nge

One

RPA

RTRa

ndom

Fore

stRa

ndom

Tree

RBFN

etw

ork

REPT

ree

Rido

rSi

mpl

eCar

tSi

mpl

eLog

istic

SMO

VFI

Vote

dPer

cept

ron

Zero

R

AUC

CLDFour features












5 Conclusion



06

07

08

09AC

C

06

07

08

09

SPE

06

07

08

SEN

06

07

08

09

F1

02

03

04

05

06

MCC

06

07

08

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

AUC

CLDFour features

06

07

08

09

1

PPV










Hitratio


Acknowledgments








1345 (intersection)


References






























































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology



Let 119862 = (119862

1 119862

2 119862


proteins and let 119863119862 = (119889

119862

1 119889

119862

2 119889

119862



(119883

1 119883

2 119883


119863

119883= (119889

119883

1 119889

119883

2 119889

119883



119894 119875

119895) DFS 119862

119894119895and


DFS 119862119894119895=

sum

120572isin119878(119875119894)119862 (120572) + sum

120573isin119878(119875119895)119862 (120573)

119898 + 119899

(5)

DFS 119883119894119895=

sum

120572isin119878(119875119894)119883 (120572) + sum

120573isin119878(119875119895)119883(120573)

119898 + 119899

(6)


119894) and 119878(119875


119894




119894 119875

119895)

let 119899119862(119875119894 119875

119895) and 119899119883(119875

119894 119875



119895) CLD


CLD119894119895=

119899

119862(119875

119894 119875

119895)

119899

119862(119875

119894 119875

119895) + 119899

119883(119875

119894 119875

119895)

(7)


119886and 119862


119886 119883119887 119883119888 and

119883





(DFS 119862119894119895


CaCb

Xa

Xb Xd

XcPi-Pj

∙ CLDij = 2(2 + 4) = 13


119895)



119884 =

119883

119863119882119862minusmin

maxminusmin

(8)


119863119882119862 respectively



ACC =


SPE = TNTN + FP


SEN =

TPTP + FN

119865

1=


times 100

MCC

=


PPV =

TPTP + FP

(9)

3 Results



















Bayes TOP 3 0672 0653 0695 0672 0346 0609 0673TOP 5 0599 0571 0660 0614 0215 0410 0614


Lazy TOP 3 0743 0837 0688 0755 0504 0883 0763

Rules

TOP 5 0764 0825 0721 0769 0537 0856 0774TOP 7 0764 0826 0721 0769 0537 0857 0774TOP 9 0765 0817 0726 0769 0534 0848 0771TOP 3 0755 0838 0705 0766 0528 0877 0771

Trees








BayesFunctionsLazy

MiscRulesTrees

04

06

08

1

AUC


0506070809

1

SPE

05

06

07

08


ACC


05

06

07

08

SEN


06

07

08

F1


02

04

06

MCC


04

06

08

1PP

V


4 Discussion










0405060708

ACC

040608

1

SPE

040506070809

SEN

040506070809

F1

0020406

MCC

02040608

1

PPV

0010203040506070809

1

AD

Tree

Baye

sianL

ogist

icRe

gres

sion

Baye

sNet

Con

junc

tiveR

ule

Dec

ision

Stum

pD

ecisi

onTa

ble

DTN

B FTH

yper

Pipe

sIB

1Ib

kJ4

8J4

8gra

ftJri

pKs

tar

LAD

Tree

LMT

Logi

stic

LWL

Mul

tilay

erPe

rcep

tron

Nai

veBa

yes

Nai

veBa

yesM

ultin

omia

lN

aive

Baye

sSim

ple

Nai

veBa

yesU

pdat

eabl

eN

BTre

eN

nge

One

RPA

RTRa

ndom

Fore

stRa

ndom

Tree

RBFN

etw

ork

REPT

ree

Rido

rSi

mpl

eCar

tSi

mpl

eLog

istic

SMO

VFI

Vote

dPer

cept

ron

Zero

R

AUC

CLDFour features












5 Conclusion



06

07

08

09AC

C

06

07

08

09

SPE

06

07

08

SEN

06

07

08

09

F1

02

03

04

05

06

MCC

06

07

08

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

AUC

CLDFour features

06

07

08

09

1

PPV










Hitratio


Acknowledgments








1345 (intersection)


References






























































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


SEN =

TPTP + FN

119865

1=


times 100

MCC

=


PPV =

TPTP + FP

(9)

3 Results



















Bayes TOP 3 0672 0653 0695 0672 0346 0609 0673TOP 5 0599 0571 0660 0614 0215 0410 0614


Lazy TOP 3 0743 0837 0688 0755 0504 0883 0763

Rules

TOP 5 0764 0825 0721 0769 0537 0856 0774TOP 7 0764 0826 0721 0769 0537 0857 0774TOP 9 0765 0817 0726 0769 0534 0848 0771TOP 3 0755 0838 0705 0766 0528 0877 0771

Trees








BayesFunctionsLazy

MiscRulesTrees

04

06

08

1

AUC


0506070809

1

SPE

05

06

07

08


ACC


05

06

07

08

SEN


06

07

08

F1


02

04

06

MCC


04

06

08

1PP

V


4 Discussion










0405060708

ACC

040608

1

SPE

040506070809

SEN

040506070809

F1

0020406

MCC

02040608

1

PPV

0010203040506070809

1

AD

Tree

Baye

sianL

ogist

icRe

gres

sion

Baye

sNet

Con

junc

tiveR

ule

Dec

ision

Stum

pD

ecisi

onTa

ble

DTN

B FTH

yper

Pipe

sIB

1Ib

kJ4

8J4

8gra

ftJri

pKs

tar

LAD

Tree

LMT

Logi

stic

LWL

Mul

tilay

erPe

rcep

tron

Nai

veBa

yes

Nai

veBa

yesM

ultin

omia

lN

aive

Baye

sSim

ple

Nai

veBa

yesU

pdat

eabl

eN

BTre

eN

nge

One

RPA

RTRa

ndom

Fore

stRa

ndom

Tree

RBFN

etw

ork

REPT

ree

Rido

rSi

mpl

eCar

tSi

mpl

eLog

istic

SMO

VFI

Vote

dPer

cept

ron

Zero

R

AUC

CLDFour features












5 Conclusion



06

07

08

09AC

C

06

07

08

09

SPE

06

07

08

SEN

06

07

08

09

F1

02

03

04

05

06

MCC

06

07

08

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

AUC

CLDFour features

06

07

08

09

1

PPV










Hitratio


Acknowledgments








1345 (intersection)


References






























































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology












Bayes TOP 3 0672 0653 0695 0672 0346 0609 0673TOP 5 0599 0571 0660 0614 0215 0410 0614


Lazy TOP 3 0743 0837 0688 0755 0504 0883 0763

Rules

TOP 5 0764 0825 0721 0769 0537 0856 0774TOP 7 0764 0826 0721 0769 0537 0857 0774TOP 9 0765 0817 0726 0769 0534 0848 0771TOP 3 0755 0838 0705 0766 0528 0877 0771

Trees








BayesFunctionsLazy

MiscRulesTrees

04

06

08

1

AUC


0506070809

1

SPE

05

06

07

08


ACC


05

06

07

08

SEN


06

07

08

F1


02

04

06

MCC


04

06

08

1PP

V


4 Discussion










0405060708

ACC

040608

1

SPE

040506070809

SEN

040506070809

F1

0020406

MCC

02040608

1

PPV

0010203040506070809

1

AD

Tree

Baye

sianL

ogist

icRe

gres

sion

Baye

sNet

Con

junc

tiveR

ule

Dec

ision

Stum

pD

ecisi

onTa

ble

DTN

B FTH

yper

Pipe

sIB

1Ib

kJ4

8J4

8gra

ftJri

pKs

tar

LAD

Tree

LMT

Logi

stic

LWL

Mul

tilay

erPe

rcep

tron

Nai

veBa

yes

Nai

veBa

yesM

ultin

omia

lN

aive

Baye

sSim

ple

Nai

veBa

yesU

pdat

eabl

eN

BTre

eN

nge

One

RPA

RTRa

ndom

Fore

stRa

ndom

Tree

RBFN

etw

ork

REPT

ree

Rido

rSi

mpl

eCar

tSi

mpl

eLog

istic

SMO

VFI

Vote

dPer

cept

ron

Zero

R

AUC

CLDFour features












5 Conclusion



06

07

08

09AC

C

06

07

08

09

SPE

06

07

08

SEN

06

07

08

09

F1

02

03

04

05

06

MCC

06

07

08

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

AUC

CLDFour features

06

07

08

09

1

PPV










Hitratio


Acknowledgments








1345 (intersection)


References






























































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology



BayesFunctionsLazy

MiscRulesTrees

04

06

08

1

AUC


0506070809

1

SPE

05

06

07

08


ACC


05

06

07

08

SEN


06

07

08

F1


02

04

06

MCC


04

06

08

1PP

V


4 Discussion










0405060708

ACC

040608

1

SPE

040506070809

SEN

040506070809

F1

0020406

MCC

02040608

1

PPV

0010203040506070809

1

AD

Tree

Baye

sianL

ogist

icRe

gres

sion

Baye

sNet

Con

junc

tiveR

ule

Dec

ision

Stum

pD

ecisi

onTa

ble

DTN

B FTH

yper

Pipe

sIB

1Ib

kJ4

8J4

8gra

ftJri

pKs

tar

LAD

Tree

LMT

Logi

stic

LWL

Mul

tilay

erPe

rcep

tron

Nai

veBa

yes

Nai

veBa

yesM

ultin

omia

lN

aive

Baye

sSim

ple

Nai

veBa

yesU

pdat

eabl

eN

BTre

eN

nge

One

RPA

RTRa

ndom

Fore

stRa

ndom

Tree

RBFN

etw

ork

REPT

ree

Rido

rSi

mpl

eCar

tSi

mpl

eLog

istic

SMO

VFI

Vote

dPer

cept

ron

Zero

R

AUC

CLDFour features












5 Conclusion



06

07

08

09AC

C

06

07

08

09

SPE

06

07

08

SEN

06

07

08

09

F1

02

03

04

05

06

MCC

06

07

08

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

AUC

CLDFour features

06

07

08

09

1

PPV










Hitratio


Acknowledgments








1345 (intersection)


References






























































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology







0405060708

ACC

040608

1

SPE

040506070809

SEN

040506070809

F1

0020406

MCC

02040608

1

PPV

0010203040506070809

1

AD

Tree

Baye

sianL

ogist

icRe

gres

sion

Baye

sNet

Con

junc

tiveR

ule

Dec

ision

Stum

pD

ecisi

onTa

ble

DTN

B FTH

yper

Pipe

sIB

1Ib

kJ4

8J4

8gra

ftJri

pKs

tar

LAD

Tree

LMT

Logi

stic

LWL

Mul

tilay

erPe

rcep

tron

Nai

veBa

yes

Nai

veBa

yesM

ultin

omia

lN

aive

Baye

sSim

ple

Nai

veBa

yesU

pdat

eabl

eN

BTre

eN

nge

One

RPA

RTRa

ndom

Fore

stRa

ndom

Tree

RBFN

etw

ork

REPT

ree

Rido

rSi

mpl

eCar

tSi

mpl

eLog

istic

SMO

VFI

Vote

dPer

cept

ron

Zero

R

AUC

CLDFour features












5 Conclusion



06

07

08

09AC

C

06

07

08

09

SPE

06

07

08

SEN

06

07

08

09

F1

02

03

04

05

06

MCC

06

07

08

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

AUC

CLDFour features

06

07

08

09

1

PPV










Hitratio


Acknowledgments








1345 (intersection)


References






























































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology









5 Conclusion



06

07

08

09AC

C

06

07

08

09

SPE

06

07

08

SEN

06

07

08

09

F1

02

03

04

05

06

MCC

06

07

08

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

AUC

CLDFour features

06

07

08

09

1

PPV










Hitratio


Acknowledgments








1345 (intersection)


References






























































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


06

07

08

09AC

C

06

07

08

09

SPE

06

07

08

SEN

06

07

08

09

F1

02

03

04

05

06

MCC

06

07

08

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

Top

3To

p 5

Top

7To

p 9

Top

11

Top

13

Top

15

Top

17

Top

19

Top

21

Top

23

Top

25

Top

27

Top

29

Top

31

Top

33

Top

35

Top

37

Top

39

AUC

CLDFour features

06

07

08

09

1

PPV










Hitratio


Acknowledgments








1345 (intersection)


References






























































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology




Hitratio


Acknowledgments








1345 (intersection)


References






























































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


















































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology

research article prediction of cancer proteins by...

Documents