multivariate selection of genetic markers in diagnostic classification

Multivariate selection of genetic markers indiagnostic classification

Griffin Webera,b, Staal Vinterboa, Lucila Ohno-Machadoa,*

aDecision Systems Group, Division of Health Sciences and Technology, Harvard and MIT,Brigham and Womens Hospital, Thorn 310, 75 Francis Street, Boston, MA 02115, USAbDivision of Engineering and Applied Sciences, Harvard University, Pierce Hall,29 Oxford Street, Cambridge, MA 02138, USA

Received 28 February 2003; received in revised form 1 April 2003; accepted 16 January 2004

1. Introduction

Analysis of gene expression using microarrays hasbecome popular in the past few years. This high-throughput method generates data on the so-calledtranscriptome, since it measures the products ofthe DNA transcription process. In its initial phases,analysis of microarray data was based mainly on

unsupervised learning methods, as there were notenough data to allow the construction of meaningfulclassification models. Even though larger series ofcases now become increasingly available, there arestill several challenges in this analysis. Besidesknown problems related to the accuracy and repro-ducibility of measurements, there are importantproblems related to the type of learning that canbe performed with these data. Analysis of high-throughput gene expression has resurrected someghosts that have haunted regression modelers for along time, such as the large n (variables), small m(cases) problem as it is known in the statistics

Artificial Intelligence in Medicine (2004) 31, 155167

*Corresponding author. Tel.: 1-617-732-8543;fax: 1-617-732-9260.

E-mail address: [email protected](L. Ohno-Machado).

KEYWORDS

Microarray;

Logistic regression;

Variable selection;

Classification;

Clustering;

Principal components

analysis

Summary Analysis of gene expression data obtained from microarrays presents a newset of challenges to machine learning modeling. In this domain, in which the number ofvariables far exceeds the number of cases, identifying relevant genes or groups ofgenes that are good markers for a particular classification is as important as achievinggood classification performance. Although several machine learning algorithms havebeen proposed to address the latter, identification of gene markers has not beensystematically pursued. In this article, we investigate several algorithms for selectinggene markers for classification. We test these algorithms using logistic regression, asthis is a simple and efficient supervised learning algorithm. We demonstrate, using 10different data sets, that a conditionally univariate algorithm constitutes a viablechoice if a researcher is interested in quickly determining a set of gene expressionlevels that can serve as markers for disease. We show that the classification perfor-mance of logistic regression is not very different from that of more sophisticatedalgorithms that have been applied in previous studies, and that the gene selection inthe logistic regression algorithm is reasonable in both cases. Furthermore, the algo-rithm is simple, its theoretical basis is well established, and our user-friendly imple-mentation is now freely available on the internet, serving as a benchmarking tool forthe development of new algorithms. 2004 Elsevier B.V. All rights reserved.

09333657/$ see front matter 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.artmed.2004.01.011

literature, or the curse of dimensionality as it isknown in machine learning. Furthermore, asopposed to other domains in which optimizing thepredictive ability of a model is the main goal,biomedical scientists want to determine whichgenes are best markers for the classification athand, and therefore it becomes important to findsuitable heuristics to determine variable relevance[1]. Although the microarray analysis field has pro-moted the development of machine learning algo-rithms by making data easily available on the WWW,it has also created new demandsfor example, theneed to publish data in a standardized format suchas that of OMGGED (http://www.geml.org/omg.htm) and to make available tools that can be usedby a large community of researchers. (In otherdomains, such as clinical medicine, data are com-monly not made available, and each researcher isexpected to utilize his or her own tools.) In thisarticle, we validate in 10 different data sets a toolthat we developed for this purpose.

Unsupervised and supervised learning methodshave been used to characterize expression patterns,and several algorithms have been recreated/renamed, modified, or newly developed to addressthe non-mutually-exclusive tasks of defining (a)clusters of genes that express similarly in differentconditions, (b) genes that may be co-regulated, (c)new diagnostic categories based on response to pre-defined stimuli or expression over time periods, and(d) genes that can serve as markers for diseasecategories. Ideally, algorithms designed for theanalysis of gene expression microarrays could beused for all these purposes.

In this article, we compare several algorithmsthat were combined with logistic regression toaddress item (d) from above: determining whichgenes are good markers for known diagnostic cate-gories. Of note, the biological significance of thesemarkers in causing disease is not the issue in ques-tion, but rather how successfully the gene expres-sion levels of these markers can be used to diagnosepreviously unseen cases using a simple algorithm.We utilize a streamlined selection procedure todevelop parsimonious regression models. Wedemonstrate, in two different data sets, that thisalgorithm can classify data nearly as well as morecomplex algorithms, and advocate its use for quickpreliminary exploration of gene expression data.We developed a web implementation of the algo-rithm that can be easily accessed.

This article is structured as follows: Section 1briefly reviews recently published applications ofalgorithms for classification of high-throughputgene expression data, and reviews the basic ele-ments of logistic regression and variable selection.

Section 2 describes the experiments we conducted tocompare logistic regression and other classificationalgorithms. Section 3 presents the results in terms ofclassification performance and gene selection. Sec-tion 4 verifies the results using 8 additional data setsand explores alternative variable selection methods.Section 5 compares the variable selection methodstested. Section 6 discusses our algorithm in terms ofadvantages, limitations, and future directions.

2. Algorithms for classification ofhigh-throughput gene expression data

The task of classifying cases into pre-defined cate-gories has traditionally been approached withsupervised machine learning algorithms. Interest-ingly, and mainly due to different expectations forthe analysis, both unsupervised and supervisedlearning algorithms (or combinations of these twotypes) have been used for classification tasks. Forexample, two landmark publications [2,3] relatedto classification of tumors assess classification per-formance using a visual analysis of hierarchicalclustering (an unsupervised algorithm). In thosearticles, cluster analysis is performed first, andthe inclusion of a case in a pre-labeled cluster isused for visualization of its predicted category. Inthese seminal articles, the primary purpose seemedto be not only the demonstration that gene expres-sion data could be used for classification of dis-eases, but also to report on larger scale explorationsin search for candidate genes for certain types ofcancer. The aim was to use gene expression data todiscover new diagnostic or prognostic categories(e.g. differential response to medication, differen-tial survival), not just to classify cases into knowncategories.

Recent influential publications have made exten-sive use of clustering, often producing results thatrequire intensive human analysis and experimentalvalidation. In unsupervised classification, such asagglomerative clustering, there is no guarantee thatthe resulting clusters will separate the spaceaccording to the outcome of interest [4]. In super-vised classification, there is a gold-standard uponwhich the model is constructed, therefore restrict-ing the search space. Aims of variable selection in asupervised algorithm include reducing informationcost, increasing model applicability and robustness,and revealing input variable relationships. As micro-array technology becomes cheaper, more accurate,and more accepted, and the analysis becomes moretargeted, it is expected that supervised algorithmswill play a more prominent role, even in exploratoryphases of the investigation. For example, if the

156 G. Weber et al.

purpose is to determine whether a particular indi-vidual will respond to drug A, then a supervisedalgorithm can be used to specifically model thatproblem if appropriate data are available. Variableselection procedures can then be used in the mod-eling process to determine which genes are statis-tically more related to the outcome (i.e. the bestmarkers for a given category). Although these mar-kers should not be assumed to be good candidatesfor drug targeting until biological validation of theirroles is conducted, trimming the space of candidategenes is nonetheless an important step towards thisgoal.

The literature has some examples of supervisedclassification. Support vector machines have beenused successfully to classify cancer tissue samples[5,6] from gene expression data. Other methodssuch as classification trees [7], linear discriminantanalysis, and nearest neighbor classifiers have alsobeen used [8,9].

Logistic regression is a supervised algorithm forbinary classification that has been not infrequentlycombined with variable selection methods [10]. Amultiple logistic regression model (in which thevariables are represented as x1, x2, . . ., xn) can bewritten as

logitp log p1 p

intercept b1x1 b2x2 bnxn

The intercept is a constant, bi is a coefficient forvariable xi, and p represents the estimated prob-ability that a particular case belongs to the categoryassigned as 1 in the gold standard. The equationcan be rearranged to

p 11 exp intercept b1x1 b2x2 bnxn

In microarray data, the variables xi correspond tothe expression levels of n genes. The training dataset for the logistic regression model consists of marrays. Because n is usually much larger than m, andwhen this is the case the regression model becomesan underdetermined system, it is not desirable toperform the regression directly using all genes asvariables.

In order to overcome the large n, small m pro-blem, some authors [1,11] use principal compo-nents analysis (PCA) as a dimensionality reductionstep, utilizing a limited number of derived variables(the selected principal components). The contribu-tion of each gene to the overall classification canbe determined indirectly from the coefficients ofthe principal components. Although popular, thisapproach is not ideal if the goal is to select markers

for a diagnostic category, since it is not assuredthat diagnostic category separability coincides withthe largest principal components. Further, theinformation requirements of model applicationare not reduced by using a number of principalcomponents, since all variables are used to computethe principal components. Others have used partialleast squares (PLS) to account for the class label,but this approach still does not address the problemof deleting irrelevant variables from the model.Selecting a small number of relevant variables isextremely important in this domain, as validationstudies are usually done in a time consuming andexpensive gene-by-gene basis. We argue that adirect-variable-selection approach is more appro-priate, since it results in small sets of variables thatcan be subjected to targeted biological validation.

The problem of selecting candidate genes can beviewed as the problem of selecting variables inpredictive models. Variable selection approachescan be classified into (a) univariate, (b) condition-ally univariate, and (c) multivariate. Univariateselection does not take into account potentialsynergistic effects of genes. Conditionally univari-ate approaches account for a certain number ofsynergies, but utilize a simple heuristic that con-siders the addition or subtraction of one gene at atime. Multivariate selection is ideal, but its compu-tational complexity is prohibitive.

For our initial experiments with logistic regres-sion, we use a hybrid variable selection approachthat involves two phases. The first is a univariatemethod in which we rank the genes according totheir Pearson correlation coefficients to the out-come variables. The m 1 genes with highest cor-relations, where m is the number of training arrays,are considered in the second phase, which is step-wise variable selection. This is a conditionally uni-variate method based on the inclusion (or exclusion)of a single variable at a time, conditioned on thevariables already included. Given a particular start-ing model, usually containing only the intercept, orcontaining all variables (and the intercept), step-wise variable selection repeatedly adds or deletessingle variables according to some addition or dele-tion criterion. The process stops when no variablesmeet either criterion. In our case, we use backwardsvariable selection and the Akaike Information Cri-terion [12] to determine when to stop removingvariables from the model.

With our method, the first phase rapidly reducesthe number of genes from thousands to a numbersmall enough so that the coefficients in the logisticregression models can be computed. The stepwisevariable selection then streamlines the model sothat as few a number of genes as possible remain in

Multivariate selection of genetic markers in diagnostic classification 157

the final model. The two-phase approach was cho-sen for several reasons. (a) In preliminary experi-ments, using correlation only (since it is a very quickalgorithm) to reduce the number of genes to fewerthan m 1 resulted in models that failed to fit thetraining arrays, whereas stepwise selection did not.(b) Pre-selection of m 1 genes by phase 1 allowedus to use the built-in stepwise function in thestatistical software package we used. (c) By redu-cing the number of genes, stepwise variable selec-tion can be run relatively quickly. We could haveopted to use of more complex classification models.Since an important goal was to compare variableselection procedures, and to make this benchmarktool available on the Web, we decided to use thesimple, efficient, and well-known logistic regressionmodel. We could have opted to use more complexfeature selection algorithms, but they wouldrequire an exceedingly large amount of time. Forexample, we could have used pairwise selection ofcases as did Bo and Jonassen [13], but we would nothave a clear rationale for stopping at pairs insteadof triplets, and so on. In our approach, there is asystematic (albeit greedy) search for a suitable setof variables that can accurately classify the trainingset. We do not limit our search to variable pairs, anddo not consider all pairs. Our goal is to develop aquick and simple method of microarray classifi-cation, which can be used as a benchmark for otheranalysis techniques.

It should be noted that although we believe thismethod of variable selection is sufficient for deter-mining whether logistic regression can be used as aclassification model for microarrays, we do notargue that it is the best approach. We are activelyresearching alternative methods of variable selec-tion whose run times are similar to our hybridalgorithm, and we include some preliminary resultstowards the end of this paper.

3. Experimentsperformance oflogistic regression

We used two data sets that were published in Khanet al. [11] and Golub et al. [2]. The classificationproblem in the Khan study is related to the differ-ential diagnosis of four small, round blue-celltumors (SRBCT) of childhood: neuroblastoma(NB), rhabdomyosarcoma (RMS), non-Hodgkin lym-phoma (NHL), and the Ewing family of tumors(EWS). A training set of 63 samples (23 EWS, 20RMS, 12 NB, 8BL) with 6567 genes was used to buildthe models, and a test set of 25 previously unseensamples (6 EWS, 5 RMS, 6 NB, 3BL, and 5 othercancers) was used to assess generalization. One

of the classification problems in the Golub studyis related to the differential diagnosis of acutemyeloid leukemia (AML) and acute lymphoblasticleukemia (ALL). A training set of 38 samples (27 ALL,11 ALL), and a test set of 34 cases (20 ALL, 14 AML)were used.

We used the statistical package R (available fromhttp://www.r-project.org) to perform the statisti-cal analyses. As described above, we pre-selected aset of 62 genes for the SRBCT data and 37 genes forthe leukemia data univariately by calculating thePearson correlation coefficient of all genes and theoutcome variables. This pre-selection greatlyreduced time needed to compute our models.The genes with highest correlations (either positiveor negative) were considered in stepwise selectionfor the logistic regression models, which were per-formed using the stepAIC function in R with itsdefault parameters.

The genes that were included in the logisticregression models are listed in Table 1a (SRBCT)and Table 1b (leukemia). A marks the genesthat overlap with the ones listed in the Khan [11]and Golub et al. [2] papers. A * marks the genesfound to be related to the type of tumor subject toclassification in the literature. Table 2 lists thecoefficients for each of these genes in the logisticregression equations.

Table 3 lists the classification results using logis-tic regression for (a) the SRBCT and (b) the leukemiadata. For a given case, the model returns a numericvalue for each tumor category. This value repre-sents the probability that this case belongs to a

Table 1 Genes that were included in the logisticregression models for (a) the small, round blue-celltumor (SRBCT) data set and (b) the AML/ALL leukemiadata set

Gene Category

(a) SRBCT data setH62098 BLAA705225*, RMSAA461125*, RMSAA456008 NBAA670200 NBAA679180 EWSAA430668 EWS

(b) Leukemia data setM23197*, ALL/AMLX85116 ALL/AML

A * indicates that the gene is known to beassociated with the corresponding tumor or tissue. A indicates that the gene was included in the listof 96 used in the Khan SRBCT model [11] or was citedin the Golub paper [2].

158 G. Weber et al.

given category. The category that yields the highestvalue for the case becomes the final classification ifthat value is greater that 0.5. If there are no valuesgreater than 0.5, then we determine that the casecannot be classified by the model. Alhough a 0.5 wasarbitrarily chosen, we found that the model is notvery sensitive to its value. A 0.95 threshold changeda single prediction in the Khan data set to unclas-sified and none in the Golub data set. Similarly,lowering the threshold to 0.05 changed one unclas-sified prediction in the Khan data to a correctprediction, but had no effect on the Golub data.The results of the logistic regression models arecompared to those of the Khan model [11] inTable 3a and to those of the Golub model [2] inTable 3b. The actual histologic classification of eachcase is also listed.

Table 4 summarizes these results by listing thenumbers of correct, incorrect, and unclassifiedcases for each model. Note that in general, logisticregression performed as well as other models interms of the number of correctly classified cases.Although Khans model had one more correct pre-

diction than logistic regression, our model had twomore correct predictions than Golubs. This isencouraging considering our somewhat arbitraryinitial approach to variable selection, since it leavesroom for further improvement of the logistic regres-sion model by exploring alternative methods ofreducing the number of genes.

4. Analysis of classification usinglogistic regression

Despite the simplicity of the logistic regressionmodels, they performed quite similarly to the morecomplex models employed by Khan [11] and Golubet al. [2]. For each data set, the logistic functionhad approximately as many correct classificationsas the corresponding previously published model.The logistic function produced a few incorrectclassifications, while the other models had moreunclassified cases. The difference was partially dueto the 0.5 minimum score threshold used in thelogistic model. A higher value could have changedseveral cases from incorrect to unclassified.

In addition to simplicity, the logistic regressionmodels had an added benefit of using far fewergenes than either the Khan (7 genes compared to96) or the Golub (2 genes compared to 50) models.Note that the number of genes we selected fallswithin the rule of thumb that is popular instatistics, in which it is recommended that thenumber of cases be around 10 times the numberof variables, as opposed to the number of genes inthe original publications. Five of the seven genesselected for the prediction of SRBCT were alsopresent in the Khan gene set; one of the two genesselected for the leukemia classification were alsopresent in the Golub gene set. The ultimate goal ofany of these models is to reduce the number ofgenes that the biologist must study in order to gainan understanding of the disease process. The geneschosen by our two-phase selection method seem tobe a reasonable starting point for further investiga-tion.

Being able to correctly classify tissue is often notenough to warrant expensive and time consuminganalysis of a gene. It must be biologically plausiblethat there exists a link between the gene and thetissue or disease being studied. Several of the geneschosen by stepwise selection were not only success-ful in classifying tumor categories, but their con-nections to the tumors or the affected tissues werealso supported in the literature.

Gene AA461125 (image: 796258) is a cDNA clonesimilar to adhalin, which has been shown to be acause of muscular dystrophy [14]. Gene AA705225

Table 2 Coefficients for (a) four logistic functionsderived from the SRBCT data set [11], and for (b) twologistic functions derived from the leukemia data set[2]

Gene Coefficient

(a) SRBCT data setBL

Intercept 41.72H62098 61.07

RMSIntercept 78.80AA705225 16.28AA461125 171.98

NBIntercept 41.65AA456008 12.39AA670200 7.10

EWSIntercept 23.24AA679180 7.13AA430668 10.72

(b) Leuikemia data set

AMLIntercept 34.10112M23197 0.03783X85116 0.01311

ALLIntercept 34.10112M23197 0.03783X85116 0.01311


Table 3 Classification results for (a) the small, round blue-cell tumor (SRBCT) data set and (b) the AML/ALLleukemia data set using logistic regression with stepwise gene selection (Classify-R), the ANN model described byKhan [11], and the self-organizing maps technique used by Golub et al. [2]

(a) SRBCT data setCase Classify-R Khan et al. Actual

BL EWS NB RMS Class Vote Class

7 0.9986 0.0000 0.0000 0.0000 BL 0.93 BL BL15 0.6546 0.0000 0.0000 0.0000 BL 0.91 BL BL18 0.0776 0.0000 0.0000 0.0000 0.88 BL BL2 0.0000 0.0000 0.0000 0.0000 0.67 EWS EWS6 0.0000 1.0000 0.0000 0.0000 EWS 0.98 EWS EWS

12 0.0000 1.0000 0.0000 0.0001 EWS 0.89 EWS EWS19 0.0000 1.0000 0.0000 0.0000 EWS 0.99 EWS EWS20 0.0000 0.0209 0.0000 0.0000 0.40 EWS21 0.0000 0.9978 0.0007 0.0000 EWS 0.81 EWS EWS1 0.0000 0.0000 0.9905 0.0000 NB 0.76 NB NB8 0.0000 0.0000 1.0000 0.0000 NB 0.94 NB NB

14 0.0000 0.0000 1.0000 0.0000 NB 0.90 NB NB16 0.0000 0.0000 1.0000 0.0000 NB 0.93 NB NB23 0.0000 0.0000 1.0000 0.0000 NB 0.70 NB NB25 0.0000 0.0000 1.0000 0.0000 NB 0.89 NB NB4 0.0000 0.0000 0.0000 1.0000 RMS 0.95 RMS RMS

10 0.0000 0.9999 0.0000 1.0000 RMS 0.68 RMS17 0.0000 0.0000 0.0000 1.0000 RMS 0.90 RMS RMS22 0.0000 0.0000 0.0000 1.0000 RMS 0.88 RMS RMS24 0.0000 0.0000 0.0000 1.0000 RMS 0.87 RMS RMS3 0.0000 0.0000 0.0000 0.0004 0.17 OTHER5 0.0000 0.0000 0.0000 0.0000 0.25 OTHER9 0.0000 0.0000 0.0000 1.0000 RMS 0.60 OTHER

11 0.0000 0.0000 0.0000 0.0000 0.39 OTHER13 0.0000 0.0000 0.0000 1.0000 RMS 0.70 OTHER

(b) Leukemia data setCase Classify-R Golub et al. Actual

ALL AML Class PS Class

39 1.0000 0.0000 ALL 0.78 ALL ALL40 1.0000 0.0000 ALL 0.68 ALL ALL41 0.0010 0.9990 AML 0.99 ALL ALL42 1.0000 0.0000 ALL 0.42 ALL ALL43 1.0000 0.0000 ALL 0.66 ALL ALL44 1.0000 0.0000 ALL 0.97 ALL ALL45 1.0000 0.0000 ALL 0.88 ALL ALL46 0.9996 0.0004 ALL 0.84 ALL ALL47 1.0000 0.0000 ALL 0.81 ALL ALL48 1.0000 0.0000 ALL 0.94 ALL ALL49 1.0000 0.0000 ALL 0.84 ALL ALL55 1.0000 0.0000 ALL 0.73 ALL ALL56 1.0000 0.0000 ALL 0.84 ALL ALL59 1.0000 0.0000 ALL 0.68 ALL ALL67 0.9997 0.0003 ALL 0.15 ALL68 1.0000 0.0000 ALL 0.8 ALL ALL69 1.0000 0.0000 ALL 0.85 ALL ALL70 1.0000 0.0000 ALL 0.73 ALL ALL71 1.0000 0.0000 ALL 0.3 ALL72 1.0000 0.0000 ALL 0.77 ALL ALL50 0.0000 1.0000 AML 0.97 AML AML51 0.0053 0.9947 AML 1 AML AML52 0.0003 0.9997 AML 0.61 AML AML

160 G. Weber et al.

(image: 461425) has similarities to myosin lightchain, which is expressed by RMS cells [15]. Bothof these genes were selected as predictors for RMS.Gene M23197 is a CD33 antigen, which has beendirectly linked to myeloid leukemias [16]. This genewas a predictor for AML and (with a negative coeffi-cient) for ALL.

5. Verification of results

Although logistic regression performed well on theGolub and Khan data sets, in the second series ofexperiments, we applied our algorithm to eightadditional published microarray data sets. Descrip-tions of all the data sets are given in Table 5. Theywere chosen to represent a wide variety of differenttypes of microarray experiments. The number ofarrays in the data sets ranged from 42 to 198

(mean 92), and the arrays contained between2308 and 16,063 genes (mean 9128). Five of the10 data sets were binary classification problems(such as ALL and AML for the Golub data), andthe other 5 data sets (including Khans SRBT data)had between 4 and 18 categories. When a testversus training set was not specified for a givendata set, the first two-thirds of the microarrayswere chosen as training cases, while the remainingwere used as test cases.

Logistic regression with the two-step variableselection method described above was applied toeach of these data sets. To determine the effect ofthe variable selection method used, we tested sixother variable selection techniques and variationsof these. The first is random variable selection(RAND). From the original list of n genes, m 1are randomly selected. The second is maximumvariance (VAR), which selects the genes with thelargest variance across all training arrays beforethey are normalized. When combined with otheralgorithms, VAR indicates variance trimming, inwhich the top 10% of genes with the largest varianceare pre-selected before being fed into the nextvariable selection method. The third base algorithmis ranking genes according to the square of theirPearson correlation coefficients to the outcomevariable (RSQ). Note that this is equivalent to seeinghow well a univariate linear regression model, asopposed to a univariate logistic regression model,would fit the data. The advantage of a linear regres-

Table 3 (Continued )

(b) Leukemia data setCase Classify-R Golub et al. Actual

ALL AML Class PS Class

53 0.0000 1.0000 AML 0.89 AML AML54 0.0000 1.0000 AML 0.23 AML57 0.0485 0.9515 AML 0.22 AML58 0.9919 0.0081 ALL 0.74 AML AML60 0.0000 1.0000 AML 0.06 AML61 0.9634 0.0366 ALL 0.4 AML AML62 0.0091 0.9909 AML 0.58 AML AML63 1.0000 0.0000 ALL 0.69 AML AML64 0.0000 1.0000 AML 0.52 AML AML65 0.0000 1.0000 AML 0.6 AML AML66 0.0000 1.0000 AML 0.27 AML

For the Classify-R model, the numeric predictions of each of the individual logistic regression equations are shown.The tumor classification category with the highest value for each case is in bold. If this value is greater than 0.5,then the corresponding category will be the final predicted Class of the Classify-R model. Otherwise, we determinethat the Classify-R model can not confidently classify the case. Also shown (in 3a) are the ANN Committee Vote andfinal predicted Class of Khan model. The threshold ANN Committee Vote is different for each Class. The predictionsof the Golub model (in 3b) are shown with their prediction strengths (PS). The threshold prediction strength was 0.3.For comparison, the histologic diagnosis of each case (actual) is given.

Table 4 The number of correct, incorrect, andunclassified Class predictions of the logistic regressionmodels (Classify-R) compared to the Khan [11] andGolub et al. [2] classification models

SRBCT data Leukemia data

Classify-R Khan Classify-R Golub

Correct 17 18 30 28Incorrect 2 0 4 0Unclassified 6 7 0 6


sion model is that it is much faster to build, yet itmay give somewhat similar results to a logisticregression model. The fourth base algorithm usesa t-statistic to determine whether a genes meanexpression level for one class of microarrays isdifferent than its mean expression level for anotherclass. Genes are ranked according to which oneshave the greatest absolute values of their t-statis-tics (TSTAT). This algorithm is assumes that the dataare normally distributed, which is not necessarilythe case with microarray data.

The fifth method is principal components analy-sis. Rather than the entire data set, PCA was appliedto only the top 1000 genes with the greatest var-iance before normalization to limit the amount ofRAM needed to calculate the principal components.Note that PCA is different from the other algorithmsdescribed in that instead of reducing the number ofgenes, it only reduces the number of variables.Regardless of the number of principal componentsused in the logistic regression models, the modelswill always be based on the expression level ofexactly 1000 genes.

The sixth algorithm uses simple univariate parti-tioning. It begins by sorting the training arraysaccording to the expression level of a given gene.With an ideal gene, the first k sorted arrays shouldconsist entirely of one class, while the last m karrays should contain samples belonging only to thesecond class. However, for most genes, the first karrays will contain some number of samples fromthe second class, and the last m k arrays willcontain the same number of samples from the first

class. The algorithm ranks genes according to thenumber of arrays that fall into the correct groupwhen the sorted list is split into the first k and lastm k members. Because the mean gene expressionlevel in arrays from the first class can either begreater than or less than that of the second class,the arrays are sorted in both ascending then des-cending order. Whichever one yields a bettersplit is used. Because there are only min(k,m k) possible scores using this algorithm, therewill many ties among genes. Ties are broken eitherby ranking genes according to correlation coeffi-cient (SPLITR) or by ranking them according to theirt-statistics (SPLITT).

For each of these base algorithms, there wereseveral choices for the number of genes or variablesto use in the logistic regression models. Theseincluded the top 1, 2, 3, 4, 5, 10, and m 1 genes(denoted as 1, 2, 3, 4, 5, 10, and M, respectively).Stepwise variable selection was tested two ways.First, m 1 genes are pre-selected using a combi-nation of methods listed above, then these arefurther selected by stepwise selection (MS). Thesecond approach uses stepwise selection startingwith 10 pre-selected genes (10S). Stepwise variableselection was performed using the default para-meters of the stepAIC function within R. Throughoutthe rest of this paper, we refer to the actual modelsused as combinations of the abbreviations for thebase algorithms and the number of genes or vari-ables used. For example, RSQ_5 uses the top fivegenes based on their correlation coefficients.RAND_MS takes m 1 random genes, then uses

Table 5 The 10 data sets for which logistic regression models were built

Data set Totalarrays

Trainingarrays

Testarrays

Genes/array

Classes Training/class

Description

Arbeitman 75 50 25 3126 5 10 Drosophila life cycleBhattacharjee 156 113 43 12600 2 56.5 Lung carcinomasGolub 72 38 34 7129 2 19 Leukemia (ALL and AML)Khan 88 63 25 2308 4 15.75 SRBCTPomeroy1 42 25 17 7129 5 5 CNS tumors (various)Pomeroy2 60 39 21 7129 2 19.5 Medullo-blastomasPomeroy3 60 40 20 7129 2 20 CNS tumors (survival)Ramaswamy1 64 42 22 16063 6 7 Solid tumor metastasisRamaswamy2 198 144 54 16063 18 8 Various cancersSingh 102 68 34 12600 2 34 Prostate cancer

Average 91.7 62.2 29.5 9127.6 4.8 19.475

Listed are the 10 data sets used in this study: Arbeitman [19]; Bhattacharjee [20]; Febbo [21]; Golub et al. [2]; Khan[11]; Pomeroy1, Pomeroy2, Pomeroy3 [22]; Ramaswamy1 [23]; and Ramaswamy2 [24]. For each data set, thefollowing are given: the total number of microarrays in the data set, the number of training arrays, the number oftest arrays, the number of genes per array, the number of different classes or categories represented in the trainingset; the number of training arrays per class, and a brief description of the types of classes. Means across all 10 datasets are listed at the bottom of the table.

162 G. Weber et al.

stepwise selection to further reduce this number.VAR_SPLITR_10S takes the top 10% genes rankedby variance, sorts this list by how well splittingthe training arrays divides the two classes, breaksties by choosing the genes with the highest correla-tion coefficients, selects the top 10 genes, andfinally reduces this number using stepwise selec-tion. Note that RSQ_MS is equivalent to the two-step algorithm described in the first part of thispaper.

As before, the statistical software package R wasused to implement each algorithm and apply it tothe various data sets. For each algorithm and dataset pair, the percentage of correctly classified testand training arrays was recorded. To compare algo-rithms, the averages of these two metrics across all10 data sets were calculated.

The first set of algorithms we tested consisted of11 types: RAND, VAR, RSQ, TSTAT, PCA, SPLITR,SPLITT, VAR_RAND, VAR_RSQ, VAR_TSTAT, andVAR_SPLITR. For each type, we built two models:one that used as many variables as possible, m 1,where m is the number of training arrays (M); andone that pre-selected 10 variables then used step-wise variable selection to reduce the number ofvariables to as few as possible (10S). The averagenumber of correctly classified training and testarrays for each of the 22 models are listed inTable 6 and Fig. 1.

Next, we chose four of the algorithm typesSPLITR, RSQ, TSTAT, and PCAand for each wesought to find the optimal number of genes thatshould be used in the logistic regression equations.This was primarily done to examine the effects of

Table 6 Comparison of variable selection method

Model Trainingarrays

Testarrays

Numberof genes

Averages across all 10 data setsSPLITR_10S 0.979 0.680 2.248RSQ_10S 0.977 0.651 2.826TSTAT_10S 0.957 0.571 2.873RAND_10S 0.759 0.450 3.943VAR_SPLITR_10S 0.977 0.624 3.208VAR_RSQ_10S 0.970 0.635 3.210VAR_TSTAT_10S 0.938 0.580 3.332VAR_RAND_10S 0.803 0.501 4.012PCA_10S 0.905 0.606 3.308SPLITT_10S 0.990 0.680 2.964VAR_10S 0.773 0.510 3.596SPLITR_M 1.000 0.477 60.661RSQ_M 1.000 0.517 60.711TSTAT_M 1.000 0.406 60.667RAND_M 1.000 0.375 61.194VAR_SPLITR_M 1.000 0.433 60.722VAR_RSQ_M 1.000 0.440 60.606VAR_TSTAT_M 1.000 0.389 60.594VAR_RAND_M 1.000 0.362 61.159PCA_M 1.000 0.728 61.200SPLITT_M 1.000 0.446 60.694VAR_M 1.000 0.437 60.600

Listed are the percentage of correct classifications forthe training and test arrays, and the average numberof genes per logistic regression equation for 11 modelsbased on different variable selection algorithms wheneither 10 pre-selected genes are filtered by stepwiseselection (10S), or when m 1 genes are selected,where m is the number of training microarrays (M).The number of genes for the M models should ideallybe 61.2. However, in large data sets with many arrays,the R statistical package occasionally could notconstruct models that used all m 1 variables whenthere was insufficient independence among the genes.

Table 7 Effects of over-fitting

Variables SPLITR RSQ TSTAT PCA

Averages across all 10 data setsPercent correctly classified: training arrays

1 0.875 0.858 0.780 0.4802 0.922 0.903 0.832 0.6073 0.953 0.933 0.882 0.6884 0.956 0.950 0.907 0.7445 0.962 0.961 0.913 0.80210S 0.979 0.977 0.957 0.90510 0.983 0.981 0.963 0.878MS 1.000 1.000 1.000 1.000M 1.000 1.000 1.000 1.000

Percent correctly classified: test arrays1 0.665 0.663 0.550 0.3812 0.699 0.677 0.567 0.4683 0.698 0.693 0.629 0.5474 0.694 0.656 0.625 0.5875 0.694 0.680 0.614 0.62310S 0.680 0.651 0.571 0.60610 0.661 0.663 0.615 0.631MS 0.666 0.693 0.571 0.585M 0.477 0.517 0.406 0.728

Number of variables per model10S 2.248 2.826 2.873 3.308MS 2.924 3.196 3.329 3.905

Listed are the percentages of correctly classifiedtraining and test microarrays, for four differentvariable selection algorithms (SPLITR, RSQ, TSTAT,and PCA) and nine choices of variable number: 1, 2, 3,4, 5, and 10 variable models; m 1 variables, wherem is the number of training arrays (M); and stepwisevariable selection starting with either 10 (10S) orm 1 (MS) pre-selected variables. At the bottom ofthe table, the average number of variables per logisticregression equation is listed for the 10S and MS cases.


over-fitting. In addition to M and 10S, for the fouralgorithm types, we created models with 1, 2, 3, 4,5, and 10 variables, and applied stepwise variableselection to the m 1 genes selected in the Mmodels (MS). The average number of correctly clas-sified training and test arrays for each of these 36models are listed in Table 7. Also given are theaverage number of genes per logistic regressionequation for the 10S and MS models.

6. Comparing variable selectionmethods

Table 6 shows that the 11 M models correctly clas-sified every training array in all 10 data sets. How-ever, they did quite poorly on test arrays. Only twocorrectly classified more than half of the testarrays. The RSQ_M model identified 51.7% of thetest arrays. The outlier was the PCA_M model,which correctly classified 72.8% of the test arrays.

The 10S models correctly classified nearly all thetraining arrays. Although they were not as accurateas the M models on the training arrays, they didsignificantly better (P 0:001) on the test arrays.All but random variable selection correctly identi-fied more than 50% of the test arrays, with

SPLITR_10S having the highest percentage at68.0%. SPLITR_10S also had the fewest averagenumber of genes per logistic regression equation(2.248). In general, the fewer number of genes, thebetter the model classified the training arrays(R-squared 0.73).

The RAND models are a baseline. Any variableselection model worth considering must at least dobetter than the RAND models, and in fact all butVAR_RAND_M do. A final point to note about Table 6is that in general, adding the initial step of filteringall but the top 10% of genes with the greatestvariance made the models less accurate. However,it would be interesting to study the effect of dif-ferent variance cut-offs or other filtering methodssuch as eliminating genes whose expression level isbelow a certain threshold (which is a common pro-cedure in this type of analysis).

Table 7 compares the effect of the number ofvariables used on the performance of the logisticregression models. Both the M and MS models cor-rectly classify all training arrays. However, insteadof an average of 62 genes per logistic regressionequation in the M models, the MS models used only 3or 4. While the conditionally univariate MS algo-rithm correctly classified 100% of the trainingarrays, the simple univariate algorithms with three

11 Model Summary

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

SPLIT

R

SPLIT

TRS

Q

VAR_

RSQ

VAR_

SPLIT

RPC

A

VAR_

TSTA

T

TSTA

TVA

R

VAR_

RAND

RAND

Variable Selection Algorithm (Sorted by TEST 10S Rank)

Perc

ent C

orr

ectly

Cl

assi

fied

TRAIN MTRAIN 10STEST MTEST 10S

Figure 1 Comparison of variable selection algorithms. The percentages of training and test microarrays correctlyclassified by logistic regression models based on 11 different variable selection algorithms are shown. For eachalgorithm, two variations are given: the first uses m 1 variables, where m is the number of training microarrays (M);the second begins by pre-selecting 10 variables, then further reducing the number by stepwise variable selection (10S).All models based on m 1 variables perfectly fit the training arrays. The 10S models do slightly worse. As expected, themodels are much less accurate in predicting the test arrays. Interestingly, though, the 10S models outperform the Mmodels on the test arrays in all cases except the PCA models. Although none of the 10S models correctly classify moretest arrays than the PCA_M model, they use only two or three genes compared to the 1000 used by the PCA_M model.The top-ranked 10S model, SPLITR_10S, is not far behind the PCA_M model in terms of percent of test arrays correctlyclassified.

164 G. Weber et al.

or four genes ranged from 68.8 to 95.6%. With thesimple univariate algorithms, in all cases the morevariables, the better the models in classifying thetraining arrays.

The story is different, though, with the testarrays. The PCA_M model had the highest averagenumber of correctly classified test arrays (72.8%).The second best method, though, was SPLITR_2with 69.9% correct. RSQ_3 was close behind with69.3%, and the best TSTAT algorithm also used threegenes. For an improvement of less than 3%, it hardlyseems worth the 998 more genes required by thePCA_M model than the SPLITR_2 model. Further-more, PCA only works well when many principalcomponents are used. As the number of principalcomponents used drops, the predictive ability of thePCA models approaches that of the RAND models.On the other hand, even with just one gene, boththe SPLITR_1 and RSQ_1 models correctly classifyabout two-thirds of the test arrays. The fact thatstepwise selection in the 10S and MS models resultsin only two to four genes also confirms that very fewgenes are needed for logistic regression models ifthe correct variable selection algorithm is chosen.

It should be noted that although the M models arethe best in correctly classifying the training arrays,they are nearly 20% worse than any other model thatuses 10 or fewer variables to classify test arrays.This suggests that over-fitting is a significant factorin the performance of logistic regression, and thusvariable selection is essential.

7. Discussion

There are evidently trade-offs in terms of modelparsimony, time efficiency, and classification per-formance. In our implementation, there are limita-tions inherent to the logistic function: a univariatepre-selection step that eliminates a large number ofvariables, and the subsequent simple conditionallyunivariate selection that trims most of the pre-selected variables. We have shown, however, thatdespite all this, the classification performance isnot significantly affected, and that the selectedgenes are reasonable in our two initial data sets.By then applying our proposed algorithm, and var-iations of the variable selection step, on eightadditional data sets, we show that logistic regres-sion performs well on different types of data, andwhile performance depends greatly on the methodof variable selection used, our two-step methodused on the Golub and Khan data sets was amongthe best tested. It is important to note that fewpublications have addressed this crucial problem ofvariable selection in microarray data analysis. In

none of them have different methods been com-pared as in this study.

Our results should be viewed with caution andplaced in the proper context: we have proposed touse a well-known statistical regression model forquick, first-step analysis of gene expression dataand preliminary selection of candidate disease mar-kers, and not as a tool to investigate gene networks.The selected genes are just at the tip of the iceberg,and the procedure cannot be used to explore deeper,more complex gene interactions that may in factconstitute the basis for the development of disease.Hence our distinction between genes that can serveas markers and genes that can be described ascauses for a certain disease. The genes we iden-tified may not be directly involved in causing thedisease, but their expression levels at the snapshot intime in which they were measured may be sufficientto characterize the disease with a certain confi-dence. The same limitation applies to most of theliterature reporting on microarray data analysis.

Other authors have proposed to use logisticregression for classification of gene expression.The approach based on PCA suggested by West ispowerful because it allows for indirect multivariatevariable selection; however, it potentially losesinformation, as the principal components do notaccount for the total variance. As we argue above,PCA is potentially ill-suited as a dimensionality-reducing technique for supervised learning. Aspointed out by Yeung and Ruzzo [17], PCA oftendegrades cluster quality. Our second series ofexperiments showed that PCA performs quite poorlyunless many principle components are used in con-structing the logistic regression equations.

We believe the main advantages of our imple-mentation are its simplicity, its reliance on plat-form-independent software, and a user-friendlyinterface. The analysis is quick and the underlyingalgorithm is well-known and widely accepted. Itconstitutes an important contribution to themachine learning community because it establisheda benchmark that can serve as basis of comparisonfor new or newly-rediscovered algorithms. Theimplementation of this tool is a web-based programnamed Classify-R, which can be found at: http://134.174.53.82/classifyr. It allows users to upload afile containing training and test microarray data viaan Upload Data page (Fig. 2a). On the Design Experi-ment page, users indicate which arrays are the testarrays. Classify-R will then calculate the logisticregression models, and on the View Results pageit will display the coefficients (Fig. 2b). Users mayadditionally view the full output from R and thepredicted classification of each of the microarraysin their data sets. The framework is simple and


somewhat related to the one proposed by Hastieet al. [18]. The main difference is that in ourimplementation there is no clustering step (eachgene is a cluster), and its scope is more limited.

We transformed a multiclassification probleminto a combination of binary classification pro-blems. The use of independent logistic regressionmodels to classify cases can only be defended interms of simplicity and speed of implementation.We plan to implement polychotomous logisticregression for multi-category problems. We haveonly dealt with diagnostic classification in our exam-ples. We plan to extend the proposed framework toprognostic tasks by developing Cox proportionalhazards and non-parametric survival analysis mod-els. We believe that without an established bench-marking tool for the analysis of large data setswith thousands of variables and dozens of cases,there will be continued use (and misuse) of newalgorithms for supervised classification withouthead-to-head comparison to simpler algorithms.Despite the excitement surrounding microarrays,

we believe that is extremely important for thebioinformatics and artificial intelligence commu-nities to be diligent about determining the valueof new algorithms. The work presented in thisarticle is an initial step towards that goal.

Acknowledgements

This work was funded by the National Library ofMedicine (R01-LM07273) and the John and VirginiaTaplin research fellowship from the Division ofHealth Sciences and Technology, Harvard/MIT.

References

[1] West M, et al. DNA microarray data analysis and regressionmodeling for genetic expression profiling. Tech Report,Duke University, 2000.

[2] Golub TR et al. Molecular classification of cancer: classdiscovery and class prediction by gene expression monitor-ing. Science 1999;286(5439):5317.

Figure 2 Screen shots of Classify-R, an internet web site we developed to implement logistic regression with stepwisevariable selection. (a) Users upload a data file containing the gene expression values for each of their training and testmicroarrays. Classify-R will then use the training arrays to develop the regression model, and it will use the test arraysto evaluate the model. (b) The coefficients of the logistic regression equations are displayed. Links at the bottom ofthe page allow users to view the predicted classifications of the model for each of the training and test arrays, and toaccess additional statistical information about regression analysis. If gene IDs are specified in the input data set, thenon the results page, the variables listed in the tables will be links. If a gene is clicked, GenBank will automatically besearched for information about the gene.

166 G. Weber et al.

[3] Alizadeh AA et al. Distinct types of diffuse large B-celllymphoma identified by gene expression profiling. Nature2000;403(6769):50311.

[4] DHaeseleer P, Liang S, Somogyi R. Genetic networkinference: from co-expression clustering to reverse en-gineering. Bioinformatics 2000;16(8):70726.

[5] Furey TS et al. Support vector machine classification andvalidation of cancer tissue samples using microarrayexpression data. Bioinformatics 2000;16(10):90614.

[6] Yeang CH et al. Molecular classification of multiple tumortypes. Bioinformatics 2001;17(Suppl 1):S31622.

[7] Zhang H et al. Recursive partitioning for tumor classifica-tion with gene expression microarray data. Proc Natl AcadSci USA 2001;98(12):67305.

[8] Dudoit S, Fridlyand J, Speed T. Comparison of discrimina-tion methods for the classification of tumours using geneexpression data. Tech Report #576. Berkeley: University ofCalifornia; 2000.

[9] Ben-Dor A et al. Tissue classification with gene expressionprofiles. J Comput Biol 2000;7(3-4):55983.

[10] Hosmer DW, Lemeshow S. Applied logistic regression, vol.xii, second ed. Wiley series in probability and statistics.Texts and references section. New York: Wiley; 2000.p. 373.

[11] Khan J et al. Classification and diagnostic prediction ofcancers using gene expression profiling and artificial neuralnetworks. Nat Med 2001;7(6):6739.

[12] Akaike H. Information theory and an extension of themaximum likelihood principle. In: Second InternationalSymposium on Information. Budapest: Akademiai Kiado;1973.

[13] Bo T, Jonassen I. New feature subset selection proceduresfor classification of expression profiles. Genome Biol2002;3(4):Epub.

[14] Dua T et al. Adhalin deficiency: an unusual case of musculardystrophy. Indian J Pediatr 2001;68:10835.

[15] Urashima M et al. Restoration of p16INK4A protein inducesmyogenic differentiation in RD rhabdomyosarcoma cells. BrJ Cancer 1999;79(7-8):10326.

[16] Caron PC, Dumont L, Scheinberg DA. Supersaturating infu-sional humanized anti-CD33 monoclonal antibody HuM195 inmyelogenous leukemia. Clin Cancer Res 1998;4(6):14218.

[17] Yeung KY, Ruzzo WL. Principal component analysis forclustering gene expression data. Bioinformatics 2001;17(9):76374.

[18] Hastie T, et al. Supervised harvesting of expression trees.Genome Biol 2001;2(1):RESEARCH0003.

[19] Arbeitman MN. Gene expression during the life cycle ofDrosophila melanogaste. Science 2002;297:22705.

[20] Bhattacharjee A. Classification of human lung carcinomasby mRNA expression profiling reveals distinct adenocarci-noma subclasses. Proc Natl Acad Sci USA 2001;98:137905.

[21] Singh D. Gene expression correlates of clinical prostatecancer behavior. Cancer Cell 2002;1:2039.

[22] Pomeroy SL. Prediction of central nervous system embry-onal tumour outcome based on gene expression. Nature2002;415:43642.

[23] Ramaswamy S, et al. Evidence for a molecular signature ofmetastasis in primary solid tumors, 2002.

[24] Ramaswamy S. Multiclass cancer diagnosis using tumorgene expression signatures. Proc Natl Acad Sci USA 2001;98:1514954.


Multivariate selection of genetic markers in diagnostic classificationIntroductionAlgorithms for classification of high-throughput gene expression dataExperiments-performance of logistic regressionAnalysis of classification using logistic regressionVerification of resultsComparing variable selection methodsDiscussionAcknowledgementsReferences

multivariate selection of genetic markers in diagnostic classification

Documents