application of boosting classification and regression to modeling the relationships between trace...

14
Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases Chao Tan & Hui Chen & Wanping Zhu Received: 25 June 2009 / Accepted: 14 July 2009 / Published online: 23 July 2009 # Humana Press Inc. 2009 Abstract The study on the relationship between trace elements and diseases often need to build a classification/regression model. Furthermore, the accuracy of such a model is of particular importance and directly decides its applicability. The goal of this study is to explore the feasibility of applying boosting, i.e., a new strategy from machine learning, to model the relationship between trace elements and diseases. Two examples are employed to illustrate the technique in the applications of classification and regression, respectively. The first example involves the diagnosis of anorexia according to the concentrations of six elements (i.e. classification task). Decision stump and support vector machine are used as the weak/base algorithm and reference algorithm, respectively. The second example involves the prediction of breast cancer mortality based on the intake of trace elements (i.e. a regression task). In this regard, partial least squares is not only used as the weak/base algorithm, but also the reference algorithm. The results from both examples confirm the potential of boosting in modeling the relationship between trace elements and diseases. Keywords Trace element . Boosting . Anorexia . Breast cancer . Classification . Regression Biol Trace Elem Res (2010) 134:146159 DOI 10.1007/s12011-009-8468-9 C. Tan (*) : W. Zhu Department of Chemistry and Chemical Engineering, Yibin University, Yibin 644007, Peoples Republic of China e-mail: [email protected] C. Tan Key Laboratory of Computational Physics, Yibin University, Yibin 644007, Peoples Republic of China H. Chen Hospital, Yibin University, Yibin 644007, Peoples Republic of China H. Chen Clinical College, North Sichuan Medical College, Nanchong 637000, Peoples Republic of China

Upload: chao-tan

Post on 14-Jul-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

Application of Boosting Classification and Regressionto Modeling the Relationships Between Trace Elementsand Diseases

Chao Tan & Hui Chen & Wanping Zhu

Received: 25 June 2009 /Accepted: 14 July 2009 /Published online: 23 July 2009# Humana Press Inc. 2009

Abstract The study on the relationship between trace elements and diseases often needto build a classification/regression model. Furthermore, the accuracy of such a model isof particular importance and directly decides its applicability. The goal of this study isto explore the feasibility of applying boosting, i.e., a new strategy from machinelearning, to model the relationship between trace elements and diseases. Two examplesare employed to illustrate the technique in the applications of classification andregression, respectively. The first example involves the diagnosis of anorexia accordingto the concentrations of six elements (i.e. classification task). Decision stump andsupport vector machine are used as the weak/base algorithm and reference algorithm,respectively. The second example involves the prediction of breast cancer mortalitybased on the intake of trace elements (i.e. a regression task). In this regard, partial leastsquares is not only used as the weak/base algorithm, but also the reference algorithm.The results from both examples confirm the potential of boosting in modeling therelationship between trace elements and diseases.

Keywords Trace element . Boosting . Anorexia . Breast cancer . Classification . Regression

Biol Trace Elem Res (2010) 134:146–159DOI 10.1007/s12011-009-8468-9

C. Tan (*) :W. ZhuDepartment of Chemistry and Chemical Engineering, Yibin University, Yibin 644007,People’s Republic of Chinae-mail: [email protected]

C. TanKey Laboratory of Computational Physics, Yibin University, Yibin 644007, People’s Republic of China

H. ChenHospital, Yibin University, Yibin 644007, People’s Republic of China

H. ChenClinical College, North Sichuan Medical College, Nanchong 637000, People’s Republic of China

Page 2: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

Introduction

It is known that some trace elements play important roles in the biochemical process ofhuman body and should be kept at optimum biological levels [1]. There existconsiderable reports that the disorder of trace elemental homeostasis leads to differentdiseases [2]. Thus, the study on the relationship between trace elements and diseases isvaluable and has attracted many researchers over the past two decades [3–13]. However,such a relationship is quite complicated and is difficult to explain very satisfactorilythrough the investigation of one or a few trace elements due to interactions amongvarious trace elements [14]. Therefore, it is often the case that a classification/regressiontool is used to model the quantitative/qualitative relationship between trace elements anddiseases [15–17]. Seeking an appropriate tool for such tasks has been a main mission inchemometrics. Up to now, for quantitative/regression modeling, partial least squares(PLS) is most widely applied due to its ability of reducing the interactions amongvariables [18]; while for qualitative/classification modeling, many newly developedmethods such as various neural networks [19] and support vector machine (SVM) [20]are available. For either classification or regression, the accuracy of a model is ofparticular importance and directly decides its applicability [21]. Traditionally, each of theabove-mentioned methods builds one model based on the training data and a rule topredict. In many situations, it is very difficult to obtain a satisfactory accuracy.

Recently, the so-called ensemble [22], which is based on the concept of building a series ofmodels rather than a single model, has shown interesting properties for regression/classification modeling. It can transform a set of weak models that perform better thanrandom but does not perform as well as one would like into a stronger one, thereby leading toincreased accuracy and stability of the predictors [23]. Boosting, proposed by Freund andSchapire [24], is one of the popular ensemble techniques and has been successfully applied tovarious fields [25–28]. Boosting is originally developed to solve the classification problem,but more recently they have been extended to the domain of regression. To use boosting, first,one must select a certain “weak” or “base” learning algorithm and then call it repeatedly, eachtime feeding it a different subset of the training samples (or, to be more precise, a differentdistribution or weighting over the training samples). Each time it is called, a weak predictionrule/model will be built, and after many rounds, all models can be combined into a singleprediction rule/model by a weighted vote/median that, hopefully, will be much better than anyone of the weak models. In boosting, since the samples misclassified or predicted with a bigerror by the preceding rules are placed the most weight, this make the later rules focus itsattention on the “hardest” samples. It is just by this particular weight-updating scheme thatboosting can be expected to provide better and more reliable prediction than the individualmodels in most situations.

Motivated by the merits of boosting, the goal of this study is to explore thefeasibility of applying boosting classification/regression to the study on relationshipbetween trace elements and diseases. Two examples are employed for illustrationpurpose. The first example involves the diagnosis of anorexia according to theconcentrations of six elements, while the second example involves the prediction ofbreast cancer mortality based on the intake of trace elements. Decision stump andpartial least squares are used as the weak/base algorithm in boosting classification andregression modeling, respectively. For boosting classification, a comparison isperformed with respect to support vector machine. For boosting regression, The PLSis used as the reference. The results from both examples confirm the potential ofboosting classification and regression in such modeling tasks.

Modeling the Relationships Between Trace Elements and Diseases 147

Page 3: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

Theory and Algorithm

Boosting Classification

The most popular boosting algorithm, i.e., AdaBoost [24, 25, 29], is used in this paper.Suppose there is a two-class classification problem. A training set has N samples (x(i), y(i),i=1, . . ., N), each having p predictor variables, i.e., x(i) = (xi1, . . ., xip). The samples arefrom two categories. One class is denoted by y=1 and the other is denoted by y=−1. Thealgorithm of AdaBoost can be implemented as follows:

1. Initialize the weights of all training samples, w1i ¼ 1=N ; i ¼ 1; 2; � � � ;N ;

2. For t=1 to T, repeat the following step:

a. Build a classifier Gt(x) to the training data based on the current weights wti

b. Compute:

errt ¼XN

i¼1wtiI yðiÞ 6¼ Gt xðiÞ

� �� �h i=XN

i¼1wti ð1Þ

c. Compute the weight of classifier t:

at ¼ 1

2log 1� errtð Þ=errtÞ½ � ð2Þ

d. Update the weights of all training samples:

w tþ1ð Þi ¼ wðtÞ

i exp atI yðiÞ 6¼ Gt xðiÞ� �� �� �

; i ¼ 1; 2; � � � ;N ð3Þ3. Output:

GðxÞ ¼ signXT

t¼1atGtðxÞ

h ið4Þ

Here, I(A) is the indicate function, I(A)=1 if A is true, otherwise I(A)=0 and sign(·) is thesign function, sign(A)=1 if A>0, otherwise sign(A)=−1.

The Adaboost sequentially applies the same classification algorithm to repeatedlymodified versions of the training data, thereby producing a sequence of classifiers Gt(x).Then, for a future sample, the final category can be obtained from the combination of suchclassifiers by a weighted vote, as indicated in Eq. (4). The modified versions of the trainingdata can be obtained by assigning different weights to the training samples at every round.Initially, all samples are assigned the same weight. At the tth round, the weights to thetraining samples are updated as follows: the weights of correctly classified samples aredecreased while those misclassified samples are increased. The weight αt reflects thereliability of the tth classifier, i.e., the higher the value, the more reliable the classifier is. Tooptimize T, a cross-validation is often applied in practice.

In this study, we use decision stump as the base classifier algorithm, which is a one-levelbinary decision tree, i.e., each of these trees has only one split. Although a decision stump oftenexhibits a relatively high error rate, the combination of stumps using a weighted vote isexpected to yield a very accurate prediction.

148 Tan et al.

Page 4: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

Boosting Regression

There exist several versions of boosting for regression applications such as Adaboost.R [30]and stochastic Gradient Boosting [31]. The former is the foremost and popular andtherefore introduced in this study. Let us consider a regression ensemble with T models andthe available training set has N observations/samples. The boosting regression can beperformed as follows:

1. Initialize the weights of all training samples: wð1Þi ¼ 1; i ¼ 1; 2; � � �N

2. For t=1 to T: repeat the following step:

a. Calculate the probability to be picked for each sample:

pðtÞi ¼ wðtÞi =

XNi¼1

wðtÞi i ¼ 1; 2; � � �Nð Þ ð5Þ

b. Build a (weak) regression model, i.e.,Gt : x ! y, on the current training set, whichis produced by sampling with probability pðtÞi

c. Calculate the square-loss value for each sample:

Li ¼ yðtÞi � yi���

���2=max yðtÞj � yj

������2i; j ¼ 1; 2; � � �Nð Þ ð6Þ

Here, the denominator represents the maximum residual between the predicted andobserved values.

d. Calculate the average loss of the current regression model:

LðtÞ ¼

Xmi¼1

LipðtÞi ð7Þ

e. Calculate the confidence measure of the current regression model:

bðtÞ ¼ LðtÞ=ð1� L

ðtÞÞ ð8Þ

f. Update the weight of each sample:

w tþ1ð Þi ¼ wðtÞ

i bðtÞ� � 1�Lið Þ

ð9Þ

3. After T boosting cycles, T weak regression models are built. For a future sample, Tweak models can give T prediction and the final prediction can be generated by aweighted median, i.e., so-called boosting, computed as follows:

a. Sort T prediction yðtÞi ; t ¼ 1; 2; � � � ; T , for the ith sample in incremental order:

y k1ð Þi � y k2ð Þ

i � � � � � ykTi ; ð10Þwhere kj is a permutation of 1, 2… T.

Modeling the Relationships Between Trace Elements and Diseases 149

Page 5: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

b. Sum log 1�bkT

� �over t until the inequality starts to be satisfied:

Xr

t¼1

log 1�bkt

� � � 1

2

XTt¼1

log 1�bkt

� � ð11Þ

Then, the prediction from krth weak regression model is taken as the ensembleprediction for ith sample.

Experimental

Dataset

In this study, two datasets are used to verify the potential of boosting in classification andregression tasks, respectively.

The first dataset involves the diagnosis of anorexia and is taken from the work of Zhaoet al [32]. It contains 90 cases (62 nonanorexic cases and 28 anorexic cases). Diagnosis ofeach case has been determined by medical experts, according to clinic symptoms for about1 month. Nonanorexic and anorexic cases are marked with numbers 1 and 2, respectively.Each case is described by eight features: gender, age, and concentrations of six elements(in hair), i.e., Zn, Fe, Mg, Cu, Ca, and Mn, which are assayed by Inductively CouplingPlasma 400 (ICP) (Perkin Elmer, USA). The statistical description including minimum,maximum, mean, and standard deviation (SD) of this dataset is shown in Table 1.

The second dataset involves the prediction of breast cancer mortality based on the intakeof several trace elements, and has been published by Zhou et al. [33]. This dataset iscollected from residents of 27 countries/regions, in which each sample is described byconcentrations of seven elements: Se, Cu, Zn, Cd, Cr, Mn, and As (mg per year) and thecorresponding mortality (1/100,000). The statistical description of this dataset is shown inTable 2.

Table 1 The Statistical Description of Dataset 1

Items Zn (μg/g) Fe (μg/g) Mg (μg/g) Cu (μg/g) Ca (μg/g) Mn (μg/g)

In all cases Minimum 6.5 10 8 6 250 0.47

Maximum 220 115 200 93 4,000 6.5

Mean 99.90 45.01 76.92 14.63 1,214.95 2.11

SD 40.91 21.68 43.98 10.79 702.99 1.02

Nonanorexic Minimum 42 15 16 8 250 0.69

Maximum 220 115 200 93 4,000 6.5

Mean 113.79 48.34 89.82 15.66 1,424.11 2.21

SD 37.33 22.52 43.84 12.09 734.61 0.98

Anorexic Minimum 6.5 10 8 6 336 0.47

Maximum 138 86 155 35 1,200 5.7

Mean 69.14 37.64 48.36 12.34 751.82 1.89

SD 28.42 17.92 28.38 6.76 260.84 0.99

P value – 6.92×10−8 0.0295 1.51×10−5 0.178 5.65×10−6 0.0105

150 Tan et al.

Page 6: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

Software and Computation

All of the calculations were performed with Matlab 7.0 under Windows Xp, based onPentium IV with 256 RAM. The boosting classification was implemented by the statisticalpattern recognition toolbox (http://cmp.felk.cvut.cz/cmp/software/stprtool) while theboosting regression was performed by our written program.

Results and Discussion

Sample Set Partitioning

It is well known that the selection of a representative training set is a crucial step inmodeling. Thus, one often needs to split a given dataset into the training set and the test set.The latter one is indispensable for evaluating a model’s characteristics. In the strict sense,the evaluation is valid only if the test set has the same information distribution as thetraining set. To achieve it, the combination of a representative sample selection method andan alternative re-sampling is applied. Specifically, for classification application (in the firstdataset), the Kennard and Stone (KS) algorithm is first used to rank all samples (ofnonanorexic or anorexic), resulting in two sequences (corresponding to 62 nonanorexic and28 anorexic cases, respectively.). Next, an alternative re-sampling is applied to extract twosamples of every three samples from the two sequences to form the training set while theremaining samples constitute the test set. However, a shortcoming of KS algorithm lies inthe fact that the information of the dependent variables is not utilized in the process ofsample selection and is not reasonable in regression. Therefore, for the second dataset, weuse a newly developed algorithm, named SPXY [34] (sample set partitioning based on jointx–y distances), which extends the KS algorithm by encompassing both x- and y-informationin the distance calculation, to replace KS as the algorithm of sorting samples intosequences. Similarly, this dataset is finally divided into the training set with 18 samples andthe test set nine samples, i.e., a 2:1 division.

Performance Measures

For classification, misclassification rate (MCR) is used as the measure for evaluating aweak/ensemble classifier. Besides, sensitivity and specificity are considered. The formermeasures the proportion of actual positives which are correctly identified as such (e.g. thepercentage of anorexic people who are identified as having the condition); and the lattermeasures the proportion of negatives which are correctly identified (e.g. the percentage ofnonanorexic people who are identified as not having the condition). According to Freund et

Table 2 The Statistical Description of Dataset 2

Items Se mg/year

Cu mg/year

Zn mg/year

Cd mg/year

Cr mg/year

Mn mg/year

As mg/year

Mortality(1/100,000)

Minimum 57.80 592.00 1674.00 33.30 11.70 463.00 82.10 3.50

Maximum 107.60 1,125.00 6,948.00 123.90 25.40 1,169.00 273.40 26.00

Mean 77.76 774.48 4,208.85 79.46 18.38 832.67 153.49 16.17

SD 12.79 115.69 993.58 17.10 3.41 198.68 53.21 6.25

Modeling the Relationships Between Trace Elements and Diseases 151

Page 7: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

al. [24], two importance measures, i.e., Weighted error of tth classifier and Error Bound ofthe ensemble of the first t classifiers, are used in our case. For regression, based on the root-mean-square error (RMSE) of the residuals, two measures are defined. That is, the RMSE istermed RMSEC and RMSEP for the training set and the test set, respectively. The RMSE ofcross-validation (RMSECV) on the training set is used to optimize model parameters

Boosting Classification Application

The first dataset concerns the diagnosis of anorexia, i.e., a classification application, wherethe main goal is to develop a prediction model with acceptable accuracy, and also, thequality of the model largely depends on the selection of features which contain the valuableinformation for diagnosing purpose. Thus, a paramount task is to determine which featuresare qualified as the input for training/applying the diagnosing model. Table 3 gives thestatistical results of the age and gender. It seems that there exists a significant difference ofage between nonanorexic and anorexic cases while the ratios of male and female cases arealmost the same in the anorexic, nonanorexic, and all cases. Some experiments have shownthat age might also affect the absorption of trace elements, leading to an eating disorder andanorexia [35]. However, in this case, using ‘age’ as a feature has been attempted, but fails toimprove the results (not reported). Besides the age and gender, many studies have shownthat the concentrations of the elements of Zn, Fe, Cu, Mn, Ca, and Mg play important rolesfor anorexia [36–38]. These concentrations are hardly altered over a period of severalmonths and can therefore represent the state of the health of an individual. It is generallybelieved that Zn is an important component of more than 80 kinds of enzymes and alsoparticipates in cellar metabolism, particularly, the synthesis and decomposition of proteinsand nucleic acids. Anorexia is one of the early symptoms of zinc deficiency in children.During zinc deficiency, atrophy, or hypertrophy of the papillae of tongue may emerge andthe sensitivity of taste is reduced severely, even disappear, therefore leading to loss ofappetite. Once anorexic symptom is present, it will inversely influence the intake of otherelements such as Cu, Mn, Fe, Mg, and Ca. Thus, the dynamic balance between traceelements is broken, based on which it is possible to build a model for diagnosing purpose.

Considering that correlation, often measured as a correlation coefficient, can indicate thestrength of a linear relationship between two variables, we first calculate correlationcoefficients of each pair’s elements. Among the six elements (variables), the correlationcoefficient of any two elements is smaller than 0.61, thereby being considered independent.Based on a t test, the p values of which are listed in the last row of Table 1, theconcentrations in nonanorexic cases are significantly higher than those in anorexic cases.However, the concentration values are also very dispersive. Figure 1 gives the frequencyhistograms and the corresponding estimated probability distributions for both anorexic andnonanorexic groups. Among the six elements, the concentration difference of Zn betweennonanorexic and anorexic cases is most significant; the mean values of nonanorexic andanorexic cases are 113.79 and 69.14 μg/g, respectively. This can also be confirmed by thelowest p value of 6.92×10−8. It seems that the concentration of Zn can be used as a simple

Table 3 The Statistical Results of the Age and Gender of Dataset 1

In all cases Nonanorexic Anorexic

Mean of age 3.08 2.28 3.44

Male/female 48/42=1.14 15/13=1.15 33/29=1.14

152 Tan et al.

Page 8: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

criterion to diagnose anorexic. However, as shown in Fig. 1, the populations of nonanorexicand anorexic cases are not completely separate and the concentration distributions of Zn forboth cases remain a considerable overlap. So, using only one element is not advisable and isdifficult to achieve an acceptable accuracy. Furthermore, principal components analysis(PCA) is used. PCA is a classic method of reducing data dimensionality by creating neworthogonal variables/components (scores/latent variables) that are linear combination of theoriginal variables (i.e., concentrations of six elements). Only the first new variables (alsocalled PCs), accounting for most of the variance of the original data, often containmeaningful information, while the last ones, which account for a small amount of variance,contain more noise and can be ignored. Although PCA itself cannot be used as aclassification tool, it can indicate the data trend in visualizing dimension spaces.Figure 2 gives the score plot for the first three PCs (PC1–PC2–PC3) and their projectionson PC1-PC2 plane. Through a calculation, it is revealed that the first two and three PCsexplain 70% and 89% variances, respectively, implying that the first three PCs conveymost of meaningful information. Two rough clusters can also be observed in Fig. 2.However, once again, they retain obvious overlap. Introducing more PCs has beenattempted, but does not improve the results of classification. Such evidences indicate thatthe classification/diagnosis task is not easy and the use of chemometrics is necessary tobuild a powerful model.

Using Adaboost with decision stumps, a series of ensemble classifiers with differentensemble size (the number of weak classifiers) are constructed. Figure 3 shows the curvesof MCR versus the ensemble size. As can be seen in Fig. 3, with the increase of ensemblesize, the MCR values for both the training set and the test set drop quickly; when the

Fig. 1 Frequency histograms and the corresponding estimated probability distributions for both anorexic (inred) and nonanorexic (in blue) groups

Modeling the Relationships Between Trace Elements and Diseases 153

Page 9: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

ensemble size is equal to 18, the MCR for the training set reaches zero. Even if the MCRvalues corresponding to the test set remain unchanged after assembling the first 12classifiers, it is not reliable, as implied by the MCR of the training set. Thus, the finalensemble classifier consists of the first 18 weak classifiers, i.e., corresponding to theensemble size where the MCR of the training set equal to zero. Figure 4 gives the weightederror and error bound related to the ensemble sizes of 1–18. On average, the successiveweak classifiers take on higher weighted error values. This is because they have paid moreattention on those “hard” samples. It is just by this means that the superiority of Adaboost can beexerted. In general, the error bound curve can imply the risk of over-fitting to some extent andtherefore provides a reference for controlling over-fitting. The optimal ensemble classifier,constructed on the training set and contained only 18 decision stumps, achieves a sensitivityof 89% and a specificity of 86% on the test set (with 21 nonanorexic and nine anorexic

Fig. 3 Misclassification rate(MCR) as a function of theensemble size (the number ofdecision stumps in an ensembleclassifier)

Fig. 2 Score plot of the firsttwo/three PCs obtained by PCAof the anorexia dataset

154 Tan et al.

Page 10: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

samples). As shown in Fig. 5, a total of four samples (also marked in Fig. 2) are misclassified.Out of 30 test samples, all except four samples have been classified successfully.

Considering the fact that support vector machine [39, 40] presents many attractive featuresand performs well in various applications, in this case, we have also developed a SVMclassifier based on radical basis function (RBF) kernel. A two-dimensional grid searchcoupled with a cross-validation on the training set is used to optimize the regularizationparameter and kernel parameter in RBF. The ranges of both parameters are preliminarilydetermined by trial and error. Associated to the regularization parameter of 120 and kernelparameter 0.04, a best SVM classifier is obtained, which provides a sensitivity of 84% andspecificity of 86% on the test set, i.e., a slightly poor sensitivity and the same specificitycompared to the Adaboost classifier of 18 decision stumps. For diagnosing a disease,sensitivity is generally more important, it seems that the boosting is preferable.

Fig. 4 The weighted error curveand error bound related to eachensemble size

Fig. 5 The predictive ability ofthe final ensemble classifier onthe test

Modeling the Relationships Between Trace Elements and Diseases 155

Page 11: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

Boosting Regression Application

The second dataset concerns the prediction of breast cancer mortality, i.e., a regressionapplication. Breast cancer is a hormonal dependent disease. Trace elements are veryimportant structural and functional cofactors of various enzymes crucial for the variousbiochemical cell activities and act at the cellular/subcellular level through differentmechanisms [41]. One of these mechanisms could be described as the interaction betweentrace elements and hormones that regulate the metabolism of higher biochemical substrate[42]. The correlation of Se, Zn, and Cu with cancer is now well-established. Se exertsanticarcinogenic effects by several mechanisms such as its protective action againstcarcinogen-induced chromosomal damage. Selenoproteins are a part of the body'santioxidant defense system able to break down hydrogen peroxide and lipid hydroperoxidesgenerated by free radicals, which can damage cell membranes and disrupt cellularfunctions. There is evidence that Se deficiency may be related to cancer [43]. Besides, Semay be adversely affected by a chronic excessive ingestion of As, since high levels ofchronic As ingestion will accelerate the excretion of Se [44]. Zn and Cu have beenrecognized to have important roles as cofactors of superoxide-dismutase. This enzymeprotect cell against free radicals producing agents and substances which might be involvedin initiating the neoplastic process. Although the action mechanism of Cr compounds ontissues is not extensively studied, it is observed that they can generate reactive oxygenspecies during its reduction in successive oxidation state [45]. Many compounds of Cd, Cr,Mn, and Zn have been used to induce cancer in experimental animals and have shown thatthese elements can interact with nucleic acids to influence base pairing and conformation[46]. However, the relationship between those trace elements and etiology of the breastcancer is complex. To realize the prediction of breast cancer mortality, it needs a goodmodel based on an appropriate algorithm.

In this case, partial least square is used as the reference algorithm as well as the weakalgorithm in boosting (named boosting PLS). Figure 6 depicts RMSECV as a function ofthe number of latent variables (Lvs) in PLS models. Clearly, the best PLS regression model

Fig. 6 Root-mean-squared errorof cross-validation (RMSECV) asa function of Lvs in PLS

156 Tan et al.

Page 12: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

is based on seven Lvs. For simplicity, each weak PLS model in boosting is fixed at 11 Lvs.The ensemble size is optimized in the range of 2–30 by cross-validation on the training set.Finally, the best boosting PLS model consists of six weak PLS models. Figure 7 gives theperformance comparison of PLS and boosting PLS on the test set. Compared to PLS,boosting PLS provides more satisfactory prediction of mortality. For samples 3, 7, 8, or 9,PLS presents a large prediction error while boosting PLS achieves a more reasonableprediction. In terms of RMSE, Fig. 8 compares the performance of weak models in the finalboosting PLS model. It is noteworthy from Fig. 8 that, the RMSEC and RMSEP values ofthe weak PLS model vary in the range of (4.8, 8.3) and (5.8, 9.4), respectively. However,assembling such six weak models by boosting (weighted median) produces a betterregression model (RMSEC=3.4 and RMSEP=5.5) that outperforms the best of its weakmodels, although, on average, these weak models do not display any advantages. Thisexample shows the potential of boosting in regression modeling.

Fig. 7 Performance comparisonof PLS and boosting PLS on thetest set

Fig. 8 Performance of weakmodels in the final boosting PLSmodel

Modeling the Relationships Between Trace Elements and Diseases 157

Page 13: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

Conclusions

The study focuses on exploring the feasibility of using boosting to model the relationshipsbetween trace elements and diseases. Two examples are employed to illustrate the techniquein the context of classification and regression, respectively. By comparison to the referencealgorithms, both examples confirm that boosting is an effective and promising tool inmodeling the relationship between trace elements and diseases, and therefore deserves touse in such modeling tasks.

Acknowledgements This work was supported by Sichuan Province Science Foundation for Youths(09ZQ026-066) and Scientific Research Startup Fund for Doctor, Yibin University (2008B06).

References

1. Zhai HL, Chen XG, Hu ZD (2003) Study on the relationship between intake of trace elements and breastcancer mortality with chemometric methods. Comput Biol Chem 27:581–586

2. Gaetke LM, Frederich RC, Oz HS, McClain CJ (2002) Decreased food intake rather than zinc deficiencyis associated with changes in plasma leptin, metabolic rate, and activity levels in zinc deficient rats. JNutr Biochem 13:237–244

3. Ren YL, Zhang ZY, Ren YQ, Li W, Wang MC, Xu G (1997) Diagnosis of lung cancer based on metalcontents in serum and hair using multivariate statistical methods. Talanta 44:1823–1831

4. Chan S, Gerson B, Subramaniam S (1998) The role of copper, molybdenum, selenium, and zinc innutrition and health. Clin Lab Med 18:673–685

5. Zhang ZY, Zhou HL, Liu SD, Harrington PB (2001) Classification of cancer patients based on elementalcontents of serums using bidirectional associative memory networks. Anal Chim Acta 436:281–291

6. Miura Y, Nakai K, Suwabe A, Sera K (2002) Trace elements in renal disease and hemodialysis. J NuclInstrum Methods Phys Res B 189:443–449

7. Douglas MT (2003) The importance of trace element speciation in biomedical science. Anal BioanalChem 375:1062–1066

8. HegdeP SML, Vengamma B, Rao TSS, Menon RB, Rao RV, Rao KSJ (2004) Serum trace element levelsand the complexity of inter-element relations in patients with Parkinson's disease. J Trace Elem Med Bio18:163–171

9. Forte G, Alimonti A, Violante N, Gregorio M, Senofonte O, Petrucci F, Sancesario G, Bocca B (2005)Calcium, copper, iron, magnesium, silicon and zinc content of hair in Parkinson's disease. J Trace ElemMed Bio 19:195–201

10. Zhang ZY, Zhou HL, Liu SD, Harrington P (2006) An application of Takagi-Sugeno fuzzy system to theclassification of cancer patients based on elemental contents in serum samples. Chemom Intell Lab Syst82:294–299

11. Gurusamy K, Davidson BR (2007) Trace element concentration in metastatic liver disease—a systematicreview. J Trace Elem Med Bio 21:169–177

12. Frisk P, Darnerud P, Friman G, Blomberg J, Ilbäck NG (2007) Sequential trace element changes in serumand blood during a common viral infection in mice. J Trace Elem Med Bio 21:29–36

13. Bianchi F, Maffini M, Mangia A, Marengo E, Mucchino C (2007) Experimental design optimization forthe ICP-AES determination of Li, Na, K, Al, Fe, Mn and Zn in human serum. J Pharm Biomed Anal43:659–665

14. Tan C, Chen H, Xia CY (2009) Early prediction of lung cancer based on the combination of traceelement analysis in urine and an Adaboost algorithm. J Pharm Biomed Anal 49:746–752

15. Greenlee RT, Hill-Harmon MB, Murray T, Thun M (2001) Cancer statistics. CA-Cancer J Clin 51:15–3616. Whelehan OP, Earll ME, Johansson E, Toft M, Eriksson L (2006) Detection of ovarian cancer using

chemometric analysis of proteomic profiles. Chemom Intell Lab Syst 84:82–8717. Huang ZW, Mcwilliams A, Lui H, Mclean D, Lan S, Zeng HS (2003) Near-infrared Raman spectroscopy

for optical diagnosis of lung cancer. Int J Cancer 107:1047–105218. Sorich MJ, Miners JO, McKinnon RA, Winkler DA, Burden FR, Smith PA (2003) Comparison of linear

and nonlinear classification algorithms for the prediction of drug and chemical metabolism by humanUDP-glucuronosyltransferase isoforms. J Chem Inf Comput Sci 43:2019–2024

158 Tan et al.

Page 14: Application of Boosting Classification and Regression to Modeling the Relationships Between Trace Elements and Diseases

19. Sboner A, Eccher C, Blanzieri E, Bauer P, Cristofolini M, Zumiani G, Forti S (2003) A multipleclassifier system for early melanoma diagnosis. AI Med 27:29–44

20. Liu HX, Zhang RS, Luan F, Yao XJ, Liu MC, Hu ZD, Fan BT (2003) Diagnosing breast cancer based onsupport vector machines. J Chem Inf Comput Sci 43:900–907

21. Tan C, Li ML, Qin X (2008) Random subspace regression ensemble for near-infrared spectroscopiccalibration of tobacco samples. Anal Sci 24:647–653

22. Brown G, Wyatt JL, Tino P (2005) Managing diversity in regression ensembles. J Mach Learn Res6:1621–1650

23. Mevik B-H, Segtnan VH, Næs T (2004) Ensemble methods and partial least squares regression. JChemometr 18:498–507

24. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Proceedings of theThirteenth International Conference, pp 148–156

25. He P, Xu CJ, Liang YZ, Fang KT (2004) Improving the classification accuracy in chemistry via boostingtechnique. Chemom Intell Lab Syst 70:39–46

26. Zhang MH, Xu QS, Massart DL (2005) Boosting partial least squares. Anal Chem 77:1423–143127. Shinzawa H, Jiang JH, Ritthiruangdej P, Ozaki Y (2006) Investigations of bagged kernel partial least

squares (KPLS) and boosting KPLS with applications to near-infrared (NIR) spectra. J Chemometr20:436–444

28. Zhou YP, Jiang JH, Wu HL, Shen GL, Yu RQ, Ozaki Y (2006) Dry film method with ytterbium as theinternal standard for near infrared spectroscopic plasma glucose assay coupled with boosting supportvector regression. J Chemometr 20:13–21

29. Tan C, Li ML, Qin X (2007) Study of the feasibility of distinguishing cigarettes of different brands usingan Adaboost algorithm and near-infrared spectroscopy. Anal Bioanal Chem 389:667–676

30. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an applicationto boosting. J Comput Syst Sci 55:119–139

31. Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–37832. Zhao CY, Zhang RS, Liu HX, Xue CX, Zhao SG, Zhou XF, Liu MC, Fan BT (2004) Diagnosing

anorexia based on partial least squares, back-propagation neural network, and support vector machines. JChem Inf Comput Sci 44:2040–2046

33. Zhou S (1996) Synthetometrics and optimization in chemistry and chemical engineering. HunanUniversity Press, Hunan p 69

34. Galváo RKH, Araújo MCU, José GE, Pontes MJC, Silva EC, Saldanha TCB (2005) A method forcalibration and validation subset partitioning. Talanta 67:736–740

35. Keller KA, Grider A, Coffield JA (2001) Age-dependent influence of dietary zinc restriction on short-term memory in male rats. Physiol Behav 72:339–348

36. Dalway JS (2000) Why trace elements are important. Fuel Process Technol 65:21–2337. Shay NF, Manigan HF (2000) Neurobiology of zinc-influenced eating behavior. J Nutr 130:1493–149938. Iyengara GV, Rappb A (2000) Human placenta as a ‘dual’ biomarker for monitoring fetal and maternal

environment with special reference to potentially toxic trace elements. Part 2: essential minor, trace andother nonessential elements in human placenta. Sci Total Environ 280:207–219

39. Vapnik VN (1995) The nature of statistical learning theory. Springer, New York40. Thissena U, Pepersb M, Stuna BU, Melssena WJ, Buydensa LMC (2004) Comparing support vector

machines to PLS for spectral regression applications. Chemom Intell Lab Syst 73:169–17941. Kuo HS, Chen SF, Wu CC, Chen DR, Lee JH (2002) Serum and tissue trace elements in patients with

breast cancer in Taiwan. Biol Trace Elem Res 89:1–1142. Magalova T, Bella V, Brtkova A, Beno I, Kudlackova M, Volkovova K (1999) Copper, zinc and

superoxide dismutase in precancerous, benign diseases and gastric, colorectal and breast cancer.Neoplasma 46:100–104

43. Spallholz JE, Mallory LB, Rhaman MM (2004) Environmental hypothesis: is poor dietary seleniumintake an underlying factor for arsenicosis and cancer in Bangladesh and West Bengal, India. Sci TotalEnviron 323:21–32

44. Conor R (1998) Selenium: a new entrant into the functional food arena. Trends Food Sci Technol 9:114–11845. Acharya UR, Mishra M, Mishra I (2004) Status of antioxidant defense system in chromium-induced

Swiss mice tissues. Environ Toxicol Pharmacol 17:117–12346. Garg AN, Weginwar RG, Sagdeo V (1990) Minor and trace elemental contents of cancerous breast tissue

measured by instrumental and radiochemical neutron activation analysis. Biol Trace Elem Res 26–27:485–496

Modeling the Relationships Between Trace Elements and Diseases 159