a multiple kernel support vector machine scheme

8/2/2019 A Multiple Kernel Support Vector Machine Scheme

1/15

A multiple kernel support vector machine scheme

for feature selection and rule extraction from

gene expression data of cancer tissue

Zhenyu Chen a,b, Jianping Li a,*, Liwei Wei a,b

a Institute of Policy & Management, Chinese Academy of Sciences, Beijing 100080, Chinab Graduate University of Chinese Academy of Sciences, Beijing 100039, China

Received 30 November 2006; received in revised form 31 July 2007; accepted 31 July 2007

Artificial Intelligence in Medicine (2007) 41, 161175

http://www.intl.elsevierhealth.com/journals/aiim

KEYWORDS

Multiple kernellearning;Support vectormachine;Feature selection;Rule extraction;

Gene expression data

Summary

Objective: Recently, gene expression profiling using microarray techniques has beenshown as a promising tool to improve the diagnosis and treatment of cancer. Geneexpression data contain high level of noise and the overwhelming number of genesrelative to the number of available samples. It brings out a great challenge formachine learning and statistic techniques. Support vector machine (SVM) has been

successfully used to classify gene expression data of cancer tissue. In the medicalfield, it is crucialto deliver the user a transparent decision process. How to explain thecomputed solutions and present the extracted knowledge becomes a main obstaclefor SVM.Material and methods: A multiple kernel support vector machine (MK-SVM)scheme, consisting of feature selection, rule extraction and prediction modelingis proposed to improve the explanation capacity of SVM. In this scheme, we showthat the feature selection problem can be translated into an ordinary multipleparameters learning problem. And a shrinkage approach: 1-norm based linearprogramming is proposed to obtain the sparse parameters and the correspondingselected features. We propose a novel rule extraction approach using the informa-tion provided by the separating hyperplane and support vectors to improve thegeneralization capacity and comprehensibility of rules and reduce the computa-tional complexity.Results and conclusion: Two public gene expression datasets: leukemia dataset andcolon tumor dataset are used to demonstrate the performance of this approach.

This research has been partially supported by a grant from National Natural Science Foundation of China (#70531040), and 973Project (#2004CB720103), Ministry of Science and Technology, China.

* Corresponding author. Tel.: +86 10 6263 4957; fax: +86 10 6254 2629.E-mail addresses: [email protected] (Z. Chen), [email protected] (J. Li), [email protected] (L. Wei).

0933-3657/$ see front matter # 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.artmed.2007.07.008
mailto:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.artmed.2007.07.008http://dx.doi.org/10.1016/j.artmed.2007.07.008mailto:[email protected]:[email protected]:[email protected]


2/15

1. Introduction

DNA microarray technology makes it possible tomeasure the expression levels of thousandsof genesin a single experiment [1,2]. There are many poten-tial applications for the DNA microarray technologysuch as the functional assignment of genes and therecognition of gene regulation network [3]. Thediagnosis and treatment of human diseasesbecomes a major application field of this technol-ogy. Especially, DNA microarray technology is con-sidered as a promising tool for the cancer diagnosis[14]. Tumors with similar histopathological

appearance can show very different response tothe same therapies. It greatly limits the decisionaccuracy of the traditional clinical methods whichrely on a limited set of historical and pathologicalfeatures. The cancer is fundamentally the malfunc-tion of genes, so utilizing the gene expression datamight be themostdirect diagnosisway [15].Thereare some reported works in the literature that focuson constructing a decision support system to assistdoctors and clinicians in their decision-making pro-cess [3,6]. That kind of information systems usuallyconsists of the following three phases: feature

selection, modeling the problem and knowledgediscovery.There are many machine learning technologies

have been used to model the gene expression data.Unsupervised learning methods, including the hier-archical cluster [4,7], self-organizing maps [8],fuzzy adaptive resonance theory (ART) [9,10] andK-means clustering [11], are widely used in func-tional assignment of novel genes, marker genesidentification and classes discovery. Recently, moreresearchers pay their attention to the supervisedlearning techniques. The artificial neural network(ANN) [1214] is the most popular supervised learn-ing methods utilized in medical research. Especially,fuzzy neural network (FNN) [6,13,14] that is one ofthe advanced ANN models can extract the causalitybetween input and output variables as explicit IFTHEN rules. It increases the explanation capacity ofthe neural network. Other supervised learningmethods include Bayesian approaches [15], decisiontree [16] and support vector machine (SVM) [1719]. Besides, the ensemble learning methods [2022] are also used in this field to improve the per-formance of some single approach.

The most outstanding characteristic of geneexpression data is that it contains a large numberof gene expression values (several thousands to tensof thousands) and a relatively small sample size (afew dozen). Furthermore, many genes are highlycorrelated, which leads to redundancy in the data.Besides, gene expression data contains high level oftechnical and biological noise. Above factors makethe clustering or classification results susceptible toover-fitting and sometimes under-fitting problem[3,21,23,24]. Therefore, feature selection has tobe performed prior to the implementation of aclassification or clustering algorithm [3,6]. Besides,

feature selection can improve the transparency ofcomputational model. Especially for gene expres-sion data analysis, a small set of genes that isindicative of important differences in cell statescan serve as either convenient diagnosis panels oras the candidates of very expensive and time-con-suming analysis required to determine if they couldserve as useful targets for therapy.

Feature selection aims at finding out a powerfullypredictive subset of features within a database andtry best to reduce the number of features presentedto the modeling process [2528]. All the methods

proposed to tackle with the feature selection pro-blem yield two basic categories: the filter and thewrapper methods [29]. In filter methods, the dataare preprocessed and some top ranked features areselected using a quality metric, independent of theclassifier. Due to the fact that they are more effi-cient than the wrapper methods, the filter methods,such as T-statistic, information entropy, informationgain and a series of statistic impurity measures[1,30], are widely used on large-scale data, suchas gene expression data. In wrapper methods, thesearch for the good feature subset is conducted byusing the classifier itself as a part of evaluationfunction [29,31]. The wrapper methods usuallyobtain better predictive accuracy estimates thanthe filter methods. However, they usually requiremuch computational time.

A recent breakthrough in feature selectionresearch is the development of SVM-based techni-ques, that are scalable of thousands of variables andtypically exhibit excellent performance in reducingthe number of variables while maintaining orimproving classification accuracy [32,19]. Most ofthese SVM-based techniques make use of forward

162 Z. Chen et al.

Using the small number of selected genes, MK-SVM achieves encouraging classifica-tion accuracy: more than 90% for both two datasets. Moreover, very simple ruleswith linguist labels are extracted. The rule sets have high diagnostic power becauseof their good classification performance.# 2007 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.artmed.2007.07.008http://dx.doi.org/10.1016/j.artmed.2007.07.008http://dx.doi.org/10.1016/j.artmed.2007.07.008http://dx.doi.org/10.1016/j.artmed.2007.07.008


3/15

selection or backward elimination strategy[19,33,34]. It is very hard to find the global solutionsby them [19]. Some approaches, including SVM-based recursive feature elimination (SVM-RFE)[19] and its improvements [35] are proposed to dealwith above issue. It is clearly that they are compu-tational expensive.

Hardin et al. argues that linear SVM-based fea-ture selection algorithms may remove the stronglyrelevant variables and keep the weakly relevantones [32]. In fact, SVM-RFE and multiple kernelsupport vector machine (MK-SVM) proposed in thispaper carry out multivariable feature selection pro-cess. That is to say, they aim at not to rank the singlemarker genes but to finding out the gene group thatwork together as pathway components and reflectthe state of cell [36]. And then new issues arise:what is the contribution of each gene in the genegroup and what is the interaction of the genes?Besides, Hardin et al. also points out that in gene

expression data even random gene subsets give goodclassification performance [32]. So a more transpar-ent model is required to give further explanation ofthe selected gene subset. The extraction of human-comprehensible rules from the selected gene subsetis necessary to answer above doubts and debates.

Many researches focus on the rule extractionfrom ANN in the last decade [3739]. Theapproaches of rule extraction from ANN can becategorized as: link rule extraction techniques,black-box rule extraction techniques, extractingfuzzy rules and extracting rules from recurrent net-

works [40]. But there are few papers published inthe cases of rule extraction from SVM. Nunez et al.propose a SVM + prototypes method in whichsupport vectors and the prototype points definedby the K-means clustering algorithm are used todetermine the rules [41]. But the introduction ofK-means clustering makes the rule extraction pro-cess uncertain and sensitive to the initialization.Fung et al. defines rules as the hypercubes and usesthe linear programming to optimize the vertexesof rules [42]. It is the drawback of this method thatit is only suitable for the linear kernel. It is wellknown that the nonlinear mapping and the kerneltrick is one of the important characteristics of SVM.Fu et al. proposes a rule-extraction approach(RuleExSVM) to obtain hyper-rectangular rulesbased on the information provided by support vec-tors and its decision function [43]. Some otherapproaches treat SVM as a black-box. Followingthe training of SVM, an interpretable model suchas decision tree [44,45] is used to generate rules.However, they cannot guarantee the same general-ization performance of the extracted rules as that ofSVM.

The multiple kernel learning considers the multi-ple kernels or parameterizations of kernels toreplace the single fixed kernel [46]. It provides moreflexibility and the chance to choose a suitable ker-nel. Some efficient methods [4749] are proposedto perform the optimization over some kind of con-vex combinations of basic kernels. In the present

paper, this idea is extended to make feature selec-tion and rule extraction in the framework of SVM.The multiple kernels are described as the convexcombination of two kinds of single feature basickernels. A sparse optimization method: 1-normbased linear programming is proposed to carry outthe optimization of the parameter of each basickernel (feature parameter). In this way, the featureselection is equivalent to the multiple parameterslearning [51] that is easy to be done. And the simpleand comprehensible rules can be extracted usingthe algorithm with low computational cost.

This paper is organized as follows: Section 2 firstly

describes SVM briefly. Then the proposed MK-SVMscheme for the feature selection and rule extractionis developed in detail. Section 3 presents the experi-mental results and analysis on two public geneexpression dataset: ALL-AML leukemia dataset andcolon tumor dataset. Section 4 summarizes theresults and draws a general conclusion.

2. MK-SVM for feature selection andrule extraction

2.1. Brief introduction of SVM

In this section we will briefly describe the basic SVMconcepts for typical two-class classification pro-blems. These concepts can also be found in[23,25,19]. Given a training set of data pointsG f~xi;yig

ni1,~xi 2R

m and yi 2f1;1g. The non-linear SVM maps the training samples from the inputspace into a higher-dimensional feature space via amapping function f and construct an optimal hyper-plane defined by ~wf~x b 0 to separate exam-ples from the two classes. For SVM with L1 soft

margin formulation, this is done by solving theprimal problem:

minJ~w;~j 1

2~wk k

2 C

Xni1

ji (1)

s:t: yi~w

Tf~xi b! 1 ji; i 1; . . . ; n

ji ! 0

(2)

where ji ! 0 are the non-negative slack variables,the regularization parameter C determines the tra-deoff between the maximum margin 1=jj~wjj2 and theminimum experience risk.

A multiple kernel support vector machine scheme 163


4/15

Above quadratic optimization problem can besolved by finding the saddle point of the Lagrangefunction:

Lp~w;b;~j;~a J~w;~j Xni1

aifyi~wT

f~xi b

ji 1g Xn

i1

diji (3)

where ai, di denote Lagrange multipliers, hencea1 ! 0, and di ! 0.

By differentiating with respect to ~w, b and ji, thefollowing equations are obtained:

@L

@~w 0)~w

Xni1

aiyif~xi (4)

@L

@b 0)

Xni1

yiai 0 (5)

@L

@ji

0)ai C di; i 1; . . . ; n (6)

Substitute Eqs. (4) and (5) into Eq. (3), then Lp istransformed to the dual Lagrangian:

maxXni1

ai 1

2

Xni;j1

aiajyiyjk~xi;~xj

( )(7)

s:t:

Xni1

yiai 0

0 ai C; i 1; . . . ; n

8

a multiple kernel support vector machine scheme

Documents