identifying the impact of decision variables for nonlinear classification tasks

14
Identifying the impact of decision variables for nonlinear classification tasks Steven H. Kim a, * , Sung Woo Shin a,b a Graduate School of Management, Korea Advanced Institute of Science and Technology, Seoul, South Korea b Samsung SDS Co. Ltd, Seoul, South Korea Abstract This paper presents a novel procedure to improve a class of learning systems known as lazy learning algorithms by optimizing the selection of variables and their attendant weights through an artificial neural network and a genetic algorithm. The procedure utilizes its previous knowledge base—also called a case base—to select an effective subset for adaptation. In particular, the procedure explores a space of N variables and generates a reduced space of M dimensions. This is achieved through clustering and compaction. The clustering stage involves the minimization of distances among individuals within the same class while maximizing the distances among different classes. The compaction stage involves the elimination of the irrelevant or redundant feature dimensions. To achieve these two goals concurrently through the evolutionary process, new measures of fitness have been developed. The metrics lead to procedures which exhibit superior characteristics in terms of both accuracy and efficiency. The efficiency springs from a reduction in the number of features required for analysis, thereby saving on computational cost as well as data collection requirements. The utility of the new techniques is validated against a variety of data sets from natural and commercial sources. q 2000 Published by Elsevier Science Ltd. All rights reserved. Keywords: Feature weighting; Similarity assessment; k-nearest neighbor; Lazy learning; Artificial neural network; Genetic algorithms 1. Introduction A popular approach to learning is to store instances of experience as raw data, then retrieve them for subsequent adaptation to new problems. The approach was popularized by a knowledge representation called a “script” which is a type of “frame” used to encode experiences in everyday life, such as going to a restaurant or attending a meeting (Schank & Abelson, 1977). Since the late 1980s, extensive research has been conducted on techniques for processing precedent cases including methods for indexing, retrieving, and adaptation to new cases (Aha, Kibler & Albert, 1991; Kolodner, 1993; Leake, Kinley & Wilson, 1995; Schank & Riesbeck, 1990; Standfill & Waltz, 1986; Watson, 1997). This collective framework of techniques is known as case-based reasoning. The class of lazy learning algorithms (LLAs) involves tech- niques which store precedent cases in memory with little or no preprocessing, then retrieves them on a selective basics as required by a problem situation (Mitchell, 1997; Wettschereck, Aha & Mohri, 1997). This approach has roots in the nearest neighbor (NN) retrieval algorithms used in pattern recognition (Cover & Hart, 1967; Dasarathy, 1991). Techniques in this category include the following algo- rithms: k-nearest neighbor (k-NN), case-based reasoning (CBR), memory-based reasoning, and instance-based learning. The delayed processing of an LLA is reminiscent of an interpreter for a programming language, which postpones processing until the moment of need. The drawback of this approach is that it can be a time-consuming task each time the procedure is invoked. In contrast, the eager learning algorithms such as neural networks and inductive decision trees construct an opti- mized internal model before real-time deployment (Mitch- ell, 1997; Wettschereck et al., 1997). For a lazy learning algorithm, the selection of appropriate cases relies on a similarity metric which takes into account the distance between pairs of cases in their state space of variables, also commonly called “features” in the technical literature. Consequently, the performance of the metric and the weighting of features are keys to the reasoning process (Punch, Goodman, Min, Lai, Hovland & Enbody, 1993; Wettschereck et al., 1997). In contrast to most other procedures, LLA is effective for applications involving Expert Systems with Applications 18 (2000) 201–214 PERGAMON Expert Systems with Applications 0957-4174/00/$ - see front matter q 2000 Published by Elsevier Science Ltd. All rights reserved. PII: S0957-4174(99)00062-7 www.elsevier.com/locate/eswa * Corresponding author. Tel.: 1 82-958-3616; fax: 1 82-958-3604. E-mail addresses: [email protected] (S.H. Kim), [email protected] (S.W. Shin).

Upload: steven-h-kim

Post on 15-Jul-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identifying the impact of decision variables for nonlinear classification tasks

Identifying the impact of decision variables for nonlinear classificationtasks

Steven H. Kima,* , Sung Woo Shina,b

aGraduate School of Management, Korea Advanced Institute of Science and Technology, Seoul, South KoreabSamsung SDS Co. Ltd, Seoul, South Korea

Abstract

This paper presents a novel procedure to improve a class of learning systems known as lazy learning algorithms by optimizing the selectionof variables and their attendant weights through an artificial neural network and a genetic algorithm. The procedure utilizes its previousknowledge base—also called a case base—to select an effective subset for adaptation. In particular, the procedure explores a space ofNvariables and generates a reduced space ofM dimensions. This is achieved through clustering and compaction. The clustering stage involvesthe minimization of distances among individuals within the same class while maximizing the distances among different classes. Thecompaction stage involves the elimination of the irrelevant or redundant feature dimensions.

To achieve these two goals concurrently through the evolutionary process, new measures of fitness have been developed. The metrics leadto procedures which exhibit superior characteristics in terms of both accuracy and efficiency. The efficiency springs from a reduction in thenumber of features required for analysis, thereby saving on computational cost as well as data collection requirements. The utility of the newtechniques is validated against a variety of data sets from natural and commercial sources.q 2000 Published by Elsevier Science Ltd. Allrights reserved.

Keywords: Feature weighting; Similarity assessment;k-nearest neighbor; Lazy learning; Artificial neural network; Genetic algorithms

1. Introduction

A popular approach to learning is to store instances ofexperience as raw data, then retrieve them for subsequentadaptation to new problems. The approach was popularizedby a knowledge representation called a “script” which is atype of “frame” used to encode experiences in everyday life,such as going to a restaurant or attending a meeting (Schank& Abelson, 1977).

Since the late 1980s, extensive research has beenconducted on techniques for processing precedent casesincluding methods for indexing, retrieving, and adaptationto new cases (Aha, Kibler & Albert, 1991; Kolodner, 1993;Leake, Kinley & Wilson, 1995; Schank & Riesbeck, 1990;Standfill & Waltz, 1986; Watson, 1997). This collectiveframework of techniques is known as case-based reasoning.The class oflazy learning algorithms (LLAs) involves tech-niques which store precedent cases in memory with little orno preprocessing, then retrieves them on a selective basicsas required by a problem situation (Mitchell, 1997;

Wettschereck, Aha & Mohri, 1997). This approach hasroots in the nearest neighbor (NN) retrieval algorithmsused in pattern recognition (Cover & Hart, 1967; Dasarathy,1991). Techniques in this category include the following algo-rithms: k-nearest neighbor (k-NN), case-based reasoning(CBR), memory-based reasoning, and instance-based learning.

The delayed processing of an LLA is reminiscent of aninterpreter for a programming language, which postponesprocessing until the moment of need. The drawback of thisapproach is that it can be a time-consuming task each timethe procedure is invoked.

In contrast, theeagerlearning algorithms such as neuralnetworks and inductive decision trees construct an opti-mized internal model before real-time deployment (Mitch-ell, 1997; Wettschereck et al., 1997).

For a lazy learning algorithm, the selection of appropriatecases relies on a similarity metric which takes into accountthe distance between pairs of cases in their state space ofvariables, also commonly called “features” in the technicalliterature. Consequently, the performance of the metric andthe weighting of features are keys to the reasoning process(Punch, Goodman, Min, Lai, Hovland & Enbody, 1993;Wettschereck et al., 1997). In contrast to most otherprocedures, LLA is effective for applications involving

Expert Systems with Applications 18 (2000) 201–214PERGAMON

Expert Systemswith Applications

0957-4174/00/$ - see front matterq 2000 Published by Elsevier Science Ltd. All rights reserved.PII: S0957-4174(99)00062-7

www.elsevier.com/locate/eswa

* Corresponding author. Tel.:1 82-958-3616; fax:1 82-958-3604.E-mail addresses:[email protected] (S.H. Kim),

[email protected] (S.W. Shin).

Page 2: Identifying the impact of decision variables for nonlinear classification tasks

weak domain knowledge; that is, complex fields wherehuman expertise is unavailable or even nonexistent. In suchsituations, the need for automatic feature selection and weightoptimization is particularly acute.

Within the framework of CBR, the NN algorithm is thesimplest version. In particular, the NN procedure retrievesthe single precedent case in the casebase which most closelyresembles the target problem at hand. In contrast to CBR,the NN approach attempts no substantive adaptations to aretrieved solution.

A generalization of NN is thek-NN algorithm. The NNand k-NN algorithms can perform poorly in retrievingprecedents when the features of the cases are irrelevant,redundant, interdependent, or noisy (Langley & Iba, 1993;Punch et al., 1993; Wettschereck et al., 1997). Therefore, tominimize the bias caused by such features, it is imperative toidentify the most salient features. Furthermore, since thek-NN classifies unknown cases directly using a distancemetric, there is no preprocessed training stage. Even so,the relative contribution (weights) of the features for theapplication at hand must be determined to achieve a suitablelevel of performance.

The goal of a multistrategy approach is to combine two ormore adaptive techniques in a synergistic fashion. In parti-cular, an artificial neural network or a genetic algorithm maybe used to optimize the weights in lazy learning algorithms.

This paper explores the potential offilter and wrapperapproaches to weighting features ink-NN classification.More specifically, the following research issues areaddressed. First is an investigation of afilter approach. Inparticular, the relative strengths of input features from atrained multilayer perceptron (MLP) serve as the weightsof features for ak-NN classifier. Despite the high perfor-mance of an MLP for classification tasks, it is difficult toextract explicit knowledge about the discriminatory powerof input features. A few studies have been conducted forautomatically building a rulebase from a trained MLPnetwork through the interpretation of its weight structure(Howes & Crook, 1999; Yoon et al., 1994;).

The weights determined by a neural network assumevalues which are independent of the other classifiers suchask-NN. Furthermore, its efficiency—that is, the converseof its computational complexity—is lower than that of aGA-basedwrapperapproach.

The second issue relates to GA-based feature weightingusing multiple criteria in the fitness function. Even thoughreasonable performance can be obtained with a simplefitness measure such as the hit rate, such metrics may failto find appropriate weights due to the congestion problemduring the evolutionary process. In other words, the proceduremay converge on a strictly local rather than the global optimum.

A classifier with great flexibility for determining classboundaries, such as thek-NN classifier, can optimize thefeature weights by measuring the cardinality of the minoritypatterns. The latter is the number of neighbors that are notused in the subsequent classification (Punch et al., 1993).

We investigate the effectiveness of multi-criteria metrics indirected search by incorporating an inter-class distance withPunch’s measure.

The effectiveness of the proposed algorithms is verifiedthrough the hit rate on a variety of test samples. The compu-tational complexity of the procedures is also examined.

2. Background

2.1. k-Nearest neighbor classifier

Thek-NN classifier is a nonparametric pattern classifica-tion technique. Given a set ofN stored cases, it seeks thekprecedents closest in the feature space to a target case andassigns the latter to the class representing the majority of theretrieved neighbors. The accuracy of thek-NN approach iscomparable to that of a multilayer perceptron using thebackpropagation learning algorithm.

Each case or pattern in the casebase may be representedas a vector. In particular, we denote theith pattern vector as

X�i� � { x�i�1 ; x�i�2 ;…; x�i�n ; c�i�}

wherex�i�j represents thejth feature andc(i) the class label ofthe ith pattern.

Then the similarity between the target patternQ and theith patternX(i) can be defined by using the standard Eucli-dean distance metric:

Similarity�X�i�;Q� �������������������������Xnf�1

wf × �x�i�f 2 qf �2vuut

In thek-NN approach, the classification process consists oftwo steps. The first is to find the appropriate value ofk in thetraining sample using the preceding similarity metric. Byemploying the “leave-one-out” strategy—which never usesthe point being classified as its own neighbor—we can esti-mate the best value ofk for the training sample. The hit ratecan be used as a performance metric. In this task, thenumberk of neighbors is defined as follows:

k �Xc

i�1

ki ; ki $ 0

Hereki represents the number of prototype cases from classv i andc denotes the number of classes.

The second step is to classifyQ into the classv j accord-ing tokj � maxi�ki�; that is, the expected value of the major-ity class among thek neighbors forQ is kj.

It is well known that thek-NN’s error rate is never morethan twice the Bayes optimal error rate as the size of thecasebase goes to infinity (Duda & Hart, 1973; Mitchell,1997). When the above conditions are satisfied, the follow-ing formula approximates the Bayes posterior probability:

p�vi uX� � ki

k; i � 1;…;C

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214202

Page 3: Identifying the impact of decision variables for nonlinear classification tasks

2.2. Genetic algorithm

A genetic algorithm (GA) is a procedure modeled afterthe processes of genetic evolution and population dynamicsin natural selection. In essence, the procedure selects highlyfit individuals and their chromosomes at random to generateoffspring; within the new population, the unfit are elimi-nated and the fittest survive to contribute genetic materialto the subsequent generation.

The seminal work on genetic algorithms dates from themid-1970s (Holland, 1975). Most applications of GAsinvolve the optimization of a performance function basedon a domain of either continuous or discrete variables. Clas-sical methods of optimization rely on the improvement of asingle trajectory toward an optimum by computing thegradient at each step. In contrast, a GA promotes severalsolutions in parallel, and modifies them in a random fashionto obtain the subsequent iteration toward the optimum. Theinherent parallelism and the advantage of directed randomsearch permits the use of genetic algorithms for addressingcomputationally difficult problems even in the NP-hardcategory (Goldberg, 1989; Holland, 1975). In a GA, a chro-mosome is a sequence of symbols which represents an indi-vidual or candidate solution to the problem at hand. Oftentheses symbols are encoded as numbers; the numbers inparticular might be the binary digits “0” and “1”. Thecollection of individuals at each iteration is called thepopu-lation. The optimization process at each iteration or genera-tion involves selection followed by crossover and/ormutation.

More specifically, the individuals in a population areevaluated through a fitness function which measures theirrelative worth. The fittest individuals are then chosen forfurther processing. The concept of inheritance is implemen-ted by selecting two fit individuals andcrossingor mixingtheir chromosomes: this crossover operation involvesslicing their chromosomes at random locations and recom-bining corresponding sections from previously distinct indi-viduals. Another way to effect variation in the subsequentpopulation takes the form ofmutation: a random location isselected on a chromosome and its symbol is changed toanother feasible value at random.

A particular sequence of symbols for a chromosome iscalled agenotype. At times two different genotypes mayexhibit the same appearance and behavior; in that casethey constitute a singlephenotype. For instance, an offspringmay inherit two genes for brown hair from its parents,thereby exhibiting brown hair. Its cousin may inherit agene for black hair and another for blond, and thereby endup with brown hair as well. The two offsprings have differ-ent genotypes but the same phenotype in the context of haircolor.

Occasionally, a population may reach a dead end inevolutionary terms. With insufficient variety in the commongene pool, the evolutionary process becomes “congested”and the population can attain only a locally optimal solution.

To circumvent this possibility, the mutation operator intro-duces a new aspect at random, thereby allowing the popula-tion to escape the local extremum and search other terrainsfor a global optimum. The rate of mutation must be highenough to avoid long periods of stagnation in the evolution-ary process, but low enough to ensure a measure of stability;that is, providing the population with a chance to reach thelocal optimum before leaping into distant terrain.

The appropriate type and extent of mutation as well ascrossover, will of course depend on the problem domain.Since the performance function is generally unknown inadvance, the optimal configuration of the fitness functionas well as the nature of crossover and mutation may beregarded as design issues which must be informed bydetailed knowledge of the application area.

The basic structure of a GA is as follows, whereP( t ) denotes a population of candidate solutions to agiven problem at generationt:

t � 0;initialize P( t );evaluate P( t );while not (termination condition)

begint � t 1 1;reproduce P( t ) from P�t 2 1�;recombine P( t );evaluate P( t );

end ;

3. Feature weighting approaches

When the input vector for a discrimination task containstoo many features, extraneous noise can cause errors in theclassification phase. Furthermore, it is very difficult to traina generalized model due to the computational complexityand probability of convergence to a strictly local minimum.Thus the feature weighting is a critical issue for competitiveclassifiers and for data reduction. The feature weightinginvolves assigning a real-valued weight to each feature.1

The weight of each feature implies the relative importanceof the attribute for the classification task.

The feature weighting algorithms fall into two categoriesbased on whether or not they perform weighting indepen-dently of the learning algorithm. Independent weighting isknown as afilter approach, whereas the dependent method iscalled awrapper (John, Kohavi & Pfleger, 1994). Despitethe computational efficiency, the major drawback of thefilter approach is that the resulting weights may not be opti-mal for a particular task. On the other hand, the wrapperapproach entails a heavy computational load. Even so, it can

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214 203

1 Feature weighting includes feature selection, since selection is a specialcase of weighting with binary weights.

Page 4: Identifying the impact of decision variables for nonlinear classification tasks

optimize weights for a specific classifier since it evaluatesthe weights based on the hit rate due to that classifier.

Among parametric, linear statistical models, featureselection methods include the use of the univariatet-test,principal component analysis, stepwise discriminant analy-sis, and ordinary least squares. For statistical methods, themain focus lies in feature selection based on the statisticalsignificance of each feature.

In a distance-based piecewise linear approach such ask-NN, irrelevant features greatly hamper overall performance.For this reason, the relative importance (weight) of eachfeature is key to the similarity assessment. When nonpara-metric, nonlinear classifiers are used, the nonparametric,nonlinear feature-weighting methods based on the geneticalgorithm appear to be superior to statistical methods (Shin& Han, 1999).

3.1. Nonlinear neural feature weighting algorithms

In 1991, Tarr suggested a simple but efficient feature-weighting method based on the stochastic gradient descenttechnique employed in the backpropagation procedure(Belue & Bauer, 1995; Looney, 1997; Tarr, 1991). Theweighting scheme involves a “saliency metric” defined onthe

li �Xhj�1

w2ij

ith input feature: Herewji is the weight from theith inputnode to thejth hidden node of an MLP with a single hiddenlayer. The final value of the saliency metricl i is obtainedthrough the following procedure:

Step 1. Set the number of experimentsR to 30 andrandomly initialize the weights in the multilayer percep-tron.Step 2. Compute all the feature saliency metricsl�r�i i �1;…;n; on therth experiment.Step 3. If the above steps have been performed less thanRtimes, then setr ← r 1 1 and go to Step 1. Otherwise foreachi � 1;…;n; calculate the average of the computedsaliency metrics over allR iterations viali � �1=R��l�l�i 1…1 l�R�i �:Most recently, Howes and Crook (1999) suggested a

different criterion: the general influence (GI) of the inputfeatures on a trained MLP. In contrast to the saliency metric,GI assumes that the trained network has generalized well;that is, the final weights of the network are nearly optimal.GI considers the effect of the output node in addition to thenormalization for the extreme weights. Unfortunately, theGI metric does not directly yield the overall influence(weight) of each feature on the output node (class). There-fore we need to modify theGI metric to discern the overalleffect of a feature on the output node. To this end, the

overall weight (OW) is defined as

OW�xi�

Xhj�1

Xok�1

uwji iwkju

Xhj�1

Xok�1

uwkju�1�

wherexi denotes theith feature of the input vector,wji theconnection weight from theith input node to thejth hiddennode, andwkj the connection weight from thejth hiddennode to thekth output node.

3.2. Genetic algorithms for feature weighting

3.2.1. Previous workThe seminal work on feature weighting using a GA is due

to Siedlecki and Sklansky (1989). They introduced thefeature selection (0–1 weighting) algorithms based ongenetic search, and showed the effectiveness of the GAthrough an experimental study using a5-NN classifier.The GA can reduce the time for finding near-optimal featuresubsets, especially on high dimensional data that the classi-cal branch-and-bound algorithm finds intractable.

Kelly and Davis (1991) proposed a GA-based, weightedk-NN approach (GA-WKNN), where search is guided byboth training accuracy and recency. They showed that theGA-WKNN attained lower error rates than standardk-NN.Subsequently, Brill, Brown and Martin (1992) describedexperiments using a genetic algorithm for feature selectionin the context of the counterpropagation neural network byapproximating thek-NN classifier’s performance.

Punch et al. (1993) extended the work of Kelly and Davisusing the concept of the cardinality of the minority, whichplayed a key role for class separation. Vafaie and De Jong(1993) proposed the wrapper approach for theAQ15classi-fier based on genetic feature selection. Recently, a fewstudies of GA-based feature weighting have been reported(Ishii & Wang, 1998; Kim & Shin, 1998; Yang & Honavar,1998). Yang and Honavar used a GA-based feature selectorfor their incremental neural network,DistAl. In contrast tomuch of the previous work focusing on the algorithm itself,Shin and Han (1999) applied the GA-based feature weight-ing approach for a NN classifier to a corporate bond ratingproblem. In the machine learning literature, especially onlazy learning, various feature weighting algorithms based ondistance metrics have been proposed (Wettschereck et al.,1997).

3.2.2. Generalization capabilities for a GA optimizationmodel

3.2.2.1. Considerations.At first glance, an appropriatemetric for problems in forecasting or classification lies inthe hit rate: the proportion of correct predictions. The hitrate has, in fact, been employed as the standard measure of

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214204

Page 5: Identifying the impact of decision variables for nonlinear classification tasks

performance in the literature. Unfortunately, the hit rate byitself suffers from the following limitations:

1. False validation of a local extremum as the global opti-mum. Even when a procedures lies far from the globaloptimum, the metric can yield a high hit rate—even100% for the particular test set at hand. Such misleadingresults are especially likely to occur for small test sets.An example of false validation is depicted in Fig. 1.

2. Promotion of stagnation. As indicated in the previoussection, the course of optimization can lead to congestionor stagnation. In such situations only a mutation is likelyto deliver the system into virgin territory. However, in asoftware simulation—as in nature—most mutations arelikely to be detrimental to the performance of an organ-ism. This is true even if the mutation eventually leads toother adaptations which yield better performance insubsequent generations. Consequently, the single metricof hit rate will tend to block mutations and thereby rein-force stagnation in the evolutionary process.

3. A standard trade-off in predictive tasks lies in accuracyversus cost. The cost takes the form of processing time aswell as data collection requirements (Dash & Liu, 1997;Siedlecki & Sklansky, 1990; Yang & Honavar, 1998). Tominimize both varieties of cost, it is imperative to weightfeatures or attributes in an efficacious way (Punch et al.,1993).

The preceding discussion highlights the importance ofdesigning a performance metric which promotes high accu-racy as well as low cost. This is the subject of the nextsubsection.

3.2.2.2. Fitness function.As a preliminary task towarddeveloping an appropriate fitness function, it is necessaryto define an effective clustering procedure for associatingsimilar cases. To this end, we identify a metric of distancewhich yields high differentiation among the clusters; inother words, large gaps between clusters but small gapsamong the cases within a single cluster.

More specifically, the inter-source distance (SD) isdefined as the distance between two source or input vectors.The inter-class distance (CD) is defined in terms of theaggregate difference in distance between the collective SDof a class and the total SD of the other classes for the entiretraining set (one training example is compared against allthe others).

In the case of a different (or same) class label, a large (orsmall) SD could yield a large value for CD. A dataset with ahigh value of CD indicates good classification performance.The motivation behind this policy is the assumption that“distinct phenomena have dissimilar causes while similarphenomena have similar causes”. One beneficial conse-quence of this policy lies in feature reduction, which lowersthe cost of data preprocessing.

The metrics of SD between theith and jth cases aredefined as follows:

SDi;j ������������������������Xnk�1

Wk�Si;k 2 Sj;k�2vuut

Further, the CD metric is defined as

CD�Xpi�1

Xpj�1

dSDi;j 2Xpi�1

Xpj�1

sSDi;j

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214 205

Fig. 1. Example of an adaptive transformation offeature weightsfor k-NN through the reduction of a dimension and re-scaling of axes. Here the number ofneighbors isk � 3: From the initial state of 3 features, the weight of Feature 1 has been changed to 0.0 (feature-selection effect) and the others to greater or lessthan 1.0. The value of cMap changed from 2 to 3 and cMic from 1 to 0. Moreover, CD was increased sufficiently to yield source-level homogeneity within thesame class (logical input vector (a, c, d) has been changed to (a, e, d) and (h, i, d) to (h, i, j) in a different class). In both situations, it is possible toget a hit rate of1.0; however, the situation on the right is closer to the optimal solution.O: Input pattern,K: stored pattern with desired class,W: stored pattern triggeringmisclassification.

Page 6: Identifying the impact of decision variables for nonlinear classification tasks

where the following notations are used:n, the number ofinput features;Wk the weight of thekth feature;Si,k, the valueof thekth input feature in theith case;Sj,k the value of thekthinput feature in thejth case;p the total number of trainingcases; dSDi,j denotes the SDi,j whenith andjth cases belongto different classes; sSDi,j denotes theSDi,j whenith andjthcases belong to the same class.

Traditionally, the simplest fitness function used for a GA is:

Fitness� HR �2�

Here HR denotes the hit rate; that is, the number of correctlyclassified patterns divided by the total number of trainingpatterns.

While this function gives reasonable performance, it doesnot generate a set of weights which yields the maximumclass separation. Rather, it only produces weights that

optimize separation based on the threshold for rating eachcase as a member of one class or another.

A better fitness metric considers the number of neighborsthat are not used in the majority decision (Punch et al.,1993). The criterion using the near-neighbor minority setis as follows:

Fitness� aHR 1 b�1=�cMiP=�k × p���Hereb is a tunable parameter,k the number of neighbors ink-NN based on accuracy on the training set, cMiP thecardinality of minority patterns within the group ofk neigh-bors, andp the total number of training patterns.

When a domain problem has many class labels, the cMiPmeasure is not a sufficient criterion for maximizing theinter-class separation. In such situations, the total numberof minority class labels should be considered to distinguisha particular class from the others. For this reason, we can

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214206

Fig. 2. Architecture for neuro-genetic synergistic feature weighting.

Page 7: Identifying the impact of decision variables for nonlinear classification tasks

modify the cMiP metric into the cMaP criterion, whichreflects the cardinality of the near-neighbor majoritypatterns. In addition, we define the cMiC criterion as thecardinality of the minority classes fork-NN classification:

Fitness� HR 1 h��cMaP2 cMiC�=�k × p�� �3�Here cMaP denotes the cardinality of the majority patternsin k neighbors, cMiC the cardinality of minority classeswithin the group ofk neighbors, andh the tunable para-meter.

Finally, an extended fitness function may be defined inthe following way:

Fitness� HR 1 h��cMaP2 cMiC�=�k × p��1 g�CD=p� �4�For example, in the Australian dataset with 14 features,consider the following string of 56 bits (� 14 attributes× 4bits):

“ 01111011010000100101001001101111111000110001101100111100”

Within the string, the first 4 bits correspond to the firstfeature, with a value of 7 (� 0111 in base 2). Similarly,

the weight of the last feature (1100) is 12, and so on.This string exhibits a hit rate of 0.8630, a secondcardinality term �h���cMaP2 cMiC�=�k × p�� of 0.4365,last distance term (g (CD/p)) of 0.3018, and overallfitness of 1.6013. A comprehensive example of a candi-date GA-based feature weighting scheme is depicted inFig. 2.

4. Experimental study

4.1. Data sets

For the experimental study, we employed six datasets.The first two are the well-knownXOR and Parity-3problems. For these two datasets, we generated some addi-tional irrelevant features in a random fashion. The problemswere used to investigate the effectiveness of our MLP-basedfeature weighting methods.

The third dataset, specified by the IBM AlmadenResearch Center, related to a customer marketing applica-tion (Agrawal et al., 1993).

The last three datasets were selected from the popularmachine learning repository at the University of Californiaat Irvine (UCI). For these three applications, relating to real-world domains, the number of classes ranged from 2 to 4.

The specification of the six data sets is listed in Table 1. Inthe first stage of the experiment, Eq. (1) was compared to themethod of Belue and Bauer (1995).

4.1.1. XOR and Parity-3 problemsThe input features for theXOR problem are the four

potential combinations of two binary variables. The outputis a single bit whose value is 1 if there is precisely one bitequal to 1, and 0 otherwise.

For the Parity-3 dataset, there are eight potential combi-nations of three binary variables. The output is a single bitequal to 1 if there is an odd number of high bits, or 0 if thereis an even number of high bits. For each of the these twodatasets, additional noise features were introduced by append-ing extra columns with random binary values of 0 or 1.

4.1.2. Customer marketing dataset from IBM AlmadenResearch Center

The dataset presented in Table 2, had been generated byAgrawal et al. to evaluate their database mining algorithmCDP (Agrawal et al., 1993; Setiono and Liu, 1997). Acollection of the classification functions were proposed asa test of the algorithm. Among these, we selected the fourthfunction (second function used in the Setiono and Liu study)as a representative test case: this function exhibits threesubtests with the greatest complexity. In our experiment,1000 cases were generated from the specifications inTable 2. Among these, 700 patterns were used for trainingand the remaining 300 for testing.

The discrimination rule for Function 4 is as follows:

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214 207

Table 1Specification of datasets used in the experiments

Data set Size # of Features # of ClassesTraining Test

XOR 70 30 5, 8 2Parity-3 70 30 5, 8 2Customer Marketing 700 300 9 2Australian Credit 460 230 14 2German Credit 670 330 24 2Vehicle 566 280 18 4

Table 2Attributes of the Customer Marketing dataset

Attributes Description Value

salary Salary Uniformly distributed from20,000 to 150,000If salary . 75,000 thencommission� 0

commission Commission Else uniformly distributed from10,000 to 75,000

age Age Uniformly distributed from 20 to80

elevel Education level Uniformly distributed from [0, 1,2, 3, 4]

car Make of the car Uniformly distributed from [1, 2,…, 20]

zipcode Zip code of the town Uniformly chosen from 9available zip codes

hvalue Value of the house Uniformly distributed from0.5k10000 to 1.5k100000 wherek [ {0…9} depends on zipcode

hyears Years house owned Uniformly distributed from [1,2,…, 30]

loan Total amount of loan Uniformly distributed from 1 to500,000

Page 8: Identifying the impact of decision variables for nonlinear classification tasks

Group A:((age, 40� ∧(((elevel[ �0 … 2�) ? (25; 000# salary# 75; 000):(50; 000# salary# 100; 000))) ∨((40 # age, 60) ∧(((elevel[ �1 … 3]) ? (50; 000# salary# 100; 000):(75; 000# salary# 125; 000))) ∨((age$ 60) ∧(((elevel[ �2 … 4�) ? (50; 000# salary# 100; 000):(25; 000# salary# 75; 000)))

Group B: Otherwise.

4.1.3. Real world dataset from the UCI Machine LearningRepository

As explained previously, the last three datasets wereselected from the Machine Learning Repository at UCI.This resource contains datasets which serve as standardyardsticks for evaluating algorithms developed by themachine learning community. Two of these datasets involvethe realm of finance. The data had been compiled in concertwith a multinational project sponsored by the ESPRITprogramme. The applications involved credit card approvalbased on customer demographics and usage histories.

The last Vehicle dataset embodies four types of vehiclesdescribed in terms of their silhouettes (Murphy, 1993).

4.1.3.1. Australian credit card dataset.This dataset involvescredit card applications. All attribute names and values hadbeen altered to meaningless symbols to protect theconfidentiality of the data. This dataset is interestingbecause there is a good mix of attributes: continuous,nominal with a small number of values, and nominal witha larger variety of values. The dataset comprises 690patterns. The 14 features consist of 6 numerical and 8categorical variables.

4.1.3.2. German credit card dataset.For this dataset, thecategorical attributes had been modified to make itsuitable for algorithms which cannot cope with nominalvariables. Several categorical variables were coded as

integers, while others were replaced by indicatorvariables. The dataset contains 24 features and 1000 cases.

4.1.3.3. Vehicle dataset.This dataset contains thesilhouettes of four types of vehicles, using featuresextracted from their silhouettes. The vehicles areperceived from a variety of different angles. The originalpurpose was to distinguish 3D objects within a 2D imagebased on shape features of the 2D silhouettes.

4.2. Preprocessing and experiments

To ensure a measure of consistency in the weights foremploying LLAs, each feature of each dataset was normal-ized into the unit interval [0,1]. The GA we employed wastailored after the Goldberg model (Goldberg, 1989). More-over, our fitness function was employed in conjunction witha k-NN procedure.

All codes were written in C11 and run on a Sun SparcUnix machine. The parameter settings for the GA wereselected largely as reported in the literature (Brill et al.,1992; Chambers, 1995; Goldberg, 1989; Punch et al.,1993; Yang & Honavar, 1998). More specifically, the para-meter settings are displayed in Table 3.

During the experimental stage, two partitions of the data-sets were divided randomly into training and test segments.The ratio of training to test segment instances was main-tained at approximately 7 to 3. The primary purpose of ourresearch lay in benchmarking various feature-weightingmethods rather than building an optimized model for eachdataset. Consequently, the assignment of cases to the parti-tions and the partitioning ratio did not take into considera-tion the decisions made in previous studies on the samedatasets (see Murphy, 1993).

Before the weighting of features using the GA, the mostpromising neighborhood sizek was to be determined fromthe training set. To this end, experiments on four datasetswere conducted by varyingk over the odd integers from 1 to29. To obtain an unbiased value for the parameterk, the“leave-one-out” tactic was used for each dataset. For the

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214208

Table 3Parameters for experimental study

Experimental attributes Customer Marketing dataset Australian Credit dataset German Credit dataset Vehicle dataset

MLP architecture 9× 15× 2 14× 5 × 2 24× 5 × 2 18× 4 × 4Population size 30 30 30 30Number of generations 10 10 10 10Probability of crossover 0.6, 0.7 0.6, 0.7 0.6, 0.7 0.6, 0.7Probability of mutation 0.05 0.05 0.05 0.05Probability of selectionby ranking

0.6 0.6 0.6 0.6

Crossover point Single Single Single SingleCoding scheme Binary Binary Binary BinaryParameterh 0.9 1.0 0.9 1.0Parameterg 0.03 0.05 0.03 0.04Parameterk 5 15 7 5

Page 9: Identifying the impact of decision variables for nonlinear classification tasks

S.H

.K

im,

S.W

.S

hin

/E

xpe

rtS

ystem

sw

ithA

pp

licatio

ns

18

(20

00

)2

01

–2

14

209

Table 4Preliminary experiment to verify the effectiveness of the MLP-based feature weighting method

Iterations

Architecture Hit rate 100 200 300 400 500

XOR 5× 3 × 1 1.0 {8.01, 8.01, 0.00, 0.00, 0.00} {8.97, 8.47, 0.00, 0.00, 0.00} {8.97, 9.47, 0.00, 0.00, 0.00} {9.49, 9.49, 0.00, 0.00, 0.00} {9.47, 9.47, 0.00, 0.00, 0.00}Parity-3 5× 3 × 1 1.0 {0.78, 0.39, 1.18, 1.38, 0.98} {5.57, 5.83, 5.94, 0.00, 0.00} {7.73, 7.38, 7.38, 0.00, 0.00} {8.35, 8.35, 8.03, 0.00, 0.00} {8.40, 8.40, 8.40, 0.00, 0.00}

Page 10: Identifying the impact of decision variables for nonlinear classification tasks

Customer Marketing, Australian, German, and Vehicledatasets, the best values ofk were 5, 15, 7 and 5, respec-tively (see Table 3).

Most real world classification problems have decisionboundaries which are nonlinearly separable and sufferfrom irrelevant features. In the face of such difficulties,how effective are the MLP-based feature weighting meth-ods? To address this issue, we selected two binary classifi-cation problems, XOR and Parity-3, which have nonlinearlyseparable decision boundaries, then contaminated the datawith artificially generated noise features.

For the first experiment, we injected three additionalnoise features to the XOR problem and two to the Parity-3problem. To simplify our experiments, the network archi-tectures for the two problems were both set to 5× 3 × 1:Moreover, the number of iterations was incremented by100 until the classification accuracy on the test setapproached 1.0.

All the test samples were classified correctly after 300training cycles (see Table 4). After 500 training cycles,the weights of the irrelevant features converged to 0.0,while the magnitude of the weights for all relevant featureswere precisely the same: 9.47 for XOR, and 8.40 for Parity-3.

An interesting phenomenon occurred for the irrelevantfeatures in both problems. For spurious features, all of theweights vanished to 0.0 within 100 or 200 iterations, long

before the weights of the substantive features stabilized. Itmay be a useful characteristic when the primary goal is toselect salient features. In such situations, it would be possi-ble to detect irrelevant features early in the training stage.

The first experiment verified the effectiveness and stabi-lity of our OW metric defined in Eq. (1). In particular, theweights of relevant features converged to identical positivevalues while irrelevant features converged to 0.0.

A secondary set of experiments addressed more challen-ging tasks. These involved additional noise features andinsufficient training cycles (see Table 5). The number ofnoise features was increased to six for XOR and to fivefor Parity-3. Moreover, the training cycles were both setto 300. For the XOR problem, all of the irrelevant weightsconverged to 0.0 after 300 training cycles, while the MLPcorrectly classified all the test samples. In contrast, for theParity-3 problem with an 8× 3 × 1 network, the deviationamong the weights of relevant and irrelevant features wasstill relatively large, while the classification accuracy was80%. However, an expanded model with 6 hidden nodesyielded 100% classification accuracy with smaller devia-tion.

These preliminary experiments demonstrated that theOW metric defined by Eq. (1) can produce more precisefeature weights than the method of Tarr in terms of thestandard deviation among features of the same nature(whether relevant or inapt).

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214210

Table 5Secondary experiment with augmented noise feature and limited training cycles

Architecture Hit rate 100 200 300

XOR 8× 3 × 1 1.0 {4.68, 8.98, 0.00,0.00, 0.06, 0.00,0.00, 0.00}

{7.28, 7.31, 0.00,0.00, 0.00, 0.00,0.00, 0.00}

{8.30, 7.74, 0.00,0.00, 0.00, 0.00,0.00, 0.00}

Parity-3 8× 3 × 1 0.8 {2.28, 1.49, 2.67,1.48, 0.69, 0.99,0.00, 0.59}

{6.32, 5.63, 8.28,0.65, 0.00, 0.35,0.35, 0.00}

{7.01, 6.32, 9.62,0.34, 0.00, 0.34,0.00, 0.00}

Parity-3 8× 6 × 1 1.0 {4.47, 4.61, 3.69,1.14, 0.00, 0.00,0.00, 0.00}

{6.46, 6.21, 6.21,0.70, 0.00, 0.00,0.00, 0.00}

{6.80, 6.60, 6.80,0.73, 0.00, 0.00,0.00, 0.00}

Table 6Effect of the multi-criteria fitness functions on generalization

Dataset (# of patterns) Method HR h �cMaP2 cMiC�=�k × p� g (CD/p) Fitness

Customer Marketing (700) Eq. (2) 0.883 0.280 2 0.448 0.715Eq. (3) 0.890 0.313 2 0.412 0.791Eq. (4) 0.910 0.324 2 0.260 0.974

Australian-Credit (460) Eq. (2) 0.876 0.440 0.204 1.520Eq. (3) 0.876 0.442 0.211 1.529Eq. (4) 0.863 0.437 0.302 1.602

German-Credit (670) Eq. (2) 0.740 0.515 2 0.581 0.674Eq. (3) 0.739 0.518 2 0.505 0.752Eq. (4) 0.708 0.513 2 0.429 0.792

Vehicle (556) Eq. (2) 0.646 2 0.411 0.137 0.372Eq. (3) 0.658 2 0.369 0.149 0.438Eq. (4) 0.663 2 0.371 0.158 0.450

Page 11: Identifying the impact of decision variables for nonlinear classification tasks

4.3. Experimental results on real world datasets

The salient difference between the single-criterion andmulti-criteria fitness functions is apparent in Table 6. Fortwo of the four datasets, Eq. (2) found the best weight vectorwhile Eq. (4) yielded the worst vector in terms of the hit rateon the training samples. However, in the test stage, theclassification performance of Eq. (4) was always superiorto that of Eq. (2) in spite of the inferior hit rates and cardin-ality values for the training samples.

For most problems other than the Vehicle dataset, theclassification rate fork-NN using Eq. (4) outperformed theother fitness measures in terms of the generalization (orrobustness) capability ask increases. In addition, multi-criteria optimization methods based on Eqs. (3) and (4)displayed superior performance compared to the single-criterion approach of Eq. (2). The detailed results arepresented in Tables 7–9.

In contrast to the GA-based methods, an MLP-basedapproach using Eq. (1) outperformed the others for theVehicle dataset. Further, Eq. (1) exhibited competitiveperformance in various classification tasks involving severalclasses (2 or 4 categories).

5. Conclusion

The experimental study uncovered a number of notableresults. First of all, the fitness measure using Eq. (4) showeda strong capacity for generalization. In particular, the accu-racy on test samples rose ask increased. In contrast to theweightedk-NN method using Eq. (4), standardk-NN did notshow any improvement as a function ofk.

In all datasets, the lack of generalization is notable whenwe recall that thek-NN’s error rate follows the Bayesoptimal error rate ask increases and the size of stored

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214 211

Table 7Estimated feature weight vector in training phase

Dataset(# of features)

Feature weightingmethods

Feature weight vector (normalized range from 0 to 1)

Customer Marketing (9) Best {1,0,1,1,0,0,0,0,0}Worst {0,1,0,0,1,1,1,1,1}Eq. (1) {1.000000, 0.156119, 0.168897, 0.110247, 0.005335, 0.000000, 0.001360, 0.009730, 0.001576}Eq. (2) {1.000000, 0.400000, 0.600000, 0.666667, 0.133333, 0.000000, 0.266667, 0.133333, 0.000000}Eq. (3) {1.000000, 0.333333, 0.733333, 0.933333, 0.000000, 0.000000, 0.266667, 0.000000, 0.000000}Eq. (4) {1.000000, 0.000000, 0.363636, 0.363636, 0.000000, 0.000000, 0.090909, 0.000000, 0.000000}

Australian Credit (14) Eq. (1) {0.042482, 0.106334, 0.000000, 0.042482, 0.659629, 0.149073, 0.276519, 1.000000, 0.595778,0.149073, 0.319258, 0.510556, 0.404222, 0.149073}

Eq. (2) {0.933333, 0.200000, 0.533333, 0.266667, 0.000000, 0.200000, 0.266667, 0.866667, 0.533333,0.066667, 1.000000, 0.866667, 0.066667, 0.400000}

Eq. (3) {0.933333, 0.200000, 0.800000, 0.400000, 0.000000, 0.200000, 0.400000, 0.933333, 0.533333,0.600000, 1.000000, 0.866667, 0.066667, 0.266667}

Eq. (4) {0.428571, 0.714286, 0.214286, 0.071429, 0.285714, 0.071429, 0.357143, 1.000000, 0.928571,0.142857, 0.000000, 0.714286, 0.142857, 0.785714}

German Credit (24) Eq. (1) {1.000000, 0.750000, 0.611360, 0.500000, 0.555903, 0.639088, 0.416816, 0.889088, 0.611360,0.194544, 0.750000, 0.278175, 0.000000, 0.722272, 0.305903, 0.916816, 0.611360, 0.361360,0.528175, 0.055903, 0.583631, 0.222272, 0.750000, 0.472272}

Eq. (2) {0.800000, 1.000000, 0.133333, 1.000000, 0.866667, 0.466667, 0.600000, 0.400000, 0.200000,0.266667, 0.533333, 0.800000, 0.800000, 0.933333, 0.600000, 0.266667, 0.800000, 0.866667,0.800000, 1.000000, 0.066667, 0.933333, 0.000000, 0.200000}

Eq. (3) {0.800000, 0.800000, 0.200000, 0.666667, 0.800000, 0.600000, 0.800000, 0.600000, 0.400000,0.333333, 0.333333, 0.333333, 0.066667, 0.533333, 0.933333, 0.666667, 0.333333, 0.600000,0.733333, 0.133333, 0.133333, 1.000000, 0.000000, 0.066667}

Eq. (4) {0.916667, 1.000000, 0.750000, 0.416667, 0.666667, 0.083333, 0.000000, 0.083333, 0.083333,0.000000, 0.166667, 0.083333, 0.750000, 0.166667, 1.000000, 0.250000, 0.000000, 0.083333,0.416667, 0.333333, 0.166667, 0.750000, 0.000000, 0.250000}

Vehicle (18) Eq. (1) {0.200000, 0.066667, 1.000000, 0.000000, 0.066667, 0.866667, 0.600000, 0.533333, 0.933333,0.066667, 0.533333, 0.666667, 0.800000, 0.533333, 0.933333, 0.800000, 0.066667, 0.533333}

Eq. (2) {0.466667, 0.066667, 0.733333, 0.000000, 0.600000, 0.933333, 0.600000, 0.733333, 1.000000,0.066667, 0.800000, 0.533333, 0.933333, 0.800000, 0.466667, 0.800000, 0.066667, 0.533333}

Eq. (3) {0.138778, 0.000000, 0.341633, 0.128154, 0.505312, 0.142320, 0.138778, 0.911023, 1.000000,0.081895, 0.238490, 0.427069, 0.138778, 0.181496, 0.775786, 0.138778, 0.066667, 0.533333}

Eq. (4) {0.200000, 0.066667, 0.466667, 0.000000, 0.000000, 0.400000, 0.733333, 0.666667, 0.400000,0.466667, 0.666667, 0.533333, 1.000000, 0.533333, 0.466667, 0.400000, 0.066667, 0.533333}

Page 12: Identifying the impact of decision variables for nonlinear classification tasks

cases goes to infinity. However, this result is not surprisingdue to the limited number of training samples coupled withthe complexity of the classification tasks.

Second, the performance of the weightedk-NN classifi-cation estimated by MLP was reasonably competitive interms of computational cost and overall accuracy. Typi-cally, the computational complexity for feature weightingby MLP is much lower than that for GA:O(N) versusO(N2).

When the training sample size is much larger than thenumber of features or the number of hidden neurodes, thesize of the training set dominates overall complexity. Morespecifically, the computational complexity of GA-basedfeature weighting coupled with ak-NN fitness metric isO�F × N2 × P × G�: The corresponding figure for the MLPis O��F × H 1 H × O� × N × I �; hereF denotes the numberof input features,N the number of training patterns,P the

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214212

Table 8Accuracy of feature weighting methods in comparison to standardk-NN

Dataset Method Value ofk in k-NN classificationk � 1 k � 3 k � 5 k � 7 k � 9 k � 11 k � 13 k � 15

Customer Marketing Best 94.3 93.0 92.3 89.7 90.0 90.7 90.0 90.0Worst 61.0 59.3 59.7 61.3 62.0 62.0 60.3 60.0Standard 72.3 74.7 76.3 75.0 75.3 75.3 74.7 71.7Eq. (1) 85.0 87.3 88.7 89.7 87.3 87.3 88.3 87.7Eq. (2) 85.3 86.0 87.0 86.7 86.3 86.7 84.7 84.7Eq. (3) 90.7 91.0 90.3 88.7 88.7 86.7 88.0 87.7Eq. (4) 93.3 94.7 92.3 92.0 91.0 88.0 90.0 91.0

Australian Credit Standard 78.3 82.2 84.3 81.7 82.2 83.0 83.5 82.6Eq. (1) 78.3 83.0 83.5 86.1 84.3 84.3 84.3 85.6Eq. (2) 80.9 80.9 80.0 80.0 81.7 81.3 80.9 82.2Eq. (3) 82.2 80.4 79.6 80.0 81.7 81.3 81.7 83.0Eq. (4) 77.8 82.6 82.2 83.0 83.9 84.8 84.8 84.8

German Credit Standard 67.9 70.6 70.9 73.6 73.3 71.8 70.3 71.5Eq. (1) 71.2 72.1 72.1 71.8 73.3 72.4 73.3 74.5Eq. (2) 70.3 72.1 73.6 73.3 74.5 76.1 73.6 73.9Eq. (3) 69.4 70.0 72.7 74.2 74.5 74.5 76.4 75.8Eq. (4) 70.0 75.8 75.2 76.7 78.2 77.9 77.9 76.7

Vehicle Standard 66.4 68.6 69.6 67.1 68.9 68.9 67.9 68.9Eq. (1) 67.1 69.3 68.2 72.9 70.7 71.8 72.5 71.4Eq. (2) 67.9 70.0 69.3 70.4 68.9 69.6 68.2 70.4Eq. (3) 68.6 70.4 71.1 71.1 69.3 71.1 71.4 70.4Eq. (4) 67.9 69.3 71.8 70.0 70.0 71.1 70.4 69.6

Value ofk in k-NN classificationk � 17 k � 19 k � 21 k � 23 k � 25 k � 27 k � 29 Mean

89.3 89.3 90.3 88.3 90.3 89.3 91.0 90.559.7 59.3 60.3 60.0 62.0 61.0 59.3 60.571.7 71.7 70.0 70.3 69.7 70.7 70.3 72.688.0 86.7 86.7 86.0 86.7 84.3 85.3 87.084.0 84.7 85.7 84.7 83.0 84.3 82.3 85.186.3 87.3 87.0 86.0 86.0 85.7 84.7 87.790.3 90.3 90.0 89.7 89.0 89.0 88.3 90.682.6 81.7 81.3 81.3 80.9 82.2 82.6 82.0383.9 83.9 84.8 82.6 81.7 80.0 80.0 83.0982.6 83.0 82.2 83.0 84.3 82.2 80.9 81.7483.0 83.5 82.6 83.5 83.5 81.7 81.3 81.9385.2 85.2 85.2 84.8 85.2 85.2 86.5 84.0872.4 73.6 73.3 74.8 73.3 74.8 73.9 72.473.3 73.9 74.5 74.8 75.8 75.5 75.5 73.672.7 73.6 73.3 72.4 73.9 73.9 74.2 73.4374.8 73.6 74.5 74.5 74.2 74.5 75.2 73.9277.3 76.7 77.0 76.7 76.4 76.4 77.3 76.4169.6 69.6 69.6 68.9 66.8 65.4 65.0 68.173.6 72.5 71.4 70.7 71.4 71.1 67.5 70.871.4 68.9 68.6 66.4 68.2 67.9 68.9 69.070.7 70.0 68.2 67.9 69.3 67.9 67.5 69.770.7 71.4 71.4 70.7 70.7 69.3 68.9 70.2

Page 13: Identifying the impact of decision variables for nonlinear classification tasks

population size per generation,G the number of generations,H the number of hidden nodes,O the number of outputnodes, andI the number of iterations.

Third, the most accurate weighting method tended to beEq. (4). This was followed in turn by Eqs. (3), (1), (2), andthe standard (no weighting)k-NN procedure.

Fourth, although Eq. (2) tended to select the weightvector that was most accurate on the training set, the perfor-mance on the test set is poor compared to Eqs. (3) and (4).This observation indicates that Eq. (2) overfits trainingsamples due to its sole reliance on the hit rate (see Table6). Since Eq. (3) optimizes the feature space based on classseparability, its performance on the test set was superior tothat of Eq. (2).

Consequently, Eq. (4) using a CD distance metric in addi-tion to the cardinality term in order to maximize class separ-ability, had the best capability for generalization.

A fruitful direction for future research is to test thesealgorithms on a wider variety of applications. Anotherpromising avenue lies in the incorporation of metalearningcapabilities to automate the quest for improved algorithms.

References

Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learningalgorithms.Machine Learning, 6, 37–66.

Agrawal, R., Imielinski, T., & Swami, A. (1993). Database mining: a

performance perspective.IEEE Trans. Knowledge and Data Engineer-ing, 5 (6), 914–925.

Belue, L. M., & Bauer, K. W. (1995). Determining input features for multi-layer perceptrons.Neurocomputing, 7, 111–121.

Brill, F. Z., Brown, E., & Martin, W. N. (1992). Fast genetic selection offeatures for neural network classifiers.IEEE Transactions of NeuralNetworks, 3(2), 324–328.

Chambers, L. (1995).Practical handbook of genetic algorithms. Indiana-polis: CRC Press.

Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification.IEEE Transactions of Information Theory, 13 (1), 21–27.

Dasarathy, B. V. (1991).Nearest neighbor (NN) norms: NN pattern classi-fication techniques. Los Alamitos, CA: IEEE Computer Society Press.

Dash, M., Liu, H. (1997). Feature selection for classification.IntelligentData Analysis, http://www-east.elsevier.com/ida/browse/0103/ida00013/article.htm.

Duda, R. O., & Hart, P. E. (1973).Pattern classification and scene analysis.New York: Wiley.

Goldberg, D. E. (1989).Genetic algorithms in search, optimization, andmachine learning. Reading, MA: Addison Wesley.

Holland, J. H. (1975).Adaptation in natural and artificial systems. AnnArbor, MI: University of Michigan Press.

Howes, P., & Crook, N. (1999). Using input parameter influences to supportthe decisions of feedforward neural network.Neurocomputing, 24,191–206.

Ishii, N., & Wang, Y. (1998). Learning feature weights for similarity algo-rithms.IEEE International Joint Symposia on Intelligence and Systems,27–33.

John, G., Kohavi, R., Pfleger, K. (1994). Irrelevant features and the subsetselection problem. In:Int. Conf. on Machine Learning(pp. 121–129).San Francisco: Morgan Kaufmann.

Kelly Jr, J. D., & Davis, L. (1991). A hybrid genetic algorithm for classi-fication.International Joint Conference on Artificial Intelligence, 645–650.

Kim, S. H., Shin, S. W. (1998). Optimizing the retrieval of precedentsin case-based reasoning through a genetic algorithm.KoreanExpert Systems Society ‘98 Fall Conf. Proc.(pp. 123–129)November.

Kolodner, J. (1993).Case-based reasoning. San Francisco: Morgan Kauf-mann.

Langley, P., & Iba, W. (1993). Average case of a nearest neighbor algo-rithm. International Joint Conference on Artificial Intelligence, 889–894.

Leake, D., Kinley, A., Wilson, D. (1995). Learning to improve case adapta-tion by introspective reasoning and CBR.Int. Conf. on Case-BasedReasoning. Sesimbra, Portugal.

Looney, C. G. (1997).Pattern recognition using neural networks. NewYork: Oxford University Press.

Mitchell, T. M. (1997).Machine learning. New York: McGraw-Hill.Murphy, P. (1993).UCI Repository for Machine Learning Databases.

Irvine, CA: Department of Information and Computer Science, Univer-sity of California.

Punch, W. F., Goodman, E. D., Min, P., Lai, C.-S., Hovland, P., & Enbody,R. (1993). Further research on feature selection and classification usinggenetic algorithms.International Conference on Genetic Algorithms,557–564.

Schank, R. C., & Abelson, R. (1977).Scripts, plans, goals and understand-ing. Hillsdale, NJ: Lawrence Erlbaum.

Schank, R. C., & Riesbeck, C. (1990).Inside case-based reasoning. Hills-dale, NJ: Lawrence Erlbaum.

Setiono, R., & Liu, H. (1997). Neural-network feature selector,IEEETransactions of Neural Networks, 8(3), 654–662.

Shin, K., & Han, I. (1999). Case-based reasoning supported by geneticalgorithms for corporate bond rating.Expert Systems with Applications,16, 85–95.

Siedlecki, W., & Sklansky, J. (1989). A note on genetic algorithms forlarge-scale feature selection.Pattern Recognition Letters, 10, 335–347.

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214 213

Table 9Wilcoxon matched pairs test for the accuracy of different weightingapproaches. Each cell entry denotes the standardizedZ-score. The symbol* indicates that the difference is significant at thep� :10 level, ** at p�:05; and *** at p� :01

Best Standard Eq. (1) Eq. (2) Eq. (3)

Customer Marketing datasetStandard 3.408***Eq. (1) 3.408*** 3.408***Eq. (2) 3.408*** 3.408*** 3.408***Eq. (3) 3.408*** 3.408*** 0.314 3.296***Eq. (4) 0.314 3.408*** 3.351*** 3.408*** 3.408***

Australian datasetEq. (1) 1.852*Eq. (2) 0.524 1.959**Eq. (3) 0.114 1.675* 1.019Eq. (4) 2.897*** 1.193 2.726*** 2.613***

German datasetEq. (1) 2.551**Eq. (2) 2.062** 0.753Eq. (3) 2.888*** 0.691 1.258Eq. (4) 3.408*** 3.237*** 3.351*** 3.401***

Vehicle datasetEq. (1) 3.238***Eq. (2) 1.946** 2.726***Eq. (3) 2.953*** 1.977** 1.992**Eq. (4) 3.408*** 1.334 2.481** 1.537

Page 14: Identifying the impact of decision variables for nonlinear classification tasks

Standfill, C., & Waltz, D. (1986). Toward memory-based reasoning.Communications of the ACM, 1213–1228.

Tarr, G. (1991). Multi-layered feedforward neural networks. PhD disserta-tion, School of Engineering, Air Force Institute of Technology,Wright–Patterson AFB, OH.

Vafaie, H., & De Jong, K. (1993). Robust feature selection algorithms.International Conference on Tools with Artificial Intelligence, 57–65.

Watson, I. (1997).Applying case-based reasoning: techniques for enter-prise systems. San Francisco: Morgan Kaufmann.

Wettschereck, D., Aha, D. W., & Mohri, T. (1997). A review and empiricalevaluation of feature weighting methods for a class of lazy learningalgorithms.Artificial Intelligence Review, 11, 273–314.

Yang, J. H., & Honavar, V. (1998). Feature subset selection using a geneticalgorithm.IEEE Intelligent Systems, 44–49.

Yoon, Y., Guimaraes, T., & Swales, G. (1994). Integrating artificial neuralnetworks with rule-based expert systems.Decision Support Systems, 11,497–507.

S.H. Kim, S.W. Shin / Expert Systems with Applications 18 (2000) 201–214214