fast- a roc-based feature selection metric for small samples and imbalanced data classification...
TRANSCRIPT
8/6/2019 FAST- A ROC-Based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems(2008)
http://slidepdf.com/reader/full/fast-a-roc-based-feature-selection-metric-for-small-samples-and-imbalanced 1/9
FAST: A ROC-based Feature Selection Metric for SmallSamples and Imbalanced Data Classification Problems
Xue-wen ChenDepartment of Electrical Engineering and Computer
ScienceThe University of KansasLawrence, KS 66045, USA
Michael WasikowskiDepartment of Electrical Engineering and Computer
ScienceThe University of KansasLawrence, KS 66045, USA
ABSTRACT
The class imbalance problem is encountered in a large number of
practical applications of machine learning and data mining, for
example, information retrieval and filtering, and the detection of
credit card fraud. It has been widely realized that this imbalanceraises issues that are either nonexistent or less severe compared to
balanced class cases and often results in a classifier’s suboptimal
performance. This is even more true when the imbalanced data
are also high dimensional. In such cases, feature selectionmethods are critical to achieve optimal performance. In this paper,we propose a new feature selection method, Feature Assessment
by Sliding Thresholds (FAST), which is based on the area under a
ROC curve generated by moving the decision boundary of a
single feature classifier with thresholds placed using an even-bin
distribution. FAST is compared to two commonly-used feature
selection methods, correlation coefficient and RELevance InEstimating Features (RELIEF), for imbalanced data classification.
The experimental results obtained on text mining, mass
spectrometry, and microarray data sets showed that the proposed
method outperformed both RELIEF and correlation methods on
skewed data sets and was comparable on balanced data sets; whensmall number of features is preferred, the classification
performance of the proposed method was significantly improved
compared to correlation and RELIEF-based methods.
Categories and Subject Descriptors
I.5.2 [Pattern Recognition]: Design Methodology – feature
evaluation and selection.
General Terms
Algorithms.
Keywords
Feature selection, imbalanced data classification, ROC.
1. INTRODUCTIONOne of the greatest challenges in machine learning and data
mining research is the class imbalance problem presented in real-
world applications. The class imbalance problem refers to the
issues that occur when a dataset is dominated by a class or classes
that have significantly more samples that the other classes of thedataset. Imbalanced classes are seen in a variety of domains and
many have major economic, commercial, and environmental
concerns. Some examples include text classification, risk
management, web categorization, medical diagnosis/monitoring,
biological data analysis, credit card fraud detection, oil spillidentification from satellite images.
While the majority of learning methods are designed for well- balanced training data, data imbalance presents a unique
challenging problem to classifier design when the
misclassification costs for the two classes are different (i.e., cost-
sensitive classification) and accordingly, the overall classification
rate is not appropriate to evaluate the performance. The classimbalance problem could hinder the performance of standard
machine learning methods. For example, it is highly possible to
achieve the high classification accuracy by simply classifying all
samples as the class with majority samples. The practicalapplications of cost-sensitive classification arise frequently, for example, in medical diagnosis [1], in agricultural product
inspection [2], in industrial production processes [3], and in
automatic target detection [4]. Analyzing the imbalanced data
thus requires new methods than those used in the past.
The majority of current research in the class-imbalance problem
can be grouped into two categories: sampling techniques and
algorithmic methods, as discussed in two workshops at the AAAIconference [5] and the ICML conference [6], and later in the sixth
issue of SIGKDD Exploration (see, for example, a review by
Weiss [7]). The sampling methods involve leveling the class
samples so that they are no longer imbalanced. Typically, this is
done by under-sampling the larger class [8-9] or by over-sampling
the smaller one [10-11] or by combination of these techniques[12]. Algorithmic methods include adjusting the costs associated
with misclassification so as to improve performance [13-15],
shifting the bias of a classifier to favor the rare class [16-17],
creating Adaboost-like boosting schemes [18-19], and learning
from one class [20].
The class imbalance problem is even more severe when thedimensionality is high. For example, in microarray-based cancer
classification, the number of features is typically tens of
thousands [21]; in text classification, the number of features in a
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
KDD’08, August 24–27, 2008, Las Vegas, Nevada, USA.
Copyright 2008 ACM 978-1-60558-193-4/08/08…$5.00.
124
8/6/2019 FAST- A ROC-Based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems(2008)
http://slidepdf.com/reader/full/fast-a-roc-based-feature-selection-metric-for-small-samples-and-imbalanced 2/9
bag of words is often more than an order of magnitude compared
to the number of training documents [22]. Both sampling
techniques and algorithmic methods may not work well for high
dimensional class imbalance problems. Indeed, van der Putten and
van Someren analyzed the COIL challenge 2000 datasets andconcluded that to overcome overfitting problems, feature
selection is even more important than classification algorithms
[23]. A similar observation was made by Forman in highlyimbalanced data classification problems [22]. As pointed out by
Forman, “no degree of clever induction can make up for a lack of
predictive signal in the input space” [22]. This holds even for theSVM which is engineered to work with hyper-dimensional
datasets. Forman [22] found that the performance of the SVM
could be improved by the judicious use of feature selection
metrics. It is thus critical to develop effective feature selection
methods for imbalanced data classification, especially if the dataare also high dimensional.
While feature selection has been extensively studied [24-30], its
importance to class imbalance problems in particular was recently
realized and attracted increasing attention from machine learning
and data mining community. Mladenic and Grobelnik examined
the performance of different feature selection metrics inclassifying text mining data from the Yahoo hierarchy [31]. After applying one of nine different filters, they tested the classification
power of the selected features using naïve Bayes classifiers. Their
results showed that the best metrics choose common features and
consider the domain and learning machine’s inherent
characteristics. Forman found improved results with the use of multiple different metrics, but the best performing results were
those selected by metrics that focused primarily on the results of
the minority class [22]. Zheng, Wu, and Srihari empirically tested
different ratios of features indicating membership in a class versus
features indicating lack of membership in a class [32]. This
approach resulted in better accuracy compared to using one-sidedmetrics that solely score features indicating membership in a class
and two-sided metrics that simultaneously score features
indicating membership and lack of membership.
One common problem with standard evaluation statistics used in
previous studies, like information gain and odds ratios, is that
they are dependent on the choice of the true positive (TP), false
positive (FP), false negative (FN), and true negative (TN). These parameters are set based on a preset threshold. Consider
imbalanced data classification with two different feature sets. The
first feature set may yield higher TP, but lower TN, than the
second feature set. By varying the decision threshold, the second
feature set may produce higher TP and lower TN than the firstfeature set. Thus, one single threshold cannot tell us which feature
set is better. This is an artifact of using a parametric statistic to
evaluate a classifier's predictive power [33]. If we vary the
classifier's decision threshold, we can find these statistics for eachthreshold and see how they vary based on where the threshold is
placed. A receiver operating characteristic, or ROC curve, is onesuch non-parametric measure of a classifier's power that compares
the true positive rate with the false positive rate. While the ROC
curve has been extensively used for evaluating classification
performance in class imbalance problems, it has not been directly
applied for feature selection. In this paper, we construct a newfeature selection metric based on an ROC curve generated on
optimal simple linear discriminants and select those features with
the highest area under the curve as the most relevant. Unlike other
feature selection metrics which depend on one particular decision
boundary, our metric evaluates features in terms of their
performance on multiple decision hyperplanes and is more
appropriate to class imbalance problems.
The rest of our paper is organized as follows. Section 2 provides
a brief discussion about two commonly-used filter methods:correlation coefficient (CC), and RELevance In Estimating
Features (RELIEF). In section 3, we follow with a description of
the proposed new method, Feature Assessment by Sliding
Thresholds (FAST). In section 4, we present the results
comparing the performance of the linear support vector machines(SVM) and 1-nearest neighbor (1-NN) classifiers using features
selected by each metric. These results are measured on two
microarray, two mass spectrometry, and one text mining datasets.
Finally, we give our concluding remarks in section 5.
2. FEATURE SEELCTION METHODSIn this section, we briefly review two commonly-used feature
selection methods, CC and RELIEF.
2.1 Correlation CoefficientThe correlation coefficient is a statistical test that measures the
strength and quality of the relationship between two variables.
Correlation coefficients can range from -1 to 1. The absolute
value of the coefficient gives the strength of the relationship;
absolute values closer to 1 indicate a stronger relationship. Thesign of the coefficient gives the direction of the relationship: a
positive sign indicates then the two variables increase or decrease
with each other and a negative sign shows that one variable
increases as the other decreases.
In machine learning problems, the correlation coefficient is used
to evaluate how accurately a feature predicts the target
independent of the context of other features. The features arethen ranked based on the correlation score [25]. For problems
where the covariance cov( i X , Y ) between a feature ( i X ) and the
target (Y ) and the variances of the feature (var( i X )) and target
(var(Y )) are known, the correlation can be directly calculated:
)var()var(
),cov()(
Y X
Y X i R
i
i
⋅= (1)
Equation 1 can only be used when the true values for the
covariance and variances are known. When these values are
unknown, an estimate of the correlation can be made using
Pearson's product-moment correlation coefficient over a sampleof the population ( xi, y). This formula only requires finding the
mean of each feature and the target to calculate:
∑∑∑
==
=
−⋅−
−−=m
k ik
m
k iik
m
k k iik
y y x x
y y x xi R
1
2
1
2,
1 ,
)()(
))(()( (2)
where m is the number of data points.
Correlation coefficients can be used for both regressors and
classifiers. When the machine is a regressor, the range of valuesof the target may be any ratio scale. When the learning machine
is a classifier, we restrict the range of values for the target to 1± .
125
8/6/2019 FAST- A ROC-Based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems(2008)
http://slidepdf.com/reader/full/fast-a-roc-based-feature-selection-metric-for-small-samples-and-imbalanced 3/9
We then use the coefficient of determination, or 2)(i R , to enforce
a ranking of the features according to the goodness of linear fit
between individual features and the target [25].
When using the correlation coefficient as a feature selection
metric, we must remember that the correlation only finds linear relationships between a feature and the target. Thus, a feature and
the target may be perfectly related in a non-linear manner, but thecorrelation could be equal to 0. We may lift this restriction by
using simple non-linear preprocessing techniques on the feature
before calculating the correlation coefficients to establish agoodness of non-linear relationship fit between a feature and the
target [25].
Another issue with using correlation coefficients comes from how
we rank features. If features are solely ranked on their value, with
features having a positive score getting picked first or vice versa,
then we risk not choosing the features that have the strongest
relationship with the target. Conversely, if features are chosen based on their absolute value, Zheng, Wu, and Srihari argue that
we may not select a ratio of positive to negative features that
gives the best results based on the imbalance in the data [32].
Finding this optimal ratio takes empirical testing, but it can resultin extremely strong results.
2.2 RELIEFRELIEF is a feature selection metric based on the nearest
neighbor rule designed by Kira and Rendell [34]. It evaluates a
feature based on how well its values differentiate themselves from
nearby points. When RELIEF selects any specific instance, it
searches for two nearest neighbors: one from the same class (the
nearest hit), and one from the other class (the nearest miss). Wethen calculate the relevance of each attribute A by the rule:
W ( A) = P (different value of A | nearest miss)
– P (different value of A | nearest hit) (3)
This is justified by the thinking that instances of different classes
should have vastly different values, while instances of the same
class should have very similar values. Because the true
probabilities cannot be calculated, we must estimate the
difference in equation 3. This is done by calculating the distance
between random instances and their nearest hits and misses. For discrete variables, the distance is 0 if the same and 1 if different;
for continuous variables, we use the standard Euclidean distance.
We may select any number of instances up to the number in the
set, and more selections indicate a better approximation [35].
Algorithm 1 details the pseudo-code for implementing RELIEF.
Algorithm 1 (RELIEF):
Set all W ( A) = 0
FOR i =1 to m
Select instance R randomly
Find nearest hit H and nearest miss M
FOR A=1 to number of features
W ( A) = W ( A) - dist ( A, R, H )/m
W ( A) = W ( A) + dist ( A, R, M )/m
The original version of RELIEF suffered from several problems.
First, this method searches only for one nearest hit and one
nearest miss. Noisy data could make this approximation
inaccurate. Second, if there are instances which have missing
values for features, the algorithm will crash because it cannot
calculate the distance between those instances. Kononenko
created multiple extensions of RELIEF to address these issues[35]. RELIEF-A allowed the algorithm to check multiple nearest
hits and misses. RELIEF-B, C, and D gave the method different
ways to address missing values. Finally, RELIEF-E and F found anearest miss from each different class instead of just one and used
this to better estimate the separability of an instance from all other
classes. These extensions added to RELIEF's adaptability todifferent types of problems.
3. METHOD DESCRIPTION: FASTIn this section, we propose to assess features based on the area
under a ROC curve, which is determined by training a simple
linear classifier on each feature and sliding the decision boundary
for optimal classification. The new metric is called FAST(Feature Assessment by Sliding Thresholds).
Most single feature classifiers set the decision boundary at the
mid-point between the mean of the two classes [25]. This may not
be the best choice for the decision boundary. By sliding the
decision boundary, we can increase the number of true positives
we find at the expense of classifying more false positives.
Alternately, we could slide the threshold to decrease the number of true positives found in order to avoid misclassifying negatives.
Thus, no single choice for the decision boundary may be ideal for
quantifying the separation between two classes.
We can avoid this problem by classifying the samples on multiple
thresholds and gathering statistics about the performance at each
boundary. If we calculate the true positive rate and false positiverate at each threshold, we can build an ROC curve and calculate
the area under the curve. Because the area under the ROC curve
is a strong predictor of performance, especially for imbalanced
data classification problems, we can use this score as our feature
ranking: we choose those features with the highest areas under the
curve because they have the best predictive power for the dataset.
By using a ROC curve as the means to rank features, we haveintroduced another problem: deciding where to place the
thresholds. If there are a large number of samples clustered
together in one region, we would like to place more thresholds
between these points to find how separated the two classes are in
this cluster. Likewise, if there is a region where samples aresparse and spread out, we want to avoid placing multiple
thresholds between these points so as to avoid placing redundant
thresholds between two points. One possible solution is to use a
histogram to determine where to place the thresholds. A
histogram fixes the bin width and varies the number of points in
each bin. This method does not accomplish the goals detailedabove. It may be the case that a particular histogram has multiple
neighboring bins that have very few points. We would prefer that
these bins be joined together so that the points would be placed
into the same bin. Likewise, a histogram may also have a bin that
has a significant proportion of the points. We would rather havethis bin be split into multiple different bins so that we could better
differentiate inside this cluster of points.
We use a modified histogram, or an even-bin distribution, to
correct both of these problems. Instead of fixing the bin width
126
8/6/2019 FAST- A ROC-Based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems(2008)
http://slidepdf.com/reader/full/fast-a-roc-based-feature-selection-metric-for-small-samples-and-imbalanced 4/9
and varying the number of points in each bin, we fix the number
of points to fall in each bin and vary the bin width. This even-bin
distribution accomplishes both of the above goals: areas in the
feature space that have fewer samples will be covered by wider
bins, and areas that have many samples will be covered bynarrower bins. We then take the mean of each sample in each bin
as our threshold and classify each sample according to this
threshold. Algorithm 2 details the pseudo-codes for implementingFAST.
Algorithm 2 (FAST):
K : number of bins
N : number of samples in dataset
M : number of features in dataset
Split = 0 to N with a step size N/K
FOR i = 1 to M
X is a vector of samples’ values for feature i
Sort X
FOR j = 1 to K
Bottom = round(Split(j))+1top = round(Split(j+1))
MU = mean( X (bottom to top))
Classify X using MU as threshold
tpr (i, j) = tp/# positive
fpr (i, j) = fp/# negative
Calculate area under ROC by tpr , fpr
One potential issue with this implementation is how it compares
to the standard ROC algorithm of using each possible threshold as
the standard is simpler but requires more computations. Weconducted a pilot study using the CNS dataset to measure the
difference between the FAST algorithm and this standard. Our
findings showed that with a parameter of K=10, 99% of the FASTscores were within plus-minus 0.02 of the exact AUC score, and
50% were within plus-minus 0.005. Additionally, the FAST
algorithm was nearly ten times as fast. Thus, we concluded thatthe approximation scores were sufficient.
Note that the FAST method is a two-sided metric. The scores
generated by the FAST method may range between 0.5 and 1. If
a feature is irrelevant to classification, its score will be close to .5.
If a feature is highly indicative of membership in the positive or
negative class or both, it will have a score closer to 1. Thus, this
method has the potential to select both positive and negativefeatures for use in classification.
4. EXPERIMENTAL RESULTS
4.1 Data SetsWe tested the effectiveness of correlation coefficient, RELIEF,and FAST features on five different data sets. Two of the data
sets are microarray sets, two are mass spectrometry sets, and one
is a bag-of-words set. Each of the microarray and mass
spectrometry data sets has a small number of samples, a large
number of features, and a significant imbalance between the twoclasses. The bag-of-words data set also has a small number of
samples with a large number of features, but we artificially
controlled the class skew to show differences in performance on
highly imbalanced classes versus balanced classes. The
microarray sets were not preprocessed. The mass spectrometry
sets were minimally preprocessed by subtracting the baseline,
reducing the amount of noise, trimming the range of inspected
mass/charge ratios, and normalizing. The bag-of-words set wasconstructed using RAINBOW [36] to extract the word counts
from text documents. These data sets are summarized in Table 1.
Because the largest data set has 320 samples, we used 10-fold
cross-validation to evaluate the trained models. Each fold had a
class ratio equal to the ratio of the full set. The results for each
fold are combined with each other to obtain test results for the
entire data set. To stabilize the results, we repeated the cross-validation 20 times and averaged over each trial.
Table 1. Data set descriptions
CNS
Central Nervous System Embryonal Tumor Data[37]. This data set contains 90 samples: 60 have
medulloblastomas and 30 have other types of
tumors or no cancer. There are 7129 genes in this
data set.
LYMPH
Lymphoma Data [38]. This data set contains 77
samples: 58 are diffuse large B-cell lymphomas,and 19 are folicular lymphomas. There are 7129
genes in this data set.
OVARY
Ovarian Cancer Data [39]. This data set contains
66 samples: 50 are benign tumors, and 16 aremalignant tumors. There are 6000 mass/charge
ratios in this data set.
PROST
Prostate Cancer Data [40]. This data set contains
89 samples: 63 have no evidence of cancer, and
26 have prostate cancer. There are 6000
mass/charge ratios in this data set.
NIPS
NIPS Bag-of-Words Data [41]. This data set
contains 320 documents: 160 cover neurobiology
topics, and 160 cover various applications topics.There are 13649 words in this data set. The setwas rebalanced for five separate class ratios: 1:1,
1:2, 1:4, 1:8, and 1:16. The neurobiology class
was the class shrunk to account for these
imbalances.
4.2 Evaluation StatisticsThe standard accuracy and error statistics quantify the strength of
a classifier over the overall data set. However, these statistics do
not take into account the class distribution. Forman argued that
this is because a trivial majority classifier can give good results on
a very imbalanced distribution [22]. It is more important to
classify samples in the minority class at the potential expense of misclassifying majority samples. However, the converse is true as
well: a trivial minority classifier will give great results for the
minority class, but such a classifier would have too many false
alarms to be usable. An ideal classifier would perform well on
both the minority and the majority class.
The balanced error rate (BER) statistic looks at the performance
of a classifier on both classes. It is defined as the average of theerror rates of two classes as shown in equation 4. If the classes are
balanced, the BER is equal to the global error rate. It is commonly
127
8/6/2019 FAST- A ROC-Based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems(2008)
http://slidepdf.com/reader/full/fast-a-roc-based-feature-selection-metric-for-small-samples-and-imbalanced 5/9
used for evaluating imbalanced data classification [42]. We used
this statistic to evaluate trained classifiers on test data.
⎟ ⎠
⎞⎜⎝
⎛
++
+=
TN FN
FN
TP FP
FP BER
2
1(4)
4.3 ResultsWe evaluated the performance of FAST-selected features bycomparing them with features chosen by correlation coefficientsand RELIEF. Many researchers have used standard learningalgorithms that maximize accuracy to evaluate imbalanceddatasets. Zheng [32] used the Naive Bayes classifier and logisticregression methods, and Forman [22] used the linear SVM andnoted its superiority over decision trees, Naive Bayes, and logisticregression. The object of study in these papers, and in our research, was the performance of the feature selection metrics andnot the induction algorithms. Thus, we chose to evaluate themetrics using the performance of the linear SVM and 1-NNclassifiers. These classifiers were chosen based on their differingclassification philosophies. The 1-NN method is a lazy algorithmthat defers computation until classification. In contrast, the SVM
computes a maximum separating hyperplane before classification.
The classification results are summarized in Figs. 1-10, wheredashed lines with square markers indicate classifiers usingRELIEF-selected features (with one nearest hit and miss), dashedlines with star markers indicate classifiers using correlation-selected features, and dashed lines with diamond markers indicateclassifiers using FAST-selected features (with 10 bins). The solid
black line indicates the baseline performance where all thefeatures are used for classification.
Figures 1 and 2 show the BER versus the number of featuresselected using an 1-NN classifier and a linear SVM for CNS data,respectively. FAST features significantly outperformed RELIEFand correlation features when using the 1-NN classifier. When
using the SVM classifier, FAST features performed the best for less than 40 features; and for more than 40 features, there waslittle difference between feature sets. For all the cases, using asmall set of features outperforms the baseline with all the originalfeatures. Similar results can be obtained for other datasets. For example, Figures 3 and 4 show the results for LYMPH data withan 1-NN and a linear SVM, respectively. Due to page limits, weare not able to show the results for all the four datasets. Instead,we include the average results here. Figures 5 and 6 show theBER scores averaged over the four datasets with an 1-NNclassifier and a SVM, respectively. For comparison, the baseline
performance of the classifier using all features is also included.
Another evaluation statistic commonly used on imbalanceddatasets is the area under the ROC (AUC). This statistic is similar
in nature to the BER in that it weights errors differently on thetwo classes. In this study, it lines up well with the design
philosophy of FAST. FAST selects features that maximize theAUC, so it is reasonable to believe that a learning method usingFAST-selected features would also maximize the AUC. We alsoused this statistic to evaluate trained classifiers on test data.Figures 7 and 8 show the AUC scores averaged over the four datasets with an 1-NN classifier and a SVM, respectively. Notsurprisingly, FAST outperforms CC and RELIEF.
Figure 1. BER for CNS using an 1-NN classifier
Figure 2. BER for CNS using a SVM classifier
Figure 3. BER for LYMPH using 1-NN classifiers
128
8/6/2019 FAST- A ROC-Based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems(2008)
http://slidepdf.com/reader/full/fast-a-roc-based-feature-selection-metric-for-small-samples-and-imbalanced 6/9
Figure 4. BER for LYMPH using a SVM classifier
Figure 5. BER averaged over CNS, LYMPH, OVARY, and
PROST using an 1-NN classifier
Figure 6. BER averaged over CNS, LYMPH, OVARY, and
PROST using a SVM classifier
Figure 7. AUC averaged over CNS, LYMPH, OVARY, and
PROST using an 1-NN classifier
Figure 8. AUC averaged over CNS, LYMPH, OVARY, and
PROST using a SVM classifier
Figure 9. AUC for CNS using a SVM classifier
129
8/6/2019 FAST- A ROC-Based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems(2008)
http://slidepdf.com/reader/full/fast-a-roc-based-feature-selection-metric-for-small-samples-and-imbalanced 7/9
Figure 10. AUC for PROST using a SVM classifier
Figure 11. Training data distribution of CNS with the two
best RELIEF-selected features
The average results in Figures 6 and 8 agree with the belief thatSVM's are robust for high-dimensional data. Up to 100 RELIEF-selected features did not improve the BER or the AUC of theSVM. Additionally, up to 100 correlation-selected features didnot improve the BER. On the other hand, the SVM using morethan 30 FAST-selected features did see a significant improvementon both BER and AUC. Thus, our results agree with the generalfinding that SVM's are resistant to feature selection, but alsoagree with the findings presented by Forman [22] that SVM's can
benefit from prudent feature selection. Specific examples of thisimprovement in our datasets can be seen in Figures 2 and 4 usingFAST on the BER scores for the CNS and LYMPH datasets,respectively, and in Figures 9 and 10 using FAST on the AUCscores for the CNS and PROST datasets, respectively.
The results for the 1-NN classifiers, seen in Figures 5 and 7, areeven more striking. Both RELIEF and correlation-selectedfeatures improved on the baseline performance of the classifier significantly for a minimum of 45 features selected. FAST-selected features saw a significant jump in performance over thatseen using RELIEF and correlation-selected features; the 1-NNclassifiers using only 15 FAST-selected features beat the baseline.
Figure 12. Training data distribution of CNS with the two
best correlation-selected features
Figure 13. Training data distribution of CNS with the two
best FAST-selected features
Why would FAST features outperform correlation and RELIEFfeatures by such a significant margin for both 1-NN and SVMclassifiers? We visualized the features selected by the correlation,RELIEF, and FAST methods to answer this question. We showthe training data of the CNS dataset with the two bestfeatures. Figures 11-13 show the data using the best two RELIEFfeatures, the best two correlation features, and the best two FASTfeatures respectively. FAST features appear to separate the twoclasses and group them into smaller clusters better thancorrelation and RELIEF features. This may explain why FASTfeatures perform better using both the SVM and 1-NN classifiers;SVM's try to maximize the distance between two classes, and 1-
NN classifiers give the best results when similar samples areclustered close together.
Finally, we show the effects of different class ratios on the performance of each feature selection metric. Figures 14 and 15show the BER versus class ratios for the NIPS dataset with theSVM and 1-NN classifiers, respectively. Not surprisingly, as theclass ratio increases, the BER tends to increase accordingly. For
both the 1-NN and SVM classifiers, correlation and FASTfeatures performed comparably well for datasets up to a 1:8 classratio. For the 1:16 ratios, FAST features performed significantly
better than correlation features. RELIEF features did not performwell on this dataset for any of the class ratios.
130
8/6/2019 FAST- A ROC-Based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems(2008)
http://slidepdf.com/reader/full/fast-a-roc-based-feature-selection-metric-for-small-samples-and-imbalanced 8/9
We conclude that FAST features perform better than RELIEF andcorrelation features; this boost in performance is especially largewhen the selected feature set is small and when the classes areextremely imbalanced. Because using less features helpsclassifiers avoid overfitting the data when the sample space issmall, we believe that the FAST metric is of interest for use inlearning patterns of real world datasets, especially those that have
imbalanced classes and high dimensionality.
Figure 14. BER for NIPS using SVM classifiers
Figure 15. BER for NIPS using 1-NN classifiers
5. CONCLUSIONClassification problems involving a small sample space and largefeature space are especially prone to overfitting. Feature selectionmethods are often used to increase the generalization potential of a classifier. However, when the dataset to be learned isimbalanced, the most-used metrics tend to select less relevantfeatures. In this paper, we proposed and tested a feature selectionmetric, FAST, that evaluates the relevance of features using thearea under the ROC curve by sliding decision line in one-dimensional feature space. We compared the FAST metric withcommonly-used RELIEF and correlation coefficient scores ontwo mass spectrometry and two microarray datasets that havesmall sample sizes and imbalanced distributions. FAST features
performed considerably better than RELIEF and correlationfeatures; the increase in performance was magnified for smaller feature counts, and this makes FAST a practical candidate for feature selection.
One interesting finding from this research was that correlationfeatures tended to outperform RELIEF features for classimbalance and small sample problems, especially when the SVMclassifier was used. This may have occurred because thecorrelation coefficient takes a global view as to whether a featureaccurately predicts the target; in contrast, RELIEF, especiallywhen the number of nearest hits and misses selected is small, has
a local view of a feature's relevancy to predicting the target. If there are small clusters of points that are near each other but far away from the main cluster of points, these points can act as eachothers' nearest hits while being a great distance from the nearestmisses. Thus, features that have this quality could be scoredrather high when they are, in fact, highly irrelevant toclassification. There is strong evidence for this claim in Fig. 11.There are multiple small clusters of points, some from themajority class and some from the minority class, that are close toeach other but a significant distance away from the nearestmiss. This would greatly affect the score of these two featuresand make them appear more relevant. Figures 5-8 clearly point tothis deficiency as the performance of both SVM's and 1-NNclassifiers using RELIEF features is only marginally better (or worse) than chance and significantly behind classifiers usingcorrelation or FAST features.
Our future work will investigate the use of other metrics for feature evaluation. For example, researchers have recently arguedthat precision-recall curves are preferable when dealing withhighly skewed datasets [43]. Whether or not the precision-recallcurves are also appropriate to small sample and imbalanced data
problems remains to be examined.
6. ACKNOWLEDGMENTSThis work is supported by the US National Science FoundationAward IIS-0644366. We would also like to the reviewers for their valuable comments.
7. REFERENCES[1] Nunez, M. 1991. The use of background knowledge in
decision tree induction. Machine Learning, 6, 231-250.
[2] Casasent, D. and Chen, X.-W. 2003. New training strategiesfor RBF neural networks for X-ray agricultural productinspection. Pattern Recognition, 36(2), 535-547.
[3] Verdenius, F. 1991. A method for inductive cost optimization.Proceedings of the Fifth European Working Session onLearning, EWSL-91, 179-191. New York: Springer-Verla.
[4] Casasent, D. and Chen, X.-W. 2004. Feature reduction andmorphological processing for hyperspectral image data.Applied Optics, 43 (2), 1-10.
[5] Japkowicz, N. editor 2000. Proceedings of the AAAI’2000Workshop on Learning from Imbalanced Data Sets. AAAITech Report WS-00-05.
[6] Chawla, N., Japkowicz, N., and Kolcz, A. editors 2003.Proceedings of the ICML’2003 Workshop on Learning fromImbalanced Data Sets.
[7] Weiss, G. 2004. Mining with rarity: A unifying framework.SIGKDD Explorations 6(1), 7-19.
[8] Kubat, M. and Matwin, S. 1997. Addressing the curse of imbalanced data set: One sided sampling. In Proc. of the
131
8/6/2019 FAST- A ROC-Based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems(2008)
http://slidepdf.com/reader/full/fast-a-roc-based-feature-selection-metric-for-small-samples-and-imbalanced 9/9
Fourteenth International Conference on Machine Learning,179-186.
[9] Chen, X., Gerlach, B., and Casasent, D. 2005. Pruning supportvectors for imbalanced data classification. In Proc. of International Joint Conference on Neural Networks, 1883-88.
[10] Kubat, M. and Matwin, S. 1997. Learning when negative
examples abound. In Proceedings of the Ninth EuropeanConference on Machine Learning ECML97, 146-153.
[11] Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, P. 2002.SMOTE: Synthetic Minority Over-sampling Technique.Journal of Artificial Intelligence Research 16, 321-357.
[12] Estabrooks, A., Jo, T., and Japkowicz, N. 2004. A multipleresampling method for learning from imbalanced data sets.Computational Intelligence, 20(1), 18-36.
[13] Domingos, P. 1999. MetaCost: a general method for makingclassifiers cost-sensitive. Proc. of the Fifth ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining, 155-164.
[14] Elkan, C. 2001. The foundations of cost-sensitive learning.
Proc. of the Seventeenth International Joint Conference onArtificial Intelligence, 973-978.
[15] Fawcett, T., Provost, F. 1997. Adaptive fraud detection. DataMining and Knowledge Discovery, 1(3), 291-316.
[16] Huang, K., Yang, H., King, I., Lyu, M., 2004. Learningclassifiers from imbalanced data based on biased minimax
probability machine. Proc. of the 2004 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition,2(27), II-558 - II-563.
[17] Ting, K. 1994. The problem of small disjuncts: its remedy ondecision trees. Proc. of the Tenth Canadian Conference onArtificial Intelligence, 91-97.
[18] Chawla, N., Lazarevic, A., Hall, L., and Bowyer, K. 2003.
SMOTEBoost: Improving prediction of the minority class in boosting. Principles of Knowledge Discovery in Databases,LNAI 2838, 107-119.
[19] Sun, Y., Kamel, M., Wang, Y. 2006. Boosting for learningmultiple classes with imbalanced class distribution. SixthInternational Conference on Data Mining, 592-602.
[20] Raskutti, A. and Kowalczyk, A. 2004. Extreme rebalancing for svms: a svm study. SIGKDD Explorations, 6(1), 60-69.
[21] Xiong, H and Chen, X. 2006. Kernel-based distance metriclearning for microarray data classification. BMCBioinformatics, 7, 299.
[22] Forman, G. 2003. An extensive empirical study of featureselection metrics for test classification. Journal of Machine
Learning Research, 3, 1289-1305.[23] Van der Putten, P. and van Someren, M. 2004. A bias-variance
analysis of a real world learning problem: the coil challenge2000. Machine Learning, 57(1-2), 177-195.
[24] Guyon, I., Weston J., Barnhill, S., and Vapnik, V. 2002. Geneselection for cancer classification using support vector machines. Machine Learning, 46(1-3), 389-422.
[25] Guyon, I., and Elisseeff, A. 2003. An introduction to variableand feature selection. JMRL special Issue on variable andFeature Selection 3, 1157-1182.
[26] Weston, J. et al. 2000. Feature selection for support vector machines. In Advances in Neural Information ProcessingSystems.
[27] Chen, X. and Jeong, J. 2007. Minimum reference set basedfeature selection for small sample classifications. Proc. of the24th International Conference on Machine Learning, 153-160.
[28] Chen, X. 2003. An improved branch and bound algorithm for feature selection. Pattern Recognition Letter, 24, 1925-1933.
[29] Yu, L. and Liu, H. 2004. Efficient feature selection via analysisof relevance and redundancy. Journal of Machine LearningResearch, 5, 1205-1224.
[30] Pudil, P., Novovicova, J., and Kittler, J., 1994. Floating searchmethods in feature selection. Pattern Recognition Letters, 15,1119 – 1125.
[31] Mladenic, D. and Grobelnik, M. 1999. Feature selection for unbalanced class distribution and naïve Bayes. In Proc. of the
16th International Conference on Machine Learning, 258-267.[32] Zheng, Z., Wu, X., and Srihari, R. 2004. Feature selection for
text categorization on imbalanced data. SIGKDD Explorations6(1), 80-89.
[33] Lund, O., Nielsen, C., Lundegaard, C., and Brunak, S. 2005.Immunological Bioinformatics, 99-101. The MIT Press.
[34] Kira, K. and Rendell, L. 1992. The feature selection problem:Traditional methods and new algorithms. In Proc. of the 9thInternational Conference on Machine Learning, 249-256.
[35] Kononenko, I. 1994. Estimating attributes: Analysis andextension of RELIEF. In Proc. of the 7th European Conferenceon Machine Learning, 171-182.
[36] McCallum, A. 1996. Bow: A toolkit for statistical language
modeling, text retrieval, classification and clustering.http://www.cs.cmu.edu/~mccallum/bow.
[37] Pomeroy, S. et al. 2002. Prediction of central nervous systemembryonal tumour outcome based on gene expression. Nature,415, 436–442.
[38] Shipp, M. et al. 2002. Diffuse large b-cell lymphoma outcome prediction by gene expression profiling and supervisedmachine learning. Nature Medicine, 8, 68–74.
[39] Petricoin, E. et al. 2002. Use of proteomic patterns in serum toidentify ovarian cancer. The Lancet, 359, 572–577.
[40] Petricoin, E. et al. 2002. Serum proteomic patterns for detection of prostate cancer. Journal of the National Cancer Institute, 94, 1576–1578.
[41] Roweis, S. 2008. http://www.cs.toronto.edu/ roweis.
[42] MPS, 2006. Performance prediction challenge – evaluation.http://www.modelselect.inf.ethz.ch/evaluation.php.
[43] Davis, J. and Goadrich, M. 2006. The relationship between precision-recall and ROC curves. In Proc. of the 23rd International Conference on Machine Learning, 30-38.
132