pattern recognition
DESCRIPTION
Pattern RecognitionTRANSCRIPT
-
POLITEHNICA UNIVERSITY OF TIMIOARA, MASTER OF SOFTWARE ENGINEERING
Pattern Recognition Classification Methods and Case Study
Marcel Gheorghi
2/5/2010
-
2
2 Contents
Contents
Contents Contents ........................................................................................................................................................ 2
1 Introduction .............................................................................................................................................. 3
2 Neural Networks ........................................................................................................................................ 4
3 Classification Methods ............................................................................................................................... 6
4 Case Study .................................................................................................................................................. 8
Data Exploration ....................................................................................................................................... 8
Preprocessing ............................................................................................................................................ 9
Results ..................................................................................................................................................... 10
5 Conclusions .............................................................................................................................................. 14
References .................................................................................................................................................. 15
-
3
3 1 Introduction
1 Introduction
The overall goal of this project is to develop classifier models to generalize whether a
person (defined as an anonymous instance) has an annual income of less or equal to
fifty thousand or greater than fifty thousand. The data set used was extracted from a
1994 U.S. census database by Barry Becker based on the following conditions:
((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)), resulting in 48,842
reasonably clean observations [BEC96].
For this project, the Weka (Waikato Environment for Knowledge Analysis) data mining
toolkit was used. This toolkit, written in Java at the University of Waikato, provides a
considerable library of algorithms and models for classifying and generalizing data
[HAR07]. Out of these algorithms, nave Bayes, SMO and J48 have been used to
develop a model based on 80% of the data, with the remaining 20% used for
validation.
In addition, NeuroShell2 was used to train and test neural networks based on various
algorithms. This required additional data preparation as later described in the case
study.
-
4
4 2 Neural Networks
2 Neural Networks
A neural network (NN), or "artificial neural network" (ANN), is a mathematical
model or computational model that tries to simulate the structure and/or functional
aspects of biological neural networks. It consists of an interconnected group
of artificial neurons and processes information using a connectionist approach
to computation. In most cases an ANN is an adaptive system that changes its
structure based on external or internal information that flows through the network
during the learning phase. Neural networks are non-linear statistical data
modeling tools. They can be used to model complex relationships between inputs and
outputs or to find patterns in data [WIK01].
Many neural network architectures have been developed for to suite better various
situations. In the following paragraphs, two of these are presented as they have been
used in the case study.
A simple recurrent network (SRN) is a variation on the Multi-Layer Perceptron,
sometimes called an "Elman network" due to its invention by Jeff Elman. A three-layer
network is used, with the addition of a set of "context units" in the input layer. There
are connections from the middle (hidden) layer to these context units fixed with a
weight of one. At each time step, the input is propagated in a standard feed-forward
fashion, and then a learning rule (usually back-propagation) is applied. The fixed back
connections result in the context units always maintaining a copy of the previous
values of the hidden units (since they propagate over the connections before the
learning rule is applied). Thus the network can maintain a sort of state, allowing it to
perform such tasks as sequence-prediction that is beyond the power of a standard
Multi-Layer Perceptron.
In a fully recurrent network, every neuron receives inputs from every other neuron in
the network. These networks are not arranged in layers. Usually only a subset of the
neurons receive external inputs in addition to the inputs from all the other neurons,
and another disjunct subset of neurons report their output externally as well as
sending it to all the neurons. These distinctive inputs and outputs perform the
function of the input and output layers of a feed-forward or simple recurrent network,
and also join all the other neurons in the recurrent processing [WIK01].
Backpropagation, or propagation of error, is a common method of teaching artificial
neural networks how to perform a given task. It was first described by Arthur E.
Bryson and Yu-Chi Ho in 1969,[1][2] but it wasn't until 1986, through the work
of David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams, that it gained
recognition, and it led to a renaissance in the field of artificial neural network
research [WIK05].
-
5
5 2 Neural Networks
Backpropagation architecture with standard connection is the standard type of
Backpropagation network in which every layer is connected or linked to the
immediately previous layer. NeuroShell 2 gives the option of using a three, four, or
five layer network.
Through experience and literature reviews, it has been found that the three layer
Backpropagation network with standard connections is suitable for almost all
problems if enough hidden neurons are used. When more than one hidden layer (the
layers between the input and output layers) is used, training time may be increased by
as much as an order of magnitude. [NEU**]
-
6
6 3 Classification Methods
3 Classification Methods
In the case study, three methods of classification have been used: nave Bayes, SMO
(SVM) and J48 (decision tree), described in this section.
Support vector machines (SVMs) are a set of related supervised learning methods
used for classification and regression. In simple words, given a set of training
examples, each marked as belonging to one of two categories, an SVM training
algorithm builds a model that predicts whether a new example falls into one category
or the other. Intuitively, an SVM model is a representation of the examples as points in
space, mapped so that the examples of the separate categories are divided by a clear
gap that is as wide as possible. New examples are then mapped into that same space
and predicted to belong to a category based on which side of the gap they fall on.
More formally, a support vector machine constructs a hyperplane or set of
hyperplanes in a high or infinite dimensional space, which can be used for
classification, regression or other tasks. Intuitively, a good separation is achieved by
the hyperplane that has the largest distance to the nearest training datapoints of any
class (so-called functional margin), since in general the larger the margin the lower
the generalization error of the classifier. [WIK02]
A Bayes classifier is a simple probabilistic classifier based on applying Bayes'
theorem (from Bayesian statistics) with strong (naive) independence assumptions. A
more descriptive term for the underlying probability model would be
"independent feature model".
In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a
particular feature of a class is unrelated to the presence (or absence) of any other
feature. For example, a fruit may be considered to be an apple if it is red, round, and
about 4" in diameter. Even though these features depend on the existence of the other
features, a naive Bayes classifier considers all of these properties to independently
contribute to the probability that this fruit is an apple.
Depending on the precise nature of the probability model, naive Bayes classifiers can
be trained very efficiently in a supervised learning setting. In many practical
applications, parameter estimation for naive Bayes models uses the method
of maximum likelihood; in other words, one can work with the naive Bayes model
without believing in Bayesian probability or using any Bayesian methods.
An advantage of the naive Bayes classifier is that it requires a small amount of
training data to estimate the parameters (means and variances of the variables)
necessary for classification. Because independent variables are assumed, only the
variances of the variables for each class need to be determined and not the entire
covariance matrix. [WIK03]
-
7
7 3 Classification Methods
J48 is an open source Java implementation of the C4.5 algorithm in the weka data
mining tool. C4.5 is an algorithm used to generate a decision tree developed by Ross
Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees
generated by C4.5 can be used for classification, and for this reason, C4.5 is often
referred to as a statistical classifier.
C4.5 builds decision trees from a set of training data in the same way as ID3, using
the concept of information entropy. The training data is a set S = s1,s2,... of already
classified samples. Each sample si = x1,x2,... is a vector where x1,x2,... represent
attributes or features of the sample. The training data is augmented with a
vector C = c1,c2,... where c1,c2,... represent the class to which each sample belongs.
At each node of the tree, C4.5 chooses one attribute of the data that most effectively
splits its set of samples into subsets enriched in one class or the other. Its criterion is
the normalized information gain (difference in entropy) that results from choosing an
attribute for splitting the data. The attribute with the highest normalized information
gain is chosen to make the decision. The C4.5 algorithm then recurrs on the smaller
sub-lists.
This algorithm has a few base cases.
All the samples in the list belong to the same class. When this happens, it
simply creates a leaf node for the decision tree saying to choose that class.
None of the features provide any information gain. In this case, C4.5 creates a
decision node higher up the tree using the expected value of the class.
Instance of previously-unseen class encountered. Again, C4.5 creates a decision
node higher up the tree using the expected value. [WIK04]
-
8
8 4 Case Study
4 Case Study
Data Exploration The following table summarizes the data set content, which is mostly comprised of
nominal attributes.
Figure 1 Survey Data [WIL**]
-
9
9 4 Case Study
There were a total of 32,561 continuous and discrete instances [HAR07].
There are 15 attribute.
It appears that two attribute listed are mirrors of one another. Education and
Education-Num, where Education-Num is a numeric representation of the other.
There is an odd looking fnlwgt attribute, which contains numeric values.
Following research on this attribute [SHO07], it appears to have no relation to
the income that each instance has. Therefore has no predictive power and can
be ignored.
There is a mix of numeric and nominal attribute.
Some of the data contained missing values, delimited by ?.
Missing values only appeared to be in numeric variables.
Using Weka visualization tools particularly scatter plots the analyzed data
did not show strong class separation.
There did not appear to be any spelling mistakes, which would cause an
instance to be incorrectly classified.
Some attribute appear to have an imbalanced distribution of values.
o Age, education, capital-gain and capital-loss is very skewed towards the
lower values
o Capital-gain and capital-loss had very strong 0 values, and little
occurrence of other values. (Data pre-processing may wish to discretize
these values into two bins (0 and 1 or more)
Preprocessing
For NeuroShell2, I have removed all the observations with missing values, which
totaled about 6% of the data set. The missing values were only part of continuous
attributes, and given the large data set, it does not influence our study significantly
(especially that I have merged the data set with the available test data from the same
source, which combined now total 48842 entries). In addition, I have mapped all the
nominal values to numerical ones. E.g.: An attribute had 4 different nominal values,
each of which was assigned an integer starting from 0. Thus, the attributes set of
values would be {0, 1, 2, 3}.
-
10
10 4 Case Study
Still in NeuroShell2s case, I have discretized the values of capital gain and capital loss
to three different values, i.e. 0 for 0 gain or loss, 1 for low gain or loss, and 2 for high
gain or loss. A gain higher than the mean value of 1079 was considered high gain, and
a loss higher than the mean value of 87 was considered a high loss. In addition, I have
partitioned the age attribute as well, as follows:
> survey$Age survey$Hours.Per.Week
-
11
11 4 Case Study
Backpropagation with standard connections (3 layer)
o Default settings in NeuroShell2
o Calibration of 50
o Min. average error: 0.886637
o o Root MSE: 0.3674
Nave Bayes
With default settings in Weka, the results summary is as follows:
Correctly Classified Instances 8081 82.7293 %
Incorrectly Classified Instances 1687 17.2707 %
Kappa statistic 0.4524
Mean absolute error 0.1787
Root mean squared error 0.3732
Relative absolute error 49.2119 %
Root relative squared error 87.7545 %
Total Number of Instances 9768
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.944 0.55 0.847 0.944 0.893 0.89 50K
Weighted Avg. 0.827 0.433 0.816 0.827 0.812 0.89
=== Confusion Matrix ===
a b
-
12
12 4 Case Study
Incorrectly Classified Instances 1429 14.6294 %
Kappa statistic 0.5735
Mean absolute error 0.1666
Root mean squared error 0.3283
Relative absolute error 45.873 %
Root relative squared error 77.19 %
Total Number of Instances 9768
Decision Tree J48
o With default settings, the summary looks as follows:
Correctly Classified Instances 8411 86.1077 %
Incorrectly Classified Instances 1357 13.8923 %
Kappa statistic 0.5822
Mean absolute error 0.2021
Root mean squared error 0.3214
Relative absolute error 55.6491 %
Root relative squared error 75.5758 %
Total Number of Instances 9768
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.945 0.41 0.881 0.945 0.912 0.878 50K
Weighted Avg. 0.861 0.326 0.855 0.861 0.854 0.878
=== Confusion Matrix ===
a b
-
13
13 4 Case Study
SMO
Changing complexity did not yield better results than default 1
With default settings in Weka, the summary looks as follows:
Correctly Classified Instances 8295 84.9293 %
Incorrectly Classified Instances 1473 15.0707 %
Kappa statistic 0.4524
Mean absolute error 0.1787
Root mean squared error 0.3302
Relative absolute error 49.2119 %
Root relative squared error 87.7545 %
Total Number of Instances 9768
-
14
14 5 Conclusions
5 Conclusions
After building and testing all the various models, using J48 decision tree algorithm
yielded best results with an accuracy of 86.1%, approached only by nave Bayes at
85.37% with kernel estimator turned off. The weakest results were given by the 2
neural network built with NeuroShell2, out of which the Backpropagation with
Standard Connections topped the other. However, to be fair, these last two were
trained on a slightly different data set as described in the preprocessing section.
-
15
15 References
References
[BEC96] Census Income Data Set, Ronny Kohavi and Barry Becker, 1996,
http://archive.ics.uci.edu/ml/datasets/Census+Income
[HAR07] Mining Information from US Census Bureau Data, Sebastian Harvey, 2007,
http://research.omegasoft.co.uk/publications/Mining%20Information%20from%20US
%20Census%20Bureau%20Data.pdf
[NEU**] Backpropagation Architecture Standard Connections, NeuroShell2 Help,
http://www.wardsystems.com/manuals/neuroshell2/index.html?probackproparchsta
ndard.htm
[SHO07] Chris Shoemaker,10 March 2007,
www.cs.wpi.edu/~cs4341/C00/Projects/fnlwgt/
[WIL**] Survey Data, Graham Williams,
http://www.togaware.com/datamining/survivor/Survey_Data.html
[WIK01] Artificial Neural Networks, Wikipedia,
http://en.wikipedia.org/wiki/Artificial_neural_network
[WIK02] Support Vector Machine, Wikipedia,
http://en.wikipedia.org/wiki/Support_vector_machine
[WIK03] Nave Bayes Classifier, Wikipedia,
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
[WIK04] C4.5 Algorithm, Wikipedia, http://en.wikipedia.org/wiki/C4.5_algorithm
[WIK05] Backpropagation, Wikipedia, http://en.wikipedia.org/wiki/Backpropagation