pattern recognition

POLITEHNICA UNIVERSITY OF TIMIOARA, MASTER OF SOFTWARE ENGINEERING

Pattern Recognition Classification Methods and Case Study

Marcel Gheorghi

2/5/2010

2

2 Contents

Contents

Contents Contents ........................................................................................................................................................ 2

1 Introduction .............................................................................................................................................. 3

2 Neural Networks ........................................................................................................................................ 4

3 Classification Methods ............................................................................................................................... 6

4 Case Study .................................................................................................................................................. 8

Data Exploration ....................................................................................................................................... 8

Preprocessing ............................................................................................................................................ 9

Results ..................................................................................................................................................... 10

5 Conclusions .............................................................................................................................................. 14

References .................................................................................................................................................. 15

3

3 1 Introduction

1 Introduction

The overall goal of this project is to develop classifier models to generalize whether a

person (defined as an anonymous instance) has an annual income of less or equal to

fifty thousand or greater than fifty thousand. The data set used was extracted from a

1994 U.S. census database by Barry Becker based on the following conditions:

((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)), resulting in 48,842

reasonably clean observations [BEC96].

For this project, the Weka (Waikato Environment for Knowledge Analysis) data mining

toolkit was used. This toolkit, written in Java at the University of Waikato, provides a

considerable library of algorithms and models for classifying and generalizing data

[HAR07]. Out of these algorithms, nave Bayes, SMO and J48 have been used to

develop a model based on 80% of the data, with the remaining 20% used for

validation.

In addition, NeuroShell2 was used to train and test neural networks based on various

algorithms. This required additional data preparation as later described in the case

study.

4

4 2 Neural Networks

2 Neural Networks

A neural network (NN), or "artificial neural network" (ANN), is a mathematical

model or computational model that tries to simulate the structure and/or functional

aspects of biological neural networks. It consists of an interconnected group

of artificial neurons and processes information using a connectionist approach

to computation. In most cases an ANN is an adaptive system that changes its

structure based on external or internal information that flows through the network

during the learning phase. Neural networks are non-linear statistical data

modeling tools. They can be used to model complex relationships between inputs and

outputs or to find patterns in data [WIK01].

Many neural network architectures have been developed for to suite better various

situations. In the following paragraphs, two of these are presented as they have been

used in the case study.

A simple recurrent network (SRN) is a variation on the Multi-Layer Perceptron,

sometimes called an "Elman network" due to its invention by Jeff Elman. A three-layer

network is used, with the addition of a set of "context units" in the input layer. There

are connections from the middle (hidden) layer to these context units fixed with a

weight of one. At each time step, the input is propagated in a standard feed-forward

fashion, and then a learning rule (usually back-propagation) is applied. The fixed back

connections result in the context units always maintaining a copy of the previous

values of the hidden units (since they propagate over the connections before the

learning rule is applied). Thus the network can maintain a sort of state, allowing it to

perform such tasks as sequence-prediction that is beyond the power of a standard

Multi-Layer Perceptron.

In a fully recurrent network, every neuron receives inputs from every other neuron in

the network. These networks are not arranged in layers. Usually only a subset of the

neurons receive external inputs in addition to the inputs from all the other neurons,

and another disjunct subset of neurons report their output externally as well as

sending it to all the neurons. These distinctive inputs and outputs perform the

function of the input and output layers of a feed-forward or simple recurrent network,

and also join all the other neurons in the recurrent processing [WIK01].

Backpropagation, or propagation of error, is a common method of teaching artificial

neural networks how to perform a given task. It was first described by Arthur E.

Bryson and Yu-Chi Ho in 1969,[1][2] but it wasn't until 1986, through the work

of David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams, that it gained

recognition, and it led to a renaissance in the field of artificial neural network

research [WIK05].

5

5 2 Neural Networks

Backpropagation architecture with standard connection is the standard type of

Backpropagation network in which every layer is connected or linked to the

immediately previous layer. NeuroShell 2 gives the option of using a three, four, or

five layer network.

Through experience and literature reviews, it has been found that the three layer

Backpropagation network with standard connections is suitable for almost all

problems if enough hidden neurons are used. When more than one hidden layer (the

layers between the input and output layers) is used, training time may be increased by

as much as an order of magnitude. [NEU**]

6

6 3 Classification Methods

3 Classification Methods

In the case study, three methods of classification have been used: nave Bayes, SMO

(SVM) and J48 (decision tree), described in this section.

Support vector machines (SVMs) are a set of related supervised learning methods

used for classification and regression. In simple words, given a set of training

examples, each marked as belonging to one of two categories, an SVM training

algorithm builds a model that predicts whether a new example falls into one category

or the other. Intuitively, an SVM model is a representation of the examples as points in

space, mapped so that the examples of the separate categories are divided by a clear

gap that is as wide as possible. New examples are then mapped into that same space

and predicted to belong to a category based on which side of the gap they fall on.

More formally, a support vector machine constructs a hyperplane or set of

hyperplanes in a high or infinite dimensional space, which can be used for

classification, regression or other tasks. Intuitively, a good separation is achieved by

the hyperplane that has the largest distance to the nearest training datapoints of any

class (so-called functional margin), since in general the larger the margin the lower

the generalization error of the classifier. [WIK02]

A Bayes classifier is a simple probabilistic classifier based on applying Bayes'

theorem (from Bayesian statistics) with strong (naive) independence assumptions. A

more descriptive term for the underlying probability model would be

"independent feature model".

In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a

particular feature of a class is unrelated to the presence (or absence) of any other

feature. For example, a fruit may be considered to be an apple if it is red, round, and

about 4" in diameter. Even though these features depend on the existence of the other

features, a naive Bayes classifier considers all of these properties to independently

contribute to the probability that this fruit is an apple.

Depending on the precise nature of the probability model, naive Bayes classifiers can

be trained very efficiently in a supervised learning setting. In many practical

applications, parameter estimation for naive Bayes models uses the method

of maximum likelihood; in other words, one can work with the naive Bayes model

without believing in Bayesian probability or using any Bayesian methods.

An advantage of the naive Bayes classifier is that it requires a small amount of

training data to estimate the parameters (means and variances of the variables)

necessary for classification. Because independent variables are assumed, only the

variances of the variables for each class need to be determined and not the entire

covariance matrix. [WIK03]

7

7 3 Classification Methods

J48 is an open source Java implementation of the C4.5 algorithm in the weka data

mining tool. C4.5 is an algorithm used to generate a decision tree developed by Ross

Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees

generated by C4.5 can be used for classification, and for this reason, C4.5 is often

referred to as a statistical classifier.

C4.5 builds decision trees from a set of training data in the same way as ID3, using

the concept of information entropy. The training data is a set S = s1,s2,... of already

classified samples. Each sample si = x1,x2,... is a vector where x1,x2,... represent

attributes or features of the sample. The training data is augmented with a

vector C = c1,c2,... where c1,c2,... represent the class to which each sample belongs.

At each node of the tree, C4.5 chooses one attribute of the data that most effectively

splits its set of samples into subsets enriched in one class or the other. Its criterion is

the normalized information gain (difference in entropy) that results from choosing an

attribute for splitting the data. The attribute with the highest normalized information

gain is chosen to make the decision. The C4.5 algorithm then recurrs on the smaller

sub-lists.

This algorithm has a few base cases.

All the samples in the list belong to the same class. When this happens, it

simply creates a leaf node for the decision tree saying to choose that class.

None of the features provide any information gain. In this case, C4.5 creates a

decision node higher up the tree using the expected value of the class.

Instance of previously-unseen class encountered. Again, C4.5 creates a decision

node higher up the tree using the expected value. [WIK04]

8

8 4 Case Study

4 Case Study

Data Exploration The following table summarizes the data set content, which is mostly comprised of

nominal attributes.

Figure 1 Survey Data [WIL**]

9

9 4 Case Study

There were a total of 32,561 continuous and discrete instances [HAR07].

There are 15 attribute.

It appears that two attribute listed are mirrors of one another. Education and

Education-Num, where Education-Num is a numeric representation of the other.

There is an odd looking fnlwgt attribute, which contains numeric values.

Following research on this attribute [SHO07], it appears to have no relation to

the income that each instance has. Therefore has no predictive power and can

be ignored.

There is a mix of numeric and nominal attribute.

Some of the data contained missing values, delimited by ?.

Missing values only appeared to be in numeric variables.

Using Weka visualization tools particularly scatter plots the analyzed data

did not show strong class separation.

There did not appear to be any spelling mistakes, which would cause an

instance to be incorrectly classified.

Some attribute appear to have an imbalanced distribution of values.

o Age, education, capital-gain and capital-loss is very skewed towards the

lower values

o Capital-gain and capital-loss had very strong 0 values, and little

occurrence of other values. (Data pre-processing may wish to discretize

these values into two bins (0 and 1 or more)

Preprocessing

For NeuroShell2, I have removed all the observations with missing values, which

totaled about 6% of the data set. The missing values were only part of continuous

attributes, and given the large data set, it does not influence our study significantly

(especially that I have merged the data set with the available test data from the same

source, which combined now total 48842 entries). In addition, I have mapped all the

nominal values to numerical ones. E.g.: An attribute had 4 different nominal values,

each of which was assigned an integer starting from 0. Thus, the attributes set of

values would be {0, 1, 2, 3}.

10

10 4 Case Study

Still in NeuroShell2s case, I have discretized the values of capital gain and capital loss

to three different values, i.e. 0 for 0 gain or loss, 1 for low gain or loss, and 2 for high

gain or loss. A gain higher than the mean value of 1079 was considered high gain, and

a loss higher than the mean value of 87 was considered a high loss. In addition, I have

partitioned the age attribute as well, as follows:

> survey$Age survey$Hours.Per.Week

11

11 4 Case Study

Backpropagation with standard connections (3 layer)

o Default settings in NeuroShell2

o Calibration of 50

o Min. average error: 0.886637

o o Root MSE: 0.3674

Nave Bayes

With default settings in Weka, the results summary is as follows:

Correctly Classified Instances 8081 82.7293 %

Incorrectly Classified Instances 1687 17.2707 %

Kappa statistic 0.4524

Mean absolute error 0.1787

Root mean squared error 0.3732

Relative absolute error 49.2119 %

Root relative squared error 87.7545 %

Total Number of Instances 9768

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.944 0.55 0.847 0.944 0.893 0.89 50K

Weighted Avg. 0.827 0.433 0.816 0.827 0.812 0.89

=== Confusion Matrix ===

a b

12

12 4 Case Study








Decision Tree J48

o With default settings, the summary looks as follows:









=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.945 0.41 0.881 0.945 0.912 0.878 50K

Weighted Avg. 0.861 0.326 0.855 0.861 0.854 0.878

=== Confusion Matrix ===

a b

13

13 4 Case Study

SMO

Changing complexity did not yield better results than default 1

With default settings in Weka, the summary looks as follows:









14

14 5 Conclusions

5 Conclusions

After building and testing all the various models, using J48 decision tree algorithm

yielded best results with an accuracy of 86.1%, approached only by nave Bayes at

85.37% with kernel estimator turned off. The weakest results were given by the 2

neural network built with NeuroShell2, out of which the Backpropagation with

Standard Connections topped the other. However, to be fair, these last two were

trained on a slightly different data set as described in the preprocessing section.

15

15 References

References

[BEC96] Census Income Data Set, Ronny Kohavi and Barry Becker, 1996,

http://archive.ics.uci.edu/ml/datasets/Census+Income

[HAR07] Mining Information from US Census Bureau Data, Sebastian Harvey, 2007,

http://research.omegasoft.co.uk/publications/Mining%20Information%20from%20US

%20Census%20Bureau%20Data.pdf

[NEU**] Backpropagation Architecture Standard Connections, NeuroShell2 Help,

http://www.wardsystems.com/manuals/neuroshell2/index.html?probackproparchsta

ndard.htm

[SHO07] Chris Shoemaker,10 March 2007,

www.cs.wpi.edu/~cs4341/C00/Projects/fnlwgt/

[WIL**] Survey Data, Graham Williams,

http://www.togaware.com/datamining/survivor/Survey_Data.html

[WIK01] Artificial Neural Networks, Wikipedia,

http://en.wikipedia.org/wiki/Artificial_neural_network

[WIK02] Support Vector Machine, Wikipedia,

http://en.wikipedia.org/wiki/Support_vector_machine

[WIK03] Nave Bayes Classifier, Wikipedia,

http://en.wikipedia.org/wiki/Naive_Bayes_classifier

[WIK04] C4.5 Algorithm, Wikipedia, http://en.wikipedia.org/wiki/C4.5_algorithm

[WIK05] Backpropagation, Wikipedia, http://en.wikipedia.org/wiki/Backpropagation

pattern recognition

Documents