pattern recognition

15
 “POLITEHNICA” UNIVERSITY OF TIMIŞOARA, MASTER OF SOFTWARE ENGINEERING Pattern Recognition Classification Methods and Case Study Marcel Gheorghiţă 2/5/2010

Upload: soimu-claudiu-adrian

Post on 03-Nov-2015

24 views

Category:

Documents


0 download

DESCRIPTION

Pattern Recognition

TRANSCRIPT

  • POLITEHNICA UNIVERSITY OF TIMIOARA, MASTER OF SOFTWARE ENGINEERING

    Pattern Recognition Classification Methods and Case Study

    Marcel Gheorghi

    2/5/2010

  • 2

    2 Contents

    Contents

    Contents Contents ........................................................................................................................................................ 2

    1 Introduction .............................................................................................................................................. 3

    2 Neural Networks ........................................................................................................................................ 4

    3 Classification Methods ............................................................................................................................... 6

    4 Case Study .................................................................................................................................................. 8

    Data Exploration ....................................................................................................................................... 8

    Preprocessing ............................................................................................................................................ 9

    Results ..................................................................................................................................................... 10

    5 Conclusions .............................................................................................................................................. 14

    References .................................................................................................................................................. 15

  • 3

    3 1 Introduction

    1 Introduction

    The overall goal of this project is to develop classifier models to generalize whether a

    person (defined as an anonymous instance) has an annual income of less or equal to

    fifty thousand or greater than fifty thousand. The data set used was extracted from a

    1994 U.S. census database by Barry Becker based on the following conditions:

    ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)), resulting in 48,842

    reasonably clean observations [BEC96].

    For this project, the Weka (Waikato Environment for Knowledge Analysis) data mining

    toolkit was used. This toolkit, written in Java at the University of Waikato, provides a

    considerable library of algorithms and models for classifying and generalizing data

    [HAR07]. Out of these algorithms, nave Bayes, SMO and J48 have been used to

    develop a model based on 80% of the data, with the remaining 20% used for

    validation.

    In addition, NeuroShell2 was used to train and test neural networks based on various

    algorithms. This required additional data preparation as later described in the case

    study.

  • 4

    4 2 Neural Networks

    2 Neural Networks

    A neural network (NN), or "artificial neural network" (ANN), is a mathematical

    model or computational model that tries to simulate the structure and/or functional

    aspects of biological neural networks. It consists of an interconnected group

    of artificial neurons and processes information using a connectionist approach

    to computation. In most cases an ANN is an adaptive system that changes its

    structure based on external or internal information that flows through the network

    during the learning phase. Neural networks are non-linear statistical data

    modeling tools. They can be used to model complex relationships between inputs and

    outputs or to find patterns in data [WIK01].

    Many neural network architectures have been developed for to suite better various

    situations. In the following paragraphs, two of these are presented as they have been

    used in the case study.

    A simple recurrent network (SRN) is a variation on the Multi-Layer Perceptron,

    sometimes called an "Elman network" due to its invention by Jeff Elman. A three-layer

    network is used, with the addition of a set of "context units" in the input layer. There

    are connections from the middle (hidden) layer to these context units fixed with a

    weight of one. At each time step, the input is propagated in a standard feed-forward

    fashion, and then a learning rule (usually back-propagation) is applied. The fixed back

    connections result in the context units always maintaining a copy of the previous

    values of the hidden units (since they propagate over the connections before the

    learning rule is applied). Thus the network can maintain a sort of state, allowing it to

    perform such tasks as sequence-prediction that is beyond the power of a standard

    Multi-Layer Perceptron.

    In a fully recurrent network, every neuron receives inputs from every other neuron in

    the network. These networks are not arranged in layers. Usually only a subset of the

    neurons receive external inputs in addition to the inputs from all the other neurons,

    and another disjunct subset of neurons report their output externally as well as

    sending it to all the neurons. These distinctive inputs and outputs perform the

    function of the input and output layers of a feed-forward or simple recurrent network,

    and also join all the other neurons in the recurrent processing [WIK01].

    Backpropagation, or propagation of error, is a common method of teaching artificial

    neural networks how to perform a given task. It was first described by Arthur E.

    Bryson and Yu-Chi Ho in 1969,[1][2] but it wasn't until 1986, through the work

    of David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams, that it gained

    recognition, and it led to a renaissance in the field of artificial neural network

    research [WIK05].

  • 5

    5 2 Neural Networks

    Backpropagation architecture with standard connection is the standard type of

    Backpropagation network in which every layer is connected or linked to the

    immediately previous layer. NeuroShell 2 gives the option of using a three, four, or

    five layer network.

    Through experience and literature reviews, it has been found that the three layer

    Backpropagation network with standard connections is suitable for almost all

    problems if enough hidden neurons are used. When more than one hidden layer (the

    layers between the input and output layers) is used, training time may be increased by

    as much as an order of magnitude. [NEU**]

  • 6

    6 3 Classification Methods

    3 Classification Methods

    In the case study, three methods of classification have been used: nave Bayes, SMO

    (SVM) and J48 (decision tree), described in this section.

    Support vector machines (SVMs) are a set of related supervised learning methods

    used for classification and regression. In simple words, given a set of training

    examples, each marked as belonging to one of two categories, an SVM training

    algorithm builds a model that predicts whether a new example falls into one category

    or the other. Intuitively, an SVM model is a representation of the examples as points in

    space, mapped so that the examples of the separate categories are divided by a clear

    gap that is as wide as possible. New examples are then mapped into that same space

    and predicted to belong to a category based on which side of the gap they fall on.

    More formally, a support vector machine constructs a hyperplane or set of

    hyperplanes in a high or infinite dimensional space, which can be used for

    classification, regression or other tasks. Intuitively, a good separation is achieved by

    the hyperplane that has the largest distance to the nearest training datapoints of any

    class (so-called functional margin), since in general the larger the margin the lower

    the generalization error of the classifier. [WIK02]

    A Bayes classifier is a simple probabilistic classifier based on applying Bayes'

    theorem (from Bayesian statistics) with strong (naive) independence assumptions. A

    more descriptive term for the underlying probability model would be

    "independent feature model".

    In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a

    particular feature of a class is unrelated to the presence (or absence) of any other

    feature. For example, a fruit may be considered to be an apple if it is red, round, and

    about 4" in diameter. Even though these features depend on the existence of the other

    features, a naive Bayes classifier considers all of these properties to independently

    contribute to the probability that this fruit is an apple.

    Depending on the precise nature of the probability model, naive Bayes classifiers can

    be trained very efficiently in a supervised learning setting. In many practical

    applications, parameter estimation for naive Bayes models uses the method

    of maximum likelihood; in other words, one can work with the naive Bayes model

    without believing in Bayesian probability or using any Bayesian methods.

    An advantage of the naive Bayes classifier is that it requires a small amount of

    training data to estimate the parameters (means and variances of the variables)

    necessary for classification. Because independent variables are assumed, only the

    variances of the variables for each class need to be determined and not the entire

    covariance matrix. [WIK03]

  • 7

    7 3 Classification Methods

    J48 is an open source Java implementation of the C4.5 algorithm in the weka data

    mining tool. C4.5 is an algorithm used to generate a decision tree developed by Ross

    Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees

    generated by C4.5 can be used for classification, and for this reason, C4.5 is often

    referred to as a statistical classifier.

    C4.5 builds decision trees from a set of training data in the same way as ID3, using

    the concept of information entropy. The training data is a set S = s1,s2,... of already

    classified samples. Each sample si = x1,x2,... is a vector where x1,x2,... represent

    attributes or features of the sample. The training data is augmented with a

    vector C = c1,c2,... where c1,c2,... represent the class to which each sample belongs.

    At each node of the tree, C4.5 chooses one attribute of the data that most effectively

    splits its set of samples into subsets enriched in one class or the other. Its criterion is

    the normalized information gain (difference in entropy) that results from choosing an

    attribute for splitting the data. The attribute with the highest normalized information

    gain is chosen to make the decision. The C4.5 algorithm then recurrs on the smaller

    sub-lists.

    This algorithm has a few base cases.

    All the samples in the list belong to the same class. When this happens, it

    simply creates a leaf node for the decision tree saying to choose that class.

    None of the features provide any information gain. In this case, C4.5 creates a

    decision node higher up the tree using the expected value of the class.

    Instance of previously-unseen class encountered. Again, C4.5 creates a decision

    node higher up the tree using the expected value. [WIK04]

  • 8

    8 4 Case Study

    4 Case Study

    Data Exploration The following table summarizes the data set content, which is mostly comprised of

    nominal attributes.

    Figure 1 Survey Data [WIL**]

  • 9

    9 4 Case Study

    There were a total of 32,561 continuous and discrete instances [HAR07].

    There are 15 attribute.

    It appears that two attribute listed are mirrors of one another. Education and

    Education-Num, where Education-Num is a numeric representation of the other.

    There is an odd looking fnlwgt attribute, which contains numeric values.

    Following research on this attribute [SHO07], it appears to have no relation to

    the income that each instance has. Therefore has no predictive power and can

    be ignored.

    There is a mix of numeric and nominal attribute.

    Some of the data contained missing values, delimited by ?.

    Missing values only appeared to be in numeric variables.

    Using Weka visualization tools particularly scatter plots the analyzed data

    did not show strong class separation.

    There did not appear to be any spelling mistakes, which would cause an

    instance to be incorrectly classified.

    Some attribute appear to have an imbalanced distribution of values.

    o Age, education, capital-gain and capital-loss is very skewed towards the

    lower values

    o Capital-gain and capital-loss had very strong 0 values, and little

    occurrence of other values. (Data pre-processing may wish to discretize

    these values into two bins (0 and 1 or more)

    Preprocessing

    For NeuroShell2, I have removed all the observations with missing values, which

    totaled about 6% of the data set. The missing values were only part of continuous

    attributes, and given the large data set, it does not influence our study significantly

    (especially that I have merged the data set with the available test data from the same

    source, which combined now total 48842 entries). In addition, I have mapped all the

    nominal values to numerical ones. E.g.: An attribute had 4 different nominal values,

    each of which was assigned an integer starting from 0. Thus, the attributes set of

    values would be {0, 1, 2, 3}.

  • 10

    10 4 Case Study

    Still in NeuroShell2s case, I have discretized the values of capital gain and capital loss

    to three different values, i.e. 0 for 0 gain or loss, 1 for low gain or loss, and 2 for high

    gain or loss. A gain higher than the mean value of 1079 was considered high gain, and

    a loss higher than the mean value of 87 was considered a high loss. In addition, I have

    partitioned the age attribute as well, as follows:

    > survey$Age survey$Hours.Per.Week

  • 11

    11 4 Case Study

    Backpropagation with standard connections (3 layer)

    o Default settings in NeuroShell2

    o Calibration of 50

    o Min. average error: 0.886637

    o o Root MSE: 0.3674

    Nave Bayes

    With default settings in Weka, the results summary is as follows:

    Correctly Classified Instances 8081 82.7293 %

    Incorrectly Classified Instances 1687 17.2707 %

    Kappa statistic 0.4524

    Mean absolute error 0.1787

    Root mean squared error 0.3732

    Relative absolute error 49.2119 %

    Root relative squared error 87.7545 %

    Total Number of Instances 9768

    === Detailed Accuracy By Class ===

    TP Rate FP Rate Precision Recall F-Measure ROC Area Class

    0.944 0.55 0.847 0.944 0.893 0.89 50K

    Weighted Avg. 0.827 0.433 0.816 0.827 0.812 0.89

    === Confusion Matrix ===

    a b

  • 12

    12 4 Case Study

    Incorrectly Classified Instances 1429 14.6294 %

    Kappa statistic 0.5735

    Mean absolute error 0.1666

    Root mean squared error 0.3283

    Relative absolute error 45.873 %

    Root relative squared error 77.19 %

    Total Number of Instances 9768

    Decision Tree J48

    o With default settings, the summary looks as follows:

    Correctly Classified Instances 8411 86.1077 %

    Incorrectly Classified Instances 1357 13.8923 %

    Kappa statistic 0.5822

    Mean absolute error 0.2021

    Root mean squared error 0.3214

    Relative absolute error 55.6491 %

    Root relative squared error 75.5758 %

    Total Number of Instances 9768

    === Detailed Accuracy By Class ===

    TP Rate FP Rate Precision Recall F-Measure ROC Area Class

    0.945 0.41 0.881 0.945 0.912 0.878 50K

    Weighted Avg. 0.861 0.326 0.855 0.861 0.854 0.878

    === Confusion Matrix ===

    a b

  • 13

    13 4 Case Study

    SMO

    Changing complexity did not yield better results than default 1

    With default settings in Weka, the summary looks as follows:

    Correctly Classified Instances 8295 84.9293 %

    Incorrectly Classified Instances 1473 15.0707 %

    Kappa statistic 0.4524

    Mean absolute error 0.1787

    Root mean squared error 0.3302

    Relative absolute error 49.2119 %

    Root relative squared error 87.7545 %

    Total Number of Instances 9768

  • 14

    14 5 Conclusions

    5 Conclusions

    After building and testing all the various models, using J48 decision tree algorithm

    yielded best results with an accuracy of 86.1%, approached only by nave Bayes at

    85.37% with kernel estimator turned off. The weakest results were given by the 2

    neural network built with NeuroShell2, out of which the Backpropagation with

    Standard Connections topped the other. However, to be fair, these last two were

    trained on a slightly different data set as described in the preprocessing section.

  • 15

    15 References

    References

    [BEC96] Census Income Data Set, Ronny Kohavi and Barry Becker, 1996,

    http://archive.ics.uci.edu/ml/datasets/Census+Income

    [HAR07] Mining Information from US Census Bureau Data, Sebastian Harvey, 2007,

    http://research.omegasoft.co.uk/publications/Mining%20Information%20from%20US

    %20Census%20Bureau%20Data.pdf

    [NEU**] Backpropagation Architecture Standard Connections, NeuroShell2 Help,

    http://www.wardsystems.com/manuals/neuroshell2/index.html?probackproparchsta

    ndard.htm

    [SHO07] Chris Shoemaker,10 March 2007,

    www.cs.wpi.edu/~cs4341/C00/Projects/fnlwgt/

    [WIL**] Survey Data, Graham Williams,

    http://www.togaware.com/datamining/survivor/Survey_Data.html

    [WIK01] Artificial Neural Networks, Wikipedia,

    http://en.wikipedia.org/wiki/Artificial_neural_network

    [WIK02] Support Vector Machine, Wikipedia,

    http://en.wikipedia.org/wiki/Support_vector_machine

    [WIK03] Nave Bayes Classifier, Wikipedia,

    http://en.wikipedia.org/wiki/Naive_Bayes_classifier

    [WIK04] C4.5 Algorithm, Wikipedia, http://en.wikipedia.org/wiki/C4.5_algorithm

    [WIK05] Backpropagation, Wikipedia, http://en.wikipedia.org/wiki/Backpropagation