business systems intelligence: 5. classification 2

Post on 27-Jan-2016

21 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Dr. Brian Mac Namee ( www.comp.dit.ie/bmacnamee ). Business Systems Intelligence: 5. Classification 2. Acknowledgments. These notes are based (heavily) on those provided by the authors to accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber - PowerPoint PPT Presentation

TRANSCRIPT

Business Systems Intelligence:

5. Classification 2

Dr. B

rian Mac N

amee (w

ww

.comp.dit.ie/bm

acnamee)

2of25

2of49 Acknowledgments

These notes are based (heavily) on those provided by the authors to

accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber

Some slides are also based on trainer’s kits provided by

More information about the book is available at:www-sal.cs.uiuc.edu/~hanj/bk2/

And information on SAS is available at:www.sas.com

3of25

3of49 Classification & PredictionToday we will look at:

– What are classification & prediction?– Issues regarding classification and prediction– Classification techniques:

• Case based reasoning (k-nearest neighbour algorithm)• Decision tree induction• Bayesian classification• Neural networks• Support vector machines (SVM)• Classification based on from association rule mining concepts• Other classification methods

– Prediction– Classification accuracy

4of25

4of49 ClassificationClassification:

– Predicts categorical class labels

Typical Applications– {CreditHistory, Salary} -> CreditApproval (Yes/No)

– {Temp, Humidity} --> Rain (Yes/No)

Mathematically

)(

:

}1,0{,}1,0{

xhy

YXh

YyXx n

5of25

5of49 Linear ClassificationBinary Classification problem

The data above the red line belongs to class ‘x’

The data below red line belongs to class ‘o’

Examples – SVM, Perceptron, Probabilistic Classifiersx

xx

x

xx

x

x

x

x ooo

oo

o

o

o

o o

oo

o

6of25

6of49 Discriminative ClassifiersAdvantages

– Prediction accuracy is generally high– Robust, works when training examples contain

errors– Fast evaluation of the learned target function

Criticism– Long training time– Difficult to understand the learned function

(weights)– Not easy to incorporate domain knowledge

7of25

7of49 Artificial Neural Networks

A biologically inspired classification technique

Formed from interconnected layers of simple artificial neurons

ANN history:– 1943: McCulloch & Pitts– 1959: Rosenblatt (Perceptron)– 1959: Widrow & Hoff (ADALINE and

MADALINE)– 1969: Marvin Minsky and Seymour Papert's– 1974: Werbos (Backprop) – 1982: John Hopfield

8of25

8of49 An Artifical Neuron

The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping

f (x)

w0

w1

wn

x0

x1

xn

bias

)*()(0

n

iii xwbiasthreshxf

9of25

9of49

ANN: Multi-Layer Perceptrons (MLPs)

Multi Layer Perceptrons (MLPs) are one of the best known ANN types

Composed of layers of fully interconnected artificial neurons

Training involves repeatedly presenting a series of training cases to the network and adjusting neurons’ weights and biases to minimise classification error

Typically the backpropogation of error algorithm is used for training

10of25

10of49 MLP ExampleRemember our surfing example

An MLP can be built and trained to perform classification for this problem

Wind Speed

Wind Direction

Temperature

Wave Size

Wave Period

Good Surf

HiddenLayer

InputLayer

OutputLayer

11of25

11of49 Network TrainingThe ultimate objective of training

– Obtain a set of weights that makes almost all of the tuples in the training data classified correctly

Steps– Initialize weights with random values – Feed the input tuples into the network one by one– For each unit

• Compute the net input to the unit as a linear combination of all the inputs to the unit

• Compute the output value using the activation function• Compute the error• Update the weights and the bias

12of25

12of49 Summary of ANN ClassificationStrengths

– Fast classification– Very good generalization capacity

Weaknesses– No explanation capability – black box– Training can be slow – eager learning– Retraining is difficult

Lots of other network types, but MLP is probably the most common

13of25

13of49 Support Vector Machines (SVM)In classification problems we try to create decision boundaries between classes

A choice must be made between possible boundaries

Class 1

Class 2

14of25

14of49 SVMs (cont…)The decision boundary should be as far away from the data of both classes as possible

Class 1

Class 2

m

15of25

15of49 Margins

Support Vectors

Small Margin Large Margin

16of25

16of49

Given a set of points with labelThe SVM finds a hyperplane defined by the pair (w, b), where w is the normal to the plane and b is the distance from the origin

Where: • x - feature vector• b - bias, y- class label• ||w|| - margin

Linear Support Vector Machine

Nibwxy ii ,...,11)(

nix },{yi 11

17of25

17of49 SVMs: The Clever Bit!What about when classes are not linearly separable?

Kernel functions and the kernel trick are used to transform data into a different linearly separable feature space

(.)( )

( )

( )( )( )

( )

( )( )

( )

( )

( )

( )( )

( )

( )

( )( )

( )

Feature spaceInput space

18of25

18of49 SVMs: The Clever Bit! (cont...)What if the data is not linearly separable?

Project the data to high dimensional space where it is linearly separable and then we can use linear SVM – (Using Kernels)

-1 0 +1

+ +-

(1,0)(0,0)

(0,1) +

+-

19of25

19of49 SVM Example

Exam

ple

of

Non-l

inear

SV

M

20of25

20of49

ResultsSVM Example (cont…)

21of25

21of49 Summary of SVM ClassificationStrengths

– Over-fitting is not common– Works well with high dimensional data– Fast classification– Good generalization capacity

Weaknesses– Retraining is difficult– No explanation capability– Slow training

At the cutting edge of machine learning

22of25

22of49 SVM vs. ANNSVM

– Relatively new concept– Nice generalization

properties

– Hard to learn – learned in batch mode using quadratic programming techniques

– Using kernels can learn very complex functions

ANN– Quite old– Generalizes well but

doesn’t have strong mathematical foundation

– Can easily be learned in incremental fashion

– To learn complex functions – use multilayer perceptron (not that trivial)

23of25

23of49 SVM Related Linkshttp://svm.dcs.rhbnc.ac.uk/

http://www.kernel-machines.org/

C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.

SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light

BOOK: An Introduction to Support Vector Machines N. Cristianini and J. Shawe-TaylorCambridge University Press

24of25

24of49 Association-Based Classification Several methods for association-based classification

– ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)

• It beats C4.5 in (mainly) scalability and also accuracy

– Associative classification: (Liu et al’98) • It mines high support and high confidence rules in the

form of “cond_set => y”, where y is a class label

– CAEP (Classification by aggregating emerging patterns) (Dong et al’99)

• Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another

• Mine Eps based on minimum support and growth rate

25of25

25of49 What Is Prediction?Prediction is similar to classification

– First, construct a model– Second, use model to predict unknown value

• Major method for prediction is regression– Linear and multiple regression– Non-linear regression

Prediction is different from classification– Classification refers to predict categorical class

label– Prediction models continuous-valued functions

26of25

26of49

Regress Analysis and Log-Linear Models in Prediction

Linear regression: Y = + X– Two parameters, and , specify the line and

are to be estimated by using the data at hand– Using the least squares criterion to the known

values of Y1, Y2,…, X1, X2,….

Multiple regression: Y = b0 + b1X1 + b2X2– Many nonlinear functions can be transformed

into the aboveLog-linear models:

– The multi-way table of joint probabilities is approximated by a product of lower-order tables

– Probability: p(a, b, c, d) = ab acad bcd

27of25

27of49 Prediction: Numerical Data

28of25

28of49 Prediction: Categorical Data

29of25

29of49

Concerns Over Classification Techniques

When choosing a technique for a specific classification problem we must consider the following issues:

– Classification accuracy– Training speed– Classification speed– Danger of over-fitting– Generalisation capacity– Implications for retraining– Explanation capability

30of25

30of49 Evaluating Classification Accuracy

During development, and in testing before deploying a classifier in the wild, we need to be able to quantify the performance of the classifier

– How accurate is the classifier?– When the classifier is wrong, how is it wrong?

Useful to decide on which classifier (which parameters) to use and to estimate what the performance of the system will be

31of25

31of49 Evaluating Classifiers (cont…)How we do this depends on how much data is availableIf there is unlimited data available then there is no problemUsually we have less data than we would like so we have to compromise

– Use hold-out testing sets– Cross validation

• K-fold cross validation• Leave-one-out validation

– Parallel live test

32of25

32of49 Hold-Out Testing SetsSplit the available data into a training set and a test set

Train the classifier in the training set and evaluate based on the test set

A couple of drawbacks– We may not have enough data– We may happen upon an unfortunate split

Training Set Test Set

Total number of available examples

33of25

33of49 K-Fold Cross ValidationDivide the entire data set into k folds

For each of k experiments, use kth fold for testing and everything else for training

Total number of available examples

Test SetK = 0

Test SetK = 1

Test SetK = 2

Test SetK = 3

34of25

34of49 K-Fold Cross Validation (cont…)The accuracy of the system is calculated as the average error across the k folds

The main advantages of k-fold cross validation are that every example is used in testing at some stage and the problem of an unfortunate split is avoided

Any value can be used for k– 10 is most common– Depends on the data set

35of25

35of49 Leave-One-Out Cross ValidationExtreme case of k-fold cross validation

With N data examples perform N experiments with N-1 training cases and 1 test case

Total number of available examples

K = 0

K = 1

K = 2

K = N

36of25

36of49 Classifier AccuracyThe accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier

– Often also referred to as recognition rate– Error rate (or misclassification rate) is the

opposite of accuracy

37of25

37of49

False Positives Vs False Negatives

While it is useful to generate the simple accuracy of a classifier, sometimes we need more

When is the classifier wrong?– False positives vs false negatives– Related to type I and type II errors in statistics

Often there is a different cost associated with false positives and false negatives

– Think about diagnosing diseases

38of25

38of49 Confusion MatrixDevice used to illustrate how a classifier is performing in terms of false positives and false negativesGives us more information than a single accuracy figureAllows us to think about the cost of mistakesCan be extended to any number of classes

Classifier Result

Class A(yes)

Class B(no)

fnClass A(yes) Expected

Resultfp Class B

(no)

39of25

39of49 Other Accuracy MeasuresSometimes a simple accuracy measure is not enough

pos

postysensitivit

_

neg

negtyspecificit

_

posfpost

postprecision

__

_

40of25

40of49 ROC CurvesReceiver Operating Characteristic (ROC) curves were originally used to make sense of noisy radio signals

Can be used to help us talk about classifier performance and determine the best operating point for a classifier

41of25

41of49 ROC Curves (cont…)

False Positives

Tru

e P

ositi

ves

0

1.0

1.0

For some great ROC curve examples have a look here

Consider how the relationship between true positives and false positives can change

We need to choose the best operating point

42of25

42of49 ROC Curves (cont…)

False Positives

Tru

e P

ositi

ves

0

1.0

1.0

ROC curves can be used to compare classifiers

The greater the area under the curve the more accurate the classifier

43of25

43of49 Over-FittingWhen we train a classifier we are trying to a learn a function approximated by the training data we happen to use

– What if the training data doesn’tcover the whole problem space?

We can learn the training data too closely which hampers the ability to generalise

This problem is known as overfitting

Depending on the type of classifier used there are different approaches to avoiding this

44of25

44of49 EnsemblesIn order to improve classification accuracy we can aggregate the results of an ensemble of classifiers

Classifier0

Classifier1

Classifiern

Aggregation

45of25

45of49 Bagging

Given a set S of s samples

Generate a bootstrap sample T from S– Cases in S may not appear in T or may appear

more than once

Repeat this sampling procedure, getting a sequence of k independent training sets

A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these training sets, by using the same classification algorithm

46of25

46of49 Bagging (cont…)

To classify an unknown sample X,let each classifier predict or vote

The Bagged Classifier C* counts the votes and assigns X to the class with the “most” votes

47of25

47of49 Boosting Technique — AlgorithmAssign every example an equal weight 1/N

For t = 1, 2, …, T Do – Obtain a hypothesis (classifier) h(t) under w(t)– Calculate the error of h(t) and re-weight the examples

based on the error . Each classifier is dependent on the previous ones. Samples that are incorrectly predicted are weighted more heavily

– Normalize w(t+1) to sum to 1 (weights assigned to different classifiers sum to 1)

Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set

48of25

48of49 SummaryClassification is an extensively studied problem

– Mainly in statistics and machine learning

Classification is probably one of the most widely used data mining techniques

Scalability is still an important issue for database applications

Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..

49of25

49of49 Questions?

?

top related