lecture 7. outline 1. overview of classification and decision tree 2. algorithm to build decision...

Introducing Textural Data Visualization to Students in Computational Mathematics Major

Classification, Decision Tree Lecture 71Outline Overview of Classification and Decision Tree Algorithm to build Decision TreeFormula to measure information Weka, data preparation and Visualization

21. Illustration of the Classification TaskCourtesy to Professor David Mease for Next 10 slides

LearningAlgorithm

Model3Classification: Definition

Given a collection of records (training set)Each record contains a set of attributes (x), with one additional attribute which is the class (y).

Find a model to predict the class as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible.A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

4Classification Examples

Classifying credit card transactions as legitimate or fraudulent

Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil

Categorizing news stories as finance, weather, entertainment, sports, etc

Predicting tumor cells as benign or malignant

5Classification Techniques

There are many techniques/algorithms for carrying out classification

In this chapter we will study only decision trees

In Chapter 5 we will study other techniques, including some very modern and effective techniques

6An Example of a Decision Tree

categoricalcategoricalcontinuousclassRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced< 80K> 80KSplitting AttributesTraining DataModel: Decision Tree7Applying the Tree Model to Predict the Class for a New ObservationRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced< 80K> 80K

Test DataStart from the root of tree.8Applying the Tree Model to Predict the Class for a New ObservationRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced< 80K> 80K

Test Data9Applying the Tree Model to Predict the Class for a New ObservationRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced< 80K> 80K




Test DataAssign Cheat to No13DECISION TREE CHARACTERISTICSEasy to understand -- similar to human decision processDeals with both discrete and continuous featuresSimple, nonparametric classifierNo assumptions regarding probability distribution typesNP-complete Algorithms are computationally inexpensivecan represent arbitrarily complex decision boundariesOverfitting can be a problem

142. Algorithm to Build Decision TreesDefined recursivelySelect attribute as root nodeUse a greedy algorithm to create branches for each possible value of a selected attribute Repeat recursively until Either running out of instances or attributes Or reach a predefined thresholds of purityUse only branches that are reached by instances15WEATHER EXAMPLE, 9 yes/5 no

16Sunny

OvercastRainyTemp Hum Wind Play

Hot high FALSE NoHot high TRUE NoMild high FALSE NoCool normal FALSE YesMild normal TRUE YesTem Hum Wind Play

Hot high FALSE YesCool High TRUE YesMild high TRUE Yes Hot normal FALSE YesTemp Hum Wind Play

Mild high FALSE YesCool normal FALSE YesCool normal TURE Nomild normal FALSE YesMild high TRUE No17DECISION TREES: Weather ExampleHotMildCool18DECISION TREES: Weather ExampleHighNormal19DECISION TREES: Weather ExampleTrueFalse203. Information MeasuresSelecting attribute upon which to splitNeed measure of purity/informationEntropy

Gini Index

Classification error

21A Graphical Comparison

22Entropy

Measures purity similar to Gini

Used in C4.5

After the entropy is computed in each node, the overall value of the entropy is computed as the weighted average of the entropy in each node as with the Gini index

The decrease in Entropy is called information gain (page 160)

23Entropy Examples for a Single Node

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Entropy = 0 log 0 1 log 1 = 0 0 = 0 P(C1) = 1/6 P(C2) = 5/6Entropy = (1/6) log2 (1/6) (5/6) log2 (5/6) = 0.65P(C1) = 2/6 P(C2) = 4/6Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92243. Entropy: Calculating InformationAll three measures are consistent with each otherWill use entropy as example, The less purity, the more bits, the less info, Outlook as root:Info[2, 3] = 0.971 bitsInfo[4, 0] = 0.0 bitsInfo[3, 2] = 0.971 bitsTotal info = 5/14*0.971 + 4/14*0.0 + 5/14* 0.971 = 0.693 bits

253. Selecting Root AttributeInitial info = info[9, 5] = 0.940 bitsGain(outlook) = 0.940 0.693 = 0.247 bitsGain( temperature) = 0.029 bitsGain(humidity) = 0.152 bitsGain(windy) = 0.048 bitsSo, select outlook as root for splitting

26Sunny

OvercastRainyHigh

Normal

FALSE

TRUE

27Sunny

OvercastRainyHigh

Normal

FALSE

TRUE

Contradicted Training example 283. Hunts AlgorithmMany algorithms use a version of a top-down or divide-and-conquer approach known as Hunts Algorithm (Page 152):Let Dt be the set of training records that reach a node tIf Dt contains records that belong the same class yt, then t is a leaf node labeled as ytIf Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

29An Example of Hunts Algorithm

Dont CheatRefundDont CheatDont CheatYesNoRefundDont CheatYesNoMaritalStatusDont CheatCheatSingle,DivorcedMarriedTaxableIncomeDont Cheat< 80K>= 80KRefundDont CheatYesNoMaritalStatusDont CheatCheatSingle,DivorcedMarried

30How to Apply Hunts Algorithm

Usually it is done in a greedy fashion.

Greedy means that the optimal split is chosen at each stage according to some criterion.

This may not be optimal at the end even for the same criterion.

However, the greedy approach is computational efficient so it is popular.

31Using the greedy approach we still have to decide 3 things:#1) What attribute test conditions to consider#2) What criterion to use to select the best split#3) When to stop splitting

For #1 we will consider only binary splits for both numeric and categorical predictors as discussed on the next slide

For #2 we will consider misclassification error, Gini index and entropy

#3 is a subtle business involving model selection. It is tricky because we dont want to overfit or underfit.32 Misclassification error is usually our final metric which we want to minimize on the test set, so there is a logical argument for using it as the split criterion

It is simply the fraction of total cases misclassified

1 - Misclassification error = Accuracy (page 149)

Misclassification Error33Gini Index

This is commonly used in many algorithms like CART and the rpart() function in R

After the Gini index is computed in each node, the overall value of the Gini index is computed as the weighted average of the Gini index in each node

34Gini Examples for a Single Node

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0

P(C1) = 1/6 P(C2) = 5/6Gini = 1 (1/6)2 (5/6)2 = 0.278P(C1) = 2/6 P(C2) = 4/6Gini = 1 (2/6)2 (4/6)2 = 0.44435

The Gini index decreases from .42 to .343 while the misclassification error stays at 30%. This illustrates why we often want to use a surrogate loss function like the Gini index even if we really only care about misclassification.A?YesNoNode N1Node N2

Gini(N1) = 1 (3/3)2 (0/3)2 = 0 Gini(Children) = 3/10 * 0 + 7/10 * 0.49= 0.343Gini(N2) = 1 (4/7)2 (3/7)2 = 0.490Misclassification Error Vs. Gini Index36

5. Discretization of Numeric Data

37Learning algorithmInductionDeductionTest SetModelTraining SetTidRefundMarital

StatusTaxable

IncomeCheat

1YesSingle125KNo

2NoMarried100KNo

3NoSingle70KNo

4YesMarried120KNo

5NoDivorced95KYes

6NoMarried60KNo

7YesDivorced220KNo

8NoSingle85KYes

9NoMarried75KNo

10NoSingle90KYes

10

RefundMarital

StatusTaxable

IncomeCheat

NoMarried80K?

10

RefundMarital

StatusTaxable

IncomeCheat

NoMarried80K?

10

RefundMarital

StatusTaxable

IncomeCheat

NoMarried80K?

10

RefundMarital

StatusTaxable

IncomeCheat

NoMarried80K?

10

RefundMarital

StatusTaxable

IncomeCheat

NoMarried80K?

10

RefundMarital

StatusTaxable

IncomeCheat

NoMarried80K?

10

Sheet1ID CodeOutlookTemperatureHumidityWindyPlayasunnyhothighFALSEnobsunnyhothighTRUEnocovercasthothighFALSEyesdrainymildhighFALSEyeserainycoolnormalFALSEyesfrainycoolnormalTRUEnogovercastcoolnormalTRUEyeshsunnymildhighFALSEnoisunnycoolnormalFALSEyesjrainymildnormalFALSEyesksunnymildnormalTRUEyeslovercastmildhighTRUEyesmovercasthotnormalFALSEyesnrainymildhighTRUEno

C10

C26

C12

C24

C11

C25

TidRefundMarital

StatusTaxable

IncomeCheat

1YesSingle125KNo

2NoMarried100KNo

3NoSingle70KNo

4YesMarried120KNo

5NoDivorced95KYes

6NoMarried60KNo

7YesDivorced220KNo

8NoSingle85KYes

9NoMarried75KNo

10NoSingle90KYes

10

C10

C26

C12

C24

C11

C25

Parent

C17

C23

Gini = 0.42

N1N2

C134

C203

Gini=0.361

lecture 7. outline 1. overview of classification and decision tree 2. algorithm to build decision...

Documents

tree model

decision tree algorithm

decision tree lecture

root of tree

decision treeformula

decision tree7applying

test sets

test datastart