chapter 7: decision trees - utrecht university · 02.06.2008 2/ 35 bodo naumann outline •...

35
Chapter 7: Decision trees Intelligent groupware 2007/2008

Upload: dinhmien

Post on 19-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Chapter 7:Decision trees

Intelligent groupware 2007/2008

02.06.2008 Bodo Naumann2 / 35

Outline• Introducing Decision trees• Training of Decision trees• Select the best argument

– Gini-Index– Entropy– Gini-Index vs. Entropy

• Recursive Tree building• Display the tree• Graphical Display• Classifying new Observations• Pruning the tree• Dealing with Missing Data• Dealing with numerical Results• Modeling Home Prices

– Zillow-API• Hot or Not• When to use Decision trees?

02.06.2008 Bodo Naumann3 / 35

Example Goal

• Prediction of final Membership Levels– Sites tend to send mass emails to all users

• Many users are just curious– Probably won't become a member

– Better:• goal-oriented strategy

– Collect information trough the server log» Don’t ask the user annoying questions, get the facts

02.06.2008 Bodo Naumann4 / 35

Example Dataset

02.06.2008 Bodo Naumann5 / 35

Context

• Why do we need Decision trees?– Easy to interpret

• Bayes classifier– Gives us the importance of a Word

» Needs calculations to interpret• Neural network

– Even harder to interpret» Importance of connections alone is not meaningful

– Decision trees can be interpreted by just watching

02.06.2008 Bodo Naumann6 / 35

Introducing Decision trees

• One of the simpler machine learning methods– Looks like a series of if then statements

arranged in a tree– Need an answer

• Follow the tree down

– Need a rationale• Trace back

02.06.2008 Bodo Naumann7 / 35

Training of Decision trees• We use the CART (Classification and Regression

Trees) Algorithm in this chapter– Creates root nodes

• Which argument is the best to split the dataset?

• Function:– divideset(rows,column,value)

• rows: The dataset that we want to split• Column: The column we want to use to split the dataset on• Value: The argument we will use to split the dataset on

02.06.2008 Bodo Naumann8 / 35

Training of Decision trees• We call the function with

– the dataset my_data– argument 2 which tells us if the user did read the FAQ or

not– The argument to split the dataset on is that the user did

read the FAQ thus ‘yes’• Both sets are still mixed, we need a better way to split

this dataset

>>> treepredict.divideset(treepredict.my_data,2,'yes')([['slashdot', 'USA', 'yes', 18, 'None'], ['google', 'France', 'yes', 23, 'Premium'], ['digg',

'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'],['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19,'Basic']],

[['google', 'UK', 'no', 21, 'Premium'], ['(direct)', 'New Zealand', 'no', 12, 'None'], ['(direct)', 'UK', 'no', 21, 'Basic'], ['google', 'USA', 'no', 24,'Premium'], ['digg', 'USA', 'no', 18, 'None'], ['google', 'UK', 'no', 18, 'None'], ['kiwitobes', 'UK', 'no', 19, 'None'], ['slashdot', 'UK', 'no', 21, 'None']])

02.06.2008 Bodo Naumann9 / 35

Select the best argument• Which argument is the best to split a dataset

– find a way to detect how mixed a specific dataset is

• First we count the occurrences of unique values– uniquecounts(rows)

• rows: the dataset to examine

– Result• None: 6• Premium: 3• Basic: 5

– We need this dictionary for the following functions

02.06.2008 Bodo Naumann10 / 35

Gini-Index

• Expected error rate if– a result is randomly applied

• on a dataset row

02.06.2008 Bodo Naumann11 / 35

Gini-Indexdef giniimpurity(rows):total=len(rows)counts=uniquecounts(rows)imp=0for k1 in counts:p1=float(counts[k1])/totalfor k2 in counts:if k1==k2: continuep2=float(counts[k2])/totalimp+=p1*p2

return imp

None=6,Premium=3,Basic=5

• p1=6/14– p2=3/14 * 6/14

• imp+=p1*p2– p2=5/14 * 6/14

• imp+=p1*p2• p1=3/14

– p2=6/14*3/14• imp+=p1*p2

– p2=6/14*3/14• imp+=p1*p2

• …

02.06.2008 Bodo Naumann12 / 35

Entropy• Amount of disorder in a set

– How mixed is a set• Formula

– p(i)=frequency(outcome)=count(outcome)/count(totalrows)

– Entropy=sum of p(i) x log(p(i)) for all outcomes• Measurement of how different the outcomes are

from each other• The more mixed-up a dataset is, the higher the

entropy

02.06.2008 Bodo Naumann13 / 35

Gini-Index vs. Entropy

• Entropy– Peaks slower

• Penalizes mixed sets a bit heavier

• We use Entropy because it’s used more commonly

02.06.2008 Bodo Naumann14 / 35

Recursive tree building• Calculate Entropy

– of the whole group• Try to divide the group

– by the possible values of each attribute• Calculate the entropy for those new groups• Calculate the information gain to find out

– which the best attribute to dive on is– the difference between the current entropy and the weighted-

average entropy of the two new groups• The algorithm calculates

– The information gain for every attribute– Chooses the one with the highest information gain

02.06.2008 Bodo Naumann15 / 35

Recursive tree building• Find the root node

– Create branches divided by true / false

• Find out if the branch can be divided further

• A branch stops dividing if the information gain is not > 0

• Function:– buildtree(rows,scoref=entropy)

• rows: The dataset

02.06.2008 Bodo Naumann16 / 35

Display the tree• printtree(tree,indent=‘’)

– tree=treepredict.buildtree(treepredict.my_data)

>>> treepredict.printtree(tree)0:google?T-> 3:21?T-> {'Premium': 3}F-> 2:yes?T-> {'Basic': 1}F-> {'None': 1}

F-> 0:slashdot?T-> {'None': 3}F-> 2:yes?T-> {'Basic': 4}F-> 3:21?T-> {'Basic': 1}F-> {'None': 3}

02.06.2008 Bodo Naumann17 / 35

Graphical Display

• drawtree(tree,jpeg=‘tree.jpg’)– tree= The tree we want to draw– Image type and image name

>>>treepredict.drawtree(tree,jpeg=‘treeview.jpg’)

02.06.2008 Bodo Naumann18 / 35

Classifying new Observations

• classify new observations according to the decision tree– Returns a prediction for a given data tuple

• Function:– classify(observation,tree)>>> treepredict.classify(['(direct)','USA','yes',5],tree){'Basic': 4}

02.06.2008 Bodo Naumann19 / 35

Pruning the tree

• A tree can become over fitted– over trained tree

• can give answers that are – more explicit than they should be

• Splits dataset because of lower entropy– Random terms, no real terms

02.06.2008 Bodo Naumann20 / 35

Pruning the tree

• Stop splitting on a minimum delta only– If the entropies in a dataset are very close

• Minimum delta can’t handle those optimally

• Pruning– Create the tree as before– Prune the result

• Merge branches with parent if entropies are closer than a predefined delta

02.06.2008 Bodo Naumann21 / 35

Pruning the tree>>> treepredict.prune(tree,0.1)>>> treepredict.printtree(tree)0:google?T-> 3:21?

T-> {'Premium': 3}F-> 2:yes?

T-> {'Basic': 1}F-> {'None': 1}

F-> 0:slashdot?T-> {'None': 3}F-> 2:yes?

T-> {'Basic': 4}F-> 3:21?

T-> {'Basic': 1}F-> {'None': 3}

>>> treepredict.prune(tree,1.0)>>> treepredict.printtree(tree)0:google?T-> 3:21?

T-> {'Premium': 3}F-> 2:yes?

T-> {'Basic': 1}F-> {'None': 1}

F-> {'None': 6, 'Basic': 5}

02.06.2008 Bodo Naumann22 / 35

Dealing with Missing Data

• Decision trees can handle Missing Data– We need a prediction function

• We have no information which branch we should follow

– We follow both» Each branch has a weight» Based on the fraction of the rows of each branch

02.06.2008 Bodo Naumann23 / 35

Dealing with Missing Data• mdclassify(observation,tree)• [Referrer,Location,Read FAQ,Pages viewed]

>>> treepredict.mdclassify(['google',None,'yes',None],tree){'Premium': 2.25, 'Basic': 0.25}

>>> treepredict.mdclassify(['google','France',None,None],tree){'None': 0.125, 'Premium': 2.25, 'Basic': 0.125}

02.06.2008 Bodo Naumann24 / 35

Dealing with numerical Results

• We did use the Gini-Index and Entropy to find a attribute to split on– Before our results where categories– The following examples return numerical

results– We should group values

• We can use variance to find >x<= values to split on

02.06.2008 Bodo Naumann25 / 35

Modeling Home Prices

• Decision trees– Particularly useful if

• there are lots of variables• we are interested in the reasoning process

02.06.2008 Bodo Naumann26 / 35

Zillow-API

• Zillow is a website that tracks real estate prices – It uses those to estimate prices for other

houses

• getpricelist() – Collects data for some predefined addresses– We can build a tree using this data– We can draw the decision tree

02.06.2008 Bodo Naumann27 / 35

Zillow-API>>> housedata=zillow.getpricelist()>>> housetree=treepredict.buildtree(housedata,scoref=treepredict.variance)>>> treepredict.drawtree(housetree,’housetree.jpg’)

• The root node is the amount of bathrooms in a house

02.06.2008 Bodo Naumann28 / 35

Hot or Not

• Using this site users can give each other ratings– The page changed to a dating site

• We get a list of users>>>l1=hotornot.getrandomratings(500)

• We get some details about them>>>pdata=hotornot.getpeopledata(l1)

02.06.2008 Bodo Naumann29 / 35

Hot or Not

• We build and draw the tree>>>hottree=treepredict.buildtree(pdata,scoref=treepredict.variance)>>>treepredict.prune(hottree,0.5)>>>treepredict.drawtree(hottree,’hottree.jpg’)

• The best way to split the dataset is the gender– The rest of the tree is not easy to understand

02.06.2008 Bodo Naumann30 / 35

Hot or Not

• We could compare the hotness of people from the south with people from mid-Atlantic

>>>south=treepredict2.mdclassify((None,None,’South’),hottree)>>>midat=treepredict2.mdclassify((None,None,’Mid Atlantic’),hottree)>>>south[10]/sum(south.values())0.055820815183261735>>>midat[10]/sum(midat.values())0.048972797320600864

• There are more hot people in the south

02.06.2008 Bodo Naumann31 / 35

When to use Decision trees?

• Advantages– Easyer to interpret– We can see which attributes are important and

which are not• The location isn’t important in our example

– It’s a hard to collect attribute» We can stop to collect it

• We found out that no one that came from slashdotsigned up

– We can use this information to optimize our strategy

02.06.2008 Bodo Naumann32 / 35

When to use Decision trees?

• Decision trees can handle– Categories– Numeric values

• Data don’t has to be normalized• If we lack some information

– We can predict them

02.06.2008 Bodo Naumann33 / 35

When to use Decision trees?

• The more results – The less effective a decision tree becomes

• If there are hundreds of possibilities – A decision tree becomes unclear

» And would probably make bad predictions

• Because decision trees use >< for numeric values– Complex combinations of variables are hard to

classify

02.06.2008 Bodo Naumann34 / 35

When to use Decision trees?

• Use decision trees for– Datasets with categories and numeric values

• That can be spitted– If the process of decision making is important

• Don’t use decision trees for– Datasets with huge amounts of numeric

attributes– Datasets with complex relations between

numeric values

02.06.2008 Bodo Naumann35 / 35

Questions?