1 cs 490 sample project mining the mushroom data set kirk scott

1

CS 490 Sample Project Mining the Mushroom Data

Set

Kirk Scott

3

Yellow Morels

4

Black Morels

5

• This set of overheads begins with the contents of the project check-off sheet

• After that an example project is given

6

CS 490 Data Mining Project Check-Off Sheet

• Student's name: _______ • 1. Meets requirements for formatting.

(No pts.) [ ]• 2. Oral presentation given. (No pts.) [ ]• 3. Attendance at Other Students'

Presentations. Partial points for partial attendance. 20 pts.____

7

I. Background Information on the Problem Domain and the Data Set

8

• Name of Data Set: _______• I.A. Random Information Drawn from the

Online Data Files Posted with the Data Set. 3 pts.___

• I.B. Contents of the Data File. 3 pts.___• I.C. Summary of Background Information.

3 pts.___• I.D. Screen Shot of Open File. 3 pts.___

9

II. Applications of Data Mining Algorithms to the Data Set

10

II. Case 1. This Needs to Be a Classification Algorithm

• Name of Algorithm: _______• i. Output Results. 3 pts.___• ii. Explanation of Item. 2 pts.___• iii. Graphical or Other Special Purpose

Additional Output. 2 pts.___

11

II. Case 2. This Needs to Be a Clustering Algorithm



12

II. Case 3. This Needs to Be an Association Mining Algorithm• Name of Algorithm: _______• i. Output Results. 3 pts.___• ii. Explanation of Item. 2 pts.___• iii. Graphical or Other Special Purpose


13

II. Case 4. Any Kind of Algorithm



14




15




16




17




18

III. Choosing the Best Algorithm Among the Results

19

• III.A. Random Babbling. 6 pts.___• III.B. An Application of the Paired t-test. 6

pts.___• • • Total out of 100 points possible: _____

20

Example Project

• The point of this sample project is to illustrate what you should produce for your project.

• In addition to the content of the project, information given in italics provides instructions or commentary or background information.

21

• Needless to say, your project should simply contain all of the necessary content.

• You don't have to provide italicized commentary.

22

I. Background Information on the Problem Domain and the Data Set

• If you are working with your own data set you will have to produce this documentation entirely yourself.

• If you are working with a downloaded data set, you can use whatever information comes with the data set.

• You may paraphrase that information, rearrange it, do anything to it to help make your presentation clear.

23

• You don't have to follow academic practice and try to document or footnote what you did when presenting the information.

• The goal is simply adaptation for clear and complete presentation.

• What I'm trying to say is this: There will be no penalty for "plagiarism".

24

• What I would like you to avoid is simply copying and pasting, leading to a mass of information that is not relevant or helpful to the reader (the teacher—who will be making the grades) in understanding what you were doing.

• Reorganize and edit as necessary in order to make it clear.

25

• Finally, include a screen shot of the explorer view of the data set after you've opened the file containing it.

• Already here you have a choice of what exactly to show and you need to write some text explaining what the screen shot displays.

26

I.A. Random Information Drawn from the Online Data Files Posted with the Data

Set• This data set includes descriptions of

hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525).

• Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended.

• This latter class was combined with the poisonous one.

27

• The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ''leaflets three, let it be'' for Poisonous Oak and Ivy.

28

• Number of Instances: 8124• Number of Attributes: 22 (all nominally

valued)• Attribute Information: (classes: edible=e,

poisonous=p)•

29

• 1. cap-shape:bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s

• 2. cap-surface:fibrous=f,grooves=g,scaly=y,smooth=s

• 3. cap-color:brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

30

• 4. bruises?: bruises=t,no=f• 5. odor:

almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

• 6. gill-attachment:attached=a,descending=d,free=f,notched=n

• 7. gill-spacing:close=c,crowded=w,distant=d

31

• 8. gill-size: broad=b,narrow=n• 9. gill-color:

black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

• 10. stalk-shape:enlarging=e,tapering=t

• 11. stalk-root:bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?

32

• 12. stalk-surface-above-ring:fibrous=f,scaly=y,silky=k,smooth=s

• 13. stalk-surface-below-ring:fibrous=f,scaly=y,silky=k,smooth=s

• 14. stalk-color-above-ring:brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

33

• 15. stalk-color-below-ring:brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

• 16. veil-type: partial=p,universal=u• 17. veil-color:

brown=n,orange=o,white=w,yellow=y• 18. ring-number: none=n,one=o,two=t

34

• 19. ring-type:cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

• 20. spore-print-color:black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

35

• 21. population:abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y

• 22. habitat:grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

36

• Missing Attribute Values: 2480 of them (denoted by "?"), all for attribute #11.

• • Class Distribution: • -- edible: 4208 (51.8%)• -- poisonous: 3916 (48.2%)• -- total: 8124 instances

37

• Logical rules for the mushroom data sets. • This is information derived by researchers

who have already worked with the data set.

• Logical rules given below seem to be the simplest possible for the mushroom dataset and therefore should be treated as benchmark results.

38

• Disjunctive rules for poisonous mushrooms, from most general to most specific:

• P_1) odor=NOT(almond.OR.anise.OR.none)

• 120 poisonous cases missed, 98.52% accuracy

• P_2) spore-print-color=green• 48 cases missed, 99.41% accuracy

39

• P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.(stalk-color-above-ring=NOT.brown)

• 8 cases missed, 99.90% accuracy• P_4) habitat=leaves.AND.cap-color=white• 100% accuracy • Rule P_4) may also be• P_4')

population=clustered.AND.cap_color=white

40

• These rules involve 6 attributes (out of 22). Rules for edible mushrooms are obtained as negation of the rules given above, for example the rule:

• odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green

• gives 48 errors, or 99.41% accuracy on the whole dataset.

41

• Several slightly more complex variations on these rules exist, involving other attributes, such as gill_size, gill_spacing, stalk_surface_above_ring, but the rules given above are the simplest we have found.

42

I.B. Contents of the Data File

• Here is a snippet of five records from the data file:

• p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u• e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g• e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m• p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u• e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g

43

• Incidentally, the data file contents also exist in expanded form.

• Here is a record from that file:• EDIBLE,CONVEX,SMOOTH,WHITE,BRUI

SES,ALMOND,FREE,CROWDED,NARROW,WHITE,TAPERING,BULBOUS,SMOOTH,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS

44

• Section I.C should be written by you. You should summarize the information given above, which is largely copy and paste, in a brief, well-organized paragraph that you write yourself and which conveys the basics in a concise way.

45

• The idea is that a reader who really doesn't want or need to know the details could go to this paragraph and find out everything they needed to know in order to keep reading the rest of your write-up and have some idea of what is going on.

46

I.C. Summary of Background Information

• The problem domain is the classification of mushrooms as either poisonous/inedible or non-poisonous/edible.

• There are 8124 instances in the data set consisting of 22 nominal attributes apiece.

• Roughly half of the instances are poisonous and half are non-poisonous.

47

• There are 2480 cases of missing attribute values, all on the same attribute.

• As is to be expected with non-original data sets, this set has already been extensively studied.

• Other researchers have provided sets of rules they have derived which would serve as benchmarks when considering the results of the application of further data mining algorithms to the data set.

48

I.D. Screen Shot of Open File

• ***What this shows: • The cap-shape attribute is chosen out of

the list on the left. • Its different values are given in the table in

the upper right. • In the lower right, the Edible attribute is

selected from a (hidden) drop down list.

49

• The graph shows the proportion of edible and inedible mushrooms among the instances containing different values of cap-shape.

51

II. Applications of Data Mining Algorithms to the Data Set

• The overall requirement is that you use the Weka explorer and run up to 8 different data mining algorithms on your data set.

• Here is a preview of what is involved:

52

• i. You will get full credit for all 8 cases if among the 8 there is at least one each of classification, clustering, and association rule mining.

• In order to make it clear that this has been done, the first case should be a classification, the second case should be a clustering, and the third case should be an application of association rule mining.

53

• The grading check-off sheet will reflect this requirement.

• All remaining cases can be of your choice, given in any order you want to.

54

• ii. You will have to either copy a screen shot or copy certain information out of the Weka explorer interface and paste it into your report.

• The stuff you need to do this for in the different kinds of cases is simply illustrated.

• I won't try and list it all out here.

55

• At every point, ask yourself this question:

• "Was it immediately apparent to me what I was looking at and what it meant?“

• If the answer to that question was no, you should include explanatory remarks with whatever you chose to show from Weka.

56

• For consistency's sake in these cases you can label your remarks "***What this shows:".

57

• iii. The most obvious kind of results that you would reproduce would be the percent correct and percent incorrect classification for a classification scheme, for example.

• In addition to this, the Weka output would include things like a confusion matrix, the Kappa statistic, and so on.

58

• For each case that you examine, you will be expected to highlight one aspect of the output and to provide your own brief, written explanation of it.

• Note that this is an "educational" aspect of this project.

59

• On the job, the expectation would be that you as a user knew what it all meant.

• Here, as a student, the goal is to show that you know what it all meant.

60

• iv. Finally, there is an additional aspect of Weka that you should use and illustrate.

• I will not try to describe it in detail here. You will see examples in the content below.

61

• In short, for the different algorithms, if you right click on the results, you will be given options to create graphs, charts, and various other kinds of output.

• For each case that you cover you should take one of these options.

• Again, there is an educational, as opposed to practical aspect to this.

62

• For the purposes of this project, just cycle through the different options that are available to show that you are familiar with them.

• For each one, provide a sentence or two making it clear that you know what this additional output means.

63

II. Case 1. This Needs to Be a Classification Algorithm

• Name of Algorithm: J48

64

i. Output Results

• ***What this shows: • This shows the classifier tree generated by

the J48 algorithm.

66

• ***What this shows: • This gives the analysis of the output of the

algorithm. • The most notable thing that should jump

out at you is that this is a "perfect" tree. • The output shows 100% correct

classification and no misclassification.

68

ii. Explanation of Item

• There is no need to repeat the screen shot.

• For this item I have chosen the confusion matrix.

• It is very easy to understand. • It shows 0 false positives and 0 false

negatives.

69

• It is interesting to note that you need to know the values for the attributes in the data file to be sure which number represents TP and which represents TN.

• Referring back to the earlier the screen shot, the same is true for the bars.

• What do the blue and red parts of the bars represent, edible or inedible?

70

iii. Graphical or Other Special Purpose Additional Output

• ***What this shows: • Going back to the previous screen shot, if

you right click on the item highlighted in blue—the results of running J48 on the data set, you get several options.

• One of them is "Visualize tree". • This screen shot shows the result of taking

that option.

72

II. Case 2. This Needs to Be a Clustering Algorithm

• Name of Algorithm: SimpleKMeans

73

i. Output Results

• ***What this shows: • This shows the results of the

SimpleKMeans clustering algorithm with the edible/inedible attribute ignored.

• The results compare the clusters/classifications with the ignored attribute.

• The algorithm finds 2 clusters based on the remaining attributes.

75


• At the bottom of the screen shot there is an item, "Incorrectly clustered instances".

• 37.6% of the clustered instances don't fall into the desired edible/inedible category.

• The algorithm finds 2 clusters, but these 2 clusters don't agree with the 2 classifications of the attribute that was ignored.

76


• ***What this shows: • Going back to the previous screen shot, if

you right click on the item highlighted in blue—the results of running SimpleKMeans on the data set, you get several options.

• One of them is "Visualize cluster assignments".

77

• This screen shot shows the result of taking that option.

• Since it isn't possible to visualize the clusters in n-dimensional space, the screen provides the option of picking which individual attribute to visualize.

78

• This screen shows the instances in order by number along the x-axis.

• The y-axis shows the cluster placements for the different values for the cap-shape attribute.

• The drop down box allows you to change what the axes represent.

80

II. Case 3. This Needs to Be an Association Mining Algorithm• • Name of Algorithm: Apriori

81

i. Output Results

• ***What this shows: • This shows the results of the Apriori

association rule mining algorithm.

83


• Various relevant parameters are shown on the screen shot.

• The system defaults to a minimum support level of .95 and a minimum confidence level of .9.

• The system lists the 10 best rules found. • The first 9 have confidence levels of 1. • On the one hand, this is good.

84

• From a practical point of view, what this tends to suggest is that the data are effectively redundant.

• Just to take the first rule for example, if you know the color of the veil, you know the type of the veil.

• The 10th rule provides an interesting reverse insight into this.

• It tells you that if you know the type, you only know the color with .98 confidence.

85


• There don't appear to be any other output options for association rules.

• There is no standard visualization for them so nothing is included for this point.

86


• Name of Algorithm: ADTree

87

i. Output Results

• ***What this shows. • These are the results of running the

ADTree classification algorithm. • I haven't bothered to scroll up and show

the ASCII representation of the tree. • Instead, I've just shown the critical output

at the bottom.

89


• There are two items I'd like to highlight:• a. Notice that this tree generation

algorithm didn't get 100% classified correctly.

• If I'm reading the data correctly, there were 8 false positives on the attribute of interest, which is named Edible.

• This is not good.

90

• False negatives deprive you of a tasty gustatory and culinary experience.

• False positives deprive you of your health or your life.

• I point this out in contrast to the J48 results given above.

91

• b. Notice that the time taken to build the model was .73 seconds.

• This is about 10 times slower than J48, but I'm mainly interested in comparing with the following algorithm.

92


• ***What this shows: • This is the visualization of the tree. • There are other graphical options, but they

are difficult to interpret for the mushroom data set, so this is given for comparison with the J48 tree.

94


• Name of Algorithm: BFTree

95

i. Output Results

• ***What this shows: • This shows the results of using the BFTree

classification algorithm.

97


• This algorithm also doesn't give a tree that classifies with 100% accuracy.

• It gives the same kind of error as the ADTree, although there are 3 fewer.

98

• The additional item I'd like to highlight is that the time taken to build the model was 12.42 seconds.

• As a matter of fact, that information came out first and then additional, significant amounts of time were taken to run through each fold of the data.

• This was quite time consuming compared to the other trees produced so far.

99


• ***What this shows: • This screen shot is the result of taking the

"Visualize classifier errors" option on the results of the algorithm.

• I believe what this screen illustrates is a decision point in the tree on the cap-surface attribute.

100

• In one of the cases, symbolized by the blue rectangle, an incorrect classification is made on this basis while 7 other instances classify correctly based on this attribute.

102


• Name of Algorithm: Naïve Bayes

103

i. Output Results

• ***What this shows: • This screen shot shows the bottom of the

output for the Naïve Bayes classification algorithm.

• The upper part of the output shows conditional probability counts for all of the attributes in the data.

104

• If the cost of an error wasn't so high, this algorithm by itself does OK.

• It's time cost is only .03 seconds and it achieves 95.8% correct classification.

106


• I'm running out of items to highlight which are particularly meaningful for the example in question.

• Notice that the output includes the Mean absolute error, the Root mean squared error, the Relative absolute error, and the Root relative squared error.

107

• These differ in magnitude because of the way they're calculated, but they are all indicators of the same general thing.

• As pointed out in the book, when comparing two different data mining approaches, if you compared the same measure for both, you will tend to have a valid comparison regardless of which of the measures you used.

108


• ***What this shows: • Two graphical output screen shots are

given below. • They show a cost-benefit analysis. • Such an analysis is more appropriate to

something like direct mailing, but it is possible to illustrate something by changing one of the parameters in the display.

109

• Both screen shots show a threshold curve and a cost-benefit curve where the button to minimize cost/benefit has been clicked.

• In the first screen shot the costs of FP and FN are equal, at 1.

• In the second, the cost of a false positive has been raised to 1,000.

• Notice how the shape of the curve changes.

110

• Roughly speaking, I would interpret the second screenshot to mean that you have effectively no costs as long as you are correctly predicting TP, but your cost rises linearly with the increasing probability of FP predictions later in the data set.

113


• Name of Algorithm: BayesNet

114

i. Output Results

• ***What this shows. • This screen shot shows the results of the

BayesNet classification algorithm.

116


• This is not a new item to explain, but it is an observation related to the values and results previously obtained.

• The association rule mining algorithm seemed to suggest that there were heavy dependencies among some of the attributes in the data set.

• BayesNet is supposed to take these into account, while Naïve Bayes does not.

117

• However, when you compare the rate of correctly classified instances, here you get 96.2% vs. 95.8% for Naïve Bayes.

• It seems fair to ask what difference it really made to include the dependencies in the analysis.

118


• What this shows: • This shows the result of taking the

"Visualize cost curve" option on the results of the data mining.

• Honestly, I've about reached the limit of what I understand without further research.

• I present this here without further explanation.

119

• This is one of the reasons I advertise this sample project write-up as an example of a B, rather than an A effort.

• Everything that has been asked for is included, but in this point, for example, the explanation isn't complete.

• It sure is pretty though…

121


• Name of Algorithm: RIDOR

122

i. Output Results

• ***What this shows: • This screen shows the results of applying

the RIDOR algorithm to the data set. • RIDOR was the technique based on rules

and exceptions. • Look at the top of the output. • Here you see clearly that the default

classification is edible, with exceptions listed underneath.

123

• Philosophically, this goes against my point of view on mushrooms.

• The logical default should be inedible, but there are more edible mushrooms in the data set than inedible.

• So it goes.

125


• The last set of items that appears in these output screens are the Precision, Recall, F-Measure, and ROC values.

• This is probably not the best example for illustrating what they mean.

• It's apparent that things like recall would be better suited to document retrieval for example.

126

• Maybe the best illustration that they don't really apply is that they are all 1 or .999.

• On the other hand, maybe that's realistic for a classification scheme that gives 99.95% correct results.

127


• Once again, the fact that this is a "B" example rather than an "A" example comes into play.

• I'm not showing a new bit of graphical output.

• I'm showing the cost curve, like for the previous data mining algorithm.

128

• The main reason for choosing to show it again is that this picture looks so much like the simple picture in the text that they used to illustrate some of the cost concepts graphically.

130

III. Choosing the Best Algorithm Among the Results

• Depending on the problem domain and your level of ambition, you might compare algorithms on the basis of lift charts, cost curves, and so on.

• For simple classification, the tools will give results showing the percent classified correctly and the percent classified incorrectly.

131

• It would be natural to simply choose the one with the highest percent classified incorrectly.

• However, this is not good enough for credit on this item.

• I have chosen to illustrate what you need to do with a simple basic example.

132

• I consider the two classification algorithms that gave the highest percent classified correctly.

• I then apply the paired t-test to see whether or not there is actually a statistically significant different between them.

133

• If there is, that's the correct basis for preferring one over the other.

• For the purposes of illustration, I do this by hand and explain what I'm doing.

• You may find tools that allow you to make a valid comparison of results.

• That's OK, as long as you explain.

134

• The point simply is that it's not sufficient to just list a bunch of percents and pick the highest one.

• Illustrate the use of some advanced technique, whether involving concepts like lift charts or cost curves or statistics.

• You may also have noticed that Weka tells you the run time for doing an analysis.

135

• When making a decision about which algorithm is the best, at a minimum take into account an advanced comparison of the two apparent best, and you may want to make an observation about the apparent complexity or time cost of the algorithms.

136

III.A. Random Babbling

• The concept of "Cost of classification" seems relevant to this example.

• It takes a human expert to tell if a mushroom is poisonous.

• If you're not an expert, you can tell by eating a mushroom and seeing what happens.

137

• The cost of finding out that the mushroom is poisonous is about as high as it gets.

• I guess if you're truly dedicated, you'd be willing to die for science.

• Directly related to this is the cost of a misclassification.

• It seems to be on the infinite side…

138

• The J48 tree approach, given first, even though it's apparently been pruned, still classifies 100% correctly.

• This seems to be at odds with claims made at various point that you don't want a perfect classifier because it will tend to be overtrained.

139

• On the other hand, since the cost of a misclassification is so high, maybe it would be best to bias the training.

• Lots of false "It's poisonous" results would be desirable.

• I remember learning this rule from my parents:

• Don't eat any wild mushrooms.

140

• It's also interesting to compare with the commentary provided at the beginning.

• "Experts" who have examined the data wanted to get a minimal rule set.

• They apparently considered that a success.

• But they were willing to live with errors. • I'm not sure living with errors is consistent

with this data set.

141

III.B. An Application of the Paired t-test

• Pick any two of your results above, identify them and the success rate values they gave, and compare them using the paired t-test.

• Give a statistically valid statement that tells whether or not the two cases you're comparing are significantly different.

142

• What is shown is my attempt to interpret and apply what the book says about the paired t-test.

• I do not claim that I have necessarily done this correctly.

• Students who have recently taken statistics may reach different conclusions about how this is done.

143

• However, I have gone through the motions.

• To get credit for this section, you should do the same, whether following my example or following your own understanding.

144

• I have chosen to compare the percent of correct classifications by Naïve Bayes (NB) and BayesNet (BN) given above.

145

• Taken from Weka results:• NB sample mean = 95.8272%• NB root mean squared error = .1757• Squaring the value above:• NB mean squared error = .03087049

146

• Taken from Weka results:• BN sample mean = 96.2211%• BN root mean squared error = .1639• Squaring the value above:• BN mean squared error = .02686321

147

• This is my estimate of the standard deviation of the t statistic where the divisor is 10 because I opted for the default 10-fold cross-validation in Weka:

• Estimate of paired root mean squared error (EPRMSE)

• = square root(( NB mean squared error / 10) + (BN mean squared error / 10))

• = .075982695

148

• t statistic• = (NB sample mean – BN sample mean) /

EPRMSE• = 5.184

149

• The book says this is a two-tailed test. • For a 99% confidence interval I want to

use a threshold of .5%. • The book's table gives a value of 3.25.

150

• The computed value, 5.184 is greater than the table value of 3.25.

• This means you reject the null hypothesis that the means of the two distributions are the same.

151

• In other words, you conclude that there is a statistically significant difference between the percent of correct classifications resulting from the Naïve Bayes and the Bayesian Network algorithms on the mushroom data.

152

The End

1 cs 490 sample project mining the mushroom data set kirk scott

Documents

output results

kind of algorithm

classification algorithm

clustering algorithm

best algorithm

explanation of item

association mining algorithm

data mining project