1 cs 490 sample project mining the mushroom data set kirk scott

152
1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

Upload: madeleine-baker

Post on 16-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

1

CS 490 Sample Project Mining the Mushroom Data

Set

Kirk Scott

Page 2: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

2

Page 3: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

3

Yellow Morels

Page 4: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

4

Black Morels

Page 5: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

5

• This set of overheads begins with the contents of the project check-off sheet

• After that an example project is given

Page 6: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

6

CS 490 Data Mining Project Check-Off Sheet

• Student's name: _______ •  1. Meets requirements for formatting.

(No pts.) [ ]• 2. Oral presentation given. (No pts.) [ ]•  3. Attendance at Other Students'

Presentations. Partial points for partial attendance. 20 pts.____

Page 7: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

7

I. Background Information on the Problem Domain and the Data Set

Page 8: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

8

• Name of Data Set: _______• I.A. Random Information Drawn from the

Online Data Files Posted with the Data Set. 3 pts.___

• I.B. Contents of the Data File. 3 pts.___• I.C. Summary of Background Information.

3 pts.___• I.D. Screen Shot of Open File. 3 pts.___

Page 9: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

9

II. Applications of Data Mining Algorithms to the Data Set

Page 10: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

10

II. Case 1. This Needs to Be a Classification Algorithm

• Name of Algorithm: _______• i. Output Results. 3 pts.___• ii. Explanation of Item. 2 pts.___• iii. Graphical or Other Special Purpose

Additional Output. 2 pts.___

Page 11: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

11

II. Case 2. This Needs to Be a Clustering Algorithm

• Name of Algorithm: _______• i. Output Results. 3 pts.___• ii. Explanation of Item. 2 pts.___• iii. Graphical or Other Special Purpose

Additional Output. 2 pts.___

Page 12: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

12

II. Case 3. This Needs to Be an Association Mining Algorithm• Name of Algorithm: _______• i. Output Results. 3 pts.___• ii. Explanation of Item. 2 pts.___• iii. Graphical or Other Special Purpose

Additional Output. 2 pts.___

Page 13: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

13

II. Case 4. Any Kind of Algorithm

• Name of Algorithm: _______• i. Output Results. 3 pts.___• ii. Explanation of Item. 2 pts.___• iii. Graphical or Other Special Purpose

Additional Output. 2 pts.___

Page 14: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

14

II. Case 5. Any Kind of Algorithm

• Name of Algorithm: _______• i. Output Results. 3 pts.___• ii. Explanation of Item. 2 pts.___• iii. Graphical or Other Special Purpose

Additional Output. 2 pts.___

Page 15: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

15

II. Case 6. Any Kind of Algorithm

• Name of Algorithm: _______• i. Output Results. 3 pts.___• ii. Explanation of Item. 2 pts.___• iii. Graphical or Other Special Purpose

Additional Output. 2 pts.___

Page 16: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

16

II. Case 7. Any Kind of Algorithm

• Name of Algorithm: _______• i. Output Results. 3 pts.___• ii. Explanation of Item. 2 pts.___• iii. Graphical or Other Special Purpose

Additional Output. 2 pts.___

Page 17: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

17

II. Case 8. Any Kind of Algorithm

• Name of Algorithm: _______• i. Output Results. 3 pts.___• ii. Explanation of Item. 2 pts.___• iii. Graphical or Other Special Purpose

Additional Output. 2 pts.___

Page 18: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

18

III. Choosing the Best Algorithm Among the Results

Page 19: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

19

• III.A. Random Babbling. 6 pts.___• III.B. An Application of the Paired t-test. 6

pts.___•  •  • Total out of 100 points possible: _____

Page 20: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

20

Example Project

• The point of this sample project is to illustrate what you should produce for your project.

• In addition to the content of the project, information given in italics provides instructions or commentary or background information.

Page 21: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

21

• Needless to say, your project should simply contain all of the necessary content.

• You don't have to provide italicized commentary.

Page 22: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

22

I. Background Information on the Problem Domain and the Data Set

• If you are working with your own data set you will have to produce this documentation entirely yourself.

• If you are working with a downloaded data set, you can use whatever information comes with the data set.

• You may paraphrase that information, rearrange it, do anything to it to help make your presentation clear.

Page 23: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

23

• You don't have to follow academic practice and try to document or footnote what you did when presenting the information.

• The goal is simply adaptation for clear and complete presentation.

• What I'm trying to say is this: There will be no penalty for "plagiarism".

Page 24: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

24

• What I would like you to avoid is simply copying and pasting, leading to a mass of information that is not relevant or helpful to the reader (the teacher—who will be making the grades) in understanding what you were doing.

• Reorganize and edit as necessary in order to make it clear.

Page 25: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

25

• Finally, include a screen shot of the explorer view of the data set after you've opened the file containing it.

• Already here you have a choice of what exactly to show and you need to write some text explaining what the screen shot displays.

Page 26: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

26

I.A. Random Information Drawn from the Online Data Files Posted with the Data

Set• This data set includes descriptions of

hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525).

• Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended.

• This latter class was combined with the poisonous one.

Page 27: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

27

• The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ''leaflets three, let it be'' for Poisonous Oak and Ivy.

Page 28: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

28

• Number of Instances: 8124• Number of Attributes: 22 (all nominally

valued)• Attribute Information: (classes: edible=e,

poisonous=p)•  

Page 29: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

29

• 1. cap-shape:bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s

• 2. cap-surface:fibrous=f,grooves=g,scaly=y,smooth=s

• 3. cap-color:brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

Page 30: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

30

• 4. bruises?: bruises=t,no=f• 5. odor:

almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

• 6. gill-attachment:attached=a,descending=d,free=f,notched=n

• 7. gill-spacing:close=c,crowded=w,distant=d

Page 31: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

31

• 8. gill-size: broad=b,narrow=n• 9. gill-color:

black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

• 10. stalk-shape:enlarging=e,tapering=t

• 11. stalk-root:bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?

Page 32: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

32

• 12. stalk-surface-above-ring:fibrous=f,scaly=y,silky=k,smooth=s

• 13. stalk-surface-below-ring:fibrous=f,scaly=y,silky=k,smooth=s

• 14. stalk-color-above-ring:brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

Page 33: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

33

• 15. stalk-color-below-ring:brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

• 16. veil-type: partial=p,universal=u• 17. veil-color:

brown=n,orange=o,white=w,yellow=y• 18. ring-number: none=n,one=o,two=t

Page 34: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

34

• 19. ring-type:cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

• 20. spore-print-color:black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

Page 35: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

35

• 21. population:abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y

• 22. habitat:grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

Page 36: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

36

• Missing Attribute Values: 2480 of them (denoted by "?"), all for attribute #11.

•  • Class Distribution: • -- edible: 4208 (51.8%)• -- poisonous: 3916 (48.2%)• -- total: 8124 instances

Page 37: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

37

• Logical rules for the mushroom data sets. • This is information derived by researchers

who have already worked with the data set.

• Logical rules given below seem to be the simplest possible for the mushroom dataset and therefore should be treated as benchmark results.

Page 38: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

38

• Disjunctive rules for poisonous mushrooms, from most general to most specific:

• P_1) odor=NOT(almond.OR.anise.OR.none)

• 120 poisonous cases missed, 98.52% accuracy

• P_2) spore-print-color=green• 48 cases missed, 99.41% accuracy

Page 39: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

39

• P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.(stalk-color-above-ring=NOT.brown)

• 8 cases missed, 99.90% accuracy• P_4) habitat=leaves.AND.cap-color=white• 100% accuracy • Rule P_4) may also be• P_4')

population=clustered.AND.cap_color=white

Page 40: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

40

• These rules involve 6 attributes (out of 22). Rules for edible mushrooms are obtained as negation of the rules given above, for example the rule:

• odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green

• gives 48 errors, or 99.41% accuracy on the whole dataset.

Page 41: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

41

• Several slightly more complex variations on these rules exist, involving other attributes, such as gill_size, gill_spacing, stalk_surface_above_ring, but the rules given above are the simplest we have found.

Page 42: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

42

I.B. Contents of the Data File

• Here is a snippet of five records from the data file:

• p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u• e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g• e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m• p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u• e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g

Page 43: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

43

• Incidentally, the data file contents also exist in expanded form.

• Here is a record from that file:• EDIBLE,CONVEX,SMOOTH,WHITE,BRUI

SES,ALMOND,FREE,CROWDED,NARROW,WHITE,TAPERING,BULBOUS,SMOOTH,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS

Page 44: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

44

• Section I.C should be written by you. You should summarize the information given above, which is largely copy and paste, in a brief, well-organized paragraph that you write yourself and which conveys the basics in a concise way.

Page 45: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

45

• The idea is that a reader who really doesn't want or need to know the details could go to this paragraph and find out everything they needed to know in order to keep reading the rest of your write-up and have some idea of what is going on.

Page 46: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

46

I.C. Summary of Background Information

• The problem domain is the classification of mushrooms as either poisonous/inedible or non-poisonous/edible.

• There are 8124 instances in the data set consisting of 22 nominal attributes apiece.

• Roughly half of the instances are poisonous and half are non-poisonous.

Page 47: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

47

• There are 2480 cases of missing attribute values, all on the same attribute.

• As is to be expected with non-original data sets, this set has already been extensively studied.

• Other researchers have provided sets of rules they have derived which would serve as benchmarks when considering the results of the application of further data mining algorithms to the data set.

Page 48: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

48

I.D. Screen Shot of Open File

• ***What this shows: • The cap-shape attribute is chosen out of

the list on the left. • Its different values are given in the table in

the upper right. • In the lower right, the Edible attribute is

selected from a (hidden) drop down list.

Page 49: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

49

• The graph shows the proportion of edible and inedible mushrooms among the instances containing different values of cap-shape.

Page 50: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

50

Page 51: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

51

II. Applications of Data Mining Algorithms to the Data Set

• The overall requirement is that you use the Weka explorer and run up to 8 different data mining algorithms on your data set.

• Here is a preview of what is involved:

Page 52: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

52

• i. You will get full credit for all 8 cases if among the 8 there is at least one each of classification, clustering, and association rule mining.

• In order to make it clear that this has been done, the first case should be a classification, the second case should be a clustering, and the third case should be an application of association rule mining.

Page 53: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

53

• The grading check-off sheet will reflect this requirement.

• All remaining cases can be of your choice, given in any order you want to.

Page 54: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

54

• ii. You will have to either copy a screen shot or copy certain information out of the Weka explorer interface and paste it into your report.

• The stuff you need to do this for in the different kinds of cases is simply illustrated.

• I won't try and list it all out here.

Page 55: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

55

• At every point, ask yourself this question:

• "Was it immediately apparent to me what I was looking at and what it meant?“

• If the answer to that question was no, you should include explanatory remarks with whatever you chose to show from Weka.

Page 56: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

56

• For consistency's sake in these cases you can label your remarks "***What this shows:".

Page 57: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

57

• iii. The most obvious kind of results that you would reproduce would be the percent correct and percent incorrect classification for a classification scheme, for example.

• In addition to this, the Weka output would include things like a confusion matrix, the Kappa statistic, and so on.

Page 58: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

58

• For each case that you examine, you will be expected to highlight one aspect of the output and to provide your own brief, written explanation of it.

• Note that this is an "educational" aspect of this project.

Page 59: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

59

• On the job, the expectation would be that you as a user knew what it all meant.

• Here, as a student, the goal is to show that you know what it all meant.

Page 60: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

60

• iv. Finally, there is an additional aspect of Weka that you should use and illustrate.

• I will not try to describe it in detail here. You will see examples in the content below.

Page 61: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

61

• In short, for the different algorithms, if you right click on the results, you will be given options to create graphs, charts, and various other kinds of output.

• For each case that you cover you should take one of these options.

• Again, there is an educational, as opposed to practical aspect to this.

Page 62: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

62

• For the purposes of this project, just cycle through the different options that are available to show that you are familiar with them.

• For each one, provide a sentence or two making it clear that you know what this additional output means.

Page 63: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

63

II. Case 1. This Needs to Be a Classification Algorithm

• Name of Algorithm: J48

Page 64: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

64

i. Output Results

• ***What this shows: • This shows the classifier tree generated by

the J48 algorithm.

Page 65: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

65

Page 66: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

66

• ***What this shows: • This gives the analysis of the output of the

algorithm. • The most notable thing that should jump

out at you is that this is a "perfect" tree. • The output shows 100% correct

classification and no misclassification.

Page 67: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

67

Page 68: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

68

ii. Explanation of Item

• There is no need to repeat the screen shot.

• For this item I have chosen the confusion matrix.

• It is very easy to understand. • It shows 0 false positives and 0 false

negatives.

Page 69: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

69

• It is interesting to note that you need to know the values for the attributes in the data file to be sure which number represents TP and which represents TN.

• Referring back to the earlier the screen shot, the same is true for the bars.

• What do the blue and red parts of the bars represent, edible or inedible?

Page 70: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

70

iii. Graphical or Other Special Purpose Additional Output

• ***What this shows: • Going back to the previous screen shot, if

you right click on the item highlighted in blue—the results of running J48 on the data set, you get several options.

• One of them is "Visualize tree". • This screen shot shows the result of taking

that option.

Page 71: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

71

Page 72: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

72

II. Case 2. This Needs to Be a Clustering Algorithm

• Name of Algorithm: SimpleKMeans

Page 73: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

73

i. Output Results

• ***What this shows: • This shows the results of the

SimpleKMeans clustering algorithm with the edible/inedible attribute ignored.

• The results compare the clusters/classifications with the ignored attribute.

• The algorithm finds 2 clusters based on the remaining attributes.

Page 74: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

74

Page 75: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

75

ii. Explanation of Item

• At the bottom of the screen shot there is an item, "Incorrectly clustered instances".

• 37.6% of the clustered instances don't fall into the desired edible/inedible category.

• The algorithm finds 2 clusters, but these 2 clusters don't agree with the 2 classifications of the attribute that was ignored.

Page 76: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

76

iii. Graphical or Other Special Purpose Additional Output

• ***What this shows: • Going back to the previous screen shot, if

you right click on the item highlighted in blue—the results of running SimpleKMeans on the data set, you get several options.

• One of them is "Visualize cluster assignments".

Page 77: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

77

• This screen shot shows the result of taking that option.

• Since it isn't possible to visualize the clusters in n-dimensional space, the screen provides the option of picking which individual attribute to visualize.

Page 78: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

78

• This screen shows the instances in order by number along the x-axis.

• The y-axis shows the cluster placements for the different values for the cap-shape attribute.

• The drop down box allows you to change what the axes represent.

Page 79: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

79

Page 80: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

80

II. Case 3. This Needs to Be an Association Mining Algorithm•  • Name of Algorithm: Apriori

Page 81: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

81

i. Output Results

• ***What this shows: • This shows the results of the Apriori

association rule mining algorithm.

Page 82: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

82

Page 83: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

83

ii. Explanation of Item

• Various relevant parameters are shown on the screen shot.

• The system defaults to a minimum support level of .95 and a minimum confidence level of .9.

• The system lists the 10 best rules found. • The first 9 have confidence levels of 1. • On the one hand, this is good.

Page 84: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

84

• From a practical point of view, what this tends to suggest is that the data are effectively redundant.

• Just to take the first rule for example, if you know the color of the veil, you know the type of the veil.

• The 10th rule provides an interesting reverse insight into this.

• It tells you that if you know the type, you only know the color with .98 confidence.

Page 85: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

85

iii. Graphical or Other Special Purpose Additional Output

• There don't appear to be any other output options for association rules.

• There is no standard visualization for them so nothing is included for this point.

Page 86: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

86

II. Case 4. Any Kind of Algorithm

• Name of Algorithm: ADTree

Page 87: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

87

i. Output Results

• ***What this shows. • These are the results of running the

ADTree classification algorithm. • I haven't bothered to scroll up and show

the ASCII representation of the tree. • Instead, I've just shown the critical output

at the bottom.

Page 88: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

88

Page 89: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

89

ii. Explanation of Item

• There are two items I'd like to highlight:• a. Notice that this tree generation

algorithm didn't get 100% classified correctly.

• If I'm reading the data correctly, there were 8 false positives on the attribute of interest, which is named Edible.

• This is not good.

Page 90: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

90

• False negatives deprive you of a tasty gustatory and culinary experience.

• False positives deprive you of your health or your life.

• I point this out in contrast to the J48 results given above.

Page 91: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

91

• b. Notice that the time taken to build the model was .73 seconds.

• This is about 10 times slower than J48, but I'm mainly interested in comparing with the following algorithm.

Page 92: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

92

iii. Graphical or Other Special Purpose Additional Output

• ***What this shows: • This is the visualization of the tree. • There are other graphical options, but they

are difficult to interpret for the mushroom data set, so this is given for comparison with the J48 tree.

Page 93: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

93

Page 94: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

94

II. Case 5. Any Kind of Algorithm

• Name of Algorithm: BFTree

Page 95: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

95

i. Output Results

• ***What this shows: • This shows the results of using the BFTree

classification algorithm.

Page 96: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

96

Page 97: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

97

ii. Explanation of Item

• This algorithm also doesn't give a tree that classifies with 100% accuracy.

• It gives the same kind of error as the ADTree, although there are 3 fewer.

Page 98: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

98

• The additional item I'd like to highlight is that the time taken to build the model was 12.42 seconds.

• As a matter of fact, that information came out first and then additional, significant amounts of time were taken to run through each fold of the data.

• This was quite time consuming compared to the other trees produced so far.

Page 99: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

99

iii. Graphical or Other Special Purpose Additional Output

• ***What this shows: • This screen shot is the result of taking the

"Visualize classifier errors" option on the results of the algorithm.

• I believe what this screen illustrates is a decision point in the tree on the cap-surface attribute.

Page 100: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

100

• In one of the cases, symbolized by the blue rectangle, an incorrect classification is made on this basis while 7 other instances classify correctly based on this attribute.

Page 101: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

101

Page 102: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

102

II. Case 6. Any Kind of Algorithm

• Name of Algorithm: Naïve Bayes

Page 103: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

103

i. Output Results

• ***What this shows: • This screen shot shows the bottom of the

output for the Naïve Bayes classification algorithm.

• The upper part of the output shows conditional probability counts for all of the attributes in the data.

Page 104: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

104

• If the cost of an error wasn't so high, this algorithm by itself does OK.

• It's time cost is only .03 seconds and it achieves 95.8% correct classification.

Page 105: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

105

Page 106: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

106

ii. Explanation of Item

• I'm running out of items to highlight which are particularly meaningful for the example in question.

• Notice that the output includes the Mean absolute error, the Root mean squared error, the Relative absolute error, and the Root relative squared error.

Page 107: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

107

• These differ in magnitude because of the way they're calculated, but they are all indicators of the same general thing.

• As pointed out in the book, when comparing two different data mining approaches, if you compared the same measure for both, you will tend to have a valid comparison regardless of which of the measures you used.

Page 108: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

108

iii. Graphical or Other Special Purpose Additional Output

• ***What this shows: • Two graphical output screen shots are

given below. • They show a cost-benefit analysis. • Such an analysis is more appropriate to

something like direct mailing, but it is possible to illustrate something by changing one of the parameters in the display.

Page 109: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

109

• Both screen shots show a threshold curve and a cost-benefit curve where the button to minimize cost/benefit has been clicked.

• In the first screen shot the costs of FP and FN are equal, at 1.

• In the second, the cost of a false positive has been raised to 1,000.

• Notice how the shape of the curve changes.

Page 110: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

110

• Roughly speaking, I would interpret the second screenshot to mean that you have effectively no costs as long as you are correctly predicting TP, but your cost rises linearly with the increasing probability of FP predictions later in the data set.

Page 111: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

111

Page 112: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

112

Page 113: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

113

II. Case 7. Any Kind of Algorithm

• Name of Algorithm: BayesNet

Page 114: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

114

i. Output Results

• ***What this shows. • This screen shot shows the results of the

BayesNet classification algorithm.

Page 115: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

115

Page 116: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

116

ii. Explanation of Item

• This is not a new item to explain, but it is an observation related to the values and results previously obtained.

• The association rule mining algorithm seemed to suggest that there were heavy dependencies among some of the attributes in the data set.

• BayesNet is supposed to take these into account, while Naïve Bayes does not.

Page 117: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

117

• However, when you compare the rate of correctly classified instances, here you get 96.2% vs. 95.8% for Naïve Bayes.

• It seems fair to ask what difference it really made to include the dependencies in the analysis.

Page 118: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

118

iii. Graphical or Other Special Purpose Additional Output

• What this shows: • This shows the result of taking the

"Visualize cost curve" option on the results of the data mining.

• Honestly, I've about reached the limit of what I understand without further research.

• I present this here without further explanation.

Page 119: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

119

• This is one of the reasons I advertise this sample project write-up as an example of a B, rather than an A effort.

• Everything that has been asked for is included, but in this point, for example, the explanation isn't complete.

• It sure is pretty though…

Page 120: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

120

Page 121: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

121

II. Case 8. Any Kind of Algorithm

• Name of Algorithm: RIDOR

Page 122: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

122

i. Output Results

• ***What this shows: • This screen shows the results of applying

the RIDOR algorithm to the data set. • RIDOR was the technique based on rules

and exceptions. • Look at the top of the output. • Here you see clearly that the default

classification is edible, with exceptions listed underneath.

Page 123: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

123

• Philosophically, this goes against my point of view on mushrooms.

• The logical default should be inedible, but there are more edible mushrooms in the data set than inedible.

• So it goes.

Page 124: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

124

Page 125: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

125

ii. Explanation of Item

• The last set of items that appears in these output screens are the Precision, Recall, F-Measure, and ROC values.

• This is probably not the best example for illustrating what they mean.

• It's apparent that things like recall would be better suited to document retrieval for example.

Page 126: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

126

• Maybe the best illustration that they don't really apply is that they are all 1 or .999.

• On the other hand, maybe that's realistic for a classification scheme that gives 99.95% correct results.

Page 127: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

127

iii. Graphical or Other Special Purpose Additional Output

• Once again, the fact that this is a "B" example rather than an "A" example comes into play.

• I'm not showing a new bit of graphical output.

• I'm showing the cost curve, like for the previous data mining algorithm.

Page 128: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

128

• The main reason for choosing to show it again is that this picture looks so much like the simple picture in the text that they used to illustrate some of the cost concepts graphically.

Page 129: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

129

Page 130: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

130

III. Choosing the Best Algorithm Among the Results

• Depending on the problem domain and your level of ambition, you might compare algorithms on the basis of lift charts, cost curves, and so on.

• For simple classification, the tools will give results showing the percent classified correctly and the percent classified incorrectly.

Page 131: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

131

• It would be natural to simply choose the one with the highest percent classified incorrectly.

• However, this is not good enough for credit on this item.

• I have chosen to illustrate what you need to do with a simple basic example.

Page 132: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

132

• I consider the two classification algorithms that gave the highest percent classified correctly.

• I then apply the paired t-test to see whether or not there is actually a statistically significant different between them.

Page 133: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

133

• If there is, that's the correct basis for preferring one over the other.

• For the purposes of illustration, I do this by hand and explain what I'm doing.

• You may find tools that allow you to make a valid comparison of results.

• That's OK, as long as you explain.

Page 134: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

134

• The point simply is that it's not sufficient to just list a bunch of percents and pick the highest one.

• Illustrate the use of some advanced technique, whether involving concepts like lift charts or cost curves or statistics.

• You may also have noticed that Weka tells you the run time for doing an analysis.

Page 135: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

135

• When making a decision about which algorithm is the best, at a minimum take into account an advanced comparison of the two apparent best, and you may want to make an observation about the apparent complexity or time cost of the algorithms.

Page 136: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

136

III.A. Random Babbling

• The concept of "Cost of classification" seems relevant to this example.

• It takes a human expert to tell if a mushroom is poisonous.

• If you're not an expert, you can tell by eating a mushroom and seeing what happens.

Page 137: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

137

• The cost of finding out that the mushroom is poisonous is about as high as it gets.

• I guess if you're truly dedicated, you'd be willing to die for science.

• Directly related to this is the cost of a misclassification.

• It seems to be on the infinite side…

Page 138: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

138

• The J48 tree approach, given first, even though it's apparently been pruned, still classifies 100% correctly.

• This seems to be at odds with claims made at various point that you don't want a perfect classifier because it will tend to be overtrained.

Page 139: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

139

• On the other hand, since the cost of a misclassification is so high, maybe it would be best to bias the training.

• Lots of false "It's poisonous" results would be desirable.

• I remember learning this rule from my parents:

• Don't eat any wild mushrooms.

Page 140: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

140

• It's also interesting to compare with the commentary provided at the beginning.

• "Experts" who have examined the data wanted to get a minimal rule set.

• They apparently considered that a success.

• But they were willing to live with errors. • I'm not sure living with errors is consistent

with this data set.

Page 141: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

141

III.B. An Application of the Paired t-test

• Pick any two of your results above, identify them and the success rate values they gave, and compare them using the paired t-test.

• Give a statistically valid statement that tells whether or not the two cases you're comparing are significantly different.

Page 142: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

142

• What is shown is my attempt to interpret and apply what the book says about the paired t-test.

• I do not claim that I have necessarily done this correctly.

• Students who have recently taken statistics may reach different conclusions about how this is done.

Page 143: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

143

• However, I have gone through the motions.

• To get credit for this section, you should do the same, whether following my example or following your own understanding.

Page 144: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

144

• I have chosen to compare the percent of correct classifications by Naïve Bayes (NB) and BayesNet (BN) given above.

Page 145: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

145

• Taken from Weka results:• NB sample mean = 95.8272%• NB root mean squared error = .1757• Squaring the value above:• NB mean squared error = .03087049

Page 146: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

146

• Taken from Weka results:• BN sample mean = 96.2211%• BN root mean squared error = .1639• Squaring the value above:• BN mean squared error = .02686321

Page 147: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

147

• This is my estimate of the standard deviation of the t statistic where the divisor is 10 because I opted for the default 10-fold cross-validation in Weka:

• Estimate of paired root mean squared error (EPRMSE)

• = square root(( NB mean squared error / 10) + (BN mean squared error / 10))

• = .075982695

Page 148: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

148

• t statistic• = (NB sample mean – BN sample mean) /

EPRMSE• = 5.184

Page 149: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

149

• The book says this is a two-tailed test. • For a 99% confidence interval I want to

use a threshold of .5%. • The book's table gives a value of 3.25.

Page 150: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

150

• The computed value, 5.184 is greater than the table value of 3.25.

• This means you reject the null hypothesis that the means of the two distributions are the same.

Page 151: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

151

• In other words, you conclude that there is a statistically significant difference between the percent of correct classifications resulting from the Naïve Bayes and the Bayesian Network algorithms on the mushroom data.

Page 152: 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

152

The End