data mining: classification. classification what is classification? –classifying tuples in a...
TRANSCRIPT
![Page 1: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/1.jpg)
Data Mining:Classification
![Page 2: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/2.jpg)
Classification
• What is Classification?– Classifying tuples in a database
– In training set E• each tuple consists of the same set of multiple attributes
as the tuples in the large database W
• additionally, each tuple has a known class identity
– Derive the classification mechanism from the training set E, and then use this mechanism to classify general data (in W)
![Page 3: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/3.jpg)
Learning Phase
• Learning– Training data are analyzed by a classification algorithm
– The class label attribute is credit_rating
– The classifier is represented in the form of classification rules
![Page 4: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/4.jpg)
Testing Phase
• Testing (Classification)– Test data are used to estimate the accuracy of the classification rules
– If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples
![Page 5: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/5.jpg)
Classification by Decision Tree
A top-down decision tree generation algorithm: ID-3 and its extended version C4.5 (Quinlan’93): J.R. Quinlan, C4.5 Programs for Machine Learning, Morgan Kaufmann, 1993
![Page 6: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/6.jpg)
Decision Tree Generation• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
• Attribute Selection– Favoring the partitioning which makes the majority
of examples belong to a single class
• Tree Pruning (Overfitting Problem)– Aiming at removing tree branches that may lead to
errors when classifying test data• Training data may contain noise, …
![Page 7: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/7.jpg)
Eye Hair Height OrientalBlack Black Short YesBlack White Tall YesBlack White Short YesBlack Black Tall YesBrown Black Tall YesBrown White Short YesBlue Gold Tall NoBlue Gold Short NoBlue White Tall NoBlue Black Short No
Brown Gold Short No
1 2 3 4 5 6 7 8 91011
Another Examples
![Page 8: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/8.jpg)
• After the analysis, can you classify the following patterns?– (Black, Gold, Tall)– (Blue, White, Short)
• Example distributions
BlackShort
BlackTall
WhiteShort
WhiteTall
GoldShort
GoldTall
Black + + + + ?
Brown + + ─
Blue ─ ? ─ ─ ─
![Page 9: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/9.jpg)
Decision Tree
![Page 10: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/10.jpg)
Decision Tree
![Page 11: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/11.jpg)
Decision Tree Generation
• Attribute Selection (Split Criterion)– Information Gain (ID3/C4.5/See5)– Gini Index (CART/IBM Intelligent Miner)– Inference Power
• These measures are also called goodness functions and used to select the attribute to split at a tree node during the tree generation phase
![Page 12: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/12.jpg)
Decision Tree Generation
• Branching Scheme– Determining the tree branch to which a sample
belongs– Binary vs. K-ary Splitting
• When to stop the further splitting of a node– Impurity Measure
• Labeling Rule– A node is labeled as the class to which most sa
mples at the node belongs
![Page 13: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/13.jpg)
Decision Tree Generation Algorithm: ID3
(7.1) Entropy
ID: Iterative Dichotomiser
![Page 14: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/14.jpg)
Decision Tree Algorithm: ID3
![Page 15: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/15.jpg)
Decision Tree Algorithm: ID3
![Page 16: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/16.jpg)
Decision Tree Algorithm: ID3
![Page 17: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/17.jpg)
Decision Tree Algorithm: ID3
yes
![Page 18: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/18.jpg)
Decision Tree Algorithm: ID3
![Page 19: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/19.jpg)
Another Example
![Page 20: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/20.jpg)
Another Example
![Page 21: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/21.jpg)
Decision Tree Generation Algorithm: ID3
![Page 22: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/22.jpg)
Decision Tree Generation Algorithm: ID3
![Page 23: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/23.jpg)
Decision Tree Generation Algorithm: ID3
![Page 24: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/24.jpg)
Gini Index• If a data set T contains examples from n classes, gi
ni index, gini(T), is defined as
where pj is the relative frequency of class j in T.
• If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index, gini(T), is defined as
n
jj
p)T(gini1
21
)(giniN
)(giniN
)T( TNTNginisplit 2
21
1
![Page 25: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/25.jpg)
Inference Power of an Attribute
• A feature that is useful in inferring the group identity of a data tuple is said to have a good inference power to that group identity.
• In Table 1, given attributes (features) “Gender”, “Beverage”, “State”, try to find their inference power to “Group id”
![Page 26: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/26.jpg)
![Page 27: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/27.jpg)
![Page 28: Data Mining: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists of the same](https://reader031.vdocuments.us/reader031/viewer/2022032203/56649e305503460f94b215c7/html5/thumbnails/28.jpg)
Generating Classification Rules