classification ii
DESCRIPTION
Classification II. Data Mining Overview. Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering: Hierarchical and Partitional approaches Classification: Decision Trees and Bayesian classifiers. Setting. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/1.jpg)
Classification II
![Page 2: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/2.jpg)
Data Mining Overview Data Mining
Data warehouses and OLAP (On Line Analytical Processing.)
Association Rules Mining Clustering: Hierarchical and
Partitional approaches Classification: Decision Trees and
Bayesian classifiers
![Page 3: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/3.jpg)
Setting Given old data about customers and payments,
predict new applicant’s loan eligibility.
AgeSalaryProfessionLocationCustomer type
Previous customers Classifier Decision rulesSalary > 5 L
Prof. = Exec
New applicant’s data
Good/bad
![Page 4: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/4.jpg)
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Decision trees
Salary < 1 M
Prof = teaching
Good
Age < 30
BadBad Good
![Page 5: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/5.jpg)
SLIQ (Supervised Learning In Quest)
Decision-tree classifier for data mining
Design goals: Able to handle large disk-resident
training sets No restrictions on training-set size
![Page 6: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/6.jpg)
Building treeGrowTree(TrainingData D)
Partition(D);
Partition(Data D)if (all points in D belong to the same class) thenreturn;for each attribute A doevaluate splits on attribute A;use best split found to partition D into D1 and D2;Partition(D1);Partition(D2);
![Page 7: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/7.jpg)
Data Setup
One list for each attribute Entries in an Attribute List consist of:
attribute value class list index
A list for the classes with pointers to the tree nodes
Lists for continuous attributes are in sorted order Attribute lists may be disk-resident Class List must be in main memory
![Page 8: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/8.jpg)
Data Setup
Age Car Type Risk 23 family High 17 sports High 43 sports High 68 family Low 32 truck Low 20 family High
Age CLI 23 0 17 1 43 2 68 3 32 4 20 5
Car Type CLI family 0 sports 1 sports 2 family 3 truck 4 family 5
Risk Leaf High N1 High N1 High N1 Low N1 Low N1 High N1
Attribute lists Class list
N1Age CLI 17 1 20 5 23 0 32 4 43 2 68 3
Risk Leaf 0 High N1 1 High N1 2 High N1 3 Low N1 4 Low N1 5 High N1
Car Type CLI family 0 sports 1 sports 2 family 3 truck 4 family 5
![Page 9: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/9.jpg)
Evaluating Split Points
Gini Index if data D contains examples from c classes
Gini(D) = 1 - pj2
where pj is the relative frequency of class j in D If D split into D1 & D2 with n1 & n2 tuples each
Ginisplit(D) = n1* gini(D1) + n2* gini(D2) n n Note: Only class frequencies are needed to compute index
![Page 10: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/10.jpg)
Finding Split Points
For each attribute A do evaluate splits on attribute A using attribute
list
Key idea: To evaluate a split on numerical attributes we need to sort the set at each node. But, if we have all attributes pre-sorted we don’t need to do that at the tree construction phase
Keep split with lowest GINI index
![Page 11: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/11.jpg)
Finding Split Points: Continuous Attrib.
Consider splits of form: value(A) < x Example: Age < 17
Evaluate this split-form for every value in an attribute list
To evaluate splits on attribute A for a given tree-node:Initialize class-histograms of left and right children;
for each record in the attribute list dofind the corresponding entry in Class List and the class and Leaf nodeevaluate splitting index for value(A) < record.value;update the class histogram in the leaf
![Page 12: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/12.jpg)
Age CLI 17 1 20 5 23 0 32 4 43 2 68 3
Class Leaf 0 High N1 1 High N1 2 High N1 3 Low N1 4 Low N1 5 High N1
High LowL 0 0R 4 2
GINI Index:
0
N1
1
4
High LowL 1 0R 3 2
High LowL 3 0R 1 2
3 1
3
und
High LowL 3 1R 1 1
4
1: Age < 20
3: Age < 32
4: Age < 43
0
0.33
0.22
0.5 Age < 32
![Page 13: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/13.jpg)
Finding Split Points: Categorical Attrib.
Consider splits of the form: value(A) {x1, x2, ..., xn} Example: CarType {family, sports}
Evaluate this split-form for subsets of domain(A) To evaluate splits on attribute A for a given tree node:
initialize class/value matrix of node to zeroes;
for each record in the attribute list doincrement appropriate count in matrix;
evaluate splitting index for various subsets using the constructed matrix;
![Page 14: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/14.jpg)
High Low family 2 1
sports 2 0 truck 0 1
class/value matrix
CarType in {family}High Low2 1
High Low2 1
Left Child Right Child GINI Index:
High Low0 1
High Low4 1CarType in {truck}
GINI = 0.444
GINI = 0.267
High Low2 0
High Low2 2CarType in {sports} GINI = 0.333
Car Type CLI family 0 sports 1 sports 2 family 3 truck 4 family 5
Risk Leaf High N1 High N1 High N1 Low N1 Low N1 High N1
![Page 15: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/15.jpg)
Updating the Class List
For each attribute A in a split traverse the attribute list for each value u in the attr list find the corresponding entry in the class list (e) find the new class c to which u belongs update the class list for e to c update node reference in e to the node corresponding to class c
Next step is to update the Class List with the new nodes Scan the attr list that is used to split and update the corresponding
leaf entry in the Class List
![Page 16: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/16.jpg)
Preventing overfitting
A tree T overfits if there is another tree T’ that gives higher error on the training data yet gives lower error on unseen data.
An overfitted tree does not generalize to unseen instances.
Happens when data contains noise or irrelevant attributes and training size is small.
Overfitting can reduce accuracy drastically: 10-25% as reported in Minger’s 1989
Machine learning
![Page 17: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/17.jpg)
Approaches to prevent overfitting
Two Approaches: Stop growing the tree beyond a certain point First over-fit, then post prune. (More widely
used) Tree building divided into phases:
Growth phase Prune phase
Hard to decide when to stop growing the tree, so second appraoch more widely used.
![Page 18: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/18.jpg)
Criteria for finding correct final tree size:
Three criteria: Cross validation with separate test data Use some criteria function to choose best
size Example: Minimum description length
(MDL) criteria Statistical bounds: use all data for training
but apply statistical test to decide right size.
![Page 19: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/19.jpg)
The minimum description length principle (MDL)
MDL: paradigm for statistical estimation particularly model selection
Given data D and a class of models M, our choose is to choose a model m in M such that data and model can be encoded using the smallest total length
L(D) = L(D|m) + L(m) How to find encoding length?
Answer in Information Theory Consider the problem of transmitting n messages
where pi is probability of seeing message i Shannon’s theorem: minimum expected length when
-log pi bits to message i
![Page 20: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/20.jpg)
Encoding data
Assume t records of training data D First send tree m using L(m|M) bits Assume all but the class labels of training data
known. Goal: transmit class labels using L(D|m) If tree correctly predicts an instance, 0 bits Otherwise, log k bits where k is number of classes. Thus, if e errors on training data: total cost e log k + L(m|M) bits. Complex tree will have higher L(m|M) but lower e. Question: how to encode the tree?
![Page 21: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/21.jpg)
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer =
“yes”IF age = “31…40” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”IF age = “<=30” AND credit_rating = “fair” THEN buys_computer =
“no”
![Page 22: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/22.jpg)
SPRINT An improvement over SLIQ Does not need to keep a list in main memory Parallel version is straightforward Attribute lists are extended with class field – no
Class list is needed Uses hashing to assign records to classes and
nodes
![Page 23: Classification II](https://reader035.vdocuments.us/reader035/viewer/2022062305/56816331550346895dd3b324/html5/thumbnails/23.jpg)
Pros and Cons of decision trees
• Cons– Cannot handle complicated relationship between features– simple decision boundaries– problems with lots of missing data
• Pros+ Reasonable training time+ Fast application+ Easy to interpret+ Easy to implement+ Can handle large number of features
More information: http://www.recursive-partitioning.com/