classification supplemental. scalable decision tree induction methods in data mining studies sliq...
TRANSCRIPT
Classification
supplemental
Scalable Decision Tree Induction Methods in Data Mining Studies
• SLIQ (EDBT’96 — Mehta et al.)– builds an index for each attribute and only class list and the
current attribute list reside in memory• SPRINT (VLDB’96 — J. Shafer et al.)
– constructs an attribute list data structure • PUBLIC (VLDB’98 — Rastogi & Shim)
– integrates tree splitting and tree pruning: stop growing the tree earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)– separates the scalability aspects from the criteria that
determine the quality of the tree– builds an AVC-list (attribute, value, class label)
SPRINT
For large data sets.
Age Car Type Risk23 Family H17 Sports H43 Sports H68 Family L32 Truck L20 Family H
Age < 25
H Car = Sports
H L
Gini Index (IBM IntelligentMiner)
• If a data set T contains examples from n classes, gini index, gini(T) is defined as
where pj is the relative frequency of class j in T.• If a data set T is split into two subsets T1 and T2 with sizes N1 and
N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as
• The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).
n
jp jTgini1
21)(
)()()( 22
11 Tgini
NN
TginiNNTginisplit
SPRINT
Partition (S)if all points of S are in the same class
return; else
for each attribute A doevaluate_splits on A;use best split to partition into S1,S2;
Partition(S1); Partition(S2);
SPRINT Data Structures
Age Risk Tuple17 H 120 H 523 H 032 L 443 H 368 L 2
Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4
Family H 5
Age Car Type Risk23 Family H17 Sports H43 Sports H68 Family L32 Truck L20 Family H
Training set
Age Car
Attribute lists
SplitsAge Risk Tuple
17 H 120 H 523 H 032 L 443 H 368 L 2
Age < 27.5
Car Type Risk TupleFamily H 0Sports H 1Family H 5
Car Type Risk TupleSports H 2Family L 3Truck L 4
Age Risk Tuple17 H 120 H 523 H 0
Age Risk Tuple32 L 443 H 268 L 3
Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4
Family H 5
Group1 Group2
Histograms
For continuous attributes Associated with node (Cabove, Cbelow)
to process already processed
ExampleAge Risk Tuple
17 H 120 H 523 H 032 L 443 H 368 L 2
H LCb 0 0Ca 4 2
ginisplit0 = 0/6 gini(S1) + 6/6 gini(S2)
gini(S2) = 1 - [(4/6)2 +(2/6)2 ] = 0.444
ginisplit0 = 0.444
H LCb 1 0Ca 3 2
ginisplit1 = 1/6 gini(S1) +5/6 gini(S2)
gini(S1) = 1 - [(1/1) 2 ] = 0
gini(S2) = 1 - [(3/4)2 +(2/4)2 ] = 0.1875
ginisplit1= 0.156
H LCb 2 0Ca 2 2
ginisplit2 = 2/6 gini(S1) +4/6 gini(S2)
gini(S1) = 1 - [(2/2) 2 ] = 0
gini(S2) = 1 - [(2/4)2 +(2/4)2 ] = 0.5
ginisplit2= 0.333
H LCb 3 0Ca 1 2
ginisplit3 =3/6 gini(S1) +3/6 gini(S2)
gini(S1) = 1 - [(3/3) 2 ] = 0
gini(S2) = 1 - [(1/3)2 +(2/3)2 ] = 0.444ginisplit3= 0.222
H LCb 3 1Ca 1 1
ginisplit4 =4/6 gini(S1) +2/6 gini(S2)
gini(S1) = 1 - [(3/4) 2 +(1/4) 2 ] = 0.375
gini(S2) = 1 - [(1/2)2 +(1/2)2 ] = 0.5ginisplit4= 0.416
H LCb 4 1Ca 0 1
ginisplit5 =5/6 gini(S1) +1/6 gini(S2)
gini(S1) = 1 - [(4/5) 2 +(1/5) 2 ] = 0.320
gini(S2) = 1 - [(1/1)2 ] = 0
ginisplit5= 0.222
H LCb 4 2Ca 0 0
ginisplit5 =6/6 gini(S1) +0/6 gini(S2)
gini(S1) = 1 - [(4/6) 2 +(2/6) 2 ] = 0.320
ginisplit6= 0.444Age <= 18.5
Splitting categorical attributes
Single scan through the attribute list collecting counts on count matrix for each combination of class label + attribute value
ExampleCar Type Risk Tuple
Family H 0Sports H 1Sports H 2Family L 3Truck L 4
Family H 5
H LFamily 2 1Sports 2 0Truck 0 1
ginisplit(family)= 3/6 gini(S1) + 3/6 gini(S2)
gini(S1) = 1 - [(2/3)2 + (1/3)2] = 4/9
gini(S2) = 1- [(2/3)2 + (1/3)2] = 4/9
ginisplit(family)= 0.444 ginisplit((sports)= 2/6 gini(S1) + 4/6 gini(S2)
gini(S1) = 1 - [(2/2)2] = 0
gini(S2) = 1- [(2/4)2 + (2/4)2] = 0.5ginisplit((sports) )= 0.333
ginisplit(truck)= 1/6 gini(S1) + 5/6 gini(S2)
gini(S1) = 1 - [(1/1)2] = 0
gini(S2) = 1- [(4/5)2 + (1/5)2] = 0.32
ginisplit(truck) )= 0.266
Car Type = Truck
Example (2 attributes)
The winner is Age <= 18.5
Age Risk Tuple17 H 120 H 523 H 032 L 443 H 368 L 2
Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4
Family H 5
H
Y N
Age Risk Tuple20 H 523 H 032 L 443 H 368 L 2
Car Type Risk TupleFamily H 0
Sports H 2Family L 3Truck L 4
Family H 5
Example for Bayes Rules
• The patient either has a cancer or does not.• A prior knowledge: over the entire population, .008
have cancer• Lab test result + or - is imperfect. It returns
– a correct positive result in only 98% of the cases in which the cancer is actually present
– a correct negative result in only 97% of the cases in which the cancer is not present
• What happens if a new patient for whom the lab test returns +?
Example for Bayes Rules
Pr(cancer)=0.008 Pr(not cancer)=0.992Pr(+|cancer)=0.98 Pr(-|cancer)=0.02Pr(+|not cancer)=0.03 Pr(-|not cancer)=0.97
Pr(+|cancer)p(cancer) = 0.98* 0.008 = 0.0078Pr(+|not cancer)Pr(not cancer) = 0.03*0.992=0.0298Hence, Pr(cancer|+) = 0.0078/(0.0078+0.0298)=0.21