classification supplemental. scalable decision tree induction methods in data mining studies sliq...

Classification

supplemental

Scalable Decision Tree Induction Methods in Data Mining Studies

• SLIQ (EDBT’96 — Mehta et al.)– builds an index for each attribute and only class list and the

current attribute list reside in memory• SPRINT (VLDB’96 — J. Shafer et al.)

– constructs an attribute list data structure • PUBLIC (VLDB’98 — Rastogi & Shim)

– integrates tree splitting and tree pruning: stop growing the tree earlier

• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)– separates the scalability aspects from the criteria that

determine the quality of the tree– builds an AVC-list (attribute, value, class label)

SPRINT

For large data sets.

Age Car Type Risk23 Family H17 Sports H43 Sports H68 Family L32 Truck L20 Family H

Age < 25

H Car = Sports

H L

Gini Index (IBM IntelligentMiner)

• If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.• If a data set T is split into two subsets T1 and T2 with sizes N1 and

N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as

• The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

n

jp jTgini1

21)(

)()()( 22

11 Tgini

NN

TginiNNTginisplit

SPRINT

Partition (S)if all points of S are in the same class

return; else

for each attribute A doevaluate_splits on A;use best split to partition into S1,S2;

Partition(S1); Partition(S2);

SPRINT Data Structures

Age Risk Tuple17 H 120 H 523 H 032 L 443 H 368 L 2

Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4

Family H 5

Age Car Type Risk23 Family H17 Sports H43 Sports H68 Family L32 Truck L20 Family H

Training set

Age Car

Attribute lists

SplitsAge Risk Tuple

17 H 120 H 523 H 032 L 443 H 368 L 2

Age < 27.5

Car Type Risk TupleFamily H 0Sports H 1Family H 5

Car Type Risk TupleSports H 2Family L 3Truck L 4

Age Risk Tuple17 H 120 H 523 H 0

Age Risk Tuple32 L 443 H 268 L 3


Family H 5

Group1 Group2

Histograms

For continuous attributes Associated with node (Cabove, Cbelow)

to process already processed

ExampleAge Risk Tuple

17 H 120 H 523 H 032 L 443 H 368 L 2

H LCb 0 0Ca 4 2

ginisplit0 = 0/6 gini(S1) + 6/6 gini(S2)

gini(S2) = 1 - [(4/6)2 +(2/6)2 ] = 0.444

ginisplit0 = 0.444

H LCb 1 0Ca 3 2

ginisplit1 = 1/6 gini(S1) +5/6 gini(S2)

gini(S1) = 1 - [(1/1) 2 ] = 0

gini(S2) = 1 - [(3/4)2 +(2/4)2 ] = 0.1875

ginisplit1= 0.156

H LCb 2 0Ca 2 2

ginisplit2 = 2/6 gini(S1) +4/6 gini(S2)

gini(S1) = 1 - [(2/2) 2 ] = 0

gini(S2) = 1 - [(2/4)2 +(2/4)2 ] = 0.5

ginisplit2= 0.333

H LCb 3 0Ca 1 2

ginisplit3 =3/6 gini(S1) +3/6 gini(S2)

gini(S1) = 1 - [(3/3) 2 ] = 0

gini(S2) = 1 - [(1/3)2 +(2/3)2 ] = 0.444ginisplit3= 0.222

H LCb 3 1Ca 1 1


gini(S1) = 1 - [(3/4) 2 +(1/4) 2 ] = 0.375

gini(S2) = 1 - [(1/2)2 +(1/2)2 ] = 0.5ginisplit4= 0.416

H LCb 4 1Ca 0 1


gini(S1) = 1 - [(4/5) 2 +(1/5) 2 ] = 0.320

gini(S2) = 1 - [(1/1)2 ] = 0

ginisplit5= 0.222

H LCb 4 2Ca 0 0


gini(S1) = 1 - [(4/6) 2 +(2/6) 2 ] = 0.320

ginisplit6= 0.444Age <= 18.5

Splitting categorical attributes

Single scan through the attribute list collecting counts on count matrix for each combination of class label + attribute value

ExampleCar Type Risk Tuple

Family H 0Sports H 1Sports H 2Family L 3Truck L 4

Family H 5

H LFamily 2 1Sports 2 0Truck 0 1

ginisplit(family)= 3/6 gini(S1) + 3/6 gini(S2)

gini(S1) = 1 - [(2/3)2 + (1/3)2] = 4/9

gini(S2) = 1- [(2/3)2 + (1/3)2] = 4/9

ginisplit(family)= 0.444 ginisplit((sports)= 2/6 gini(S1) + 4/6 gini(S2)

gini(S1) = 1 - [(2/2)2] = 0

gini(S2) = 1- [(2/4)2 + (2/4)2] = 0.5ginisplit((sports) )= 0.333

ginisplit(truck)= 1/6 gini(S1) + 5/6 gini(S2)

gini(S1) = 1 - [(1/1)2] = 0

gini(S2) = 1- [(4/5)2 + (1/5)2] = 0.32

ginisplit(truck) )= 0.266

Car Type = Truck

Example (2 attributes)

The winner is Age <= 18.5

Age Risk Tuple17 H 120 H 523 H 032 L 443 H 368 L 2


Family H 5

H

Y N

Age Risk Tuple20 H 523 H 032 L 443 H 368 L 2

Car Type Risk TupleFamily H 0

Sports H 2Family L 3Truck L 4

Family H 5

Example for Bayes Rules

• The patient either has a cancer or does not.• A prior knowledge: over the entire population, .008

have cancer• Lab test result + or - is imperfect. It returns

– a correct positive result in only 98% of the cases in which the cancer is actually present

– a correct negative result in only 97% of the cases in which the cancer is not present

• What happens if a new patient for whom the lab test returns +?

classification supplemental. scalable decision tree induction methods in data mining studies sliq...

Documents