sliq: a fast scalable classifier for data mining

24
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic

Upload: darin

Post on 01-Feb-2016

70 views

Category:

Documents


0 download

DESCRIPTION

SLIQ: A Fast Scalable Classifier for Data Mining. Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic. Outline. Introduction Motivation SLIQ Algorithm Building tree Pruning Example Results Conclusion. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ: A Fast Scalable Classifier for Data Mining

Manish Mehta, Rakesh Agrawal, Jorma Rissanen1996.

Presentation by: Vladan Radosavljevic

Page 2: SLIQ: A Fast Scalable Classifier for Data Mining

Outline Introduction Motivation SLIQ Algorithm

Building tree Pruning Example

Results Conclusion

Page 3: SLIQ: A Fast Scalable Classifier for Data Mining

Introduction Most of the classification algorithms are

designed for memory resident data – limited suitability for mining large datasets

Solution – build a scalable classifier - SLIQ

SLIQ – Supervised Learning in Quest, Quest was the data mining project at the IBM

Page 4: SLIQ: A Fast Scalable Classifier for Data Mining

Motivation

Recall (ID3, C4.5, CART):

Page 5: SLIQ: A Fast Scalable Classifier for Data Mining

Motivation NON SCALABLE DECISION TREES: The complexity lies in determining the best

split for each attribute The cost of evaluating splits for numerical

attributes is dominated by the cost of sorting values at each node

The cost of evaluating splits for categorical attributes is dominated by the cost of searching for the best subset

Pruning crossvalidation inapplicable for large datasets divide data in two parts - training and test set - sizes,

distribution???

Page 6: SLIQ: A Fast Scalable Classifier for Data Mining

Motivation Improve scalability of tree classifiers Previous proposals:

Sampling data at each node Discretization of numerical attributes Partitioning input data and build tree for

each partition All methods achieve low accuracy!

SLIQ – improve learning time without loss in accuracy!

Page 7: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ Key features:

Tree classifier, handling both numerical and categorical attributes

Presort numerical attributes before tree has been built

Breadth first growing strategy Goodness test – Gini index Inexpensive tree pruning algorithm

based on Minimum Description Length (MDL)

Page 8: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ - Algorithm

Eliminate the need to sort the data at each node

Create sorted list for each numerical attribute

Create class list

Page 9: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ - Algorithm

Example:

Page 10: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ - Algorithm

Split evaluation:

Page 11: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ - Algorithm

Example:

Page 12: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ - Algorithm

Update class list:

Page 13: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ - Algorithm

Example:

Page 14: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ - Algorithm For large-cardinality categorical attributes

(determined based on threshold) the best split is computed in greedy way, otherwise all possible splits are evaluated

When node becomes pure stop splitting it, then condense attribute lists by discarding examples that correspond to the pure node

SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical

Page 15: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ - Pruning

Post pruning algorithm based on Minimum Description Length principle

Find a model that minimizes:Cost(M,D) = Cost(D|M) + Cost(M)Cost(M) - cost of the modelCost(D|M) - cost of encoding the data D if model M is given

Page 16: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ - Pruning Cost of the data: classification error Cost of the model:

Encoding the tree: number of bits Encoding the splits:

numerical attribute - constant (empirically 1) categorical attribute - depends on cardinality

The MDL pruning evaluate the code length at each node to determine whether to prune one or both child or leave the node intact

Page 17: SLIQ: A Fast Scalable Classifier for Data Mining

SLIQ - pruning

Three pruning strategies: Full – pruning both children and

convert node to the leaf Partial – prune into the leaf or prune

the left child or prune the right child or leave node intact

Hybrid – apply Full method and then partial (prune left, prune right or leave intact)

Page 18: SLIQ: A Fast Scalable Classifier for Data Mining

Results

SLIQ was tested on the datasets:

Page 19: SLIQ: A Fast Scalable Classifier for Data Mining

Results

Pruning strategy comparison:

Page 20: SLIQ: A Fast Scalable Classifier for Data Mining

Results

Accuracy:

Page 21: SLIQ: A Fast Scalable Classifier for Data Mining

Results

Scalability:

Page 22: SLIQ: A Fast Scalable Classifier for Data Mining

Conclusion SLIQ demonstrates to be a fast, low-cost

and scalable classifier that builds accurate trees

Based on empirical test which compared SLIQ to other tree based classifiers, SLIQ achieves a comparable accuracy while producing smaller decision trees

Scalability??? Memory problem when increasing number of attributes or number of classes

Page 23: SLIQ: A Fast Scalable Classifier for Data Mining

References[1] M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable

Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.

Page 24: SLIQ: A Fast Scalable Classifier for Data Mining

THANK YOU!