sliq: a fast scalable classifier for data mining
DESCRIPTION
SLIQ: A Fast Scalable Classifier for Data Mining. Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic. Outline. Introduction Motivation SLIQ Algorithm Building tree Pruning Example Results Conclusion. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
SLIQ: A Fast Scalable Classifier for Data Mining
Manish Mehta, Rakesh Agrawal, Jorma Rissanen1996.
Presentation by: Vladan Radosavljevic
Outline Introduction Motivation SLIQ Algorithm
Building tree Pruning Example
Results Conclusion
Introduction Most of the classification algorithms are
designed for memory resident data – limited suitability for mining large datasets
Solution – build a scalable classifier - SLIQ
SLIQ – Supervised Learning in Quest, Quest was the data mining project at the IBM
Motivation
Recall (ID3, C4.5, CART):
Motivation NON SCALABLE DECISION TREES: The complexity lies in determining the best
split for each attribute The cost of evaluating splits for numerical
attributes is dominated by the cost of sorting values at each node
The cost of evaluating splits for categorical attributes is dominated by the cost of searching for the best subset
Pruning crossvalidation inapplicable for large datasets divide data in two parts - training and test set - sizes,
distribution???
Motivation Improve scalability of tree classifiers Previous proposals:
Sampling data at each node Discretization of numerical attributes Partitioning input data and build tree for
each partition All methods achieve low accuracy!
SLIQ – improve learning time without loss in accuracy!
SLIQ Key features:
Tree classifier, handling both numerical and categorical attributes
Presort numerical attributes before tree has been built
Breadth first growing strategy Goodness test – Gini index Inexpensive tree pruning algorithm
based on Minimum Description Length (MDL)
SLIQ - Algorithm
Eliminate the need to sort the data at each node
Create sorted list for each numerical attribute
Create class list
SLIQ - Algorithm
Example:
SLIQ - Algorithm
Split evaluation:
SLIQ - Algorithm
Example:
SLIQ - Algorithm
Update class list:
SLIQ - Algorithm
Example:
SLIQ - Algorithm For large-cardinality categorical attributes
(determined based on threshold) the best split is computed in greedy way, otherwise all possible splits are evaluated
When node becomes pure stop splitting it, then condense attribute lists by discarding examples that correspond to the pure node
SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical
SLIQ - Pruning
Post pruning algorithm based on Minimum Description Length principle
Find a model that minimizes:Cost(M,D) = Cost(D|M) + Cost(M)Cost(M) - cost of the modelCost(D|M) - cost of encoding the data D if model M is given
SLIQ - Pruning Cost of the data: classification error Cost of the model:
Encoding the tree: number of bits Encoding the splits:
numerical attribute - constant (empirically 1) categorical attribute - depends on cardinality
The MDL pruning evaluate the code length at each node to determine whether to prune one or both child or leave the node intact
SLIQ - pruning
Three pruning strategies: Full – pruning both children and
convert node to the leaf Partial – prune into the leaf or prune
the left child or prune the right child or leave node intact
Hybrid – apply Full method and then partial (prune left, prune right or leave intact)
Results
SLIQ was tested on the datasets:
Results
Pruning strategy comparison:
Results
Accuracy:
Results
Scalability:
Conclusion SLIQ demonstrates to be a fast, low-cost
and scalable classifier that builds accurate trees
Based on empirical test which compared SLIQ to other tree based classifiers, SLIQ achieves a comparable accuracy while producing smaller decision trees
Scalability??? Memory problem when increasing number of attributes or number of classes
References[1] M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable
Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.
THANK YOU!