classification and prediction (baseado nos slides do livro: data mining: c & t)
Post on 21-Dec-2015
216 views
TRANSCRIPT
Classification and Classification and PredictionPrediction
(baseado nos slides do livro: Data (baseado nos slides do livro: Data Mining: C & T)Mining: C & T)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification and PredictionClassification and Prediction What is classification? What is prediction?What is classification? What is prediction?
Issues regarding classification and predictionIssues regarding classification and prediction
Classification by decision tree inductionClassification by decision tree induction
Bayesian Classification (TI)Bayesian Classification (TI)
Classification by Back Propagation (Neural Networks) (TI)Classification by Back Propagation (Neural Networks) (TI)
Support Vector Machines (SVM)Support Vector Machines (SVM)
Associative Classification: Classification by association rule Associative Classification: Classification by association rule
analysisanalysis
Other Classification Methods: K-Nearest Neighbor, Case-Other Classification Methods: K-Nearest Neighbor, Case-
based reasoning, etcbased reasoning, etc
Prediction: Regression Prediction: Regression (TI)(TI)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
ClassificationClassification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction Prediction models continuous-valued functions, i.e., predicts unknown or
missing values
Typical applicationsTypical applications Credit approval Target marketing Medical diagnosis Fraud detection
Classification vs. PredictionClassification vs. Prediction
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification—A Two-Classification—A Two-Step ProcessStep Process (1) (1)
Model constructionModel construction: describing a set of predetermined : describing a set of predetermined classesclasses
Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
The set of tuples used for model construction is the training set
The model is represented as classification rules, decision trees, or mathematical formulae
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification—A Two-Classification—A Two-Step ProcessStep Process (2) (2)
Model usageModel usage: for classifying future or unknown objects: for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set, otherwise over-fitting will occur
If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification Process Classification Process (1): Model Construction(1): Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification Process (2): Classification Process (2): Use the Model in Use the Model in
PredictionPrediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Supervised vs. Supervised vs. Unsupervised LearningUnsupervised Learning
Supervised learning (classification)Supervised learning (classification) The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the
observations
New data is classified based on the training set
Unsupervised learning (clustering)Unsupervised learning (clustering) The class labels of training data is unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification and Classification and PredictionPrediction
What is classification? What is prediction?What is classification? What is prediction?
Issues regarding classification and predictionIssues regarding classification and prediction
Classification by decision tree inductionClassification by decision tree induction
Bayesian ClassificationBayesian Classification
Classification by Back PropagationClassification by Back Propagation
Support Vector Machines Support Vector Machines
Associative Classification: Classification by association Associative Classification: Classification by association
rule analysisrule analysis
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data PreparationData Preparation
Data cleaningData cleaning Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)Relevance analysis (feature selection) Remove the irrelevant or redundant attributes
Data transformationData transformation Generalize and/or normalize data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Evaluating classification Evaluating classification methodsmethods
AccuracyAccuracy: classifier accuracy and predictor accuracy: classifier accuracy and predictor accuracy
Speed and scalabilitySpeed and scalability time to construct the model (training time) time to use the model (classification/prediction time)
RobustnessRobustness handling noise and missing values
ScalabilityScalability efficiency in disk-resident databases
InterpretabilityInterpretability understanding and insight provided by the model
Other measuresOther measures e.g., goodness of rules, such as decision tree size or compactness
of classification rules
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification Accuracy: Classification Accuracy: Estimating Error RatesEstimating Error Rates
Partition: Training-and-testingPartition: Training-and-testing use two independent data sets, e.g., training set (2/3),
test set (1/3) used for data set with large number of samples
Cross-validationCross-validation divide the data set into k sub-samples use k-1 sub-samples as training data and one sub-
sample as test data—k-fold cross-validation for data set with moderate size
Bootstrapping (leave-one-out)Bootstrapping (leave-one-out) for small size data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Increasing Classfier Increasing Classfier Accuracy: BaggingAccuracy: Bagging
General ideaGeneral idea: averaging the prediction over a : averaging the prediction over a collection of classifierscollection of classifiers
Training data Training data
Altered Training data Altered Training data
Altered Training dataAltered Training data
…………....
Aggregation ….Aggregation ….
Classifier CClassification method (CM)
CM
Classifier C1
CM
Classifier C2
Classifier C*
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Bagging: The Algorithm Bagging: The Algorithm Given a set Given a set S S of of ss samples samples Generate a bootstrap sample Generate a bootstrap sample TT from from SS. Cases in . Cases in SS may may
not appear in not appear in TT or may appear more than once. or may appear more than once. Repeat this sampling procedure, getting a sequence of Repeat this sampling procedure, getting a sequence of kk
independent training setsindependent training sets A corresponding sequence of classifiers A corresponding sequence of classifiers CC11, C, C22, …, C, …, Ckk is is
constructed for each of these training sets, by using the constructed for each of these training sets, by using the same classification algorithm same classification algorithm
To classify an unknown sample To classify an unknown sample XX, let each classifier , let each classifier predict or vote predict or vote
The Bagged Classifier The Bagged Classifier C*C* counts the votes and assigns counts the votes and assigns XX to the class with the “most” votesto the class with the “most” votes
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification and PredictionClassification and Prediction What is classification? What is prediction?What is classification? What is prediction?
Issues regarding classification and predictionIssues regarding classification and prediction
Classification by decision tree inductionClassification by decision tree induction
Bayesian Classification (TI)Bayesian Classification (TI)
Classification by Back Propagation (Neural Networks) (TI)Classification by Back Propagation (Neural Networks) (TI)
Support Vector Machines (SVM)Support Vector Machines (SVM)
Associative Classification: Classification by association rule Associative Classification: Classification by association rule
analysisanalysis
Other Classification Methods: K-Nearest Neighbor, Case-Other Classification Methods: K-Nearest Neighbor, Case-
based reasoning, etcbased reasoning, etc
Prediction: Regression Prediction: Regression (TI)(TI)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Decision Tree Induction: Decision Tree Induction: Training DatasetTraining Dataset
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example of Quinlan’s ID3 (Playing Tennis)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
Output: A Decision Tree Output: A Decision Tree for “for “buys_computer”buys_computer”
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Algorithm for Decision Algorithm for Decision Tree Induction (ID3)Tree Induction (ID3)
Basic algorithmBasic algorithm (a greedy algorithm) (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer
manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in
advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for Conditions for stopping partitioningstopping partitioning1) All samples for a given node belong to the same class
2) There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
3) There are no samples left
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute Selection Measure: Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m} information measures info required to classify any
arbitrary tuple
entropy of attribute A with values {a1,a2,…,av}
information gained by branching on attribute A
s
slogs
s),...,s,ssI(
im
i
im21 2
1
)s,...,s(Is
s...sE(A) mjj
v
j
mjj1
1
1
E(A))s,...,s,I(sGain(A) m 21
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute Selection by Attribute Selection by Information Gain ComputationInformation Gain Computation
Class PClass P: buys_computer = “yes”: buys_computer = “yes”Class NClass N: buys_computer = “no”: buys_computer = “no”I(p, n) = I(9, 5) =0.940I(p, n) = I(9, 5) =0.940Compute the entropy for Compute the entropy for ageage::
means “age <=30” has 5 out means “age <=30” has 5 out
of 14 samples, with 2 yes’es and 3 of 14 samples, with 2 yes’es and 3
no’s. Henceno’s. Hence
Similarly:Similarly:
age pi ni I(pi, ni)<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971
694.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIageE
048.0)_(
151.0)(
029.0)(
ratingcreditGain
studentGain
incomeGain
246.0)(),()( ageEnpIageGain
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
)3,2(14
5I
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Extracting Classification Extracting Classification Rules from TreesRules from Trees
Represent the knowledge in the form of Represent the knowledge in the form of IF-THENIF-THEN rules rules
One One rulerule is created for each path from the root to a leaf is created for each path from the root to a leaf
Each attribute-value pair along a path forms a Each attribute-value pair along a path forms a conjunctionconjunction
The leaf node holds the The leaf node holds the class predictionclass prediction
Rules are Rules are easiereasier for humans to understand for humans to understand
Example:Example:
IF IF ageage = “<=30” AND = “<=30” AND studentstudent = “ = “nono” THEN ” THEN buys_computerbuys_computer = “ = “nono””
IF IF ageage = “<=30” AND = “<=30” AND studentstudent = “ = “yesyes” THEN ” THEN buys_computerbuys_computer = “ = “yesyes””
IF IF ageage = “31…40” THEN = “31…40” THEN buys_computerbuys_computer = “ = “yesyes””
IF IF ageage = “>40” AND = “>40” AND credit_ratingcredit_rating = “ = “excellentexcellent” THEN ” THEN buys_computer buys_computer = “= “nono””
IF IF ageage = “<=30” AND = “<=30” AND credit_ratingcredit_rating = “ = “fairfair” THEN ” THEN buys_computerbuys_computer = “ = “yesyes””
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Avoid Overfitting in Avoid Overfitting in ClassificationClassification
OverfittingOverfitting: An induced tree may overfit the training data: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to
noise or outliers Poor accuracy for unseen samples
Two approaches to Two approaches to avoid overfittingavoid overfitting Prepruning: Halt tree construction early, do not split a node if
this would result in the goodness measure falling below a threshold; difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
Use a set of data different from the training data to decide which is the “best pruned tree”
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Approaches to Determine Approaches to Determine the Final Tree Sizethe Final Tree Size
Separate training (2/3) and testing (1/3) setsSeparate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross validationUse cross validation, e.g., 10-fold cross validation
Use all the data for trainingUse all the data for training but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the entire distribution
Use minimum description length (MDL) principleUse minimum description length (MDL) principle halting growth of the tree when the encoding is minimized
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Enhancements to Basic Enhancements to Basic Decision Tree InductionDecision Tree Induction
Allow for Allow for continuous-valued attributescontinuous-valued attributes Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of intervals
Handle Handle missing attribute valuesmissing attribute values Assign the most common value of the attribute Assign probability to each of the possible values
Attribute constructionAttribute construction Create new attributes based on existing ones that are
sparsely represented This reduces fragmentation, repetition, and replication
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Computing Info. Gain for Computing Info. Gain for Continuous-Value AttributesContinuous-Value Attributes
Let attribute A be a continuous-valued attributeLet attribute A be a continuous-valued attribute
Must determine the Must determine the best split pointbest split point for A for A Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information requirement for A is selected as the split-point for A
Split:Split: D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification in Large Classification in Large DatabasesDatabases
ClassificationClassification: a classical problem extensively studied by : a classical problem extensively studied by statisticians and machine learning researchersstatisticians and machine learning researchers
ScalabilityScalability: Classify data sets with millions of examples : Classify data sets with millions of examples and hundreds of attributes with reasonable speedand hundreds of attributes with reasonable speed
Why decision tree induction in data mining?Why decision tree induction in data mining? relatively faster learning speed (than other classification
methods) convertible to simple and easy to understand classification
rules can use SQL queries for accessing databases comparable classification accuracy with other methods
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Scalable Decision Tree Scalable Decision Tree Induction MethodsInduction Methods
SLIQSLIQ (EDBT’96 — Mehta et al.): builds an index for each (EDBT’96 — Mehta et al.): builds an index for each attribute and only class list and the current attribute list attribute and only class list and the current attribute list reside in memoryreside in memory
SPRINTSPRINT (VLDB’96 — J. Shafer et al.): constructs an (VLDB’96 — J. Shafer et al.): constructs an attribute list data structure attribute list data structure
PUBLICPUBLIC (VLDB’98 — Rastogi & Shim): integrates tree (VLDB’98 — Rastogi & Shim): integrates tree splitting and tree pruning: stop growing the tree earliersplitting and tree pruning: stop growing the tree earlier
RainForestRainForest (VLDB’98 — Gehrke, Ramakrishnan & (VLDB’98 — Gehrke, Ramakrishnan & Ganti): separates the scalability aspects from the Ganti): separates the scalability aspects from the criteria that determine the quality of the tree; builds an criteria that determine the quality of the tree; builds an AVC-list (attribute, value, class label)AVC-list (attribute, value, class label)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation of Classification Presentation of Classification ResultsResults
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Visualization of a Decision Tree Visualization of a Decision Tree in SGI/MineSet 3.0in SGI/MineSet 3.0
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
BibliografiaBibliografia
(Livro) (Livro) Data Mining: Concepts and Data Mining: Concepts and TechniquesTechniques, J. Han & M. Kamber, Morgan , J. Han & M. Kamber, Morgan Kaufmann, 2001 (Secções 7.1 a 7.3 – livro Kaufmann, 2001 (Secções 7.1 a 7.3 – livro 2001, Secções 5.1 a 5.3 – draft)2001, Secções 5.1 a 5.3 – draft)
(Livro) (Livro) Machine LearningMachine Learning, T. Mitchell, McGraw , T. Mitchell, McGraw Hill, 1997Hill, 1997
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data Cube-Based Decision-Data Cube-Based Decision-Tree InductionTree Induction
Integration of generalization with decision-tree Integration of generalization with decision-tree
induction (Kamber et al’97).induction (Kamber et al’97).
ClassificationClassification at primitive concept levels at primitive concept levels E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy
classification-trees
Semantic interpretation problems.
Cube-based multi-levelCube-based multi-level classification classification Relevance analysis at multi-levels.
Information-gain analysis with dimension + level.