classification and prediction (baseado nos slides do livro: data mining: c & t)

31
Classification and Classification and Prediction Prediction (baseado nos slides do (baseado nos slides do livro: Data Mining: C & T) livro: Data Mining: C & T)

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Classification and Classification and PredictionPrediction

(baseado nos slides do livro: Data (baseado nos slides do livro: Data Mining: C & T)Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Classification and PredictionClassification and Prediction What is classification? What is prediction?What is classification? What is prediction?

Issues regarding classification and predictionIssues regarding classification and prediction

Classification by decision tree inductionClassification by decision tree induction

Bayesian Classification (TI)Bayesian Classification (TI)

Classification by Back Propagation (Neural Networks) (TI)Classification by Back Propagation (Neural Networks) (TI)

Support Vector Machines (SVM)Support Vector Machines (SVM)

Associative Classification: Classification by association rule Associative Classification: Classification by association rule

analysisanalysis

Other Classification Methods: K-Nearest Neighbor, Case-Other Classification Methods: K-Nearest Neighbor, Case-

based reasoning, etcbased reasoning, etc

Prediction: Regression Prediction: Regression (TI)(TI)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

ClassificationClassification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and

the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction Prediction models continuous-valued functions, i.e., predicts unknown or

missing values

Typical applicationsTypical applications Credit approval Target marketing Medical diagnosis Fraud detection

Classification vs. PredictionClassification vs. Prediction

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Classification—A Two-Classification—A Two-Step ProcessStep Process (1) (1)

Model constructionModel construction: describing a set of predetermined : describing a set of predetermined classesclasses

Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

The set of tuples used for model construction is the training set

The model is represented as classification rules, decision trees, or mathematical formulae

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Classification—A Two-Classification—A Two-Step ProcessStep Process (2) (2)

Model usageModel usage: for classifying future or unknown objects: for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set, otherwise over-fitting will occur

If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Classification Process Classification Process (1): Model Construction(1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Classification Process (2): Classification Process (2): Use the Model in Use the Model in

PredictionPrediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Supervised vs. Supervised vs. Unsupervised LearningUnsupervised Learning

Supervised learning (classification)Supervised learning (classification) The training data (observations, measurements, etc.) are

accompanied by labels indicating the class of the

observations

New data is classified based on the training set

Unsupervised learning (clustering)Unsupervised learning (clustering) The class labels of training data is unknown

Given a set of measurements, observations, etc. with the

aim of establishing the existence of classes or clusters in

the data

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Classification and Classification and PredictionPrediction

What is classification? What is prediction?What is classification? What is prediction?

Issues regarding classification and predictionIssues regarding classification and prediction

Classification by decision tree inductionClassification by decision tree induction

Bayesian ClassificationBayesian Classification

Classification by Back PropagationClassification by Back Propagation

Support Vector Machines Support Vector Machines

Associative Classification: Classification by association Associative Classification: Classification by association

rule analysisrule analysis

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Data PreparationData Preparation

Data cleaningData cleaning Preprocess data in order to reduce noise and

handle missing values

Relevance analysis (feature selection)Relevance analysis (feature selection) Remove the irrelevant or redundant attributes

Data transformationData transformation Generalize and/or normalize data

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Evaluating classification Evaluating classification methodsmethods

AccuracyAccuracy: classifier accuracy and predictor accuracy: classifier accuracy and predictor accuracy

Speed and scalabilitySpeed and scalability time to construct the model (training time) time to use the model (classification/prediction time)

RobustnessRobustness handling noise and missing values

ScalabilityScalability efficiency in disk-resident databases

InterpretabilityInterpretability understanding and insight provided by the model

Other measuresOther measures e.g., goodness of rules, such as decision tree size or compactness

of classification rules

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Classification Accuracy: Classification Accuracy: Estimating Error RatesEstimating Error Rates

Partition: Training-and-testingPartition: Training-and-testing use two independent data sets, e.g., training set (2/3),

test set (1/3) used for data set with large number of samples

Cross-validationCross-validation divide the data set into k sub-samples use k-1 sub-samples as training data and one sub-

sample as test data—k-fold cross-validation for data set with moderate size

Bootstrapping (leave-one-out)Bootstrapping (leave-one-out) for small size data

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Increasing Classfier Increasing Classfier Accuracy: BaggingAccuracy: Bagging

General ideaGeneral idea: averaging the prediction over a : averaging the prediction over a collection of classifierscollection of classifiers

Training data Training data

Altered Training data Altered Training data

Altered Training dataAltered Training data

…………....

Aggregation ….Aggregation ….

Classifier CClassification method (CM)

CM

Classifier C1

CM

Classifier C2

Classifier C*

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Bagging: The Algorithm Bagging: The Algorithm Given a set Given a set S S of of ss samples samples Generate a bootstrap sample Generate a bootstrap sample TT from from SS. Cases in . Cases in SS may may

not appear in not appear in TT or may appear more than once. or may appear more than once. Repeat this sampling procedure, getting a sequence of Repeat this sampling procedure, getting a sequence of kk

independent training setsindependent training sets A corresponding sequence of classifiers A corresponding sequence of classifiers CC11, C, C22, …, C, …, Ckk is is

constructed for each of these training sets, by using the constructed for each of these training sets, by using the same classification algorithm same classification algorithm

To classify an unknown sample To classify an unknown sample XX, let each classifier , let each classifier predict or vote predict or vote

The Bagged Classifier The Bagged Classifier C*C* counts the votes and assigns counts the votes and assigns XX to the class with the “most” votesto the class with the “most” votes

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Classification and PredictionClassification and Prediction What is classification? What is prediction?What is classification? What is prediction?

Issues regarding classification and predictionIssues regarding classification and prediction

Classification by decision tree inductionClassification by decision tree induction

Bayesian Classification (TI)Bayesian Classification (TI)

Classification by Back Propagation (Neural Networks) (TI)Classification by Back Propagation (Neural Networks) (TI)

Support Vector Machines (SVM)Support Vector Machines (SVM)

Associative Classification: Classification by association rule Associative Classification: Classification by association rule

analysisanalysis

Other Classification Methods: K-Nearest Neighbor, Case-Other Classification Methods: K-Nearest Neighbor, Case-

based reasoning, etcbased reasoning, etc

Prediction: Regression Prediction: Regression (TI)(TI)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Decision Tree Induction: Decision Tree Induction: Training DatasetTraining Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example of Quinlan’s ID3 (Playing Tennis)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Output: A Decision Tree Output: A Decision Tree for “for “buys_computer”buys_computer”

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Algorithm for Decision Algorithm for Decision Tree Induction (ID3)Tree Induction (ID3)

Basic algorithmBasic algorithm (a greedy algorithm) (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer

manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in

advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)

Conditions for Conditions for stopping partitioningstopping partitioning1) All samples for a given node belong to the same class

2) There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf

3) There are no samples left

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Attribute Selection Measure: Information Gain (ID3/C4.5)

Select the attribute with the highest information gain

S contains si tuples of class Ci for i = {1, …, m} information measures info required to classify any

arbitrary tuple

entropy of attribute A with values {a1,a2,…,av}

information gained by branching on attribute A

s

slogs

s),...,s,ssI(

im

i

im21 2

1

)s,...,s(Is

s...sE(A) mjj

v

j

mjj1

1

1

E(A))s,...,s,I(sGain(A) m 21

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Attribute Selection by Attribute Selection by Information Gain ComputationInformation Gain Computation

Class PClass P: buys_computer = “yes”: buys_computer = “yes”Class NClass N: buys_computer = “no”: buys_computer = “no”I(p, n) = I(9, 5) =0.940I(p, n) = I(9, 5) =0.940Compute the entropy for Compute the entropy for ageage::

means “age <=30” has 5 out means “age <=30” has 5 out

of 14 samples, with 2 yes’es and 3 of 14 samples, with 2 yes’es and 3

no’s. Henceno’s. Hence

Similarly:Similarly:

age pi ni I(pi, ni)<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971

694.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIageE

048.0)_(

151.0)(

029.0)(

ratingcreditGain

studentGain

incomeGain

246.0)(),()( ageEnpIageGain

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

)3,2(14

5I

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Extracting Classification Extracting Classification Rules from TreesRules from Trees

Represent the knowledge in the form of Represent the knowledge in the form of IF-THENIF-THEN rules rules

One One rulerule is created for each path from the root to a leaf is created for each path from the root to a leaf

Each attribute-value pair along a path forms a Each attribute-value pair along a path forms a conjunctionconjunction

The leaf node holds the The leaf node holds the class predictionclass prediction

Rules are Rules are easiereasier for humans to understand for humans to understand

Example:Example:

IF IF ageage = “<=30” AND = “<=30” AND studentstudent = “ = “nono” THEN ” THEN buys_computerbuys_computer = “ = “nono””

IF IF ageage = “<=30” AND = “<=30” AND studentstudent = “ = “yesyes” THEN ” THEN buys_computerbuys_computer = “ = “yesyes””

IF IF ageage = “31…40” THEN = “31…40” THEN buys_computerbuys_computer = “ = “yesyes””

IF IF ageage = “>40” AND = “>40” AND credit_ratingcredit_rating = “ = “excellentexcellent” THEN ” THEN buys_computer buys_computer = “= “nono””

IF IF ageage = “<=30” AND = “<=30” AND credit_ratingcredit_rating = “ = “fairfair” THEN ” THEN buys_computerbuys_computer = “ = “yesyes””

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Avoid Overfitting in Avoid Overfitting in ClassificationClassification

OverfittingOverfitting: An induced tree may overfit the training data: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to

noise or outliers Poor accuracy for unseen samples

Two approaches to Two approaches to avoid overfittingavoid overfitting Prepruning: Halt tree construction early, do not split a node if

this would result in the goodness measure falling below a threshold; difficult to choose an appropriate threshold

Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees

Use a set of data different from the training data to decide which is the “best pruned tree”

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Approaches to Determine Approaches to Determine the Final Tree Sizethe Final Tree Size

Separate training (2/3) and testing (1/3) setsSeparate training (2/3) and testing (1/3) sets

Use cross validation, e.g., 10-fold cross validationUse cross validation, e.g., 10-fold cross validation

Use all the data for trainingUse all the data for training but apply a statistical test (e.g., chi-square) to estimate

whether expanding or pruning a node may improve the entire distribution

Use minimum description length (MDL) principleUse minimum description length (MDL) principle halting growth of the tree when the encoding is minimized

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Enhancements to Basic Enhancements to Basic Decision Tree InductionDecision Tree Induction

Allow for Allow for continuous-valued attributescontinuous-valued attributes Dynamically define new discrete-valued attributes that

partition the continuous attribute value into a discrete set of intervals

Handle Handle missing attribute valuesmissing attribute values Assign the most common value of the attribute Assign probability to each of the possible values

Attribute constructionAttribute construction Create new attributes based on existing ones that are

sparsely represented This reduces fragmentation, repetition, and replication

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Computing Info. Gain for Computing Info. Gain for Continuous-Value AttributesContinuous-Value Attributes

Let attribute A be a continuous-valued attributeLet attribute A be a continuous-valued attribute

Must determine the Must determine the best split pointbest split point for A for A Sort the value A in increasing order

Typically, the midpoint between each pair of adjacent values is considered as a possible split point

(ai+ai+1)/2 is the midpoint between the values of ai and ai+1

The point with the minimum expected information requirement for A is selected as the split-point for A

Split:Split: D1 is the set of tuples in D satisfying A ≤ split-point, and

D2 is the set of tuples in D satisfying A > split-point

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Classification in Large Classification in Large DatabasesDatabases

ClassificationClassification: a classical problem extensively studied by : a classical problem extensively studied by statisticians and machine learning researchersstatisticians and machine learning researchers

ScalabilityScalability: Classify data sets with millions of examples : Classify data sets with millions of examples and hundreds of attributes with reasonable speedand hundreds of attributes with reasonable speed

Why decision tree induction in data mining?Why decision tree induction in data mining? relatively faster learning speed (than other classification

methods) convertible to simple and easy to understand classification

rules can use SQL queries for accessing databases comparable classification accuracy with other methods

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Scalable Decision Tree Scalable Decision Tree Induction MethodsInduction Methods

SLIQSLIQ (EDBT’96 — Mehta et al.): builds an index for each (EDBT’96 — Mehta et al.): builds an index for each attribute and only class list and the current attribute list attribute and only class list and the current attribute list reside in memoryreside in memory

SPRINTSPRINT (VLDB’96 — J. Shafer et al.): constructs an (VLDB’96 — J. Shafer et al.): constructs an attribute list data structure attribute list data structure

PUBLICPUBLIC (VLDB’98 — Rastogi & Shim): integrates tree (VLDB’98 — Rastogi & Shim): integrates tree splitting and tree pruning: stop growing the tree earliersplitting and tree pruning: stop growing the tree earlier

RainForestRainForest (VLDB’98 — Gehrke, Ramakrishnan & (VLDB’98 — Gehrke, Ramakrishnan & Ganti): separates the scalability aspects from the Ganti): separates the scalability aspects from the criteria that determine the quality of the tree; builds an criteria that determine the quality of the tree; builds an AVC-list (attribute, value, class label)AVC-list (attribute, value, class label)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Presentation of Classification Presentation of Classification ResultsResults

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Visualization of a Decision Tree Visualization of a Decision Tree in SGI/MineSet 3.0in SGI/MineSet 3.0

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

BibliografiaBibliografia

(Livro) (Livro) Data Mining: Concepts and Data Mining: Concepts and TechniquesTechniques, J. Han & M. Kamber, Morgan , J. Han & M. Kamber, Morgan Kaufmann, 2001 (Secções 7.1 a 7.3 – livro Kaufmann, 2001 (Secções 7.1 a 7.3 – livro 2001, Secções 5.1 a 5.3 – draft)2001, Secções 5.1 a 5.3 – draft)

(Livro) (Livro) Machine LearningMachine Learning, T. Mitchell, McGraw , T. Mitchell, McGraw Hill, 1997Hill, 1997

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Data Cube-Based Decision-Data Cube-Based Decision-Tree InductionTree Induction

Integration of generalization with decision-tree Integration of generalization with decision-tree

induction (Kamber et al’97).induction (Kamber et al’97).

ClassificationClassification at primitive concept levels at primitive concept levels E.g., precise temperature, humidity, outlook, etc.

Low-level concepts, scattered classes, bushy

classification-trees

Semantic interpretation problems.

Cube-based multi-levelCube-based multi-level classification classification Relevance analysis at multi-levels.

Information-gain analysis with dimension + level.