www.monash.edu.au classification and prediction monash university semester 1, march 2006

www.monash.edu.au

Classification and Prediction

Monash UniversitySemester 1, March 2006

www.monash.edu.au

2

Patterns

• Data mining aims to reveal knowledge about the data under consideration

• Knowledge is viewed as Patterns within the data• Patterns are also referred to as structures, models

and relationships• Learning approach is inherently linked to the patterns • Which approach for what data ? – eternal question!

www.monash.edu.au

3

Taxonomy of Learning Approaches

• Verification Driven– Query and reporting Statistical Analysis

• Discovery Driven– Supervised Learning

> Predictive> Typical Learning Techniques: Classification and Regression

– Unsupervised Learning> Informative> Typical Learning Techniques: Clustering, Association

Analysis, Deviation/Outlier Detections

www.monash.edu.au

4

Problems and Uses of Verification Driven Techniques

• Statistics– Assumption – you have a well-defined model

> In reality – you often don’t– Focus on model estimation rather than model selection– Don’t cope well with large amounts of data

> Time and Effort required is very high– Highlight linear relationships– Non-linear stats requires knowledge of the non-linearity

> The ways in which variables interact> In real multi-dimensional data -> Not often known or specificiable

– Important lessons from Statistics> Estimation of uncertainty and checking of assumptions is

absolutely essential

www.monash.edu.au

5

Discovery Driven Data Mining

• Predictive/Supervised Learning– Build patterns by making a prediction of some

unknown attribute given the values of other known attributes

– Principal Techniques: Classification and Regression

– Supervision: Training data (a set of observations/measurements) is accompanied by labels indicating the class of the observations

– New data is classified based on the training data.

www.monash.edu.au

6

Discovery Driven Data Mining

• Informative/Unsupervised Learning– Do not present a solution to a known problem – They present “interesting” patterns for

consideration by a domain expert– Principal Techniques: Clustering and

Association Analysis– Labels of training data unknown– Given a set of observations/measurements

(data), need to establish existence of classes or clusters in the data.

www.monash.edu.au

7

Classification

• Difficult to define• Usually based on a set of classes that are

pre-defined and a set of training samples that are used for building class models

• Each new object is then assigned to one of the classes already defined in the class model

• Typical Apps: Credit Approval, Target Marketing, Medical Diagnosis…

www.monash.edu.au

8

Classification Vs. Regression

• Classification and Regression are two major types of prediction techniques

• Classification – Predict nominal or discrete values

• Regression– Predict continuous or ordered values

www.monash.edu.au

9

Classification – A two stage process

• Model Construction– Each tuple/sample is assumed to belong to a pre-

defined class, as determined by the class label attribute

– Set of tuples used for model construction: Training Set

– The model can be represented as a set of IF-THEN rules, Decision Trees of Mathematical Formulae

www.monash.edu.au

10

Classification Process (1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

www.monash.edu.au

11

Classification – A two stage process

• Model Usage– For classifying future or unknown objects– Need to estimate accuracy of the model– The known label of the test sample is compared

with the classified result from the model– Accuracy rate is the percentage of test set

samples that are correctly classified by the model– Test set is independent of the training set

www.monash.edu.au

12

Classification Process (2): Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

www.monash.edu.au

13

Evaluating Classification Methods

• Predictive accuracy• Speed and scalability

– time to construct the model– time to use the model

• Robustness– handling noise and missing values

• Scalability– efficiency in disk-resident databases

• Interpretability: – understanding and insight provided by the model

• Goodness of rules– decision tree size– compactness of classification rules

www.monash.edu.au

14

Decision Trees

• Assumption:– Data consists of records that have a number of

independent attributes and a dependent attribute

– A flow-chart-like tree structure

– A structure to support a game of 20 questions:> Each internal node represents a test on an

attribute> Each branch represents the outcome of the test> Each leaf represents a class

www.monash.edu.au

15

Classification by Decision Tree Induction

Decision tree generation consists of two phases

– Tree construction

> At start, all the training examples are at the root> Partition examples recursively based on selected

attributes– Tree pruning

> Identify and remove branches that reflect noise or outliers

• Use of decision tree: Classifying an unknown sample

– Test the attribute values of the sample against the decision tree

www.monash.edu.au

16

Training Dataset

age income student credit_rating<=30 high no fair<=30 high no excellent31…40 high no fair>40 medium no fair>40 low yes fair>40 low yes excellent31…40 low yes excellent<=30 medium no fair<=30 low yes fair>40 medium yes fair<=30 medium yes excellent31…40 medium no excellent31…40 high yes fair>40 medium no excellent

This follows an example from Quinlan’s ID3

www.monash.edu.au

17

Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

www.monash.edu.au

18

Decision Trees

• Classify objects based on the values of their independent attributes

• Classification is based on a tree structure where each node of the tree is a rule that involves a multi-way decision

• To classify an object, the attributes are compared with tests in each node starting from the root. The path found leads to a leaf node which is the class to which the object belongs to.

www.monash.edu.au

19

Decision Trees

• Attractive due to ease of understanding• Rules can be expressed in natural language• Aim: Build a DT consisting of a root node,

number of internal nodes and a set of pre-defined leaf nodes.

• Continuous splitting of root node until the process is complete.

www.monash.edu.au

20

Algorithm for Decision Tree Induction

• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are discretized in

advance)– Examples are partitioned recursively based on selected attributes– Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)

• Conditions for stopping partitioning– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority

voting is employed for classifying the leaf– There are no samples left

www.monash.edu.au

21

Decision Trees

• Attractive due to ease of understanding• Rules can be expressed in natural language• Aim: Build a DT consisting of a root node,

number of internal nodes and a set of pre-defined leaf nodes.

• Continuous splitting of root node until the process is complete.

www.monash.edu.au

22

Decision Trees

• The quality of the tree depends on the quality of the training data

• 100% accurate for training data, but training data is only a sample…cannot be accurate for all data.

www.monash.edu.au

23

Finding the Split

• A variable has to be chosen to split the data• This variable has to be such that it does the best job

of discriminating among target classes.• Finding which variable is most influential is non-

trivial.• One approach – finding the data’s diversity and

choosing a variable that minimises the diversity amongst the children nodes.

• Several Techniques have been proposed – Information Gain/Entropy, Gini, Chi-Squared, Rough Sets etc.

www.monash.edu.au

24

Attribute Selection Measure

• Information gain (ID3/C4.5)– All attributes are assumed to be categorical

– Can be modified for continuous-valued attributes

• Gini index (IBM IntelligentMiner)– All attributes are assumed continuous-valued

– Assume there exist several possible split values for each attribute

– May need other tools, such as clustering, to get the possible split values

– Can be modified for categorical attributes

www.monash.edu.au

25

Information Theory – 101

• Suppose there is a variable s that can take either value a or value b.

• If s is always going to be a, then there is no uncertainty and no information

• The most common measure of the mount of information (also known as entropy) is:

– I = - SUM (pi log (pi))

• Let p(a) = 0.9 and p(b) = 0.1• Let p(a) = 0.5 and p(b) = 0.5• When is I maximised ?

www.monash.edu.au

26

Information Gain (ID3/C4.5)• Select the attribute with the highest information gain

• Assume there are two classes, P and N

– Let the set of examples S contain p elements of class P and

n elements of class N

– The amount of information, needed to decide if an arbitrary

example in S belongs to P or N is defined as

np

n

np

n

np

p

np

pnpI

22 loglog),(

www.monash.edu.au

27

Information Gain in Decision Tree Induction

• Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv}

– If Si contains pi examples of P and ni examples of N, the

entropy, or the expected information needed to classify objects in all subtrees Si is

• The encoding information that would be gained by branching on A

1),()(

iii

ii npInp

npAE

)(),()( AEnpIAGain

www.monash.edu.au

28

Attribute Selection by Information Gain Computation Class P: buys_computer

= “yes”

Class N: buys_computer = “no”

I(p, n) = I(9, 5) =0.940

Compute the entropy for age:

Hence

Similarly

age pi ni I(pi, ni)<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971

69.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIageE

048.0)_(

151.0)(

029.0)(

ratingcreditGain

studentGain

incomeGain

)(),()( ageEnpIageGain

www.monash.edu.au

29

Extracting Classification Rules from Trees

• Represent the knowledge in the form of IF-THEN rules

• One rule is created for each path from the root to a leaf

• Each attribute-value pair along a path forms a conjunction

• The leaf node holds the class prediction• Rules are easier for humans to understand

www.monash.edu.au

30

Extracting Classification Rules from Trees

• ExampleIF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “excellent” THEN

buys_computer = “yes”IF age = “>40” AND credit_rating = “fair” THEN buys_computer =

“no”

www.monash.edu.au

31

Avoid Overfitting in Classification

• The generated tree may overfit the training data

– Too many branches, some may reflect anomalies due to noise or outliers

– Result is in poor accuracy for unseen samples

www.monash.edu.au

32

Avoid Overfitting in Classification

• Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not

split a node if this would result in the goodness measure falling below a threshold

> Difficult to choose an appropriate threshold– Postpruning: Remove branches from a “fully

grown” tree—get a sequence of progressively pruned trees

> Use a set of data different from the training data to decide which is the “best pruned tree”

www.monash.edu.au

33

Approaches to Determine the Final Tree Size

• Separate training (2/3) and testing (1/3) sets

• Use cross validation, e.g., 10-fold cross validation

• Use all the data for training

– but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution

• Use minimum description length (MDL) principle:

– halting growth of the tree when the encoding is minimized

www.monash.edu.au

34

Decision Trees: Strengths and Weaknesses

• Strengths

– Ability to generate understandable rules

– Classify with minimal computational overhead

• Issues

– Need to find the right split variable

– Visualising Trees is tedious – particularly as data dimensionality increases

– Handling both continuous and categorical data

www.monash.edu.au

35

Enhancements to basic decision tree induction

• Allow for continuous-valued attributes– Dynamically define new discrete-valued attributes that

partition the continuous attribute value into a discrete set of intervals

• Handle missing attribute values– Assign the most common value of the attribute

– Assign probability to each of the possible values

• Attribute construction– Create new attributes based on existing ones that are

sparsely represented

– This reduces fragmentation, repetition, and replication

www.monash.edu.au

36

Classification in Large Databases• Classification—a classical problem extensively studied by

statisticians and machine learning researchers

• Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed

• Why decision tree induction in data mining?

– relatively faster learning speed (than other classification methods)

– convertible to simple and easy to understand classification rules

– can use SQL queries for accessing databases

– comparable classification accuracy with other methods

www.monash.edu.au

37

Other Classification Methods

• Bayesian approaches• Neural Networks• k-nearest neighbor classifier

• case-based reasoning

• Genetic algorithm

• Rough set approach

• Fuzzy set approaches

www.monash.edu.au classification and prediction monash university semester 1, march 2006

Documents

data patterns

classifier model slide

training set slide

model test set

class slide

decision tree slide

regression classification

model time