data miningce.sharif.edu/courses/94-95/1/ce714-1/resources/...hamid beigy (sharif university of...

Data miningProbabilistic classifiers

Hamid Beigy

Sharif University of Technology

Fall 1394

Hamid Beigy (Sharif University of Technology) Data mining Fall 1394 1 / 25

Table of contents

1 Introduction

2 Introduction to probability

3 Bayes decision theory

4 Supervised learning of the Bayesian classifiers

5 Naive Bayes classifier

6 k−Nearest neighbor classifier


Outline

1 Introduction







Introduction

In classification, the goal is to find a mapping from inputs X to outputs t given a labeledset of input-output pairs

S = {(x1, t1), (x2, t2), . . . , (xN , tN)}.

S is called training set.

In the simplest setting, each training input x is a D−dimensional vector of numbers.

Each component of x is called feature, attribute, or variable and x is called feature vector.

The goal is to find a mapping from inputs X to outputs t, where t ∈ {1, 2, . . . ,C} with Cbeing the number of classes.

When C = 2, the problem is called binary classification. In this case, we often assumethat t ∈ {−1,+1} or t ∈ {0, 1}.When C > 2, the problem is called multi-class classification.

Approaches for building a classifier.

Generative approach: This approach first creates a joint model of the form of p(x ,Ck) andthen to condition on x , deriving p(Ck |x).Discriminative approach: This approach creates a model of the form of p(Ck |x) directly.


Outline

1 Introduction







Introduction to probability

The probability of an event is the fraction of times that an event occurs out of somenumber of trials, as the number of trials approaches infinity.

Probability of an event X denoted by p(X ) lies in the range of [0, 1].

We use p(X ) to refer to a distribution over a random variable, and p(xi ) to refer to thedistribution evaluated at a particular value.

Mutually exclusive events are those events that cannot simultaneously occur.

The sum of the probabilities of all mutually exclusive events must equal 1.


Joint probability

Let nij be the number of times events i and j simultaneously occur.

Let N =∑

i

∑j nij .

Joint probability is

p(X = xi ,Y = yj) =nijN.

Let ci =∑

j nij , and rj =∑

i nij .

The probability of X irrespective of Y is

p(X = xi ) =ciN.

Therefore, we can marginalize or sum over Y

p(X = xi ) =∑j

p(X = xi ,Y = yj).


Marginalization

Consider only instances where the fraction of instances Y = yj when X = xi .

This is conditional probability and written p(Y = yj |X = xi ), probability of Y given X .

p(Y = yj |X = xi ) =nijci

Now consider

p(X = xi ,Y = yj) =nijN

=nijci

ciN

= p(Y = yj |X = xi )p(X = xi )

If two events are independent, p(X ,Y ) = p(X )p(Y ) and p(X |Y ) = p(X )

Sum rule p(X ) =∑

Y p(X ,Y )

Product rule p(X ,Y ) = p(Y |X )p(X )


Expected value and variance

Expectation, expected value, or mean of a random variable X , denoted by E [X ], is theaverage value of X in a large number of experiments.

E [x ] =∑x

p(x)x

or

E [x ] =

∫p(x)xdx

Variance measures how much X varies around the expected value.

Var(X ) = E[(X − E [X ])2

]Covariance indicates the relationship between two random variables X and Y .

Cov(X ) = E [(X − E [X ])(Y − E [Y ])]


Normal (Gaussian) distribution

For 1-dimensional normal or Gaussian distributed variable X with mean µ and varianceσ2, denoted as N(µ, σ2), we have

p(x) = N(µ, σ2) =1

σ√

2πexp{−(X − µ)2

2σ2}

A D−dimensional (Multivariate) Gaussian distribution is

p(x) = N(µ,Σ) =1

(2π)D/2|Σ|1/2exp{−1

2(X − µ)TΣ−1(x − µ)}


Bayes theorem

Bayes theorem

p(Y |X ) =P(X |Y )P(Y )

P(X )

=P(X |Y )P(Y )∑Y p(X |Y )p(Y )

p(Y ) is called prior of Y . This is information we have before observing anything aboutthe Y that was drawn.

p(Y |X ) is called posterior probability, or simply posterior. This is the distribution of Yafter observing X .

p(X |Y ) is called likelihood of X and is the conditional probability that an event Y hasthe associated observation X .

p(X ) is called evidence and is the marginal probability that an observation X is seen.

In other words

posterior =prior × likelihood

evidence.


Maximum a posteriori estimation

In many learning scenarios, the learner considers some set Y and is interested in findingthe most probable Y ∈ Y given observed data X .

This is called maximum a posteriori estimation and can be estimated using Bayes theorem.

YMAP = argmaxY∈Y

p(Y |X )

= argmaxY∈Y

P(X |Y )P(Y )

P(X )

= argmaxY∈Y

P(X |Y )P(Y )

P(X ) is dropped because it is constant and independent of Y .

YMAP = argmaxY∈Y

P(X |Y )P(Y )

= argmaxY∈Y

{logP(X |Y ) + logP(Y )}

= argminY∈Y

{− logP(X |Y )− logP(Y )}


Maximum likelihood estimation

In some cases, we will assume that every Y ∈ Y is equally probable.This is called maximum likelihood estimation.

YML = argmaxY∈Y

P(X |Y )

= argmaxY∈Y

logP(X |Y )

= argminY∈Y

{− logP(X |Y )}

Let x1, x2, . . . , xN be random samples drawn from p(X ,Y ).Assuming statistical independence between the different samples,we can form p(X |Y ) as

p(X |Y ) = p(x1, x2, . . . , xN |Y ) =N∏

n=1

p(xn|Y )

This method estimates Y so that p(X |Y ) takes its maximum value.

YML = argmaxY∈Y

N∏n=1

p(xn|Y )


Maximum likelihood estimation(cont.)

A necessary condition that YML must satisfy in order to be a maximum is the gradient ofthe likelihood function with respect to Y to be zero.

∂∏N

n=1 p(xn|Y )

∂Y= 0

Because of the monotonicity of the logarithmic function, we define the loglikelihoodfunction as

L(Y ) = lnN∏

n=1

p(xn|Y )

Equivalently, we have

∂L(Y )

∂Y=

N∑n=1

∂ln p(xn|Y )

∂Y

=N∑

n=1

1

p(xn|Y )

∂p(xn|Y )

∂Y= 0


Outline

1 Introduction







Bayes decision theory

Given a classification task of M classes, C1,C2, . . . ,CM , and an input vector x , we canform M conditional probabilities

p(Ck |x) ∀k = 1, 2, . . . ,M

Without loss of generality, consider two class classification problem. From the Bayestheorem, we have

p(Ck |x) = P(x |Ck)P(Ck)

The Bayes classification rule is

if p(C1|x) > p(C2|x) then x is classified to C1

if p(C1|x) < p(C2|x) then x is classified to C2

if p(C1|x) = p(C2|x) then x is classified to either C1 or C2

Since p(x) is same for all classes, then it can be removed. Hence

p(x |C1)p(C1) ≶ p(x |C2)p(C2)

If p(C1) = p(C2) = 12 , then we have

p(x |C1) ≶ p(x |C2)


Bayes decision theory

If p(C1) = p(C2) = 12 , then we have

The coloured region may produce error. The probability of error equals to

Pe = p(mistake) = p(x ∈ R1,C2) + p(x ∈ R2,C1)

=1

2

∫R1

p(x |C2)dx +1

2

∫R2

p(x |C1)dx


Discriminant function and decision surface

If two regions Ri and Rj happen to be continuous, then they are separated by a decisionsurface in the multi-dimensional feature space.

For the minimum error probability case, this surface described by equation

p(Ci |x)− p(Cj |x) = 0.

From one side of the surface this difference is positive and from other side, it is negative.

Sometimes, instead of working directly with probabilities (or risks), it is more convenientto work with an equivalent function of them such as

gi (x) = f (p(Ci |x))

f (.) is a monotonically increasing function. (why?)

Function gi (x) is known as a discriminant function.

Now, the decision test is stated as

Classify x in Ci if gi (x) > gj(x) ∀j 6= i

The decision surfaces, separating continuous regions are stated as

gij(x) = gi (x)− gj(x) ∀i , j = 1, 2, . . . ,M, and j 6= i


Discriminant function for Normally distributed classes

The one dimensional Gaussian distribution with mean of µ and variance σ2 is given by

p(x) = N (µ, σ2) =1

σ√

2πexp

(−(x − µ)2

2σ2

)The D−dimensional Gaussian distribution with mean of µ and covariance matrix Σ is

p(x) = N (µ,Σ) =1

|Σ|D/2(2π)D/2exp

(−1

2(x − µ)TΣ−1(x − µ)

)What is the optimal classifier when the involved pdfs are N (µ,Σ)?Because of the exponential form of the involved densities, it is preferable to work with thefollowing discriminant functions.

gi (x) = ln[p(x |ci )p(Ci )]

= ln p(x |ci ) + ln p(Ci )

gi (x) = −1

2(x − µi )TΣ−1i (x − µi ) + wi0

wi0 = −D

2ln(2π)− D

2ln |Σi |+ ln p(Ci )

By expanding the above equation, we obtain the following quadratic form.

gi (x) = −1

2xTΣ−1i x +

1

2xTΣ−1i µi −

1

2µTi Σ−1i µi +

1

2µTi Σ−1i x + wi0


Discriminant function for Normally distributed classes(example)

For Normally distributed classes, we have the following quadratic form classifier.

gi (x) = −1

2xTΣ−1i x +

1

2xTΣ−1i µi −

1

2µTi Σ−1i µi +

1

2µTi Σ−1i x + wi0

Assume

Σi =

[σ2i 00 σ2i

]Thus we have

gi (x) = − 1

2σ2i

(x21 + x22

)+

1

2σ2i(µi1x1 + µi2x2)− 1

2σ2i

(µ2i1 + µ2i2

)+ wi0

Obviously the associated decision curves gi (x)− gj(x) = 0 are quadratics.

In this case the Bayesian classifier is a quadratic classifier, i.e. the partition of the featurespace is performed via quadratic decision surfaces.


Outline

1 Introduction







Supervised learning of the Bayesian classifiers

We assumed that the class conditional pdfs p(x |Ci ) and the prior probabilities p(Ci ) wereknown. In practice, this is never the case and we study supervised learning of classconditional pdfs.

For supervised learning we need training samples. In the training set there are featurevectors from each class and we re-arrange training samples based on their classes.

Si = {(xi1, ti1) , (xi2, ti2) , . . . , (xiNi, tiNi

)}

Ni is the number of training samples from the class Ci .

We assume that the training samples in the sets Si are occurrences of the independentrandom variables.

There are various ways to estimate the probability density functions.

If we know the type of of the pdf, we can estimate the parameters of the pdf such as meanand variance from the available data. These methods are known as parametric methods.In many cases, we may not have the information about the type of the pdf, but we may knowcertain statistical parameters such as the mean and the variance. These methods are knownas nonparametric methods.


Outline

1 Introduction







Naive Bayes classifier

Bayesian classifiers estimate posterior probabilities based likelihood, prior, and evidence.

These classifiers first estimate p(x |Ci ) and p(Ci ) and then classify the given instance.

How much training data will be required to obtain reliable estimates of thesedistributions?

Consider the number of parameters that must be estimated when C = 2 and x is a vectorof D boolean features.

In this case, we need to estimate a set of parameters

θij = p(xi |Cj)

Index i takes on 2D possible values, and j takes on 2 possible values.

Therefore, we will need to estimate exactly 2(2D − 1) of such θij parameters.

Unfortunately, this corresponds to two distinct parameters for each of the distinctinstances in the instance space for x .

In order to obtain reliable estimates of each of these parameters, we will need to observeeach of these distinct instances multiple times! This is clearly unrealistic in most practicallearning domains.

For example, if x is a vector containing 30 boolean features, then we will need to estimatemore than 3 billion parameters.


Naive Bayes classifier (cont.)

Given the intractable sample complexity for learning Bayesian classifiers, we must look forways to reduce this complexity.

The Naive Bayes classifier does this by making a conditional independence assumptionthat dramatically reduces the number of parameters to be estimated when modellingP(xi |Cj), from our original 2(2D − 1) to just 2D.

Definition (Conditional Independence)

Given random variables x , y and z , we say x is conditionally independent of y given z , if andonly if the probability distribution governing x is independent of the value of y given z ; that is

p(xi |yj , zk) = p(xi |zk) ∀i , j , k

The Naive Bayes algorithm is a classification algorithm based on Bayes rule, that assumesthe features x1, x2, . . . , xD are all conditionally independent of one another, given theclass label Ci . Thus we have

p(x1, x2, . . . , xD |Cj) =D∏i=1

p(xi |Cj)

Note that when C and the xi are boolean variables, we need only 2D parameters to definep(xik |Cj) for the necessary i , j , and k .


Naive Bayes classifier (cont.)

We derive the Naive Bayes algorithm, assuming in general that C is any discrete-valuedvariable, and features x1, x2, . . . , xD are any discrete or real-valued features.

Our goal is to train a classifier that will output the probability distribution over possiblevalues of C , for each new instance x that we ask it to classify.

The probability that C will take on its kth possible value equals to

p(Ck |x1, x2, . . . , xD) =p(Ck)p(x1, x2, . . . , xD |Ck)∑j p(Cj)p(x1, x2, . . . , xD |Cj)

Now, assuming the xi are conditionally independent given Ck , we can rewrite as

p(Ck |x1, x2, . . . , xD) =p(Cj)

∏i p(xi |Cj)∑

j p(Cj)∏

i p(xi |Ck)

The Naive Bayes classification rule is

C = argmaxCj

p(Cj)∏

i p(xi |Cj)∑j p(Cj)

∏i p(xi |Ck)

Since the denominator does not depend on C , it simplifies to the following

C = argmaxCj

p(Cj)∏i

p(xi |Cj)


Naive Bayes for discrete-valued inputs

When the D input features xi each take on J possible discrete values, and C is a discretevariable taking on M possible values, then our learning task is to estimate two sets ofparameters.

θijk = p(xi = x ′ij |C = Ck) Feature xi takes value xij

πk = p(C = Ck)

We can estimate these parameters using either ML estimates or Bayesian/MAP estimates.

θijk =|xi = x ′ij

∧C = Ck |

|Ck |This maximum likelihood estimate sometimes results in θ estimates of zero, if the datadoes not happen to contain any training examples satisfying the condition in thenumerator. To avoid this, it is common to use a smoothed estimate.

θijk =|xi = x ′ij

∧C = Ck |+ l

|Ck |+ lJ

Value of l determines the strength of this smoothing.Maximum likelihood estimates for πk are

πk =|Ck |N

=|Ck |+ l

N + lM


Naive Bayes for continuous inputs

When features are continuous, we must choose some other way to represent thedistributions p(xi |Ck).One common approach is to assume that for each possible Ck , the distribution of eachfeature xi is Gaussian defined by mean and variance specific to xi and Ck .In order to train such a Naive Bayes classifier, we must therefore estimate the mean andstandard deviation of each of these distributions.

µik = E [xi |Ck ]

σ2ik = E [(xik − µik)2|Ck ]

We must also estimate the prior on C .

πk = p(C = Ck)

We can use either ML estimates (MLE) or MAP estimates (MAP) for these parameters.The maximum likelihood estimator for µik is

µ̂ik =

∑j xijδ(tj = Ck)∑j δ(tj = Ck)

The maximum likelihood estimator for σ2ik is

σ̂2ik =

∑j(xij − µ̂ik)2δ(tj = Ck)∑

j δ(tj = Ck)Hamid Beigy (Sharif University of Technology) Data mining Fall 1394 23 / 25

Outline

1 Introduction







k−Nearest neighbor classifier

Suppose that we have a data set with Ni points in class Ci with N points in total, so that∑i Nk = N.

To classify a new point x , we draw a sphere centered on x containing precisely k pointsirrespective of their class.

Thus to classify a new point, we identify the k nearest points from the training data setand then assign the new point to the class having the largest number of representativesamongst this set.

The particular case of k = 1 is called the nearest-neighbor rule, because a test point issimply assigned to the same class as the nearest point from the training set.

An interesting property of the nearest-neighbor (k = 1) classifier is that, in the limitN →∞, the error rate is never more than twice the minimum achievable error rate of anoptimal classifier.


k−Nearest neighbor classifier (example)

The parameter k controls the degree of smoothing, i.e. small k produces many smallregions of each class and large k leads to a fewer larger regions.


data miningce.sharif.edu/courses/94-95/1/ce714-1/resources/...hamid beigy (sharif university of...

Documents