info-f-422 gianluca bontempi - université libre de · pdf filestatistical foundations...

Statistical foundations of machine learningINFO-F-422

Gianluca Bontempi

Machine Learning Group

Computer Science Department

mlg.ulb.ac.be

Classification problem

Let x ∈ Rn denote a real valued random input vector and y a

categorical random output variable that takes values in the set{c1, . . . , cK}.For example, let x the month and y a categorical variabletaking K = 2 possible values {RAIN,NO.RAIN}.We can distinguish two situations in classification

Separable classes: given an input x , the output y takes

always the same value. In other terms

∀x ∈ Rn ∃ck : Prob {y = ck |x} = 1

This is also known as noiseless or degenerate situation.

Non separable classes: given an input x , we can have

realizations of y with different values. In other terms

∃x ∈ Rn : ∀ck Prob {y = ck |x} < 1

In the non-separable case a partition of the input space which

returns a null classification error is not possible. Most real

interesting problems are of this kind.

Stochastic setting

We consider a stochastic setting to model non separable tasks.

This means that data are noisy and follows a statisticaldistribution. In other terms, given an input x , y does notalways take the same value.

However y follows a statistical distribution such that

K∑

k=1

Prob {y = ck |x} = 1

Binary classification problem

It is a problem where the output class y can take only K = 2values.

Suppose for simplicity that y ∈ {c1 = 0, c2 = 1}.Let us denote

p0 = Prob {y = 0}p1 = Prob {y = 1}

where p1 + p0 = 1.

Note that for a binary variable

E [y] = 0 · p0 + 1 · p1 = p1

Var [y] = E [(y − E [y])2] = p0(0 − p1)2 + p1(1 − p1)

2 = p1(1 − p1)

Degree of non separability

In the case of non separable classes we can have classificationproblems with different degrees of separability.Let

Prob {y = 1|x} = p1(x), Prob {y = 0|x} = p0(x)

The degree of non separability for an input x may be quantified bysome quantitative measures like

Conditional variance

Var [y|x] = p1(x) (1 − p1(x))

Conditional entropy

H[y|x] = −p0(x) log p0(x) − p1(x) log p1(x)

Both measures attain their maximum value when p0 = p1 = 1/2

The conditional distribution

The figure plots Prob {y = RAIN|x = month} and Prob {y = NO.RAIN|x = month} for each month.Note that for a fixed month

Prob {y = RAIN|x = month} + Prob {y = NO.RAIN|x = month} = 1

The classification statistical setting

Let x ∈ Rn denote a real valued random input vector and y a

categorical random output variable that takes values in the set{c1, . . . , cK}.Let Prob {y = ck |x} the probability that the output belongs tothe kth class given the set of measurements x . It follows

K∑

k=1

Prob {y = ck |x} = 1

An estimate c(x) of the class takes values in {c1, . . . , cK}. Wedefine a loss matrix L(K×K), being null on the diagonal andnon negative elsewhere.

Loss matrix

13

PRED

REAL

c cc 1 2 3

c

c

c

1

2

3

L11 L L

L

LL

LL

L

21

31 32 33

2322

12

The element L(jk) = L(cj ,ck ) denotes the cost of themisclassification when the predicted class is c(x) = cj and thecorrect class is ck .

Suppose that for a given x the classifier returns c(x). Theaverage cost of this classification is

K∑

k=1

L(c(x),ck )Prob {y = ck |x}

The Bayes classifier

The goal of the classification procedure for a given x is to findthe predictor c(x) that minimizes

K∑

k=1

L(c(x),ck )Prob {y = ck |x}

The optimal classifier (also known as the Bayes classifier) isthe one that returns for all x

c∗(x) = arg mincj∈{c1,...,cK}

K∑

k=1

L(j ,k)Prob {y = ck |x}

The 0-1 case

In the case of a 0-1 loss function the optimal classifier returns

c∗(x) = arg mincj∈{c1,...,cK}

∑

k=1:K ,k 6=j

Prob {y = ck |x} =

= arg mincj∈{c1,...,cK}

(1 − Prob {y = cj |x}) =

= arg maxcj∈{c1,...,cK}

Prob {y = cj |x}

The Bayes decision rule selects the maximum a posteriori classcj , j = 1, . . . ,K that is the class that maximizes the posteriorprobability Prob {y = cj |x}.

Discrete input example

CONDITIONAL PROB

PRED

REAL

C1

C2

C3

C2 C3C1

0

0

0

51

20

2 1

10

C3C2C1x

1

2

3

4

5

LOSS MATRIX

0.6 0.30.1

0.2

0.9

0.5

0.3

0.8

0.04

0.25

0.1

0.0

0.06

0.25

0.6

The Bayes classification in x = 2 is given by

c∗(2) = arg mink=1,2,3

{0.2 ∗ 0 + 0.8 ∗ 1 + 0.0 ∗ 5, (avg loss if c = 1)

0.2 ∗ 20 + 0.8 ∗ 0 + 0.0 ∗ 10, (avg loss if c = 2)

0.2 ∗ 2 + 0.8 ∗ 1 + 0.0 ∗ 0 (avg loss if c = 3)

} = arg mink=1,2,3

{1, 4, 1.2} = 1

What would have been the Bayes classification in the 0-1 case?

The Bayes’ theorem

According to the Bayes theorem the following relations hold

Prob {y = ck |x = x} =Prob {x = x |y = ck}Prob {y = ck}

∑Kk=1 Prob {x = x |y = ck}Prob {y = ck}

Prob {x = x |y = ck} =Prob {y = ck |x = x}Prob {x = x}

Prob {y = ck}

This means that by knowing the conditional distributionProb {y = ck |x = x} and the apriori distribution Prob {x = x},we can derive the conditional distribution Prob {x = x |y = ck}.Note that for a multivariate x ∈ R

n (for example n = 40000 isthe number of measured genes)

1 Prob {y = ck |x = x} is a multi-input single-output function

which for each values of x returns a value in [0, 1].2 for a given class Prob {x = x |y = ck} is a multivariate

n-dimensional distribution.

Inverse conditional distribution

The figure plots Prob {x = month|y = RAIN} andProb {x = month|y = NO.RAIN} for each month assuming thatthe apriori distribution is uniform. Note that for a fixed month

X

month

Prob {x = month|y = NO.RAIN} =X

month

Prob {x = month|y = RAIN} = 1

Two dimensional input

−6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

x1

x 2Observed data: 2 classes (red and green)

Classification strategies

Optimal classification is possible only if the quantitiesProb {y = ck |x}, k = 1, . . . ,K are known. What happens if this isnot the case? Three strategies are generally used.

Discriminant functions. A classifier can be represented interms of a set of K discriminant functions

gk(x) = Prob {y = k |x} , k = 1, . . . ,K

associated to the K a posteriori probabilities such that theclassifier assigns a feature vector x to a class c(x) = ck if

gk(x) > gj(x) for all j 6= k

The discriminant functions divide the feature space into Kdecision regions. The regions are separated by decisionboundaries, i.e. surfaces in the domain of x where ties occuramong the largest discriminant functions.

Classification strategies (II)

Density estimation via the Bayes theorem. Since

Prob {y = ck |x} =p(x |y = ck)Prob {y = ck}

p(x)

an estimation of Prob {x |y = ck} allows an estimation ofProb {y = ck |x}.Direct estimation via regression techniques. If theclassification problem has K = 2 classes and if we denote themby y = 0 and y = 1

E [y|x] = 1·Prob {y = 1|x}+0·Prob {y = 0|x} = Prob {y = 1|x}

Then the classification problem can be put in the form of aregression problem where the output takes value in {0, 1}.

Misclassification probability

Let us consider a binary classification task, i.e. y ∈ {c0, c1}.What is the probability of misclassification for a genericclassifier y = h(x ,αN) trained with a dataset DN

MME(x) = Prob {y 6= y} =

= Prob {y = c1|x}Prob {y = c0|x}+Prob {y = c0|x}Prob {y = c1|x} =

= Prob {y = c1|x} [1 − Prob {y = c1|x}]+ Prob {y = c0|x} [1 − Prob {y = c1|x}] =

= 1 − Prob {y = c1|x}Prob {y = c1|x}− Prob {y = c0|x}Prob {y = c0|x} =

= 1 −1

∑

j=0

Prob {y = cj |x}Prob {y = cj |x}

Note that both y and y = h(x ,αN) are random variables.

Bias/variance decomposition in classification

Let us consider the squared sum:

1

2

1X

j=0

“

Probn

y = cj

o

− Probn

y = cj

o”

2

=

=1

2

0

@

1X

j=0

Probn

y = cj

o

2

1

A +1

2

0

@

1X

j=0

Probn

y = cj

o

2

1

A −1

X

j=0

Probn

y = cj

o

Probn

y = cj

o

By adding one to both members we obtain the followingdecomposition

MME(x) = 1 −1

X

j=0

Probn

y = cj |xo

Probn

y = cj |xo

=

=1

2

0

@1 −

0

@

1X

j=0

Probn

y = cj

o

2

1

A

1

A + “noise”

+1

2

1X

j=0

“

Probn

y = cj

o

− Probn

y = cj

o”

2

+ “squared bias”

+1

2

0

@1 −

0

@

1X

j=0

Probn

y = cj

o

2

1

A

1

A “variance”

Discriminant functions

A classifier can be represented in terms of a set of Kdiscriminant functions gk(x), k = 1, . . . ,K such that theclassifier assigns a feature vector x to a class c(x) = ck if

gk(x) > gj(x) for all j 6= k

In the case of zero-one loss function the optimal classifiercorresponds to a maximum a posteriori discriminant functiongk(x) = Prob {y = k |x}.The discriminant functions divide the feature space into Kdecision regions. The regions are separated by decisionboundaries, i.e. surfaces in the domain of x where ties occuramong the largest discriminant functions.

Discriminant functions and Bayes rule

We can multiply all the discriminant functions by the samepositive constant or shift them by the same additive constantwithout influencing the decision.

More generally, if we replace every gk(z) by f (gk(z)), wheref (·) is a monotonically increasing function, the resultingclassification is unchanged.

Any of the following choices gives identical classification result:

gk(x) = Prob {y = k |x} =p(x |y = k)P(y = k)

∑Kk=1 p(x |y = k)P(y = k)

gk(x) = p(x |y = k)P(y = k)

gk(x) = ln p(x |y = k) + ln P(y = k)

and returns a minimum-error-rate classification.

Discriminant functions in the gaussian case

Let us consider the case where the densities are multivariatenormal, i.e. p(x |y = k) ∼ N (µk ,Σk) where x ∈ R

n, µk is a[n, 1] vector and Σk is a [n, n] covariance matrix.

Since

p(x |y = k) =1

(√

2π)n√

det(Σk)exp

{

−1

2(x − µk)TΣ−1

k (x − µk)

}

the discriminant function is then

gk(x) = ln p(x |y = k) + ln P(y = k) =

= −1

2(x−µk)TΣ−1

k(x−µk)−n

2ln 2π−1

2ln det(Σk)+ln P(y = k)

In the following we will consider the simplest case: Σk = σ2Iwhere I is the [n, n] identity matrix.

Gaussian case: Σk = σ2I

This mean that all the classes have a Gaussian distribution ofx where the covariance matrix is identical and diagonal.

For each class, the x samples fall in equal-size sphericalclusters which are parallel to the axes.

We have that the two terms

det(Σk) = σ2n, Σ−1k = (1/σ2)I

are independent of k , then they are unimportant additiveconstants that can be ignored.

Thus we obtain the simple discriminant function

gk(x) = −‖x − µk‖2

2σ2+lnP(y = k) = −(x − µk)T (x − µk)

2σ2+lnP(y

= − 1

2σ2[xT x − 2µT

k x + µTk µk ] + ln P(y = k)

Since the quadratic term xT x is the same for all k , making itan ignorable additive constant, this is equivalent to a lineardiscriminant function

gk(x) = wTk x + wk0

where wk is a [n, 1] vector

wk =1

σ2µk

and the term wk0

wk0 = − 1

2σ2µT

k µk + ln P(y = k)

is called the bias or threshold.

Decision boundary

In the two-classes problem, the decision boundary (i.e. the set ofpoints where g1(x) = g2(x)) is given by the hyperplane havingequation

wT (x − x0) = 0

wherew = µ1 − µ2

and

x0 =1

2(µ1 + µ2) −

σ2

‖µ1 − µ2‖2ln

Prob {y = 1}Prob {y = 2}(µ1 − µ2)

This equation defines a hyperplane through the point x0 andorthogonal to the vector w .See R script discri.R.

R script discri.R

−10 −5 0 5 10

−10

−5

05

10

x1

x2

Uniform prior case

If the prior probabilities P(y = k) are the same for all the Kclasses, then the term ln P(y = k) is an unimportant additiveconstant that can be ignored.

In this case, it can be shown that the optimum decision rule isa minimum distance classfier.

This means that in order to classify an input x , it measures theEuclidean distance ‖x − µk‖2 from x to each of the K meanvectors, and assign x to the category of the nearest mean.

It can be shown that for the more generic case Σk = Σ, thediscriminant rule is based on minimizing the Mahalanobisdistance

c(x) = arg mink

(x − µk)TΣ−1(x − µk)

Hyperplanes

Consider an input space Rn and an hyperplane defined by the

equationh(x , β) = β0 + xT β = 0

If we are in R2, this equation represents a line.

Some properties hold

Since for any two points x1 and x2 lying on the hyperplane wehave (x1 − x2)

Tβ = 0, the vector normal to the hyperplane isgiven by

β∗ =β

‖β‖The signed distance of any point x to the hyperplane is givenby

β∗T (x − x0) =xT β − βxT

0

‖β‖ =1

‖β‖(xT β + β0)

where x0 is a point on the hyperplane

Hyperplane

2

β0 β=0

β∗

+x

xq

0(β||β||1

Τ

+x Τq β)

x

x

1

Separating hyperplane

The figure shows points in two classes in R2.

These data can be separated by a linear boundary. In this casethere are infinitely many possible separating hyperplanes.

Perceptron

Classifiers that use the sign of the linear combinationh(x , β) = β0 + βT x to perform classification were calledperceptrons in the engineering literature in the late 1950.

The class returned by a perceptron for a given input xq is

{

1 if β0 + xTq β > 0

−1 if β0 + xTq β < 0

For all well classified points in the training set the followingrelation hold

yi (xTi β + β0) > 0

Misclassifications in the training set occur when{

yi = 1 but β0 + βT xi < 0

yi = −1 but β0 + βT xi > 0

Perceptron (II)

The learning method of perceptron is to minimize the quantity

Remp(β, β0) = −∑

i∈M

yi (xTi β + β0)

where M is the subset of misclassified points in the trainingset.

Note that this quantity is non-negative and proportional to thedistance of the misclassified points to the hyperplane.

Since the gradients are

∂Remp(β, β0)

∂β= −

∑

i∈M

yixi ,∂Remp(β, β0)

∂β0= −

∑

i∈M

yi

a gradient descent minimization procedure can be adopted.

Perceptron (III)

Although the perceptron set the foundations for much of thefollowing research in machine learning, a number of problems withthis algorithm have to be mentioned:

When the data are separable, there are many possiblesolutions, and which one is found depends on the initializationof the gradient method.

When the data ar not separable, the algorithm will notconverge.

Also for a separable problem the convergence of the gradientminimization can be very slow.

A possible solution to the perceptron problems has been proposedby the idea of optimal separating hyperplane in the SVM approach.

Optimal separating hyperplane

The idea of the optimal separating hyperplane is to separatethe two class by maximizing the distance to the closest pointfrom either class (also known as the margin).

This provides a unique solution to the separating hyperplaneproblem and was shown to lead to better classificationperformance on test data.The search for the optimal hyperplane is modeled as theoptimization problem

maxβ,β0

C

subject to1

‖β‖yi(xTi β + β0) ≥ C for i = 1, . . . , N

The constraint ensures that all the points are at least adistance C from the decision boundary defined by β and β0.

We seek the largest C that satisfies the constraints and theassociated parameters.

Optimal separating hyperplane (II)

Since the hyperplane is invariant when the parameters β0 andβ are multiplied by a constant, we can set ‖β‖ = 1/C .

Maximizing C is like minimizing ‖β‖ which is like minimizing‖β‖2

The maximization problem can be reformulated in aminimization form

minβ,β0

1

2‖β‖2

subject to yi (xTi β + β0) ≥ 1 for i = 1, . . . ,N

The constraints impose a margin around the linear decision ofthickness 1/‖β‖.

Constrained optimization problem

Let us consider an optimization problem in the following form

minimize f (b)

subject to fi(b) ≤ 0, i = 1, . . . ,m

where b ∈ Rn and b∗ is the optimal feasible solution.

The basic idea in Lagrangian duality is to take the constraints intoaccount by augmenting the objective function with a weighted sumof the constraint functions

L(b, λ) = f (b) +

m∑

i=1

λi fi(b)

where λi > 0 are the Lagrange multipliers (or dual variables)associated to the ith inequality constraint.

Lagrange dual function

The Lagrange dual function is the minimum value of theLagrangian over b

g(λ) = infb

L(b, λ)

which satisfies the property f (b∗) ≥ g(λ).The Lagrange dual function gives us a lower bound on the optimalvalue b∗ of the optimization problem which depends on λ.The best lower bound that can be obtained from the Lagrange dualfunction is obtained by solving the dual optimization problem

maximize g(λ)

subject to λi ≥ 0, i = 1, . . . ,m

This Lagrange dual problem is a convex optimization problem, sincethe objective to be maximized is concave and the constraint isconvex. This is the case whether or not the primal problem isconvex.

Optimal separating hyperplane (II)

The problem

minβ,β0

1

2‖β‖2

subject to yi (xTi β + β0) ≥ 1 for i = 1, . . . ,N

is a convex optimization problem (quadratic criterion withlinear inequality constraints) that can be put in the Lagrange(primal) form

LP(β, β0) = minβ,β0

1

2‖β‖2 −

N∑

i=1

αi [yi (xTi β + β0) − 1]

Setting the derivatives wrt β and β0 to zero we obtain:

β =

N∑

i=1

αiyixi , 0 =

N∑

i=1

αiyi

Substituting these in the primal form we obtain the Wolfe dualto be maximised

LD =

N∑

i=1

αi −1

2

N∑

i=1

N∑

k=1

αiαjyiykxTi xk

subject to αi ≥ 0.

This dual optimization problem can be solved using aneffective algorithm (sequential minimal optimization).

It can be shown that the solution must satisfy the KKTcondition

αi [yi (xTi β + β0) − 1] = 0, ∀i

The above condition means that we are in either of these twosituations

1 yi(xTi β + β0) = 1, i.e. the point is on the boundary of the

margin, then αi > 02 yi(x

Ti β + β0) > 1, i.e. the point is not on the boundary of the

margin, then αi = 0

The training points having an index i such that αi > 0 arecalled the support vectors

SVM decision function

Given the solution β, the term β0 is obtained by

β0 =1

2[βx∗(1) + βx∗(−1)]

where we denote by x∗(1) some (any) support vectorbelonging to the first class and we denote by x∗(−1) a supportvector belonging to the second class.

Now, the decision function can be written as

h(x , β, β0) = sign[xT β + β0]

or equivalently

h(x , β, β0) = sign[∑

support vectors

yiαi 〈x , xi 〉 + β0]

where 〈x , xi 〉 = xT xi stands for the inner product

Some remarks

Once we have found the αi , in order to make a prediction, wehave to calculate a quantity that depends only on the innerproduct between x and the points in the training set.

The αi will all be zero except for the support vectors.

Thus, many of the terms in the sum above will be zero, andwe really need to find only the inner products between x andthe support vectors (of which there is often only a smallnumber) in order to make our prediction.

SVM in the nonlinear case

The extension of the Support Vector (SV) approach tononlinear classification relies on the transformation of theinput variables and the possibility of effectively adapting theSVM procedure to a transformed input spaces (possibly of amuch larger dimension).

The idea of transforming the input space by using kernelfunction is an intuitive manner of extending linear techniquesto a nonlinear setting.

Definition (Kernel)

A kernel is a function K : Rn ×R

n → R, such that for all xi , xj ∈ X

K (xi , xj ) = 〈z(xi ), z(xj )〉 (1)

where 〈z1, z2〉 = zT1 z2 stands for the inner product and z(·) is the

mapping from the original to the feature space Z.

SVM in the nonlinear case

Let us suppose now that we want to perform a binaryclassification by SVM in a transformed space z ∈ Z. For thesake of simplicity we will consider a separable case.The parametric identification step requires the solution of aquadratic programming problem in the space Z

maxα

NX

i=1

αi −1

2

NX

i=1

NX

k=1

αi αkyi ykzTi zk =

= maxα

NX

i=1

αi −1

2

NX

i=1

NX

k=1

αi αkyi ykK (xi , xk )

The resolution of this problem differs from the linear one onlyby the replacement of 〈xi , xk〉 with 〈zi , zk〉 = K (xi , xk).

What is interesting, is that once we know how to derive thekernel matrix we do not even need to know the underlyingfeature transformation function z(x). This is known as thekernel trick.

Kernel matrix and functions

Which properties should be satisfied by a Kernel function?

Let us consider a set of m vectors xi . The kernel matrix is thesquared [m,m] matrix where the i , j element is K (xi , xj ).

The Mercer theorem states that for K to be a valid Kernelfunction, it is necessary and sufficient that for any m set ofpoints the corresponding kernel matrix is symmetric andpositive semidefinite

Examples of Kernel functions are the polynomial kernel

K (x1, x2) = (axT1 x2 + c)d)

the Gaussian kernel

K (x1, x2) = exp−‖x1 − x2‖2σ2

)

Naive Bayes classifier

The Naive Bayes (NB) classifier has shown in some domains aperformance comparable to that of neural networks anddecision tree learning.

Consider a classification problem with n inputs and a randomoutput variable y that takes values in the set {c1, . . . , cK}.The Bayes optimal classifier should return

c∗(x) = arg maxj=1,...,K

Prob {y = cj |x}

We can use the Bayes theorem to rewrite this epression as

c∗(x) = arg maxj=1,...,K

Prob {x |y = cj}Prob {y = cj}Prob {x} =

= arg maxj=1,...,K

Prob {x |y = cj}Prob {y = cj}

Naive Bayes classifier

How to estimate these two terms with finite dataset?

It is easy to estimate Prob {y = cj} with the frequency withwhich each target class occurs in the training set. Theestimation of Prob {x |y = cj} is much harder.

NB is based on the simpifying assumption that inputs areconditionally independent given the target value:

Prob {x |y = cj} = Prob {x1, . . . , xn|y = cj} =n

∏

h=1

Prob {xh|y = cj}

The NB classification is then

cNB(x) = arg maxj=1,...,K

Prob {y = cj}n

∏

h=1

Prob {xh|y = cj}

If inputs xh are discrete the estimation of Prob {xh|y = cj}boils down to the counting of the frequencies of theoccurrences of the different values of xh for a class cj .

Example

Obs x·1 x·2 x·3 y

1 P.LOW P.HIGH N.HIGH P.HIGH

2 N.LOW P.HIGH P.HIGH N.HIGH

3 P.LOW P.LOW N.LOW P.LOW

4 P.HIGH P.HIGH N.HIGH P.HIGH

5 N.LOW P.HIGH N.LOW P.LOW

6 N.HIGH N.LOW P.LOW N.LOW

7 P.LOW N.LOW N.HIGH P.LOW

8 P.LOW N.HIGH N.LOW P.LOW

9 P.HIGH P.LOW P.LOW N.LOW

10 P.HIGH P.LOW P.LOW P.LOW

What is the NB classification for the query{x=[N.LOW,N.HIGH,N.LOW] }?

Example

Prob {y = P.HIGH} = 2/10, Prob {y = P.LOW} = 5/10

Prob {y = N.HIGH} = 1/10, Prob {y = N.LOW} = 2/10

Prob {x·1 = N.LOW |y = P.HIGH} = 0/2, Prob {x

·1 = N.LOW |y = P.LOW} = 1/5

Prob {x·1 = N.LOW |y = N.HIGH} = 1/1, Prob {x

·1 = N.LOW |y = N.LOW} = 0/2

Prob {x·2 = N.HIGH|y = P.HIGH} = 0/2, Prob {x

·2 = N.HIGH|y = P.LOW} = 1/5

Prob {x·2 = N.HIGH|y = N.HIGH} = 0/1, Prob {x

·2 = N.HIGH|y = N.LOW} = 0/2

Prob {x·3 = N.LOW |y = P.HIGH} = 0/2, Prob {x

·3 = N.LOW |y = P.LOW} = 3/5

Prob {x·3 = N.LOW |y = N.HIGH} = 0/1, Prob {x

·3 = N.LOW |y = N.LOW} = 0/2

Prob {y = P.H|x = [N.L,N.H,N.L]} =

=P(x

·1 = N.L|y = P.H)P(x·2 = N.H|y = P.H)P(x

·3 = N.L|y = P.H)P(y = P.H)

P(x)cNB (x) = arg max

P.H,P.L,N.H,N.L

{0 ∗ 0 ∗ 0 ∗ 2/10, 1/5 ∗ 1/5 ∗ 3/5 ∗ 5/10, 1 ∗ 0 ∗ 1 ∗ 1/10, 0 ∗ 0 ∗ 0 ∗ 2/10} = P.LOW

A KNN classifier

Suppose a training set is available and that the classification isrequired for a [1, n] vector. Henceafter we will call this input vectora query point. The classification procedure of a K-NN classifier canbe summarized in these steps:

1 Compute the distance between the query and the trainingsamples according to a predefined metric.

2 Rank the neighbors on the basis of their distance to the query.

3 Select a subset of the K nearest neighbors. Each of theseneighbors has an associated class.

4 Return the class which characterizes the majority of the Knearest neighbors.

A K-nearest neighbor classifier

Suppose that the dataset has the formDN = {(x1, y1), . . . , (xN , yN)} where x ∈ R

n, y ∈ {c1, . . . , cK} andthat we want to build a classifier h(x) which returns the mostprobable class given x .Suppose that our dataset contains Nk points in class ck , i.e.

K∑

k=1

Nk = N

By the Bayes’ theorem we have

Prob {ck |x} =p(x |ck )Prob {ck}

p(x)k = 1, . . . ,K

A K-nearest neighbor classifier (II)

Using the kNN density estimator we have

p(x |ck ) =Kk

NkVp(x) =

KNV

where Kk ≤ K is the number of points which belong to the class ck

among the points in V .The priors can be estimated by

Prob {ck} =Nk

N

It follows

Prob {ck |x} =Kk

K k = 1, . . . ,K

In order to minimize the misclasification probability, a vector xshould be assigned to the class ck for which the ratio Kk

K is thelargest.

Evaluation of a classifier

The most popular measure of performance is error rate ormisclassification rate. This is simply the proportion of testsamples misclassified by the rule.

Misclassification error, though it is the default criterion, is notnecessarily the most appropriate criterion. Misclassificationimplicitly assumes that the costs of different types ofmisclassification are equal and returns accuracy for a specificthreshold.

When there are only a few or a moderate number of classes,the confusion matrix is a convenient way of summarising theclassifier performance.

In the following we will focus on evaluating two-class rules.

Confusion matrix in a two-class problem

Suppose to use the classifier to make N test classifications and thatamong the values to be predicted there are NP examples of class 1and NN examples of class 0.Then we have

Negative (0) Positive (1)

Classified as negative TN FN NN

Classified as positive FP TP NP

NN NP N

FP is the number of False Positives

FN is the number of False Negatives

NN/N is an estimator of the a priori probability of class 0.

NP/N is an estimator of the a priori probability of class 1.

Balanced Error Rate

In a setting where the two classes are not balanced themisclassification error rate

ER =FP + FN

N

can lead to a too optimistic interpretation of the rate of succes

For instance if NP = 90 and NN = 10, the naive classifierreturning always the positive class would have ER = 0.1 sinceFN = 0 and FP = 10.

In these case it is preferable to adopt the balanced error ratewhich is the average of the errors on each class:

BER =1

2

(

FP

TN + FP

+FN

FN + TP

)

Specificity and sensitivity

Sensitivity (true positive rate): the ratio (to be maximized)

SE =TP

TP + FN

=TP

NP

=NP − FN

NP

= 1 − FN

NP

, 0 ≤ SE ≤ 1

It increases by reducing the number of false negatives. Thisquantity is also called the recall in information retrieval.

Specificity (true negative rate): the ratio (to be maximized)

SP =TN

FP + TN

=TN

NN

=NN − FP

NN

= 1− FP

NN

, 0 ≤ SP ≤ 1

It increases by reducing the number of false positive.

Specificity and sensitivity (II)

There exists a trade-off between these two quantities.

In the case of a classifier who return always 0 we haveNP = 0,NN = N, FP = 0, TN = NN and SP = 1 but SE = 0.

In the case of a classifier who return always 1 we haveNP = N,NN = 0, FN = 0, TP = NP and SE = 1 but SP = 0.

False Positive and False Negative Rate

False Positive Rate:

FPR = 1−SP = 1− TN

FP + TN

=FP

FP + TN

=FP

NN

, 0 ≤ FPR ≤ 1

It decreases by reducing the number of false positive.

False Negative Rate:

FNR = 1−SE = 1− TP

TP + FN

=FN

TP + FN

=FN

NP

0 ≤ FPR ≤ 1

It decreases by reducing the number of false negatives.

Predictive value

Positive Predictive value: the ratio(to be maximized)

PPV =TP

TP + FP

=TP

NP

, 0 ≤ PPV ≤ 1

This quantity is also called precision in information retrieval.

Negative Predictive value: the ratio (to be maximized)

PNV =TN

TN + FN

=TN

NN

, 0 ≤ PNV ≤ 1

False Discovery Rate: the ratio (to be minimized)

FDR =FP

TP + FP

=FP

NP

= 1 − PPV , 0 ≤ FDR ≤ 1

Receiver Operating Characteristic curve

The Receiver Operating Characteristic (ROC) is a plot of thetrue positive rate (i.e. sensitivity or power) against the falsepositive rate (1- specificity) for different classificationthresholds.ROC visualizes the probability of detection vs. the probabilityof false alarm.Different points on the curve correspond to different thresholdsused in the classifier.A classifier with a ROC curve following the bissetrix line wouldbe useless. For each threshold we would haveTP/NP=FP/NN , i.e. the same proportion of true positive andfalse positive It would not separate the classes at all.A perfect ROC curves would follow the two axes. Real-lifeclassification rules produce ROC curves which lie betweenthese two extremes.By comparing ROC curves one can study relationships betweenclassifiers.

Receiver Operating Characteristic curve

Consider an example where t+ ∼ N (1, 1) and t− ∼ N (−1, 1).Suppose that the examples are classed as positive if t > THR andnegative if t < THR , where THR is a threshold.

If THR = −∞, all the examples are classed as positive:TN = FN = 0 which implies SE = TP

NP= 1 and

FPR = FP

FP+TN= 1.

If THR = ∞, all the examples are classed as negative:TP = FP = 0 which implies SE = 0 and FPR = 0.

ROC curve

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

SE

On the x-axis we have FPR = FP/NN and on the y-axis we haveTPR = TP/NP .

Multi-class problems

So far we have simplified our analysis by limiting to considerbinary classification tasks.

However, we often encounter multi-class problems inbioinformatics.

There exists several strategies to extend binary classifiers tohandle multi-class tasks y ∈ {c1, . . . , ck}.


One-versus-the rest: In the one-versus-the-rest method,classifiers for discriminating one from all the other classes areassembled. For each class ck a binary classifier that separatesthis class from the rest is built. To predict a class label of agiven data point, the output of each of the k classifiers isobtained. If there is a unique class label which is consistentwith all the k prediction, the data point is assigned to such aclass . Otherwise, one of the k classes is selected randomly.

Pairwise: a classifier is trained for each pair of classes, so thereare k(k − 1)/2 independently built binary classifiers. Topredict a class label of a given data point, the prediction ofeach of the k(k − 1)/2 classifiers is calculated, which is viewedas a vote. If there is a class which receives the largest numberof votes, the data point is assigned to such a class, where a tieis broken randomly.


Coding: each class is coded by a binary vector of size d . Eachbinary classifier is designed to produce one of 0 and 1 as theclass label. So, given a list of d classifiers, the outputs of themcan be viewed as a (usually, a row) vector in {0, 1}d . Topredict the class label of an input x , the output word of the dclassifiers on input x is compared against the codeword of eachclass, and the class having the smallest Hamming distance (thenumber of disagreements) to the output word is selected.Suppose that we have a problems with 8 output classes. Threbinary classifiers can be used to handle this problem.

f1 f2 f3c1 0 0 0c2 0 0 1c3 0 1 0c4 0 1 1c5 1 0 0c6 1 0 1c7 1 1 0c8 1 1 1

info-f-422 gianluca bontempi - université libre de · pdf filestatistical foundations...

Documents