info-f-422 gianluca bontempi - université libre de · pdf filestatistical foundations...
TRANSCRIPT
Statistical foundations of machine learningINFO-F-422
Gianluca Bontempi
Machine Learning Group
Computer Science Department
mlg.ulb.ac.be
Classification problem
Let x ∈ Rn denote a real valued random input vector and y a
categorical random output variable that takes values in the set{c1, . . . , cK}.For example, let x the month and y a categorical variabletaking K = 2 possible values {RAIN,NO.RAIN}.We can distinguish two situations in classification
Separable classes: given an input x , the output y takes
always the same value. In other terms
∀x ∈ Rn ∃ck : Prob {y = ck |x} = 1
This is also known as noiseless or degenerate situation.
Non separable classes: given an input x , we can have
realizations of y with different values. In other terms
∃x ∈ Rn : ∀ck Prob {y = ck |x} < 1
In the non-separable case a partition of the input space which
returns a null classification error is not possible. Most real
interesting problems are of this kind.
Stochastic setting
We consider a stochastic setting to model non separable tasks.
This means that data are noisy and follows a statisticaldistribution. In other terms, given an input x , y does notalways take the same value.
However y follows a statistical distribution such that
K∑
k=1
Prob {y = ck |x} = 1
Binary classification problem
It is a problem where the output class y can take only K = 2values.
Suppose for simplicity that y ∈ {c1 = 0, c2 = 1}.Let us denote
p0 = Prob {y = 0}p1 = Prob {y = 1}
where p1 + p0 = 1.
Note that for a binary variable
E [y] = 0 · p0 + 1 · p1 = p1
Var [y] = E [(y − E [y])2] = p0(0 − p1)2 + p1(1 − p1)
2 = p1(1 − p1)
Degree of non separability
In the case of non separable classes we can have classificationproblems with different degrees of separability.Let
Prob {y = 1|x} = p1(x), Prob {y = 0|x} = p0(x)
The degree of non separability for an input x may be quantified bysome quantitative measures like
Conditional variance
Var [y|x] = p1(x) (1 − p1(x))
Conditional entropy
H[y|x] = −p0(x) log p0(x) − p1(x) log p1(x)
Both measures attain their maximum value when p0 = p1 = 1/2
The conditional distribution
The figure plots Prob {y = RAIN|x = month} and Prob {y = NO.RAIN|x = month} for each month.Note that for a fixed month
Prob {y = RAIN|x = month} + Prob {y = NO.RAIN|x = month} = 1
The classification statistical setting
Let x ∈ Rn denote a real valued random input vector and y a
categorical random output variable that takes values in the set{c1, . . . , cK}.Let Prob {y = ck |x} the probability that the output belongs tothe kth class given the set of measurements x . It follows
K∑
k=1
Prob {y = ck |x} = 1
An estimate c(x) of the class takes values in {c1, . . . , cK}. Wedefine a loss matrix L(K×K), being null on the diagonal andnon negative elsewhere.
Loss matrix
13
PRED
REAL
c cc 1 2 3
c
c
c
1
2
3
L11 L L
L
LL
LL
L
21
31 32 33
2322
12
The element L(jk) = L(cj ,ck ) denotes the cost of themisclassification when the predicted class is c(x) = cj and thecorrect class is ck .
Suppose that for a given x the classifier returns c(x). Theaverage cost of this classification is
K∑
k=1
L(c(x),ck )Prob {y = ck |x}
The Bayes classifier
The goal of the classification procedure for a given x is to findthe predictor c(x) that minimizes
K∑
k=1
L(c(x),ck )Prob {y = ck |x}
The optimal classifier (also known as the Bayes classifier) isthe one that returns for all x
c∗(x) = arg mincj∈{c1,...,cK}
K∑
k=1
L(j ,k)Prob {y = ck |x}
The 0-1 case
In the case of a 0-1 loss function the optimal classifier returns
c∗(x) = arg mincj∈{c1,...,cK}
∑
k=1:K ,k 6=j
Prob {y = ck |x} =
= arg mincj∈{c1,...,cK}
(1 − Prob {y = cj |x}) =
= arg maxcj∈{c1,...,cK}
Prob {y = cj |x}
The Bayes decision rule selects the maximum a posteriori classcj , j = 1, . . . ,K that is the class that maximizes the posteriorprobability Prob {y = cj |x}.
Discrete input example
CONDITIONAL PROB
PRED
REAL
C1
C2
C3
C2 C3C1
0
0
0
51
20
2 1
10
C3C2C1x
1
2
3
4
5
LOSS MATRIX
0.6 0.30.1
0.2
0.9
0.5
0.3
0.8
0.04
0.25
0.1
0.0
0.06
0.25
0.6
The Bayes classification in x = 2 is given by
c∗(2) = arg mink=1,2,3
{0.2 ∗ 0 + 0.8 ∗ 1 + 0.0 ∗ 5, (avg loss if c = 1)
0.2 ∗ 20 + 0.8 ∗ 0 + 0.0 ∗ 10, (avg loss if c = 2)
0.2 ∗ 2 + 0.8 ∗ 1 + 0.0 ∗ 0 (avg loss if c = 3)
} = arg mink=1,2,3
{1, 4, 1.2} = 1
What would have been the Bayes classification in the 0-1 case?
The Bayes’ theorem
According to the Bayes theorem the following relations hold
Prob {y = ck |x = x} =Prob {x = x |y = ck}Prob {y = ck}
∑Kk=1 Prob {x = x |y = ck}Prob {y = ck}
Prob {x = x |y = ck} =Prob {y = ck |x = x}Prob {x = x}
Prob {y = ck}
This means that by knowing the conditional distributionProb {y = ck |x = x} and the apriori distribution Prob {x = x},we can derive the conditional distribution Prob {x = x |y = ck}.Note that for a multivariate x ∈ R
n (for example n = 40000 isthe number of measured genes)
1 Prob {y = ck |x = x} is a multi-input single-output function
which for each values of x returns a value in [0, 1].2 for a given class Prob {x = x |y = ck} is a multivariate
n-dimensional distribution.
Inverse conditional distribution
The figure plots Prob {x = month|y = RAIN} andProb {x = month|y = NO.RAIN} for each month assuming thatthe apriori distribution is uniform. Note that for a fixed month
X
month
Prob {x = month|y = NO.RAIN} =X
month
Prob {x = month|y = RAIN} = 1
Two dimensional input
−6 −4 −2 0 2 4 6 8−8
−6
−4
−2
0
2
4
6
x1
x 2Observed data: 2 classes (red and green)
Classification strategies
Optimal classification is possible only if the quantitiesProb {y = ck |x}, k = 1, . . . ,K are known. What happens if this isnot the case? Three strategies are generally used.
Discriminant functions. A classifier can be represented interms of a set of K discriminant functions
gk(x) = Prob {y = k |x} , k = 1, . . . ,K
associated to the K a posteriori probabilities such that theclassifier assigns a feature vector x to a class c(x) = ck if
gk(x) > gj(x) for all j 6= k
The discriminant functions divide the feature space into Kdecision regions. The regions are separated by decisionboundaries, i.e. surfaces in the domain of x where ties occuramong the largest discriminant functions.
Classification strategies (II)
Density estimation via the Bayes theorem. Since
Prob {y = ck |x} =p(x |y = ck)Prob {y = ck}
p(x)
an estimation of Prob {x |y = ck} allows an estimation ofProb {y = ck |x}.Direct estimation via regression techniques. If theclassification problem has K = 2 classes and if we denote themby y = 0 and y = 1
E [y|x] = 1·Prob {y = 1|x}+0·Prob {y = 0|x} = Prob {y = 1|x}
Then the classification problem can be put in the form of aregression problem where the output takes value in {0, 1}.
Misclassification probability
Let us consider a binary classification task, i.e. y ∈ {c0, c1}.What is the probability of misclassification for a genericclassifier y = h(x ,αN) trained with a dataset DN
MME(x) = Prob {y 6= y} =
= Prob {y = c1|x}Prob {y = c0|x}+Prob {y = c0|x}Prob {y = c1|x} =
= Prob {y = c1|x} [1 − Prob {y = c1|x}]+ Prob {y = c0|x} [1 − Prob {y = c1|x}] =
= 1 − Prob {y = c1|x}Prob {y = c1|x}− Prob {y = c0|x}Prob {y = c0|x} =
= 1 −1
∑
j=0
Prob {y = cj |x}Prob {y = cj |x}
Note that both y and y = h(x ,αN) are random variables.
Bias/variance decomposition in classification
Let us consider the squared sum:
1
2
1X
j=0
“
Probn
y = cj
o
− Probn
y = cj
o”
2
=
=1
2
0
@
1X
j=0
Probn
y = cj
o
2
1
A +1
2
0
@
1X
j=0
Probn
y = cj
o
2
1
A −1
X
j=0
Probn
y = cj
o
Probn
y = cj
o
By adding one to both members we obtain the followingdecomposition
MME(x) = 1 −1
X
j=0
Probn
y = cj |xo
Probn
y = cj |xo
=
=1
2
0
@1 −
0
@
1X
j=0
Probn
y = cj
o
2
1
A
1
A + “noise”
+1
2
1X
j=0
“
Probn
y = cj
o
− Probn
y = cj
o”
2
+ “squared bias”
+1
2
0
@1 −
0
@
1X
j=0
Probn
y = cj
o
2
1
A
1
A “variance”
Discriminant functions
A classifier can be represented in terms of a set of Kdiscriminant functions gk(x), k = 1, . . . ,K such that theclassifier assigns a feature vector x to a class c(x) = ck if
gk(x) > gj(x) for all j 6= k
In the case of zero-one loss function the optimal classifiercorresponds to a maximum a posteriori discriminant functiongk(x) = Prob {y = k |x}.The discriminant functions divide the feature space into Kdecision regions. The regions are separated by decisionboundaries, i.e. surfaces in the domain of x where ties occuramong the largest discriminant functions.
Discriminant functions and Bayes rule
We can multiply all the discriminant functions by the samepositive constant or shift them by the same additive constantwithout influencing the decision.
More generally, if we replace every gk(z) by f (gk(z)), wheref (·) is a monotonically increasing function, the resultingclassification is unchanged.
Any of the following choices gives identical classification result:
gk(x) = Prob {y = k |x} =p(x |y = k)P(y = k)
∑Kk=1 p(x |y = k)P(y = k)
gk(x) = p(x |y = k)P(y = k)
gk(x) = ln p(x |y = k) + ln P(y = k)
and returns a minimum-error-rate classification.
Discriminant functions in the gaussian case
Let us consider the case where the densities are multivariatenormal, i.e. p(x |y = k) ∼ N (µk ,Σk) where x ∈ R
n, µk is a[n, 1] vector and Σk is a [n, n] covariance matrix.
Since
p(x |y = k) =1
(√
2π)n√
det(Σk)exp
{
−1
2(x − µk)TΣ−1
k (x − µk)
}
the discriminant function is then
gk(x) = ln p(x |y = k) + ln P(y = k) =
= −1
2(x−µk)TΣ−1
k(x−µk)−n
2ln 2π−1
2ln det(Σk)+ln P(y = k)
In the following we will consider the simplest case: Σk = σ2Iwhere I is the [n, n] identity matrix.
Gaussian case: Σk = σ2I
This mean that all the classes have a Gaussian distribution ofx where the covariance matrix is identical and diagonal.
For each class, the x samples fall in equal-size sphericalclusters which are parallel to the axes.
We have that the two terms
det(Σk) = σ2n, Σ−1k = (1/σ2)I
are independent of k , then they are unimportant additiveconstants that can be ignored.
Thus we obtain the simple discriminant function
gk(x) = −‖x − µk‖2
2σ2+lnP(y = k) = −(x − µk)T (x − µk)
2σ2+lnP(y
= − 1
2σ2[xT x − 2µT
k x + µTk µk ] + ln P(y = k)
Since the quadratic term xT x is the same for all k , making itan ignorable additive constant, this is equivalent to a lineardiscriminant function
gk(x) = wTk x + wk0
where wk is a [n, 1] vector
wk =1
σ2µk
and the term wk0
wk0 = − 1
2σ2µT
k µk + ln P(y = k)
is called the bias or threshold.
Decision boundary
In the two-classes problem, the decision boundary (i.e. the set ofpoints where g1(x) = g2(x)) is given by the hyperplane havingequation
wT (x − x0) = 0
wherew = µ1 − µ2
and
x0 =1
2(µ1 + µ2) −
σ2
‖µ1 − µ2‖2ln
Prob {y = 1}Prob {y = 2}(µ1 − µ2)
This equation defines a hyperplane through the point x0 andorthogonal to the vector w .See R script discri.R.
R script discri.R
−10 −5 0 5 10
−10
−5
05
10
x1
x2
Uniform prior case
If the prior probabilities P(y = k) are the same for all the Kclasses, then the term ln P(y = k) is an unimportant additiveconstant that can be ignored.
In this case, it can be shown that the optimum decision rule isa minimum distance classfier.
This means that in order to classify an input x , it measures theEuclidean distance ‖x − µk‖2 from x to each of the K meanvectors, and assign x to the category of the nearest mean.
It can be shown that for the more generic case Σk = Σ, thediscriminant rule is based on minimizing the Mahalanobisdistance
c(x) = arg mink
(x − µk)TΣ−1(x − µk)
Hyperplanes
Consider an input space Rn and an hyperplane defined by the
equationh(x , β) = β0 + xT β = 0
If we are in R2, this equation represents a line.
Some properties hold
Since for any two points x1 and x2 lying on the hyperplane wehave (x1 − x2)
Tβ = 0, the vector normal to the hyperplane isgiven by
β∗ =β
‖β‖The signed distance of any point x to the hyperplane is givenby
β∗T (x − x0) =xT β − βxT
0
‖β‖ =1
‖β‖(xT β + β0)
where x0 is a point on the hyperplane
Hyperplane
2
β0 β=0
β∗
+x
xq
0(β||β||1
Τ
+x Τq β)
x
x
1
Separating hyperplane
The figure shows points in two classes in R2.
These data can be separated by a linear boundary. In this casethere are infinitely many possible separating hyperplanes.
Perceptron
Classifiers that use the sign of the linear combinationh(x , β) = β0 + βT x to perform classification were calledperceptrons in the engineering literature in the late 1950.
The class returned by a perceptron for a given input xq is
{
1 if β0 + xTq β > 0
−1 if β0 + xTq β < 0
For all well classified points in the training set the followingrelation hold
yi (xTi β + β0) > 0
Misclassifications in the training set occur when{
yi = 1 but β0 + βT xi < 0
yi = −1 but β0 + βT xi > 0
Perceptron (II)
The learning method of perceptron is to minimize the quantity
Remp(β, β0) = −∑
i∈M
yi (xTi β + β0)
where M is the subset of misclassified points in the trainingset.
Note that this quantity is non-negative and proportional to thedistance of the misclassified points to the hyperplane.
Since the gradients are
∂Remp(β, β0)
∂β= −
∑
i∈M
yixi ,∂Remp(β, β0)
∂β0= −
∑
i∈M
yi
a gradient descent minimization procedure can be adopted.
Perceptron (III)
Although the perceptron set the foundations for much of thefollowing research in machine learning, a number of problems withthis algorithm have to be mentioned:
When the data are separable, there are many possiblesolutions, and which one is found depends on the initializationof the gradient method.
When the data ar not separable, the algorithm will notconverge.
Also for a separable problem the convergence of the gradientminimization can be very slow.
A possible solution to the perceptron problems has been proposedby the idea of optimal separating hyperplane in the SVM approach.
Optimal separating hyperplane
The idea of the optimal separating hyperplane is to separatethe two class by maximizing the distance to the closest pointfrom either class (also known as the margin).
This provides a unique solution to the separating hyperplaneproblem and was shown to lead to better classificationperformance on test data.The search for the optimal hyperplane is modeled as theoptimization problem
maxβ,β0
C
subject to1
‖β‖yi(xTi β + β0) ≥ C for i = 1, . . . , N
The constraint ensures that all the points are at least adistance C from the decision boundary defined by β and β0.
We seek the largest C that satisfies the constraints and theassociated parameters.
Optimal separating hyperplane (II)
Since the hyperplane is invariant when the parameters β0 andβ are multiplied by a constant, we can set ‖β‖ = 1/C .
Maximizing C is like minimizing ‖β‖ which is like minimizing‖β‖2
The maximization problem can be reformulated in aminimization form
minβ,β0
1
2‖β‖2
subject to yi (xTi β + β0) ≥ 1 for i = 1, . . . ,N
The constraints impose a margin around the linear decision ofthickness 1/‖β‖.
Constrained optimization problem
Let us consider an optimization problem in the following form
minimize f (b)
subject to fi(b) ≤ 0, i = 1, . . . ,m
where b ∈ Rn and b∗ is the optimal feasible solution.
The basic idea in Lagrangian duality is to take the constraints intoaccount by augmenting the objective function with a weighted sumof the constraint functions
L(b, λ) = f (b) +
m∑
i=1
λi fi(b)
where λi > 0 are the Lagrange multipliers (or dual variables)associated to the ith inequality constraint.
Lagrange dual function
The Lagrange dual function is the minimum value of theLagrangian over b
g(λ) = infb
L(b, λ)
which satisfies the property f (b∗) ≥ g(λ).The Lagrange dual function gives us a lower bound on the optimalvalue b∗ of the optimization problem which depends on λ.The best lower bound that can be obtained from the Lagrange dualfunction is obtained by solving the dual optimization problem
maximize g(λ)
subject to λi ≥ 0, i = 1, . . . ,m
This Lagrange dual problem is a convex optimization problem, sincethe objective to be maximized is concave and the constraint isconvex. This is the case whether or not the primal problem isconvex.
Optimal separating hyperplane (II)
The problem
minβ,β0
1
2‖β‖2
subject to yi (xTi β + β0) ≥ 1 for i = 1, . . . ,N
is a convex optimization problem (quadratic criterion withlinear inequality constraints) that can be put in the Lagrange(primal) form
LP(β, β0) = minβ,β0
1
2‖β‖2 −
N∑
i=1
αi [yi (xTi β + β0) − 1]
Setting the derivatives wrt β and β0 to zero we obtain:
β =
N∑
i=1
αiyixi , 0 =
N∑
i=1
αiyi
Substituting these in the primal form we obtain the Wolfe dualto be maximised
LD =
N∑
i=1
αi −1
2
N∑
i=1
N∑
k=1
αiαjyiykxTi xk
subject to αi ≥ 0.
This dual optimization problem can be solved using aneffective algorithm (sequential minimal optimization).
It can be shown that the solution must satisfy the KKTcondition
αi [yi (xTi β + β0) − 1] = 0, ∀i
The above condition means that we are in either of these twosituations
1 yi(xTi β + β0) = 1, i.e. the point is on the boundary of the
margin, then αi > 02 yi(x
Ti β + β0) > 1, i.e. the point is not on the boundary of the
margin, then αi = 0
The training points having an index i such that αi > 0 arecalled the support vectors
SVM decision function
Given the solution β, the term β0 is obtained by
β0 =1
2[βx∗(1) + βx∗(−1)]
where we denote by x∗(1) some (any) support vectorbelonging to the first class and we denote by x∗(−1) a supportvector belonging to the second class.
Now, the decision function can be written as
h(x , β, β0) = sign[xT β + β0]
or equivalently
h(x , β, β0) = sign[∑
support vectors
yiαi 〈x , xi 〉 + β0]
where 〈x , xi 〉 = xT xi stands for the inner product
Some remarks
Once we have found the αi , in order to make a prediction, wehave to calculate a quantity that depends only on the innerproduct between x and the points in the training set.
The αi will all be zero except for the support vectors.
Thus, many of the terms in the sum above will be zero, andwe really need to find only the inner products between x andthe support vectors (of which there is often only a smallnumber) in order to make our prediction.
SVM in the nonlinear case
The extension of the Support Vector (SV) approach tononlinear classification relies on the transformation of theinput variables and the possibility of effectively adapting theSVM procedure to a transformed input spaces (possibly of amuch larger dimension).
The idea of transforming the input space by using kernelfunction is an intuitive manner of extending linear techniquesto a nonlinear setting.
Definition (Kernel)
A kernel is a function K : Rn ×R
n → R, such that for all xi , xj ∈ X
K (xi , xj ) = 〈z(xi ), z(xj )〉 (1)
where 〈z1, z2〉 = zT1 z2 stands for the inner product and z(·) is the
mapping from the original to the feature space Z.
SVM in the nonlinear case
Let us suppose now that we want to perform a binaryclassification by SVM in a transformed space z ∈ Z. For thesake of simplicity we will consider a separable case.The parametric identification step requires the solution of aquadratic programming problem in the space Z
maxα
NX
i=1
αi −1
2
NX
i=1
NX
k=1
αi αkyi ykzTi zk =
= maxα
NX
i=1
αi −1
2
NX
i=1
NX
k=1
αi αkyi ykK (xi , xk )
The resolution of this problem differs from the linear one onlyby the replacement of 〈xi , xk〉 with 〈zi , zk〉 = K (xi , xk).
What is interesting, is that once we know how to derive thekernel matrix we do not even need to know the underlyingfeature transformation function z(x). This is known as thekernel trick.
Kernel matrix and functions
Which properties should be satisfied by a Kernel function?
Let us consider a set of m vectors xi . The kernel matrix is thesquared [m,m] matrix where the i , j element is K (xi , xj ).
The Mercer theorem states that for K to be a valid Kernelfunction, it is necessary and sufficient that for any m set ofpoints the corresponding kernel matrix is symmetric andpositive semidefinite
Examples of Kernel functions are the polynomial kernel
K (x1, x2) = (axT1 x2 + c)d)
the Gaussian kernel
K (x1, x2) = exp−‖x1 − x2‖2σ2
)
Naive Bayes classifier
The Naive Bayes (NB) classifier has shown in some domains aperformance comparable to that of neural networks anddecision tree learning.
Consider a classification problem with n inputs and a randomoutput variable y that takes values in the set {c1, . . . , cK}.The Bayes optimal classifier should return
c∗(x) = arg maxj=1,...,K
Prob {y = cj |x}
We can use the Bayes theorem to rewrite this epression as
c∗(x) = arg maxj=1,...,K
Prob {x |y = cj}Prob {y = cj}Prob {x} =
= arg maxj=1,...,K
Prob {x |y = cj}Prob {y = cj}
Naive Bayes classifier
How to estimate these two terms with finite dataset?
It is easy to estimate Prob {y = cj} with the frequency withwhich each target class occurs in the training set. Theestimation of Prob {x |y = cj} is much harder.
NB is based on the simpifying assumption that inputs areconditionally independent given the target value:
Prob {x |y = cj} = Prob {x1, . . . , xn|y = cj} =n
∏
h=1
Prob {xh|y = cj}
The NB classification is then
cNB(x) = arg maxj=1,...,K
Prob {y = cj}n
∏
h=1
Prob {xh|y = cj}
If inputs xh are discrete the estimation of Prob {xh|y = cj}boils down to the counting of the frequencies of theoccurrences of the different values of xh for a class cj .
Example
Obs x·1 x·2 x·3 y
1 P.LOW P.HIGH N.HIGH P.HIGH
2 N.LOW P.HIGH P.HIGH N.HIGH
3 P.LOW P.LOW N.LOW P.LOW
4 P.HIGH P.HIGH N.HIGH P.HIGH
5 N.LOW P.HIGH N.LOW P.LOW
6 N.HIGH N.LOW P.LOW N.LOW
7 P.LOW N.LOW N.HIGH P.LOW
8 P.LOW N.HIGH N.LOW P.LOW
9 P.HIGH P.LOW P.LOW N.LOW
10 P.HIGH P.LOW P.LOW P.LOW
What is the NB classification for the query{x=[N.LOW,N.HIGH,N.LOW] }?
Example
Prob {y = P.HIGH} = 2/10, Prob {y = P.LOW} = 5/10
Prob {y = N.HIGH} = 1/10, Prob {y = N.LOW} = 2/10
Prob {x·1 = N.LOW |y = P.HIGH} = 0/2, Prob {x
·1 = N.LOW |y = P.LOW} = 1/5
Prob {x·1 = N.LOW |y = N.HIGH} = 1/1, Prob {x
·1 = N.LOW |y = N.LOW} = 0/2
Prob {x·2 = N.HIGH|y = P.HIGH} = 0/2, Prob {x
·2 = N.HIGH|y = P.LOW} = 1/5
Prob {x·2 = N.HIGH|y = N.HIGH} = 0/1, Prob {x
·2 = N.HIGH|y = N.LOW} = 0/2
Prob {x·3 = N.LOW |y = P.HIGH} = 0/2, Prob {x
·3 = N.LOW |y = P.LOW} = 3/5
Prob {x·3 = N.LOW |y = N.HIGH} = 0/1, Prob {x
·3 = N.LOW |y = N.LOW} = 0/2
Prob {y = P.H|x = [N.L,N.H,N.L]} =
=P(x
·1 = N.L|y = P.H)P(x·2 = N.H|y = P.H)P(x
·3 = N.L|y = P.H)P(y = P.H)
P(x)cNB (x) = arg max
P.H,P.L,N.H,N.L
{0 ∗ 0 ∗ 0 ∗ 2/10, 1/5 ∗ 1/5 ∗ 3/5 ∗ 5/10, 1 ∗ 0 ∗ 1 ∗ 1/10, 0 ∗ 0 ∗ 0 ∗ 2/10} = P.LOW
A KNN classifier
Suppose a training set is available and that the classification isrequired for a [1, n] vector. Henceafter we will call this input vectora query point. The classification procedure of a K-NN classifier canbe summarized in these steps:
1 Compute the distance between the query and the trainingsamples according to a predefined metric.
2 Rank the neighbors on the basis of their distance to the query.
3 Select a subset of the K nearest neighbors. Each of theseneighbors has an associated class.
4 Return the class which characterizes the majority of the Knearest neighbors.
A K-nearest neighbor classifier
Suppose that the dataset has the formDN = {(x1, y1), . . . , (xN , yN)} where x ∈ R
n, y ∈ {c1, . . . , cK} andthat we want to build a classifier h(x) which returns the mostprobable class given x .Suppose that our dataset contains Nk points in class ck , i.e.
K∑
k=1
Nk = N
By the Bayes’ theorem we have
Prob {ck |x} =p(x |ck )Prob {ck}
p(x)k = 1, . . . ,K
A K-nearest neighbor classifier (II)
Using the kNN density estimator we have
p(x |ck ) =Kk
NkVp(x) =
KNV
where Kk ≤ K is the number of points which belong to the class ck
among the points in V .The priors can be estimated by
Prob {ck} =Nk
N
It follows
Prob {ck |x} =Kk
K k = 1, . . . ,K
In order to minimize the misclasification probability, a vector xshould be assigned to the class ck for which the ratio Kk
K is thelargest.
Evaluation of a classifier
The most popular measure of performance is error rate ormisclassification rate. This is simply the proportion of testsamples misclassified by the rule.
Misclassification error, though it is the default criterion, is notnecessarily the most appropriate criterion. Misclassificationimplicitly assumes that the costs of different types ofmisclassification are equal and returns accuracy for a specificthreshold.
When there are only a few or a moderate number of classes,the confusion matrix is a convenient way of summarising theclassifier performance.
In the following we will focus on evaluating two-class rules.
Confusion matrix in a two-class problem
Suppose to use the classifier to make N test classifications and thatamong the values to be predicted there are NP examples of class 1and NN examples of class 0.Then we have
Negative (0) Positive (1)
Classified as negative TN FN NN
Classified as positive FP TP NP
NN NP N
FP is the number of False Positives
FN is the number of False Negatives
NN/N is an estimator of the a priori probability of class 0.
NP/N is an estimator of the a priori probability of class 1.
Balanced Error Rate
In a setting where the two classes are not balanced themisclassification error rate
ER =FP + FN
N
can lead to a too optimistic interpretation of the rate of succes
For instance if NP = 90 and NN = 10, the naive classifierreturning always the positive class would have ER = 0.1 sinceFN = 0 and FP = 10.
In these case it is preferable to adopt the balanced error ratewhich is the average of the errors on each class:
BER =1
2
(
FP
TN + FP
+FN
FN + TP
)
Specificity and sensitivity
Sensitivity (true positive rate): the ratio (to be maximized)
SE =TP
TP + FN
=TP
NP
=NP − FN
NP
= 1 − FN
NP
, 0 ≤ SE ≤ 1
It increases by reducing the number of false negatives. Thisquantity is also called the recall in information retrieval.
Specificity (true negative rate): the ratio (to be maximized)
SP =TN
FP + TN
=TN
NN
=NN − FP
NN
= 1− FP
NN
, 0 ≤ SP ≤ 1
It increases by reducing the number of false positive.
Specificity and sensitivity (II)
There exists a trade-off between these two quantities.
In the case of a classifier who return always 0 we haveNP = 0,NN = N, FP = 0, TN = NN and SP = 1 but SE = 0.
In the case of a classifier who return always 1 we haveNP = N,NN = 0, FN = 0, TP = NP and SE = 1 but SP = 0.
False Positive and False Negative Rate
False Positive Rate:
FPR = 1−SP = 1− TN
FP + TN
=FP
FP + TN
=FP
NN
, 0 ≤ FPR ≤ 1
It decreases by reducing the number of false positive.
False Negative Rate:
FNR = 1−SE = 1− TP
TP + FN
=FN
TP + FN
=FN
NP
0 ≤ FPR ≤ 1
It decreases by reducing the number of false negatives.
Predictive value
Positive Predictive value: the ratio(to be maximized)
PPV =TP
TP + FP
=TP
NP
, 0 ≤ PPV ≤ 1
This quantity is also called precision in information retrieval.
Negative Predictive value: the ratio (to be maximized)
PNV =TN
TN + FN
=TN
NN
, 0 ≤ PNV ≤ 1
False Discovery Rate: the ratio (to be minimized)
FDR =FP
TP + FP
=FP
NP
= 1 − PPV , 0 ≤ FDR ≤ 1
Receiver Operating Characteristic curve
The Receiver Operating Characteristic (ROC) is a plot of thetrue positive rate (i.e. sensitivity or power) against the falsepositive rate (1- specificity) for different classificationthresholds.ROC visualizes the probability of detection vs. the probabilityof false alarm.Different points on the curve correspond to different thresholdsused in the classifier.A classifier with a ROC curve following the bissetrix line wouldbe useless. For each threshold we would haveTP/NP=FP/NN , i.e. the same proportion of true positive andfalse positive It would not separate the classes at all.A perfect ROC curves would follow the two axes. Real-lifeclassification rules produce ROC curves which lie betweenthese two extremes.By comparing ROC curves one can study relationships betweenclassifiers.
Receiver Operating Characteristic curve
Consider an example where t+ ∼ N (1, 1) and t− ∼ N (−1, 1).Suppose that the examples are classed as positive if t > THR andnegative if t < THR , where THR is a threshold.
If THR = −∞, all the examples are classed as positive:TN = FN = 0 which implies SE = TP
NP= 1 and
FPR = FP
FP+TN= 1.
If THR = ∞, all the examples are classed as negative:TP = FP = 0 which implies SE = 0 and FPR = 0.
ROC curve
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
SE
On the x-axis we have FPR = FP/NN and on the y-axis we haveTPR = TP/NP .
Multi-class problems
So far we have simplified our analysis by limiting to considerbinary classification tasks.
However, we often encounter multi-class problems inbioinformatics.
There exists several strategies to extend binary classifiers tohandle multi-class tasks y ∈ {c1, . . . , ck}.
Multi-class problems
One-versus-the rest: In the one-versus-the-rest method,classifiers for discriminating one from all the other classes areassembled. For each class ck a binary classifier that separatesthis class from the rest is built. To predict a class label of agiven data point, the output of each of the k classifiers isobtained. If there is a unique class label which is consistentwith all the k prediction, the data point is assigned to such aclass . Otherwise, one of the k classes is selected randomly.
Pairwise: a classifier is trained for each pair of classes, so thereare k(k − 1)/2 independently built binary classifiers. Topredict a class label of a given data point, the prediction ofeach of the k(k − 1)/2 classifiers is calculated, which is viewedas a vote. If there is a class which receives the largest numberof votes, the data point is assigned to such a class, where a tieis broken randomly.
Multi-class problems
Coding: each class is coded by a binary vector of size d . Eachbinary classifier is designed to produce one of 0 and 1 as theclass label. So, given a list of d classifiers, the outputs of themcan be viewed as a (usually, a row) vector in {0, 1}d . Topredict the class label of an input x , the output word of the dclassifiers on input x is compared against the codeword of eachclass, and the class having the smallest Hamming distance (thenumber of disagreements) to the output word is selected.Suppose that we have a problems with 8 output classes. Threbinary classifiers can be used to handle this problem.
f1 f2 f3c1 0 0 0c2 0 0 1c3 0 1 0c4 0 1 1c5 1 0 0c6 1 0 1c7 1 1 0c8 1 1 1