engineering part iib: module 4f10 statistical pattern...
TRANSCRIPT
Engineering Part IIB: Module 4F10
Statistical Pattern Processing
Lecture 5: Single Layer Perceptrons & Estimating
Linear Classifiers
Phil Woodland: [email protected]
Michaelmas 2012
Cambridge University Engineering Department
Engineering Part IIB: Module 4F10
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
IntroductionPreviously generative models with Gaussian & GMM class-conditional PDFs werediscussed. For Gaussian PDFs with common covariance matrices, the decisionboundary is linear. The linear decision boundary from the generative model wasbriefly compared with a discriminative model, logistic regression.
An alternative to modelling the class conditional PDFs functions is to select afunctional form for a discriminant function and construct a mapping from theobservation to the class directly.
Here we will concentrate on the construction of linear classifiers, however it isalso possible to construct quadratic or other non-linear decision boundaries (thisis how some types of neural networks operate).
There are several methods of classifier construction for a linear discriminantfunction. The following schemes will be examined:
• Iterative solution for the weights via the perceptron algorithm which directlyminimises the number of misclassifications
• Fisher linear discriminant: directly maximises a measure of class separability
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 1
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Simple Vowel ClassifierSelect two vowels to classify with a linear decision boundary
300 400 500 600 700 800 900600
800
1000
1200
1400
1600
1800
2000
2200
2400Linear Classifier
Formant One
For
man
t Tw
o
Most of the data is correctly classifiedbut classification errors occur
• realisation of vowels vary fromspeaker to speaker
• the same vowel varies for the samespeaker as well! (vowel context etc)
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 2
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Single Layer Perceptronw0
Σ
wd
1
−1w2
w1
1
x
x
x
1
2
d
y(x)z
Typically uses threshold activationfunctionThe d-dimensional input vector x andscalar value z are related by
z = w′x+ w0
z is then fed to the activation function to yield y(x).
The parameters of this system are
• weights: w =
w1...wd
selects
direction of decision boundary
• bias: w0, sets position of decisionboundary.
Parameters are often combined intosingle composite vector, w, & extendedinput vector , x.
w =
w1...wd
w0
; x =
x1...xd
1
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 3
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Single Layer Perceptron (cont)
We can then write z = w′x
The task is to train the set of modelparameters w. For this example adecision boundary is placed at z = 0.The decision rule is
y(x) =
{
1, z ≥ 0−1, z < 0
If the training data is linearly separablein the d-dimensional space then using anappropriate training algorithm perfectclassification (on the training data atleast!) can be achieved.
−0.5 0 0.5 1 1.5−0.5
0
0.5
1
1.5
Is the solution unique?
Precise solution depends on trainingcriterion/algorithm used.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 4
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Parameter OptimisationFirst a cost or loss function of weights defined, E(w). A learning process whichminimises the cost function is used. One basic procedure is gradient descent:
1. Start with some initial estimate w[0], τ = 0.
2. Compute the gradient ∇E(w)|w[τ ]
3. Update weight by moving a small distance in the steepest downhill direction,to give the estimate of the weights at iteration τ + 1, w[τ + 1],
w[τ + 1] = w[τ ]− η∇E(w)|w[τ ]
Set τ = τ + 1
4. Repeat steps (2) and (3) until convergence, or optimisation criterion satisfied
If using gradient descent, then cost function E(w) must be differentiable (andhence continuous). This means mis-classification cannot be used as the costfunction for gradient descent schemes. Gradient descent is not usually guaranteedto decrease cost function.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 5
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Choice of ηIn the previous slide the learning rate term η was used in the optimisation scheme.
Step size too large - divergence
E
Desired minima
slowdescent
η is positive. When setting η we need to consider:
• if η is too small, convergence is slow;
• if η is too large, we may overshoot the solution and diverge.
Later in the course we will examine improvements for gradient descent schemes.Some of these give automated techniques for setting η.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 6
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Local MinimaAnother problem with gradient descent is local minima.At all maxima and minima
∇E(w) = 0
• gradient descent stops at allmaxima/minima
• estimated parameters will dependon starting point
• single solution if function convexθ θ
Crit
erio
n
Local Minima
Global Minimum
Expectation-Maximisation is also only guaranteed to find a local maximum of thelikelihood function for example.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 7
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Perceptron CriterionThe decision boundary should be constructed to minimise the number ofmisclassified training examples. For each training example xi
w′
xi > 0 xi ∈ ω1
w′
xi < 0 xi ∈ ω2
Distance from the decision boundary is given by w′x, but cannot simply sum the
values of w′x as a cost function since the sign depends on the class.
For training, we replace all the observations of class ω2 by their negative value.Thus
x =
{
x; belongs to ω1
−x; belongs to ω2
This means that for a correctly classified symbol w′x > 0
and for mis-classified training example w′x < 0
We will refer to this as “normalising” the data.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 8
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Perceptron Solution Region• Each sample xi places a constraint on possible location of a solution vectorthat classifies all samples correctly.
• w′xi = 0 defines a hyperplane through the origin of the “weight-space” of w
vectors with xi as a normal vector.
• For normalised data, solution vector on positive side of every such hyperplane.
• Solution vector, if it exists, is not unique—lies anywhere within solution region
• Figure below (from DHS) shows 4 training points and the solution region forboth normalised data and un-normalised data.
x1
x2
separating plane
solution
region
x1
x2
"separating" plane
solution
region
ww
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 9
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Perceptron Criterion (cont)The perceptron criterion may be expressed as
E(w) =∑
x∈Y
(−w′x)
where Y is the set of mis-classified points. We now want to minimise theperceptron criterion. We can use gradient descent. It is simple to show that
∇E(w) =∑
x∈Y
(−x)
Hence the GD update rule is
w[τ + 1] = w[τ ] + η∑
x∈Y[τ ]
x
where Y[τ ] is the set of mis-classified points using w[τ ]. For the case when thesamples are linearly separable using a value of η = 1 is guaranteed to converge.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 10
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
The basic algorithm is:
1. Take the extended observations x for class ω2 and invert the sign to give x.
2. Initialise the weight vector w[0], τ = 0.
3. Using w[τ ] produce the set of mis-classified samples Y[τ ].
4. Use update rule
w[τ + 1] = w[τ ] +∑
x∈Y[τ ]
x
then set τ = τ + 1.
5. Repeat steps (3) and (4) until the convergence criterion is satisfied.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 11
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
A Simple Example
Class 1 has points
[
10
]
,
[
11
]
,
[
0.60.6
]
,
[
0.70.4
]
Class 2 has points
[
00
]
,
[
01
]
,
[
0.251
]
,
[
0.30.4
]
Initial estimate: w[0] =
01
−0.5
This yields the following initial estimateof the decision boundary.
Given this initial estimate we need totrain the decision boundary.
−0.5 0 0.5 1 1.5−0.5
0
0.5
1
1.5
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 12
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Simple Example (cont)First use current decision boundary to obtain the set of mis-classified points.
For the data from Class ω1
Class ω1 1 2 3 4
z -0.5 0.5 0.1 -0.1Class 2 1 1 2
and for Class ω2
Class ω2 1 2 3 4
z -0.5 0.5 0.5 -0.1Class 2 1 1 2
The set of mis-classified points, Y[0], is
Y[0] = {1ω1, 4ω1, 2ω2, 3ω2}
From the perceptron update rule this yields the updated vector
w[1] = w[0] +
101
+
0.70.41
+
0−1−1
+
−0.25−1−1
=
1.45−0.6−0.5
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 13
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Simple Example (cont)Apply decision rule again: class ω1 data
Class 1 1 2 3 4
z 0.95 0.35 0.01 0.275Class 1 1 1 1
and for class ω2
Class 2 1 2 3 4
z -0.5 -1.1 -0.7375 -0.305Class 2 2 2 2
All points correctly classified the algorithm has converged.
This yields the following decisionboundary
Is this a good decision boundary?
−0.5 0 0.5 1 1.5−0.5
0
0.5
1
1.5
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 14
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Fisher’s Discriminant AnalysisAn alternative approach to training the perceptron uses Fisher’s discriminantanalysis. The aim is to find the projection that maximises the distance betweenthe class means, whilst minimising the within class variance. Note only theprojection w is determined.
The following cost function is used
E(w) = −(µ1 − µ2)
2
s1 + s2
where sj and µj are the projectedscatter matrix and mean for classωj. The projected scatter matrix isdefined as
sj =∑
xi∈ωj
(xi − µj)2
The cost function may be expressed as
E(w) = −w
′Sbw
w′Sww
where
Sb = (µ1 − µ2)(µ1 − µ2)′
Sw = S1 + S2
Sj =∑
xi∈ωj
(xi − µj)(xi − µj)′
the mean of class µj is defined as usual.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 15
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Fisher’s Discriminant Analysis (cont)Differerentiating E(w) w.r.t. weights, it is minimised when
(w′Sbw)Sww = (w′
Sww)Sbw
From the definition of Sb
Sbw = (µ1 − µ2)(µ1 − µ2)′w
= ((µ1 − µ2)′w) (µ1 − µ2)
We therefore know that Sww ∝ (µ1 − µ2) Multiplying both sides by S−1w
yields
w ∝ S−1w
(µ1 − µ2)
This has given the direction of the decision boundary. However we still need thebias value b. If the data is separable using the Fisher’s discriminant it makessense to select the value of b that maximises the margin. Simply put this meansthat, given no additional information, the boundary should be equidistant fromthe two points either side of the decision boundary.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 16
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
ExampleUsing the previously described data
µ1 =
[
0.8250.500
]
; µ2 =
[
0.13750.600
]
; Sw =
[
0.2044 0.03000.0300 1.2400
]
Solving yields the direction.
w =
[
3.3878−0.1626
]
Taking the midpoint between theobservations closest to the decisionboundary
w[1] =
3.3878−0.1626−1.4432
−0.5 0 0.5 1 1.5−0.5
0
0.5
1
1.5
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 17
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Logistic Regression/ClassificationInteresting to contrast the perceptron algorithm with logisticregression/classification. This is a linear classifier, but a discriminative model nota discriminant function. The training criterion is
L(w) =n∑
i=1
[
yi log
(
1
1 + exp(−w′xi)
)
+ (1− yi) log
(
1
1 + exp(w′xi)
)]
where (note change of label definition)
yi =
{
1, xi generated by class ω1
0, xi generated by class ω2
It is simple to show that (has a nice intuitive feel)
∇L(w) =n∑
i=1
(
yi −1
1 + exp(−w′xi)
)
xi
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 18
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Maximising the log-likelihood is equal to minimising the negative log-likelihood,E(w) = −L(w). Update rule then becomes
w[τ + 1] = w[τ ] + η∇L(w)|w[τ ]
Compared to the perceptron algorithm:
• need an appropriate method to select η;
• does not necessarily correctly classify training data even if linearly separable;
• not necessary for data to be linearly separable;
• yields a class posterior (use Bayes’ decision rule to get hypothesis).
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 19
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Kesler’s Construct
So far only examined binary classifiers—direct use of multiple binary classifierscan result in no-decision regions (see examples paper).
The multi-class problem can be converted to a 2-class problem. Consider anextended observation x which belongs to class ω1. Then to be correctly classified
w′1x− w
′jx > 0, j = 2, . . . , K
There are therefore K− 1 inequalities requiring that K(d+1)-dimensional vector
α =
w1...
wK
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 20
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
correctly classifies all K − 1 set of K(d+ 1)-dimensional samples
γ12 =
x
−x
0...0
, γ13 =
x
0
−x...0
, . . . , γ1K =
x
0
0...
−x
The multi-class problem has now been transformed to a two-class problem at theexpense of increasing the effective dimensionality of the data and increasing thenumber of training samples. We now simply optimise for α.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 21
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Limitations of Linear Decision Boundaries
Perceptrons were very popular until in 1969, it was realised that it couldn’t solvethe XOR problem.
We can use perceptrons to solve the binary logicoperators AND, OR, NAND, NOR.
Σ 1
−1
1
1
x1
1x2
−1.5AND
(a) AND operator
Σ 1
−1
1
1
x1
1x2
−0.5OR
(b) OR operator
Σ 1
−1
−1
1
x1
−1x2
1.5NAND
(c) NAND operator
Σ 1
−1
−1
1
x1
−1x2
0.5NOR
(d) NOR operator
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 22
4F10: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
XOR (cont)But XOR may be written in terms of AND, NAND and OR gates
Σ 1
−1
−1
1
−1
1.5NAND
Σ 1
−1
1
1
x1
1x2
−0.5OR
Σ 1
−1
1
1
1
−1.5AND
This yields the decision boundaries
So XOR can be solved using a two-layernetwork. The problem is how to trainmulti-layer perceptrons. In the 1980’san algorithm for training such networkswas proposed, error back propagation.
Cambridge UniversityEngineering Department
Engineering Part IIB: Module 4F10 23