pattern recognition and machine learning - slide set 2 ... · pattern recognition and machine...
TRANSCRIPT
PATTERN RECOGNITION AND MACHINE
LEARNINGSlide Set 2: Estimation Theory
October 2019
Heikki Huttunen
[email protected] Processing
Tampere University
default
Classical Estimation and Detection Theory
• Before the machine learning part, we will take a look at classical estimation theory.• Estimation theory has many connections to the foundations of modern machinelearning.
• Outline of the next few hours:1 Estimation theory:
• Fundamentals• Maximum likelihood• Examples
2 Detection theory:
• Fundamentals• Error metrics• Examples
2 / 37
default
Introduction - estimation
• Our goal is to estimate the values of a group of parameters from data.• Examples: radar, sonar, speech, image analysis, biomedicine, communications,control, seismology, etc.
• Parameter estimation: Given an N-point data set x = {x[0], x[1], . . . , x[N− 1]} whichdepends on the unknown parameter θ ∈ R, we wish to design an estimator g(·) for θ
θ = g(x[0], x[1], . . . , x[N− 1]).• The fundamental questions are:
1 What is the model for our data?
2 How to determine its parameters?
3 / 37
default
Introductory Example – Straight line
• Suppose we have the illustrated time series andwould like to approximate the relationship of the
two coordinates.
• The relationship looks linear, so we could assumethe following model:
y[n] = ax[n] + b+ w[n],with a ∈ R and b ∈ R unknown andw[n] ∼ N (0, σ2)
• N (0, σ2) is the normal distribution with mean 0and variance σ2. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
y
4 / 37
default
Introductory Example – Straight line
• Each pair of a and b represent one line.• Which line of the three would best describe thedata set? Or some other line?
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0x
0.25
0.00
0.25
0.50
0.75
1.00
1.25
y
Model candidate 1(a = 0.07, b = 0.49)Model candidate 2(a = 0.06, b = 0.33)Model candidate 3(a = 0.08, b = 0.51)
5 / 37
default
Introductory Example – Straight line
• It can be shown that the best solution (in themaximum likelihood sense; to be definedlater) is given by
a = − 6
N(N+ 1)
N−1∑n=0y(n) + 12
N(N2 − 1)N−1∑n=0x(n)y(n)
b =2(2N− 1)N(N+ 1)
N−1∑n=0y(n)− 6
N(N+ 1)
N−1∑n=0x(n)y(n).
• Or, as we will later learn, in an easy matrix form:
θ =
(ab)
= (XTX)−1XTy
6 / 37
default
Introductory Example – Straight line
• In this case, a = 0.07401 and b = 0.49319, whichproduces the line shown on the right.
• The line also minimizes the squared distances(green dashed lines) between the model (blue
line) and the data (red circles).
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0x
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
y
Best Fity = 0.0740x+0.4932Sum of Squares = 3.62
7 / 37
default
Introductory Example 2 – Sinusoid
• Consider transmitting the sinusoid below.
0 20 40 60 80 100 120 140 160432101234
8 / 37
default
Introductory Example 2 – Sinusoid
• When the data is received, it is corrupted by noise and the received samples look likebelow.
0 20 40 60 80 100 120 140 160432101234
• Can we recover the parameters of the sinusoid?
9 / 37
default
Introductory Example 2 – Sinusoid
• In this case, the problem is to find good values for A, f0 and φ in the following model:x[n] = A cos(2πf0n+ φ) + w[n],
with w[n] ∼ N (0, σ2).
10 / 37
default
Introductory Example 2 – Sinusoid
• It can be shown that themaximum likelihood estimator; MLE for parameters A, f0 and φare given by
f0 = value of f that maximizes∣∣∣∣∣N−1∑n=0x(n)e−2πifn
∣∣∣∣∣ ,A =
2
N∣∣∣∣∣N−1∑n=0x(n)e−2πif0n
∣∣∣∣∣φ = arctan
−∑N−1n=0 x(n) sin(2πf0n)∑N−1n=0 x(n) cos(2πf0n) .
11 / 37
default
Introductory Example 2 – Sinusoid
• It turns out that the sinusoidal parameter estimation is very successful:
0 20 40 60 80 100 120 140 160432101234
f0 = 0.068 (0.068); A = 0.692 (0.667); φ = 0.238 (0.609)
• The blue curve is the original sinusoid, and the red curve is the one estimated fromthe green circles.
• The estimates are shown in the figure (true values in parentheses).
12 / 37
default
Introductory Example 2 – Sinusoid
• However, the results are different for each realization of the noise w[n].
0 20 40 60 80 100 120 140 160432101234
f0 = 0.068; A = 0.652; φ = -0.023
0 20 40 60 80 100 120 140 160432101234
f0 = 0.066; A = 0.660; φ = 0.851
0 20 40 60 80 100 120 140 160432101234
f0 = 0.067; A = 0.786; φ = 0.814
0 20 40 60 80 100 120 140 160432101234
f0 = 0.459; A = 0.618; φ = 0.299
13 / 37
default
Introductory Example 2 – Sinusoid
• Thus, we’re not very interested in an individual case, but rather on the distributions ofestimates
• What are the expectations: E[f0], E[φ] and E[A]?• What are their respective variances?• Could there be a better formula that would yield smaller variance?• If yes, how to discover the better estimators?
14 / 37
default
LEAST SQUARES
15 / 37
default
Least Squares
• The general solution for linear case is easy to remember.• Consider the line fitting case: y[n] = ax[n] + b+ w[n].• This can be written in matrix form as follows:
y[0]y[1]...y[N− 1]
︸ ︷︷ ︸
y
=
x[0] 1x[1] 1x[2] 1
.
.
....x[N− 1] 1
︸ ︷︷ ︸
X
(ab)
︸︷︷︸θ
+w.
16 / 37
default
Least Squares
• Now the model is written compactly as
y = Xθ +w
• Solution: The value of θ minimizing the error wTw is given by:
θLS =(XTX)−1 XTy.
• See sample Python code here: https://github.com/mahehu/SGN-41007/blob/master/code/Least_Squares.ipynb
17 / 37
default
LS Example with two variables
• Consider an example, where microscope imageshave an uneven illumination.
• The illumination pattern appears to have thehighest brightness at the center.
• Let’s try to fit a paraboloid to remove the effect:
z(x, y) = c1x2 + c2y2 + c3xy + c4x + c5y + c6,with z(x, y) the brightness at pixel (x, y).
18 / 37
default
LS Example with two variables
• In matrix form, the model looks like the following:
z1z2...zN
︸ ︷︷ ︸
z
=
x21y21x1y1 x1 y1 1x2
2y22x2y2 x2 y2 1
.
.
....
.
.
....
.
.
....x2N y2N xNyN xN yN 1
︸ ︷︷ ︸
H
·
c1c2c3c4c5c6
︸ ︷︷ ︸
c
+ε, (1)
with zk the grayscale value at k’th pixel (xk, yk).
19 / 37
default
LS Example with two variables
• As a result we get the LS fit by c =(HTH)−1HTz,
c = (−0.000080,−0.000288, 0.000123, 0.022064, 0.284020, 106.538687)
• Or, in other words
z(x, y) =− 0.000080x2 − 0.000288y2+ 0.000123xy + 0.022064x+ 0.284020y + 106.538687.
20 / 37
default
LS Example with two variables
• Finally, we remove the illumination component by subtraction.
21 / 37
default
MAXIMUM LIKELIHOOD
22 / 37
default
Maximum Likelihood Estimation
• Maximum likelihood (MLE) is the most popular estimation approach due to itsapplicability in complicated estimation problems.
• Maximization of likelihood also appears often as the optimality criterion in machinelearning.
• The method was proposed by Fisher in 1922, though he published the basic principlealready in 1912 as a third year undergraduate.
• The basic principle is simple: find the parameter θ that is the most probable to havegenerated the data x.
• The MLE estimator may or may not be optimal in the minimum variance sense. It isnot necessarily unbiased, either.
23 / 37
default
The Likelihood Function
• Consider again the problem of estimating the mean level A of noisy data.• Assume that the data originates from the following model:
x[n] = A+ w[n],where w[n] ∼ N (0, σ2): Constant plus Gaussian random noise with zero mean andvariance σ2.
0 50 100 150 2004
3
2
1
0
1
2
3
24 / 37
default
The Likelihood Function
• The key to maximum likelihood is the likelihood function.• If the probability density function (PDF) of the data is viewed as a function of the
unknown parameter (with fixed data), it is called likelihood function.
• Often the likelihood function has an exponential form. Then it’s usual to take thenatural logarithm to get rid of the exponential. Note that the maximum of the newlog-likelihood function does not change.
4 2 0 2 4 6 8 10A
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40PDF of A assuming x[0] = 3
20
15
10
5
0
LikelihoodLog-likelihood
25 / 37
default
MLE Example
• In our case of estimating the mean of a signal:
x[n] = A+ w[n], n = 0, 1, . . . ,N− 1,the likelihood function is easy to construct.
• The noise samples w[n] ∼ N (0, σ2) are assumed independent, so the PDF of thewhole batch of samples x = (x[0], . . . , x[N− 1]) is obtained by the product rule:
p(x; A) = N−1∏n=0p(x[n]; A) = 1
(2πσ2)N2
exp
[− 1
2σ2
N−1∑n=0
(x[n]− A)2]
• When we have observed the data x, we can turn the problem around and considerwhat is the most likely parameter A to have generated the data.
26 / 37
default
MLE Example
• Some authors emphasize this by turning the order around: p(A; x) or give the functiona different name such as L(A; x) or `(A; x).
• So, consider p(x; A) as a function of A and try to maximize it.
27 / 37
default
MLE Example
• The picture below shows the likelihood function and the log-likelihood function forone possible realization of data.
• The data consists of 50 points, with true A = 5.• The likelihood function gives the probability of observing these particular points withdifferent values of A.
2 3 4 5 6 7 8A
0
1
2
3
4
5
6 1e 31 Likelihood of A (max at 5.17)
350
300
250
200
150
100
50
SamplesLikelihoodLog-likelihood
28 / 37
default
MLE Example
• Instead of finding the maximum from the plot, we wish to have a closed form solution.
• Closed form is faster, more elegant, accurate and numerically more stable.• Just for the sake of an example, below is the code for the stupid version.
# The samples are in array called x0
x = np.linspace(2, 8, 200)likelihood = []log_likelihood = []
for A in x:likelihood.append(gaussian(x0, A, 1).prod())log_likelihood.append(gaussian_log(x0, A, 1).sum())
print ("Max likelihood is at %.2f" % (x[np.argmax(log_likelihood)]))
29 / 37
default
MLE Example
• Maximization of p(x; A) directly is nontrivial. Therefore, we take the logarithm, and maximize itinstead:
p(x; A) =1
(2πσ2)N2
exp[− 1
2σ2
N−1∑n=0
(x[n]− A)2]
ln p(x; A) = −N2ln(2πσ2)− 1
2σ2
N−1∑n=0
(x[n]− A)2
• The maximum is found via differentiation:
∂ ln p(x; A)∂A =
1
σ2
N−1∑n=0
(x[n]− A)
30 / 37
default
MLE Example
• Setting this equal to zero gives
1
σ2
N−1∑n=0
(x[n] − A) = 0
N−1∑n=0
(x[n] − A) = 0
N−1∑n=0
x[n] −N−1∑n=0
A = 0
N−1∑n=0
x[n] − NA = 0
N−1∑n=0
x[n] = NA
A =1
NN−1∑n=0
x[n]
31 / 37
default
Conclusion
• What did we actually do?• We proved that the sample mean is the maximum likelihood estimator for thedistribution mean.
• But I could have guessed this result from the beginning. What’s the point?• We can do the same thing for cases where you can not guess.
32 / 37
default
Example: Sinusoidal Parameter Estimation
• Consider the modelx[n] = A cos(2πf0n+ φ) + w[n]
with w[n] ∼ N (0, σ2). It is possible to find the MLE for all three parameters:θ = [A, f0, φ]T .
• The PDF is given as
p(x;θ) = 1
(2πσ2)N2
exp
− 1
2σ2
N−1∑n=0
(x[n]− A cos(2πf0n+ φ)︸ ︷︷ ︸w[n]
)2
33 / 37
default
Example: Sinusoidal Parameter Estimation
• Instead of proceeding directly through the log-likelihood function, we note that the
above function is maximized when
J(A, f0, φ) =N−1∑n=0
(x[n]− A cos(2πf0n+ φ))2
is minimized.
• The minimum of this function can be found although it is a nontrivial task (about 10slides).
• We skip the derivation, but for details, see Kay et al. "Statistical Signal Processing:Estimation Theory," 1993.
34 / 37
default
Sinusoidal Parameter Estimation
• The MLE of frequency f0 is obtained by maximizing the periodogram over f0:f0 = arg maxf
∣∣∣∣∣N−1∑n=0x[n] exp(−j2πfn)
∣∣∣∣∣• Once f0 is available, proceed by calculating the other parameters:
A = 2
N∣∣∣∣∣N−1∑n=0x[n] exp(−j2πf0n)
∣∣∣∣∣φ = arctan
−
N−1∑n=0x[n] sin 2πf0n
N−1∑n=0x[n] cos 2πf0n
35 / 37
default
Sinusoidal Parameter Estimation—Experiments
• Four example runs of the estimation algorithm areillustrated in the figures.
• The algorithm was also tested for 10000 realizationsof a sinusoid with fixed θ and N = 160, σ2 = 1.2.
• Note that the estimator is not unbiased.
0.0 0.1 0.2 0.3 0.4 0.50
1000
2000
3000
4000
5000
6000
7000
800010000 estimates of parameter f0 (true = 0.07)
0.4 0.6 0.8 1.0 1.20
100
200
300
400
500
60010000 estimates of parameter A (true = 0.67)
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.00
100
200
300
400
500
600
700
800
90010000 estimates of parameter φ (true = 0.61)
36 / 37
default
Estimation Theory—Summary
• We have seen a brief overview of estimation theory with particular focus onMaximum Likelihood.
• If your problem is simple enough to be modeled by an equation, the estimation
theory is the answer.
• Estimating the frequency of a sinusoid is completely solved by classical theory.• Estimating the age of the person in picture can not possibly be modeled this simply andclassical theory has no answer.
• Model based estimation is the best answer when a model exists.• Machine learning can be understood as a data driven approach.
37 / 37