pattern recognition and machine learning - slide set 2 ... · pattern recognition and machine...

PATTERN RECOGNITION AND MACHINE

LEARNINGSlide Set 2: Estimation Theory

October 2019

Heikki Huttunen

[email protected] Processing

Tampere University

default

Classical Estimation and Detection Theory

• Before the machine learning part, we will take a look at classical estimation theory.• Estimation theory has many connections to the foundations of modern machinelearning.

• Outline of the next few hours:1 Estimation theory:

• Fundamentals• Maximum likelihood• Examples

2 Detection theory:

• Fundamentals• Error metrics• Examples

2 / 37

default

Introduction - estimation

• Our goal is to estimate the values of a group of parameters from data.• Examples: radar, sonar, speech, image analysis, biomedicine, communications,control, seismology, etc.

• Parameter estimation: Given an N-point data set x = {x[0], x[1], . . . , x[N− 1]} whichdepends on the unknown parameter θ ∈ R, we wish to design an estimator g(·) for θ

θ = g(x[0], x[1], . . . , x[N− 1]).• The fundamental questions are:

1 What is the model for our data?

2 How to determine its parameters?

3 / 37

default

Introductory Example – Straight line

• Suppose we have the illustrated time series andwould like to approximate the relationship of the

two coordinates.

• The relationship looks linear, so we could assumethe following model:

y[n] = ax[n] + b+ w[n],with a ∈ R and b ∈ R unknown andw[n] ∼ N (0, σ2)

• N (0, σ2) is the normal distribution with mean 0and variance σ2. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

x

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

y

4 / 37

default


• Each pair of a and b represent one line.• Which line of the three would best describe thedata set? Or some other line?

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0x

0.25

0.00

0.25

0.50

0.75

1.00

1.25

y

Model candidate 1(a = 0.07, b = 0.49)Model candidate 2(a = 0.06, b = 0.33)Model candidate 3(a = 0.08, b = 0.51)

5 / 37

default


• It can be shown that the best solution (in themaximum likelihood sense; to be definedlater) is given by

a = − 6

N(N+ 1)

N−1∑n=0y(n) + 12

N(N2 − 1)N−1∑n=0x(n)y(n)

b =2(2N− 1)N(N+ 1)

N−1∑n=0y(n)− 6

N(N+ 1)

N−1∑n=0x(n)y(n).

• Or, as we will later learn, in an easy matrix form:

θ =

(ab)

= (XTX)−1XTy

6 / 37

default


• In this case, a = 0.07401 and b = 0.49319, whichproduces the line shown on the right.

• The line also minimizes the squared distances(green dashed lines) between the model (blue

line) and the data (red circles).

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0x

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

y

Best Fity = 0.0740x+0.4932Sum of Squares = 3.62

7 / 37

default

Introductory Example 2 – Sinusoid

• Consider transmitting the sinusoid below.

0 20 40 60 80 100 120 140 160432101234

8 / 37

default


• When the data is received, it is corrupted by noise and the received samples look likebelow.

0 20 40 60 80 100 120 140 160432101234

• Can we recover the parameters of the sinusoid?

9 / 37

default


• In this case, the problem is to find good values for A, f0 and φ in the following model:x[n] = A cos(2πf0n+ φ) + w[n],

with w[n] ∼ N (0, σ2).

10 / 37

default


• It can be shown that themaximum likelihood estimator; MLE for parameters A, f0 and φare given by

f0 = value of f that maximizes∣∣∣∣∣N−1∑n=0x(n)e−2πifn

∣∣∣∣∣ ,A =

2

N∣∣∣∣∣N−1∑n=0x(n)e−2πif0n

∣∣∣∣∣φ = arctan

−∑N−1n=0 x(n) sin(2πf0n)∑N−1n=0 x(n) cos(2πf0n) .

11 / 37

default


• It turns out that the sinusoidal parameter estimation is very successful:

0 20 40 60 80 100 120 140 160432101234

f0 = 0.068 (0.068); A = 0.692 (0.667); φ = 0.238 (0.609)

• The blue curve is the original sinusoid, and the red curve is the one estimated fromthe green circles.

• The estimates are shown in the figure (true values in parentheses).

12 / 37

default


• However, the results are different for each realization of the noise w[n].

0 20 40 60 80 100 120 140 160432101234

f0 = 0.068; A = 0.652; φ = -0.023

0 20 40 60 80 100 120 140 160432101234

f0 = 0.066; A = 0.660; φ = 0.851

0 20 40 60 80 100 120 140 160432101234

f0 = 0.067; A = 0.786; φ = 0.814

0 20 40 60 80 100 120 140 160432101234

f0 = 0.459; A = 0.618; φ = 0.299

13 / 37

default


• Thus, we’re not very interested in an individual case, but rather on the distributions ofestimates

• What are the expectations: E[f0], E[φ] and E[A]?• What are their respective variances?• Could there be a better formula that would yield smaller variance?• If yes, how to discover the better estimators?

14 / 37

default

LEAST SQUARES

15 / 37

default

Least Squares

• The general solution for linear case is easy to remember.• Consider the line fitting case: y[n] = ax[n] + b+ w[n].• This can be written in matrix form as follows:

y[0]y[1]...y[N− 1]

︸︷︷︸

y

=

x[0] 1x[1] 1x[2] 1

.

.

....x[N− 1] 1

︸︷︷︸

X

(ab)

︸︷︷︸θ

+w.

16 / 37

default

Least Squares

• Now the model is written compactly as

y = Xθ +w

• Solution: The value of θ minimizing the error wTw is given by:

θLS =(XTX)−1 XTy.

• See sample Python code here: https://github.com/mahehu/SGN-41007/blob/master/code/Least_Squares.ipynb

17 / 37

https://github.com/mahehu/SGN-41007/blob/master/code/Least_Squares.ipynb

https://github.com/mahehu/SGN-41007/blob/master/code/Least_Squares.ipynb

default

LS Example with two variables

• Consider an example, where microscope imageshave an uneven illumination.

• The illumination pattern appears to have thehighest brightness at the center.

• Let’s try to fit a paraboloid to remove the effect:

z(x, y) = c1x2 + c2y2 + c3xy + c4x + c5y + c6,with z(x, y) the brightness at pixel (x, y).

18 / 37

default


• In matrix form, the model looks like the following:

z1z2...zN

︸︷︷︸

z

=

x21y21x1y1 x1 y1 1x2

2y22x2y2 x2 y2 1

.

.

....

.

.

....

.

.

....x2N y2N xNyN xN yN 1

︸︷︷︸

H

·

c1c2c3c4c5c6

︸︷︷︸

c

+ε, (1)

with zk the grayscale value at k’th pixel (xk, yk).

19 / 37

default


• As a result we get the LS fit by c =(HTH)−1HTz,

c = (−0.000080,−0.000288, 0.000123, 0.022064, 0.284020, 106.538687)

• Or, in other words

z(x, y) =− 0.000080x2 − 0.000288y2+ 0.000123xy + 0.022064x+ 0.284020y + 106.538687.

20 / 37

default


• Finally, we remove the illumination component by subtraction.

21 / 37

default

MAXIMUM LIKELIHOOD

22 / 37

default

Maximum Likelihood Estimation

• Maximum likelihood (MLE) is the most popular estimation approach due to itsapplicability in complicated estimation problems.

• Maximization of likelihood also appears often as the optimality criterion in machinelearning.

• The method was proposed by Fisher in 1922, though he published the basic principlealready in 1912 as a third year undergraduate.

• The basic principle is simple: find the parameter θ that is the most probable to havegenerated the data x.

• The MLE estimator may or may not be optimal in the minimum variance sense. It isnot necessarily unbiased, either.

23 / 37

default

The Likelihood Function

• Consider again the problem of estimating the mean level A of noisy data.• Assume that the data originates from the following model:

x[n] = A+ w[n],where w[n] ∼ N (0, σ2): Constant plus Gaussian random noise with zero mean andvariance σ2.

0 50 100 150 2004

3

2

1

0

1

2

3

24 / 37

default

The Likelihood Function

• The key to maximum likelihood is the likelihood function.• If the probability density function (PDF) of the data is viewed as a function of the

unknown parameter (with fixed data), it is called likelihood function.

• Often the likelihood function has an exponential form. Then it’s usual to take thenatural logarithm to get rid of the exponential. Note that the maximum of the newlog-likelihood function does not change.

4 2 0 2 4 6 8 10A

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40PDF of A assuming x[0] = 3

20

15

10

5

0

LikelihoodLog-likelihood

25 / 37

default

MLE Example

• In our case of estimating the mean of a signal:

x[n] = A+ w[n], n = 0, 1, . . . ,N− 1,the likelihood function is easy to construct.

• The noise samples w[n] ∼ N (0, σ2) are assumed independent, so the PDF of thewhole batch of samples x = (x[0], . . . , x[N− 1]) is obtained by the product rule:

p(x; A) = N−1∏n=0p(x[n]; A) = 1

(2πσ2)N2

exp

[− 1

2σ2

N−1∑n=0

(x[n]− A)2]

• When we have observed the data x, we can turn the problem around and considerwhat is the most likely parameter A to have generated the data.

26 / 37

default

MLE Example

• Some authors emphasize this by turning the order around: p(A; x) or give the functiona different name such as L(A; x) or `(A; x).

• So, consider p(x; A) as a function of A and try to maximize it.

27 / 37

default

MLE Example

• The picture below shows the likelihood function and the log-likelihood function forone possible realization of data.

• The data consists of 50 points, with true A = 5.• The likelihood function gives the probability of observing these particular points withdifferent values of A.

2 3 4 5 6 7 8A

0

1

2

3

4

5

6 1e 31 Likelihood of A (max at 5.17)

350

300

250

200

150

100

50

SamplesLikelihoodLog-likelihood

28 / 37

default

MLE Example

• Instead of finding the maximum from the plot, we wish to have a closed form solution.

• Closed form is faster, more elegant, accurate and numerically more stable.• Just for the sake of an example, below is the code for the stupid version.

# The samples are in array called x0

x = np.linspace(2, 8, 200)likelihood = []log_likelihood = []

for A in x:likelihood.append(gaussian(x0, A, 1).prod())log_likelihood.append(gaussian_log(x0, A, 1).sum())

print ("Max likelihood is at %.2f" % (x[np.argmax(log_likelihood)]))

29 / 37

default

MLE Example

• Maximization of p(x; A) directly is nontrivial. Therefore, we take the logarithm, and maximize itinstead:

p(x; A) =1

(2πσ2)N2

exp[− 1

2σ2

N−1∑n=0

(x[n]− A)2]

ln p(x; A) = −N2ln(2πσ2)− 1

2σ2

N−1∑n=0

(x[n]− A)2

• The maximum is found via differentiation:

∂ ln p(x; A)∂A =

1

σ2

N−1∑n=0

(x[n]− A)

30 / 37

default

MLE Example

• Setting this equal to zero gives

1

σ2

N−1∑n=0

(x[n] − A) = 0

N−1∑n=0

(x[n] − A) = 0

N−1∑n=0

x[n] −N−1∑n=0

A = 0

N−1∑n=0

x[n] − NA = 0

N−1∑n=0

x[n] = NA

A =1

NN−1∑n=0

x[n]

31 / 37

default

Conclusion

• What did we actually do?• We proved that the sample mean is the maximum likelihood estimator for thedistribution mean.

• But I could have guessed this result from the beginning. What’s the point?• We can do the same thing for cases where you can not guess.

32 / 37

default

Example: Sinusoidal Parameter Estimation

• Consider the modelx[n] = A cos(2πf0n+ φ) + w[n]

with w[n] ∼ N (0, σ2). It is possible to find the MLE for all three parameters:θ = [A, f0, φ]T .

• The PDF is given as

p(x;θ) = 1

(2πσ2)N2

exp

− 1

2σ2

N−1∑n=0

(x[n]− A cos(2πf0n+ φ)︸︷︷︸w[n]

)2

33 / 37

default

Example: Sinusoidal Parameter Estimation

• Instead of proceeding directly through the log-likelihood function, we note that the

above function is maximized when

J(A, f0, φ) =N−1∑n=0

(x[n]− A cos(2πf0n+ φ))2

is minimized.

• The minimum of this function can be found although it is a nontrivial task (about 10slides).

• We skip the derivation, but for details, see Kay et al. "Statistical Signal Processing:Estimation Theory," 1993.

34 / 37

default

Sinusoidal Parameter Estimation

• The MLE of frequency f0 is obtained by maximizing the periodogram over f0:f0 = arg maxf

∣∣∣∣∣N−1∑n=0x[n] exp(−j2πfn)

∣∣∣∣∣• Once f0 is available, proceed by calculating the other parameters:

A = 2

N∣∣∣∣∣N−1∑n=0x[n] exp(−j2πf0n)

∣∣∣∣∣φ = arctan

−

N−1∑n=0x[n] sin 2πf0n

N−1∑n=0x[n] cos 2πf0n

35 / 37

default

Sinusoidal Parameter Estimation—Experiments

• Four example runs of the estimation algorithm areillustrated in the figures.

• The algorithm was also tested for 10000 realizationsof a sinusoid with fixed θ and N = 160, σ2 = 1.2.

• Note that the estimator is not unbiased.

0.0 0.1 0.2 0.3 0.4 0.50

1000

2000

3000

4000

5000

6000

7000

800010000 estimates of parameter f0 (true = 0.07)

0.4 0.6 0.8 1.0 1.20

100

200

300

400

500

60010000 estimates of parameter A (true = 0.67)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.00

100

200

300

400

500

600

700

800

90010000 estimates of parameter φ (true = 0.61)

36 / 37

default

Estimation Theory—Summary

• We have seen a brief overview of estimation theory with particular focus onMaximum Likelihood.

• If your problem is simple enough to be modeled by an equation, the estimation

theory is the answer.

• Estimating the frequency of a sinusoid is completely solved by classical theory.• Estimating the age of the person in picture can not possibly be modeled this simply andclassical theory has no answer.

• Model based estimation is the best answer when a model exists.• Machine learning can be understood as a data driven approach.

37 / 37

pattern recognition and machine learning - slide set 2 ... · pattern recognition and machine...

Documents