hw 2 - probability & bayesian inference - eth z€¦ · bysettingy= x 1 anda= 2˙2,weobtain e...

High Performance Computing forScience and Engineering II

Spring semester 2019P. KoumoutsakosETH Zentrum, CLT E 13CH-8092 Zürich

HW 2 - Probability & Bayesian InferenceIssued: March 4, 2019

Due Date: March 18, 2019 10:00am

1-Week Milestone: Solve tasks 1 to 3; install python (version > 3.0) and make yourselfcomfortable with python (task 4.1) (March 11).

Task 1: Probability Theory RemindersIn this exercise we fix the notation we will use during this course and refresh our memory onbasic properties of random variables. Present your answers in detail.

a) [10pts] A random variable with normal (or Gaussian) distribution X ∼ N(µ, σ2

)has prob-

ability density function (pdf) given by

fX(x) =1√

2πσ2e−

(x−µ)2

2σ2 . (1)

Show that the mean and the variance of X are given by E [X] = µ and E[(X − µ)2

]= σ2,

respectively.The mean of X is given by,

E [X] =

+∞∫−∞

xfX(x)dx

=

+∞∫−∞

(x− µ)fX(x)dx+ µ

+∞∫−∞

fX(x)dx

=1√

2πσ2

+∞∫−∞

ye−y2

2σ2 dy + µ

= µ,

where we have used that+∞∫−∞

fX(x)dx = 1 because fX is a density and that+∞∫−∞

ye−y2

2σ2 dy = 0

due to symmetry.The variance of X is given by,

E[(X − µ)2

]=

+∞∫−∞

(x− µ)2fX(x)dx.

1

By setting y = x− µ and a = 12σ2 , we obtain

E[(X − µ)2

]=

√a

π

+∞∫−∞

y2e−ay2

dy

= −√a

π

∂

∂a

+∞∫−∞

e−ay2

dy

= −√a

π

∂

∂a

√π

a

=1

2a= σ2.

b) [10pts] The probability that a random variable X with pdf fX is less or equal than anyx ∈ R is given by,

P (X ≤ x) = FX(x) =

∫ x

−∞fX(z) dz . (2)

The function FX is called the cumulative distribution function (cdf).The Laplace distribution with parameters µ and β has pdf,

f(x) =1

2βexp

(−|x− µ|

β

). (3)

i) Find the cdf of the Laplace distribution.Combining eq. (2) and eq. (3), the cdf of the Laplace distribution is

FX(x) =1

2β

∫ x

−∞exp

(−|z − µ|

β

)dz .

Apply s = z − µ,

FX(x) =1

2β

∫ x−µ

−∞exp

(−|s|β

)ds .

Due to the absolute value, we have two cases. For x− µ < 0,

FX(x) =1

2β

∫ x−µ

−∞exp

(s

β

)ds =

1

2exp

(x− µβ

),

and for x− µ ≥ 0,

FX(x) =1

2β

∫ 0

−∞exp

(s

β

)ds+

1

2b

∫ x−µ

0

exp

(− sβ

)ds

= 1− 1

2exp

(−x− µ

β

).

2

Finally,

FX(x) =

12

exp(x−µβ

), x < µ

1− 12

exp(−x−µ

β

), x ≥ µ .

ii) Use the cdf to find the median of the Laplace distribution.The median is defined as the point m that satisfies FX(m) = 1

2. For both cases, x < µ

or x ≥ µ, it is easy to verify that m = µ.

c) [10pts] The pdf of the quotient Q = X/Y of two random variables X, Y is given by,

fQ(q) =

∫ ∞−∞|x| fX,Y (qx, x) dx , (4)

where fX,Y is the joint pdf of X and Y .Assume that X and Y are independent random variables with pdfs fX(x) = N

(x|0, σ2

X

)and fY (y) = N

(y|0, σ2

Y

).

i) Find the joint pdf of X and Y .Since X and Y are independent it holds that fX,Y = fXfY ,

fX,Y (x, y) = N(x|0, σ2

X

)N(y|0, σ2

Y

)=

1

2πσxσyexp

(− x2

2σ2x

− y2

2σ2y

).

ii) Show that Q = X/Y follows a Cauchy distribution with zero location parameter and scaleγ = σX/σY . The pdf of a Cauchy distribution with location parameter x0 and scale γ isgiven by,

f(x) =1

π

γ

(x− x0)2 + γ2. (5)

From eq. (4) and point i), we can write

fQ(q) =1

2πσxσy

∫ ∞−∞|x| exp

(− x2

2σ2x

− q2x2

2σ2y

)dx

=1

2πσxσy

∫ ∞−∞|x| exp

−x2( 1

2σ2x

+q2

2σ2y

) dx

=1

πσxσy

∫ ∞0

x exp

−x2( 1

2σ2x

+q2

2σ2y

) dx ,

where in the last step we used the fact that the integrand is symmetric around zero. Usingthe fact that ∫ ∞

0

x exp(−kx2

)dx =

1

2k,

for k =σ2y+q

2σ2x

2σ2xσ

2y,

fQ(q) =1

π

σx/σyq2 + (σx/σy)2

.

3

Task 2: Bayesian InferenceYou are given a set of points d = {di}Ni=1 with di ∈ R. You make the modelling assumptionthat the points come from N realisations of N independent random variables Xi, i = 1, . . . , N ,that follow normal distribution with unknown parameter µ and known parameter σ = 1.

a) [10pts] Formulate the likelihood function of µ,

L(µ) := p(d|µ) , (6)

where p is the conditional pdf of d conditioned on µ.From the modelling assumption, di are independent,

p(d|µ) =N∏i=1

p(di|µ) ,

and Xi ∼ N (µ, 1),

p(d|µ) =N∏i=1

N (di|µ, 1)

=N∏i=1

1√2π

exp

(−1

2(di − µ)2

)

= (2π)−N/2 exp

−1

2

N∑i=1

(di − µ)2

.

(7)

The log-likelihood is given by,

log p(d|µ) = −N2

log(2π) − 1

2

N∑i=1

(di − µ)2 .

b) [15pts] Find the maximum likelihood estimate (MLE) of µ, i.e.,

µ̂ = arg maxµ

L(µ). (8)

You may find useful that arg max µ L(µ) = arg max µ logL(µ).It holds that,

µ̂ = arg maxµ

p(d|µ)

= arg maxµ

log p(d|µ)

= arg maxµ

−N2

log(2π) − 1

2

N∑i=1

(di − µ)2

= arg max

µ− 1

2

N∑i=1

(di − µ)2 .

4

The optimum µ̂ satisfies,

0 =d

dµ

1

2

N∑i=1

(di − µ)2∣∣∣∣µ=µ̂

=N∑i=1

di −Nµ̂

⇒ µ̂ =1

N

N∑i=1

di .

c) [30pts] Before observing any data d you had the belief that µ follows a normal distributionwith mean µ0 and variance σ2

0. After observing the dataset d you update your belief byusing Bayes’ theorem. Identify the posterior distribution p(µ|d) of µ conditioned on d.Calculate the mean and the variance of p(µ|d).The posterior is given by Bayes’ theorem,

p(µ|d) =p(d|µ) p(µ)

p(d).

where p(µ) = N (µ|µ0, σ20). Working with the numerator and forgetting for the moment all

the proportionality constants, we have

p(d|µ) p(µ) =N∏i=1

N (di|µ, 1)N (µ|µ0, σ20)

= (2π)−N/2 exp

−1

2

N∑i=1

(di − µ)2

(2πσ20)−1/2 exp

(− 1

2σ20

(µ− µ0)2

)

∝ exp

−1

2

N∑i=1

(d2i − 2µdi + µ2)− 1

2σ20

(µ2 − 2µµ0 + µ20)

= exp

(−(1 +Nσ2

0)µ2 − 2µ(µ0 + σ20

∑i di) + µ0 + σ2

0

∑i di

2σ20

)

= exp

−1

2

µ2 − 2µµ0+σ2

0

∑i di

1+Nσ20

+µ0+σ2

0

∑i di

1+Nσ20

σ20

1+Nσ20

.

(9)

In order to lighten the notation, we set

µ̄ =µ0 + σ2

0

∑i di

1 +Nσ20

and σ̄2 =σ20

1 +Nσ20

.

Now we work only with the numerator in the exponential eq. (9) and complete the square,

µ2 − 2µµ̄+ µ̄ = µ2 − 2µµ̄+ µ̄2 − µ̄2 + µ̄

= (µ− µ̄)2 − µ̄2 + µ̄ .(10)

5

Substituting eq. (10) in eq. (9),

p(d|µ) p(µ) ∝ exp

(−1

2

(µ− µ̄)2 − µ̄2 + µ̄

σ̄2

)

∝ exp

(−1

2

(µ− µ̄)2

σ̄2

),

(11)

where the last proportionality is true due to the fact that µ̄ does not depend on µ. At thispoint, we use equality in eq. (11) and find the proportionality constant C such that theposterior integrates to 1 (and therefore is a distribution),

p(µ|d) = C exp

(−1

2

(µ− µ̄)2

σ̄2

)

=1√

2πσ̄2exp

(−1

2

(µ− µ̄)2

σ̄2

)= N (µ|µ̄, σ̄2) .

(12)

d) [5pts] Find the maximum a posteriori (MAP) estimate of µ, i.e.,

µ̂ = arg maxµ

p(µ|d) . (13)

The normal distribution attains its maximum at the mean. From eq. (12)

µ̂ =µ0 + σ2

0

∑i di

1 +Nσ20

. (14)

e) [10pts] Perform (c) and (d) using as prior an uninformative distribution, i.e. a uniformdistribution in R, and compare the MAP with the MLE. Although this not a distribution,since it is not integrable over R, we are allowed to use it in Bayes’ theorem al long as theposterior is a distribution. These priors are called improper priors and a common choice inpractical applications when there is no prior information on the parameters.The prior is proportional to 1, p(µ) ∝ 1, and the likelihood is given by eq. (7). Thus, theposterior is proportional to,

p(µ|d) ∝ (2π)−N/2 exp

−1

2

N∑i=1

(di − µ)2

∝ exp

− 1

2N

µ− 1

N

N∑i=1

di

2 ,

6

where we have rearranged the terms in the exponential and completed the square, in thesame way as in Task 2c. It is easy to find the proportionality constant and write,

p(µ|d) =1√

2πN−1exp

− 1

2N

µ− 1

N

N∑i=1

di

2

= N

µ | 1

N

N∑i=1

di,1

N

.

For this choice of prior, the MAP coincides with the MLE.

7

Task 3: Bayesian Inference: Linear Model[20pts] You are given the linear regression model that describes the relation between variablesx and y,

y = βx+ ε,

where β is the regression parameter, y is the output quantity of interest (QoI) of the system, xis the input variable and ε is the random variable accounting for model and measurement errors.The model error is quantified by a Gaussian distribution ε ∼ N (0, σ2).

You are given one measurement data point, D = {x0, y0}. Consider an uninformative prior forβ and identify the posterior distribution of β after observing D. Calculate the MAP and thestandard deviation of β|D.

The posterior distribution of β, given the dataset D is given by Bayes’ therorem,

p(β|D) =p(y0|β) p(β)

p(y = y0).

Since ε ∼ N (0, σ2), p(y0|β, I) = N (y0|βx, σ2) and p(β) ∝ 1, it follows that

p(β|D) ∝ 1√2πσ

exp

(− 1

2σ2(y0 − βx0)2

),

or equivalently, by reordering the terms in the exponential,

p(β|D) ∝ 1√2πσ

exp

(x202σ2

(β − y0

x0

)2).

We can conclude that the posterior distribution is the normal distribution N(y0x0, σ

2

x20

).

8

Task 4: Data AnalysisAll programming related solutions can be found in the python script task4_example_solution.py

This task consists of python implementation and work on paper. Python is one of the mostwidespread1 programming languages across research and industry, and hence we would like youto get your hands on it.Your task will be to analyse the data set provided in the file task4.npy and to implement somesimple statistical functions. Finally you will visualize your findings in order to support youranalysis.The prepared python script task4.py contains a few lines of code to start with and some rec-ommendations in comment sections on how to complete this task. You must not follow ourstructure and you are free to implement your own solution.

1. Read Data [0pts]

• Make yourself comfortable with python and know how to call methods from pythonpackages. Use the following command to make sure that your python version is newerthan 3.0:$ python −−versionOn Euler you have to load a newer version:$ module load python/3.6.0

• Study and run the python script task4.py.

(Info: no documentation needed)

2. Histogram [0pts]

• Visualize the data d = {di}20193i=1 with following three lines provided in the script:counts, bins = hist(data,numbins,continuous=True)plt.bar(bins, counts, width=0.5, align=’edge’)plt.show()

• Play around with the parameter numbins.Understand what happens inside the function hist(xarr, nbins, continuous=True).

(Info: no documentation needed)(Hint: as you advance with your implementation you might want to comment the linesabove in the script in order to avoid interruptions during testing)

3. Normalized Histogram [5pts]Adjust the histogram function (where indicated in the script) such that it returns an esti-mated pdf for the data d (for continuous and discrete distributions), i.e. the input vectorxarr .

1https://stackoverflow.blog/2017/09/06/incredible-growth-python/?_ga=2.199625454.1908037254.1532442133-221121599.1532442133

9

https://stackoverflow.blog/2017/09/06/incredible-growth-python/?_ga=2.199625454.1908037254.1532442133-221121599.1532442133

https://stackoverflow.blog/2017/09/06/incredible-growth-python/?_ga=2.199625454.1908037254.1532442133-221121599.1532442133

After that, visualize the normalized histogram and self-check if your solution is reasonable.

(Hint: for a continuous pdf it holds that∫∞−∞ fX(u)du = 1; for a discrete pdf it holds that∑N

i=1 p(xi) = 1)

4. Distribution Function [5pts]By now you should be familiar with the data, make an educated guess what the underlyingdistribution function could be. Document your hypothesis and rationalize your choice (2 -5 sentences).

(Hint 1: we generated the data with a function provided in the numpy.random pack-age:https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html)(Hint 2: it’s not the Gaussian distribution)

From the histogram we can easily see that di ∈ R and that the distribution is not sym-metric. Introducing a large bin number (or printing some data on the console) makes clearthat the data is discrete. From there we can take that the underlying distribution mightbe the Poisson distribution.Note that the Poisson distribution is a generalization of the Binomial distribution forn → inf and p → 0 and hence the Binomial distribution would also be a good guessbut more difficult to treat.

P (k) =λk

k!e−λ (15)

5. Likelihood and Log-likelihood [5+15pts]

• Write down the likelihood L(θ) and log-likelihood `(θ) = logL(θ) of your densityfunction.• Implement both functions in python where indicated in the script. The likelihood,

respectively log-likelihood, should take an input for each of the parameters θ of thepdf, as well as an input for the data data d.

(Info: For one liners you might want to use lambda functionshttps://www.programiz.com/python-programming/anonymous-function)

L(λ) = p(d|λ) =N∏i=1

λdie−λ

di!

=1∏N

i=1 di!· λ

∑Ni=1 di · e−nλ

Taking the logarithm of the equation above yields

`(d|λ) = −N∑i=1

di! + ln(λ)N∑i=1

di −Nλ

10

https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html

https://www.programiz.com/python-programming/anonymous-function

6. MLE [5+10pts]

• Derive the MLE of the parameters θ̂ = arg max L(θ) and document your result.We derive the MLE by setting the log-likelihood function equal zero:

d

dλ`(d|λ) = 0

⇐⇒ 0 =

∑Ni=1 diλ

− n

λ =

∑Ni=1 diN

• Add a few lines to the python script to calculate the MLE of the parameters θ fromthe full data set d.

(Hint: you might want to start from the definition of the log-likelihood function in orderto derive the MLE).

7. Comparison with Gaussian Distribution [5+15+10pts]

• In python, compute the maximum likelihood and maximum log-likelihood of your pdfgiven the full data set d (reuse the functions implemented in subtask 5).• Implement a likelihood and log-likelihood function for the Gaussian distribution too

(according to subtasks 5 & 6).• Compare the results from bullet 1 with the maximum likelihood, respectively maxi-

mum log-likelihood, of the Gaussian distribution (use the Gaussian MLE θ = (µ̂, σ̂2)).Answer the following questions and document: Which distribution function provides abetter likelihood, respectively log-likelihood? Do you encounter any problems duringthe calculation of the likelihood - or why is it advisable to work in the log space? (intotal 2-5 sentences).

– The log-likelihood of the Poisson distribution is `(λ̂) = −39581 and the log-likelihood of the Gaussian distribution is `(µ̂, ˆlambda) = −40288. From thisquantities we can see that the Poisson distribution provides a better fit to thedata.

– Whilst calculating the likelihood of the Poisson and the Gaussian distribution wehave issues with the precision of the numbers, we multiply many numbers thatare close to zero and the result which is even smaller (in an absolute sense) cannot be represented any more by a floating point number. And hence comparingnumbers in the log space is much more accurate.

(Hint 1: the log-likelihood of the Gaussian pdf for the full data set equals -40287.57)(Hint 2: you can self-check your implementations by comparing the output of the likelihoodfunction with the exponential of the output of the log-likelihood function)

8. Visualization [5+5+5pts]

• Plot your distribution function on top of the normalized histogram from subtask 3.

11

• Plot the distribution function of the Gaussian on top of the other two graphs.• Which graph provides a qualitatively better fit to the histogram from subtask 3? Do

your findings agree with the numbers calculated in subtask 7? State your answer in2-5 sentences and include the final plot into your report.

Figure 1: Task 4.8: Visualization of the Gaussian distribution and Poisson distributionfitted to the histogram of the data set.

The Poisson distribution qualitatively fits the data better as predicted by the log-likelihood. The Poisson distribution is steeper for values smaller than the mode andit has a fatter tail (di ≥ 8).

(Hint: you can uncomment plt.show() from subtask 2 & 3, add two more plotstatements plt.plot(x,y) and then insert plt.show() at the end of the script)

12

Task 5: 1-D Laplacian ApproximationThe Laplace method approximates a continuous distribution function p(x) with a Gaussiandistribution. For the approximation we perform a Taylor expansion of the logarithm of p(x)around the maximum x̂.In this task we derive the Laplace approximation of the Cauchy distribution and then plot bothdistributions.

1. Maximum [5pts]Let x ∈ R be the parameter of a continuously differentiable probability distribution functionp(x) (i.e. p(x) ∈ C1) with a unique global maximum. Which conditions do not hold forthe global maximum x̂ of the pdf p(x) :

(i) ∂p∂x

∣∣∣x̂< 0,

(ii) ∂p∂x

∣∣∣x̂

= 0,

(iii) ∂p∂x

∣∣∣x̂> 0,

(iv) ∂2p∂x2

∣∣∣x̂< 0,

(v) ∂2p∂x2

∣∣∣x̂

= 0,

(vi) ∂2p∂x2

∣∣∣x̂> 0.

p(x̂) has a maximum if and only if ∂p∂x

∣∣∣x̂

= 0 and ∂2p∂x2

∣∣∣x̂< 0, and hence (i), (iii), (v) and

(vi) is the right answer.Note that we thankfully updated the restrictions on p(x) due to your input: p(x) ∈ C1and the maximum of p(x) is unique. Otherwise (ii) may not be defined (e.g. Laplacedistribution) and (iv) may equal zero if p(x) is flat (e.g. uniform distribution).

2. Taylor expansion [5pts]The logarithm of the probability density is given by `(x) = log p(x). The Taylor expansionof the logarithm of p(x) around the maximum x̂ is given by

`(x) = `(x̂) +1

2

∂2`

∂x2

∣∣∣∣∣x̂

(x− x̂)2 +O(x− x̂)3, (16)

The 1-D Laplace approximation of the probability distribution p(x) can be approximatedby

p(x) ≈ A exp

1

2

∂2l

∂x2

∣∣∣∣∣x̂

(x− x̂)2

. (17)

Which steps and assumptions did we do to go from 16 to 17? Also find an expression forthe constant A.

We omit terms of order > 2 in equation 17. Then we exponentiate l(x) and set A =exp(l(x̂)) = p(x̂).

13

3. Laplace approximation [10pts]One can see that the 1-D Laplace approximation (equation 17) has the form of a Gaussiandistribution with mean µ = x̂. First, derive an expression for the variance σ2. Second,rewrite the Laplace approximation (equation 17) as a product of the pdf of the Gaussiandistribution N (µ, σ2) and constant terms left to define.

Comparing equation 17 with the definition of the Gaussian distribution we see that

σ2 = − 1

l′′(x̂).

Rewriting yields the unnormalised distribution

p(x) ≈√

2πσ2p(µ)N (µ, σ2).

4. Cauchy distribution [5+5+5pts]The Cauchy distribution is given by

p(x|x0, γ) =1

π

γ

(x− x0)2 + γ2(18)

• Derive analytically the location of the maximum x̂ of the Cauchy distribution anddocument it.

∂p(x|x0, γ)

∂x=γ

π

2(x− x0)((x− x0)2 + γ2)2

!= 0⇒ x = x0 (19)

In the next bullet we see that ∂2l∂x2

∣∣∣x̂< 0 (which is equivalent as ∂2p

∂x2

∣∣∣x̂< 0).

• Write down the Taylor expansion up to order 2 of the logarithm of the Cauchy dis-tribution around the maximum. Start with the derivation of the second derivative of`(x|x0, γ).

∂log(f(x))

∂x=f(x)

f ′(x)⇒ ∂l(x|x0, γ)

∂x=

2(x− x0)(x− x0)2 + γ2

∂2l(x|x0, γ)

∂x2= −2

(x− x0)2 + γ2 + 2(x− x0)(x0 − x)

[(x− x0)2 + γ2]2

∂2l

∂x2

∣∣∣∣∣x̂

= − 2

γ2.

The Taylor expansion is given by

l(x) = l(x̂)− 1

γ2(x− x̂)2 +O(x− x̂)3,

with x̂ = x0.• Write down the Laplace approximation of the Cauchy distribution. Also define the

normalization constant if you choose to use one.

14

The Laplace approximation is given by

p(x) ≈√

2πσ2 p(x0|x0, γ)︸︷︷︸1πγ

N (x0, σ2),

with σ2 = 12γ2. Normalization leads to:

p(x) ≈ 1√πγ2

exp(− (x− x0)2

γ2

),

5. Visualization [20pts]In this subtask you are asked to create a plot with two graphs. You can solve it with anyprogramming language you like. You might want to reuse your routines from task 4.

• Plot the Cauchy p(x|x0, γ) distribution with parameters x0 = −2.0 and γ = 1.0.• Add a graph of the Laplace approximation of p(x|x0, γ).

Include the final plot with both graphs into your report (code not needed). Do not forgetto add labels for the axis and a legend for the graphs. Make sure that the modes areinferable from the plot.

Figure 2: Task 5.5: Visualization of the Cauchy distribution and the unnormalized Laplaceapproximation

Guidelines for reports submissions:

• Submit two files via Moodle: a pdf of your report (including plots from task 4 & 5) plusthe solution to the coding exercise task4.py.

15

hw 2 - probability & bayesian inference - eth z€¦ · bysettingy= x 1 anda= 2˙2,weobtain e...

Documents