beta and gamma distributions - purdue universitybayesian scientific computing, spring 2013 (n....

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Beta and Gamma Distributions

Prof. Nicholas Zabaras

School of Engineering

University of Warwick

Coventry CV4 7AL

United Kingdom

Email: [email protected]

URL: http://www.zabaras.com/

August 7, 2014

1

mailto:[email protected]

http://mpdc.mae.cornell.edu/

http://mpdc.mae.cornell.edu/


Beta distribution,

Gamma Function, Normalization of the Beta Distribution, Beta as a Prior to

Bernoulli, Posterior and Predictive Distributions

A Frequentist View of Bayesian Learning, Variance Decomposition

Gamma Distribution

Exponential Distribution

Chi Squared Distribution

Inverse Gamma Distribution

The Pareto Distribution

2

Contents

• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm

http://www.cs.ubc.ca/~murphyk/MLbook/





The Beta(a,b) distribution with is defined as

follows:

The expected value, mode and variance of a Beta

random variable x with (hyper-)parameters α and β :

For more information visit this link.

[0,1], , 0x ba

1 11 1( ) (1 )

( ) (1 )( ) ( ) ( , )

x xx x x

beta

a ba ba b

a b a b

Beta

Normalizingfactor

xa

a b

2( 1)

var xab

a b a b

Beta Distribution

3

1

mod2

e xa

a b

http://en.wikipedia.org/wiki/Beta_distribution

http://en.wikipedia.org/wiki/Beta_distribution

http://en.wikipedia.org/wiki/File:Beta_distribution_pdf.png


If a=b=1, we obtain a uniform distribution.

If a and b are both less than 1, we get a bimodal

distribution with spikes at 0 and 1.

If a and b are both greater

than 1, the distribution

is unimodal.

Run betaPlotDemo

from PMTK

Beta Distribution

4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.5

1

1.5

2

2.5

3

beta distributions

a=0.1, b=0.1

a=1.0, b=1.0

a=2.0, b=3.0

a=8.0, b=4.0

a=b=1

a,b<1

a,b>1

https://code.google.com/p/pmtk3/source/browse/trunk/demos/betaPlotDemo.m?r=2840

https://code.google.com/p/pmtk3/


The gamma function extends the factorial to real numbers:

With integration by parts:

For integer n:

For more information visit this link.

1 1( )( ) (1 )

( ) ( )x x xa ba b

a b

Beta

Gamma Function

5

1

0

( ) x ux u e du

( ) ( 1)!n n

( 1) ( )x x x

http://en.wikipedia.org/wiki/Gamma_function

http://en.wikipedia.org/wiki/Gamma_function


Showing that the Beta(a,b) distribution is normalized

correctly is a bit tricky. We need to prove that:

Indeed we follow the steps: (a) change the variables y to

t=y+x; (b) change the order of integration in the shaded

triangular region; and (c) change x to m via x=tm:

1

1 1

0

( ) ( ) ( ) (1 ) dxa ba b a b m m

Beta Distribution: Normalization

6

11 1 1

0 0 0

11 11 1 1 1

0 0 0 0

111

0

( ) ( )

1

( ) 1

x y t

y t x x

t

t t

x t

e x dx e y dy x e t x dt dx

x e t x dx dt e t t tdt d

a d

ba b a

b ba a b a

m

ba

a b

m m m

b m m m

x

t t=x


Beta Distribution

7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(0.1,0.1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(1,1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(2,3)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(8,4)

See Matlab implementation

http://www.zabaras.com/Courses/BayesianComputing/Software/Fig2Chapter2Bishop.m


Assuming a Bernoulli likelihood and Beta prior we derive the

posterior as:

This is also a Beta distribution:

a and b are the effective number of observations of x=1 and

x=0, respectively, introduced by the prior (don’t have to be

integers).

Posterior Distribution

8

( | ) (1 )m N mp m m m D 1 1( ) (1 )x a bm m Beta

1 1( | , ) (1 )m N mp a bm a b m m D,

( | , ) ( | , )

,

N N

N N

p

m N m

m a b m a b

a a b b

D, Beta


From the properties of the Beta distribution, we compute:

The posterior mean always lies in between the prior mean

and the MLE estimate:

This can be shown easily by noticing that:

Posterior Mean and Variance

9

a

ma b

N

N N

2( 1)

a bm

a b a b

N N

N N N N

var

a

mb a b

a m m

a m N m N

0 1 1

1

a a b a a bm

a b a b a b a b

m m

N N N N


Posterior Distribution

10

For example, after observing heads, the posterior is computed as

follows:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x

pdf

Posterior: Beta(3,2)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x

pdf

Likelihood Function (N=m=1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x

pdf

Prior Beta(2,2)

http://www.zabaras.com/Courses/BayesianComputing/Software/BernoulliBetaPosterior.m


We can now compute the probability that the next coin flip is

heads:

Predictive Distribution

11

1 1( | , ) (1 )m N mp a bm a b m m D,

1

0

1

0 ( , )

( 1| , ) ( 1| ) ( | , )

( | , )

| ,

N N

N

N N

p x p x p d

p d

a b

a b m m a b m

m m a b m

am a b

a b

Beta

D, D,

D,

D,

Posterior mean


Consider the case of infinite data (N→∞):

and the posterior mean and variance become:

For N→∞, the distribution as expected spikes around the

MLE estimate with zero variance (i.e. the uncertainty

decreases as N→∞). Is this a general property?

Properties of the Posterior Distribution

12

,N Nm m N m N ma a b b

a

ma b

N

N N

m m

m N m N

2 2

( )0

( 1) ( 1)

a bm

a b a b

N N

N N N N

m N mvar

m N m m N m


A Frequentist View of Bayesian Learning

13

Consider inference of parameter q using data D. We

expect that because the posterior p(q|D) incorporates the

information from the data D, it will imply less variability for q

than the prior p(q).

We have the following identities:

[ ] |q q D

[ ] | | |var var var varq q q q D D D


A Frequentist View of Bayesian Learning

14

This means that on average over the realizations of the

data D, the conditional expectation E[q|D] is equal to E[q].

Also, the posterior variance on average is smaller than the

prior variance by an amount that depends on the variations

in posterior means over the distribution of

possible data.

[ ] |q q D

[ ] | | |var var var varq q q q D D D

|var q D


Posterior Mean

15

Note the not-surprising result regarding the posterior mean:

| ( | )

( | ) ( ) ( , ) ( )

p d

p p d d p d d p d

q q q q

q q q q q q q q q q

D D

D D D D D

|q q

Prior Posteriormean mean

Posterior meanaveraged over thedata

D


Variance Decomposition Identity

16

If (q,D) are two scalar random variables then we have:

Here is the proof:

[ ] | r |q q q var var vaD D

22

22

22

[ ]

| |

| |

var | var |

var q q q

q q

q q

q q

D D

D D

D D


Posterior Variability

17

We can derive a similar expression regarding the posterior

variance:

Thus on average (over the data), the variability in q

decreases. For a particular observed data set D, it is

however possible that

These results implicitly assume that the data follow the

distribution:

Pr

| | |

ior Posteriorvariance variance

averaged overall data

var var var varq q q q D D D

|var varq q D

( ) ( )m p p dq q q D D

|var varq qD


The Gamma distribution is a two-parameter family of

continuous distributions. It has a scale parameter θ>0 and

a shape parameter k>0. If k is an integer then the

distribution represents the sum of k independent

exponentially distributed random variables, each of which

has a mean of θ (which is equivalent to a rate parameter

of θ −1) .

More often, we also use the rate

parametrization

1 exp( / )~ ( , ) , 0,

( )

qq

q

k

k

xX k x x

kGamma

Gamma Distribution

18

1( | , ) exp( ), 0,( )

aab

X a b x xb xa

Gamma

http://en.wikipedia.org/wiki/Gamma_distribution

http://en.wikipedia.org/wiki/File:Gamma_distribution_pdf.png


It is frequently a model for waiting times. For important

properties see here.

It is more often parameterized in terms of a shape

parameter a = k and an inverse scale parameter b = 1/θ,

called a rate parameter:

The mean, mode and variance with this parametrization are:

1 1

0

( | , ) , 0, , ( )( )

aa bx a ub

p x a b x e x a u e dua

Gamma Distribution- Rate Parametrization

19

xb

a

1, 1

mod

0

afor a

e x b

otherwise

2var x

b

a

http://en.wikipedia.org/wiki/Gamma_distribution


Plots of

As we decrease the rate b, the distribution squeezes

leftwards and upwards .

Gamma Distribution

20

1 2 3 4 5 6 7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Gamma distributions

a=1.0,b=1.0

a=1.5,b=1.0

a=2.0,b=1.0

1( | , ) exp( ), 1( )

aab

X a b x xb ba

Gamma

Run gammaPlotDemo

from PMTK

https://code.google.com/p/pmtk3/source/browse/trunk/demos/bookDemos/Probability_I/gammaPlotDemo.m?r=2550

https://code.google.com/p/pmtk3/source/browse/trunk/demos/bookDemos/Probability_I/gammaPlotDemo.m?r=2550



An empirical PDF of rainfall data fitted with a Gamma

distribution.

Run MatLab function gammaRainfallDemo

from PMTK

Gamma Distribution

21

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

3

3.5

MoM

MLE

https://code.google.com/p/pmtk3/source/browse/trunk/bookDemos/Probability/gammaRainfallDemo.m?spec=svn1458&r=1458



Exponential Distribution

22

This is defined as

Here λ is the rate parameter.

This distribution describes the times between events in a

Poisson process, i.e. a process in which events occur

continuously and independently at a constant average rate

λ.

( | ) ( |1, ) exp( ), 0,X X x x Expon Gamma

http://en.wikipedia.org/wiki/Exponential_distribution


Chi-Squared Distribution

23

This is defined as

This is the distribution of the sum of squared Gaussian

random variables.

More precisely,

2

12 2

1

1 2( | ) ( | , ) exp( ), 0,

2 2 2

2

xX X x x

Gamma

2 2

1

~ (0,1) . ~i i

i

Let Z and S Z Then S

N

http://en.wikipedia.org/wiki/Chi-squared_distribution


Inverse Gamma Distribution

24

This is defined as follows:

where:

a is the shape and b the scale parameters.

It can be shown that:

1~ ( | , ) ~ ( | , )If X X a b X X a bGamma InvGamma

( 1)( | , ) exp( / ), 0,( )

aab

X a b x b x xa

InvGamma

2

2

( 1), ,1 1

var ( 2)( 1) ( 2)

b bMean exists for a Mode

a a

bexists for a

a a

http://en.wikipedia.org/wiki/Inverse-gamma_distribution



25

Used to model the distribution of quantities that exhibit

long tails (heavy tails)

This density asserts that x must be greater than some

constant m, but not too much greater, k controls what is

“too much”.

As k → ∞, the distribution approaches δ(x − m).

On a log-log scale, the pdf forms a straight line, of the form

log p(x) = a log x + c for some constants a and c (power

law, Zipf’s law).

( 1)( | , ) ( )k kX k m km x x m Pareto

http://en.wikipedia.org/wiki/Pareto_distribution





http://en.wikipedia.org/wiki/Zipf's_law







26

Applications: Modeling the frequency of words vs their

rank, distribution of wealth (k=Pareto Index), etc.

( 1)

2

2

( | , ) ( ),

( 1),1

,

var ( 2)( 1) ( 2)

k kX k m km x x m

kmMean if k

k

Mode m

m kif k

k k

Pareto

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Pareto distribution

m=0.01, k=0.10

m=0.00, k=0.50

m=1.00, k=1.00

ParetoPlot from PMTK


https://code.google.com/p/pmtk3/source/browse/trunk/demos/paretoPlot.m?r=2837


beta and gamma distributions - purdue universitybayesian scientific computing, spring 2013 (n....

Documents