beta and gamma distributions - purdue universitybayesian scientific computing, spring 2013 (n....
TRANSCRIPT
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Beta and Gamma Distributions
Prof. Nicholas Zabaras
School of Engineering
University of Warwick
Coventry CV4 7AL
United Kingdom
Email: [email protected]
URL: http://www.zabaras.com/
August 7, 2014
1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Beta distribution,
Gamma Function, Normalization of the Beta Distribution, Beta as a Prior to
Bernoulli, Posterior and Predictive Distributions
A Frequentist View of Bayesian Learning, Variance Decomposition
Gamma Distribution
Exponential Distribution
Chi Squared Distribution
Inverse Gamma Distribution
The Pareto Distribution
2
Contents
• Following closely Chris Bishops’ PRML book, Chapter 2
• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Beta(a,b) distribution with is defined as
follows:
The expected value, mode and variance of a Beta
random variable x with (hyper-)parameters α and β :
For more information visit this link.
[0,1], , 0x ba
1 11 1( ) (1 )
( ) (1 )( ) ( ) ( , )
x xx x x
beta
a ba ba b
a b a b
Beta
Normalizingfactor
xa
a b
2( 1)
var xab
a b a b
Beta Distribution
3
1
mod2
e xa
a b
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
If a=b=1, we obtain a uniform distribution.
If a and b are both less than 1, we get a bimodal
distribution with spikes at 0 and 1.
If a and b are both greater
than 1, the distribution
is unimodal.
Run betaPlotDemo
from PMTK
Beta Distribution
4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
2.5
3
beta distributions
a=0.1, b=0.1
a=1.0, b=1.0
a=2.0, b=3.0
a=8.0, b=4.0
a=b=1
a,b<1
a,b>1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The gamma function extends the factorial to real numbers:
With integration by parts:
For integer n:
For more information visit this link.
1 1( )( ) (1 )
( ) ( )x x xa ba b
a b
Beta
Gamma Function
5
1
0
( ) x ux u e du
( ) ( 1)!n n
( 1) ( )x x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Showing that the Beta(a,b) distribution is normalized
correctly is a bit tricky. We need to prove that:
Indeed we follow the steps: (a) change the variables y to
t=y+x; (b) change the order of integration in the shaded
triangular region; and (c) change x to m via x=tm:
1
1 1
0
( ) ( ) ( ) (1 ) dxa ba b a b m m
Beta Distribution: Normalization
6
11 1 1
0 0 0
11 11 1 1 1
0 0 0 0
111
0
( ) ( )
1
( ) 1
x y t
y t x x
t
t t
x t
e x dx e y dy x e t x dt dx
x e t x dx dt e t t tdt d
a d
ba b a
b ba a b a
m
ba
a b
m m m
b m m m
x
t t=x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Beta Distribution
7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(0.1,0.1)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(1,1)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(2,3)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(8,4)
See Matlab implementation
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Assuming a Bernoulli likelihood and Beta prior we derive the
posterior as:
This is also a Beta distribution:
a and b are the effective number of observations of x=1 and
x=0, respectively, introduced by the prior (don’t have to be
integers).
Posterior Distribution
8
( | ) (1 )m N mp m m m D 1 1( ) (1 )x a bm m Beta
1 1( | , ) (1 )m N mp a bm a b m m D,
( | , ) ( | , )
,
N N
N N
p
m N m
m a b m a b
a a b b
D, Beta
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
From the properties of the Beta distribution, we compute:
The posterior mean always lies in between the prior mean
and the MLE estimate:
This can be shown easily by noticing that:
Posterior Mean and Variance
9
a
ma b
N
N N
2( 1)
a bm
a b a b
N N
N N N N
var
a
mb a b
a m m
a m N m N
0 1 1
1
a a b a a bm
a b a b a b a b
m m
N N N N
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Distribution
10
For example, after observing heads, the posterior is computed as
follows:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
Posterior: Beta(3,2)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
Likelihood Function (N=m=1)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
Prior Beta(2,2)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We can now compute the probability that the next coin flip is
heads:
Predictive Distribution
11
1 1( | , ) (1 )m N mp a bm a b m m D,
1
0
1
0 ( , )
( 1| , ) ( 1| ) ( | , )
( | , )
| ,
N N
N
N N
p x p x p d
p d
a b
a b m m a b m
m m a b m
am a b
a b
Beta
D, D,
D,
D,
Posterior mean
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Consider the case of infinite data (N→∞):
and the posterior mean and variance become:
For N→∞, the distribution as expected spikes around the
MLE estimate with zero variance (i.e. the uncertainty
decreases as N→∞). Is this a general property?
Properties of the Posterior Distribution
12
,N Nm m N m N ma a b b
a
ma b
N
N N
m m
m N m N
2 2
( )0
( 1) ( 1)
a bm
a b a b
N N
N N N N
m N mvar
m N m m N m
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
A Frequentist View of Bayesian Learning
13
Consider inference of parameter q using data D. We
expect that because the posterior p(q|D) incorporates the
information from the data D, it will imply less variability for q
than the prior p(q).
We have the following identities:
[ ] |q q D
[ ] | | |var var var varq q q q D D D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
A Frequentist View of Bayesian Learning
14
This means that on average over the realizations of the
data D, the conditional expectation E[q|D] is equal to E[q].
Also, the posterior variance on average is smaller than the
prior variance by an amount that depends on the variations
in posterior means over the distribution of
possible data.
[ ] |q q D
[ ] | | |var var var varq q q q D D D
|var q D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Mean
15
Note the not-surprising result regarding the posterior mean:
| ( | )
( | ) ( ) ( , ) ( )
p d
p p d d p d d p d
q q q q
q q q q q q q q q q
D D
D D D D D
|q q
Prior Posteriormean mean
Posterior meanaveraged over thedata
D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Variance Decomposition Identity
16
If (q,D) are two scalar random variables then we have:
Here is the proof:
[ ] | r |q q q var var vaD D
22
22
22
[ ]
| |
| |
var | var |
var q q q
q q
q q
q q
D D
D D
D D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Variability
17
We can derive a similar expression regarding the posterior
variance:
Thus on average (over the data), the variability in q
decreases. For a particular observed data set D, it is
however possible that
These results implicitly assume that the data follow the
distribution:
Pr
| | |
ior Posteriorvariance variance
averaged overall data
var var var varq q q q D D D
|var varq q D
( ) ( )m p p dq q q D D
|var varq qD
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Gamma distribution is a two-parameter family of
continuous distributions. It has a scale parameter θ>0 and
a shape parameter k>0. If k is an integer then the
distribution represents the sum of k independent
exponentially distributed random variables, each of which
has a mean of θ (which is equivalent to a rate parameter
of θ −1) .
More often, we also use the rate
parametrization
1 exp( / )~ ( , ) , 0,
( )
q
k
k
xX k x x
kGamma
Gamma Distribution
18
1( | , ) exp( ), 0,( )
aab
X a b x xb xa
Gamma
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
It is frequently a model for waiting times. For important
properties see here.
It is more often parameterized in terms of a shape
parameter a = k and an inverse scale parameter b = 1/θ,
called a rate parameter:
The mean, mode and variance with this parametrization are:
1 1
0
( | , ) , 0, , ( )( )
aa bx a ub
p x a b x e x a u e dua
Gamma Distribution- Rate Parametrization
19
xb
a
1, 1
mod
0
afor a
e x b
otherwise
2var x
b
a
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Plots of
As we decrease the rate b, the distribution squeezes
leftwards and upwards .
Gamma Distribution
20
1 2 3 4 5 6 7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Gamma distributions
a=1.0,b=1.0
a=1.5,b=1.0
a=2.0,b=1.0
1( | , ) exp( ), 1( )
aab
X a b x xb ba
Gamma
Run gammaPlotDemo
from PMTK
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
An empirical PDF of rainfall data fitted with a Gamma
distribution.
Run MatLab function gammaRainfallDemo
from PMTK
Gamma Distribution
21
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
3
3.5
MoM
MLE
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Distribution
22
This is defined as
Here λ is the rate parameter.
This distribution describes the times between events in a
Poisson process, i.e. a process in which events occur
continuously and independently at a constant average rate
λ.
( | ) ( |1, ) exp( ), 0,X X x x Expon Gamma
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Chi-Squared Distribution
23
This is defined as
This is the distribution of the sum of squared Gaussian
random variables.
More precisely,
2
12 2
1
1 2( | ) ( | , ) exp( ), 0,
2 2 2
2
xX X x x
Gamma
2 2
1
~ (0,1) . ~i i
i
Let Z and S Z Then S
N
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inverse Gamma Distribution
24
This is defined as follows:
where:
a is the shape and b the scale parameters.
It can be shown that:
1~ ( | , ) ~ ( | , )If X X a b X X a bGamma InvGamma
( 1)( | , ) exp( / ), 0,( )
aab
X a b x b x xa
InvGamma
2
2
( 1), ,1 1
var ( 2)( 1) ( 2)
b bMean exists for a Mode
a a
bexists for a
a a
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Pareto Distribution
25
Used to model the distribution of quantities that exhibit
long tails (heavy tails)
This density asserts that x must be greater than some
constant m, but not too much greater, k controls what is
“too much”.
As k → ∞, the distribution approaches δ(x − m).
On a log-log scale, the pdf forms a straight line, of the form
log p(x) = a log x + c for some constants a and c (power
law, Zipf’s law).
( 1)( | , ) ( )k kX k m km x x m Pareto
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Pareto Distribution
26
Applications: Modeling the frequency of words vs their
rank, distribution of wealth (k=Pareto Index), etc.
( 1)
2
2
( | , ) ( ),
( 1),1
,
var ( 2)( 1) ( 2)
k kX k m km x x m
kmMean if k
k
Mode m
m kif k
k k
Pareto
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Pareto distribution
m=0.01, k=0.10
m=0.00, k=0.50
m=1.00, k=1.00
ParetoPlot from PMTK