introduction to probability and statistics (continued)€¦ · it gives the probability in...
TRANSCRIPT
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Introduction to Probability and Statistics
(Continued)Prof. Nicholas Zabaras
Center for Informatics and Computational Science
https://cics.nd.edu/
University of Notre Dame
Notre Dame, Indiana, USA
Email: [email protected]
URL: https://www.zabaras.com/
August 27, 2018
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The binomial and Bernoulli distributions
The multinomial and multinoulli distributions
The Poisson distribution
Student’s T
Laplace distribution
Gamma distribution
Beta distribution
Pareto distribution
Introduction to Covariance and Correlation
Contents
2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
References
• Following closely Chris Bishops’ PRML book, Chapter 2
• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2
• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge
University Press.
• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena
Scientific. 2nd Edition
• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical
Inference. Springer.
3
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider a coin flipping experiment with heads = 1 and
tails = 0. With
This defines the Bernoulli distribution as follows:
Using the indicator function, we can also write this as:
( 1| )
( 0 | ) 1
p x
p x
Binary Variables
[0,1]
1( | ) (1 )x xx Bern
( 1) ( 0)( | ) (1 )x xx Bern
4
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Recall that in general
For the Bernoulli distribution , we
can easily show from the definitions:
Here H[𝑥] is the “entropy of the distribution”
2 2
[ ] ( ) ( ), [ ] ( ) ( )
var[ ] [ ( ) ] [ ( )]
x
f p x f x f p x f x dx
f f x f x
Bernoulli Distribution
0,1
[ ]
var[ ] (1 )
[ ] ( | ) ln ( | ) ln (1 ) ln(1 )x
x
x
x p x p x
1( | ) (1 )x xx Bern
5
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider the data set
in which we have 𝑚 heads (𝑥 = 1), and 𝑁 −𝑚 tails (𝑥 = 0)
The likelihood function takes the form:
1 2, ,..., Nx x xD
Likelihood Function for Bernoulli Distribution
1
1 1
( | ) ( | ) (1 ) (1 )n n
N Nx x m N m
n
n n
p p x
D
6
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider the discrete random variable
We define the Binomial distribution as follows:
In our coin flipping experiment,
it gives the probability in 𝑁 flips to
get 𝑚 heads with 𝜇 being the
probability getting heads in one flip.
It can be shown (see S. Ross, Introduction to Probability
Models) that the limit of the binomial distribution as
, is the Poisson(l) distribution.
0,1,2,...,X N
( | , ) (1 )m N mN
X m Nm
Bin
,N N l
Binomial Distribution
0 1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35binomial distribution
( 10, 0.25)N Bin
Matlab Code
7
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The Binomial distribution for 𝑁 = 10, and is
shown below using MatLab function binomDistPlot from
Kevin Murphys’ PMTK
0.25,0.9
Binomial Distribution
( , )N Bin
0 1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35=0.250
0 1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4=0.9000.9 0.25
8
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider for independent events the mean of the sum is
the sum of the means, and the variance of the sum is the
sum of the variances.
Because 𝑚 = 𝑥1+ . . . + 𝑥𝑁, and for each observation the
mean and variance are known from the Bernoulli
distribution:
One can also compute by differentiating
(twice) the identity wrt 𝜇. Try it!
1
0
2
1
0
[ ] ( | , ) [ ... ]
[ ] [ ] ( | , ) [ ... ] (1 )
N
N
m
N
N
m
m m m N x x N
var m m m m N var x x N
Bin
Bin
Mean,Variance of the Binomial Distribution
2[ ], [ ]m m
1
(1 ) 1N
m N m
m
N
m
9
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
To show that the Binomial is correctly normalized, we use the
following identities:
Can be shown with direct substitution:
Binomial theorem:
This theorem is proved by induction using (*) and noticing:
To finally show normalization using (**):
0
(1 ) (**)N
N m
m
Nx x
m
Binomial Distribution: Normalization
1(*)
1
N N N
n n n
11 1
0 0 0 1
* 11 1
1 1 1 0
0
(1 ) (1 )1
1 11 1
1
N N N NN m m m m
m m m m
N N N Nm m N m N m
m
N
m m m
m
m
N N N Nx x x x x x
m m m m
N N N Nx x x x
Nx
m
x xm m m m
0 0
(1 ) (1 ) (1 ) 1 11 1
m NN N
m N m N N
m m
N N
m m
10
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
We are now looking at discrete variables that can take on
one of 𝐾 possible mutually exclusive states.
The variable is represented by a 𝐾-dimensional vector 𝒙 in
which one of the elements 𝑥𝑘 equals 1, and all remaining
elements equal 0: 𝒙 = 0,0,… , 1,0,… , 0 𝑇
These vectors satisfy:
Let the probability of 𝑥𝑘 = 1 be denoted as 𝜇𝑘. Then
where 𝝁 = 𝜇1, … , 𝜇𝐾𝑇 .
Generalization of the Bernoulli Distribution
1
1K
k
k
x
( 1)
11 1
( | ) , 1, 0k k
K K Kx x
k k k k
kk k
p
x
11
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The distribution is already normalized:
The mean of the distribution is computed as:
similar to the result for the Bernoulli distribution.
The Multinoulli also known as the Categorical distribution
often denoted as (Mu here is the multinomial distribution):
The parameter 1 stands to emphasize that we roll a 𝐾-sided
dice once (𝑁 = 1) – see next for the multinomial distribution.
Multinoulli/Categorical Distribution
11
( | ) 1k
K Kx
k k
kk
p
x x
x
1| ( | ) ,...,T
Kp x
x x x
| | |1, x x x Cat Multinoulli Mu
12
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Let us consider a data set . The likelihood
becomes:
is the # of observations of 𝑥𝑘 = 1.
mk is the “sufficient statistic” of the distribution.
Likelihood: Multinoulli Distribution
1
11 1 1 1
( | ) , :
N
nk
nk n k
N K K K Nxx m
k k k k nk
nn k k k
p where m x
D
1,..., ND x x
13
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
To compute the maximum likelihood (MLE) estimate of 𝝁, we
maximize an augmented log-likelihood
Setting the derivative wrt 𝜇𝑘 equal to zero:
Substitution into the constraint
MLE Estimate: Multinoulli Distribution
1 1 1
ln ( | ) 1 ln 1K K K
k k k k
k k k
p ml l
D
kK
m
l
1
1 1
1 1
K
kK Kk
k k
k k
m
m ll
1
k kK K
k
k
m m
Nm
As expected, this is
the fraction in the 𝑁observations of 𝑥𝑘 = 1
14
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
We can also consider the joint distribution of 𝑚1, … ,𝑚𝐾 in 𝑁observations conditioned on the parameters 𝝁 =(𝜇1, … , 𝜇𝐾).
From the expression for the likelihood given earlier
the multinomial distribution with
parameters 𝑁 and 𝝁 takes the form:
1 2
1 2 1 2 1 2 11 2
!( , ,..., | , , ,..., ) ...
! !... !k
Kmm m
K K K kkk
Np m m m N where m N
m m m
Multinomial Distribution
1
( | ) k
Km
k
k
p
D
1( ,..., | , )Km m N Mu
15
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider a set of DNA
sequences where there
are 10 rows (sequences)
and 15 columns (locations
along the genome).
Several locations are
conserved by evolution
(e.g., because they are part
of a gene coding region), since the corresponding columns
tend to be pure e.g., column 7 is all 𝑔’s.
To visuallize the data (sequence logo), we plot the letters
𝐴, 𝐶, 𝐺 and 𝑇 with a font size proportional to their empirical
probability, and with the most probable letter on the top.
Example: Biosequence Analysisc g a t a c g g g g t c g a a
c a a t c c g a g a t c g c a
c a a t c c g t g t t g g g a
c a a t c g g c a t g c g g g
c g a g c c g c g t a c g a a
c a t a c g g a g c a c g a a
t a a t c c g g g c a t g t a
c g a g c c g a g t a c a g a
c c a t c c g c g t a a g c a
g g a t a c g a g a t g a c a
N=1:10
Sequences
Location along the genome
16
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The empirical probability distribution at location 𝑡, is
obtained by normalizing the vector of counts (see MLE
estimate)
This distribution is known
as a motif.
Can also compute the
most probable letter
in each location; this is the
consensus sequence.
Use MatLab function seqlogoDemo from Kevin Murphys’ PMTK
Example: Biosequence Analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 150
1
2
Sequence Position
Bit
s
𝜽𝑡 =1
𝑁
𝑖=1
𝑁
)𝕀(𝑋𝑖𝑡 = 1 ,
𝑖=1
𝑁
)𝕀(𝑋𝑖𝑡 = 2 ,
𝑖=1
𝑁
)𝕀(𝑋𝑖𝑡 = 3 ,
𝑖=1
𝑁
)𝕀(𝑋𝑖𝑡 = 4
17
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
A summary of the multinomial and related discrete
distributions is summarized below on a Table from Kevin
Murphy’s textbook
𝑛 = 1 (one roll of the dice), 𝑛 = − (𝑁 rolls of the dice)
𝐾 = 1 (binary variables), 𝐾 = − (1-of-𝐾 encoding)
Summary of Discrete Distributions
18
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
We say that has a Poisson distribution with
parameter 𝜆 > 0, if its pmf is
This is a model for counts of rare events.
The Poisson Distribution 0,1,2,3,...X
( ) : ( | )!
x
X x ex
l ll l ~Poi Poi
0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14Poi(l=10.000)
0 5 10 15 20 25 300
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4Poi(l=1.000)
Use MatLab function poissonPlotDemo from Kevin Murphys’ PMTK
19
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Given data, 𝒟 = {𝑥1, … , 𝑥𝑁}, we define the empirical
distribution as:
We can also associate weights with each sample:
This corresponds to a histogram with spikes at each
sample point with height equal to the corresponding
weight. This distribution assigns zero weight to any point
not in the dataset.
Note that the “sample mean of 𝑓(𝑥)” is the expectation of
𝑓(𝑥) under the empirical distribution:
The Empirical Distribution
1
11( ) ( ), : ( )
0i i
Ni
emp x x
i i
if x Ap A A Dirac Measure A
if x AN
11 1
( ) (1
( ) , 0 1, 1) ( )i i
N N
emp i x i i
i
N
p
i
em x
i
pGenerali p x w x wxz xeN
w
1 1
1 1[ ( )] ( ) ( ) ( )
i
N N
x i
i n
f x f x x dx f xN N
20
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Student’s T Distribution
The parameter 𝜆 is called the precision of the 𝒯-distribution, even
though it is not in general equal to the inverse of the variance (see
below on behavior as 𝜐 → ∞).
The parameter 𝜐 is called the degrees of freedom.
For the particular case of 𝜐 = 1, the T-distribution reduces to the
Cauchy distribution.
In the limit 𝜐 → ∞, the 𝒯-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) becomes a Gaussian
𝒩(𝑥|𝜇, 𝜆−1)with mean 𝜇 and precision 𝜆.
/2 1/2
21/21
( )2 2( | , , ) 1
( )2
xp x
l l
l
21
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
For 𝝊 → ∞, 𝓣(𝒙|𝝁, 𝝀, 𝝊) Becomes a Gaussian
We first write the distribution as follows:
For large 𝜐, we can approximate the log as follows:
In the limit 𝜐 → ∞, the T-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) is indeed a Gaussian
𝒩(𝑥|𝜇, 𝜆−1)with mean 𝜇 and precision 𝜆. The normalization of the T is
valid in this limit as well (so the Gaussian obtained is normalized).
/2 1/2
21/21
( )2 2( | , , ) 1
( )2
xp x
l l
l
/2 1/2
2 2
1( | , , ) 1 exp ln 1
2
x xx
l l l
T
2 2
2 11( | , , ) exp ( ) exp ( )
2 2
x xx O O
l l l
T
22
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Student’s T Distribution
/2 1/2
21/21
( )2 2( | , , ) 1
( )2
xp x
l l
l
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4student distribution
v=10
v=1.0
v=0.1
1
0, 1
,
( , )
For we
obtain
l
l
N
MatLab Code
: , 1
:
: , 22
Mean
Mode
Var
l
23
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
We plot:
The mean and variance of the
Student’s is undefined for 𝜐 = 1.
Logs of the PDFs. The Student’s
is NOT log concave.
Run MatLab function studentLaplacePdfPlot
from Kevin Murphys’ PMTK
When 𝜐 = 1, the distribution is
known as Cauchy or Lorentz.
Due to its heavy tails, the mean
does not converge.
Recommended to use 𝜐 = 4.
Student’s T Vs the Gaussian
-4 -3 -2 -1 0 1 2 3 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8prob. density functions
Gauss
Student
Laplace
-4 -3 -2 -1 0 1 2 3 4-9
-8
-7
-6
-5
-4
-3
-2
-1
0log prob. density functions
Gauss
Student
Laplace
| 0,1 , ( | 0,1,1), | 0,1/ 2x x xN T Lap
24
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Student’s T Distribution
The Student’s T distribution can be seen from the equation
above (see following two slides for proof) as an infinite
mixture of Gaussians each of them with different
precision (governed by a Gamma distribution)
The result is a distribution that in general has longer ‘tails’
than a Gaussian.
This gives the T-distribution robustness, i.e. the T-
distribution is much less sensitive than the Gaussian to the
presence of outliers.
1
1
0
1/22
0
( | , , ) | ,
exp2
| ,
( )2
aa b
a b
be
a
p x a b x d
x d
GammaN
25
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
1/2
1/2 1
0
1/2
1/2 1
1/2 1 1
0
1( | , , ) exp
( ) 2
1 1exp
( ) 2
aa
aa
a
bp x a b z d
a
bz z z dz
a A
Appendix: Student’s T as a Mixture of Gaussians
If we have a univariate Gaussian 𝒯(𝑥|𝜇, 𝜏−1) together with a prior
Gamma(𝜏 |𝑎, 𝑏) and we integrate out the precision, we obtain the
marginal distribution of 𝑥
Introduce the transformation to simplify as:
1
0
1/22 1
0
( | , , ) | , | ,
exp2 2 ( )
aa b
p x a b x a b d
bx e d
a
N Gamma
21
2
A
z b x
26
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Appendix: Student’s T as a Mixture of Gaussians
Recalling the definition of the Gamma function:
It is common to redefine the parameters in this distribution as:
1/2
1/2 1
1/2 1 1
0
1/2 1/22 1/2
0
1 1( | , , ) exp
( ) 2
1 1exp
( ) 2 2
aa
a
aaa
bp x a b z z z dz
a A
bb x z z dz
a
1
0
( ) exp aa z z dz
1/2 1/2
21 1 1( | , , ) ( )
( ) 2 2 2
aabp x a b b x a
a
2 ,a
ab
l
/2 1/2
21/21
( )2 2( | , , ) 1
( )2
xp x
l l
l
27
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
-10 -8 -6 -4 -2 0 2 4 6 8 10 120
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Robustness of Student’s T Distribution The robustness of the T-distribution is illustrated here by comparing
the “maximum likelihood solutions” for a Gaussian and a T-distribution
(30 data points from the Gaussian are used).
The effect of a small number of outliers (Fig. on the right) is less
significant for the T-distribution than for the Gaussian.
-10 -8 -6 -4 -2 0 2 4 6 8 10 120
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
T-distribution &
Gaussian nearly
overlap
T-distribution
Gaussian
MatLab Code
28
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Robustness of Student’s T Distribution The earlier simulation is repeated here with the PMTK toolbox.
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
gaussian
student T
laplace
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
gaussian
student T
laplace
Run MatLab function robustDemo
from Kevin Murphys’ PMTK
29
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The Laplace Distribution Another distribution with heavy tails is the Laplace distribution, also
known as the double sided exponential distribution. It has the following
pdf:
𝜇 is a location parameter and 𝑏 > 0 is a scale parameter
Its robust to outliers (see earlier demonstration).
It puts mores probability density at 0 than the Gaussian. This property
is a useful way to encourage sparsity in a model.
1
| ,2
x
bx b eb
Lap
2, , 2Mean Mode Var b
30
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The Beta(𝛼, 𝛽) distribution with is defined as
follows:
The expected value, mode and variance of a Beta
random variable 𝑥 with (hyper-) parameters 𝛼 and 𝛽 :
For more information visit this link.
[0,1], , 0x
2( 1)
var x
Beta Distribution
1 11 1( ) (1 )
( ) (1 ) , ( , )( ) ( ) ( , )
x xx x x beta
beta
Normalizingfactor
,x
mode xéë ùû =a -1
a + b - 2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
2.5
3
beta distributions
a=0.1, b=0.1
a=1.0, b=1.0
a=2.0, b=3.0
a=8.0, b=4.0
31
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
If 𝛼 = 𝛽 = 1, we obtain a uniform distribution.
If 𝛼 and 𝛽 are both less than 1, we get a bimodal
distribution with spikes at 0 and 1.
If 𝛼 and 𝛽 are both greater
than 1, the distribution
is unimodal.
Run betaPlotDemo
from PMTK
Beta Distribution
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
2.5
3
beta distributions
a=0.1, b=0.1
a=1.0, b=1.0
a=2.0, b=3.0
a=8.0, b=4.0
𝛼=𝛽=1
𝛼,𝛽<1
𝛼,𝛽>1
32
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Beta Distribution
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(0.1,0.1)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(1,1)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(2,3)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(8,4)
See Matlab implementation
33
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The gamma function extends the factorial to real numbers:
With integration by parts:
For integer 𝑛:
For more information visit this link.
Gamma Function
1
0
( ) x ux u e du
( ) ( 1)!n n
( 1) ( )x x x
34
1 1( )( ) (1 )
( ) ( )x x x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Showing that the Beta(𝛼, 𝛽) distribution is normalized
correctly is a bit tricky. We need to prove that:
Follow the steps:
(a) change the variable 𝑦 below to 𝑦 = 𝑡 − 𝑥;
(b) change the order of integration in the shaded
triangular region;
and (c) change 𝑥 to via 𝑥 = 𝑡𝜇:
Beta Distribution: Normalization
x
t𝑡 = 𝑥
m
1
11
0
( ) ( ) ( ) 1 d
35
1 1 1
0 0 0
111
0 0 0 0
111
0
1 1 1
1
1
( ) ( )
(
1
) 1
x y
y t x x
t
t
t
t t
e tx e dx y e dy x dt dx
x e t x dx dt e dttt
x
d
d
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
It is frequently a model for waiting times. For important
properties see here.
It is more often parameterized in terms of a shape
parameter 𝑎 and an inverse scale parameter 𝑏 = 1/𝜃,
called a rate parameter:
The mean, mode and variance with this parametrization are:
1 1
0
( | , ) , 0, , ( )( )
aa bx a ub
p x a b x e x a u e dua
Gamma Distribution- Rate Parametrization
xb
1, 1
mod
0
afor a
e x b
otherwise
2var x
b
36
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Plots of
As we increase the rate 𝑏, the distribution squeezes
leftwards and upwards. For 𝑎 < 1, the mode is at zero.
Gamma Distribution
1 2 3 4 5 6 7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Gamma distributions
a=1.0,b=1.0
a=1.5,b=1.0
a=2.0,b=1.0
1( | , ) exp( ), 1( )
aab
X a b x xb ba
Gamma
Run gammaPlotDemo
from PMTK
37
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
An empirical PDF of rainfall data fitted with a Gamma
distribution.
Run MatLab function gammaRainfallDemo
from PMTK
Gamma Distribution
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
3
3.5
MoM
MLE
38
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Exponential Distribution
This is defined as
Here 𝜆 is the rate parameter.
This distribution describes the times between events in a
Poisson process, i.e. a process in which events occur
continuously and independently at a constant average rate
𝜆.
( | ) ( |1, ) exp( ), 0,X X x xl l l l Expon Gamma
39
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Chi-Squared Distribution
This is defined as
This is the distribution of the sum of squared Gaussian
random variables.
More precisely,
2
12 2
1
1 2( | ) ( | , ) exp( ), 0,
2 2 2
2
xX X x x
Gamma
2 2
1
~ (0,1) , : ~i i
i
Let Z and S Z then S
40
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Inverse Gamma Distribution
This is defined as follows:
where:
𝑎 is the shape and 𝑏 the scale parameters.
Note that 𝑏 is a scale parameter since:
It can be shown that:
1~ ( | , ) ~ ( | , )If X X a b X X a bGamma InvGamma
( 1)( | , ) exp( / ), 0,( )
aab
X a b x b x xa
InvGamma
2
2( 1), , var ( 2)
1 1 ( 1) ( 2)
b b bMean exists for a Mode exists for a
a a a a
41
( | ,1)
( | , )
Xa
bX a bb
InvGammaInvGamma
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The Pareto Distribution
Used to model the distribution of quantities that exhibit
long tails (heavy tails)
𝒫𝒶𝓇ℯ𝓉ℴ 𝑋|k,𝑚 = 𝑘𝑚𝑘𝑥−(𝑘+1)𝕀 𝑥 ≥ 𝑚
This density asserts that 𝑥 must be greater than some
constant 𝑚, but not too much greater, 𝑘 controls what is
“too much”.
Modeling the frequency of words vs. their rank (e.g. “the”,
“of”, etc.) or the wealth of people.*
As 𝑘 → ∞, the distribution approaches 𝛿(𝑥 − 𝑚).
On a log-log scale, the pdf forms a straight line of the form
log 𝑝(𝑥) = 𝑎 log 𝑥 + 𝑐 for some constants 𝑎 and 𝑐 (power
law, Zipf’s law).
42
* Basis of the distribution: a high proportion of a population has low income and only few have very high incomes.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The Pareto Distribution
Applications: Modeling the frequency of words vs their
rank, distribution of wealth (𝑘 =Pareto Index), etc.
( 1)
2
2
( | , ) ( ),
( 1),1
,
var ( 2)( 1) ( 2)
k kX k m km x x m
kmMean if k
k
Mode m
m kif k
k k
Pareto
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Pareto distribution
m=0.01, k=0.10
m=0.00, k=0.50
m=1.00, k=1.00
ParetoPlot from PMTK
43
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider two random variables
The joint probability distribution is defined as:
Two random variables are independent if
The covariance of X and Y is defined as:
It is straight forward to verify that:
, : .X Y
( , ) ( ) ( )p x y p x p y
cov( , ) ( ( ))( ( ))X Y X X Y Y
1 1, ( ) ( ) ( , )A B
P X A Y B P X A Y B p x y dxdy
cov( , )X Y XY X Y
Covariance
44
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider two random variables
The correlation coefficient of X and Y is defined as:
where the standard deviations of X and Y are
The center normalized random variables are defined as:
It is straight forward to verify that:
, : .X Y
cov( , )( , )
X Y
X Ycorrc X Y
cov( ), cov( )X YX Y
Correlation, Center Normalized Random Variables
var 𝑋~
= var 𝑌~
= 1𝔼 𝑋~
= 𝔼 𝑌~
= 0
𝑋~
=𝑋 − 𝔼 𝑋
𝜎𝑋
𝑌~
=𝑌 − 𝔼 𝑌
𝜎𝑌
45