introduction to probability and statistics (continued)€¦ · it gives the probability in...

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Introduction to Probability and Statistics

(Continued)Prof. Nicholas Zabaras

Center for Informatics and Computational Science

https://cics.nd.edu/

University of Notre Dame

Notre Dame, Indiana, USA

Email: [email protected]

URL: https://www.zabaras.com/

August 27, 2018

https://cics.nd.edu/

mailto:[email protected]

http://mpdc.mae.cornell.edu/


The binomial and Bernoulli distributions

The multinomial and multinoulli distributions

The Poisson distribution

Student’s T

Laplace distribution

Gamma distribution

Beta distribution

Pareto distribution

Introduction to Covariance and Correlation

Contents

2


References

• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge

University Press.

• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena

Scientific. 2nd Edition

• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical

Inference. Springer.

3

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm

http://www.cs.ubc.ca/~murphyk/MLbook/

http://omega.albany.edu:8008/JaynesBook.html

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjGi8qBsavUAhUJWxQKHe6JA8EQFggwMAE&url=http://sanghv.com/download/soft/Machine Learning, Artificial Intelligence, Mathematics eBooks/math/probability/introduction to probability (bertsekas, 2nd, 2008).pdf&usg=AFQjCNFcVnETxkkD10T7-FQgAxFT_nu9pw&sig2=vt6iFJFO89_zA7jowynazg

https://link.springer.com/book/10.1007/978-0-387-21736-9


Consider a coin flipping experiment with heads = 1 and

tails = 0. With

This defines the Bernoulli distribution as follows:

Using the indicator function, we can also write this as:

( 1| )

( 0 | ) 1

p x

p x

Binary Variables

[0,1]

1( | ) (1 )x xx Bern

( 1) ( 0)( | ) (1 )x xx Bern

4

http://en.wikipedia.org/wiki/Bernoulli_distribution


Recall that in general

For the Bernoulli distribution , we

can easily show from the definitions:

Here H[𝑥] is the “entropy of the distribution”

2 2

[ ] ( ) ( ), [ ] ( ) ( )

var[ ] [ ( ) ] [ ( )]

x

f p x f x f p x f x dx

f f x f x

Bernoulli Distribution

0,1

[ ]

var[ ] (1 )

[ ] ( | ) ln ( | ) ln (1 ) ln(1 )x

x

x

x p x p x

1( | ) (1 )x xx Bern

5


Consider the data set

in which we have 𝑚 heads (𝑥 = 1), and 𝑁 −𝑚 tails (𝑥 = 0)

The likelihood function takes the form:

1 2, ,..., Nx x xD

Likelihood Function for Bernoulli Distribution

1

1 1

( | ) ( | ) (1 ) (1 )n n

N Nx x m N m

n

n n

p p x

D

6


Consider the discrete random variable

We define the Binomial distribution as follows:

In our coin flipping experiment,

it gives the probability in 𝑁 flips to

get 𝑚 heads with 𝜇 being the

probability getting heads in one flip.

It can be shown (see S. Ross, Introduction to Probability

Models) that the limit of the binomial distribution as

, is the Poisson(l) distribution.

0,1,2,...,X N

( | , ) (1 )m N mN

X m Nm

Bin

,N N l

Binomial Distribution

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35binomial distribution

( 10, 0.25)N Bin

Matlab Code

7

http://en.wikipedia.org/wiki/Binomial_distribution

http://www.amazon.com/Introduction-Probability-Models-Ninth-Sheldon/dp/0125980620/ref=sr_1_1?ie=UTF8&s=books&qid=1258215646&sr=8-1

http://en.wikipedia.org/wiki/Poisson_distribution

https://www.dropbox.com/s/yq6ewb0ekqpapbi/Fig1Chapter2Bishop.m?dl=0


The Binomial distribution for 𝑁 = 10, and is

shown below using MatLab function binomDistPlot from

Kevin Murphys’ PMTK

0.25,0.9

Binomial Distribution

( , )N Bin

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35=0.250

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4=0.9000.9 0.25

8

https://github.com/probml/pmtk3/blob/master/demos/binomDistPlot.m

https://github.com/probml/pmtk3


Consider for independent events the mean of the sum is

the sum of the means, and the variance of the sum is the

sum of the variances.

Because 𝑚 = 𝑥1+ . . . + 𝑥𝑁, and for each observation the

mean and variance are known from the Bernoulli

distribution:

One can also compute by differentiating

(twice) the identity wrt 𝜇. Try it!

1

0

2

1

0

[ ] ( | , ) [ ... ]

[ ] [ ] ( | , ) [ ... ] (1 )

N

N

m

N

N

m

m m m N x x N

var m m m m N var x x N

Bin

Bin

Mean,Variance of the Binomial Distribution

2[ ], [ ]m m

1

(1 ) 1N

m N m

m

N

m

9


To show that the Binomial is correctly normalized, we use the

following identities:

Can be shown with direct substitution:

Binomial theorem:

This theorem is proved by induction using (*) and noticing:

To finally show normalization using (**):

0

(1 ) (**)N

N m

m

Nx x

m

Binomial Distribution: Normalization

1(*)

1

N N N

n n n

11 1

0 0 0 1

* 11 1

1 1 1 0

0

(1 ) (1 )1

1 11 1

1

N N N NN m m m m

m m m m

N N N Nm m N m N m

m

N

m m m

m

m

N N N Nx x x x x x

m m m m

N N N Nx x x x

Nx

m

x xm m m m

0 0

(1 ) (1 ) (1 ) 1 11 1

m NN N

m N m N N

m m

N N

m m

10


We are now looking at discrete variables that can take on

one of 𝐾 possible mutually exclusive states.

The variable is represented by a 𝐾-dimensional vector 𝒙 in

which one of the elements 𝑥𝑘 equals 1, and all remaining

elements equal 0: 𝒙 = 0,0,… , 1,0,… , 0 𝑇

These vectors satisfy:

Let the probability of 𝑥𝑘 = 1 be denoted as 𝜇𝑘. Then

where 𝝁 = 𝜇1, … , 𝜇𝐾𝑇 .

Generalization of the Bernoulli Distribution

1

1K

k

k

x

( 1)

11 1

( | ) , 1, 0k k

K K Kx x

k k k k

kk k

p

x

11


The distribution is already normalized:

The mean of the distribution is computed as:

similar to the result for the Bernoulli distribution.

The Multinoulli also known as the Categorical distribution

often denoted as (Mu here is the multinomial distribution):

The parameter 1 stands to emphasize that we roll a 𝐾-sided

dice once (𝑁 = 1) – see next for the multinomial distribution.

Multinoulli/Categorical Distribution

11

( | ) 1k

K Kx

k k

kk

p

x x

x

1| ( | ) ,...,T

Kp x

x x x

| | |1, x x x Cat Multinoulli Mu

12


Let us consider a data set . The likelihood

becomes:

is the # of observations of 𝑥𝑘 = 1.

mk is the “sufficient statistic” of the distribution.

Likelihood: Multinoulli Distribution

1

11 1 1 1

( | ) , :

N

nk

nk n k

N K K K Nxx m

k k k k nk

nn k k k

p where m x

D

1,..., ND x x

13


To compute the maximum likelihood (MLE) estimate of 𝝁, we

maximize an augmented log-likelihood

Setting the derivative wrt 𝜇𝑘 equal to zero:

Substitution into the constraint

MLE Estimate: Multinoulli Distribution

1 1 1

ln ( | ) 1 ln 1K K K

k k k k

k k k

p ml l

D

kK

m

l

1

1 1

1 1

K

kK Kk

k k

k k

m

m ll

1

k kK K

k

k

m m

Nm

As expected, this is

the fraction in the 𝑁observations of 𝑥𝑘 = 1

14


We can also consider the joint distribution of 𝑚1, … ,𝑚𝐾 in 𝑁observations conditioned on the parameters 𝝁 =(𝜇1, … , 𝜇𝐾).

From the expression for the likelihood given earlier

the multinomial distribution with

parameters 𝑁 and 𝝁 takes the form:

1 2

1 2 1 2 1 2 11 2

!( , ,..., | , , ,..., ) ...

! !... !k

Kmm m

K K K kkk

Np m m m N where m N

m m m

Multinomial Distribution

1

( | ) k

Km

k

k

p

D

1( ,..., | , )Km m N Mu

15

http://en.wikipedia.org/wiki/Multinomial_distribution


Consider a set of DNA

sequences where there

are 10 rows (sequences)

and 15 columns (locations

along the genome).

Several locations are

conserved by evolution

(e.g., because they are part

of a gene coding region), since the corresponding columns

tend to be pure e.g., column 7 is all 𝑔’s.

To visuallize the data (sequence logo), we plot the letters

𝐴, 𝐶, 𝐺 and 𝑇 with a font size proportional to their empirical

probability, and with the most probable letter on the top.

Example: Biosequence Analysisc g a t a c g g g g t c g a a

c a a t c c g a g a t c g c a

c a a t c c g t g t t g g g a

c a a t c g g c a t g c g g g

c g a g c c g c g t a c g a a

c a t a c g g a g c a c g a a

t a a t c c g g g c a t g t a

c g a g c c g a g t a c a g a

c c a t c c g c g t a a g c a

g g a t a c g a g a t g a c a

N=1:10

Sequences

Location along the genome

16


The empirical probability distribution at location 𝑡, is

obtained by normalizing the vector of counts (see MLE

estimate)

This distribution is known

as a motif.

Can also compute the

most probable letter

in each location; this is the

consensus sequence.

Use MatLab function seqlogoDemo from Kevin Murphys’ PMTK

Example: Biosequence Analysis

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

1

2

Sequence Position

Bit

s

𝜽𝑡 =1

𝑁

𝑖=1

𝑁

)𝕀(𝑋𝑖𝑡 = 1 ,

𝑖=1

𝑁

)𝕀(𝑋𝑖𝑡 = 2 ,

𝑖=1

𝑁

)𝕀(𝑋𝑖𝑡 = 3 ,

𝑖=1

𝑁

)𝕀(𝑋𝑖𝑡 = 4

17

https://github.com/probml/pmtk3/blob/master/demos/seqlogoDemo.m



A summary of the multinomial and related discrete

distributions is summarized below on a Table from Kevin

Murphy’s textbook

𝑛 = 1 (one roll of the dice), 𝑛 = − (𝑁 rolls of the dice)

𝐾 = 1 (binary variables), 𝐾 = − (1-of-𝐾 encoding)

Summary of Discrete Distributions

18

http://www.cs.ubc.ca/~murphyk/MLbook/


We say that has a Poisson distribution with

parameter 𝜆 > 0, if its pmf is

This is a model for counts of rare events.

The Poisson Distribution 0,1,2,3,...X

( ) : ( | )!

x

X x ex

l ll l ~Poi Poi

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14Poi(l=10.000)

0 5 10 15 20 25 300

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Poi(l=1.000)

Use MatLab function poissonPlotDemo from Kevin Murphys’ PMTK

19

http://en.wikipedia.org/wiki/Poisson_distribution

https://code.google.com/p/pmtk3/source/browse/trunk/demos/bookDemos/Probability_I/poissonPlotDemo.m?r=2583

https://github.com/probml/pmtk3/blob/master/demos/poissonPlotDemo.m



Given data, 𝒟 = {𝑥1, … , 𝑥𝑁}, we define the empirical

distribution as:

We can also associate weights with each sample:

This corresponds to a histogram with spikes at each

sample point with height equal to the corresponding

weight. This distribution assigns zero weight to any point

not in the dataset.

Note that the “sample mean of 𝑓(𝑥)” is the expectation of

𝑓(𝑥) under the empirical distribution:

The Empirical Distribution

1

11( ) ( ), : ( )

0i i

Ni

emp x x

i i

if x Ap A A Dirac Measure A

if x AN

11 1

( ) (1

( ) , 0 1, 1) ( )i i

N N

emp i x i i

i

N

p

i

em x

i

pGenerali p x w x wxz xeN

w

1 1

1 1[ ( )] ( ) ( ) ( )

i

N N

x i

i n

f x f x x dx f xN N

20


Student’s T Distribution

The parameter 𝜆 is called the precision of the 𝒯-distribution, even

though it is not in general equal to the inverse of the variance (see

below on behavior as 𝜐 → ∞).

The parameter 𝜐 is called the degrees of freedom.

For the particular case of 𝜐 = 1, the T-distribution reduces to the

Cauchy distribution.

In the limit 𝜐 → ∞, the 𝒯-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) becomes a Gaussian

𝒩(𝑥|𝜇, 𝜆−1)with mean 𝜇 and precision 𝜆.

/2 1/2

21/21

( )2 2( | , , ) 1

( )2

xp x

l l

l

21

http://en.wikipedia.org/wiki/Cauchy_distribution


For 𝝊 → ∞, 𝓣(𝒙|𝝁, 𝝀, 𝝊) Becomes a Gaussian

We first write the distribution as follows:

For large 𝜐, we can approximate the log as follows:

In the limit 𝜐 → ∞, the T-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) is indeed a Gaussian

𝒩(𝑥|𝜇, 𝜆−1)with mean 𝜇 and precision 𝜆. The normalization of the T is

valid in this limit as well (so the Gaussian obtained is normalized).

/2 1/2

21/21

( )2 2( | , , ) 1

( )2

xp x

l l

l

/2 1/2

2 2

1( | , , ) 1 exp ln 1

2

x xx

l l l

T

2 2

2 11( | , , ) exp ( ) exp ( )

2 2

x xx O O

l l l

T

22

https://en.wikipedia.org/wiki/Taylor_series



/2 1/2

21/21

( )2 2( | , , ) 1

( )2

xp x

l l

l

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4student distribution

v=10

v=1.0

v=0.1

1

0, 1

,

( , )

For we

obtain

l

l

N

MatLab Code

: , 1

:

: , 22

Mean

Mode

Var

l

23

https://www.dropbox.com/s/qd04bfh0edbz1q0/Fig15Chapter2Bishop.m?dl=0


We plot:

The mean and variance of the

Student’s is undefined for 𝜐 = 1.

Logs of the PDFs. The Student’s

is NOT log concave.

Run MatLab function studentLaplacePdfPlot

from Kevin Murphys’ PMTK

When 𝜐 = 1, the distribution is

known as Cauchy or Lorentz.

Due to its heavy tails, the mean

does not converge.

Recommended to use 𝜐 = 4.

Student’s T Vs the Gaussian

-4 -3 -2 -1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8prob. density functions

Gauss

Student

Laplace

-4 -3 -2 -1 0 1 2 3 4-9

-8

-7

-6

-5

-4

-3

-2

-1

0log prob. density functions

Gauss

Student

Laplace

| 0,1 , ( | 0,1,1), | 0,1/ 2x x xN T Lap

24

http://pmtk3.googlecode.com/svn/trunk/demos/studentLaplacePdfPlot.m

https://code.google.com/p/pmtk3/

http://en.wikipedia.org/wiki/Cauchy_distribution



The Student’s T distribution can be seen from the equation

above (see following two slides for proof) as an infinite

mixture of Gaussians each of them with different

precision (governed by a Gamma distribution)

The result is a distribution that in general has longer ‘tails’

than a Gaussian.

This gives the T-distribution robustness, i.e. the T-

distribution is much less sensitive than the Gaussian to the

presence of outliers.

1

1

0

1/22

0

( | , , ) | ,

exp2

| ,

( )2

aa b

a b

be

a

p x a b x d

x d

GammaN

25

https://en.wikipedia.org/wiki/Gamma_distribution


1/2

1/2 1

0

1/2

1/2 1

1/2 1 1

0

1( | , , ) exp

( ) 2

1 1exp

( ) 2

aa

aa

a

bp x a b z d

a

bz z z dz

a A

Appendix: Student’s T as a Mixture of Gaussians

If we have a univariate Gaussian 𝒯(𝑥|𝜇, 𝜏−1) together with a prior

Gamma(𝜏 |𝑎, 𝑏) and we integrate out the precision, we obtain the

marginal distribution of 𝑥

Introduce the transformation to simplify as:

1

0

1/22 1

0

( | , , ) | , | ,

exp2 2 ( )

aa b

p x a b x a b d

bx e d

a

N Gamma

21

2

A

z b x

26


Appendix: Student’s T as a Mixture of Gaussians

Recalling the definition of the Gamma function:

It is common to redefine the parameters in this distribution as:

1/2

1/2 1

1/2 1 1

0

1/2 1/22 1/2

0

1 1( | , , ) exp

( ) 2

1 1exp

( ) 2 2

aa

a

aaa

bp x a b z z z dz

a A

bb x z z dz

a

1

0

( ) exp aa z z dz

1/2 1/2

21 1 1( | , , ) ( )

( ) 2 2 2

aabp x a b b x a

a

2 ,a

ab

l

/2 1/2

21/21

( )2 2( | , , ) 1

( )2

xp x

l l

l

27

http://mathworld.wolfram.com/GammaFunction.html


-10 -8 -6 -4 -2 0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Robustness of Student’s T Distribution The robustness of the T-distribution is illustrated here by comparing

the “maximum likelihood solutions” for a Gaussian and a T-distribution

(30 data points from the Gaussian are used).

The effect of a small number of outliers (Fig. on the right) is less

significant for the T-distribution than for the Gaussian.

-10 -8 -6 -4 -2 0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

T-distribution &

Gaussian nearly

overlap

T-distribution

Gaussian

MatLab Code

28

https://www.dropbox.com/s/na7axa00dx1squh/Fig16Chapter2Bishop.m?dl=0


Robustness of Student’s T Distribution The earlier simulation is repeated here with the PMTK toolbox.

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

gaussian

student T

laplace

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

gaussian

student T

laplace

Run MatLab function robustDemo

from Kevin Murphys’ PMTK

29

https://github.com/probml/pmtk3/blob/master/demos/robustDemo.m



The Laplace Distribution Another distribution with heavy tails is the Laplace distribution, also

known as the double sided exponential distribution. It has the following

pdf:

𝜇 is a location parameter and 𝑏 > 0 is a scale parameter

Its robust to outliers (see earlier demonstration).

It puts mores probability density at 0 than the Gaussian. This property

is a useful way to encourage sparsity in a model.

1

| ,2

x

bx b eb

Lap

2, , 2Mean Mode Var b

30

http://en.wikipedia.org/wiki/Laplace_distribution


The Beta(𝛼, 𝛽) distribution with is defined as

follows:

The expected value, mode and variance of a Beta

random variable 𝑥 with (hyper-) parameters 𝛼 and 𝛽 :

For more information visit this link.

[0,1], , 0x

2( 1)

var x

Beta Distribution

1 11 1( ) (1 )

( ) (1 ) , ( , )( ) ( ) ( , )

x xx x x beta

beta

Normalizingfactor

,x

mode xéë ùû =a -1

a + b - 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.5

1

1.5

2

2.5

3

beta distributions

a=0.1, b=0.1

a=1.0, b=1.0

a=2.0, b=3.0

a=8.0, b=4.0

31

http://en.wikipedia.org/wiki/Beta_distribution


If 𝛼 = 𝛽 = 1, we obtain a uniform distribution.

If 𝛼 and 𝛽 are both less than 1, we get a bimodal

distribution with spikes at 0 and 1.

If 𝛼 and 𝛽 are both greater

than 1, the distribution

is unimodal.

Run betaPlotDemo

from PMTK

Beta Distribution

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.5

1

1.5

2

2.5

3

beta distributions

a=0.1, b=0.1

a=1.0, b=1.0

a=2.0, b=3.0

a=8.0, b=4.0

𝛼=𝛽=1

𝛼,𝛽<1

𝛼,𝛽>1

32

https://github.com/probml/pmtk3/blob/master/demos/betaPlotDemo.m



Beta Distribution

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(0.1,0.1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(1,1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(2,3)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(8,4)

See Matlab implementation

33

https://www.dropbox.com/s/cqag9knh73bip6j/Fig3Chapter2Bishop.m?dl=0


The gamma function extends the factorial to real numbers:

With integration by parts:

For integer 𝑛:

For more information visit this link.

Gamma Function

1

0

( ) x ux u e du

( ) ( 1)!n n

( 1) ( )x x x

34

1 1( )( ) (1 )

( ) ( )x x x

http://en.wikipedia.org/wiki/Gamma_function


Showing that the Beta(𝛼, 𝛽) distribution is normalized

correctly is a bit tricky. We need to prove that:

Follow the steps:

(a) change the variable 𝑦 below to 𝑦 = 𝑡 − 𝑥;

(b) change the order of integration in the shaded

triangular region;

and (c) change 𝑥 to via 𝑥 = 𝑡𝜇:

Beta Distribution: Normalization

x

t𝑡 = 𝑥

m

1

11

0

( ) ( ) ( ) 1 d

35

1 1 1

0 0 0

111

0 0 0 0

111

0

1 1 1

1

1

( ) ( )

(

1

) 1

x y

y t x x

t

t

t

t t

e tx e dx y e dy x dt dx

x e t x dx dt e dttt

x

d

d


It is frequently a model for waiting times. For important

properties see here.

It is more often parameterized in terms of a shape

parameter 𝑎 and an inverse scale parameter 𝑏 = 1/𝜃,

called a rate parameter:

The mean, mode and variance with this parametrization are:

1 1

0

( | , ) , 0, , ( )( )

aa bx a ub

p x a b x e x a u e dua

Gamma Distribution- Rate Parametrization

xb

1, 1

mod

0

afor a

e x b

otherwise

2var x

b

36

http://en.wikipedia.org/wiki/Gamma_distribution


Plots of

As we increase the rate 𝑏, the distribution squeezes

leftwards and upwards. For 𝑎 < 1, the mode is at zero.

Gamma Distribution

1 2 3 4 5 6 7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Gamma distributions

a=1.0,b=1.0

a=1.5,b=1.0

a=2.0,b=1.0

1( | , ) exp( ), 1( )

aab

X a b x xb ba

Gamma

Run gammaPlotDemo

from PMTK

37

https://github.com/probml/pmtk3/blob/master/demos/gammaPlotDemo.m



An empirical PDF of rainfall data fitted with a Gamma

distribution.

Run MatLab function gammaRainfallDemo

from PMTK

Gamma Distribution

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

3

3.5

MoM

MLE

38

https://github.com/probml/pmtk3/blob/master/demos/gammaRainfallDemo.m



Exponential Distribution

This is defined as

Here 𝜆 is the rate parameter.

This distribution describes the times between events in a

Poisson process, i.e. a process in which events occur

continuously and independently at a constant average rate

𝜆.

( | ) ( |1, ) exp( ), 0,X X x xl l l l Expon Gamma

39

http://en.wikipedia.org/wiki/Exponential_distribution


Chi-Squared Distribution

This is defined as

This is the distribution of the sum of squared Gaussian

random variables.

More precisely,

2

12 2

1

1 2( | ) ( | , ) exp( ), 0,

2 2 2

2

xX X x x

Gamma

2 2

1

~ (0,1) , : ~i i

i

Let Z and S Z then S

40

http://en.wikipedia.org/wiki/Chi-squared_distribution


Inverse Gamma Distribution

This is defined as follows:

where:

𝑎 is the shape and 𝑏 the scale parameters.

Note that 𝑏 is a scale parameter since:

It can be shown that:

1~ ( | , ) ~ ( | , )If X X a b X X a bGamma InvGamma

( 1)( | , ) exp( / ), 0,( )

aab

X a b x b x xa

InvGamma

2

2( 1), , var ( 2)

1 1 ( 1) ( 2)

b b bMean exists for a Mode exists for a

a a a a

41

( | ,1)

( | , )

Xa

bX a bb

InvGammaInvGamma

http://en.wikipedia.org/wiki/Inverse-gamma_distribution


The Pareto Distribution

Used to model the distribution of quantities that exhibit

long tails (heavy tails)

𝒫𝒶𝓇ℯ𝓉ℴ 𝑋|k,𝑚 = 𝑘𝑚𝑘𝑥−(𝑘+1)𝕀 𝑥 ≥ 𝑚

This density asserts that 𝑥 must be greater than some

constant 𝑚, but not too much greater, 𝑘 controls what is

“too much”.

Modeling the frequency of words vs. their rank (e.g. “the”,

“of”, etc.) or the wealth of people.*

As 𝑘 → ∞, the distribution approaches 𝛿(𝑥 − 𝑚).

On a log-log scale, the pdf forms a straight line of the form

log 𝑝(𝑥) = 𝑎 log 𝑥 + 𝑐 for some constants 𝑎 and 𝑐 (power

law, Zipf’s law).

42

* Basis of the distribution: a high proportion of a population has low income and only few have very high incomes.

http://en.wikipedia.org/wiki/Pareto_distribution

http://en.wikipedia.org/wiki/Zipf's_law


The Pareto Distribution

Applications: Modeling the frequency of words vs their

rank, distribution of wealth (𝑘 =Pareto Index), etc.

( 1)

2

2

( | , ) ( ),

( 1),1

,

var ( 2)( 1) ( 2)

k kX k m km x x m

kmMean if k

k

Mode m

m kif k

k k

Pareto

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Pareto distribution

m=0.01, k=0.10

m=0.00, k=0.50

m=1.00, k=1.00

ParetoPlot from PMTK

43

http://en.wikipedia.org/wiki/Pareto_distribution

https://github.com/probml/pmtk3/blob/master/demos/paretoPlot.m



Consider two random variables

The joint probability distribution is defined as:

Two random variables are independent if

The covariance of X and Y is defined as:

It is straight forward to verify that:

, : .X Y

( , ) ( ) ( )p x y p x p y

cov( , ) ( ( ))( ( ))X Y X X Y Y

1 1, ( ) ( ) ( , )A B

P X A Y B P X A Y B p x y dxdy

cov( , )X Y XY X Y

Covariance

44


Consider two random variables

The correlation coefficient of X and Y is defined as:

where the standard deviations of X and Y are

The center normalized random variables are defined as:

It is straight forward to verify that:

, : .X Y

cov( , )( , )

X Y

X Ycorrc X Y

cov( ), cov( )X YX Y

Correlation, Center Normalized Random Variables

var 𝑋~

= var 𝑌~

= 1𝔼 𝑋~

= 𝔼 𝑌~

= 0

𝑋~

=𝑋 − 𝔼 𝑋

𝜎𝑋

𝑌~

=𝑌 − 𝔼 𝑌

𝜎𝑌

45

introduction to probability and statistics (continued)€¦ · it gives the probability in...

Documents