introduction to probability and statistics (continued)€¦ · it gives the probability in...

45
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) Introduction to Probability and Statistics (Continued) Prof. Nicholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of Notre Dame Notre Dame, Indiana, USA Email: [email protected] URL: https://www.zabaras.com / August 27, 2018

Upload: others

Post on 20-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Introduction to Probability and Statistics

(Continued)Prof. Nicholas Zabaras

Center for Informatics and Computational Science

https://cics.nd.edu/

University of Notre Dame

Notre Dame, Indiana, USA

Email: [email protected]

URL: https://www.zabaras.com/

August 27, 2018

Page 2: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

The binomial and Bernoulli distributions

The multinomial and multinoulli distributions

The Poisson distribution

Student’s T

Laplace distribution

Gamma distribution

Beta distribution

Pareto distribution

Introduction to Covariance and Correlation

Contents

2

Page 3: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

References

• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge

University Press.

• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena

Scientific. 2nd Edition

• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical

Inference. Springer.

3

Page 4: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Consider a coin flipping experiment with heads = 1 and

tails = 0. With

This defines the Bernoulli distribution as follows:

Using the indicator function, we can also write this as:

( 1| )

( 0 | ) 1

p x

p x

Binary Variables

[0,1]

1( | ) (1 )x xx Bern

( 1) ( 0)( | ) (1 )x xx Bern

4

Page 5: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Recall that in general

For the Bernoulli distribution , we

can easily show from the definitions:

Here H[𝑥] is the “entropy of the distribution”

2 2

[ ] ( ) ( ), [ ] ( ) ( )

var[ ] [ ( ) ] [ ( )]

x

f p x f x f p x f x dx

f f x f x

Bernoulli Distribution

0,1

[ ]

var[ ] (1 )

[ ] ( | ) ln ( | ) ln (1 ) ln(1 )x

x

x

x p x p x

1( | ) (1 )x xx Bern

5

Page 6: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Consider the data set

in which we have 𝑚 heads (𝑥 = 1), and 𝑁 −𝑚 tails (𝑥 = 0)

The likelihood function takes the form:

1 2, ,..., Nx x xD

Likelihood Function for Bernoulli Distribution

1

1 1

( | ) ( | ) (1 ) (1 )n n

N Nx x m N m

n

n n

p p x

D

6

Page 7: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Consider the discrete random variable

We define the Binomial distribution as follows:

In our coin flipping experiment,

it gives the probability in 𝑁 flips to

get 𝑚 heads with 𝜇 being the

probability getting heads in one flip.

It can be shown (see S. Ross, Introduction to Probability

Models) that the limit of the binomial distribution as

, is the Poisson(l) distribution.

0,1,2,...,X N

( | , ) (1 )m N mN

X m Nm

Bin

,N N l

Binomial Distribution

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35binomial distribution

( 10, 0.25)N Bin

Matlab Code

7

Page 8: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

The Binomial distribution for 𝑁 = 10, and is

shown below using MatLab function binomDistPlot from

Kevin Murphys’ PMTK

0.25,0.9

Binomial Distribution

( , )N Bin

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35=0.250

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4=0.9000.9 0.25

8

Page 9: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Consider for independent events the mean of the sum is

the sum of the means, and the variance of the sum is the

sum of the variances.

Because 𝑚 = 𝑥1+ . . . + 𝑥𝑁, and for each observation the

mean and variance are known from the Bernoulli

distribution:

One can also compute by differentiating

(twice) the identity wrt 𝜇. Try it!

1

0

2

1

0

[ ] ( | , ) [ ... ]

[ ] [ ] ( | , ) [ ... ] (1 )

N

N

m

N

N

m

m m m N x x N

var m m m m N var x x N

Bin

Bin

Mean,Variance of the Binomial Distribution

2[ ], [ ]m m

1

(1 ) 1N

m N m

m

N

m

9

Page 10: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

To show that the Binomial is correctly normalized, we use the

following identities:

Can be shown with direct substitution:

Binomial theorem:

This theorem is proved by induction using (*) and noticing:

To finally show normalization using (**):

0

(1 ) (**)N

N m

m

Nx x

m

Binomial Distribution: Normalization

1(*)

1

N N N

n n n

11 1

0 0 0 1

* 11 1

1 1 1 0

0

(1 ) (1 )1

1 11 1

1

N N N NN m m m m

m m m m

N N N Nm m N m N m

m

N

m m m

m

m

N N N Nx x x x x x

m m m m

N N N Nx x x x

Nx

m

x xm m m m

0 0

(1 ) (1 ) (1 ) 1 11 1

m NN N

m N m N N

m m

N N

m m

10

Page 11: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

We are now looking at discrete variables that can take on

one of 𝐾 possible mutually exclusive states.

The variable is represented by a 𝐾-dimensional vector 𝒙 in

which one of the elements 𝑥𝑘 equals 1, and all remaining

elements equal 0: 𝒙 = 0,0,… , 1,0,… , 0 𝑇

These vectors satisfy:

Let the probability of 𝑥𝑘 = 1 be denoted as 𝜇𝑘. Then

where 𝝁 = 𝜇1, … , 𝜇𝐾𝑇 .

Generalization of the Bernoulli Distribution

1

1K

k

k

x

( 1)

11 1

( | ) , 1, 0k k

K K Kx x

k k k k

kk k

p

x

11

Page 12: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

The distribution is already normalized:

The mean of the distribution is computed as:

similar to the result for the Bernoulli distribution.

The Multinoulli also known as the Categorical distribution

often denoted as (Mu here is the multinomial distribution):

The parameter 1 stands to emphasize that we roll a 𝐾-sided

dice once (𝑁 = 1) – see next for the multinomial distribution.

Multinoulli/Categorical Distribution

11

( | ) 1k

K Kx

k k

kk

p

x x

x

1| ( | ) ,...,T

Kp x

x x x

| | |1, x x x Cat Multinoulli Mu

12

Page 13: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Let us consider a data set . The likelihood

becomes:

is the # of observations of 𝑥𝑘 = 1.

mk is the “sufficient statistic” of the distribution.

Likelihood: Multinoulli Distribution

1

11 1 1 1

( | ) , :

N

nk

nk n k

N K K K Nxx m

k k k k nk

nn k k k

p where m x

D

1,..., ND x x

13

Page 14: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

To compute the maximum likelihood (MLE) estimate of 𝝁, we

maximize an augmented log-likelihood

Setting the derivative wrt 𝜇𝑘 equal to zero:

Substitution into the constraint

MLE Estimate: Multinoulli Distribution

1 1 1

ln ( | ) 1 ln 1K K K

k k k k

k k k

p ml l

D

kK

m

l

1

1 1

1 1

K

kK Kk

k k

k k

m

m ll

1

k kK K

k

k

m m

Nm

As expected, this is

the fraction in the 𝑁observations of 𝑥𝑘 = 1

14

Page 15: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

We can also consider the joint distribution of 𝑚1, … ,𝑚𝐾 in 𝑁observations conditioned on the parameters 𝝁 =(𝜇1, … , 𝜇𝐾).

From the expression for the likelihood given earlier

the multinomial distribution with

parameters 𝑁 and 𝝁 takes the form:

1 2

1 2 1 2 1 2 11 2

!( , ,..., | , , ,..., ) ...

! !... !k

Kmm m

K K K kkk

Np m m m N where m N

m m m

Multinomial Distribution

1

( | ) k

Km

k

k

p

D

1( ,..., | , )Km m N Mu

15

Page 16: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Consider a set of DNA

sequences where there

are 10 rows (sequences)

and 15 columns (locations

along the genome).

Several locations are

conserved by evolution

(e.g., because they are part

of a gene coding region), since the corresponding columns

tend to be pure e.g., column 7 is all 𝑔’s.

To visuallize the data (sequence logo), we plot the letters

𝐴, 𝐶, 𝐺 and 𝑇 with a font size proportional to their empirical

probability, and with the most probable letter on the top.

Example: Biosequence Analysisc g a t a c g g g g t c g a a

c a a t c c g a g a t c g c a

c a a t c c g t g t t g g g a

c a a t c g g c a t g c g g g

c g a g c c g c g t a c g a a

c a t a c g g a g c a c g a a

t a a t c c g g g c a t g t a

c g a g c c g a g t a c a g a

c c a t c c g c g t a a g c a

g g a t a c g a g a t g a c a

N=1:10

Sequences

Location along the genome

16

Page 17: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

The empirical probability distribution at location 𝑡, is

obtained by normalizing the vector of counts (see MLE

estimate)

This distribution is known

as a motif.

Can also compute the

most probable letter

in each location; this is the

consensus sequence.

Use MatLab function seqlogoDemo from Kevin Murphys’ PMTK

Example: Biosequence Analysis

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

1

2

Sequence Position

Bit

s

𝜽𝑡 =1

𝑁

𝑖=1

𝑁

)𝕀(𝑋𝑖𝑡 = 1 ,

𝑖=1

𝑁

)𝕀(𝑋𝑖𝑡 = 2 ,

𝑖=1

𝑁

)𝕀(𝑋𝑖𝑡 = 3 ,

𝑖=1

𝑁

)𝕀(𝑋𝑖𝑡 = 4

17

Page 18: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

A summary of the multinomial and related discrete

distributions is summarized below on a Table from Kevin

Murphy’s textbook

𝑛 = 1 (one roll of the dice), 𝑛 = − (𝑁 rolls of the dice)

𝐾 = 1 (binary variables), 𝐾 = − (1-of-𝐾 encoding)

Summary of Discrete Distributions

18

Page 19: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

We say that has a Poisson distribution with

parameter 𝜆 > 0, if its pmf is

This is a model for counts of rare events.

The Poisson Distribution 0,1,2,3,...X

( ) : ( | )!

x

X x ex

l ll l ~Poi Poi

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14Poi(l=10.000)

0 5 10 15 20 25 300

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Poi(l=1.000)

Use MatLab function poissonPlotDemo from Kevin Murphys’ PMTK

19

Page 20: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Given data, 𝒟 = {𝑥1, … , 𝑥𝑁}, we define the empirical

distribution as:

We can also associate weights with each sample:

This corresponds to a histogram with spikes at each

sample point with height equal to the corresponding

weight. This distribution assigns zero weight to any point

not in the dataset.

Note that the “sample mean of 𝑓(𝑥)” is the expectation of

𝑓(𝑥) under the empirical distribution:

The Empirical Distribution

1

11( ) ( ), : ( )

0i i

Ni

emp x x

i i

if x Ap A A Dirac Measure A

if x AN

11 1

( ) (1

( ) , 0 1, 1) ( )i i

N N

emp i x i i

i

N

p

i

em x

i

pGenerali p x w x wxz xeN

w

1 1

1 1[ ( )] ( ) ( ) ( )

i

N N

x i

i n

f x f x x dx f xN N

20

Page 21: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Student’s T Distribution

The parameter 𝜆 is called the precision of the 𝒯-distribution, even

though it is not in general equal to the inverse of the variance (see

below on behavior as 𝜐 → ∞).

The parameter 𝜐 is called the degrees of freedom.

For the particular case of 𝜐 = 1, the T-distribution reduces to the

Cauchy distribution.

In the limit 𝜐 → ∞, the 𝒯-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) becomes a Gaussian

𝒩(𝑥|𝜇, 𝜆−1)with mean 𝜇 and precision 𝜆.

/2 1/2

21/21

( )2 2( | , , ) 1

( )2

xp x

l l

l

21

Page 22: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

For 𝝊 → ∞, 𝓣(𝒙|𝝁, 𝝀, 𝝊) Becomes a Gaussian

We first write the distribution as follows:

For large 𝜐, we can approximate the log as follows:

In the limit 𝜐 → ∞, the T-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) is indeed a Gaussian

𝒩(𝑥|𝜇, 𝜆−1)with mean 𝜇 and precision 𝜆. The normalization of the T is

valid in this limit as well (so the Gaussian obtained is normalized).

/2 1/2

21/21

( )2 2( | , , ) 1

( )2

xp x

l l

l

/2 1/2

2 2

1( | , , ) 1 exp ln 1

2

x xx

l l l

T

2 2

2 11( | , , ) exp ( ) exp ( )

2 2

x xx O O

l l l

T

22

Page 23: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Student’s T Distribution

/2 1/2

21/21

( )2 2( | , , ) 1

( )2

xp x

l l

l

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4student distribution

v=10

v=1.0

v=0.1

1

0, 1

,

( , )

For we

obtain

l

l

N

MatLab Code

: , 1

:

: , 22

Mean

Mode

Var

l

23

Page 24: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

We plot:

The mean and variance of the

Student’s is undefined for 𝜐 = 1.

Logs of the PDFs. The Student’s

is NOT log concave.

Run MatLab function studentLaplacePdfPlot

from Kevin Murphys’ PMTK

When 𝜐 = 1, the distribution is

known as Cauchy or Lorentz.

Due to its heavy tails, the mean

does not converge.

Recommended to use 𝜐 = 4.

Student’s T Vs the Gaussian

-4 -3 -2 -1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8prob. density functions

Gauss

Student

Laplace

-4 -3 -2 -1 0 1 2 3 4-9

-8

-7

-6

-5

-4

-3

-2

-1

0log prob. density functions

Gauss

Student

Laplace

| 0,1 , ( | 0,1,1), | 0,1/ 2x x xN T Lap

24

Page 25: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Student’s T Distribution

The Student’s T distribution can be seen from the equation

above (see following two slides for proof) as an infinite

mixture of Gaussians each of them with different

precision (governed by a Gamma distribution)

The result is a distribution that in general has longer ‘tails’

than a Gaussian.

This gives the T-distribution robustness, i.e. the T-

distribution is much less sensitive than the Gaussian to the

presence of outliers.

1

1

0

1/22

0

( | , , ) | ,

exp2

| ,

( )2

aa b

a b

be

a

p x a b x d

x d

GammaN

25

Page 26: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

1/2

1/2 1

0

1/2

1/2 1

1/2 1 1

0

1( | , , ) exp

( ) 2

1 1exp

( ) 2

aa

aa

a

bp x a b z d

a

bz z z dz

a A

Appendix: Student’s T as a Mixture of Gaussians

If we have a univariate Gaussian 𝒯(𝑥|𝜇, 𝜏−1) together with a prior

Gamma(𝜏 |𝑎, 𝑏) and we integrate out the precision, we obtain the

marginal distribution of 𝑥

Introduce the transformation to simplify as:

1

0

1/22 1

0

( | , , ) | , | ,

exp2 2 ( )

aa b

p x a b x a b d

bx e d

a

N Gamma

21

2

A

z b x

26

Page 27: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Appendix: Student’s T as a Mixture of Gaussians

Recalling the definition of the Gamma function:

It is common to redefine the parameters in this distribution as:

1/2

1/2 1

1/2 1 1

0

1/2 1/22 1/2

0

1 1( | , , ) exp

( ) 2

1 1exp

( ) 2 2

aa

a

aaa

bp x a b z z z dz

a A

bb x z z dz

a

1

0

( ) exp aa z z dz

1/2 1/2

21 1 1( | , , ) ( )

( ) 2 2 2

aabp x a b b x a

a

2 ,a

ab

l

/2 1/2

21/21

( )2 2( | , , ) 1

( )2

xp x

l l

l

27

Page 28: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

-10 -8 -6 -4 -2 0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Robustness of Student’s T Distribution The robustness of the T-distribution is illustrated here by comparing

the “maximum likelihood solutions” for a Gaussian and a T-distribution

(30 data points from the Gaussian are used).

The effect of a small number of outliers (Fig. on the right) is less

significant for the T-distribution than for the Gaussian.

-10 -8 -6 -4 -2 0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

T-distribution &

Gaussian nearly

overlap

T-distribution

Gaussian

MatLab Code

28

Page 29: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Robustness of Student’s T Distribution The earlier simulation is repeated here with the PMTK toolbox.

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

gaussian

student T

laplace

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

gaussian

student T

laplace

Run MatLab function robustDemo

from Kevin Murphys’ PMTK

29

Page 30: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

The Laplace Distribution Another distribution with heavy tails is the Laplace distribution, also

known as the double sided exponential distribution. It has the following

pdf:

𝜇 is a location parameter and 𝑏 > 0 is a scale parameter

Its robust to outliers (see earlier demonstration).

It puts mores probability density at 0 than the Gaussian. This property

is a useful way to encourage sparsity in a model.

1

| ,2

x

bx b eb

Lap

2, , 2Mean Mode Var b

30

Page 31: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

The Beta(𝛼, 𝛽) distribution with is defined as

follows:

The expected value, mode and variance of a Beta

random variable 𝑥 with (hyper-) parameters 𝛼 and 𝛽 :

For more information visit this link.

[0,1], , 0x

2( 1)

var x

Beta Distribution

1 11 1( ) (1 )

( ) (1 ) , ( , )( ) ( ) ( , )

x xx x x beta

beta

Normalizingfactor

,x

mode xéë ùû =a -1

a + b - 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.5

1

1.5

2

2.5

3

beta distributions

a=0.1, b=0.1

a=1.0, b=1.0

a=2.0, b=3.0

a=8.0, b=4.0

31

Page 32: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

If 𝛼 = 𝛽 = 1, we obtain a uniform distribution.

If 𝛼 and 𝛽 are both less than 1, we get a bimodal

distribution with spikes at 0 and 1.

If 𝛼 and 𝛽 are both greater

than 1, the distribution

is unimodal.

Run betaPlotDemo

from PMTK

Beta Distribution

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.5

1

1.5

2

2.5

3

beta distributions

a=0.1, b=0.1

a=1.0, b=1.0

a=2.0, b=3.0

a=8.0, b=4.0

𝛼=𝛽=1

𝛼,𝛽<1

𝛼,𝛽>1

32

Page 33: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Beta Distribution

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(0.1,0.1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(1,1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(2,3)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(8,4)

See Matlab implementation

33

Page 34: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

The gamma function extends the factorial to real numbers:

With integration by parts:

For integer 𝑛:

For more information visit this link.

Gamma Function

1

0

( ) x ux u e du

( ) ( 1)!n n

( 1) ( )x x x

34

1 1( )( ) (1 )

( ) ( )x x x

Page 35: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Showing that the Beta(𝛼, 𝛽) distribution is normalized

correctly is a bit tricky. We need to prove that:

Follow the steps:

(a) change the variable 𝑦 below to 𝑦 = 𝑡 − 𝑥;

(b) change the order of integration in the shaded

triangular region;

and (c) change 𝑥 to via 𝑥 = 𝑡𝜇:

Beta Distribution: Normalization

x

t𝑡 = 𝑥

m

1

11

0

( ) ( ) ( ) 1 d

35

1 1 1

0 0 0

111

0 0 0 0

111

0

1 1 1

1

1

( ) ( )

(

1

) 1

x y

y t x x

t

t

t

t t

e tx e dx y e dy x dt dx

x e t x dx dt e dttt

x

d

d

Page 36: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

It is frequently a model for waiting times. For important

properties see here.

It is more often parameterized in terms of a shape

parameter 𝑎 and an inverse scale parameter 𝑏 = 1/𝜃,

called a rate parameter:

The mean, mode and variance with this parametrization are:

1 1

0

( | , ) , 0, , ( )( )

aa bx a ub

p x a b x e x a u e dua

Gamma Distribution- Rate Parametrization

xb

1, 1

mod

0

afor a

e x b

otherwise

2var x

b

36

Page 37: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Plots of

As we increase the rate 𝑏, the distribution squeezes

leftwards and upwards. For 𝑎 < 1, the mode is at zero.

Gamma Distribution

1 2 3 4 5 6 7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Gamma distributions

a=1.0,b=1.0

a=1.5,b=1.0

a=2.0,b=1.0

1( | , ) exp( ), 1( )

aab

X a b x xb ba

Gamma

Run gammaPlotDemo

from PMTK

37

Page 38: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

An empirical PDF of rainfall data fitted with a Gamma

distribution.

Run MatLab function gammaRainfallDemo

from PMTK

Gamma Distribution

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

3

3.5

MoM

MLE

38

Page 39: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Exponential Distribution

This is defined as

Here 𝜆 is the rate parameter.

This distribution describes the times between events in a

Poisson process, i.e. a process in which events occur

continuously and independently at a constant average rate

𝜆.

( | ) ( |1, ) exp( ), 0,X X x xl l l l Expon Gamma

39

Page 40: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Chi-Squared Distribution

This is defined as

This is the distribution of the sum of squared Gaussian

random variables.

More precisely,

2

12 2

1

1 2( | ) ( | , ) exp( ), 0,

2 2 2

2

xX X x x

Gamma

2 2

1

~ (0,1) , : ~i i

i

Let Z and S Z then S

40

Page 41: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Inverse Gamma Distribution

This is defined as follows:

where:

𝑎 is the shape and 𝑏 the scale parameters.

Note that 𝑏 is a scale parameter since:

It can be shown that:

1~ ( | , ) ~ ( | , )If X X a b X X a bGamma InvGamma

( 1)( | , ) exp( / ), 0,( )

aab

X a b x b x xa

InvGamma

2

2( 1), , var ( 2)

1 1 ( 1) ( 2)

b b bMean exists for a Mode exists for a

a a a a

41

( | ,1)

( | , )

Xa

bX a bb

InvGammaInvGamma

Page 42: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

The Pareto Distribution

Used to model the distribution of quantities that exhibit

long tails (heavy tails)

𝒫𝒶𝓇ℯ𝓉ℴ 𝑋|k,𝑚 = 𝑘𝑚𝑘𝑥−(𝑘+1)𝕀 𝑥 ≥ 𝑚

This density asserts that 𝑥 must be greater than some

constant 𝑚, but not too much greater, 𝑘 controls what is

“too much”.

Modeling the frequency of words vs. their rank (e.g. “the”,

“of”, etc.) or the wealth of people.*

As 𝑘 → ∞, the distribution approaches 𝛿(𝑥 − 𝑚).

On a log-log scale, the pdf forms a straight line of the form

log 𝑝(𝑥) = 𝑎 log 𝑥 + 𝑐 for some constants 𝑎 and 𝑐 (power

law, Zipf’s law).

42

* Basis of the distribution: a high proportion of a population has low income and only few have very high incomes.

Page 43: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

The Pareto Distribution

Applications: Modeling the frequency of words vs their

rank, distribution of wealth (𝑘 =Pareto Index), etc.

( 1)

2

2

( | , ) ( ),

( 1),1

,

var ( 2)( 1) ( 2)

k kX k m km x x m

kmMean if k

k

Mode m

m kif k

k k

Pareto

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Pareto distribution

m=0.01, k=0.10

m=0.00, k=0.50

m=1.00, k=1.00

ParetoPlot from PMTK

43

Page 44: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Consider two random variables

The joint probability distribution is defined as:

Two random variables are independent if

The covariance of X and Y is defined as:

It is straight forward to verify that:

, : .X Y

( , ) ( ) ( )p x y p x p y

cov( , ) ( ( ))( ( ))X Y X X Y Y

1 1, ( ) ( ) ( , )A B

P X A Y B P X A Y B p x y dxdy

cov( , )X Y XY X Y

Covariance

44

Page 45: Introduction to Probability and Statistics (Continued)€¦ · it gives the probability in 𝑁flips to get heads with being the probability getting heads in one flip. It can be shown

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Consider two random variables

The correlation coefficient of X and Y is defined as:

where the standard deviations of X and Y are

The center normalized random variables are defined as:

It is straight forward to verify that:

, : .X Y

cov( , )( , )

X Y

X Ycorrc X Y

cov( ), cov( )X YX Y

Correlation, Center Normalized Random Variables

var 𝑋~

= var 𝑌~

= 1𝔼 𝑋~

= 𝔼 𝑌~

= 0

𝑋~

=𝑋 − 𝔼 𝑋

𝜎𝑋

𝑌~

=𝑌 − 𝔼 𝑌

𝜎𝑌

45