Lecture 4
Probabilityand what it has to do with
data analysis
Please Read
Doug Martinson’s
Chapter 2: ‘Probability Theory’
Available on Courseworks
Abstraction
Random variable, x
it has no set value, until you ‘realize’ it
its properties are described by a distribution, p(x)
When you realize x
the probability that the value you get is
between x and x+dx
is p(x) dx
Probability density distribution
the probability, P, that the value you get is
is between x1 and x2
is P = x1
x2 p(x) dx
Note that it is written with a capital P
And represented by a fraction between
0 = never
And
1 = always
p(x)
x x1x2
Probability P that x is between x1 and x2 is proportional to this area
the probability that the value you get is
is something is unity
-+ p(x) dx = 1
Or whatever the allowable range of x is …
p(x)
x
Probability that x is between - and + is unity, so total area = 1
Why all this is relevant …
Any measurement is that contains noise is treated as a random variable, x
The distribution p(x) embodies both the ‘true value’ of the quantity being measured and the measurement noise
All quantities derived from a random variable are themselves random variables, so …
The algebra of random variables allows you to understand how measurement noise affects inferences made from the data
Basic Description of Distributions
p(x)
xxmode
Mode
x at which distribution has peak
most-likely value of x
peak
But modes can be deceptive …p(
x)
xxmode
peak
0 10
x N0-1 31-2 182-3 113-4 84-5 115-6 146-7 87-8 78-9 119-10 9
Sure, the 1-2 range has the most counts, but most of the measurements are bigger than 2!
100 realizations of x
p(x)
xxmedian
Median
50% chance x is smaller than xmedian
50% chance x is bigger than xmedian
No special reason the median needs to coincide with the peak
50% 50%
p(x)
x
Expected value or ‘mean’
x you would get if you took the mean of lots of realizations of x
01
2
3
4
1 2 3
Let’s examine a discrete distribution, for simplicity ...
x N
1 20
2 80
3 40
Total 140
mean = [ 20 1 + 80 2 + 40 3 ] / 140
= (20/140) 1 + (80/140) 2 + (40/140) 3
= p(1) 1 + p(2) 2 + p(3) 3
= Σi p(xi) xi
Hypothetical table of 140 realizations of x
by analogyfor a smooth distribution
Expected value of x
E(x) = -+ x p(x) dx
by the way …You can compute the expected (“mean”)
value of any function of x this way …
E(x) = -+ x p(x) dx
E(x2) = -+ x2 p(x) dx
E(x) = -+ x p(x) dx
etc.
Beware
E(x2) E(x)2
E(x) E(x)2
and so forth …
p(x)
x
Width of a distribution
Here’s a perfectly sensible way to define the width of a distribution…
50%25%25%
W50
… it’s not used much, though
p(x)
x
Width of a distribution
Here’s another way…
… multiply and integrate
E(x)
Parabola [x-E(x)]2
p(x)
x
Variance = 2 = -+ [x-E(x)]2 p(x) dx
E(x)
[x-E
(x)]
2
[x-E
(x)]
2 p(
x)
xE(x)
Compute this total area …
Idea is that if distribution is narrow, then most of the probability lines up with the low spot of the parabola
But if it is wide, then some of the probability lines up with the high parts of the parabola
p(x)
x
variance =
A measure of width …
we don’t immediately know its relationship to area, though …
E(x)
the Gaussian or normal distribution
p(x) = exp{ - (x-x)2 / 22 ) 1(2)
expected value
variance
Memorize me !
x = 1
= 1
x = 3
= 0.5
x
x
p(x)
p(x)
Examples of
Normal
Distributions
x
p(x)
x x+2x-2
95%
Expectation =
Median =
Mode = x
95% of probability within 2 of the expected value
Properties of the normal distribution
Functions of a random variable
any function of a random variable is itself a random variable
If x has distribution p(x)
the y(x) has distribution
p(y) = p[x(y)] dx/dy
This follows from the rule for transforming integrals …
1 = x1
x2 p(x) dx = y1
y2 p[x(y)] dx/dy dy
Limits so that y1=y(x1), etc.
example
Let x have a uniform (white) distribution of [0,1]
p(x)
0 x 1
1
Uniform probability that x is anywhere between 0 and 1
Let y = x2
then x=y½
y(x=0)=0y(x=1)=1dx/dy=½y-½
p[x(y)]=1
So p(y)=½y-½ on the interval [0,1]
1
Numerical testhistogram of 1000 random numbers
Histogram of x, generated with Excel’s rand() function which claims to be based upon a uniform distribution
Histogram of x2, generated by squaring x’s from above
Plausible that it’s proportional to 1/y
Plausible that it’s uniform
multivariate distributions
example
Liberty island is inhabited by both pigeons and seagulls
40% of the birds are pigeonsand 60% of the birds are gulls
50% of pigeons are white and 50% are grey100% of gulls are white
Two variables
species s takes two values
pigeon p
and gull g
color c takes two values
white w
and tan t
Of 100 birds,
20 are white pigeons
20 are tan pigeons
60 are white gulls
0 are tan gulls
What is the probability that a bird has species s and color c ?
cw t
p
g
s
20% 20%
60% 0%
Note: sum of all boxes is 100%
a random bird, that is
This is called theJoint Probability
and is writtenP(s,c)
Two continuous variablessay x1 and x2
have a joint probability distributionand written
p(x1, x2)with
p(x1, x2) dx1 dx2 = 1
The probability thatx1 is between x1 and x1+dx1
and x2 is between x2 and x2+dx2
isp(x1, x2) dx1 dx2
so p(x1, x2) dx1 dx2 = 1
You would contour a joint probability distribution
and it would look something like
x2
x1
What is the probability that a bird has color c ?
cw t
p
g
s
20% 20%
60% 0%
start with P(s,c)
80% 20%
and sum columns
To get P(c)
Of 100 birds,
20 are white pigeons
20 are tan pigeons
60 are white gulls
0 are tan gulls
What is the probability that a bird has species s ?
cw t
p
g
s
20% 20%
60% 0%
start with P(s,c)
60%
40%
and sum rows
To get P(s)
Of 100 birds,
20 are white pigeons
20 are tan pigeons
60 are white gulls
0 are tan gulls
These operations make sense with distributions, too
x2
x1
x2
x1
x1
p(x1)
p(x1) = p(x1,x2) dx2
x2
p(x2)
p(x2) = p(x1,x2) dx1
distribution of x1
(irrespective of x2)distribution of x2
(irrespective of x1)
Given that a bird is species swhat is the probability that it has color c ?
cw t
p
g
s
50% 50%
100% 0%
Note, all rows sum to 100
Of 100 birds,
20 are white pigeons
20 are tan pigeons
60 are white gulls
0 are tan gulls
This is called theConditional Probability of c given s
and is writtenP(c|s)
similarly …
Given that a bird is color cwhat is the probability that it has species s ?
cw t
p
g
s
25% 100%
75% 0%
Note, all columns sum to 100
Of 100 birds,
20 are white pigeons
20 are tan pigeons
60 are white gulls
0 are tan gulls
So 25% of white birds are pigeons
This is called theConditional Probability of s given c
and is writtenP(s|c)
Beware!P(c|s) P(s|c)
cw t
p
g
s
50% 50%
100% 0%
cw t
p
g
s
25% 100%
75% 0%
note
P(s,c) = P(s|c) P(c)
cw t
p
g
s
20 20
60 0
cw t
p
g
s
25 100
75 0
= 80 20
cw t
25% of 80 is 20
and
P(s,c) = P(c|s) P(s)
cw t
p
g
s
20 20
60 0
=
cw t
p
g
s
50 50
100 0 60
40p
g
s
50% of 40 is 20
and if
P(s,c) = P(s|c) P(c) = P(c|s) P(s)
thenP(s|c) = P(c|s) P(s) / P(c)
and
P(c|s) = P(s|c) P(c) / P(s)
… which is called Bayes Theorem
Why Bayes Theorem is important
Consider the problem of fitting a straight line to data, d, where the intercept and slope are given by the vector m.
If we guess m and use it to predict d we are doing something like P(d|m)
But if we observe d and use it to estimate m then we are doing something like P(m|d)
Bayes Theorem provides a framework for relating what we do to get P(d|m) to what we do to get P(m|d)
Expectation
Variance
And
Covariance
Of a multivariate distribution
The expected value of x1 and x2 are calculated in a fashion analogous to the one-variable case:
E(x1)= x1 p(x1,x2) dx1dx2 E(x2)= x2 p(x1,x2) dx1dx2
x2
x1
Note
E(x1) = x1 p(x1,x2) dx1dx2
= x1 [ p(x1,x2)dx2 ] dx1
= x1 p(x1) dx1
So the formula really is just the expectation of a one-variable distribution
The variance of x1 and x2 are calculated in a fashion analogous to the one-variable case, too:
x12= (x1-x1)2p(x1,x2) dx1dx2 with x1=E(x1)
and similarly for x22
x2
x1
Note, once againx1
2= (x1-x1)2p(x1,x2) dx1dx2
= (x1-x1)2 [p(x1,x2) dx2] dx2
= (x1-x1)2p(x1) dx1
So the formula really is just the variance of a one-variable distribution
Note that in this distributionif x1 is bigger than x1, then x2 is bigger than x2 and if x1 is smaller than x1, then x2 is smaller than x2
x2
x1Expected value
x1
x2
This is a
positive correlation
Conversely, in this distributionif x1 is bigger than x1, then x2 is smaller than x2 and if x1 is smaller than x1, then x2 is smaller than x2
x2
x1Expected value
x1
x2
This is a
negative correlation
This correlation can be quantified by multiplying the distribution by a four-quadrant function
x2
x1
x1
x2
+
+ -
-
And then integrating. The function (x1-x1)(x2-x2) works fine
cov(x1,x2) = (x1-x1) (x2-x2) p(x1,x2) dx1dx2Called the “covariance”
Note that the vector x with elements
xi = E(xi)= xi p(x1,x2) dx1dx2 is the expectation of x
and the matrix Cx with elementsCij = (xi-xi) (xj-xj) p(x1,x2) dx1dx2
has diagonal elements equal to the variance of x i
Cxii = xi
2
andoff-diagonal elements equal to the covariance of x i and xj
Cxij = cov(xi,xj)
“Center” of multivatiate distribution
x
“Width” and “Correlatedness” of multivariate distribution
summarized a lot – but not everything –about a multivariate distribution
Functions of a set of random variables, x
A vector of of N random variables in a vector, x
given y(x)Do you remember how to
transform the integral
… p(x) dNx =
… ? dNy =
given y(x)
then
… p(x) dNx =
… p[x(y)] |dx/dy| dNy =
Jacobian determinant, that is, the determinant of matrix Jij whose elements are dxi/dyj
But here’s something that’s EASIER …
Suppose y(x) is a linear function y=Mx
Then we can easily calculate the expectation of y
yi = E(yi) = … yi p(x1 … xN) dx1…dxN
= … Mijxj p(x1 … xN) dx1… dxN
= Mij … xj p(x1 … xN) dx1 … dxN
= Mij E(xi) = Mij xi So y=Mx
And we can easily calculate the covariance
Cyij = … (yi – yi) (yj – yj) p(x1,x2) dx1dx2
= … ΣpMip(xp – xp) ΣqMjq (xq – xq) p(x1…xN) dx1…dxN
= ΣpMip ΣqMjq … (xp – xp) (xq – xq) p(x1…xN) dx1 …dxN
= ΣpMip ΣqMjq Cxpq
So Cy = M Cx MT
Memorize!
Note that these rules work regardless of the distribution of x
if y is linearly related to x, y=Mx then
y=Mx (rule for means)
Cy = M Cx MT
(rule for propagating error)