lecture 2 probability and what it has to do with data analysis
DESCRIPTION
Lecture 2 Probability and what it has to do with data analysis. Abstraction. Random variable, x it has no set value, until you ‘realize’ it its properties are described by a probability, P. One way to think about it. pot of an infinite number of x’s. x. p(x). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/1.jpg)
Lecture 2
Probabilityand what it has to do with
data analysis
![Page 2: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/2.jpg)
Abstraction
Random variable, x
it has no set value, until you ‘realize’ it
its properties are described by a probability, P
![Page 3: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/3.jpg)
pot of an infinite number of x’s
Drawing one x from the pot “realizes” x
One way to think about it
p(x)
x
![Page 4: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/4.jpg)
Describing P
If x can take on only discrete values,
say (1, 2, 3, 4, or 5)
then a table would work:
x 1 2 3 4 5
P 10% 30% 40% 15% 5%
Probabilitiesshould sum to
100%
40% probability that x=4
![Page 5: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/5.jpg)
Sometimes you see probabilities written as fractions, instead of percentages
x 1 2 3 4 5
P 0.10 0.40 0.40 0.15 0.05
Probabilityshould sum
to 1
0.15 probability that x=4
x
P(x)
0.0
0.5
1 2 3 4 5
And sometimes you see probabilities plotted as a histogram
0.15 probability that x=4
![Page 6: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/6.jpg)
If x can take on any value, then use a smooth function (or “distribution”) p(x) instead of a table
p(x)
xx1 x2
probability that x is between x1 and x2 is proportional to
this area
mathematically P(x1<x<x2) = x1
x2 p(x) dx
![Page 7: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/7.jpg)
p(x)
x
Probability that x is between - and + is 100%, so total area = 1
Mathematically -+ p(x) dx = 1
![Page 8: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/8.jpg)
One Reason Why all this is relevant …
Any measurement of data that contains noise is treated as a random variable, d
and …
![Page 9: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/9.jpg)
The distribution p(d) embodies both the ‘true value’ of the datum being measured and the measurement noise
and …
![Page 10: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/10.jpg)
All quantities derived from a random variable are themselves random variables,
so …
![Page 11: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/11.jpg)
The algebra of random variables allows you to understand how …
… measurement noise affects inferences made from the data
![Page 12: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/12.jpg)
Basic Description of Distributions
want two basic numbers
1) something that describes what x’s commonly occur
2) something that describes the variability of the x’s
![Page 13: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/13.jpg)
1) something that describes what x’s e commonly occur
that is, where the distribution is centered
![Page 14: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/14.jpg)
p(x)
xxmode
Mode
x at which distribution has peak
most-likely value of x
peak
![Page 15: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/15.jpg)
The most popular car in the US is the Honda CR-V
But the next car you see on the highway will probably not be a Honda CR-V
Where’s a CV-R?
Honda CV-R
![Page 16: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/16.jpg)
But modes can be deceptive …p(
x)
xxmode
peak
0 10
x N0-1 31-2 182-3 113-4 84-5 115-6 146-7 87-8 78-9 119-10 9
Sure, the 1-2 range has the most counts, but most of the measurements are bigger than 2!
100 realizations of x
![Page 17: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/17.jpg)
p(x)
xxmedian
Median
50% chance x is smaller than xmedian
50% chance x is bigger than xmedian
No special reason the median needs to coincide with the peak
50% 50%
![Page 18: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/18.jpg)
P(x
)
x
Expected value or ‘mean’
value you would get if you took the mean of lots of realizations of x
01
2
3
4
1 2 3
Let’s examine a discrete distribution, for simplicity ...
![Page 19: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/19.jpg)
x N
1 20
2 80
3 40
Total 140
mean = [ 20 1 + 80 2 + 40 3 ] / 140
= (20/140) 1 + (80/140) 2 + (40/140) 3
= p(1) 1 + p(2) 2 + p(3) 3
= Σi p(xi) xi
Hypothetical table of 140 realizations of x
![Page 20: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/20.jpg)
by analogyfor a smooth distribution
Expected (or mean) value of x
E(x) = -+ x p(x) dx
![Page 21: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/21.jpg)
2) something that describes the variability of the x’s
that is, the width of the distribution
![Page 22: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/22.jpg)
p(x)
x
Here’s a perfectly sensible way to define the width of a distribution…
50%25%25%
W50
… it’s not used much, though
![Page 23: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/23.jpg)
p(x)
x
Width of a distribution
Here’s another way…
… multiply and integrate
E(x)
Parabola [x-E(x)]2
![Page 24: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/24.jpg)
p(x)
x
Variance = 2 = -+ [x-E(x)]2 p(x) dx
E(x)
[x-E
(x)]
2
[x-E
(x)]
2 p(x
)
xE(x)
Compute this total area …
Idea is that if distribution is narrow, then most of the probability lines up with the low spot of the parabola
But if it is wide, then some of the probability lines up with the high parts of the parabola
![Page 25: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/25.jpg)
p(x)
x
variance =
A measure of width …
we don’t immediately know its relationship to area, though …
E(x)
![Page 26: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/26.jpg)
the Gaussian or normal distribution
p(x) = exp{ - (x-x)2 / 22 ) 1(2)
x is expected value
2 is variance
Memorize me !
![Page 27: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/27.jpg)
x = 1
= 1
x = 3
= 0.5
x
x
p(x)
p(x)
Examples of
Normal
Distributions
![Page 28: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/28.jpg)
x
p(x)
x x+2x-2s
95%
Expectation =
Median =
Mode = x
95% of probability within 2 of the expected value
Properties of the normal distribution
![Page 29: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/29.jpg)
Again, Why all this is relevant …
Inference depends on data …
You use measurement, d, to deduce the values of some underlying parameter of interest, m.
e.g.use measurements of travel time, d, to deduce the seismic velocity, m, of the earth
![Page 30: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/30.jpg)
model parameter, m, depends on measurement, d
so m is a function of d, m(d)
so …
![Page 31: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/31.jpg)
If data, d, is a random variable
then so is model parameter, m
All inferences made from uncertain data are themselves uncertain
Model parameters are described by a distribution, p(m)
![Page 32: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/32.jpg)
Functions of a random variable
any function of a random variable is itself a random variable
![Page 33: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/33.jpg)
Special case of a linear relationship and a normal distribution
Normal p(d) with mean d and variance 2d
Linear relationship m = a d + b
Normal p(m) with mean ad+b and variance a22
d
![Page 34: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/34.jpg)
multivariate distributions
![Page 35: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/35.jpg)
Example
Liberty island is inhabited by both pigeons and seagulls
40% of the birds are pigeonsand 60% of the birds are gulls
50% of pigeons are white and 50% are grey100% of gulls are white
![Page 36: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/36.jpg)
Two variables
species s takes two values
pigeon p
and gull g
color c takes two values
white w
and tan t
Of 100 birds,
20 are white pigeons
20 are grey pigeons
60 are white gulls
0 are grey gulls
![Page 37: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/37.jpg)
What is the probability that a bird has species s and color c ?
cw t
p
g
s
20% 20%
60% 0%
Note: sum of all boxes is 100%
a random bird, that is
![Page 38: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/38.jpg)
This is called theJoint Probability
and is written
P(s,c)
![Page 39: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/39.jpg)
Two continuous variablessay x1 and x2
have a joint probability distributionand written
p(x1, x2)
with p(x1, x2) dx1 dx2 = 1
![Page 40: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/40.jpg)
You would contour a joint probability distribution
and it would look something like
x2
x1
![Page 41: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/41.jpg)
What is the probability that a bird has color c ?
cw t
p
g
s
20% 20%
60% 0%
start with P(s,c)
80% 20%
and sum columns
To get P(c)
Of 100 birds,
20 are white pigeons
20 are grey pigeons
60 are white gulls
0 are grey gulls
![Page 42: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/42.jpg)
What is the probability that a bird has species s ?
cw t
p
g
s
20% 20%
60% 0%
start with P(s,c)
60%
40%
and sum rows
To get P(s)
Of 100 birds,
20 are white pigeons
20 are grey pigeons
60 are white gulls
0 are grey gulls
![Page 43: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/43.jpg)
These operations make sense with distributions, too
x2
x1
x2
x1
x1
p(x1)
p(x1) = p(x1,x2) dx2
x2
p(x2)
p(x2) = p(x1,x2) dx1
distribution of x1
(irrespective of x2)distribution of x2
(irrespective of x1)
![Page 44: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/44.jpg)
Given that a bird is species swhat is the probability that it has color c ?
cw t
p
g
s
50% 50%
100% 0%
Note, all rows sum to 100
Of 100 birds,
20 are white pigeons
20 are grey pigeons
60 are white gulls
0 are grey gulls
![Page 45: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/45.jpg)
This is called theConditional Probability of c given s
and is writtenP(c|s)
similarly …
![Page 46: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/46.jpg)
Given that a bird is color cwhat is the probability that it has species s ?
cw t
p
g
s
25% 100%
75% 0%
Note, all columns sum to 100
Of 100 birds,
20 are white pigeons
20 are grey pigeons
60 are white gulls
0 are grey gulls
So 25% of white birds are pigeons
![Page 47: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/47.jpg)
This is called theConditional Probability of s given c
and is written
P(s|c)
![Page 48: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/48.jpg)
Beware!P(c|s) P(s|c)
cw t
p
g
s
50% 50%
100% 0%
cw t
p
g
s
25% 100%
75% 0%
![Page 49: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/49.jpg)
Actor Patrick Swaysepancreatic cancer victim
Lot of errors occur from confusing the two:
Probability that, if you have pancreatic cancer, that you
will die from it
90%
Probability that, if you die, you will have died of
pancreatic cancer
1.4%
![Page 50: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/50.jpg)
note
P(s,c) = P(s|c) P(c)
cw t
p
g
s
20 20
60 0
cw t
p
g
s
25 100
75 0
= 80 20
cw t
25% of 80 is 20
![Page 51: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/51.jpg)
and
P(s,c) = P(c|s) P(s)
cw t
p
g
s
20 20
60 0
=
cw t
p
g
s
50 50
100 0 60
40p
g
s
50% of 40 is 20
![Page 52: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/52.jpg)
and if
P(s,c) = P(s|c) P(c) = P(c|s) P(s)
then
P(s|c) = P(c|s) P(s) / P(c)
and
P(c|s) = P(s|c) P(c) / P(s)
… which is called Bayes Theorem
![Page 53: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/53.jpg)
In this example
bird color is the observable, the “data”
bird species is the “model parameter”
P(c|s) “color given species” or P(d|m) is
“making a prediction based on the model”Given a pigeon, what the probability that it’s grey?
P(s|c), “species given color” or P(m|d) is
“making an inference from the data”Given a grey bird, what the probability that it’s a pigeon?
![Page 54: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/54.jpg)
Why Bayes Theorem is important
It provides a framework for relating
making a prediction from the model, P(d|m)
to
making an inference from the data, P(m|d)
![Page 55: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/55.jpg)
Bayes Theorem also implies that the joint distribution of data and model parameters
p(d, m)
is the fundamental quantity
If you know p(d, m), you know everything there is to know …
![Page 56: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/56.jpg)
Expectation
Variance
And
Covariance
Of a multivariate distribution
![Page 57: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/57.jpg)
The expectation is computed by first reducing the distribution to one dimension
x2
x1
x2
x1
x1
p(x1)
x2
p(x2)
take theexpectationof p(x1) to get x1
x1
x2
take theexpectationof p(x2) to get x2
![Page 58: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/58.jpg)
The varaince is also computed by first reducing the distribution to one dimension
x2
x1
x2
x1
x1
p(x1)
x2
p(x2)
take thevarianceof p(x1) to get 1
2
x1
x2
take thevarianceof p(x2) to get 2
2
1
2
![Page 59: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/59.jpg)
Note that in this distributionif x1 is bigger than x1, then x2 is bigger than x2 and if x1 is smaller than x1, then x2 is smaller than x2
x2
x1Expected value
x1
x2
This is a
positive correlation
![Page 60: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/60.jpg)
Conversely, in this distributionif x1 is bigger than x1, then x2 is smaller than x2 and if x1 is smaller than x1, then x2 is smaller than x2
x2
x1Expected value
x1
x2
This is a
negative correlation
![Page 61: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/61.jpg)
This correlation can be quantified by multiplying the distribution by a four-quadrant function
x2
x1
x1
x2
+
+ -
-
And then integrating. The function (x1-x1)(x2-x2) works fine
C = (x1-x1) (x2-x2) p(x1,x2) dx1dx2Called the “covariance”
![Page 62: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/62.jpg)
Note that the matrix C with elements
Cij = (xi-xi) (xj-xj) p(x1,x2) dx1dx2
has diagonal elements of xi2 the variance of xi
andoff-diagonal elements of cov(xi,xj) the covariance of xi and xj
C =
12 cov(x1,x2) cov(x1,x3)
cov(x1,x2) 22 cov(x2,x2)
cov(x1,x3) cov(x2,x2) 32
![Page 63: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/63.jpg)
The “vector of means” of multivatiate distribution
x
and the “Covariance matrix” of multivariate distribution
Cx
summarized a lot – but not everything –about a multivariate distribution
![Page 64: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/64.jpg)
Functions of a set of random variables, x
A vector of of N random variables in a vector, x
![Page 65: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/65.jpg)
Special Case
linear function y=Mx
the expectation of y is
y=Mx
Memorize!
![Page 66: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/66.jpg)
the covariance of y is
So Cy = M Cx MT
Memorize!
![Page 67: Lecture 2 Probability and what it has to do with data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062517/5681386a550346895da01b33/html5/thumbnails/67.jpg)
Note that these rules work regardless of the distribution of x
if y is linearly related to x, y=Mx then
y=Mx (rule for means)
Cy = M Cx MT
(rule for propagating error)
Memorize!