probability, statistics and inferenceeero/math-tools17/...probability. el ta ence in middleville,...
TRANSCRIPT
Mathematical Tools for Neural and Cognitive Science
Probability, Statistics and Inference
Fall semester, 2017
Probability: an abstract mathematical framework for describing random quantities (e.g., measurements)
Statistics: use of probability to summarize, analyze, interpret data. Fundamental to all experimental science.
Probabilistic Middleville
The stork delivers boys and girls randomly, with equal probability.
probabilistic model
data
statistical inference
In Middleville, every family has two children, brought by the stork.
You pick a family at random and discover that one of the children is a girl.
What is the probability that the other child is a girl?
Statistical MiddlevilleIn Middleville, every family has two children, brought by the stork.
You pick a family at random and discover that one of the children is a girl.
What is the probability that the other child is a girl?
In a survey of 100 Middleville families, 32 have two girls, 24 have two boys, and the remainder have one of each.
The stork delivers boys and girls randomly, with equal probability.
- Efron & Tibshirani, Introduction to the Bootstrap
Some historical context
• 1600’s: Early notions of data summary/averaging
• 1700’s: Bayesian prob/statistics (Bayes, Laplace)
• 1920’s: Frequentist statistics for science (e.g., Fisher)
• 1940’s: Statistical signal analysis and communication, estimation/decision theory (Shannon, Wiener, etc)
• 1970’s: Computational optimization and simulation (e.g,. Tukey)
• 1990’s: Machine learning (large-scale computing + statistical inference + lots of data)
• Since 1950’s: statistical neural/cognitive models
Scientific process
Summarize/fit ,compare with predictions
Create/modifyhypothesis/model
Generate predictions,design experiment
Observe / measure data
Estimating model parameters
• How do I compute the estimate? (mathematics vs. numerical optimization)
• How “good” are my estimates?
• How well does my model explain the data? Future data (prediction/generalization)?
• How do I compare two (or more) models?
Outline of what’s comingThemes:
• Uni-variate vs. multi-variate
• Discrete vs. continuous
• Math vs. simulation
• Bayesian vs. frequentist inference
Topics:• Descriptive statistics
• Basic probability theory: univariate, multivariate
• Model parameter estimation
• Hypothesis testing / model comparison
Example: Localization
Issues: Mean and variability (accuracy and precision)
Descriptive statistics: Central tendency• We often summarize data with the average. Why?
• Average minimizes the squared error (think regression!)
• More generally, for Lp norms:
• minimum L1 norm: median
• minimum L0 norm: mode
• Issues: Data from a common source, outliers, asymmetry, bimodality
argminx
1
N
NX
n=1
(xn
� x)2 =1
N
NX
n=1
x
n
"1
N
NX
i=1
|xn � x|p#1/p
Descriptive statistics: Dispersion
• Sample variance
• Why N-1?
• Sample standard deviation
• Mean absolute deviation
s2 = 1N −1
xi − x( )2i=1
N
∑
1N
xi − xi=1
N
∑
Example: Localization
I find that . Is that convincing? Is the apparent bias real?To answer this, we need tools from probability…
x ≠ 0
Probability: notationlet X, Y, Z be random variables
they can take on values (like ‘heads’ or ‘tails’; or integers 1-6; or real-valued numbers)
let x, y, z stand generically for values they can take, and also, in shorthand, for events like X = x
we write the probability that X takes on value x as P(X = x), or PX(x), or sometimes just P(x)
P(x) is a function over x, which we call the probability “distribution” function (pdf) (or, for continuous variables only, “density”)
A distribution (the sum of 2 dice rolls)
Another distribution (IQ or a randomly chosen person)
P(x)
p(x)
Discrete pdf Continuous pdf
Normalization
∞
−∞∫ x xd p( ) =1( )b
a= ∫ ( )P a< < b d x px x
P(x)
p(x)
0 < P(x) <1P(xi ) = 1
i∑
0 < p(x)
p(x)dx = 1−∞
∞
∫
Probability basics
• discrete probability distributions
• continuous probability densities
• cumulative distributions
• translation and scaling of distributions
• monotonic nonlinear transformations
• drawing samples from a distribution. Uniform. Inverse cumulative mapping
• example densities/distributions
[on board]
0 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6 7 8 9 10 110
0.05
0.1
0.15
0.2
0.25
0 200 400 600 800 10000
0.02
0.04
0.06
0.08
0.1
2 3 4 5 6 7 8 9 10 11 120
0.05
0.1
0.15
0.2
1 2 3 4 5 60
0.05
0.1
0.15
0.2
a not-quite-fair coin sum of two rolled fair dice
clicks of a Geiger counter, in a fixed time interval
horizontal velocity of gas molecules exiting a fan... and, time between clicks
Example distributionsroll of a fair die
-0 1 2 4 53 6 7 8 9 10
Expected value - discrete
[the mean, ]µE(X ) = xi p(xi )i=1
N
∑
0 1 2 3 4# of credit cards
0
0.5
1
1.5
2
2.5
# of
stu
dent
s
104
0 1 2 3 4# of credit cards
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
P(x)
Expected value - continuous
E(x) =
Zx p(x) dx
E(x2) =
Zx
2p(x) dx
E
�(x� µ)2
�=
Z(x� µ)2 p(x) dx
=
Zx
2p(x) dx� µ
2
E (f(x)) =
Zf(x) p(x) dx
[the mean, ]µ
[the “second moment”]
�2[the variance, ]
note: an inner product, and thus linear, i.e.,
E(af (X )+ bg(X )) = aE( f (X ))+ bE(g(X ))
Joint and conditional probability - discrete
Joint and conditional probability - discrete
P(Ace) P(Heart) P(Ace & Heart) P(Ace | Heart) P(not Jack of Diamonds) P(Ace | not Jack of Diamonds)
“Independence”
• Joint distributions
• Marginals (integrating)
• Conditionals (slicing)
• Bayes’ Rule (inverting)
• Statistical independence (separability)
Multi-variate probability
[on board]
p(x, y)
Joint distribution
p(x) =
Zp(x, y)dy
p(x, y)
Marginal distribution
Conditional probability
A BA & B
p(A | B) = probability of A given that B is asserted to be true = p(A& B)p(B)
Neither A nor B
p(x, y) p(x|y = 68)
Conditional distribution
p(x|y = 68) = p(x, y = 68)
�Zp(x, y = 68)dx
= p(x, y = 68).p(y = 68)
P(x
|Y=6
8)
Conditional distribution
slice joint distribution normalize (by marginal)
- -
p(x|y) = p(x, y)/p(y)
More generally:
Bayes’ Rule
A BA & B
p(A& B) = p(B)p(A | B)
= p(A)p(B | A)
⇒ p(A | B) = p(B | A)p(A)p(B)
p(A | B) = probability of A given that B is asserted to be true = p(A& B)p(B)
Bayes’ Rule
p(x|y) = p(y|x) p(x)/p(y)
(a direct consequence of the definition of conditional probability)
P(x|Y=120)
P(x)
Conditional vs. marginal
In general, these differ. When are they they same? In particular, when are all conditionals equal to the marginal?
-
-
Statistical independence
Random variables X and Y are statistically independent if (and only if):
Independence implies that all conditionals are equal to the corresponding marginal:
p(x, y) = p(x)p( y) ∀ x, y
p(x | y) = p(x, y) / p( y) = p(x) ∀ x, y
[note: for discrete distributions, this is an outer product!]
Sums of independent RVs
Suppose X and Y are independent, then
E(XY ) = E(X )E(Y )
σ X+Y2 = E X +Y( )− µX + µY( )( )2⎛
⎝⎞⎠ =σ X
2 +σ Y2
and is a convolutionpX+Y (z)
Implications: (1) Sums of Gaussians are Gaussian, (2) Properties of the sample average
E(X +Y ) = E(X )+ E(Y )
For any two random variables (independent or not):
• Mean and variance summarize centroid/width
• translation and rescaling of random variables
• nonlinear transformations - “warping”
• Mean/variance of weighted sum of random variables
• The sample average
• ... converges to true mean (except for bizarre distributions)
• ... with variance
• ... most common common choice for an estimate ...
Mean and variance
• Estimator: Any function of the data, intended to compute an estimate of the true value of a parameter
• The most common estimator is the sample average, used to estimate the true mean of the distribution.
• Statistically-motivated examples:
- Maximum likelihood (ML):
- Max a posteriori (MAP):
- Min Mean Squared Error (MMSE):
Point Estimates
Example: Estimate the bias of a coin
Bayes’ Rule and Estimation
p(parameter value | data) = p(data | parameter value)p(parameter value)p(data)
Posterior PriorLikelihood
Nuisance normalizing term
Likelihood: 1 head Likelihood: 1 tail
More heads
More tails
T=0
1
2
3
2 31H=0
Posteriors, p(H,T|x), assuming prior p(x)=1
exampleinfer whether a coin is fair by flipping it repeatedly here, x is the probability of heads (50% is fair) y1...n are the outcomes of flips
Consider three different priors: suspect fair suspect biased no idea
prior fair prior biased prior uncertain
X likelihood (heads)
= posterior
previous posteriors
X likelihood (heads)
= new posterior
previous posteriors
X likelihood (tails)
= new posterior
Posteriors after observing 75 heads, 25 tails
àprior differences are ultimately overwhelmed by data
PDFs
CDFs
10H / 5T 20H / 10T2H / 1T
.975
.025.19 .93 .49 .80
Confidence
Bias & Variance• Mean squared error = bias^2 + variance • Bias is difficult to assess (requires knowing the “true”
value). Variance is easier. • Classical statistics generally aims for an unbiased
estimator, with minimal variance (“MVUE”). • The MLE is asymptotically unbiased (under fairly
general conditions), but this is only useful if - the likelihood model is correct - the optimum can be computed - you have enough data
• More general/modern view: estimation is about trading off bias and variance, through model selection, “regularization”, or Bayesian priors.
• Is the coin fair? Compared to what?
• Point hypotheses:
Bayesian Model Comparison
M1 : p = p1 = 0.5 M2 : p = p2 = 0.6
p(M1 |D) =p(D | M1)P(M1)
p(D)=p(D | M1)P( p1)
p(D)
Assuming equal priors over models the Bayes factor is
p(M1 |D)p(M2 |D)
=p(D | M1)P(M1)p(D | M2 )P(M2 )
=p(D | M1)P( p1)p(D | M2 )P( p2 )
• Is the coin fair? Compared to what?
• Alternative hypothesis:
Bayesian Model Comparison
M1 : p = p1 = 0.5 M2 : p ≠ 0.5
p(M2 |D) =p(D | M2 )p(M2 )
p(D)
= p( pcoin |D)p( pcoin )dpcoin0
1
∫
=p(D | M2 , pcoin )p( pcoin )dpcoin0
1
∫ P(M2 )
p(D)
Compute Bayes factor as before.
The Gaussian
• parameterized by mean and stdev (position / width)• joint density of two indep Gaussian RVs is circular! [easy]
• product of two Gaussians is Gaussian! [easy] • conditionals of a Gaussian are Gaussian! [easy]
• sum of Gaussian RVs is Gaussian! [moderate]
• marginals of a Gaussian are Gaussian! [moderate]• central limit theorem: sum of many RVs is Gaussian! [hard]
• most random (max entropy) density with this variance! [moderate]
Product of Gaussians is Gaussian
Completing the square shows that this posterior is also Gaussian, with
(average, weighted by inverse variances!)
∝
Product of Gaussians is Gaussian
Completing the square shows that this posterior is also Gaussian, with
(average, weighted by inverse variances!)
p(x|y) / p(y|x)p(x)
/ e
� 12
1
�
2n
(x�y)2�
e
� 12
1
�
2x
(x�µ
x
)2�
= e
� 12
✓1
�
2n
+ 1�
2x
◆x
2�2
✓y
�
2n
+µ
x
�
2x
◆x+...
�
mean: [0.2, 0.8]cov: [1.0 -0.3;
-0.3 0.4]
Gaussian densities
~x ⇠ N(~µ,C), let P = C
�1
Gaussian, with:
Conditional:
p(x1) =
Zp(~xdx2
Marginal:
Gaussian, with:
(known as the “precision” matrix)
u
z = u
T~x
µz
= uT ~µx
�2z
= uTCx
u
z
p(z)
Generalized marginals of a Gaussian
x1
x2
w
is Gaussian, with:
true mean: [0 0.8]true cov: [1.0 -0.25
-0.25 0.3]
sample mean: [-0.05 0.83]sample cov: [0.95 -0.23
-0.23 0.29]
700 samples
Measurement(sampling)
Inference
true density
−4 −3 −2 −1 0 1 2 3 40
50
100
150
200
250
300
350
400
450
500(u+u+u+u)/sqrt(4)
−4 −3 −2 −1 0 1 2 3 40
50
100
150
200
250104 samples of uniform dist
−4 −3 −2 −1 0 1 2 3 40
100
200
300
400
500
60010 u’s divided by sqrt(10)
−4 −3 −2 −1 0 1 2 3 40
50
100
150
200
250
300
350
400
450(u+u)/sqrt(2)
Central limit for a uniform distribution...
10k samples, uniform density (sigma=1)
0 0.2 0.4 0.6 0.8 10
1000
2000
3000
4000
5000
6000one coin
0 0.2 0.4 0.6 0.8 10
1000
2000
3000
4000avg of 4 coins
0 0.2 0.4 0.6 0.8 10
500
1000
1500
2000avg of 16 coins
0 0.2 0.4 0.6 0.8 10
500
1000
1500
2000avg of 64 coins
0 0.2 0.4 0.6 0.8 10
500
1000
1500
2000
2500avg of 256 coins
Central limit for a binary distribution...