probability, statistics and inferenceeero/math-tools17/...probability. el ta ence in middleville,...

29
Mathematical Tools for Neural and Cognitive Science Probability, Statistics and Inference Fall semester, 2017 Probability: an abstract mathematical framework for describing random quantities (e.g., measurements) Statistics: use of probability to summarize, analyze, interpret data. Fundamental to all experimental science.

Upload: others

Post on 06-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Mathematical Tools for Neural and Cognitive Science

Probability, Statistics and Inference

Fall semester, 2017

Probability: an abstract mathematical framework for describing random quantities (e.g., measurements)

Statistics: use of probability to summarize, analyze, interpret data. Fundamental to all experimental science.

Page 2: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Probabilistic Middleville

The stork delivers boys and girls randomly, with equal probability.

probabilistic model

data

statistical inference

In Middleville, every family has two children, brought by the stork.

You pick a family at random and discover that one of the children is a girl.

What is the probability that the other child is a girl?

Statistical MiddlevilleIn Middleville, every family has two children, brought by the stork.

You pick a family at random and discover that one of the children is a girl.

What is the probability that the other child is a girl?

In a survey of 100 Middleville families, 32 have two girls, 24 have two boys, and the remainder have one of each.

The stork delivers boys and girls randomly, with equal probability.

Page 3: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

- Efron & Tibshirani, Introduction to the Bootstrap

Some historical context

• 1600’s: Early notions of data summary/averaging

• 1700’s: Bayesian prob/statistics (Bayes, Laplace)

• 1920’s: Frequentist statistics for science (e.g., Fisher)

• 1940’s: Statistical signal analysis and communication, estimation/decision theory (Shannon, Wiener, etc)

• 1970’s: Computational optimization and simulation (e.g,. Tukey)

• 1990’s: Machine learning (large-scale computing + statistical inference + lots of data)

• Since 1950’s: statistical neural/cognitive models

Page 4: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Scientific process

Summarize/fit ,compare with predictions

Create/modifyhypothesis/model

Generate predictions,design experiment

Observe / measure data

Estimating model parameters

• How do I compute the estimate? (mathematics vs. numerical optimization)

• How “good” are my estimates?

• How well does my model explain the data? Future data (prediction/generalization)?

• How do I compare two (or more) models?

Page 5: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Outline of what’s comingThemes:

• Uni-variate vs. multi-variate

• Discrete vs. continuous

• Math vs. simulation

• Bayesian vs. frequentist inference

Topics:• Descriptive statistics

• Basic probability theory: univariate, multivariate

• Model parameter estimation

• Hypothesis testing / model comparison

Example: Localization

Issues: Mean and variability (accuracy and precision)

Page 6: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Descriptive statistics: Central tendency• We often summarize data with the average. Why?

• Average minimizes the squared error (think regression!)

• More generally, for Lp norms:

• minimum L1 norm: median

• minimum L0 norm: mode

• Issues: Data from a common source, outliers, asymmetry, bimodality

argminx

1

N

NX

n=1

(xn

� x)2 =1

N

NX

n=1

x

n

"1

N

NX

i=1

|xn � x|p#1/p

Descriptive statistics: Dispersion

• Sample variance

• Why N-1?

• Sample standard deviation

• Mean absolute deviation

s2 = 1N −1

xi − x( )2i=1

N

1N

xi − xi=1

N

Page 7: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Example: Localization

I find that . Is that convincing? Is the apparent bias real?To answer this, we need tools from probability…

x ≠ 0

Probability: notationlet X, Y, Z be random variables

they can take on values (like ‘heads’ or ‘tails’; or integers 1-6; or real-valued numbers)

let x, y, z stand generically for values they can take, and also, in shorthand, for events like X = x

we write the probability that X takes on value x as P(X = x), or PX(x), or sometimes just P(x)

P(x) is a function over x, which we call the probability “distribution” function (pdf) (or, for continuous variables only, “density”)

Page 8: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

A distribution (the sum of 2 dice rolls)

Another distribution (IQ or a randomly chosen person)

P(x)

p(x)

Discrete pdf Continuous pdf

Normalization

−∞∫ x xd p( ) =1( )b

a= ∫ ( )P a< < b d x px x

P(x)

p(x)

0 < P(x) <1P(xi ) = 1

i∑

0 < p(x)

p(x)dx = 1−∞

Page 9: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Probability basics

• discrete probability distributions

• continuous probability densities

• cumulative distributions

• translation and scaling of distributions

• monotonic nonlinear transformations

• drawing samples from a distribution. Uniform. Inverse cumulative mapping

• example densities/distributions

[on board]

0 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10 110

0.05

0.1

0.15

0.2

0.25

0 200 400 600 800 10000

0.02

0.04

0.06

0.08

0.1

2 3 4 5 6 7 8 9 10 11 120

0.05

0.1

0.15

0.2

1 2 3 4 5 60

0.05

0.1

0.15

0.2

a not-quite-fair coin sum of two rolled fair dice

clicks of a Geiger counter, in a fixed time interval

horizontal velocity of gas molecules exiting a fan... and, time between clicks

Example distributionsroll of a fair die

-0 1 2 4 53 6 7 8 9 10

Page 10: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Expected value - discrete

[the mean, ]µE(X ) = xi p(xi )i=1

N

0 1 2 3 4# of credit cards

0

0.5

1

1.5

2

2.5

# of

stu

dent

s

104

0 1 2 3 4# of credit cards

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

P(x)

Expected value - continuous

E(x) =

Zx p(x) dx

E(x2) =

Zx

2p(x) dx

E

�(x� µ)2

�=

Z(x� µ)2 p(x) dx

=

Zx

2p(x) dx� µ

2

E (f(x)) =

Zf(x) p(x) dx

[the mean, ]µ

[the “second moment”]

�2[the variance, ]

note: an inner product, and thus linear, i.e.,

E(af (X )+ bg(X )) = aE( f (X ))+ bE(g(X ))

Page 11: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Joint and conditional probability - discrete

Joint and conditional probability - discrete

P(Ace) P(Heart) P(Ace & Heart) P(Ace | Heart) P(not Jack of Diamonds) P(Ace | not Jack of Diamonds)

“Independence”

Page 12: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

• Joint distributions

• Marginals (integrating)

• Conditionals (slicing)

• Bayes’ Rule (inverting)

• Statistical independence (separability)

Multi-variate probability

[on board]

p(x, y)

Joint distribution

Page 13: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

p(x) =

Zp(x, y)dy

p(x, y)

Marginal distribution

Conditional probability

A BA & B

p(A | B) = probability of A given that B is asserted to be true = p(A& B)p(B)

Neither A nor B

Page 14: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

p(x, y) p(x|y = 68)

Conditional distribution

p(x|y = 68) = p(x, y = 68)

�Zp(x, y = 68)dx

= p(x, y = 68).p(y = 68)

P(x

|Y=6

8)

Conditional distribution

slice joint distribution normalize (by marginal)

- -

p(x|y) = p(x, y)/p(y)

More generally:

Page 15: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Bayes’ Rule

A BA & B

p(A& B) = p(B)p(A | B)

= p(A)p(B | A)

⇒ p(A | B) = p(B | A)p(A)p(B)

p(A | B) = probability of A given that B is asserted to be true = p(A& B)p(B)

Bayes’ Rule

p(x|y) = p(y|x) p(x)/p(y)

(a direct consequence of the definition of conditional probability)

Page 16: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

P(x|Y=120)

P(x)

Conditional vs. marginal

In general, these differ. When are they they same? In particular, when are all conditionals equal to the marginal?

-

-

Statistical independence

Random variables X and Y are statistically independent if (and only if):

Independence implies that all conditionals are equal to the corresponding marginal:

p(x, y) = p(x)p( y) ∀ x, y

p(x | y) = p(x, y) / p( y) = p(x) ∀ x, y

[note: for discrete distributions, this is an outer product!]

Page 17: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Sums of independent RVs

Suppose X and Y are independent, then

E(XY ) = E(X )E(Y )

σ X+Y2 = E X +Y( )− µX + µY( )( )2⎛

⎝⎞⎠ =σ X

2 +σ Y2

and is a convolutionpX+Y (z)

Implications: (1) Sums of Gaussians are Gaussian, (2) Properties of the sample average

E(X +Y ) = E(X )+ E(Y )

For any two random variables (independent or not):

• Mean and variance summarize centroid/width

• translation and rescaling of random variables

• nonlinear transformations - “warping”

• Mean/variance of weighted sum of random variables

• The sample average

• ... converges to true mean (except for bizarre distributions)

• ... with variance

• ... most common common choice for an estimate ...

Mean and variance

Page 18: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

• Estimator: Any function of the data, intended to compute an estimate of the true value of a parameter

• The most common estimator is the sample average, used to estimate the true mean of the distribution.

• Statistically-motivated examples:

- Maximum likelihood (ML):

- Max a posteriori (MAP):

- Min Mean Squared Error (MMSE):

Point Estimates

Example: Estimate the bias of a coin

Page 19: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Bayes’ Rule and Estimation

p(parameter value | data) = p(data | parameter value)p(parameter value)p(data)

Posterior PriorLikelihood

Nuisance normalizing term

Page 20: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Likelihood: 1 head Likelihood: 1 tail

More heads

More tails

T=0

1

2

3

2 31H=0

Posteriors, p(H,T|x), assuming prior p(x)=1

Page 21: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

exampleinfer whether a coin is fair by flipping it repeatedly here, x is the probability of heads (50% is fair) y1...n are the outcomes of flips

Consider three different priors: suspect fair suspect biased no idea

prior fair prior biased prior uncertain

X likelihood (heads)

= posterior

Page 22: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

previous posteriors

X likelihood (heads)

= new posterior

previous posteriors

X likelihood (tails)

= new posterior

Page 23: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Posteriors after observing 75 heads, 25 tails

àprior differences are ultimately overwhelmed by data

PDFs

CDFs

10H / 5T 20H / 10T2H / 1T

.975

.025.19 .93 .49 .80

Confidence

Page 24: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Bias & Variance• Mean squared error = bias^2 + variance • Bias is difficult to assess (requires knowing the “true”

value). Variance is easier. • Classical statistics generally aims for an unbiased

estimator, with minimal variance (“MVUE”). • The MLE is asymptotically unbiased (under fairly

general conditions), but this is only useful if - the likelihood model is correct - the optimum can be computed - you have enough data

• More general/modern view: estimation is about trading off bias and variance, through model selection, “regularization”, or Bayesian priors.

• Is the coin fair? Compared to what?

• Point hypotheses:

Bayesian Model Comparison

M1 : p = p1 = 0.5 M2 : p = p2 = 0.6

p(M1 |D) =p(D | M1)P(M1)

p(D)=p(D | M1)P( p1)

p(D)

Assuming equal priors over models the Bayes factor is

p(M1 |D)p(M2 |D)

=p(D | M1)P(M1)p(D | M2 )P(M2 )

=p(D | M1)P( p1)p(D | M2 )P( p2 )

Page 25: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

• Is the coin fair? Compared to what?

• Alternative hypothesis:

Bayesian Model Comparison

M1 : p = p1 = 0.5 M2 : p ≠ 0.5

p(M2 |D) =p(D | M2 )p(M2 )

p(D)

= p( pcoin |D)p( pcoin )dpcoin0

1

=p(D | M2 , pcoin )p( pcoin )dpcoin0

1

∫ P(M2 )

p(D)

Compute Bayes factor as before.

The Gaussian

• parameterized by mean and stdev (position / width)• joint density of two indep Gaussian RVs is circular! [easy]

• product of two Gaussians is Gaussian! [easy] • conditionals of a Gaussian are Gaussian! [easy]

• sum of Gaussian RVs is Gaussian! [moderate]

• marginals of a Gaussian are Gaussian! [moderate]• central limit theorem: sum of many RVs is Gaussian! [hard]

• most random (max entropy) density with this variance! [moderate]

Page 26: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

Product of Gaussians is Gaussian

Completing the square shows that this posterior is also Gaussian, with

(average, weighted by inverse variances!)

Product of Gaussians is Gaussian

Completing the square shows that this posterior is also Gaussian, with

(average, weighted by inverse variances!)

p(x|y) / p(y|x)p(x)

/ e

� 12

1

2n

(x�y)2�

e

� 12

1

2x

(x�µ

x

)2�

= e

� 12

✓1

2n

+ 1�

2x

◆x

2�2

✓y

2n

x

2x

◆x+...

Page 27: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

mean: [0.2, 0.8]cov: [1.0 -0.3;

-0.3 0.4]

Gaussian densities

~x ⇠ N(~µ,C), let P = C

�1

Gaussian, with:

Conditional:

p(x1) =

Zp(~xdx2

Marginal:

Gaussian, with:

(known as the “precision” matrix)

Page 28: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

u

z = u

T~x

µz

= uT ~µx

�2z

= uTCx

u

z

p(z)

Generalized marginals of a Gaussian

x1

x2

w

is Gaussian, with:

true mean: [0 0.8]true cov: [1.0 -0.25

-0.25 0.3]

sample mean: [-0.05 0.83]sample cov: [0.95 -0.23

-0.23 0.29]

700 samples

Measurement(sampling)

Inference

true density

Page 29: Probability, Statistics and Inferenceeero/math-tools17/...probability. el ta ence In Middleville, every family has two children, brought by the stork. You pick a family at random and

−4 −3 −2 −1 0 1 2 3 40

50

100

150

200

250

300

350

400

450

500(u+u+u+u)/sqrt(4)

−4 −3 −2 −1 0 1 2 3 40

50

100

150

200

250104 samples of uniform dist

−4 −3 −2 −1 0 1 2 3 40

100

200

300

400

500

60010 u’s divided by sqrt(10)

−4 −3 −2 −1 0 1 2 3 40

50

100

150

200

250

300

350

400

450(u+u)/sqrt(2)

Central limit for a uniform distribution...

10k samples, uniform density (sigma=1)

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

5000

6000one coin

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000avg of 4 coins

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000avg of 16 coins

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000avg of 64 coins

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000

2500avg of 256 coins

Central limit for a binary distribution...