lecture 1: basic statistical tools - university of...

Post on 22-Jan-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Lecture 1:Basic Statistical Tools

Bruce Walsh lecture notesUppsala EQG 2012 course

version 28 Jan 2012

2

A random variable (RV) = outcome (realization) not a setvalue, but rather drawn from some probability distribution

A discrete RV x --- takes on values X1, X2, … Xk

Probability distribution: Pi = Pr(x = Xi)

Pi > 0, ! Pi = 1

Discrete Random Variables

Example: Binominal random variable. Let p = prob of a success.

Prob (k successes in n trails) = n!/[ (n-k)! k!] pk (1-p)n-k

Probabilities are non-negativeand sum to one

Example: Poisson random variable. Let " = expected number of

successes. Prob (k successes) = "k exp(- ")/k!

3

A continuous RV x can take on any possible value in some interval (or set of intervals). The probability distribution is defined by the probability density function, p(x)

Continuous Random Variables

p(x) ≥ 0 and

∫ ∞

−∞p(x)dx = 1

P(x1 ≤ x ≤ x2) =x2

x1

p(x)dx

cdf(x) =∫ x

−∞p(x)dx

Finally, the cdf, or cumulative probability function,is defined as cdf(z) = Pr( x < z)

Prob is area underthe curve

4

Example: The normal (or Gaussian)distribution

Mean µ, variance #2

Unit normal(mean 0, variance 1)

5

Mean (µ) = peak of distribution

The variance is a measure of spread about the mean. Thesmaller #2, the narrower the distribution about the mean

6

Joint and Conditional Probabilities

The probability for a pair (x,y) of random variables isspecified by the joint probability density function, p(x,y)

The marginal density of x, p(x)

P ( y1 ≤ y ≤ y2, x1 ≤ x ≤ x2 ) =∫ y2

y1

∫ x2

x1

p(x, y)dx dy

p(x) =∫ ∞

−∞p (x, y)dy

7

Joint and Conditional Probabilities

p(y|x), the conditional density y given x

Relationships among p(x), p(x,y), p(y|x)

x and y are said to be independent if p(x,y) = p(x)p(y)

Note that p(y|x) = p(y) if x and y are independent

p(x, y) = p(y | x )p(x), hence p(y |x ) =p(x, y)p(x)

P( y1 ≤ y ≤ y2 |x ) =y2

y1

p( y |x ) dy

8

Bayes’ TheoremSuppose an unobservable RV takes on values b1 .. bn

Suppose that we observe the outcome A of an RV correlatedwith b. What can we say about b given A?

Bayes’ theorem:

A typical application in genetics is that A is somephenotype and b indexes some underlying (but unknown) genotype

Pr(bj | A) =Pr(bj)Pr(A |bj )

Pr(A)=

Pr(bj) Pr(A | bj)n∑

i=1

Pr(bi) Pr(A | bi)

9

Example: BRCA1/2 & Breastcancer

• NCI statistics:– 12% is lifetime risk of breast cancer in females

– 60% is lifetime risk if carry BRCA 1 or 2 mutation

– One estimate of BRCA 1 or 2 allele frequency is around2.3%.

– Question: Given a patent has breast cancer, what is thechance that she has a BRCA 1 or BRCA 2 mutation?

10

• Here– Event B = has a BRCA mutation

– Event A = has breast cancer

• Bayes: Pr(B|A) = Pr(A|B)* Pr(B)/Pr(A)– Pr(A) = 0.12

– Pr(B) = 0.023

– Pr(A|B) = 0.60

– Hence, Pr(BRCA|Breast cancer) = [0.60*0.023]/0.12 =0.115

• Hence, for the assumed BRCA frequency (2.3%),11.5% of all patients with breast cancer have aBRCA mutation

11

0.90.60.3Pr(height >70 | genotype)

0.20.30.5Freq(genotype)

qqQqQQGenotype

Pr(height > 70) = 0.3*0.5 +0.6*0.3 + 0.9*0.2 = 0.51

Pr(QQ | height > 70) = Pr(QQ) * Pr (height > 70 | QQ)

Pr(height > 70)

= 0.5*0.3 / 0.51 = 0.294

Second example: Suppose height > 70. What isThe probability individual is QQ, Qq, qq?

12

Expectations of Random Variables

The expected value, E [f(x)], of some function x of therandom variable x is just the average value of that function

E[x] = the (arithmetic) mean, µ, of a random variable x

E(x) = µ =+∞

−∞xp(x)dx

E[f(x)] =∑

i

Pr(x = Xi)f(Xi) x discrete

E[f(x)] =∫ +∞

−∞f(x)p(x)dx x continuous

-

13

Expectations of Random Variables

E[ (x - µ)2 ] = # 2, the variance of x

More generally, the rth moment about the mean is given by E[ (x - µ)r ] r =2: variance. r =3: skew

r = 4: (scaled) kurtosis

Useful properties of expectations

E[g(x) + f(y)] = E[g(x)] + E[f(y)]E(cx) = cE(x)

E[(x− µ)2

]= σ2 =

∫ +∞

−∞(x− µ)2 p(x)dx

14

The truncated normalOnly consider values of T or above in a normal

T

Let $T = Pr(z > T)

E[ z | z > T ] =∫ ∞

t

z p(z)πT

dz = µ +σ · pT

πT

Mean of truncated distribution

pT = (2π) 1/2 exp[−(T µ)2

2σ2

]

Here pT is the height of the normalat the truncation point,

Density function = p(x | x > T)

p(z)Pr(z > T )

=p(z)

∞T p(z)dz

15

The truncated normal

Variance

[

1 +pT · (z µ)/σ

πT

(pT

πT

)2]

σ2

pT = height of normal at T$T = area of curve to right of T

16

Covariances

• Cov(x,y) = E [(x-µx)(y-µy)]

X

Y

cov(X,Y) > 0

Cov(x,y) > 0, positive (linear) association between x & y

• = E [x*y] - E[x]*E[y]

17

Cov(x,y) < 0, negative (linear) association between x & y

X

Y

cov(X,Y) < 0

Cov(x,y) = 0, no linear association between x & y

X

Y

cov(X,Y) = 0

18

Cov(x,y) = 0 DOES NOT imply no association

X

Y

cov(X,Y) = 0

19

Correlation

Cov = 10 tells us nothing about the strength of anassociation

What is needed is an absolute measure of association

This is provided by the correlation, r(x,y)

r = 1 implies a perfect (positive) linear association

r = - 1 implies a perfect (negative) linear association

√r(x, y) =Cov(x, y)

V ar(x)V ar(y)

20

Useful Properties of Variances andCovariances

• Symmetry, Cov(x,y) = Cov(y,x)

• The covariance of a variable with itself is thevariance, Cov(x,x) = Var(x)

• If a is a constant, then– Cov(ax,y) = a Cov(x,y)

• Var(a x) = a2 Var(x).– Var(ax) = Cov(ax,ax) = a2 Cov(x,x) = a2Var(x)

• Cov(x+y,z) = Cov(x,z) + Cov(y,z)

21

Hence, the variance of a sum equals the sum of theVariances ONLY when the elements are uncorrelated

More generally

Cov n∑

i=1

xi,m∑

j=1

yj =

n∑

i=1

m∑

j=1

Cov(xi, yj)

V ar(x + y) = V ar(x) + V ar(y) + 2Cov(x, y)

Question: What is Var(x-y)?

22

RegressionsConsider the best (linear) predictor of y given we know x

The slope of this linear regression is a function of Cov,

The fraction of the variation in y accounted for by knowing x, i.e,Var(yhat - y), is r2

|y = y + by x( x x )

by | x =Cov(x, y)V ar(x)

23

In this case, the fraction of variation accounted forby the regression is b2

r(x, y) =Cov(x, y)√

V ar(x)V ar(y)= by|x

√V ar(x)V ar(y)

Relationship between the correlation and the regressionslope:

If Var(x) = Var(y), then by|x = b x|y = r(x,y)

24

r2 = 0.6r2 = 0.9

25

Properties of Least-squares Regressions

The slope and intercept obtained by least-squares: minimize the sum of squared residuals:

• The average value of the residual is zero

• The LS solution maximizes the amount of variation in y that can be explained by a linear regression on x

• The residual errors around the least-squares regression are uncorrelated with the predictor variable x

• Fraction of variance in y accounted by the regression is r2

• Homoscedastic vs. heteroscedastic residual variances

∑e2i =

∑(yi yi)2 =

∑(yi a bxi)2

26

Different methods of analysis

• Parameters of these various models can beestimated in a number of frameworks

• Method of moments

– Very little assumptions about the underlying distribution.Typically, the mean of some statistic has an expectedvalue of the parameter

– Example: Estimate of the mean µ given by the samplemean,x, as E(x) = µ.

– While estimation does not require distributionassumptions, confidence intervals and hypothesis testingdo

• Distribution-based estimation

– The explicit form of the distribution used

27

Distribution-based estimation

• Maximum likelihood estimation– MLE

– REML

– More in Lynch & Walsh (book) Appendix 3

• Bayesian– Marginal posteriors

– Conjugating priors

– MCMC/Gibbs sampling

– More in Walsh & Lynch (online chapters = Vol 2)Appendices 2,3

28

Maximum Likelihood p(x1,…, xn | % ) = density of the observed data (x1,…, xn)given the (unknown) distribution parameter(s) %

Fisher suggested the method of maximum likelihood ---given the data (x1,…, xn) find the value(s) of % thatmaximize p(x1,…, xn | % )

We usually express p(x1,…, xn | %) as a likelihood function l ( % | x1,…, xn ) to remind us that it is dependenton the observed data

The Maximum Likelihood Estimator (MLE) of % are the value(s) that maximize the likelihood function l giventhe observed data x1,…, xn .

29

l (% | x)

MLE of %

The curvature of the likelihood surface in the neighborhoodof the MLE informs us as to the precision of the estimatorA narrow peak = high precision. A board peak = lowerprecision

This is formalize by looking at the log-likelihood surface,L = ln [l (% | x) ]. Since ln is a monotonic function, thevalue of % that maximizes l also maximizes L

The larger the curvature, the smaller the variance

Var(MLE) = -1 ∂2 L(µ| z)∂ µ2

%

30

Likelihood Ratio testsHypothesis testing in the ML frameworks occursthrough likelihood-ratio (LR) tests

For large sample sizes (generally) LR approaches aChi-square distribution with r df (r = number ofparameters assigned fixed values under null)

LR = 2 ln(

$(Θr |z)$(Θ | z)

)= 2

[L(Θr | z)−L(Θ | z)

]

%r is the MLE under the restricted conditions (someparameters specified, e.g., var =1)

& r is the MLE under the unrestricted conditions (noparameters specified)

31

Bayesian StatisticsAn extension of likelihood is Bayesian statistics

p(% | x) = C * l(x | %) p(%)

Instead of simply estimating a point estimate (e.g.., the MLE), the goal is the estimate the entire distribution for the unknown parameter % given the data x

p(% | x) is the posterior distribution for % given the data x

l(x | %) is just the likelihood function

p(%) is the prior distribution on %.

32

Bayesian Statistics

Why Bayesian?

• Exact for any sample size

• Marginal posteriors

• Efficient use of any prior information

• MCMC (such as Gibbs sampling) methods

Priors quantify the strength of any prior information.Often these are taken to be diffuse (with a highvariance), so prior weights on % spread over a widerange of possible values.

33

Marginal posteriors• Often times were are interested in a particular

set of parameters (say some subset of the fixedeffects). However, we also have to estimate all ofthe other parameters.

• How do uncertainties in these nuisance parametersfactor into the uncertainty in the parameters ofinterest?

• A Bayesian marginal posterior takes this intoaccount by integrating the full posterior over thenuisance parameters

• While this sound complicated, easy to do withMCMC (Markov Chain Monte Carlo)

34

Conjugating priors

For any particular likelihood, we can often find aconjugating prior, such that the product of the likelihood and the prior returns a known distribution.

Example: For the mean µ in a normal, taking theprior on the mean to also be normal returns aposterior for µ that is normal.

Example: For the variance #2 in a normal, taking theprior on the variance to an inverse chi-square distribution returns a posterior for #2 that is alsoan inverse chi-square (details in WL Appendix 2).

35

A normal prior on the mean with mean µ0 andvariance #0

2 (larger #0

2, more diffuse the prior)

If the likelihood for the mean is a normal distribution,the resulting posterior is also normal, with

Note that if #02 is large, the mean of the posterior is

very close to the sample mean.

36

If x follows a Chi-square distribution, then 1/x followsan inverse chi-square distribution.

The scaled inverse chi-square has two parameters, allowingmore control over the mean and variance of the prior

37

MCMC

Analytic expressions for posteriors can be complicated,but the method of MCMC (Markov Chain Monte

Carlo) is a general approach to simulating draws for just about any distribution (details in WL Appendix 3).

Generating several thousand such draws from the posterior returns an empirical distribution that wecan use.

For example, we can compute a 95% credible interval,the region of the distribution that containing 95% ofthe probability.

38

Gibbs Sampling• A very powerful version of MCMC is the Gibbs Sampler

• Assume we are sampling from a vector of parameters,but that the marginal distribution of each parameter isknown

• For example, given a current value for all the fixedeffects (but one, say '1) and the variances, conditioningon these values the distribution of '1 is a normal, whoseparameters are now functions of the current values ofthe other parameters. A random draw is then generatedfrom this distribution.

• Likewise, conditioning on all the fixed effects and allvariances but one, the distribution of this variance is aninverse chi-square

39

This generates one cycle of the sampler. Using thesenew values, a second cycle is generated.

Full details in WL Appendix 3.

top related