bioinformatic_2

1

BIOINFORMATIK II

PROBABILITY & STATISTICS

Summer semester 2006

The University of Zurich and ETH Zurich

Lecture 2a: Statistical estimation.

Prof. Andrew Barbour

Dr. Beatrice de Tiliere

Adapted from a course by

Dr. D. Schuhmacher & Dr. D. Svensson.

2

Problems in statistics:

Given: a probability model for some (chance) experiment:

X ∼ Pθ.

Here, Pθ is a probability distribution (given by a probability function

pθ(x) or a distribution function Fθ(x)) for any θ. The Pθ are all known,

but the actual value of the parameter θ is unknown.

(Ex: X ∼ Bin(100, p) but the probability p is unknown.)

Two main areas in statistics are:

• ESTIMATION: estimate the unknown value of θ given observations

of X (a single observation is usually not enough).

• TESTING: test a hypothesis about the unknown value of θ. Base

acceptance/rejection upon observations of X (a single observation is

usually not enough).

3

Statistical estimation:

Given: a probability model: X ∼ Pθ,

where Pθ are known, but actual value of the parameter θ is unknown.

(ex: X ∼ Bin(100, p), but the probability p is unknown.)

• To be able to estimate the value of θ we repeat the experiment n

times independently, which gives x1, x2, . . . , xn. These are n

observations of X .

• Next step: use the observations x1, x2, . . . , xn to compute an

estimate of θ. (Observe the values, and then ‘take a good guess’ )

The collection x1, x2, . . . , xn is called an (observed) sample of random

variables X1, X2, . . . , Xn; the latter are independent and have the same

distribution as X .

The collection X1, X2, . . . , Xn is called a (random) sample.

4

Estimator, estimate

Def: An estimator of θ is a function of the random variables

X1, X2, . . . , Xn, written θ(X1, X2, . . . , Xn).

For “theory” and principles of estimation. It is RANDOM!

******

Def: An estimate of θ is the quantity θ(x1, x2, . . . , xn) calculated from

the observed values x1, x2, . . . , xn of X1, X2, . . . , Xn.

For “practice”. The value computed after the experiment .

It is not random.

5

Example: Suppose X ∼ Bin(100, θ), where the value of θ is unknown.

Let X1, . . . , X20 be a random sample; i.e.

Xi ∼ Bin(100, θ) for i = 1, 2, . . . , 20

and independent. Then:

θ1(X1, X2, . . . , X20) :=X1 + X2 + . . . + X20

2000

is an estimator of the unknown value of θ, and

θ2(X1, X2, . . . , X20) := X1 + X2 + . . . + X20

is another one.

However θ2(X1, X2, . . . , X20) is not very useful since it might be larger

than 1 (but θ has to be between 0 and 1 since here θ is a probability).

There are many possible estimators. How can we find a ‘good’ one?

6

Some principles for finding ‘good’ estimators

• Consistency: For every ε > 0 and every possible value of θ:

Pθ

( ∣

∣ θ(X1, X2, . . . , Xn) − θ∣

∣ > ε)

→ 0 as n → ∞

≈ “the more observations, the closer to the truth”.

• The mean square error MSE[ θ ] as low as possible:

MSEθ[ θ ] := Eθ

[

(

θ(X1, X2, . . . , Xn) − θ)2

]

for every θ.

‘Not too much variation’, ‘not too far away from truth’ .

(Note: MSEθ[ θ ] = Varθ[ θ ] if θ is unbiased , i.e. if Eθ[ θ ] = θ.)

• ‘Nice’ if the estimator θ(X1, X2, . . . , Xn) has a known probability

distribution (at least, in an asymptotic sense).

Maximum likelihood estimators have these properties!

7

Maximum Likelihood Estimation (here: for discrete RVs)

Suppose X1, X2, . . . , Xn is a (random) sample from some distribution Pθ

with probability function pθ(x), where the value of θ is unknown.

Let

L(x1, x2, . . . , xn; θ) := Pθ(X1 = x1, X2 = x2, . . . , Xn = xn).

This is the probability for observing the sample x1, x2, . . . , xn if the

unknown parameter takes the value θ.

L(x1, x2, . . . , xn; θ) is called the likelihood function.

By independence of X1, . . . , Xn,

L(x1, x2, . . . , xn; θ) = pθ(x1) · pθ(x2) · · · pθ(xn).

8

Now suppose that we really did observe the values x1, x2, . . . , xn.

Def: The maximum likelihood estimate (ML estimate) of θ is

the value θml of θ that maximizes the likelihood function

L(x1, x2, . . . , xn; θ).

That is, the value θml is the ML-estimate of θ if

L(x1, x2, . . . , xn; θml) > L(x1, x2, . . . , xn; θ)

for all other values θ 6= θml.

INTERPRETATION: “It is more likely to observe the sample

x1, x2, . . . , xn if the parameter θ is equal to θml than for any other value

of θ.”

“θml is the value of θ that best explains the data.”

9

X1, X2, . . . , Xn is random sample from a probability distribution Pθ,

θ unknown.

Def: The maximum likelihood estimator is the (random!) value θml

that maximizes

L(X1, X2, . . . , Xn; θ).

The maximization is carried out with respect to θ.

The maximization depends upon the random variables X1, . . . , Xn, so θml

is a function of X1, . . . , Xn, written θml(X1, X2, . . . , Xn).

(The maximum likelihood estimate is the value obtained by computing

θml(x1, x2, . . . , xn) using the observed sample x1, x2, . . . , xn).

10

Two important things to remember:

1: The maximum likelihood estimator θml(X1, X2, . . . , Xn) has

good properties!

2: How to compute the estimate:

Plug in the observed data into the the likelihood function

L(x1, x2, . . . , xn; θ)

and vary the value of θ until you find the value that maximizes

L(x1, x2, . . . , xn; θ). (Mathematically, the ML estimate is usually found

by differentiation.)

11

Example: Suppose that we want to find the maximum likelihood

estimate for θ in the Bin(100, θ)-distribution based on an observed

sample x1, x2, . . . , xn.

Then

pθ(x) =(

100

x

)

θx(1 − θ)100−x for x = 0, 1, . . . , 100;

and hence

L(x1, . . . , xn; θ) =(

100

x1

)

θx1(1 − θ)100−x1 . . .(

100

xn

)

θxn(1 − θ)100−xn .

The binomial coefficients do not vary with θ, so finding the θ that

maximizes the above expression amounts to finding the θ that maximizes

θx1(1 − θ)100−x1 . . . θxn(1 − θ)100−xn = θP

xi(1 − θ)100n−P

xi .

This can be done by differentiating in θ and setting the result to zero . . .

12

...differentiating in θ and setting the result to zero:

Set s :=∑n

i=1xi, and solve

0 =d

dθ

(

θs(1 − θ)100n−s

)

= sθs−1(1 − θ)100n−s − (100n − s)θs(1 − θ)100n−s−1

=[

s(1 − θ) − (100n − s)θ]

· θs−1 · (1 − θ)100n−s−1

= [s − 100nθ] · θs−1 · (1 − θ)100n−s−1.

Possible solutions are θ = 0, θ = 1 (→ no maximum!), and

θ =s

100n=

1

100n

n∑

i=1

xi (→ maximum!).

13

Hence, the maximum likelihood estimate for θ in the

Bin(100, θ)-distribution is given by

θ(x1, x2, . . . , xn) =1

100n

n∑

i=1

xi.

(Compare this with the two estimators on Slide 5: θ1(X1, X2, . . . , X20) is

the ML estimator!)

14

Likelihoods are not just for independent observations!

• If x1, x2, . . . , xn is an observed sample of independent and equally

distributed random variables X1, . . . , Xn, then the likelihood function

is

L(x1, x2, . . . , xn; θ) = pθ(x1) · pθ(x2) · · · pθ(xn).

(A product, due to independence).

• Be careful with dependent variables, and with random variables

having different distributions! (E.g., observations from a Markov

chain.)

The likelihood function L(x1, x2, . . . , xn; θ) is then still defined as the

probability for observing the sample, but L(x1, x2, . . . , xn; θ) cannot

be computed as the product above!

15

Unbiased estimators

Let θ(X1, X2, . . . , Xn) be some estimator of the unknown value of θ.

If

Eθ

[

θ(X1, X2, . . . , Xn)]

= θ

holds for every possible value of θ we say that θ is unbiased.

It is ‘nice’ if our estimator has this property, but it is not a good

principle to rely on in order to find good estimators!

• One often gets unbiasedness only at the expense of other nice

properties.

• Sometimes unbiased estimators are useless

(e.g. the unbiased estimator for p in the Geo(p)-distribution).

Sometimes there is no unbiased estimator at all

(e.g. there is no unbiased estimator for 1/p in the

Bin(n, p)-distribution).

16

BIOINFORMATIK II

PROBABILITY & STATISTICS

Summer semester 2006

The University of Zurich and ETH Zurich

Lecture 2b: Statistical hypothesis testing.

Prof. Andrew Barbour

Dr. Beatrice de Tiliere

Adapted from a course by

Dr. D. Schuhmacher & Dr. D. Svensson.

17

Statistical testing problem:

Given: a probability model for some (chance) experiment:

X ∼ Pθ.

Typically, the form of the probability distribution Pθ is known for each θ,

but the actual value of the parameter θ is unknown.

(Ex: X ∼ Bin(100, p) but the probability p is unknown.)

Want: to be able to test the plausibility of certain hypotheses concerning

the probability model.

Typically: the hypotheses specify values for the parameter θ.

(Ex: X ∼ Bin(100, p);

‘null’ hypothesis: p = 0.25,

‘alternative’ hypothesis: p = 0.35.)

18

Statistical hypothesis testing involves the test of a

null hypothesis H0 against an alternative hypothesis HA.

• H0 is the ‘default’ hypothesis, taken to be the truth unless

convincing evidence against it is found.

(‘Default’? In the sense of:

‘two given sequences are not evolutionarily related’,

‘there is no life on Mars’).

• HA is more ‘controversial’, and its acceptance (in favor of H0)

requires strong evidence.

(‘Controversial?’ Like:

‘two given sequences are evolutionarily related’,

‘there is life on Mars’)

19

Example (sequence matching):

Two random DNA sequences of length N , where

• the letters within each sequence are independently generated;

• each letter equals a, c, g, or t with uniform probabilities, i.e.

0.25, 0.25, 0.25, 0.25.

Let X = the number of matches; then

X ∼ Bin(N, p) ; p = P(‘match’).

H0 : p = 0.25 (i.e. the sequences are not evolutionarily related).

HA : p > 0.25 (sequences are evolutionarily related).

(The value of p depends on whether the sequences are dependent or not,

i.e. on the joint probabilities!).

If X takes an unexpectedly large value, we reject H0 and accept HA as

being the truth.

20

The decision to make:

Accept H0, or reject it in favour of HA.

How?

The decision taken is based upon the observed value of some function

T (X1, X2, . . . , Xn)

of the the sample X1, X2, . . . , Xn.

This function is called a test-statistic.

It is a random variable!

21

Critical value(s), rejection region.

Ideally, the distribution of T (X1, X2, . . . , Xn) is known under the

assumption that H0 is true.

From this null hypothesis distribution, one or several critical values can

be determined.

If H0 is true, it is unlikely that these critical values will be reached by T .

But if that happens, then H0 is rejected

(since H0 then explains the data X1, X2, . . . , Xn badly).

(Exactly what does ‘unlikely’ mean? That depends upon the ‘significance

level’ of the test; determined by the experimenter!)

22

Type I error.

Ex: Suppose that the null hypothesis H0 will be rejected if

T (X1, X2, . . . , Xn) > C for some constant C.

And otherwise, if T (X1, X2, . . . , Xn) ≤ C, H0 is accepted.

(C is the critical value in this example).

NOTE: Even if H0 is true it is (in general) possible that

T (X1, X2, . . . , Xn) > C, since the data are random!

If this occurs, a Type I error is being made: rejection of a true null

hypothesis.

23

Significance level

(Type I error = rejection of a true null hypothesis).

The probability α for this type of incorrect decision, the significance

level, should be kept (reasonably) low.

α = P(

T (X1, X2, . . . , Xn) > C |H0 true)

=

= P(

Type I error)

.

Typically, one takes α equal to some low probability (often 0.05 or 0.01),

and then determines the corresponding value of C.

C depends upon α!

24

’Statistical Significance’

Our test (at significance level α) is

Reject H0 ⇐⇒ T (X1, X2, . . . , Xn) > C.

If we observe values x1, . . . , xn such that T (x1, x2, . . . , xn) > C then we

reject H0 and say that we have ’statistical significance’.

***

This statement ’statistical significance’ is always relative to some

significance level α.

(’statistical significance’ does not automatically mean ’good scientific

evidence’. If α = 1 then any experiment outcome would be ’statistically

significant’. )

25

Type II error

Another type of incorrect decision can also be made:

Type II error: acceptance of a false null hypothesis.

(Type II is usually a less serious error than type I).

Suppose that the significance level is fixed (α = 0.05 or some other

value), and the critical value C has been determined such that

α = P(

T (X1, X2, . . . , Xn) > C |H0 true)

holds.

Then, the probability of a Type II error is

β = P(

T (X1, X2, . . . , Xn) ≤ C |H0 false)

.

26

Power, Type II error

Furthermore, the power of the test is then defined as

1 − β = P(

T (X1, X2, . . . , Xn) > C |H0 false)

.

The power describes how good the test is at detecting it if the null

hypothesis is false.

27

Power, Type II errors. The power is typically more complicated to

compute than the significance level.

In fact, the power might depend upon ’how false H0 is’, in the following

sense:

Ex: X ∼ Bin(20, p). H0 : p = 0.25, HA : p > 0.25. Suppose the

significance level is fixed at α = 0.041. Then

P(X > 8|H0 true) = P(Bin(20, 0.25) > 8) = 0.041

The power (1 − β)? Assume H0 false (i.e., p > 0.25).

1 − β = P(X > 8|p = 0.26) = P(Bin(20, 0.26) > 8) = 0.0515.

1 − β = P(X > 8|p = 0.3) = P(Bin(20, 0.3) > 8) = 0.1133.

....

1 − β = P(X > 8|p = 0.9) = P(Bin(20, 0.9) > 8) ≈ 1.

28

p-values:

The p-value is a probability, and can only be computed after the data

have been observed.

Suppose that the test is: we reject H0 if and only if

T (X1, X2, . . . , Xn) > C, for some critical value C.

The significance level is

α = P(

T (X1, X2, . . . , Xn) > C |H0 true)

.

Now suppose we observe x1, x2, . . . , xn.

Compute the observed test statistic t := T (x1, x2, . . . , xn).

The p-value is defined as

P(

T (X1, X2, . . . , Xn) ≥ t |H0 true)

.

(Interpretation: the probability for seeing something at least as extreme

as just observed... “how unlikely the observed value is”.)

29

The five main steps in statistical testing:

1. Declare the hypotheses H0 and HA. (Before the data are seen!!!)

2. Determine a test statistic.

3. Choose the significance level α ∈ (0, 1).

4. Determine those observed values of the test statistic that lead to

rejection of H0 (determine the critical value(s)).

5. Obtain the data and determine whether the observed value of the test

statistic is equal to or more extreme than the critical value(s) calculated

in step 4.

30

NOTE: Point 4 requires the knowledge of the distribution of the test

statistic under the assumption that H0 is true.

4. Determine those observed values of the test statistic that lead to

rejection of H0 (determine the critical value(s)).

This means typically:

Given the significance level α ∈ (0, 1), for which C do we have

P(

T (X1, X2, . . . , Xn) > C |H0 true)

= α ?

If this distribution is unknown or too complicated, it can often be

approximately determined from a computer simulation.

31

How can we find a good test statistic?

To be able to perform a statistical test, one is required to find a suitable

test statistic.

2. Determine a test statistic.

In many cases it is optimal to use the likelihood ratio as a statistic.

(Optimal in certain probabilistic senses.)

32

Simple hypotheses

Suppose that we have a test problem where the hypotheses are simple,

i.e. they completely specify the probability function.

Ex: X ∼ Bin(20, p). H0 : p = 0.25, HA : p = 0.35, which is equivalent to

H0 : P (X = k) =

(

N

k

)

0.25k(1 − 0.25)N−k

and

HA : P (X = k) =

(

N

k

)

0.35k(1 − 0.35)N−k.

33

Likelihood ratio test:

Let X1, X2, . . . , Xn be the sample (independent RVs), and let pθ0(x) and

pθ1(x) be the probability functions specified by the simple hypotheses H0

and HA, respectively.

Define the likelihood ratio LR as

LR :=L(X1, X2, . . . , Xn; θ1)

L(X1, X2, . . . , Xn; θ0)=

pθ1(X1) · pθ1

(X2) · · · pθ1(Xn)

pθ0(X1) · pθ0

(X2) · · · pθ0(Xn)

Choose a constant C such that

P(LR ≥ C| H0 true) = α.

Then this yields the most powerful test at the significance level α for

this testing problem (The Neyman-Pearson Lemma).

There are good reason for using likelihood ratios in statistics;

(good test properties, a well-studied topic)

34

Example (sequence matching):

Two DNA sequences of length n. We want to test if they are

evolutionarily related.

Assumption: letters in each sequence independently generated with

uniform probabilities (probability 1/4 for each letter).

Let Xi = 1 if there is a match at position i, otherwise Xi = 0. Suppose

that X1, X2, . . . , Xn are i.i.d. (independent and identically distributed)

with match probability p.

(Note that in this case P(Xi = x) = px(1 − p)1−x for x ∈ {0, 1}.)

We now want to test

H0: p = 0.25 (i.e. the sequences are not evolutionarily related) against

HA: p = 0.35 (sequences are evolutionarily related and there is some

evidence that in that case p ≈ 0.35)

at a significance level of α = 0.05 (or approximately so). → LR-Test

35

We make a likelihood ratio test:

Likelihood: For x1, . . . , xn ∈ {0, 1},

L(x1, . . . , xn; p) = px1(1 − p)1−x1 . . . pxn(1 − p)1−xn .

Therefore, with p0 = 0.25 and p1 = 0.35,

LR =p

P

xi

1(1 − p1)

n−P

xi

pP

xi

0(1 − p0)n−

P

xi

=(p1

p0

)s(1 − p1

1 − p0

)n−s

=(7

5

)s(13

15

)n−s

,

where s =∑n

i=1xi is the total number of matches. But this LR is just a

(increasing) function of s! So instead of rejecting H0 if LR is ‘too big’,

we can also use the test that we reject H0 if s is ‘too big’. What means

‘too big’??

36

What is ‘too big’? What is the critical value C for rejecting H0?

We want significance level α ≈ 0.05, that is, for S :=∑n

i=1Xi, we choose

C in such a way that

P(S > C|H0 true) ≈ 0.05.

We know that, under the assumption that H0 is true, S has the

Bin(n, 0.25) distribution, so we just have to find out, at what value C the

distribution function of the Bin(n, 0.25)-distribution jumps from below

0.95 to above 0.95.

Choose n = 1000 (say), then

P(S ≤ 272|H0 true) ≈ 0.9488, so P(S > 272|H0 true) ≈ 0.0512;

P(S ≤ 273|H0 true) ≈ 0.9559, so P(S > 273|H0 true) ≈ 0.0441.

37

Thus for an observed sample (x1, x2, . . . , x1000) our test at significance

level ≈ 0.0441 says: we can reject H0 if∑

1000

i=1xi > 273, and we cannot

reject it if∑

1000

i=1xi ≤ 273.

This test has power

P(S > C|H0 false) = P(S > 273|p = 0.35) ≈ 0.9999998 ≈ 1.

(Which is so good, because we have such a big sample!!)

***

NOTE: If one wants a significance level of exactly 0.05, one usually

decides randomly whether to reject H0 in the ‘critical case’ that∑

1000

i=1xi = 273!

38

The introductory example, revisited

For the sequences from the first lecture (Slide 5), one obtains the test

(n = 26): “Reject H0 if∑

26

i=1xi > 10, and do not reject H0 if

∑

26

i=1xi ≤ 10”, which is at a significance level of ≈ 0.0401

This means that we have statistical significance for rejecting the

null-hypothesis, that the two sequences are not evolutionarily related.

(The power of the test is this time only about 0.278)

***

Note, however, that if we wanted a lower significance level, say α ≈ 0.01,

we would not be able to reject H0 (the p-value for 11 matches in two

sequences of length 26 is about 0.0155).

bioinformatic_2

Documents