bioinformatic_1

40
1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2006 University of Z¨ urich and ETH Z¨ urich Lecture 1: Basic probability. Prof. Andrew Barbour Dr. B´ eatrice de Tili` ere Adapted from a course by Dr. D. Schuhmacher & Dr. D. Svensson. Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Upload: bharatbioinormaticshau

Post on 17-Jul-2016

3 views

Category:

Documents


0 download

DESCRIPTION

Basic maths for bioinformatics

TRANSCRIPT

Page 1: Bioinformatic_1

1

BIOINFORMATIK II

PROBABILITY & STATISTICS

Summer semester 2006

University of Zurich and ETH Zurich

Lecture 1: Basic probability.

Prof. Andrew BarbourDr. Beatrice de Tiliere

Adapted from a course by

Dr. D. Schuhmacher & Dr. D. Svensson.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 2: Bioinformatic_1

2

Web page

http://www.math.unizh.ch/baps/lectures/bioinf2.html

You will find there:

• Up-to-date information;

• Transparencies of the lectures;

• Exercise sheets with solutions (...in due time);

• Additional background material.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 3: Bioinformatic_1

3

Course content

• Basic probability concepts

’probability distribution’, ’independence’, ’conditional probability’,

’expectation’, ’standard deviation’, . . .

• Concepts and principles in statistics

’estimation’, ’hypothesis testing’, ’maximum likelihood’, ’likelihood

ratio’, ’significance’, ’p-value’, . . .

• Markov chains

’transition matrix’, ’stationary distribution’, ’reversibility’, ’random

walks’, ’hidden Markov models’, . . .

• Models and algorithms in bioinformatics

’sequence alignment’, ’models for evolution’,

’PAM/BLOSUM-matrices’, ’BLAST’, . . .

Principles, rather than ’How-to’ !

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 4: Bioinformatic_1

4

evolutionarychanges

evolutionarychanges

a g g t g a c c c t . . . g t c a t t t

t g g a g c c a t . . . g t c g a t t a c g t c a c c c t . . . g a c a t t t

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 5: Bioinformatic_1

5

Why probability and statistics in bioinformatics?

Given: Two sequences from two species. Common ancestor?

g g a g a c t g t a g a c a g c t a a t g c t a t ag a a c g c c c t a g c c a c g a g c c c t t a t c

• Sequence length: 26 nucleotides.

• 11 of 26 positions agree.

Conclusion? Generated ’purely by chance’ or by some other

mechanism?

To be able to answer this, one needs to understand properties of

random sequences.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 6: Bioinformatic_1

6

Probability & statistics in bioinformatics?

For ...

• modelling sequence evolution (Markov chains).

• inferring phylogenetic trees (maximum likelihood trees).

• gene prediction (hidden markov chains).

• analysis of micro array data (multiple testing, multivariate

statistics)

• evaluating sequence similarity in BLAST searches (extreme values,

random walks)

• much more!

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 7: Bioinformatic_1

7

Random variables (RVs)A random variable = ”numerical” quantity whose value depends on the

outcome of some chance experiment.

Ex 1. Flip a coin and let X=1 if ’head’ occurs, otherwise let X = 0.

Then X is a random variable.

Ex 2. Two DNA sequences are randomly chosen from a database. Then

X = the number of matches between the sequences is a RV.

Ex 3. Let X = the waiting time until a certain event occurs; e.g., time

until a nucleotide substitution first occurs at a specified position in a

genome. Then X is a RV.

There are two main types of random variables:

• either DISCRETE (as in example 1 and 2)

• or CONTINUOUS (example 3).

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 8: Bioinformatic_1

8

Probability distribution of a RV

The important feature of a random variable is its probability distribution.

The probability distribution of a random variable X is basically the

mechanism (mathematically: the function) which tells us, with what

probability the random variable takes what values.

For a discrete random variable the probability distribution can be

expressed either by its probability function or by its distribution function.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 9: Bioinformatic_1

9

Probability and distribution functions (DISCRETE RVs)

Let X be any discrete random variable, and denote the set of the

possible values with S (=’sample space’).

Associated with the random variable X are

• The probability function pX :

pX(i) := P(X = i) ∈ [0, 1], i ∈ S;

• and the (cumulative) distribution function FX :

FX(j) := P(X ≤ j) =∑

i∈S;i≤j

P(X = i) ∈ [0, 1], j ∈ S.

A mathematical analysis of a random variable typically requires explicit

formulas for these functions!

In principle, these two functions contain all essential information

concerning properties and behavior of the random variable.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 10: Bioinformatic_1

10

Similar formulas for continuous random variables X : Then the

(cumulative) distribution function is given by

• FX(t) := P(X ≤ t) =∫

x≤tfX(x) dx, for t ∈ S

for some probability density function fX with fX(x) ≥ 0.

Caution with the interpretation in the continuous case:

• the density fX(x) is NOT equal to P(X = x) (which in fact always

is zero for continuous RVs !),

• fX(x) is NOT a probability (that is, fX(x) might be > 1 ... )

Think of it as P(t ≤ X ≤ t + h) = fX(t) · h, where h small and h > 0.

Discrete random variables are perhaps more important to

bioinformatics... (in some sense)

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 11: Bioinformatic_1

11

Example: Flip a coin and let X=1 if ’head’ occurs, otherwise let X = 0.

Then the sample space is S = {0, 1} and the probability function

pX(i) := P(X = i) is given by

pX(0) =1

2pX(1) =

1

2.

The distribution function is given by

FX(0) := P(X ≤ 0) =1

2

and

FX(1) := P(X ≤ 1) = P(X = 0) + P(X = 1) =1

2+

1

2= 1.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 12: Bioinformatic_1

12

Ex: “Random DNA sequences” with i.i.d. letters.

Two sequences of N letters are randomly generated, i.e.

• the letters are independently generated,

• each position equals a, c, g, or t with probabilities pa, pc, pg, pt

Seq1: gtacacgggata...tacgtgact

Seq2: cgaggtagtcga...tttatacga

Let X = the number of matches. Then the probability function of X is

P(X = k) =

(

N

k

)

pk(1 − p)N−k, where

(

N

k

)

=N !

k!(N − k)!,

for some match probability p ∈ [0, 1]. [n! = 1 · 2 · · · · (n − 1) · n].

This is known as the binomial distribution with parameters N and p.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 13: Bioinformatic_1

13

P(X = k) =(

Nk

)

pk(1 − p)N−k. Why?

Step 1:. Fix any position (j say).

Let p = P(match in the position considered) (which is independent of the

position chosen).

p = P(

two ’a’, or two ’c’, or two ’g’, or two ’t’)

=

= P(two ’a’) + P(two ’c’) + P(two ’g’) + P(two ’t’) =

= pa · pa + pc · pc + pg · pg + pt · pt

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 14: Bioinformatic_1

14

So the match probability is p = p2a + p2

c + p2g + p2

t

(p = 1/4 if pa = pc = pg = pt = 1/4).

The probability for a mismatch is (1 − p).

Step 2:. What is P(X = k) =?

Exactly k matches and N − k mismatches can occur in different ways: for

example,

Seq 2

Seq 1

)N-k(probability (1-p)

)k(probability p

N-k miss-matches

k matches

match matchmatchmatchmatch

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

����������������������������

����������������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

����������������������������

����������������������������

���������������������

���������������������

���������������������

���������������������

Each such configuration has probability pk(1 − p)N−k.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 15: Bioinformatic_1

15

We have to add the probabilities for the different configurations to get the

total probability:

P(X = k) = pk(1 − p)N−k + . . . + pk(1 − p)N−k

How many?

Combinatorical arguments: there are(

N

k

)

=N !

k!(N − k)!

possible configurations of k matches and N − k mismatches.

Therefore

P(X = k) =

(

N

k

)

pk(1 − p)N−k.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 16: Bioinformatic_1

16

The binomial distribution

Any random variable Y having this probability function

p(k) =

(

N

k

)

pk(1 − p)N−k

is said to be binomially distributed (important in general, not just for

counting matches between random sequences!).

In general: imagine that

• N independent trials are carried out,

• for each trial, P(’success’ ) = p, and P(’failure’ ) = 1 − p.

Let X = the number of successes. Then X is binomially distributed with

parameters N and p. Notation:

X ∼ Bin(N, p).

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 17: Bioinformatic_1

17

...other distributions?

In general, different RV’s have different probability distributions (i.e.

different probability functions).

Consider the random sequence example again, and define

Y = the first position where a match occurs

(counted from left to right).

Seq1 : c g t c g t ... g

Seq2 : g a c c c t ... t

Then the probability function of Y would be

pY (k) = (1 − p)k−1 · p

for k = 1, 2, 3, . . ., where p is the probability for having a match at a

fixed position i, 1 ≤ i ≤ N .

This is called the geometric distribution with parameter p.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 18: Bioinformatic_1

18

Another important distribution is the uniform distribution.

Suppose that each of the (finitely many) possible values of X are equally

likely, that is

P(X = k) =1

N

for each possible value k, and where N is the number of possible values.

Then X is said to be uniformly distributed.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 19: Bioinformatic_1

19

...some important distributions?

There are infinitely many possible probability distributions but some

appear over and over again in applications.

Some examples are

• Binomial

• Geometric

• Uniform

• Poisson

• Normal

• Exponential

• Chi-square

• ...

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 20: Bioinformatic_1

20

Probabilities of events

Let S be the set of possible outcomes of some ’experiment’

(S is the sample space).

An event is something that either will or will not occur when the

experiment is conducted (mathematically, E ⊂ S).

Ex.1 Experiment: counting matches between two sequences of length

1000. Then S = {0, 1, . . . , 1000}.

The event E = ’at least 50% identity’ is E = {500, 501, . . . , 1000}.

Ex.2 Experiment: Rolling a dice once. Then S = {1, 2, 3, 4, 5, 6}.

Then E1 = ’the number turning up is at least 3’ = {3, 4, 5, 6},

and E2 = ’the number turning up is odd’ = {1, 3, 5}.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 21: Bioinformatic_1

21

Let E, E1 and E2 be some events.

Interpretations:

• Ec = ’the event E does not occur’;

• E1 ∪ E2 = ’at least one of the events E1 and E2 occurs’;

• E1 ∩ E2 = ’both the events E1 and E2 occur’.

If the events E1 and E2 cannot occur together, then they are said to be

mutually exclusive.

(Mathematically, two events are mutually exclusive if E1 ∩ E2 is the

empty set)

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 22: Bioinformatic_1

22

How to compute probabilities of events

• P(S) = 1.

• For any event E ⊂ S, 0 ≤ P(E) ≤ 1.

• P(Ec) = 1 − P(E).

• For mutually exclusive events E1 and E2,

P(E1 ∪ E2) = P(E1) + P(E2).

• For any two events E1 and E2,

P(E1 ∪ E2) = P(E1) + P(E2) − P(E1 ∩ E2).

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 23: Bioinformatic_1

23

Conditional probabilities

A fair dice is rolled once. Suppose that it is known that the number

turning up is less or equal to three.

How likely is it then that it is an odd number?

P(number is odd | number less or equal to 3) = 2/3

The information given is: the number is 1, 2 or 3.

Two of these three outcomes are odd: therefore 2/3.

(NOTE: Without the additional information given,

P(number is odd ) = 1/2).

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 24: Bioinformatic_1

24

...conditional probabilities...

Suppose that E1 and E2 are two events associated with some random

experiment.

Then the conditional probability P(E1|E2) that E1 occurs, given that

E2 occurs, is defined as

P(E1|E2) =P(E1 ∩ E2)

P(E2).

Here we assume that P(E2) > 0.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 25: Bioinformatic_1

25

The conditional probability formula:

P(E1|E2) =P(E1 ∩ E2)

P(E2).

***

Ex: The dice example again:

P(number is odd | number less or equal to 3) =

=P(number is odd and less or equal to 3)

P(number less or equal to 3)=

=P( {1, 3} )

P( {1, 2, 3} )=

2

3

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 26: Bioinformatic_1

26

Independence

Mathematically, two events E1 and E2 are said to be independent if

and only if

P(E1 ∩ E2) = P(E1) · P(E2)

holds. This is equivalent to

P(E1|E2) = P(E1) and P(E2|E1) = P(E2),

so for independent events E1 and E2, the information about the

experiment contained in E2 says nothing about the occurrence of E1 (and

vice versa).

Think of two random variables X and Y as being independent if the

value of one does not in any way affect the probabilities associated with

the possible values of the other one.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 27: Bioinformatic_1

27

Two sequences linked by evolution are dependent ...

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 28: Bioinformatic_1

28

...Independence...

Once again: match counts, two random sequences...

Seq1: gtacacgggata...tacgtgact

Seq2: cgaggtagtcga...tttatacga

Each position equals a, c, g, or t with probabilities pa, pc, pg, pt, and we

define X = the number of matches.

• If the positions in the sequences are independently generated, then

X ∼ Bin(N, p).

• X will not be binomially distributed if successive nucleotides are

dependent (i.e. if neighbors are dependent on each other)!

Why...?

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 29: Bioinformatic_1

29

...dependence...

The positions in the sequences can be dependent in different ways... One

extreme case is:

• Let the letter in the first position be a, c, g or t with probabilities

pa, pc, pg, pt.

• Let the other letters in positions 2, 3, . . . , N be equal to the first

letter!

Then the sequences will be of the following form:

aaaaaaaa...aaaaaaa, cccccccc...ccccccccc,

ggggggg...gggggggg, or ttttttt...ttttttttttttt.

Then, the possible values of X are 0 and N . Hence, X cannot be

binomially distributed.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 30: Bioinformatic_1

30

Expected value, variance, standard error

Associated with each random variable X (and each probability

distribution) are three important quantities:

• the expected value µ = E[X ],

• the variance σ2 = Var[X ],

• the standard deviation σ = SD[X ] =√

Var[X ].

They contain useful information about the random variable X , and they

can be computed from the probability function.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 31: Bioinformatic_1

31

Expected value

Once again: two random sequence of length N = 1000, where the letters

(nucleotides) in each position are equally likely.

P(a) = P(c) = P(g) = P(t) =1

4.

and X = the number of matches.

Since the nucleotides are equally probable, the match probability is

p = 0.25, and X ∼ Bin(1000, 0.25).

• How many matches would we expect to see?

The intuitive answer is: ’about 1000 · 0.25 = 250’.

This is in fact the expected value E[X ] of this random variable X : If

X ∼ Bin(N, p) then one can prove that E[X ] = N · p

= 250 in our case.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 32: Bioinformatic_1

32

In general, with X being a discrete RV, the expected value (also called

the expectation or the mean) µ = E[X ] is defined as

E[X ] =∑

k∈S

k · P(X = k)

where S is the set of possible values of X .

Ex: If X ∼ Bin(N, p), then S = {0, 1, . . . , N − 1, N} and

E[X ] = 0 · P(X = 0) + 1 · P(X = 1) + . . . + N · P(X = N)

which can be shown to be equal to N · p.

Ex: If a dice is rolled, and X = the number turning up,

E[X ] = 1 ·1

6+ 2 ·

1

6+ . . . + 6 ·

1

6= 3.5

NOTE: The value 3.5 is not a possible value of X !

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 33: Bioinformatic_1

33

If the value E[X ] not necessarily is a possible value of X , how can it be

an ’expected’ value...?

Interpretation: If we repeat the experiment many times and observe

independent copies X1,X2,...,Xn of X , then the average

1

n

(

X1 + . . . + Xn

)

will be close to E[X ]!

Convergence: the average tends closer and closer to E[X ] as n increases.

(A more precise statement is possible.)

Roll a dice 1000 times, and compute the average: it will be close to 3.5!

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 34: Bioinformatic_1

34

Expectations of linear combinations

Let X1, . . . , Xn be (independent or dependent!) random variables,

and let c1, . . . , cn be real numbers. Then

E[

c1X1 + c2X2 + . . . + cnXn

]

= c1E[X1] + c2E[X2] + . . . + cnE[Xn].

Expectations of products

Let X and Y be two random variables.

If they are independent then

E[X ·Y ] = E[X ] · E[Y ].

This is generally NOT true if they are dependent.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 35: Bioinformatic_1

35

More expectation formulas:

Let X be a random variable with the set S of possible values.

Then

E[

X2]

=∑

k∈S

k2 · P(X = k)

and

E[

g(X)]

=∑

k∈S

g(k) · P(X = k)

for functions g.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 36: Bioinformatic_1

36

Random variation...

If X ∼ Bin(1000, 0.25) then E[X ] = 1000 · 0.25 = 250,

so we would expect to see approximately 250 matches in the sequence

matching example.

That is, 251 or 249 would not be a surprising result...

But what about 240? 280? 350? ...

X is a random variable, so there will be some ’variability’ around its

expected value... How much variation is expected?

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 37: Bioinformatic_1

37

Standard deviation

X is a random variable, so there will be some ’variability’ around its

expected value... How much variation is expected?

This ’expected variation’ is captured by the the standard deviation:

σ := SD[X ] :=√

Var[X ] =√

E[

(X − E[X ])2]

.

Note: (X − E[X ])2 = the (squared) distance between X and its mean.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 38: Bioinformatic_1

38

The deviation from the mean is (in a sense) on average σ.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 39: Bioinformatic_1

39

Variance formulas

Definition:

σ2 = Var[X ] := E[

(X − E[X ])2]

.

Alternative formula:

Var[X ] = E[X2] −(

E[X ])2

.

Let a and b be constants. Then

Var[a + b·X ] = b2 · Var[X ].

Let X and Y be independent random variables. Then

Var[X + Y ] = Var[X ] + Var[Y ].

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Page 40: Bioinformatic_1

40

If X and Y are dependent random variables, then

Var[X + Y ] = Var[X ] + Var[Y ] + 2 · Cov[X, Y ],

where the last term is the covariance:

Cov[X, Y ] := E[

(X − E[X ])·(Y − E[Y ])]

=

= E[X ·Y ] − E[X ] · E[Y ].

The covariance measures the linear dependence between X and Y (which

is 0 in the independent case).

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html