bioinformatic_1

1

BIOINFORMATIK II

PROBABILITY & STATISTICS

Summer semester 2006

University of Zurich and ETH Zurich

Lecture 1: Basic probability.

Prof. Andrew BarbourDr. Beatrice de Tiliere

Adapted from a course by

Dr. D. Schuhmacher & Dr. D. Svensson.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

2

Web page

http://www.math.unizh.ch/baps/lectures/bioinf2.html

You will find there:

• Up-to-date information;

• Transparencies of the lectures;

• Exercise sheets with solutions (...in due time);

• Additional background material.


3

Course content

• Basic probability concepts

’probability distribution’, ’independence’, ’conditional probability’,

’expectation’, ’standard deviation’, . . .

• Concepts and principles in statistics

’estimation’, ’hypothesis testing’, ’maximum likelihood’, ’likelihood

ratio’, ’significance’, ’p-value’, . . .

• Markov chains

’transition matrix’, ’stationary distribution’, ’reversibility’, ’random

walks’, ’hidden Markov models’, . . .

• Models and algorithms in bioinformatics

’sequence alignment’, ’models for evolution’,

’PAM/BLOSUM-matrices’, ’BLAST’, . . .

Principles, rather than ’How-to’ !


4

evolutionarychanges

evolutionarychanges

a g g t g a c c c t . . . g t c a t t t

t g g a g c c a t . . . g t c g a t t a c g t c a c c c t . . . g a c a t t t


5

Why probability and statistics in bioinformatics?

Given: Two sequences from two species. Common ancestor?

g g a g a c t g t a g a c a g c t a a t g c t a t ag a a c g c c c t a g c c a c g a g c c c t t a t c

• Sequence length: 26 nucleotides.

• 11 of 26 positions agree.

Conclusion? Generated ’purely by chance’ or by some other

mechanism?

To be able to answer this, one needs to understand properties of

random sequences.


6

Probability & statistics in bioinformatics?

For ...

• modelling sequence evolution (Markov chains).

• inferring phylogenetic trees (maximum likelihood trees).

• gene prediction (hidden markov chains).

• analysis of micro array data (multiple testing, multivariate

statistics)

• evaluating sequence similarity in BLAST searches (extreme values,

random walks)

• much more!


7

Random variables (RVs)A random variable = ”numerical” quantity whose value depends on the

outcome of some chance experiment.

Ex 1. Flip a coin and let X=1 if ’head’ occurs, otherwise let X = 0.

Then X is a random variable.

Ex 2. Two DNA sequences are randomly chosen from a database. Then

X = the number of matches between the sequences is a RV.

Ex 3. Let X = the waiting time until a certain event occurs; e.g., time

until a nucleotide substitution first occurs at a specified position in a

genome. Then X is a RV.

There are two main types of random variables:

• either DISCRETE (as in example 1 and 2)

• or CONTINUOUS (example 3).


8

Probability distribution of a RV

The important feature of a random variable is its probability distribution.

The probability distribution of a random variable X is basically the

mechanism (mathematically: the function) which tells us, with what

probability the random variable takes what values.

For a discrete random variable the probability distribution can be

expressed either by its probability function or by its distribution function.


9

Probability and distribution functions (DISCRETE RVs)

Let X be any discrete random variable, and denote the set of the

possible values with S (=’sample space’).

Associated with the random variable X are

• The probability function pX :

pX(i) := P(X = i) ∈ [0, 1], i ∈ S;

• and the (cumulative) distribution function FX :

FX(j) := P(X ≤ j) =∑

i∈S;i≤j

P(X = i) ∈ [0, 1], j ∈ S.

A mathematical analysis of a random variable typically requires explicit

formulas for these functions!

In principle, these two functions contain all essential information

concerning properties and behavior of the random variable.


10

Similar formulas for continuous random variables X : Then the

(cumulative) distribution function is given by

• FX(t) := P(X ≤ t) =∫

x≤tfX(x) dx, for t ∈ S

for some probability density function fX with fX(x) ≥ 0.

Caution with the interpretation in the continuous case:

• the density fX(x) is NOT equal to P(X = x) (which in fact always

is zero for continuous RVs !),

• fX(x) is NOT a probability (that is, fX(x) might be > 1 ... )

Think of it as P(t ≤ X ≤ t + h) = fX(t) · h, where h small and h > 0.

Discrete random variables are perhaps more important to

bioinformatics... (in some sense)


11

Example: Flip a coin and let X=1 if ’head’ occurs, otherwise let X = 0.

Then the sample space is S = {0, 1} and the probability function

pX(i) := P(X = i) is given by

pX(0) =1

2pX(1) =

1

2.

The distribution function is given by

FX(0) := P(X ≤ 0) =1

2

and

FX(1) := P(X ≤ 1) = P(X = 0) + P(X = 1) =1

2+

1

2= 1.


12

Ex: “Random DNA sequences” with i.i.d. letters.

Two sequences of N letters are randomly generated, i.e.

• the letters are independently generated,

• each position equals a, c, g, or t with probabilities pa, pc, pg, pt

Seq1: gtacacgggata...tacgtgact

Seq2: cgaggtagtcga...tttatacga

Let X = the number of matches. Then the probability function of X is

P(X = k) =

(

N

k

)

pk(1 − p)N−k, where

(

N

k

)

=N !

k!(N − k)!,

for some match probability p ∈ [0, 1]. [n! = 1 · 2 · · · · (n − 1) · n].

This is known as the binomial distribution with parameters N and p.


13

P(X = k) =(

Nk

)

pk(1 − p)N−k. Why?

Step 1:. Fix any position (j say).

Let p = P(match in the position considered) (which is independent of the

position chosen).

p = P(

two ’a’, or two ’c’, or two ’g’, or two ’t’)

=

= P(two ’a’) + P(two ’c’) + P(two ’g’) + P(two ’t’) =

= pa · pa + pc · pc + pg · pg + pt · pt


14

So the match probability is p = p2a + p2

c + p2g + p2

t

(p = 1/4 if pa = pc = pg = pt = 1/4).

The probability for a mismatch is (1 − p).

Step 2:. What is P(X = k) =?

Exactly k matches and N − k mismatches can occur in different ways: for

example,

Seq 2

Seq 1

)N-k(probability (1-p)

)k(probability p

N-k miss-matches

k matches

match matchmatchmatchmatch

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Each such configuration has probability pk(1 − p)N−k.


15

We have to add the probabilities for the different configurations to get the

total probability:

P(X = k) = pk(1 − p)N−k + . . . + pk(1 − p)N−k

How many?

Combinatorical arguments: there are(

N

k

)

=N !

k!(N − k)!

possible configurations of k matches and N − k mismatches.

Therefore

P(X = k) =

(

N

k

)

pk(1 − p)N−k.


16

The binomial distribution

Any random variable Y having this probability function

p(k) =

(

N

k

)

pk(1 − p)N−k

is said to be binomially distributed (important in general, not just for

counting matches between random sequences!).

In general: imagine that

• N independent trials are carried out,

• for each trial, P(’success’ ) = p, and P(’failure’ ) = 1 − p.

Let X = the number of successes. Then X is binomially distributed with

parameters N and p. Notation:

X ∼ Bin(N, p).


17

...other distributions?

In general, different RV’s have different probability distributions (i.e.

different probability functions).

Consider the random sequence example again, and define

Y = the first position where a match occurs

(counted from left to right).

Seq1 : c g t c g t ... g

Seq2 : g a c c c t ... t

Then the probability function of Y would be

pY (k) = (1 − p)k−1 · p

for k = 1, 2, 3, . . ., where p is the probability for having a match at a

fixed position i, 1 ≤ i ≤ N .

This is called the geometric distribution with parameter p.


18

Another important distribution is the uniform distribution.

Suppose that each of the (finitely many) possible values of X are equally

likely, that is

P(X = k) =1

N

for each possible value k, and where N is the number of possible values.

Then X is said to be uniformly distributed.


19

...some important distributions?

There are infinitely many possible probability distributions but some

appear over and over again in applications.

Some examples are

• Binomial

• Geometric

• Uniform

• Poisson

• Normal

• Exponential

• Chi-square

• ...


20

Probabilities of events

Let S be the set of possible outcomes of some ’experiment’

(S is the sample space).

An event is something that either will or will not occur when the

experiment is conducted (mathematically, E ⊂ S).

Ex.1 Experiment: counting matches between two sequences of length

1000. Then S = {0, 1, . . . , 1000}.

The event E = ’at least 50% identity’ is E = {500, 501, . . . , 1000}.

Ex.2 Experiment: Rolling a dice once. Then S = {1, 2, 3, 4, 5, 6}.

Then E1 = ’the number turning up is at least 3’ = {3, 4, 5, 6},

and E2 = ’the number turning up is odd’ = {1, 3, 5}.


21

Let E, E1 and E2 be some events.

Interpretations:

• Ec = ’the event E does not occur’;

• E1 ∪ E2 = ’at least one of the events E1 and E2 occurs’;

• E1 ∩ E2 = ’both the events E1 and E2 occur’.

If the events E1 and E2 cannot occur together, then they are said to be

mutually exclusive.

(Mathematically, two events are mutually exclusive if E1 ∩ E2 is the

empty set)


22

How to compute probabilities of events

• P(S) = 1.

• For any event E ⊂ S, 0 ≤ P(E) ≤ 1.

• P(Ec) = 1 − P(E).

• For mutually exclusive events E1 and E2,

P(E1 ∪ E2) = P(E1) + P(E2).

• For any two events E1 and E2,

P(E1 ∪ E2) = P(E1) + P(E2) − P(E1 ∩ E2).


23

Conditional probabilities

A fair dice is rolled once. Suppose that it is known that the number

turning up is less or equal to three.

How likely is it then that it is an odd number?

P(number is odd | number less or equal to 3) = 2/3

The information given is: the number is 1, 2 or 3.

Two of these three outcomes are odd: therefore 2/3.

(NOTE: Without the additional information given,

P(number is odd ) = 1/2).


24

...conditional probabilities...

Suppose that E1 and E2 are two events associated with some random

experiment.

Then the conditional probability P(E1|E2) that E1 occurs, given that

E2 occurs, is defined as

P(E1|E2) =P(E1 ∩ E2)

P(E2).

Here we assume that P(E2) > 0.


25

The conditional probability formula:

P(E1|E2) =P(E1 ∩ E2)

P(E2).

***

Ex: The dice example again:

P(number is odd | number less or equal to 3) =

=P(number is odd and less or equal to 3)

P(number less or equal to 3)=

=P( {1, 3} )

P( {1, 2, 3} )=

2

3


26

Independence

Mathematically, two events E1 and E2 are said to be independent if

and only if

P(E1 ∩ E2) = P(E1) · P(E2)

holds. This is equivalent to

P(E1|E2) = P(E1) and P(E2|E1) = P(E2),

so for independent events E1 and E2, the information about the

experiment contained in E2 says nothing about the occurrence of E1 (and

vice versa).

Think of two random variables X and Y as being independent if the

value of one does not in any way affect the probabilities associated with

the possible values of the other one.


27

Two sequences linked by evolution are dependent ...


28

...Independence...

Once again: match counts, two random sequences...

Seq1: gtacacgggata...tacgtgact

Seq2: cgaggtagtcga...tttatacga

Each position equals a, c, g, or t with probabilities pa, pc, pg, pt, and we

define X = the number of matches.

• If the positions in the sequences are independently generated, then

X ∼ Bin(N, p).

• X will not be binomially distributed if successive nucleotides are

dependent (i.e. if neighbors are dependent on each other)!

Why...?


29

...dependence...

The positions in the sequences can be dependent in different ways... One

extreme case is:

• Let the letter in the first position be a, c, g or t with probabilities

pa, pc, pg, pt.

• Let the other letters in positions 2, 3, . . . , N be equal to the first

letter!

Then the sequences will be of the following form:

aaaaaaaa...aaaaaaa, cccccccc...ccccccccc,

ggggggg...gggggggg, or ttttttt...ttttttttttttt.

Then, the possible values of X are 0 and N . Hence, X cannot be

binomially distributed.


30

Expected value, variance, standard error

Associated with each random variable X (and each probability

distribution) are three important quantities:

• the expected value µ = E[X ],

• the variance σ2 = Var[X ],

• the standard deviation σ = SD[X ] =√

Var[X ].

They contain useful information about the random variable X , and they

can be computed from the probability function.


31

Expected value

Once again: two random sequence of length N = 1000, where the letters

(nucleotides) in each position are equally likely.

P(a) = P(c) = P(g) = P(t) =1

4.

and X = the number of matches.

Since the nucleotides are equally probable, the match probability is

p = 0.25, and X ∼ Bin(1000, 0.25).

• How many matches would we expect to see?

The intuitive answer is: ’about 1000 · 0.25 = 250’.

This is in fact the expected value E[X ] of this random variable X : If

X ∼ Bin(N, p) then one can prove that E[X ] = N · p

= 250 in our case.


32

In general, with X being a discrete RV, the expected value (also called

the expectation or the mean) µ = E[X ] is defined as

E[X ] =∑

k∈S

k · P(X = k)

where S is the set of possible values of X .

Ex: If X ∼ Bin(N, p), then S = {0, 1, . . . , N − 1, N} and

E[X ] = 0 · P(X = 0) + 1 · P(X = 1) + . . . + N · P(X = N)

which can be shown to be equal to N · p.

Ex: If a dice is rolled, and X = the number turning up,

E[X ] = 1 ·1

6+ 2 ·

1

6+ . . . + 6 ·

1

6= 3.5

NOTE: The value 3.5 is not a possible value of X !


33

If the value E[X ] not necessarily is a possible value of X , how can it be

an ’expected’ value...?

Interpretation: If we repeat the experiment many times and observe

independent copies X1,X2,...,Xn of X , then the average

1

n

(

X1 + . . . + Xn

)

will be close to E[X ]!

Convergence: the average tends closer and closer to E[X ] as n increases.

(A more precise statement is possible.)

Roll a dice 1000 times, and compute the average: it will be close to 3.5!


34

Expectations of linear combinations

Let X1, . . . , Xn be (independent or dependent!) random variables,

and let c1, . . . , cn be real numbers. Then

E[

c1X1 + c2X2 + . . . + cnXn

]

= c1E[X1] + c2E[X2] + . . . + cnE[Xn].

Expectations of products

Let X and Y be two random variables.

If they are independent then

E[X ·Y ] = E[X ] · E[Y ].

This is generally NOT true if they are dependent.


35

More expectation formulas:

Let X be a random variable with the set S of possible values.

Then

E[

X2]

=∑

k∈S

k2 · P(X = k)

and

E[

g(X)]

=∑

k∈S

g(k) · P(X = k)

for functions g.


36

Random variation...

If X ∼ Bin(1000, 0.25) then E[X ] = 1000 · 0.25 = 250,

so we would expect to see approximately 250 matches in the sequence

matching example.

That is, 251 or 249 would not be a surprising result...

But what about 240? 280? 350? ...

X is a random variable, so there will be some ’variability’ around its

expected value... How much variation is expected?


37

Standard deviation

X is a random variable, so there will be some ’variability’ around its

expected value... How much variation is expected?

This ’expected variation’ is captured by the the standard deviation:

σ := SD[X ] :=√

Var[X ] =√

E[

(X − E[X ])2]

.

Note: (X − E[X ])2 = the (squared) distance between X and its mean.


38

The deviation from the mean is (in a sense) on average σ.


39

Variance formulas

Definition:

σ2 = Var[X ] := E[

(X − E[X ])2]

.

Alternative formula:

Var[X ] = E[X2] −(

E[X ])2

.

Let a and b be constants. Then

Var[a + b·X ] = b2 · Var[X ].

Let X and Y be independent random variables. Then

Var[X + Y ] = Var[X ] + Var[Y ].


40

If X and Y are dependent random variables, then

Var[X + Y ] = Var[X ] + Var[Y ] + 2 · Cov[X, Y ],

where the last term is the covariance:

Cov[X, Y ] := E[

(X − E[X ])·(Y − E[Y ])]

=

= E[X ·Y ] − E[X ] · E[Y ].

The covariance measures the linear dependence between X and Y (which

is 0 in the independent case).


bioinformatic_1

Documents