genome evolution

31
Genome Evolution © Amos Tanay, The Weizmann Institut Genome evolution Lecture 2: population genetics I: models and drift

Upload: sona

Post on 23-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Genome evolution. Lecture 2: population genetics I: models and drift. Reading. Course slides on the web Hartl and Clark, Principle of population genetics Rick Durret – Probability models for DNA sequence evolution Gillespie – Pop. Genetics, A concise guide - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Genome evolution

Lecture 2: population genetics I: models and drift

Page 2: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Reading• Course slides on the web

• Hartl and Clark, Principle of population genetics• Rick Durret – Probability models for DNA sequence

evolution • Gillespie – Pop. Genetics, A concise guide• Hein et al. - Gene Genealogies, Variation and Evolution:

A Primer in Coalescent Theory• Wakeley: Coalescent Theory, an introduction• Graur and Li: Intro to molecular evolution• Classics: Kimura, Dobzhanski..

Page 3: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Studying Populations

Models

A set of individuals, genomesAncestry relations or hierarchies

Experiments

Fields studies, diversity/genotypingExperimental evolution

Åland Islands, Glanville fritillary population

mtDNA human migration patterns

Page 4: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Human population

Growth:

Year -10,000 0 1750 1950 2010

Estimate (Millions)

6 252 771 2521 6055

Page 5: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

The Data: the hapmap project

1 million SNPs (single nucleotide polymorphisms)

4 populations: 30 trios (parents/child) from Nigeria (Yoruba - YRI)30 trios (parents/child) from Utah (CEU)45 Han chinease (Beijing - CHB)44 Japanease (Tokyo - JPT)

Haplotyping – each SNP/individual.

No just determining heterozygosity/homozygosity – haplotyping completely resolve the genotypes (phasing)

Page 6: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

The Data: the hapmap project

Because of linkage, the partial SNP Map largely determine all other SNPs!!

The idea is that a group of “tag SNPs” Can be used for representing all genetic Variation in the human population.

This is extremely important in association studies that look for the genetic cause of disease.

Page 7: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Correlation on SNPs between populations

Page 8: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Modeling…

Page 9: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

The Hardy-Weinberg Model

• Diploid organismsTwo copies of each allele/gene/baseHomozygous / Heterozygous

• Sexual ReproductionMating haplotypes

• Large population, No migrationFixed size, closed system

• Non-overlapping generationsSynchronous processNot as bad as it may look like

• Random matingNew generation is being selected from the existing haplotypes with

replacement

• No mutations, no selection (will add these later)

Page 10: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

2

2

)(

2)()(

qaaP

pqAaPpAAP

The Hardy-Weinberg Model

Hardy-Weinberg equilibrium:

AA

Aa

aa

aAqaPpAP

)()(

AA

Aa

aa

aA

Random mating

Non overlapping generations

With the model assumption, equilibrium is reached within one generation

• Non-overlapping generationsSynchronous processNot as bad as it may look like

• Random matingNew generation is being selected from the existing haplotypes with

replacement

• No mutations, no selection (will add these later)

Page 11: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Frequency estimates

We will be dealing with estimation of allele frequencies.

To remind you, when sampling n times from a population with allele of frequency p, we get an estimate that is distributed as a binomial variable. This can be further approximated using a normal distribution:

))1(,());(( pnpnpNnpBV

npps )ˆ1(ˆ

When estimating the frequency out of the number of successes we therefore have an error that looks like:

ini ppin

npB

)1();(

Page 12: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Testing Hardy-Weinberg(HW) using chi-square statistics

HW is over simplifying everything, but can be used as a baseline to test if interesting evolution is going on for some allele

Classical example is the blood group genotypes M/N (Sanger 1975) (this genotype determines the expression of a polysaccharide on red blood cell surfaces – so they were quantifiable before the genomic era..):

294.3 298 MM

496 489 MN

209.3 213 NN

Observed Expected (HW)

2

2

)(

2)()(

qNNP

pqMNPpMMP

22.0exp

exp)( 22

obs

Chi-square significance can be computed from the chi-square distribution with df degrees of freedom.

Here: df = #classes - #parameters – 1 = 3(MN/NN/MM) – 1 (p) – 1 = 1

Page 13: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Modeling population: the Wright-Fisher model

Generation t

Generation t+1

1 2 3 4 2N

1 2 3 4 2N

…..

…..

Haploid model

Nf Nm

Nf Nm

…..

…..

…..

…..

Diploid model

Page 14: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Wright-Fisher model for genetic drift

We follow the frequency of an allele in the population, until fixation (f=2N) or loss (f=0)

We can model the frequency as a Markov process on a variable X (the number of A alleles) with transition probabilities:

jNj

ij Ni

Ni

jN

T

2

21

22 Sampling j alleles from a population 2N

population with i alleles.

In larger population the frequency would change more slowly (the variance of the binomial variable is pq/2N – so sampling wouldn’t change that much)

0 2N1 2N-1Loss Fixation

ephraimk
Now we move to discuss finite size effects
ephraimk
can we add an animation?
Page 15: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Coalescent and fixation

Page 16: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Drift and fixation probability

Theorem (fixation in drift): In the Wright-Fisher model, the probability of fixation in the A’s allele state, given a population of 2N alleles out of which i are A, is:

NiNXPi 2

)2(

Proof: The mean of the binomial sample in the n’th step is np:

nnn XiNiNiXXE 2

2)|( 1

Which means that the expected number of A’s is constant in time. Intuitively:

)2(2)( NXNPXEi ii

)1()();();()( oXEnXEnXEXEi i

n

niini

Since 0 and 2N are absorbing states, given sufficient time, the wright-Fisher process will converge to either 0 or 2N. Define:

}20:min{ NXorXn nn

More formally:

Page 17: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Figure 7.4

Drift

Experiments with drifting fly populations: 107 Drosophila melanogaster populations. Each consisted originally of 16 brown eys (bw) heterozygotes. At each generation, 8 males and 8 females were selected at random from the progenies of the previous generation. The bars shows the distribution of allele frequencies in the 107 populations

Number o

f bw

75  allel

es

Generation

Num

ber o

f pop

ulat

ions

Page 18: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

The geometric distribution: reminder

Rolling a dice, and recording the time until first appearance of k (waiting time)

ppjTP j 1)1()(

)()|( 1212 ttTPtTtTP

Lack of memory:

pTE /1)( 2

1)(ppTVar

Moments:

)''(),min( ppppgeoTS

“Intersection”:

ephraimk
Need motivation for the coalescent approach
Page 19: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

The exponential distribution: reminder

The limit of the geometric distribution when the time step is going to 0:

atetUP 1)( ataedt

tUPd )((Density:

aUE /1)( 2

1)(a

UVar

Moments:

baaVUP

)(

“Intersection”:

)(~),min( baExpVU

tDt

P=atMemory less!

atMtMMj

j eMa

MpMpjTP

11)1()(

/1

Probability:

M=2

M=4

Page 20: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Coalescence

Coalescent at time -1?

NP

21

Coalescent at time -T?NN

P T

21)

211( 1

No coalescence for k samples?

)1(21

21)1(

21

21

22...

222

212

22

1

1 nO

Nk

nO

Ni

Ni

NkN

NN

NNP

ki

k

i

Distribution of time from k to k-1:

Nk

Nk

tTPt

k 21

221

21)(

1

Page 21: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

The continuous time coalescent

When sampling K new individuals, the chances of peaking up the same parent twice is roughly:

Present 102)( 5NTE

62)( 4NTE

32)( 3NTE

NTE 2)( 2

Past

1 2 3 54

)1(21

2)1(

2NO

Nkk

Theorem: The amount of time during which there are k lineages, tk has approximately an exponential distribution with mean 2N * (2/(k(k-1)))

When looking at k individuals, we can trace their coalescent backwards and ask when did they had k-1,k-2, or one common ancestor.

Proof: the probability of not merging k lineages in n generations is:

Nnkk

Nkk n

22)1(exp

21

2)1(1

We scale the time by N, so it is like an exponential : te

This is correct for any k, so going backward from present time, we can estimate the time to coalescent at each step

The expected value is)1(

41)(

kkNeE t

Page 22: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

The coalescent

The expected time to the common ancestor of n individuals:

Present 102)( 5NTE

62)( 4NTE

32)( 3NTE

NTE 2)( 2

Past

1 2 3 54

E(T1) =4N

k(k −1)= 4N

1k −1

−1k

= 4N(1−1n

)k=2..n∑

k=2..n∑

Theorem: The probability that the most recent common ancestor of a sample of size n is the same as that of the population converges to (n-1)/(n+1) as the population size increase.

When looking at n individuals, we can trace their coalescent backwards and ask when did they had n-1,n-2, or one common ancestor.

4N is the magic number

Page 23: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Diffusion approximation and Kimura’s solution

),(),( txJx

txt

),( tx

Fisher, and then Kimura approximated the drift process using a diffusion equation (heat equation):

The density of population with frequency x..x+dx at time t

),( txJ The flux of probability at time t and frequency x

The change in the density equals the differences between the fluxes J(x,t) and J(x+dx,t), taking dx to the limit we have:

The if M(x) is the mean change in allele frequency when the frequency is x, and V(x) is the variance of that change, then the probability flux equals:

),()(21),()(),( txxVx

txxMtxJ

),()(21),()(),( 2 txxVx

txxMx

txt

NxxxVM

2)1()(,0

),()1(41),( 2 txxx

xNtx

t

Heat diffusionFokker-PlanckKolmogorov Forward eq.

ephraimk
Maybe this should be before the coalescent theory, as a limit of the forward process or move to the next lecture
Page 24: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Diffusion approximation and Kimura’s solution

),( tx

Fisher, and then Kimura approximated the drift process using a diffusion equation (heat equation). We start with working on the time step dy and frequency step dx

The probability that the population have allele frequency x time t

)(xM

We limit changes from t to t+dt and x+-dx. The population can be on x at t+dt if:

It was at x and stayed there:

It was at x-dx and moved to x:

It was at x+dx and moved to x:

)],()(),()([21

)],()(),()([21

)],()(),()([),(),(

tdxxdxxVtxxV

txxVtdxxdxxV

tdxxdxxMtxxMtxdttx

),()(21),()(),( 2 txxVx

txxMx

txt

2/)(xV

the probability that the frequency increased from x by dx, due to mutation/selection

The probability of dx increase or decrease due to drift

))()(1)(,( xVxMtx

)2/)()()(,( xVxMtdxx

)2/)()(,( xVtdxx

Page 25: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Diffusion approximation and Kimura’s solution

),( tx

Fisher, and then Kimura approximated the drift process using a diffusion equation (heat equation). We start with working on the time step dy and frequency step dx

The probability that the population have allele frequency x time t

)(xM

),()(21),()(),( 2 txxVx

txxMx

txt

2/)(xV

the probability that the frequency increased from x by dx, due to mutation/selection

The probability of dx increase or decrease due to drift

0)(2/)1()(

xMNxxxVFor drift the variance is binomial:

And we assume no selection:

Still not easy to solve analytically…

Page 26: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Changes in allele-frequencies, Fisher-Wright model

After about 4N generations, just 10% of the cases are not fixed and the distribution becomes flat.

Page 27: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Absorption time and Time to fixation

According to Kimura’s solution, the mean time for allele fixation, assuming initial probability p and assuming it was not lost is:

)1log()1(4)(1̂ pppNpt

)log()(14)(0̂ pppNpt

The mean time for allele loss is (the fixation time of the complement event):

Page 28: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Effective population size

4N generations looks light a huge number (in a population of billions!)

But in fact, the wright-Fisher model (like the hardy-weinberg model) is based on many non-realistic assumption, including random mating – any two individuals can mate

The effective population size is defined as the size of an idealized population for which the predicted dynamics of changes in allele frequency are similar to the observed ones

For each measurable statistics of population dynamics, a different effective population size can be computed

For example, the expected variance in allele frequency is expressed as:

NpppV tt

t 2)1()( 1

e

ttt N

pppV2

)1()( 1

But we can use the same formula to define the effective population size given the variance:

Page 29: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Effective population size: changing populations

110

1..11

t

e

NNN

tN

So the effective population size is dominated by the size of the smallest bottleneck

Bottlenecks can occur during migration, environmental stress, isolation

Such effects greatly decrease heterozygosity (founder effect – for example Tay-Sachs in “ashkenazim”)Bottlenecks can accelerate fixation of neutral or even deleterious mutations as we shall see later.

If the population is changing over time, the dynamics will be affect by the harmonic mean of the sizes:

Human effective population size in the recent 2My is estimated around 10,000 (due to bottlenecks). (so when was our T1?)

Page 30: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann InstituteEffective population size: unequal sex ratio, and sex chromosomes

fma NNN

So if there are 10 times more females in the population, the effective population size is 4*x*10x/(11x)=4x, much less than the size of the population (11x).

If there are more females than males, or there are fewer males participating in reproduction then the effective population size will be smaller:

fm

fme NN

NNN

4 Any combination of alleles from a male and a female

Another example is the X chromosome, which is contained in only one copy for males.

fm

fme NN

NNN

249

f

ff

m

mmfm N

qpNqppVarppp

294

91)(,

32

31

fm

fmfmfm

NNNN

pqNN

pqpVarppp

249

218

49

1)(,

ephraimk
Consider removing this. I think (effi) it's not necessary
Page 31: Genome evolution

Genome Evolution © Amos Tanay, The Weizmann Institute

Population genetics

Drift: The process by which allele frequencies are changing through generations

Mutation: The process by which new alleles are being introduced

Recombination: the process by which multi-allelic genomes are mixed

Selection: the effect of fitness on the dynamics of allele drift

Epistasis: the drift effects of fitness dependencies among different alleles

“Organismal” effects: Ecology, Geography, Behavior