varriation within and between species

Post on 27-May-2015

223 Views

Category:

Education

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Case study: Are Neanderthals still among us?

TRANSCRIPT

1

Introduction to

Bioinformatics

2

Introduction to Bioinformatics.

LECTURE 5: Variation within and between species

* Chapter 5: Are Neanderthals among us?

3

Neandertal, Germany, 1856

Initial interpretations:

* bear skull* pathological idiot* Old Dutchman ...

4

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

6

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

7

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5.1 Variation in DNA sequences

* Even closely related individuals differ in genetic sequences

* (point) mutations : copy error at certain location

* Sexual reproduction – diploid genome

8

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

Diploid chromosomes

9

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

Mitosis: diploid reproduction

10

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

Meiosis: diploid (=double) → haploid (=single)

11

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

* typing error rate very good typist: 1 error / 1K typed letters

* all our diploid cells constantly reproduce 7 billion letters

* typical cell copying error rate is ~ 1 error /1 Gbp

12

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

GERM LINE

Reverse time and follow your cells:

• Now you count ~ 1013 cells• One generation ago you had 2 cells ‘somewhere’ in your parents body• Small T generations ago you had (2T – multiple ancestors) cells• Large T generations ago you counted #(fertile ancestors) cells• Congratulations: you are 3.4 billion years old !!!

Fast-forward time and follow your cells:

• Only a few cells in your reproductive organs have a chance to live on in the next generations

• The rest (including you) will die …

13

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

GERM LINE MUTATIONS

This potentially immortal lineage of (germ) cells is called the GERM LINE

All mutations that we have accumulated are en route on the germ line

14

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

* Polymorphism : multiple possibilities for a nucleotide: allelle

* Single Nucleotide Polymorphism – SNP (“snip”) point mutation example: AAATAAA vs AAACAAA

* Humans: SNP = 1/1500 bases = 0.067%

* STR = Short Tandem Repeats (microsatelites) example: CACACACACACACACACA …

* Transition - transversion

15

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

Purines – Pyrimidines

16

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

Transitions – Transversions

17

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5.2 Mitochondrial DNA

* mitochondriae are inherited only via the maternal line!!!

* Very suitable for comparing evolution, not reshuffled

18

Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA

H.sapiens mitochondrion

19

Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA

EM photograph of H. Sapiens mtDNA

20

Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA

21

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5.3 Variation between species

* genetic variation accounts for morphological-physiological-behavioral variation

* Genetic variation (c.q. distance) relates to phylogenetic relation (=relationship)

* Necessity to measure distances between sequences: a metric

22

Introduction to Bioinformatics5.3 VARIATION BETWEEN SPECIES

Substitution rate

* Mutations originate in single individuals

* Mutations can become fixed in a population

* Mutation rate: rate at which new mutations arise

* Substitution rate: rate at which a species fixes new mutations

* For neutral mutations

23

Introduction to Bioinformatics5.3 VARIATION BETWEEN SPECIES

Substitution rate and mutation rate

* For neutral mutations

* ρ = 2Nμ*1/(2N) = μ

* ρ = K/(2T)

24

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5.4 Estimating genetic distance

* Substitutions are independent (?)

* Substitutions are random

* Multiple substitutions may occur

* Back-mutations mutate a nucleotide back to an earlier value

25

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

Multiple substitutions and Back-mutations

conceal the real genetic distance

GACTGATCCACCTCTGATCCTTTGGAACTGATCGTTTCTGATCCACCTCTGATCCTTTGGAACTGATCGTTTCTGATCCACCTCTGATCCATCGGAACTGATCGTGTCTGATCCACCTCTGATCCATTGGAACTGATCGT

observed : 2 (= d)actual : 4 (= K)

26

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

* Saturation: on average one substitution per site

* Two random sequences of equal length will match for approximately ¼ of their sites

* In saturation therefore the proportional genetic distance is ¼

27

Introduction to Bioinformatics5.4 ESTIMATING GENETIC DISTANCE

* True genetic distance (proportion): K

* Observed proportion of differences: d

* Due to back-mutations K ≥ d

28

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

SEQUENCE EVOLUTION is a Markov process: a sequence at generation (= time) t depends only the sequence at generation t-1

29

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

The Jukes-Cantor model

Correction for multiple substitutions

Substitution probability per site per second is α

Substitution means there are 3 possible replacements (e.g. C → {A,G,T})

Non-substitution means there is 1 possibility(e.g. C → C)

30

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

Therefore, the one-step Markov process has the following transition matrix:

MJC =

A C G T

A 1-α α/3 α/3 α/3

C α/3 1-α α/3 α/3

G α/3 α/3 1-α α/3

T α/3 α/3 α/3 1-α

31

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

After t generations the substitution probability is:

M(t) = MJCt

Eigen-values and eigen-vectors of M(t):

λ1 = 1, (multiplicity 1): v1 = 1/4 (1 1 1 1)T

λ2..4 = 1-4α/3, (multiplicity 3): v2 = 1/4 (-1 -1 1 1)T

v3 = 1/4 (-1 -1 -1 1)T

v4 = 1/4 (1 -1 1 -1)T

32

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

Spectral decomposition of M(t):

MJCt = ∑i λi

tviviT

Define M(t) as:

MJCt =

Therefore, substitution probability s(t) per site after t generations is:

s(t) = ¼ - ¼ (1 - 4α/3)t

r(t) s(t) s(t) s(t)

s(t) r(t) s(t) s(t)

s(t) s(t) r(t) s(t)

s(t) s(t) s(t) r(t)

33

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

substitution probability s(t) per site after t generations:

s(t) = ¼ - ¼ (1 - 4α/3)t

observed genetic distance d after t generations ≈ s(t) :

d = ¼ - ¼ (1 - 4α/3)t

For small α :

( )dt 341ln

4

3 −−≈α

34

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

For small α the observed genetic distance is:

The actual genetic distance is (of course):

K = αt

So:

This is the Jukes-Cantor formula : independent of α and t.

( )dt 341ln

4

3 −−≈α

( )dK 34

43 1ln −−≈

35

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

The Jukes-Cantor formula :

For small d using ln(1+x) ≈ x : K ≈ d So: actual distance ≈ observed distance

For saturation: d ↑ ¾ : K →∞So: if observed distance corresponds to random sequence-distance then the actual distance becomes indeterminate

( )dK 34

43 1ln −−≈

36

Jukes-Cantor

37

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

Variance in K

If: K = f(d) then:

So:

Generation of a sequence of length n with substitution rate

d is a binomial process:

and therefore with variance: Var(d) = d(1-d)/n

Because of the Jukes-Cantor formula:

knk ddk

nk −−

= )1()(Prob

dd

K

341

1

−=

∂∂

)(Var)(Var2

dd

KK

∂∂=

22

2 dd

KKd

d

KK δδδδ

∂∂=⇒

∂∂=

38

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

Variance in K

Variance: Var(d) = d(1-d)/n

Jukes-Cantor:

So:

dd

K

341

1

−=

∂∂

234 )1(

)1()(Var

dn

ddK

−−≈

39

Var(K)

40

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

EXAMPLE 5.4 on page 90

* Create artificial data with n = 1000: generate K* mutations

* Count d

* With Jukes-Cantor relation reconstruct estimate K(d)

* Plot K(d) – K*

41

Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

42

Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

43

Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

44

Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90 (= FIG 5.3)

45

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

The Kimura 2-parameter model

Include substitution bias in correction factor

Transition probability (G↔A and T↔C) per site per second is α

Transversion probability (G↔T, G↔C, A↔T, and A↔C) per site per second is β

46

Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL

The one-step Markov process substitution matrix now becomes:

MK2P =

A C G T

A 1-α-β β α β

C β 1-α-β β α

G α β 1-α-β β

T β α β 1-α-β

47

Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL

After t generations the substitution probability is:

M(t) = MK2Pt

Determine of M(t):

eigen-values {λi}

and eigen-vectors {vi}

48

Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL

Spectral decomposition of M(t):

MK2Pt = ∑i λi

tviviT

Determine fraction of transitions per site after t generations : P(t)

Determine fraction of transitions per site after t generations : Q(t)

Genetic distance: K ≈ - ½ ln(1-2P-Q) – ¼ ln(1 – 2Q)

Fraction of substitutions d = P + Q → Jukes-Cantor

49

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

Other models for nucleotide evolution

* Different types of transitions/transversions

* Pairwise substitutions GTR (= General Time Reversible) model

* Amino-acid substitutions matrices

* …

50

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

Other models for nucleotide evolution

DEFICIT:

all above models assume symmetric substitution probs;

prob(A→T) = prob(T→A)

Now strong evidence that this assumption is not true

Challenge: incorporate this in a self-consistent model

51

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5.5 CASE STUDY: Neanderthals

* mtDNA of 206 H. sapiens from different regions

* Fragments of mtDNA of 2 H. neanderthaliensis, including the original 1856 specimen.

* all 208 samples from GenBank

* A homologous sequence of 800 bp of the HVR could be found in all 208 specimen.

52

Introduction to Bioinformatics5.5 CASE STUDY: Neanderthals

* Pairwise genetic difference – corrected with Jukes-Cantor formula

* d(i,j) is JC-corrected genetic difference between pair (i,j);

* dT = d

* MDS (Multi Dimensional Scaling): translate distance table d to a nD-map X, here 2D-map

53

Introduction to Bioinformatics5.5 CASE STUDY: Neanderthals

distance map d(i,j)

54

Introduction to Bioinformatics5.5 CASE STUDY: Neanderthals

MDS

H. sapiens

H. neanderthaliensiswell-separated

55

Introduction to Bioinformatics5.5 CASE STUDY: Neanderthals

phylogentic tree

56

END of LECTURE 5

57

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

58

top related