estimating recombination rates using three-site likelihoods jeff wall program in molecular and...

Post on 21-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Estimating recombination rates using three-site

likelihoods

Jeff Wall

Program in Molecular and Computational Biology, USC

DNA sequence variation

Patterns of DNA sequence variation are affected by

mutationrecombinationpopulation structurechanges in population sizenatural selectiongenetic drift

DNA sequence variation

Patterns of DNA sequence variation are affected by

mutationrecombinationpopulation structurechanges in population sizenatural selectiongenetic drift

Standard double strand break model of recombination

Gene conversion

Crossover (with gene conversion)

Slide courtesy of M. Przeworski

Standard double strand break model of recombination

Gene conversion

Crossover (with gene conversion)

Approximated as

Gene conversion

Crossover

Ignore patchworks.

e.g.

Slide courtesy of M. Przeworski

Gene conversion

• Most population genetic models ignore gene conversion. However gene conversion has a strong effect on the levels of linkage disequilibrium between closely linked sites.

Recombinants are produced at a rate proportional to the genetic distance between the sites.

Recombinants are produced at a rate that is roughly independent of the distance between the sites.

Crossing over

Gene conversion

Effect of gene conversion on patterns of linkage disequilibrium (LD)

Gene conversion leads to a steeper decay of LD at short distances.

0

0.1

0.2

0.3

0.4

0 5000 10000 15000 20000

avera

ge r2

Physical distance between markers (bps)

no gene conversion

gene conversion

Figure courtesy of M. Przeworski

Implications of high levels of gene conversion

• To detect natural selection (Andolfatto and Nordborg 1998; Berry and Barbadilla 2000)

Implications of high levels of gene conversion

• To detect natural selection (Andolfatto and Nordborg 1998; Berry and Barbadilla 2000)

• For linkage disequilibrium-based association studiesA

B C

1 2 3 1 2 3

1 2 3

Parameters

= 4Nerco where Ne is the effective population size and rco is the crossover rate

per bp per generation

f = rgc / rco where rgc is the rate of gene conversion initiation per bp per

generation

t = mean gene conversion tract length. We assume that gene conversion tract lengths follow a geometric distribution.

General Approach

Ideally we would calculate the probability of the data

as a function of the recombination parameters.

However, full likelihood methods (e.g., Fearnhead & Donnelly 2001) are too computationally intensive.

The composite likelihood approach calculates likelihoods for small subsets of the data, thenmultiplies these likelihoods over many subsets.

Composite likelihood (Frisse et al. 2001)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Composite likelihood (Frisse et al. 2001)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Composite likelihood (Frisse et al. 2001)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Composite likelihood (Frisse et al. 2001)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Composite likelihood (Wall 2004)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Composite likelihood (Wall 2004)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Composite likelihood (Wall 2004)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Composite likelihood (Wall 2004)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Simulations

We ran simulations of 5 Kb loci with

n = 50, θ = ρ = 0.001 / bp, f = 4 and t = 125 bp.

We analyze each locus individually as well as groupsof 5, 20 and 100 loci (assuming each locus is evolutionarily independent). For each group, we estimate f over a grid of values using the methods of Frisse et al. (2001) and Wall (2004).

Distribution of estimates of f(1 locus)

Triplet

method

Pair method

Estimated value of f

Frequ

en

cy

0

0.05

0.1

0.15

0.2

0.25

0 1 1.4 2 2.8 4 5.6 8 11.2 16

Distribution of estimates of f(5 loci)

Triplet

method

Pair method

Estimated value of f

Frequ

en

cy

0

0.05

0.1

0.15

0.2

0.25

0.3

0 1 1.4 2 2.8 4 5.6 8 11.2 16

Distribution of estimates of f(20 loci)

Triplet

method

Pair method

Estimated value of f

Frequ

en

cy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 1 1.4 2 2.8 4 5.6 8 11.2 16

Distribution of estimates of f(100 loci)

Triplet

method

Pair method

Estimated value of f

Frequ

en

cy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 1.4 2 2.8 4 5.6 8 11.2 16

Estimating ρ and f jointly

0

0.2

0.4

0.6

0.8

1

1 10 100 1000

Triplet method

Pair method

Number of loci

Pro

bab

ility

Conclusions

• For estimating gene conversion rates, the triplet composite likelihood method is slightly more accurate than the pairwise composite likelihood method.

• Both methods are not very accurate on an absolute scale.

Further directions

• Modify method to handle unphased data, missing data, ascertainment bias, etc.

• Variation in recombination rates

• Confounding factors:– Multiple hits– Sequencing errors– Population history– Natural selection

top related