estimating recombination rates using three-site likelihoods jeff wall program in molecular and...

27
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Estimating recombination rates using three-site

likelihoods

Jeff Wall

Program in Molecular and Computational Biology, USC

Page 2: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

DNA sequence variation

Patterns of DNA sequence variation are affected by

mutationrecombinationpopulation structurechanges in population sizenatural selectiongenetic drift

Page 3: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

DNA sequence variation

Patterns of DNA sequence variation are affected by

mutationrecombinationpopulation structurechanges in population sizenatural selectiongenetic drift

Page 4: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Standard double strand break model of recombination

Gene conversion

Crossover (with gene conversion)

Slide courtesy of M. Przeworski

Page 5: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Standard double strand break model of recombination

Gene conversion

Crossover (with gene conversion)

Approximated as

Gene conversion

Crossover

Ignore patchworks.

e.g.

Slide courtesy of M. Przeworski

Page 6: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Gene conversion

• Most population genetic models ignore gene conversion. However gene conversion has a strong effect on the levels of linkage disequilibrium between closely linked sites.

Recombinants are produced at a rate proportional to the genetic distance between the sites.

Recombinants are produced at a rate that is roughly independent of the distance between the sites.

Crossing over

Gene conversion

Page 7: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Effect of gene conversion on patterns of linkage disequilibrium (LD)

Gene conversion leads to a steeper decay of LD at short distances.

0

0.1

0.2

0.3

0.4

0 5000 10000 15000 20000

avera

ge r2

Physical distance between markers (bps)

no gene conversion

gene conversion

Figure courtesy of M. Przeworski

Page 8: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Implications of high levels of gene conversion

• To detect natural selection (Andolfatto and Nordborg 1998; Berry and Barbadilla 2000)

Page 9: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Implications of high levels of gene conversion

• To detect natural selection (Andolfatto and Nordborg 1998; Berry and Barbadilla 2000)

• For linkage disequilibrium-based association studiesA

B C

1 2 3 1 2 3

1 2 3

Page 10: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Parameters

= 4Nerco where Ne is the effective population size and rco is the crossover rate

per bp per generation

f = rgc / rco where rgc is the rate of gene conversion initiation per bp per

generation

t = mean gene conversion tract length. We assume that gene conversion tract lengths follow a geometric distribution.

Page 11: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

General Approach

Ideally we would calculate the probability of the data

as a function of the recombination parameters.

However, full likelihood methods (e.g., Fearnhead & Donnelly 2001) are too computationally intensive.

The composite likelihood approach calculates likelihoods for small subsets of the data, thenmultiplies these likelihoods over many subsets.

Page 12: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Composite likelihood (Frisse et al. 2001)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Page 13: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Composite likelihood (Frisse et al. 2001)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Page 14: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Composite likelihood (Frisse et al. 2001)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Page 15: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Composite likelihood (Frisse et al. 2001)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Page 16: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Composite likelihood (Wall 2004)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Page 17: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Composite likelihood (Wall 2004)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Page 18: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Composite likelihood (Wall 2004)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Page 19: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Composite likelihood (Wall 2004)

Sequence 1 a c c g a t g c g t a a g c t

Sequence 2 g t a g a t g c g t c a g c t

Sequence 3 g t a g t c g t g t c g g c c

Sequence 4 a c a g t c g t g t c g g t t

Sequence 5 a c a g t c g t g t a g g t t

Sequence 6 a c c g a c g c c c a a g c t

Sequence 7 a c c g a t g c c c a a g c t

Sequence 8 a c c g a t g c c c a a g c c

Sequence 9 a c c t a t g c g t a a g c t

Sequence 10 a c c g a t a c g t c g g t t

Sequence 11 a c a g a c g c g t c g c c t

Sequence 12 g t a g a t g c c c a a g c t

Page 20: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Simulations

We ran simulations of 5 Kb loci with

n = 50, θ = ρ = 0.001 / bp, f = 4 and t = 125 bp.

We analyze each locus individually as well as groupsof 5, 20 and 100 loci (assuming each locus is evolutionarily independent). For each group, we estimate f over a grid of values using the methods of Frisse et al. (2001) and Wall (2004).

Page 21: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Distribution of estimates of f(1 locus)

Triplet

method

Pair method

Estimated value of f

Frequ

en

cy

0

0.05

0.1

0.15

0.2

0.25

0 1 1.4 2 2.8 4 5.6 8 11.2 16

Page 22: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Distribution of estimates of f(5 loci)

Triplet

method

Pair method

Estimated value of f

Frequ

en

cy

0

0.05

0.1

0.15

0.2

0.25

0.3

0 1 1.4 2 2.8 4 5.6 8 11.2 16

Page 23: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Distribution of estimates of f(20 loci)

Triplet

method

Pair method

Estimated value of f

Frequ

en

cy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 1 1.4 2 2.8 4 5.6 8 11.2 16

Page 24: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Distribution of estimates of f(100 loci)

Triplet

method

Pair method

Estimated value of f

Frequ

en

cy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 1.4 2 2.8 4 5.6 8 11.2 16

Page 25: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Estimating ρ and f jointly

0

0.2

0.4

0.6

0.8

1

1 10 100 1000

Triplet method

Pair method

Number of loci

Pro

bab

ility

Page 26: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Conclusions

• For estimating gene conversion rates, the triplet composite likelihood method is slightly more accurate than the pairwise composite likelihood method.

• Both methods are not very accurate on an absolute scale.

Page 27: Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC

Further directions

• Modify method to handle unphased data, missing data, ascertainment bias, etc.

• Variation in recombination rates

• Confounding factors:– Multiple hits– Sequencing errors– Population history– Natural selection