jack kamm, richard durbin · 2021. 7. 20. · learning history from the sfs of low- and...

1
Learning history from the SFS of low- and high-coverage ancient DNA Jack Kamm, Richard Durbin Human Genetics Programme, Wellcome Trust Sanger Institute Department of Genetics, University of Cambridge Joint site frequency spectrum (SFS) Distn. of SNP derived allele counts x Can be used to infer demographic his- tory Computed under coalescent model x =(x 1 , x 2 , x 3 ) = (1, 1, 0) n =(n 1 , n 2 , n 3 )=(2, 2, 1) SFS = P(x) Ancient DNA Added resolution and power See further into the past Can estimate dates without mutation rates Modern pops can be modeled as mixture of an- cient pops But errors, bias distort SFS Differences in coverage Different rates of variant discovery Deamination momi: compute SFS using Moran model + Bayesian graph 1. Represent demography as DAG 2. Convert DAG to tree 1 2 3 4,7 5,6 5,7 8 ent tree T . Each internal v c 3. Compute likelihoods at each node Likelihoods and other statistics momi algorithm is a DP on the tree To compute P(x 1 , x 2 , x 3 ): set leafs to e x 1 , e x 2 , e x 3 Propagate likelihoods up tree “Tree-peeling” Also efficiently compute: Total branch length TMRCA f 2 , f 3 , f 4 statistics F ST Tajima’s D E[f (X 1 )g (X 2 )h(X 3 )] Estimating ”Basal Eurasian” gene flow Lazaridis et al 2014, 2016: f 4 (HG , Farmer ; East , Out ) < 0; Posit “Basal Eurasian” component in Europe/Middle East We estimated parameters for 8 population model with SFS Used 300 nonparametric bootstraps for confidence intervals Correcting Coverage & Ascertainment Bias Ascertain in high-coverage Random allele call in low cov- erage Normalize by subtree branch length to com- pute conditional probability P(x | x 1 , x 2 , x 3 polymorphic) Assessing model fit ABBA/BABA statistics Like qpGraph (Patterson et al., 2012), but with recent mutations Compare model expectation to observed ABBA/BABA Basal Eurasian model fits reasonably well |Z | < 3.2(p >.05 after Bonferonni) Pairwise similarity Comparing expectation to observed, Mbuti and Sardinian have excess pairwise similarity; UstIshim has excess dissimilarity Mutation rate estimation We used the expected within-population nucleotide diversity to estimate mutation rate. We estimated 1.11 to 1.26 × 10 -8 depending on which populations we included in the model. This research is supported by the Wellcome Trust (grant 206194) and the Human Frontiers Science Program (LT000402/2017). {jk21,rd}@sanger.ac.uk

Upload: others

Post on 21-Aug-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jack Kamm, Richard Durbin · 2021. 7. 20. · Learning history from the SFS of low- and high-coverage ancient DNA Jack Kamm, Richard Durbin Human Genetics Programme, Wellcome Trust

Learning history from the SFS of low- and high-coverageancient DNA

Jack Kamm, Richard Durbin

Human Genetics Programme, Wellcome Trust Sanger InstituteDepartment of Genetics, University of Cambridge

Joint site frequency spectrum (SFS)Distn. of SNP derived allele counts xCan be used to infer demographic his-toryComputed under coalescent model

x = (x1, x2, x3) = (1,1,0)n = (n1,n2,n3) = (2,2,1)

SFS = P(x)

Ancient DNAAdded resolution and power

See further into the pastCan estimate dates without mutation ratesModern pops can be modeled as mixture of an-cient pops

But errors, bias distort SFSDifferences in coverage

Different rates of variant discoveryDeamination

momi: compute SFS using Moran model + Bayesian graph

1. Represent demography as DAG 2. Convert DAG to tree

5

µ80

µ40

µ10

µ50 µ6

0

µ20 µ3

0

µ70

(a) A demographic history with lookdown Moran model. µv0 is the number of derived alleles (blue

stars) at the bottom of population v. The observed configuration is x = (µ10, µ

20, µ

30) = (1, 1, 0). The

coalescent is in solid lines.

1

2 3

4

5 6 7

8

(b) The DAG G. Each vertex v corresponds toa population, with a path from v to w i↵ w hasancestry in v.

1

23

4,7

5,6

5,7

8

(c) The event tree T . Each internal v correspondsto a join or split event, and is labeled by a subsetof contemporaneous populations.

Figure 2: Summary of important notation and data structures. (TODO: change the pulse so thereis positive time between the split and join) (TODO: make the non-ancestral alleles haveempty filling)

Algorithm 1 defines a dynamic program (DP) over `v,0µ , ⇠v, using equations (1), (2), (3), (4), (5), (6)

to be defined shortly. For appropriate inputs the DP computes the SFS:

Theorem 1. For polymorphic x = (x1, . . . , xD) 6= 0,n and leaf population d 2 {1, . . . , D}, let exd=

(0, . . . , 1, . . . , 0) 2 R(nd+1) have 1 at coordinate xd and 0 elsewhere. Then

⇠x = DP(ex1, . . . , exD ).

We now present the formulas used by Algorithm 1, in a series of lemmas that also prove Theorem 1.We start with a formula to compute ⇠v from `v,0

µ and the partial SFS at the child events CT (v).

Lemma 1. For v 2 V (T ) and w =S

CT (v) = {w 2 V (G) | w 2 w0,w0 2 CT (v)},

⇠v = ⇠w +X

v2v\w

nvX

k=1

fvnv

(k)`v,0kev

(1)

3. Compute likelihoods at each node

Likelihoods and other statistics

momi algorithm is a DP onthe treeTo compute P(x1, x2, x3):

set leafs to ex1,ex2,ex3

Propagate likelihoods up tree“Tree-peeling”

Also efficiently compute:Total branch lengthTMRCAf2, f3, f4 statisticsFSTTajima’s DE[f (X1)g(X2)h(X3)]

Estimating ”Basal Eurasian” gene flow

MbutiLBK Sardinian

Loschbour

MA1 Han UstIshimNeanderthal

104

5 × 104

105

5 × 105

9.1%10.5%

4.8%

N1.0e+034.0e+031.6e+046.4e+04

0.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

0.200

Lazaridis et al 2014, 2016: f4(HG,Farmer ;East ,Out) < 0;Posit “Basal Eurasian” component in Europe/Middle East

We estimated parameters for 8 population model with SFSUsed 300 nonparametric bootstraps for confidence intervals

Correcting Coverage & Ascertainment Bias

Ascertain in high-coverageRandom allele call in low cov-erageNormalize by subtreebranch length to com-pute conditional probabilityP(x | x1, x2, x3 polymorphic)

Assessing model fitABBA/BABA statistics

Like qpGraph (Patterson et al., 2012), but with recent mutationsCompare model expectation to observed ABBA/BABA

Basal Eurasian model fits reasonably well|Z | < 3.2 (p > .05 after Bonferonni)

Pairwise similarityComparing expectation to observed, Mbuti and Sardinian have excesspairwise similarity; UstIshim has excess dissimilarity

Mutation rate estimationWe used the expected within-population nucleotide diversity to estimate mutation rate. We estimated 1.11 to 1.26× 10−8 dependingon which populations we included in the model.

This research is supported by the Wellcome Trust (grant 206194) and the HumanFrontiers Science Program (LT000402/2017). {jk21,rd}@sanger.ac.uk