bayesian estimators of time to most recent common ancestry

Bayesian Estimators ofTime to Most Recent Common Ancestry

Ecology and Evolutionary BiologyAdjunct Appointments

Molecular and Cellular BiologyPlant Sciences

Epidemiology & Biostatistics Animal Sciences

Bruce [email protected]

DefinitionsMRCA - Most Recent Common Ancestor

TMRCA - Time to Most Recent Common Ancestor

Question: Given molecular marker information from a pair of individuals, what is the estimated time backto their most recent common ancestor?

With even a small number of highly polymorphicautosomal markers, trivial to assess zero (subject/biological sample) and one (parent-offspring) MRCA

Problems with Autosomal Markers

Often we are very interested in MRCAs that are modest(5-10 generations) or large (100’s to 10,000’s of generations)

Unlinked autosomal markers simply don’t work over thesetime scales.

Reason: IBD probabilities for individuals sharing a MRCA5 or more generations ago are extremely small and hencevery hard to estimate (need VERY large number of markers).

MRCA-I vs. MRCA-GWe need to distinguish between the MRCA for a pairof individuals (MRCA-I) and the MRCA for a particulargenetic marker G (MRCA-G).

MRCA-G varies between any two individuals over recombination units.

For example, we could easily have for a pair of relativesMRCA (mtDNA ) = 180 generationsMRCA (Y ) = 350 generationsMRCA (one -globulin allele ) = 90 generationsMRCA (other -globulin allele ) = 400 generations

MRCA-G > MRCA-I

MRCA-I

lost

MRCA-G( )

MRCA-G( )

mtDNA and Y Chromosomes

So how can we accurately estimate TMRCA for modestto large number of generations?Answer: Use a set of completely linked markers

With autosomes, unlinked markers assort each generationleaving only a small amount of IBD information on eachmarker, which we must then multiply together. IBD information decays on the order of 1/2 each generation.

With completely linked marker loci, information on IBD does not assort away via recombination. IBD information decay is on the order of the mutation rate.

Y chromosome microsatellitemutation rates- I

Estimate of u Source Reference0.0028 Y chromosome Kayser et al. 20000.0021 Y chromosome Heyer et al. 19970.001 - 0.0021 Autosomal

chromosomesWong & Weber 1993Brinkmann 1998

Estimates of human mutation rate in microsatellitesare fairly consistent over both the Y and the autosomes

Basic Structure of the Problem

What is the probability that the two marker alleles at a haploid locus from two related individuals agree given that their MRCA was t generation ago?

Phrased another way, what is their probabilityof identity by state (IBS), given they are identicalby descent (IBD) when their TMRCA is t generations

Infinite Alleles ModelThe first step in answering this question is to assume a particular mutational modelOur (initial) assumption will be the infinite allelesmodel (IAM)

The key assumption of this model (originally due to Kimura and Crow, 1964) is that each new mutationgives rise to a new allele.

The IAM was the first population-genetics model toattempt to formally incorporate the structure of DNAinto a model

Key: Under the infinite alleles, two alleles that areidentical in state that are also ibd have notexperienced any mutations since their MRCA.

Let q(t) = Probability two alleles with a MRCAt generations ago are identical in state

If u = per generation mutation rate, then

q(t) = (1-u)2t

MRCA

(1-u) t

A

(1-u)t

B

MRCA

Pr(No mutation from MRCA->A) = (1-u)t

Pr(No mutation from MRCA->B) = (1-u)t

q(t) = (1-u)2t ≈ e-2ut = e-, = 2ut

Building the Likelihood Function for n Loci

For any single marker locus, the probability of IBSgiven a TMRCA of t generations is

The probability that k of n marker loci are IBS is justPr(k)=n!(n°k)!k!q(t)k[1°q(t)]n°k- - -

a Binomial distribution with success parameter q(t)L(tjn;k)=n!(n°k)!k!e°kø°1°e°ø¢n°k- -- --( )Likelihood function for t given k of n matches

ML Analysis of TMRCAL(tjn;k)=n!(n°k)!k!e°kø°1°e°ø¢n°k- -- --( )

It would seem that we now have all the pieces inhand for a likelihood analysis of TMRCA giventhe marker data (k of n matches)

Likelihood function ( = 2ut)

MLE for t is solution of ∂ L/∂t = 0

p = fraction of matches

ø=2tπ=ln≥nk¥=lnµ1p∂=°ln(p)-( ) ( )^ ^

In particular, the MLE for t becomes

Likewise, the precision of this estimator followsfor the (negative) inverse of the 2nd derivativeof the log-likelihood function evaluated at theMLE,

(bt=12πln≥nk¥=°12πln(p)) -°µ@2lnL(tjn;k)@t2ØØØØt=t∂°1=14u21nµ1°pp∂Var( t ) = --( )^^ -

Likewise, we can (numerically) easily find 1-LOD support intervals for t and hence constructapproximate 95% confidence intervals to TMRCA

Finally, hypothesis testing, say Ho: MRCA = t0, is easily accomplished by comparing -2* the naturallog of the ratio of the value of the likelihood function at t = t0 over the value of the likelihood function at the MLE t = t ̂

The resulting log likelihood ratio LR is (asymptotically)distributed as a chi-square distribution with one degree of freedom

Trouble in ParadiseThe ML machinery has seem to have done its job,giving us an estimate, its approximate samplingerror, approximate confidence intervals, and a schemefor hypothesis testing.

Hence, all seems well.

Problem: Look at k=n (= complete match at all markers).MLE (TMRCA) = 0 (independent of n)Var(MLE) = 0 (ouch!)

With n=k, the value of the likelihood function is

L(t) = (1-u)2tn ≈ e-2tun

What about one-LOD support intervals (95% CI) ?

L has a maximum value of one under the MLE

Hence, any value of t that gives a likelihood value of0.1 or larger is in the one-LOD support interval

Solving, the one-LOD support interval is from t=0 to t = (1/2n) [ -Ln(10)/Ln(1-u) ] ≈ (1/n) [ Ln(10)/(2u) ]

For u = 0.002, CI is (0, 575/n)

With n=k, likelihood function reduces toL(t) = (1-u)2tn ≈ e-2tun

t

L(t)

(Plots for u = 0.002)

MLE(t) = 0 for all values on nn=5

n=10

n=20

0.1 of max value (1) oflikelihood function

1 LOD ≈ t = 291 LOD ≈ t = 581 LOD ≈ t = 115

What about Hypothesis testing?Again recall that for k =n that the likelihood at t = t0 is L(t0) ≈ Exp(-2t0un)

Hence, the LR test statistic for Ho: t = t0 is just LR = -2 ln [ L(t0)/ L(0) ] = -2 ln [ Exp(-2t0un) / 1 ] = 4t0unThus the probability for the test that TMRCA = t0 is just Pr( 1

2 > 4t0un)

The problem(s) with MLThe expressions developed for the sampling variance,approximate confidence intervals, and hypothesistesting are all large-sample approximations

Problem 1: Here our sample size is the number ofmarkers scored in the two individuals. Not likely tobe large.

Problem 2: These expressions are obtained by takingappropriate limits of the likelihood function. If theML is exactly at the boundary of the admissible spaceon the likelihood surface, this limit may not formallyexist, and hence the above approximations are incorrect.

The Solution? Go Bayesian

An extension of likelihood is Bayesian statistics

p( | x) = C * l(x | ) p()

Instead of simply estimating a point estimate (e.g., the MLE), the goal is the estimate the entire distribution for the unknown parameter given the data x

posterior distribution ofgiven x

Likelihood function for Given the data x

prior distribution for The appropriate constant so that the posteriorintegrates to one.

Why Bayesian? • Exact for any sample size• Marginal posteriors• Efficient use of any prior information

• MCMC (such as Gibbs sampling) methods

The Prior on TMRCAThe first step in any Bayesian analysis is choice ofan appropriate prior distribution p(t) -- our thoughts onthe distribution of TMRCA in the absence of any ofthe marker data

Standard approach: Use a flat or uninformative prior,with p(t) = a constant over the admissible range of theparameter. Can cause problems if the likelihood functionis unbounded (integrates to infinity)

In our case, population-genetic theory provides theprior: under very general settings, the time to MRCA for a pair of individuals follows a geometric distribution

In particular, for a haploid gene, TMRCA followsa geometric distribution with mean 1/Ne.

Hence, our prior is just

p(t) = (1-)t ≈ e-t, where = 1/Ne

Hence, we can use an exponential prior withhyperparameter (the parameter fullycharacterizing the distribution) = 1/Ne.The posterior thus becomesp(tjk)/L(tjn;k)p(t)=exp[°(2πk+∏)t](1°exp[°(2πt)])n°k- - - -

Previous likelihood function (ignoring constantsthat cancel when we compute the normalizing factorC)

Prior

Prior hyperparameter = 1/Ne

The Normalizing constantp(tjk)=exp[°(2πk+∏)t](1°exp[°(2πt)])n°kI(π;k;n;∏)- - - -

whereI(π;k;n;∏)=Z10exp[°(2πk+∏)t](1°exp[°(2πt)])n°kdt- - - -

I ensures that the posterior distribution integratesto one, and hence is formally a probability distribution

What is the effect of the hyperparameter?p(tjk)=exp[°(2πk+∏)t](1°exp[°(2πt)])n°kI(π;k;n;∏)- - - -

If 2uk >> , then essentially no dependence on theactual value of chosen.

Hence, if 2Neuk >> 1, essentially no dependence on(hyperparameter) assumptions of the prior.

For a typical microsatellite rate of u = 0.002, this is justNek >> 250, which is a very weak assumption. For example,with k =10 matches, Ne >> 25. Even with only 1 match (k=1),just require Ne >> 250.

Closed-form Solutions for the Posterior Distribution

Complete analytic solutions to the prior can be obtainedby using a series expansion (of the (1-ex)n term) to giveexp[°(2πk+∏)t](1°exp[°(2πt)])n°k- - - -exp[°(2πk+∏)t]√n°kXi=0(°1)i(n°k)!i!(n°k°i)!exp[°(2πti)]!- - -- -

--(=

-=n°kXi=0(°1)i(n°k)!i!(n°k°i)!exp[°(2π(k+i)+∏)t]----

-

Each term is just a * ebt, which is easily integrated

I(π;k;n;∏)=n°kXi=0(°1)i(n°k)!i!(n°k°i)!Z10exp[°(2π(k+i)+∏)t]dt=n°kXi=0(°1)i(n°k)!i!(n°k°i)!12π(k+i)+∏=2n°k(n°k)!πn°kn°ki=0[∏+2π(n°i)]Q- -

-

-- - -

-- -

--

-

-

- -

With the assumption of a flat prior, = 0, this reduces toI(π;k;n;0)=(n°k)!(k°1)!(2π)n!- -

Hence, the complete analytic solution of the posterior is

Suppose k = n (no mismatches)

p(tjk;∏)=√Qn°ki=0[∏+2π(n°i)]2n°k(n°k)!πn°k!(1°exp[°2πt])n°kexp[t(2πk+∏)]- - --

-- -

-(In this case, the prior is simply an exponentialdistribution with mean 2un + ,p(tjk=n)=(∏+2nπ)exp[°(2πn+∏)t]-

Analysis of n = k caseMean TMRCA and its variance:πt=æt=1∏+2nπ'12nπPr(t∑T)=ZT0p(tjk=n)dt=1°exp(°(2πn+∏)T)< --

Cumulative probability:

In particular, the time T satisfying P(t < T) = is TÆ=°ln(1°Æ)2πn+∏--

For a flat prior ( = 0), the 95% (one-side) confidenceinterval is thus given by -ln(0.5)/(2nu) ≈ 1.50/(nu)

Hence, under a Bayesian analysis for u = 0.02, the95% upper confidence interval is given by ≈ 749/n

Recall that the one-LOD support interval (approximate95% CI) under an ML analysis is ≈ 575/n

The ML solution’s asymptotic approximation significantlyunderestimates the true interval relative to theexact analysis under a full Bayesian model

Why the difference?

Under ML, we plot the likelihood function and lookfor the 0.1 value

Under a Bayesian analysis, we lookat the posterior probabilitydistribution (likelihood adjustedto integrate to one) and find thevalues that give an area of 0.95

n = 20, area toleft of t=38 = 0.95

n = 10, area toleft of t=75 = 0.95

t

Pr(T

MRC

A <

t)

n = 20, t0.95 = 38 n = 10, t0.95 = 75

3002502001501005000.00

0.01

0.02

0.03

Time t to MRCA

p( t | k )

10

9

87

65

Posteriors for n = 10

Sample Posteriors for u = 0.002

10090807060504030201000.00

0.02

0.04

0.06

0.08

Time t to MRCA

p( t | k )

20

1918 17

1516


40353025201510500.00

0.05

0.10

0.15

0.20

Time t to MRCA

p( t | k )

50

4948 47 46

45


20191817161514131211109876543210.00

0.05

0.10

0.15

0.20

0.25

0.30

Time t to MRCA

p( t | k )

100

9998

9695

97

n = 100

656055504540353025201510500.000

0.010

0.020

0.030

0.040

0.050

0.060

Time t to MRCA

p( t | k )

9493

9291

90

89

n = 100

Key points

• By using the appropriate number of markers wecan get accurate estimates for TMRCA for evenjust a few generations. 20-50 markers will do.

• By using markers on a non recombining chromosomalsection, we can estimate TMRCA over much, muchlonger time scales than with unlinked autosomal markers

• Hence, we have a fairly large window of resolutionfor TMRCA when using a modest number of completelylinked markers.

Extensions I: Different Mutation Rates

Let marker locus k have mutation rate uk.

Code the observations as xk = 1 if a match, otherwise code xk = 0p(tjx)/exp"°t√∏+2nXk=1πkxk!#nYk=1£1°e°2tπi§1°xk- ( [ ] -- -

The posterior becomes:

Stepwise Mutation Model (SMM) The Infinite alleles model (IAM) is not especiallyrealistic with microsatellite data, unless the fractionof matches is very high.

Under IAM, score as a match, and hence no mutationsIn reality, there are two mutations

Microsatellite allelic variants are scored by their numberof repeat units. Hence, two “matching” alleles can actuallyhide multiple mutations (and hence more time to the MRCA)

Mutation 1

Mutation 2

Y chromosome microsatellitemutation rates- II

The SMM model is an attempt to correct formultiple hits by fully accounting for the mutationalstructure.Good fit to array sizes in natural populations when assuming the symmetric single-step model • Equal probability of a one-step move up or down

In direct studies of (Y chromosome) microsatellites35 of 37 dectected mutations in pedigrees weresingle step, other 2 were two-step

SMM0 model -- match/no match under SMM

The simplest implementation of the SMM model isto simply replace the match probabilities q(t) underthe IAM model with those for the SMM model.

This simply codes the marker loci as match / no matchWe refer to this as the SMMO model

Formally, the SMM model assumes the followingtransition probabilitiesPr(X(t+1)=i°1jX(t)=i)=Pr(X(t+1)=i+1jX(t)=i)=π2Pr(X(t+1)=ijX(t)=i)=1°πPr(jX(t+1)°X(t)j∏2jX(t)=i)=0-

-- >

Note that two alleles can match only if they haveexperienced an even number of mutations in total betweenthem. In such cases, the match probabilities becomePr(matchj2Mmoves)=122Mµ2MM∂=122M(2M)!(M!)2( )

Pr(matchj2Mmoves)=122Mµ2MM∂=122M(2M)!(M!)2( )Number ofmutations

Prob(Match)

2 0.5004 0.3756 0.3138 0.27310 0.246

q(t)=1XM=0Pr(matchj2Mmoves)Pr(2Mmovesjt)=1XM=0µ122M(2M)!(M!)2∂µ(2πt)2M(2M)!∂exp(°2tπ)=exp(°2tπ)√1XM=0(πt)2M(M!)2!-

- (1X=0(x)2k(k!)2=I0(2x)kThe zero-order modifedType I Bessel Function

Hence,q(ø)=exp(°ø)I0(ø)-

= 2μt

q(t)

Infinite-alleles

stepwise

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

= 2μt

q(t) Infinite-allelesstepwise

Under the SMM model, the prior hyperparametercan now become important.

This is the case when the number n of markers issmall and/or k/n is not very close to one

Why? Under the prior, TMRCA is forced by ageometric with 1/Ne. Under the IAM model formost values this is still much more time thanthe likelihood function predicts from the marker data

Under the SMM model, the likelihood alone predictsa much longer value so that the forcing effect of theinitial geometric can come into play

n =5, k = 3, u = 0.02

Time, t

Pr(T

MRC

A <

t)

IAM, both flat prior and Ne = 5000

SSMO, Ne = 5000

SSMO, flat prior

Prior with Ne =5000

An Exact Treatment: SMME

With a little work we can show that the probabilitytwo sites differ by j steps is justq(j)(ø)=2exp(°ø)Ij(ø)forj∏1- >

The resulting likelihood thus becomesL(tjn0;¢¢¢;nk)=n!n0!n1!¢¢¢nk!kYj=0hq(j)(2πt)inj… …

Where nj is the number of sites that differby k (observed) steps

The jth-order modifedType I Bessel Function

With this likelihood, the resulting prior becomesp(tjn0;¢¢¢;nk)/kYj=0hq(j)(2πt)inje°∏t… -

This rearranges to give the general posteriorunder the Exact SMM model (SMME) asp(tjn0;¢¢¢;nk)=e°(∏+2πn)tQkj=0[Ij(2πt)]njR10e°(∏+2πn)tQkj=0[Ij(2πt)]njdt-

-…

Number of exact matchesNumber of k steps differences

ExampleConsider comparing haplotypes 1 and 3 from Thomaset al.’s (2000) study on the Lemba and Cohen Y chromosome modal haplotypes. Here six markers used,four exactly match, one differs by one repeat, the otherby two repeats

Hence, n = 6, k = 4 for IAM and SMM0 models

n0 = 4, n1 = 1, n2 = 1, n = 6 under SMME model

Assume Hammer’s value of Ne=5000 for the prior

IAM

SMM0

SMME

Time to MRCA, t

P(t |

mar

kers

)TMRCA for Lemba and Cohen Y

Model used Mean Medium 2.5% 97.5%

IAM 152.3 135.4 31.1 371

SMM0 454.7 233.7 40.4 2389

SMME 422.3 286.2 65.1 1631

Time to MRCA, t

Pr(T

MRC

A <

t)

IAM

SMM0

SMME

Technology Transfer

Family Tree DNA (ftDNA) -- provides Y chromosome marker kits for genealogical studies

So far, ftDNA has processed over 10,000 suchkits

This amounts to a rough gross of around 3 milliondollars.

The expressions developed above have directcommercial applications

Forensic applications of the Y• A not uncommon situation is the only DNA

is from fingernail scrappings.• The result is a mixture wherein the

victim's DNA often overwhelms the DNA of the perpetrator.

• Result: only modest match probability as many autosomal markers cannot be detected

• One solution: use Y chromosome markers. Easily amplified over (female) background

Problem: How do we combine Y match with autosomal match?

NRC 1996 recommendations (autosomal loci)

Prob(Y match)*Prob(autosomal match)

Problem: Y markers may provide information about population substructure membershipFor example, a particular haplotype may be restricted to a certain subpopulation, e.g., Native Americans

Product rule across markersPopulation substructure correction within markers

Correcting for Y substructureLet y denote the observed Y haplotypeA the multilocus autosomal marker genotype

P(y,A) = P(A | y)*P(y)

Simple approach: knowledge of y indicates membershipin a particular subpopulation, P(A) computed using allele frequencies for that subpopulation.

Suggestion: Multiply freq(y)* max Freq(A over subgroups)

A more precise accountingSuppose two individuals share the same y haplotype.What is there average coancestry, ?

Balding and Nichols give expressions for autosomalsingle-locus genotype frequencies given that thepopulation shows structure with coancestry .

Second approach: Compute from haplotype matching.Using this value in Balding - Nichols expressionsto compute (single-locus) autosomal frequencies.

P(tjt∏k)=(1°u)2nø¢(1°∏)ø°1P1ø=k(1°u)2nø¢(1°∏)ø°1>

Posterior Distribution for a match at alln markers with a prior of = 1/NeE[µjt∏k]=1Xt=kµ122t+1∂P(tjt∏k)> >

With a MRCA of t generations, = (1/2)2t+1

Typical situation is where we can exclude father-son and paternal half-sibs, k > 2nπ+2nπ)E[µjt∏2]'∏+2nπ24°8(∏+2nπ)>

Typical values, n = 11, = 1/500

• For = 1/5000, E [ ] = 0.00186• For = 1/500, E [ ] = 0.00194• For = 1/50, E [ ] = 0.00272

For these values, unless pi < 0.01, Balding-Nichols expression are essentially HW.

Formal procedureEstimate P(y) from a database (counting methods,Bayesian estimators)

Compute mutlilocus autosomal frequencies byeach major ethnic group using the product of the single-locus genotypes computed using group-specific allele frequencies and = 0.002correction.

Conservative P(y,A) = P(y)*max P(A)

Thanks to Assistance from

Jay TaylorMike Hammer

Matt Kaplan

bayesian estimators of time to most recent common ancestry

Documents