population assignment likelihoods in a phylogenetic and demographic model. jody hey rutgers...

18
Population assignment Population assignment likelihoods likelihoods in a phylogenetic and in a phylogenetic and demographic model. demographic model. Jody Hey Jody Hey Rutgers University Rutgers University

Upload: lynette-dorsey

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Population assignment likelihoods Population assignment likelihoods in a phylogenetic and in a phylogenetic and demographic model.demographic model.

Jody HeyJody Hey

Rutgers UniversityRutgers University

Page 2: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

DNA Barcoding is great!DNA Barcoding is great!

• But it is useful to keep in mind that But it is useful to keep in mind that species taxa are provisional – they are species taxa are provisional – they are hypothesis to be revised with more datahypothesis to be revised with more data

• Taxa are tools, not truthTaxa are tools, not truth• Mitochondrial-based DNA barcodesMitochondrial-based DNA barcodes

– Can be misleading due to chance factors Can be misleading due to chance factors (different genes have different histories)(different genes have different histories)

– Can be misleading due to deterministic factors Can be misleading due to deterministic factors (mitochondria are a large target for natural (mitochondria are a large target for natural selection)selection)

Page 3: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

A general problem…A general problem…

For example, a gene sequenced multiple For example, a gene sequenced multiple timestimes

Or a microsatellite locus genotyped in a Or a microsatellite locus genotyped in a number of individualsnumber of individuals

Suppose you are willing to assume that Suppose you are willing to assume that positive or balancing selection has not positive or balancing selection has not played a big role in the history of the played a big role in the history of the data data

What could you figure out about the history What could you figure out about the history of the organisms from which the genes of the organisms from which the genes came? came?

You have some genetic dataYou have some genetic data

Page 4: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

A General Parameterization for questions on A General Parameterization for questions on population demography, population divergence, population demography, population divergence,

speciation, population identification etcspeciation, population identification etcXX genetic datagenetic data (e.g. aligned sequences, microsatellites) (e.g. aligned sequences, microsatellites)

may (or may not) come with population labelsmay (or may not) come with population labelsmay (or may not) be given as diploid genotypesmay (or may not) be given as diploid genotypesmay include multiple loci for each sampled organismmay include multiple loci for each sampled organism

PP population phylogenypopulation phylogeny TT splitting timessplitting times – i.e. the times of branch points in the – i.e. the times of branch points in the

phylogeny P phylogeny P ΘΘ DemographyDemography - population size and migration rate parameters - population size and migration rate parametersII Population labelsPopulation labels – assignment of genes to populations - – assignment of genes to populations -

which genes came from which populations or specieswhich genes came from which populations or speciesGG GenealogyGenealogy – the gene tree for the data – the gene tree for the data

G is a necessary ‘nuisance’ parameter – it provides a G is a necessary ‘nuisance’ parameter – it provides a mathematical connection between X and (P,T, mathematical connection between X and (P,T, ΘΘ and I) and I)It is possible to calculate the probability of G as a function of It is possible to calculate the probability of G as a function of

P,T, P,T, ΘΘ and I, p(G| P,T, and I, p(G| P,T, ΘΘ,I), using coalescent models,I), using coalescent modelsIt is possible to calculate the probability of a data set given It is possible to calculate the probability of a data set given

G, p(X|G), using mutation models.G, p(X|G), using mutation models.

Page 5: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Sequence1 Sequence1 ACgTACgACgCACgAATACgTACgACgCACgAAT

Sequence2Sequence2 ACgTACgACgCACgAATACgTACgACgCACgAAT

Sequence3Sequence3 ACCTTCgACgTACgAGTACCTTCgACgTACgAGT

Sequence4Sequence4 ACgTTCgACgTACgAATACgTTCgACgTACgAAT

Sequence5Sequence5 ACCTTCgACgTACgAATACCTTCgACgTACgAAT

Sequence6Sequence6 ACgTTCgACgTATgAATACgTTCgACgTATgAAT

5 6 4 3 1 25 6 4 3 1 2

Specify a random G with Specify a random G with topology and branch lengthstopology and branch lengths

for example : for example : Unlabeled DataUnlabeled Data

With a mutation model, and a value of With a mutation model, and a value of G, we can calculate the probability of G, we can calculate the probability of G given the data: p(G|X)G given the data: p(G|X)

11 22

33

Connecting Data to the General Model – Parts 1-3Connecting Data to the General Model – Parts 1-3For For unlabeled dataunlabeled data - without information on the number of - without information on the number of

populations, or on which populations were sampledpopulations, or on which populations were sampled

Page 6: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Connecting Data to the General Model – Parts 4&5Connecting Data to the General Model – Parts 4&5

With a phylogeny that depicts populations in time, With a phylogeny that depicts populations in time, we can also pick random values for population sizes we can also pick random values for population sizes and migration rates – and migration rates – ΘΘ = {N = {N11, N, N22... m... m1>21>2, m, m2>12>1…}…}

44

55

Specify a random phylogeny P with multiple Specify a random phylogeny P with multiple populations and with splitting times T … for populations and with splitting times T … for example:example:

← ← TT22

← ← TT11

Pop 1Pop 1

Pop (2,3),1Pop (2,3),1

Pop (2,3)Pop (2,3)

Pop 3Pop 3Pop 2Pop 2

NN11

NN22 NN33

NN(2,3)(2,3)

NN(2,3),1(2,3),1

Page 7: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Connecting Data to the General Model – Parts 6-8Connecting Data to the General Model – Parts 6-8

add implied add implied migration events migration events and other random and other random migration events migration events to the phylogenyto the phylogeny

66

77

Overlay the genealogy on the phylogenyOverlay the genealogy on the phylogeny5 6 4 3 1 25 6 4 3 1 2

88

Identify I, the data Identify I, the data labels representing labels representing the populations the populations containing the datacontaining the data

5 6 4 3 1 25 6 4 3 1 2Pop 3Pop 3Pop 2Pop 2Pop 1Pop 1

Page 8: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Calculating the likelihood of Calculating the likelihood of P, T, P, T, ΘΘ, and I, given the data, and I, given the data

• If we can solve this then we can obtain maximum If we can solve this then we can obtain maximum likelihood estimates of P,T, I and likelihood estimates of P,T, I and ΘΘ

• We know how to calculate p(X|G) and p(G|P,T,I,We know how to calculate p(X|G) and p(G|P,T,I,ΘΘ))– The math is not the hard partThe math is not the hard part

• The greatest challenge is finding efficient ways to The greatest challenge is finding efficient ways to sample the space of genealogies and the space of P, sample the space of genealogies and the space of P, T, Θ, and IT, Θ, and I

I)dG Θ, T,P, |p(G G)|p(X

I) Θ, T,P, |p(X X) | I Θ, T,L(P,

Page 9: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Genetic Data and different types of data labelsGenetic Data and different types of data labels

ACgTACgACgCACgAATACgTACgACgCACgAAT

ACgTACgACgCACgAATACgTACgACgCACgAAT

ACCTTCgACgTACgAGTACCTTCgACgTACgAGT

ACgTTCgACgTACgAATACgTTCgACgTACgAAT

ACCTTCgACgTACgAATACCTTCgACgTACgAAT

ACgTTCgACgTATgAATACgTTCgACgTATgAAT

Often Population Labels are known (come with data)Often Population Labels are known (come with data)

Aligned DNA SequencesAligned DNA Sequences Population LabelsPopulation Labels AA

AA

BB

BB

CC

CCPopulation labels are already known and do not need Population labels are already known and do not need to be estimated. Parameter I (population labels) is not to be estimated. Parameter I (population labels) is not included in the model. included in the model.

Page 10: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

ACgTACgACgCACgAATACgTACgACgCACgAAT

ACgTACgACgCACgAATACgTACgACgCACgAAT

ACCTTCgACgTACgAGTACCTTCgACgTACgAGT

ACgTTCgACgTACgAATACgTTCgACgTACgAAT

ACCTTCgACgTACgAATACCTTCgACgTACgAAT

ACgTTCgACgTATgAATACgTTCgACgTATgAAT

Case 1 Data has no labeling at allCase 1 Data has no labeling at all

??

??

??

??

??

??

Aligned DNA SequencesAligned DNA Sequences Population LabelsPopulation Labels

Page 11: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

ACgTACgACgCACgAATACgTACgACgCACgAAT

ACgTACgACgCACgAATACgTACgACgCACgAAT

ACCTTCgACgTACgAGTACCTTCgACgTACgAGT

ACgTTCgACgTACgAATACgTTCgACgTACgAAT

ACCTTCgACgTACgAATACCTTCgACgTACgAAT

ACgTTCgACgTATgAATACgTTCgACgTATgAAT

Case 2, no population labels, but data comes in Case 2, no population labels, but data comes in diploid genotypes pairsdiploid genotypes pairs

Aligned DNA SequencesAligned DNA Sequences Population Labels |Genotype PairsPopulation Labels |Genotype Pairs

Individual #1Individual #1

Individual #2Individual #2

Individual #3Individual #3

Gene copies are identified in genotype pairs only. Gene copies are identified in genotype pairs only.

Parameter I (Population labels) is unknown (?) and Parameter I (Population labels) is unknown (?) and needs to be estimated. needs to be estimated.

??

??

??

??

??

??

Page 12: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Two kinds of data sets without Two kinds of data sets without population labelspopulation labels

1.Alleles or gene copies provided without 1.Alleles or gene copies provided without any additional information on populationsany additional information on populations- e.g. locus may be haploid- e.g. locus may be haploid- or for whatever reason, data not - or for whatever reason, data not collected in a way that yields diploid collected in a way that yields diploid genotypesgenotypes

2. Alleles or sequences provided in diploid 2. Alleles or sequences provided in diploid (genotype) pairs(genotype) pairsThis is a common situation for population This is a common situation for population assignmentassignment

Page 13: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Case 1: Alleles or gene copies Case 1: Alleles or gene copies come without any additional come without any additional information on populationsinformation on populations

• The only available information on The only available information on population labels (parameter I) and all population labels (parameter I) and all other parameters (P, T, other parameters (P, T, ΘΘ) is in the actual ) is in the actual variation in the datavariation in the data

• This is a lot to ask of single locus data set. This is a lot to ask of single locus data set. • With multiple loci, can be possible to to With multiple loci, can be possible to to

estimate P, T, estimate P, T, ΘΘ, and I , and I • Can include information from a database Can include information from a database

on the same locus (loci) – i.e. DNA on the same locus (loci) – i.e. DNA barcodingbarcoding

Page 14: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Case 2: Data comes in diploid Case 2: Data comes in diploid (genotype) pairs(genotype) pairs

• Such data contains two types of Such data contains two types of information for population identification:information for population identification:– Patterns of variation (as in case 1)Patterns of variation (as in case 1)– Knowledge that both gene copies from a single Knowledge that both gene copies from a single

individual must come from the same individual must come from the same population (assume no hybrids) population (assume no hybrids)

• This problem (identifying populations This problem (identifying populations based on diploid genotypes) is traditionally based on diploid genotypes) is traditionally called population assignmentcalled population assignment

Page 15: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Population Assignment based on Population Assignment based on diploid genotype datadiploid genotype data

• Many methods exist for population Many methods exist for population assignment, using allelic data, based assignment, using allelic data, based on an assumption of Hardy-Weinberg on an assumption of Hardy-Weinberg equilibrium within populations equilibrium within populations

• These methods do not otherwise These methods do not otherwise incorporate phylogenetics or incorporate phylogenetics or population genetics (no P, T, or population genetics (no P, T, or ΘΘ))

• Have to overcome difficulty of not Have to overcome difficulty of not knowing the underlying allele knowing the underlying allele frequenciesfrequencies

Page 16: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Considering the probability of a Considering the probability of a particular genotype configuration, Dparticular genotype configuration, D

11 ACgTACgACgCACgAATACgTACgACgCACgAAT

22 ACgTACgACgCACgAATACgTACgACgCACgAAT

33 ACCTTCgACgTACgAGTACCTTCgACgTACgAGT

44 ACgTTCgACgTACgAATACgTTCgACgTACgAAT

55 ACCTTCgACgTACgAATACCTTCgACgTACgAAT

66 ACgTTCgACgTATgAATACgTTCgACgTATgAAT

6 Sequences6 Sequences 3 Genotype pairs3 Genotype pairs

The actual configuration D that comes with the data The actual configuration D that comes with the data is one of many possible configurations. is one of many possible configurations.

Page 17: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Calculating the probability of a Calculating the probability of a particular genotype configuration, Dparticular genotype configuration, D

• Assume that genes come together and Assume that genes come together and form zygotes at random with respect to form zygotes at random with respect to their time of common ancestrytheir time of common ancestry– This is a genealogical version of the This is a genealogical version of the

assumption of random mating that is assumption of random mating that is usually made with respect to usually made with respect to segregating alleles (e.g. in Hardy segregating alleles (e.g. in Hardy Weinberg)Weinberg)

• Assume that both gene copies within an Assume that both gene copies within an individual are in the same populationindividual are in the same population

Page 18: Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

Given a genealogy, G,Given a genealogy, G,

Some genotype Some genotype configurations are more configurations are more probable than others probable than others under an assumption of under an assumption of random union of random union of gametesgametes