e quilibria in populations cse280vineet bafna. 0100 0011 1000 population data recall that we often...

49
EQUILIBRIA IN POPULATIONS CSE280 Vineet Bafna

Upload: caroline-wade

Post on 01-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet Bafna

EQUILIBRIA IN POPULATIONS

CSE280

Page 2: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

0100001100110011100010001000

Population data

• Recall that we often study a population in the form of a SNP matrix– Rows correspond to individuals (or individual chromosomes), columns

correspond to SNPs– The matrix is binary (why?)– The underlying genealogy is hidden. If the span is large, the genealogy is

not a tree any more. Why?

Page 3: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Basic Principles

• In a ‘stable’ population, the distribution of alleles obeys certain laws– Not really, and the deviations are interesting

• HW Equilibrium – (due to mixing in a population)

• Linkage (dis)-equilibrium– Due to recombination

Page 4: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Hardy Weinberg equilibrium

• Assume a population of diploid individuals;• Consider a locus with 2 alleles, A, a• Define p (respectively, q): the frequency of A (resp. a)

in the population• 3 Genotypes: AA, Aa, aa• Q: What is the frequency of each genotypes

If various assumptions are satisfied, (such as random mating, no natural selection), Then• PAA=p2

• PAa=2pq• Paa=q2

Page 5: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Hardy Weinberg: why?

• Assumptions:– Diploid– Sexual reproduction– Random mating– Bi-allelic sites– Large population size, …

• Why? Each individual randomly picks his two parental chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.

0100001100110011100010001000

01001000

Page 6: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Hardy Weinberg: Generalizations

• Multiple alleles with frequencies– By HW,

• Multiple loci?

Page 7: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Hardy Weinberg: Implications

• The allele frequency does not change from generation to generation. True or false?

• It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the mutation?

• Males are 100 times more likely to have the ‘red’ type of color blindness than females. Why?

• Individuals homozygous for S have the sickle-cell disease. In an experiment, the ratios A/A:A/S: S/S were 9365:2993:29. Is HWE violated? Is there a reason for this violation?

• A group of individuals was chosen in NYC. Would you be surprised if HWE was violated?

• Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.

Page 8: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet Bafna

A modern example of HW application• SNP-chips can give us the allelic

value at each polymorphic site based on hybridization.

• Typically, the sites sampled have common SNPs– Minor allele frequency > 10%

• We simply plot the 3 genotypes at each locus on 3 separate horizontal lines.

• What is peculiar in the picture?• What is your conclusion?

CSE280

Page 9: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Linkage Equilibrium

• HW equilibrium is the equilibrium in allele frequencies at a single locus.

• Linkage equilibrium refers to the equilibrium between allele occurrences at two loci.– LE is due to recombination

events– First, we will consider the case

where no recombination occurs.

HWE

Page 10: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Perfect phylogeny

Page 11: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

What if there were no recombinations?

• Life would be simpler• Each individual sequence would have a single parent

(even for higher ploidy)• True for mtDNA (maternal) , and y-chromosome

(paternal)• The genealogical relationship is expressed as a tree.

This principle can be used to track ancestry of an individual

Page 12: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

0100001100110011100010001000

Phylogeny

• Recall that we often study a population in the form of a SNP matrix– Rows correspond to individuals (or individual chromosomes), columns

correspond to SNPs– The matrix is binary (why?)– The underlying genealogy is hidden. If the span is large, the genealogy is

not a tree any more. Why?

Page 13: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Reconstructing Perfect Phylogeny

• Input: SNP matrix M (n rows, m columns/sites/mutations)

• Output: a tree with the following properties– Rows correspond to leaf nodes– We add mutations to edges– each edge labeled with i splits

the individuals into two subsets. • Individuals with a 1 in column i• Individuals with a 0 in column i

i

1 in position i0 in position i

Page 14: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

1234567801000000001101100011010000110000100000001000100010001001

Reconstructing perfect phylogeny

• Each mutation can be labeled by the column number

• Goal is to reconstruct the phylogeny• genographic atlas

1

3 4

2

85

67

Page 15: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows
Page 16: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

An algorithm for constructing a perfect phylogeny

• We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later.

• In any tree, each node (except the root) has a single parent.– It is sufficient to construct a parent for every node.

• In each step, we add a column and refine some of the nodes containing multiple children.

• Stop if all columns have been considered.

Page 17: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Columns

• Define i0: taxa (individuals) with a 0 at the i’th column

• Define i1: taxa (individuals) with a 1 at the i’th column

Page 18: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Inclusion Property

• For any pair of columns i,j, one of the folowing holds– i1 j1

– j1 i1

– i1 j1 = • For any pair of columns i,j

– i < j if and only if i1 j1 • Note that if i<j then the edge

containing i is an ancestor of the edge containing j

i

j

Page 19: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Perfect Phylogeny via Example

1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

r

A B C D E

Initially, there is a single clade r, and each node has r as its parent

Page 20: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Sort columns

• Sort columns according to the inclusion property (note that the columns are already sorted here).

• This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order

1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

Page 21: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Add first column• In adding column i– Check each individual

and decide which side you belong.

– Finally add a node if you can resolve a clade

r

A BC DE

1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

u

Page 22: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Adding other columns• Add other

columns on edges using the ordering property r

E B

C

D

A

1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

1

2

4

3

5

Page 23: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Unrooted case

• Important point is that the perfect phylogeny condition does not change when you interchange 1s and 0s at a column.

• Alg (Unrooted)– Switch the values in each column, so that 0 is the majority

element. – Apply the algorithm for the rooted case.– Relabel columns and individuals.

• Show that this is a correct algorithm.

Page 24: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Unrooted perfect phylogeny

• We transform matrix M to a 0-major matrix M0.

• if M0 has a directed perfect phylogeny, M has a perfect phylogeny.

• If M has a perfect phylogeny, does M0 have a directed perfect phylogeny?

Page 25: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet Bafna

Unrooted case

• Theorem: If M has a perfect phylogeny, there exists a relabeling, and a perfect phylogeny s.t.– Root is all 0s– For any SNP (column), #1s <= #0s– All edges are mutated 01

CSE280

Page 26: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet Bafna

Proof• Consider the perfect phylogeny of M.• Find the center: none of the clades has

greater than n/2 nodes.– Is this always possible?

• Root at one of the 3 edges of the center, and direct all mutations from 01 away from the root. QED

• If the theorem is correct, then simply relabeling all columns so that the majority element is 0 is sufficient.

CSE280

Page 27: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet Bafna

‘Homework’ Problems

• What if there is missing data? (An entry that can be 0 or 1)?

• What if there are recurrent mutations?

CSE280

Page 28: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet Bafna

Linkage Disequilibrium

CSE280

Page 29: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Quiz• Recall that a SNP data-set is a ‘binary’ matrix.– Rows are individual (chromosomes)– Columns are alleles at a specific locus

• Suppose you have 2 SNP datasets of a contiguous genomic region but no other information – One from an African population, and one from a

European Population.– Can you tell which is which?– How long does the genomic region have to be?

Page 30: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 1: No recombination• Each new individual

chromosome chooses a parent from the existing ‘haplotype’

A B0 10 10 00 01 01 01 01 0

1 0

Page 31: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 2: diploidy and

recombination• Each new individual chooses a

parent from the existing alleles

A B0 10 10 00 01 01 01 01 0

1 1

Page 32: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 1: No recombination• Each new individual chooses a parent

from the existing ‘haplotype’– Pr[A,B=0,1] = 0.25

• Linkage disequilibrium• Case 2: Extensive recombination• Each new individual simply chooses and

allele from either site– Pr[A,B=(0,1)]=0.125

• Linkage equilibrium

A B0 10 10 00 01 01 01 01 0

Page 33: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

LD

• In the absence of recombination, – Correlation between columns– The joint probability Pr[A=a,B=b] is different from

P(a)P(b)• With extensive recombination– Pr(a,b)=P(a)P(b)

Page 34: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Measures of LD

• Consider two bi-allelic sites with alleles marked with 0 and 1

• Define– P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]

– P0* = Pr[Allele 0 in locus 1]

• Linkage equilibrium if P00 = P0* P*0

• D = (P00 - P0* P*0) = -(P01 - P0* P*1) = …

Page 35: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Other measures of LD• D’ is obtained by dividing D by the largest possible

value– Suppose D = (P00 - P0* P*0) >0.

– Then the maximum value of Dmax= min{P0* P*1, P1* P*0}

– If D<0, then maximum value is max{-P0* P*0, -P1* P*1}

– D’ = D/ Dmax

Site 10

Site 2

1

10

D -D

-D D

Page 36: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Other measures of LD• D’ is obtained by dividing D by the largest possible value

– Ex: D’ = abs(P11- P1* P*1)/ Dmax

• = D/(P1* P0* P*1 P*0)1/2

• Let N be the number of individuals• Show that 2N is the 2 statistic between the two sites

Site 10

Site 2

1

10

P00N P0*N

Page 37: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Digression: The χ2 test0 1

1

0 O1

O3 O4

O2

• The statistic behaves like a χ2 distribution (sum of squares of normal variables).

• A p-value can be computed directly

Page 38: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Observed and expected

Site 10

1

P00N

10 10

P01N

P10N P11N

P0*P*0N

P1*P*0N P1*P*1N

P0*P*1N

• = D/(P1* P0* P*1 P*0)1/2

• Verify that 2N is the 2 statistic between the two sites

Page 39: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

LD over time and distance

• The number of recombination events between two sites, can be assumed to be Poisson distributed.

• Let r denote the recombination rate between two adjacent sites

• r = # crossovers per bp per generation• The recombination rate between two sites l

apart is rl

Page 40: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

LD over time• Decay in LD– Let D(t) = LD at time t between two sites– r’=lr– P(t)

00 = (1-r’) P(t-1)00 + r’ P(t-1)

0* P(t-1)*0

– D(t) = P(t)00 - P(t)

0* P(t)*0 = P(t)

00 - P(t-1)0* P(t-1)

*0 (Why?)– D(t) =(1-r’) D(t-1) =(1-r’)t D(0)

Page 41: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

LD over distance

• Assumption– Recombination rate increases linearly with

distance and time– LD decays exponentially.

• The assumption is reasonable, but recombination rates vary from region to region, adding to complexity

• This simple fact is the basis of disease association mapping.

Page 42: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

LD and disease mapping

• Consider a mutation that is causal for a disease. • The goal of disease gene mapping is to discover which

gene (locus) carries the mutation.• Consider every polymorphism, and check:

– There might be too many polymorphisms – Multiple mutations (even at a single locus) that lead to the same

disease

• Instead, consider a dense sample of polymorphisms that span the genome

Page 43: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

LD can be used to map disease genes

• LD decays with distance from the disease allele.• By plotting LD, one can short list the region containing the

disease gene.

011001

DNNDDN

LD

Page 44: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

• 269 individuals– 90 Yorubans– 90 Europeans (CEPH)– 44 Japanese– 45 Chinese

• ~1M SNPs

Page 45: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet Bafna

Haplotype blocks

• It was found that recombination rates vary across the genome– How can the recombination rate be measured?

• In regions with low recombination, you expect to see long haplotypes that are conserved. Why?

• Typically, haplotype blocks do not span recombination hot-spots

CSE280

Page 46: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet Bafna

Long haplotypes• Chr 2 region with high r2 value (implies little/no recombination)• History/Genealogy can be explained by a tree ( a perfect phylogeny)• Large haplotypes with high frequency

CSE280

Page 47: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet Bafna

19q13

CSE280

Page 48: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet Bafna

LD variation across populations• LD is maintained upto

60kb in swedish population, 6kb in Yoruban population

CSE280

Reich et al.Nature 411, 199-204(10 May 2001)

Page 49: E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna. 0100 0011 1000 Population data Recall that we often study a population in the form of a SNP matrix – Rows

Vineet BafnaCSE280

Understanding Population Data• We described various population genetic concepts (HW,

LD), and their applicability• The values of these parameters depend critically upon the

population assumptions.– What if we do not have infinite populations– No random mating (Ex: geographic isolation)– Sudden growth– Bottlenecks– Ad-mixture

• It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation?