fine scale mapping and the coalescent the fundamental problem the data genotype to phenotype...

43
Fine Scale Mapping and the Coalescent The Fundamental Problem The Data Genotype to Phenotype Functions Types of Mapping Population Set-up & Measures of Dependency The Calculations Practical Considerations

Upload: agatha-barnett

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Fine Scale Mapping and the Coalescent

•The Fundamental Problem

•The Data

•Genotype to Phenotype Functions

•Types of Mapping

•Population Set-up & Measures of Dependency

•The Calculations

•Practical Considerations

Genotype and Phenotype Covariation: Gene Mapping

Tim

e

Result:The Mapping Function

Reich et al. (2001)

Decay of local dependency

A set of characters.

Binary decision (0,1).

Quantitative Character.

Dominant/Recessive.

Penetrance

Spurious Occurrence

Heterogeneity

genotype Genotype Phenotype phenotype

Genetype -->Phenotype Function

Sampling Genotypes and Phenotypes

Pedigree Analysis:Association Mapping:

Pedigree known

Few meiosis (max 100s)

Resolution: cMorgans (Mbases)Pedigree unknown

Many meiosis (>104)

Resolution: 10-5 Morgans (Kbases)

2N gen

erations

rM

D

rM

D

Adapted from McVean and others

Pedigree Analysis & Association Mapping

Time t ago

Now

D M

D M

Creates LD Breaks down LD

Drift RecombinationSelection Gene conversionAdmixture

Causes of linkage disequilibrium

Disease locus Marker locus Disease locus Marker locus

Test for independence in 2 times 2 Contingency Table

XA,B Xa,B X.,B

XA,b Xa,b X.,b

XA,. Xa,. X.,.

Significance of a Single Association

Measuring Linkage Disequilibrium between 2 Loci with 2 AllelesRemade from McVean

DA,B =fA,B-fAfB =-Da,B =-DA,b =Da,b

Correlation Coeffecient Measure [0,1] Hill & Robertson (1968)

Range constrained by allele frequencies [0,1] Lewontin (1964)

Odds-ratio formulation Devlin & Risch (1995)

22

2AB

bBaA

ABAB ffff

Dr ρ==

),,,,min(),,,,min()0('

BabA

AB

BabA

ABAB ffff

Delse

ffff

DDifD

−>=

AB

ABAB ff

f=δ

Disease locus

Marker loci

Combine Single (Pairwise) to Multiple Tests

Bonferroni

Sharper bounds using linkage information.

Examples of Associations: Pairwise, Triple,...

Martin et al 2000

6 markers with low association

Causative SNP

ApoE and Alzheimers Syndrome

Adapted from Hudson 1990

Recombination: Gene Conversion:

The coalescent with recombination or gene conversion

Gene conversion

Tree 1 Tree 2 Tree 1

1 432 1 4 32

Recombination

1 432 1 4 32

Tree 1 Tree 2 Tree 3

Local trees for recombination and gene conversion

1 32 41 432

Target

Same topology as target

Same MRCA as target

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5 1 2 3 4 5

Target tree

Same tree Same MRCASame topology

Same tree as targetRegion with no recombination

Measures of tree similarity

Local trees of the target and other positions

Only recombination, r=2. Also gene conversion g/ρ

Sample size = 20

From Mikkel Schierup

Recombination/gene conversion rate

R=2, G=0 R=2, G=8

#segments with same tree 1.02 1.8

P(target segment not largest) 0.2% 14%

#segments same topology 1.02 2.1

P(target segment not largest) 0.3% 20%

#segments same TMRCA 1.1 2.9

P(target segment not largest) 1.5% 25%

Probability that the largest segment does not include the target

From Mikkel Schierup

A and B are the most distant markers in significant LD with A and B are the most distant markers in significant LD with targettarget

A BTarget

G=0 G=16

Rho=4 56% 33%

Quantifying the mosaicism caused by Gene Conversion

What is the proportion of markers between these also in significant LD?

From Mikkel Schierup

Based on Morris et al. 2002

Single Marker Methods•Kaplan et al. (1995), Rannala & Slatkin (1998)Problem: Difficult to combine markers.

Haplotype methods with star-shaped genealogies •Terwilliger (1995), Graham & Thompson (1998), McPeek & Strahs(1999), Morris et al.(2000)Problem: wrong genealogy, gives overconfidence in result.

Haplotype methods based on the coalescent•Rannala & Reeve (2001), Morris et al. (2002), Larribe et al. (2003).Problem: computationally intensive

Development of multi-locus association methods

Probability of Data I:

3 step approach:

I Probability of Data given topology and branch lengths

Felsenstein81 for each columnMultiply for all columns

II Integrate over branch lengths

III Sum over topologies

TCAGCCT TCAGCATGCAGGTT

Conclusion: Exact Calculation Computationally Intractible!!

P(DataTopo,tktk−1..t2)e−tk /

k

2

⎝ ⎜ ⎜

⎠ ⎟ ⎟ dtke

−tk−1 /k−1

2

⎝ ⎜ ⎜

⎠ ⎟ ⎟ e−tdtk−1....dt20

∞∫0

∞∫0

∞∫

j

2

⎝ ⎜ ⎞

⎠ ⎟

j=n

3

Probability of Data II: Griffiths & Tavavé TPB46.2.131-149

ACCTAGGAT

TCCTAGGAT

TCCTAGGAT

n=

3*9*3 mutations(1,2) coalescence

ACCTAGGAT

TCCTAGGAT

q(n’’) – determined by equilibrium distribution.

q(n)= q(n') f(n,n')n'→n

Griffiths-Ethier-Tavare Recursions

1 2 3

n=(3,1,2)

1 2 3

n=(2,1,2)

1 2 3

n=(3,1,2)

2

1

Griffiths-Marjoram (1996) included recombination in the equations.

)','( ),()1(

)1( ),( ,1n

2:d

nTpenTpnn

nnnTp a

nkka

kka

k

δθθ

+−+−−

= ∑≥

Example: Solving Linear System

∑∑∈∈

∈+=BzAy

zqzxryqyxrxq Bxfor ),(),()(),()(

})(),()...,(......{

....)(),(),(),(

)(),(),()(),()(

1

1 2

1

10

2211

11

∑ ∑∑∑

∑∑∑

∑∑∑

∈ ∈ ∈

=

∈ ∈ ∈

∈ ∈∈

=

++

+=

By By Ayk

k

By By Ay

By AyAy

k

yqyyryxr

yqyyryyryxr

yqyyryxryqyxrxq

B.x whenunknown and Ax whenknown )( ∈∈xq

q( )

??

q( )q( )??

??

??

r(,)

r(,)

r(,)

r(,)r(,)

r(,)r(,)

r(,)

r(,)

Example: Solving Linear System

}),(

),()({)(

1 1

10 0 ∏

= −

−=j

k kj

k

kj

kx XXA

XXrXqExq

τ

τ }),(

),()(

1 1

1

1∏∑

= −

=

=j

k

k

jk

jjk

jjk

m

j

j

XXAXXr

Xqm

τ

Construct Markov transition function, A(x,y), with following properties:

i) A(x,y) > 0 when r(x,y) >0

ii) The chain visits A with certainty.

•Introduced in coalescence theory by Griffiths & Tavare (1994)

•Griffiths & Marjoram (1996) included recombination

•Donnelly-Stephens-Fearnhead (2000-) accelerated these algorithms

The position of the marker locus is missing dataLarribe and Lessard.(2002)

Data: haplotype phenotype multiplicity

15

3

6

2

1

2

1

Where is the disease causing disease?

Likelihood as function of disease locus position

P(data)

s)(parameter)parameters|P(datadata)|parameters(

ff =

Continuous version of Bayes formula

f (parameters) = prior distribution of parametersP(data|parameters) = L(parameters) = likelihood functionf (P|D) = posterior distribution of parameters given data

The evolutionary parameter (e.g. disease location) is considered to have prior distribution (any prior knowledge we may have)

and we learn about parameters through data

Advantage: f (parameters|data) is the full distribution of parameters of interest given data, e.g. confidence intervals

Bayesian approach to LD mapping

∫∫∫=except x Parameters

n1 P...Pdata) |rsP(parametedata)|position x P(disease dd

P(data)

s)(parameter)parameters|P(datadata)|parameters(

ff =

Marginal posterior distribution of disease position:

The basic equation

Parameters in Shattered Coalescent ModelMorris, Whittaker and Balding (2001,,2003,2004..

P(x,h,,T,z,N,ρ|A,U) ~ L(A,U|x,h,,T,z,N) (,T,z|ρ) (ρ)

(ρ) = 2ρ(,T,z|ρ) prior distribution of genealogies (coalescent like)

x Location of disease locush Population marker-haplotype proportionsbranch lengths of genealogical treeT topology (branching pattern)Z Parental-statusN effective population sizeρ shattering parameterA, U cases, controls

Probability of Haplotypes associated Mutant

At recombination markers are incorporated from the population distribution.

Morris et al: The Shattered Coalescent

Advantages: Allows for multiple origins of the disease mutant + sporadic occurrences of the disease without the mutation

Coalescent tree

Morris, Whittaker & Balding,2002

∫∫∫=except x Parameters

n1 P...Pdata) |rsP(paramete)position x P(disease dd

•Evaluate the function in the current point p, f(p)=x

•Suggest a new point, p'

•Evaluate the function in this point f(p') = y

•If x < y, go to point p'

•If x > y, go to point p' with the probability y/x

Monte-Carlo (Metropolis) sampling and integrationMetropolis et al.(1953)

Due to Jesper Nymann

Projection on one axis equivalent to integration over the remaining parameters

1

1

2?2?

2!

231

Monte-Carlo (Metropolis)

Due to Jesper Nymann

Morris et al. (2002).

11 19

Example 1 - Cystic fibrosis

Due to Jesper Nymann

1132 Cases, 54 with known mutation

758 Controls

Iceland Genomics Corporation:

Example 2 - BRCA2

Due to Jesper Nymann

1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15

True Location

Multipoint calculation for the full BRCA2 dataset Multipoint calculation where the 54 known mutation cases has been removed.

Example 2 - BRCA2 continued

Due to Jesper Nymann

Simulation Parameters: Recombination rate = 50 Number of leaf nodes = 1000 Number of markers = 10 Diseased haplotype fraction: 0.08 – 0.12 No Heterogeneity Simulated under the asumption of constant population size Diplotypes (phase known)

Type of simulation 50% quantileBasic (red curve) 0.044

The Basic Setup

Due to Jesper Nymann

Type of simulation 50% quantile19 markers (blue curve) 0.029219 markers and recombination rate = 100 (yellow curve) 0.02321Basic (red curve) 0.044

The effect of marker density

Due to Jesper Nymann

010 0 0 01 1 1 1

000 0 1 01 0 1 0

00 0 01 0/1 10/1 0/1 0/1

Type of simulation 50% quantileWith Genotype data (blue curve)0.05857Basic (red curve) 0.044

The effect of knowing phase

Due to Jesper Nymann

Type of simulation 50% quantileWith known genealogy (blue curve) 0.03516Basic (red curve) 0.044

The Effect of knowing gene genealogy

Due to Jesper Nymann

Type of simulation 50% quantileDisease fraction 12% - 14% (blue curve) 0.0353Disease fraction 18% - 22% (yellow curve) 0.03229Basic (red curve) 0.044

The effect of disease fraction

Due to Jesper Nymann

Type of simulation 50% quantileWith Heterogeneity (blue curve) 0.065587Basic (red curve) 0.044

The effect of Heterogeneity

Due to Jesper Nymann

Type of simulation 50% quantileWith mixed cases/controls (blue curve) 0.1518Basic (red curve) 0.044

33% cases are moved tothe controls and a similarnumber of controls are movedto the cases

Cases Controls

The effect of Impurity of cases and controls

Due to Jesper Nymann

Type of simulation 50% quantileLD in background (blue curve) 0.0419Basic (red curve) 0.044

GenePool

010 0 0 01 1 1 1

LD in background: P(0) P(1|0) P(1|1) P(0|1) P(0|0) P(1|0) P(1|1) P(0|1) P(1|0) P(0|1)

No LD in background: P(0) P(1) P(1) P(0) P(0) P(1) P(1) P(0) P(1) P(0)

LD in background population

Due to Jesper Nymann

Simulation Type Mean 50% Quantile 70% Quantile 95% Quantile

Basic 0,059 0,044 0,070 0,19319 markes rho=100 0,044 0,023 0,043 0,14219 markers 0,053 0,029 0,047 0,17618% - 22% cases 0,046 0,032 0,052 0,14612% - 14% cases 0,048 0,035 0,050 0,136Fixed topology 0,047 0,035 0,058 0,111LD in background 0,078 0,042 0,072 0,273Genotype Data 0,087 0,059 0,099 0,305Heterogeneity 0,088 0,066 0,092 0,24633% impure 0,173 0,152 0,217 0,452

Random 0,303 0,273 0,407 0,696

Comparing the different scenarios

Due to Jesper Nymann

SummaryThe Fundamental Problem

The Data

Genotype to Phenotype Functions

Types of Mapping

Population Set-up & Measures of Dependency

Methods:

Pure Coalescent Based

The Shattered Coalescent

Factors influencing mapping error.

M. A. Beaumont and B. Rannala (2004) The Bayesian Revolution in genetics, Nature Reviews, Genetics vol. 5. 251 Botstein D, Risch N. (2003) Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 33 Suppl:228-237. Cardon, L. and J. Bell (2001) “Association Study Designs for Complex Diseases “ Nature Review Genetics Daly, M. J., Rioux, J. D., Schaner, S. F., Hudson, T. J. & Lander, E. S. (2001), High-resolution haplotype structure in the human genome, Nat Genet 29(2), 229-232.Devlin, B. & Roeder, K. (1999), Genomic control for association studies, Biometrics 55(4), 997-1004.Frisse, L et al.(2001) Gene Conversion and Different Population Histories May Explain the Contrast between Polymorphisms and LD Levels. AJHG 69..?-?Gabriel, S. B. et al. (2002), The structure of haplotype blocks in the human genome, Science 296(5576), 2225-2229.Griffiths,R & S. Tavare (1994) “ Simiulating probability distributions in the coalescent ” Theor.Pop.Biol. 46.2.131-159Griifiths, R. and P. Marjoram (1996) “Ancestral inference from samples of DNA sequences with recombination ”J.Compu.Biol. Hudson, R. R. (1990).Gene genealogies and the coalescent process, “Oxford Surveys in Evolutionary Biology” (D. futuyma and J. Antonovics, Eds.) Vol 7, pp. 1-44, Oxford Univ. Press, Oxford, UKB. Kerem, J. M. Rommens, J. A. Buchanan D. Markiewicz, T. K. Cox, A. Chakravarti, M. Buchwald and L. C. Tsui Identification of the Cystic Fibrosis Gene: Genetic Analysis Science 245: 1073-1080, 1989Kong A, et al. (2002) A high-resolution recombination map of the human genome. Nat Genet. 31,241-7.Laitinen et al. (2004) Characterization of a common susceptibility locus for Asthma-related traits. Nature 304, 300-304.Martin, E. R., et al. (2000), SNPing away at complex diseases: analysis of single-nucleotide polymorphisms around APOE in Alzheimer disease, Am J Hum Genet 67, 383-394. Larribe, M, S. Lessard and Schork (2002) “Gene Mapping via the Ancestral Recombination Graph”. Theor. Pop.Biol. 62.215-229.Liu,J. et al.(2000) “Bayesian Analysis of Haplotypes for Linkage Disequilibrium Mapping” Genome Research 11.1716-24. Martin, E. et al.(2001) “SNPing Away at Complex Diseases: Analysis of Single-Nucleotide Polymorphisms around APOE Alzheimer Disease” AJHG 67.838-394. N Metropolis N AW Rosenbluth, MN Rosenbluth, AH Teller, E Teller (1953) Equation of state calculation by fast computer machines, J. Chem. Phys. 21:1087-1092 McVean,G.(2002) “A Genealogical Interpretation of Linkage Disequilibrium” Genetics 162.987-991Morris, A., JC Whittaker and D. Balding “Fine-Scale Mapping of Disease Loci via Shattered Coalescent Modeling of Genealogies” AJHG 70.686-707. Morris, J. C. Whittaker, and D. J. Balding (2004) Little loss of information due to unknown phase for fine-scale LD mapping with SNP genotype data, AJHG . 74: 945-953, 2004 Andrew P. Morris, John C. Whittaker, Chun-Fang Xu, Louise K. Hosking, and David J. Balding Multipoint linkage-disequilibrium mapping narrows location interval and identifies mutation heterogeneity, PNAS November 11, 2003, Vol. 100, 13442-13446

Articles I

McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P. (2004) The fine-scale structure of recombination rate variation in the human genome. Science 304:581-584.Patil, N. et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719-1723.Reich, D. E. et al. (2001), Linkage disequilibrium in the human genome, Nature 411(6834), 199-204.Reich D. E. and Lander, E. On the allelic spectrum of human diseases. Trends in Genetics 19, 502-510.Reich, D. E. et al. (2002), Human genome sequence variation and the influence of gene history, mutation and recombination, Nat Genet 32(1), 135-142.Risch, N. and Merikangas, K. (1996) The future of genetic studies of complex human diseases. Science 273, 15161-1517.Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. (2000), Association mapping in structured populations, Am J Hum Genet 67(1), 170-181.Stefansson, H. et al. (2003), Association of neuregulin 1 with schizophrenia confirmed in a Scottish population, Am J Hum Genet 72(1), 83-87.Stephens JC et al. (2001) Haplotype variation and linkage disequilibrium in 313 human genes. Science.;293(5529):489-93. Strachan, T. & Read, A. P. (2003) Human Molecular Genetics 3, BIOS Scientific Publishers Ltd, Wiley, New York.Spielman R S and W J Ewens (1996) The TDT and other family-basedtests for linkage disquilibrium and association. Am. J. Hum. Gen. 59:983-989The International HapMap Consortium (2003) The International HapMap Project. Nature 426, 789-795.Weiss, KM and Clark, AG (2002) Linkage disequilibrium and the mapping of complex human traits. Trends in Genetics 18:19-24.Pritchard, J and M. Przeworski (2000) Linkage Disequilibrium in Humans: Models and Data AJHG 69.1-14. Pritchard, JK et al.(2000) “Association Mapping in Structured Populations” Am.J.Hum.Genet. 67.170-181 .Pritchard and Cox (2002) “The allelic architecture of human disease genes: common disease-common variant … or not” Human Molecular Genetics 11.20.2417-2Rannala, B and JP Reeve (2001) High-Resolution Multipoint Linkage-Disequilibrium Mapping in the Context of a Human Genome Sequence AMJHG 69.159-178. R S Spielman and W J Ewens (1996) The TDT and other family-basedtests for linkage disquilibrium and association. Am. J. Hum. Gen. 59:983-989 Tabor, Risch and Myers (2002) Candidate-gene approaches for studying complex genetic traits: practical considerations Nature Reviews Genetics 3.May.1-7

Terwilliger,JD et al(2002) A bias-ed assessement of the use of SNPs in human complex traits. Curr.Opin. Genetics & Development 12.726-34

Weiss,K and Terwilliger, J (2000) “How many diseases does it take to map a disease with SNPs” Nature Genetics vol. 26 Oct.

Articles II

Books

Encyclopedia of the Human Genome (2003) Nature Publishing Group

Liu, . J(2001) “Monte Carlo Strategies in Scientific Computation” Springer Verlag

Ott, J.(1999) Analysis of Human Genetic Linkage 3rd edition Publisher: John Hopkins

Strachan & Read (2004) Human Molecular Genetics III Publisher: Biosciences

Weiss,K.(1993) “Genetic Variation and Human Disease” Cambridge University Press.

Web-sites

www.stats.ox.ac.uk/mcvean

Jeff Reeve and Bruce Rannala A multipoint linkage disequilibrium disease mapping program (DMLE+) that allows genotype data to be used directly and allows estimation of allele ages. http://dmle.org/

Liu, J.S., Sabatti, C., Teng, J., Keats, B.J.B. and N. Risch (Version upgraded by Xin Lu, June/9/2002) This is the software for the Bayesian haplotype analysis method developed by Liu, J.S., Sabatti, C., Teng, J., Keats, B.J.B. and N. Risch in article Bayesian Analysis of Haplogypes for Linkage Disequilibrium Mapping. Genome Research 11:1716, 2001http://www.people.fas.harvard.edu/~junliu/TechRept/03folder/bladev2.tar

J. N. Madsen, M.H. Schierup, C. Storm, and L. Schauser, T. Mailund CoaSim is a tool for simulating the coalescent process with recombination and geneconversion under the assumption of exponential population growthhttp://www.birc.dk/Software/CoaSim/

Books & Www-sites