fine scale mapping and the coalescent the fundamental problem the data genotype to phenotype...
TRANSCRIPT
Fine Scale Mapping and the Coalescent
•The Fundamental Problem
•The Data
•Genotype to Phenotype Functions
•Types of Mapping
•Population Set-up & Measures of Dependency
•The Calculations
•Practical Considerations
Genotype and Phenotype Covariation: Gene Mapping
Tim
e
Result:The Mapping Function
Reich et al. (2001)
Decay of local dependency
A set of characters.
Binary decision (0,1).
Quantitative Character.
Dominant/Recessive.
Penetrance
Spurious Occurrence
Heterogeneity
genotype Genotype Phenotype phenotype
Genetype -->Phenotype Function
Sampling Genotypes and Phenotypes
Pedigree Analysis:Association Mapping:
Pedigree known
Few meiosis (max 100s)
Resolution: cMorgans (Mbases)Pedigree unknown
Many meiosis (>104)
Resolution: 10-5 Morgans (Kbases)
2N gen
erations
rM
D
rM
D
Adapted from McVean and others
Pedigree Analysis & Association Mapping
Time t ago
Now
D M
D M
Creates LD Breaks down LD
Drift RecombinationSelection Gene conversionAdmixture
Causes of linkage disequilibrium
Disease locus Marker locus Disease locus Marker locus
Test for independence in 2 times 2 Contingency Table
XA,B Xa,B X.,B
XA,b Xa,b X.,b
XA,. Xa,. X.,.
Significance of a Single Association
Measuring Linkage Disequilibrium between 2 Loci with 2 AllelesRemade from McVean
DA,B =fA,B-fAfB =-Da,B =-DA,b =Da,b
Correlation Coeffecient Measure [0,1] Hill & Robertson (1968)
Range constrained by allele frequencies [0,1] Lewontin (1964)
Odds-ratio formulation Devlin & Risch (1995)
22
2AB
bBaA
ABAB ffff
Dr ρ==
),,,,min(),,,,min()0('
BabA
AB
BabA
ABAB ffff
Delse
ffff
DDifD
−>=
AB
ABAB ff
f=δ
Disease locus
Marker loci
Combine Single (Pairwise) to Multiple Tests
Bonferroni
Sharper bounds using linkage information.
Examples of Associations: Pairwise, Triple,...
Adapted from Hudson 1990
Recombination: Gene Conversion:
The coalescent with recombination or gene conversion
Gene conversion
Tree 1 Tree 2 Tree 1
1 432 1 4 32
Recombination
1 432 1 4 32
Tree 1 Tree 2 Tree 3
Local trees for recombination and gene conversion
1 32 41 432
Target
Same topology as target
Same MRCA as target
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5 1 2 3 4 5
Target tree
Same tree Same MRCASame topology
Same tree as targetRegion with no recombination
Measures of tree similarity
Local trees of the target and other positions
Only recombination, r=2. Also gene conversion g/ρ
Sample size = 20
From Mikkel Schierup
Recombination/gene conversion rate
R=2, G=0 R=2, G=8
#segments with same tree 1.02 1.8
P(target segment not largest) 0.2% 14%
#segments same topology 1.02 2.1
P(target segment not largest) 0.3% 20%
#segments same TMRCA 1.1 2.9
P(target segment not largest) 1.5% 25%
Probability that the largest segment does not include the target
From Mikkel Schierup
A and B are the most distant markers in significant LD with A and B are the most distant markers in significant LD with targettarget
A BTarget
G=0 G=16
Rho=4 56% 33%
Quantifying the mosaicism caused by Gene Conversion
What is the proportion of markers between these also in significant LD?
From Mikkel Schierup
Based on Morris et al. 2002
Single Marker Methods•Kaplan et al. (1995), Rannala & Slatkin (1998)Problem: Difficult to combine markers.
Haplotype methods with star-shaped genealogies •Terwilliger (1995), Graham & Thompson (1998), McPeek & Strahs(1999), Morris et al.(2000)Problem: wrong genealogy, gives overconfidence in result.
Haplotype methods based on the coalescent•Rannala & Reeve (2001), Morris et al. (2002), Larribe et al. (2003).Problem: computationally intensive
Development of multi-locus association methods
Probability of Data I:
3 step approach:
I Probability of Data given topology and branch lengths
Felsenstein81 for each columnMultiply for all columns
II Integrate over branch lengths
III Sum over topologies
TCAGCCT TCAGCATGCAGGTT
Conclusion: Exact Calculation Computationally Intractible!!
€
P(DataTopo,tktk−1..t2)e−tk /
k
2
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟ dtke
−tk−1 /k−1
2
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟ e−tdtk−1....dt20
∞∫0
∞∫0
∞∫
€
j
2
⎛
⎝ ⎜ ⎞
⎠ ⎟
j=n
3
∏
Probability of Data II: Griffiths & Tavavé TPB46.2.131-149
ACCTAGGAT
TCCTAGGAT
TCCTAGGAT
n=
3*9*3 mutations(1,2) coalescence
ACCTAGGAT
TCCTAGGAT
q(n’’) – determined by equilibrium distribution.
€
q(n)= q(n') f(n,n')n'→n
∑
Griffiths-Ethier-Tavare Recursions
1 2 3
n=(3,1,2)
1 2 3
n=(2,1,2)
1 2 3
n=(3,1,2)
2
1
Griffiths-Marjoram (1996) included recombination in the equations.
)','( ),()1(
)1( ),( ,1n
2:d
nTpenTpnn
nnnTp a
nkka
kka
k
δθθ
+−+−−
= ∑≥
Example: Solving Linear System
∑∑∈∈
∈+=BzAy
zqzxryqyxrxq Bxfor ),(),()(),()(
})(),()...,(......{
....)(),(),(),(
)(),(),()(),()(
1
1 2
1
10
2211
11
∑ ∑∑∑
∑∑∑
∑∑∑
∈ ∈ ∈
∞
=
∈ ∈ ∈
∈ ∈∈
=
++
+=
By By Ayk
k
By By Ay
By AyAy
k
yqyyryxr
yqyyryyryxr
yqyyryxryqyxrxq
B.x whenunknown and Ax whenknown )( ∈∈xq
q( )
??
q( )q( )??
??
??
r(,)
r(,)
r(,)
r(,)r(,)
r(,)r(,)
r(,)
r(,)
Example: Solving Linear System
}),(
),()({)(
1 1
10 0 ∏
= −
−=j
k kj
k
kj
kx XXA
XXrXqExq
τ
τ }),(
),()(
1ˆ
1 1
1
1∏∑
= −
−
=
=j
k
k
jk
jjk
jjk
m
j
j
XXAXXr
Xqm
qτ
τ
Construct Markov transition function, A(x,y), with following properties:
i) A(x,y) > 0 when r(x,y) >0
ii) The chain visits A with certainty.
•Introduced in coalescence theory by Griffiths & Tavare (1994)
•Griffiths & Marjoram (1996) included recombination
•Donnelly-Stephens-Fearnhead (2000-) accelerated these algorithms
The position of the marker locus is missing dataLarribe and Lessard.(2002)
Data: haplotype phenotype multiplicity
15
3
6
2
1
2
1
Where is the disease causing disease?
Likelihood as function of disease locus position
P(data)
s)(parameter)parameters|P(datadata)|parameters(
ff =
Continuous version of Bayes formula
f (parameters) = prior distribution of parametersP(data|parameters) = L(parameters) = likelihood functionf (P|D) = posterior distribution of parameters given data
The evolutionary parameter (e.g. disease location) is considered to have prior distribution (any prior knowledge we may have)
and we learn about parameters through data
Advantage: f (parameters|data) is the full distribution of parameters of interest given data, e.g. confidence intervals
Bayesian approach to LD mapping
∫∫∫=except x Parameters
n1 P...Pdata) |rsP(parametedata)|position x P(disease dd
P(data)
s)(parameter)parameters|P(datadata)|parameters(
ff =
Marginal posterior distribution of disease position:
The basic equation
Parameters in Shattered Coalescent ModelMorris, Whittaker and Balding (2001,,2003,2004..
P(x,h,,T,z,N,ρ|A,U) ~ L(A,U|x,h,,T,z,N) (,T,z|ρ) (ρ)
(ρ) = 2ρ(,T,z|ρ) prior distribution of genealogies (coalescent like)
x Location of disease locush Population marker-haplotype proportionsbranch lengths of genealogical treeT topology (branching pattern)Z Parental-statusN effective population sizeρ shattering parameterA, U cases, controls
Probability of Haplotypes associated Mutant
At recombination markers are incorporated from the population distribution.
Morris et al: The Shattered Coalescent
Advantages: Allows for multiple origins of the disease mutant + sporadic occurrences of the disease without the mutation
Coalescent tree
Morris, Whittaker & Balding,2002
∫∫∫=except x Parameters
n1 P...Pdata) |rsP(paramete)position x P(disease dd
•Evaluate the function in the current point p, f(p)=x
•Suggest a new point, p'
•Evaluate the function in this point f(p') = y
•If x < y, go to point p'
•If x > y, go to point p' with the probability y/x
Monte-Carlo (Metropolis) sampling and integrationMetropolis et al.(1953)
Due to Jesper Nymann
Projection on one axis equivalent to integration over the remaining parameters
1
1
2?2?
2!
231
Monte-Carlo (Metropolis)
Due to Jesper Nymann
1132 Cases, 54 with known mutation
758 Controls
Iceland Genomics Corporation:
Example 2 - BRCA2
Due to Jesper Nymann
1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15
True Location
Multipoint calculation for the full BRCA2 dataset Multipoint calculation where the 54 known mutation cases has been removed.
Example 2 - BRCA2 continued
Due to Jesper Nymann
Simulation Parameters: Recombination rate = 50 Number of leaf nodes = 1000 Number of markers = 10 Diseased haplotype fraction: 0.08 – 0.12 No Heterogeneity Simulated under the asumption of constant population size Diplotypes (phase known)
Type of simulation 50% quantileBasic (red curve) 0.044
The Basic Setup
Due to Jesper Nymann
Type of simulation 50% quantile19 markers (blue curve) 0.029219 markers and recombination rate = 100 (yellow curve) 0.02321Basic (red curve) 0.044
The effect of marker density
Due to Jesper Nymann
010 0 0 01 1 1 1
000 0 1 01 0 1 0
00 0 01 0/1 10/1 0/1 0/1
Type of simulation 50% quantileWith Genotype data (blue curve)0.05857Basic (red curve) 0.044
The effect of knowing phase
Due to Jesper Nymann
Type of simulation 50% quantileWith known genealogy (blue curve) 0.03516Basic (red curve) 0.044
The Effect of knowing gene genealogy
Due to Jesper Nymann
Type of simulation 50% quantileDisease fraction 12% - 14% (blue curve) 0.0353Disease fraction 18% - 22% (yellow curve) 0.03229Basic (red curve) 0.044
The effect of disease fraction
Due to Jesper Nymann
Type of simulation 50% quantileWith Heterogeneity (blue curve) 0.065587Basic (red curve) 0.044
The effect of Heterogeneity
Due to Jesper Nymann
Type of simulation 50% quantileWith mixed cases/controls (blue curve) 0.1518Basic (red curve) 0.044
33% cases are moved tothe controls and a similarnumber of controls are movedto the cases
Cases Controls
The effect of Impurity of cases and controls
Due to Jesper Nymann
Type of simulation 50% quantileLD in background (blue curve) 0.0419Basic (red curve) 0.044
GenePool
010 0 0 01 1 1 1
LD in background: P(0) P(1|0) P(1|1) P(0|1) P(0|0) P(1|0) P(1|1) P(0|1) P(1|0) P(0|1)
No LD in background: P(0) P(1) P(1) P(0) P(0) P(1) P(1) P(0) P(1) P(0)
LD in background population
Due to Jesper Nymann
Simulation Type Mean 50% Quantile 70% Quantile 95% Quantile
Basic 0,059 0,044 0,070 0,19319 markes rho=100 0,044 0,023 0,043 0,14219 markers 0,053 0,029 0,047 0,17618% - 22% cases 0,046 0,032 0,052 0,14612% - 14% cases 0,048 0,035 0,050 0,136Fixed topology 0,047 0,035 0,058 0,111LD in background 0,078 0,042 0,072 0,273Genotype Data 0,087 0,059 0,099 0,305Heterogeneity 0,088 0,066 0,092 0,24633% impure 0,173 0,152 0,217 0,452
Random 0,303 0,273 0,407 0,696
Comparing the different scenarios
Due to Jesper Nymann
SummaryThe Fundamental Problem
The Data
Genotype to Phenotype Functions
Types of Mapping
Population Set-up & Measures of Dependency
Methods:
Pure Coalescent Based
The Shattered Coalescent
Factors influencing mapping error.
M. A. Beaumont and B. Rannala (2004) The Bayesian Revolution in genetics, Nature Reviews, Genetics vol. 5. 251 Botstein D, Risch N. (2003) Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 33 Suppl:228-237. Cardon, L. and J. Bell (2001) “Association Study Designs for Complex Diseases “ Nature Review Genetics Daly, M. J., Rioux, J. D., Schaner, S. F., Hudson, T. J. & Lander, E. S. (2001), High-resolution haplotype structure in the human genome, Nat Genet 29(2), 229-232.Devlin, B. & Roeder, K. (1999), Genomic control for association studies, Biometrics 55(4), 997-1004.Frisse, L et al.(2001) Gene Conversion and Different Population Histories May Explain the Contrast between Polymorphisms and LD Levels. AJHG 69..?-?Gabriel, S. B. et al. (2002), The structure of haplotype blocks in the human genome, Science 296(5576), 2225-2229.Griffiths,R & S. Tavare (1994) “ Simiulating probability distributions in the coalescent ” Theor.Pop.Biol. 46.2.131-159Griifiths, R. and P. Marjoram (1996) “Ancestral inference from samples of DNA sequences with recombination ”J.Compu.Biol. Hudson, R. R. (1990).Gene genealogies and the coalescent process, “Oxford Surveys in Evolutionary Biology” (D. futuyma and J. Antonovics, Eds.) Vol 7, pp. 1-44, Oxford Univ. Press, Oxford, UKB. Kerem, J. M. Rommens, J. A. Buchanan D. Markiewicz, T. K. Cox, A. Chakravarti, M. Buchwald and L. C. Tsui Identification of the Cystic Fibrosis Gene: Genetic Analysis Science 245: 1073-1080, 1989Kong A, et al. (2002) A high-resolution recombination map of the human genome. Nat Genet. 31,241-7.Laitinen et al. (2004) Characterization of a common susceptibility locus for Asthma-related traits. Nature 304, 300-304.Martin, E. R., et al. (2000), SNPing away at complex diseases: analysis of single-nucleotide polymorphisms around APOE in Alzheimer disease, Am J Hum Genet 67, 383-394. Larribe, M, S. Lessard and Schork (2002) “Gene Mapping via the Ancestral Recombination Graph”. Theor. Pop.Biol. 62.215-229.Liu,J. et al.(2000) “Bayesian Analysis of Haplotypes for Linkage Disequilibrium Mapping” Genome Research 11.1716-24. Martin, E. et al.(2001) “SNPing Away at Complex Diseases: Analysis of Single-Nucleotide Polymorphisms around APOE Alzheimer Disease” AJHG 67.838-394. N Metropolis N AW Rosenbluth, MN Rosenbluth, AH Teller, E Teller (1953) Equation of state calculation by fast computer machines, J. Chem. Phys. 21:1087-1092 McVean,G.(2002) “A Genealogical Interpretation of Linkage Disequilibrium” Genetics 162.987-991Morris, A., JC Whittaker and D. Balding “Fine-Scale Mapping of Disease Loci via Shattered Coalescent Modeling of Genealogies” AJHG 70.686-707. Morris, J. C. Whittaker, and D. J. Balding (2004) Little loss of information due to unknown phase for fine-scale LD mapping with SNP genotype data, AJHG . 74: 945-953, 2004 Andrew P. Morris, John C. Whittaker, Chun-Fang Xu, Louise K. Hosking, and David J. Balding Multipoint linkage-disequilibrium mapping narrows location interval and identifies mutation heterogeneity, PNAS November 11, 2003, Vol. 100, 13442-13446
Articles I
McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P. (2004) The fine-scale structure of recombination rate variation in the human genome. Science 304:581-584.Patil, N. et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719-1723.Reich, D. E. et al. (2001), Linkage disequilibrium in the human genome, Nature 411(6834), 199-204.Reich D. E. and Lander, E. On the allelic spectrum of human diseases. Trends in Genetics 19, 502-510.Reich, D. E. et al. (2002), Human genome sequence variation and the influence of gene history, mutation and recombination, Nat Genet 32(1), 135-142.Risch, N. and Merikangas, K. (1996) The future of genetic studies of complex human diseases. Science 273, 15161-1517.Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. (2000), Association mapping in structured populations, Am J Hum Genet 67(1), 170-181.Stefansson, H. et al. (2003), Association of neuregulin 1 with schizophrenia confirmed in a Scottish population, Am J Hum Genet 72(1), 83-87.Stephens JC et al. (2001) Haplotype variation and linkage disequilibrium in 313 human genes. Science.;293(5529):489-93. Strachan, T. & Read, A. P. (2003) Human Molecular Genetics 3, BIOS Scientific Publishers Ltd, Wiley, New York.Spielman R S and W J Ewens (1996) The TDT and other family-basedtests for linkage disquilibrium and association. Am. J. Hum. Gen. 59:983-989The International HapMap Consortium (2003) The International HapMap Project. Nature 426, 789-795.Weiss, KM and Clark, AG (2002) Linkage disequilibrium and the mapping of complex human traits. Trends in Genetics 18:19-24.Pritchard, J and M. Przeworski (2000) Linkage Disequilibrium in Humans: Models and Data AJHG 69.1-14. Pritchard, JK et al.(2000) “Association Mapping in Structured Populations” Am.J.Hum.Genet. 67.170-181 .Pritchard and Cox (2002) “The allelic architecture of human disease genes: common disease-common variant … or not” Human Molecular Genetics 11.20.2417-2Rannala, B and JP Reeve (2001) High-Resolution Multipoint Linkage-Disequilibrium Mapping in the Context of a Human Genome Sequence AMJHG 69.159-178. R S Spielman and W J Ewens (1996) The TDT and other family-basedtests for linkage disquilibrium and association. Am. J. Hum. Gen. 59:983-989 Tabor, Risch and Myers (2002) Candidate-gene approaches for studying complex genetic traits: practical considerations Nature Reviews Genetics 3.May.1-7
Terwilliger,JD et al(2002) A bias-ed assessement of the use of SNPs in human complex traits. Curr.Opin. Genetics & Development 12.726-34
Weiss,K and Terwilliger, J (2000) “How many diseases does it take to map a disease with SNPs” Nature Genetics vol. 26 Oct.
Articles II
Books
Encyclopedia of the Human Genome (2003) Nature Publishing Group
Liu, . J(2001) “Monte Carlo Strategies in Scientific Computation” Springer Verlag
Ott, J.(1999) Analysis of Human Genetic Linkage 3rd edition Publisher: John Hopkins
Strachan & Read (2004) Human Molecular Genetics III Publisher: Biosciences
Weiss,K.(1993) “Genetic Variation and Human Disease” Cambridge University Press.
Web-sites
www.stats.ox.ac.uk/mcvean
Jeff Reeve and Bruce Rannala A multipoint linkage disequilibrium disease mapping program (DMLE+) that allows genotype data to be used directly and allows estimation of allele ages. http://dmle.org/
Liu, J.S., Sabatti, C., Teng, J., Keats, B.J.B. and N. Risch (Version upgraded by Xin Lu, June/9/2002) This is the software for the Bayesian haplotype analysis method developed by Liu, J.S., Sabatti, C., Teng, J., Keats, B.J.B. and N. Risch in article Bayesian Analysis of Haplogypes for Linkage Disequilibrium Mapping. Genome Research 11:1716, 2001http://www.people.fas.harvard.edu/~junliu/TechRept/03folder/bladev2.tar
J. N. Madsen, M.H. Schierup, C. Storm, and L. Schauser, T. Mailund CoaSim is a tool for simulating the coalescent process with recombination and geneconversion under the assumption of exponential population growthhttp://www.birc.dk/Software/CoaSim/
Books & Www-sites