bayesian intervals for linkage locations

13
Genetic Epidemiology 33: 604–616 (2009) Bayesian Intervals for Linkage Locations Ritwik Sinha, 1,2 Robert P. Igo Jr, 1 Shiv K. Saini, 3 Robert C. Elston, 1 and Yuqun Luo 1 1 Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 2 Bristol-Myers Squibb Company, Wallingford, Connecticut 3 Department of Economics, University of Wisconsin, Madison, Wisconsin Intermediate fine mapping has received considerable attention recently, with the goal of providing statistically precise and valid chromosomal regions for fine mapping following initial identification of broad regions that are linked to a disease. The following classes of methods have been proposed and compared in the literature: (1) LOD-support intervals, (2) generalized estimating equations, (3) bootstrap, and (4) confidence set inference framework. These methods provide confidence intervals either with coverage levels deviating from the nominal confidence levels or that are not fully efficient. Here, we propose a novel Bayesian method for constructing such intervals using affected sibling pair data. The susceptibility gene location is treated as a parameter in this method, with a uniform prior. A Metropolis-Hastings algorithm is implemented to sample from the posterior distribution and highest posterior density intervals of the disease gene locations are constructed. Correct coverage levels are maintained by our method. Both simulation studies and an application to a rheumatoid arthritis dataset demonstrate the improved efficiency of the Bayesian intervals compared with existing methods. Genet. Epidemiol. 33:604–616, 2009. r 2009 Wiley-Liss, Inc. Key words: BILL; disease gene localization; intermediate fine mapping; model-free linkage Additional Supporting Information can be found in the online version of this article. Contract grant sponsor: US Public Health Service Resource; Contract grant numbers: RR03655; R01 HG003054; R01 GM28356; Contract grant sponsor: Cancer Center Support; Contract grant number: P30CAD43703; Contract grant sponsor: National Institutes of Health; Contract grant numbers: R01 GM031575; N01-AR-2-2263; R01-AR-44422; Contract grant sponsor: National Arthritis Foundation. Correspondence to: Dr. Yuqun Luo, Department of Epidemiology and Biostatistics, Case Western Reserve University, 10900 Euclid Ave., Cleveland, OH 44106-7281. E-mail: [email protected] Received 31 May 2008; Revised 19 November 2008; Accepted 23 December 2008 Published online 4 February 2009 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/gepi.20412 INTRODUCTION Confidence sets for the location of a disease gene constructed after initially identifying a linkage region can help narrow the region in which fine mapping studies are conducted. Along with reducing the financial cost of follow-up studies, such confidence sets can also alleviate the multiple testing associated with fine mapping studies. Until recently, LOD-support intervals, where a 95% confidence interval (CI) is taken to be the chromosomal region where the LOD score has dropped less than 1 from the linkage peak, had been the most widely applied. Note that this CI construction approach can be applied to dif ferent types of LOD scores, including the Kong and Cox (KAC) LOD score [Kong and Cox, 1997] and the maximum LOD score (MLS) [Risch, 1990]. However, the true cover- age level of 1-LOD-support intervals can vary wildly under different scenarios [Papachristou and Lin, 2006a]. Thus, several methods have been proposed recently for af fected sib pairs (ASPs). These include bootstrap methods [Papachristou and Lin, 2006a], generalized estimating equations (GEE) [Liang et al., 2001], and the confidence set inference (CSI) framework [Papachristou and Lin, 2006c; Sinha and Luo, 2007a,b]. All these methods construct CIs for the unknown yet unique location of the trait locus on a broad chromosomal region that has been shown by linkage studies to potentially contain a trait locus. Before we provide a brief review of the above approaches, let us first consider the difficulty of constructing a CI based on the maximum likelihood estimate (MLE) of the disease gene location. Let l be the location of the disease gene and let (z 0 , z 1 , z 2 ) be the probabilities of sharing 0, 1, or 2 alleles identical-by-descent (IBD) at the disease locus by an ASP (z 0 1z 1 1z 2 5 1). Maximization of the likelihood, under the triangle con- straints (z 1 1 2 and 2z 0 oz 1 ) of Holmans [1993], provides a consistent estimator of the location of the putative gene [Risch, 1990]. In theory, this MLE, together with an estimate of its standard error, may be used to provide asymptotic CIs for the parameters. Several issues compli- cate this approach: (1) it is dif ficult to obtain an estimate of the standard error of the MLE; (2) asymptotic normality might not hold when the maximization of the likelihood is conducted under the triangle constraints [Papachristou and Lin, 2006a]; (3) asymptotic normality will be violated if the disease gene is located near or at the end of a chromosome. r 2009 Wiley-Liss, Inc.

Upload: ritwik-sinha

Post on 11-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian intervals for linkage locations

Genetic Epidemiology 33: 604–616 (2009)

Bayesian Intervals for Linkage Locations

Ritwik Sinha,1,2 Robert P. Igo Jr,1 Shiv K. Saini,3 Robert C. Elston,1 and Yuqun Luo1�

1Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio2Bristol-Myers Squibb Company, Wallingford, Connecticut

3Department of Economics, University of Wisconsin, Madison, Wisconsin

Intermediate fine mapping has received considerable attention recently, with the goal of providing statistically preciseand valid chromosomal regions for fine mapping following initial identification of broad regions that are linked to adisease. The following classes of methods have been proposed and compared in the literature: (1) LOD-support intervals,(2) generalized estimating equations, (3) bootstrap, and (4) confidence set inference framework. These methodsprovide confidence intervals either with coverage levels deviating from the nominal confidence levels or thatare not fully efficient. Here, we propose a novel Bayesian method for constructing such intervals using affectedsibling pair data. The susceptibility gene location is treated as a parameter in this method, with a uniform prior. AMetropolis-Hastings algorithm is implemented to sample from the posterior distribution and highest posteriordensity intervals of the disease gene locations are constructed. Correct coverage levels are maintained byour method. Both simulation studies and an application to a rheumatoid arthritis dataset demonstrate theimproved efficiency of the Bayesian intervals compared with existing methods. Genet. Epidemiol. 33:604–616, 2009.r 2009 Wiley-Liss, Inc.

Key words: BILL; disease gene localization; intermediate fine mapping; model-free linkage

Additional Supporting Information can be found in the online version of this article.Contract grant sponsor: US Public Health Service Resource; Contract grant numbers: RR03655; R01 HG003054; R01 GM28356; Contractgrant sponsor: Cancer Center Support; Contract grant number: P30CAD43703; Contract grant sponsor: National Institutes of Health;Contract grant numbers: R01 GM031575; N01-AR-2-2263; R01-AR-44422; Contract grant sponsor: National Arthritis Foundation.�Correspondence to: Dr. Yuqun Luo, Department of Epidemiology and Biostatistics, Case Western Reserve University, 10900 Euclid Ave.,Cleveland, OH 44106-7281. E-mail: [email protected] 31 May 2008; Revised 19 November 2008; Accepted 23 December 2008Published online 4 February 2009 in Wiley InterScience (www.interscience.wiley.com).DOI: 10.1002/gepi.20412

INTRODUCTION

Confidence sets for the location of a disease geneconstructed after initially identifying a linkage region canhelp narrow the region in which fine mapping studies areconducted. Along with reducing the financial cost offollow-up studies, such confidence sets can also alleviatethe multiple testing associated with fine mapping studies.Until recently, LOD-support intervals, where a 95%confidence interval (CI) is taken to be the chromosomalregion where the LOD score has dropped less than 1 fromthe linkage peak, had been the most widely applied. Notethat this CI construction approach can be applied todifferent types of LOD scores, including the Kong and Cox(KAC) LOD score [Kong and Cox, 1997] and the maximumLOD score (MLS) [Risch, 1990]. However, the true cover-age level of 1-LOD-support intervals can vary wildlyunder different scenarios [Papachristou and Lin, 2006a].Thus, several methods have been proposed recently foraffected sib pairs (ASPs). These include bootstrap methods[Papachristou and Lin, 2006a], generalized estimatingequations (GEE) [Liang et al., 2001], and the confidenceset inference (CSI) framework [Papachristou and Lin,2006c; Sinha and Luo, 2007a,b].

All these methods construct CIs for the unknownyet unique location of the trait locus on a broadchromosomal region that has been shown by linkagestudies to potentially contain a trait locus. Beforewe provide a brief review of the above approaches,let us first consider the difficulty of constructing aCI based on the maximum likelihood estimate(MLE) of the disease gene location. Let l be the locationof the disease gene and let (z0, z1, z2) be the probabilities ofsharing 0, 1, or 2 alleles identical-by-descent (IBD)at the disease locus by an ASP (z01z11z2 5 1).Maximization of the likelihood, under the triangle con-straints (z1 �

12 and 2z0oz1) of Holmans [1993], provides a

consistent estimator of the location of the putative gene[Risch, 1990]. In theory, this MLE, together with anestimate of its standard error, may be used to provideasymptotic CIs for the parameters. Several issues compli-cate this approach: (1) it is difficult to obtain an estimate ofthe standard error of the MLE; (2) asymptotic normalitymight not hold when the maximization of the likelihood isconducted under the triangle constraints [Papachristouand Lin, 2006a]; (3) asymptotic normality will be violatedif the disease gene is located near or at the end of achromosome.

r 2009 Wiley-Liss, Inc.

Page 2: Bayesian intervals for linkage locations

As a common tool to approximate the complex samplingdistributions of estimators, two bootstrap approaches havebeen proposed by Papachristou and Lin [2006a] toconstruct a CI for l: the non-parametric bootstrap (NPB)and the parametric bootstrap (PMB). A bootstrap sampleeither consists of ASP units sampled with replacementfrom the observed sample of ASPs (under NPB), orconsists of ASPs simulated using the MLEs of (l, z0, z1,z2) obtained from the original ASP sample. The totality ofMLEs from each of the bootstrap samples constitutes anapproximation to the sampling distribution of the MLEfrom the original sample and a CI can be obtained fromthis sampling distribution.

On the other hand, Liang et al. [2001] explored formalmodeling of the genetic effect and the gene location. Letm(l) be the expected number of alleles shared IBD at thedisease gene location l by a sib pair given that both sibs areaffected. Suppose there are k markers. Let Mi be the maplocation of marker i and let yi(l) be the recombinationfraction between this marker and l. The GEE approachviews the imputed IBD sharing at the k markers for an ASPas repeated measures with a mean function

mðMi; lÞ ¼1þ ð1� 2yiðlÞÞ2ðmðlÞ � 1Þ

¼1þ Cð1� 2yiðlÞÞ2; i ¼ 1; 2; . . . ; k;

that is parameterized as a function of the gene effect (C)and the gene location (l). Solving the GEE provided anestimate of (C, l) and the associated standard errors. CIscan be constructed by invoking normality as an approx-imation of the sampling distribution of the estimates.

Contrary to the bootstrap and the GEE approaches’ focuson approximating the sampling distributions of the locationestimates, the CSI framework [Papachristou and Lin, 2006c]employs yet another general approach to CI construction: theduality of CI construction and hypothesis testing. Specifically,each genomic location within the candidate region is testedunder the null hypothesis that it is the disease gene location,using a level a statistical test. A (1�a) CI then consists of allthe genomic locations where the null is accepted. In order tocarry out a CSI approach, the IBD sharing probabilities at thetrait locus, (z0, z1, z2), or a function of them, needs to beknown. Thus, a two-step procedure is often adopted:estimates of (z0, z1, z2) from the first step are used as knownin the second, CI construction, step. Counterparts to varioustraditional linkage test statistics (under the null of no linkage)can be used as the test statistics in the CSI framework: usingthe MLS test statistics (CSI-MLS) results in some improve-ment over the use of the Mean statistics in the original CSI(CSI-Mean) [Sinha and Luo, 2007b].

With all these recent developments, Papachristou and Lin[2006a] conducted a thorough comparison of their perfor-mance. This revealed the need for improvement: the GEE andthe 1-LOD-support intervals can be very liberal, especially forrelatively small sample sizes and moderate genetic effectsattributable to the trait locus; CSI-Mean is too conservativeand thus results in longer intervals than justified; and thebootstrap methods are not efficient even though the nominalcoverage is well maintained. Though most methods, notablythe 1-LOD-support, improve considerably when the samplesize is sufficiently large, the linkage data for complex traitsthat are of interest nowadays will almost certainly fall withinthe scenarios where a more efficient method is desirable.

One attractive alternative to the above approaches is toutilize Risch’s likelihood in a Bayesian framework for

estimating the location of the disease gene. Under certainconditions, a Bayesian credible interval is asymptoticallyequivalent to a CI of the same level [Gelman et al., 2004].Noting the similarity of estimating a trait locus undercertain idealized models (affected half-sibling design,completely informative markers, etc.) to the general classof problems of estimating a change-point, which has beenstudied extensively, Siegmund [1998] suggested that auniform prior on the disease gene location providedsatisfactory confidence regions even for small samplesizes. Application of the uniform prior to quantitative traitloci mapping in experimental crosses confirmed thisexpectation [Dupuis and Siegmund, 1999]. In whatfollows, we first define our Bayesian model with auniform prior on l and describe an algorithm for samplingfrom the posterior distribution. Extensive simulationstudies and an application of our Bayesian Intervals forLinkage Locations (BILL) program to a RheumatoidArthritis (RA) dataset follow.

METHODS

MODEL

The data consist of n independent ASPs. Let Gi be thegenotypes of all markers in the candidate chromosomalregion under consideration for the ith ASP (and possiblyother family members), i 5 1,y,n. The main parameter ofinterest is l, the chromosomal position of the susceptibilitylocus. To completely specify the probability of theobserved data, the probabilities of IBD sharing are neededas additional parameters. Specifically, zj 5 Pr(lIBDi 5j|ASP), j 5 0, 1, 2, where lIBDi denotes the number ofalleles shared IBD by the ith pair of siblings at position l.Let p(l, z0, z1, z2) be the prior distribution of the parametervector. Our inference is based on the posterior distribution

pðl; z0; z1; z2jDataÞ / pðl; z0; z1; z2ÞPrðDatajl; z0; z1; z2Þ: ð1Þ

The probability of the data given the parameter vectoror, equivalently, the likelihood of the parameter vectorgiven the observed data, needs to be evaluated for allpossible values of the parameter vector, which is compu-tationally impractical. For computational reasons, weallow l to take on values only from a fine grid of positions(say every 0.25 cM) within the candidate region. Furthermanipulation of the likelihood [Risch, 1990; Sinha andLuo, 2007a] greatly simplifies the computational demandand takes advantage of quantities that are readily availablefrom several well-tested softwares. Specifically,

PrðDatajl; z0; z1; z2Þ ¼Yn

i¼1

PrðGijl; z0; z1; z2Þ

¼Yn

i¼1

X2

j¼0

PrðGijlIBDi ¼ jÞzj

¼Yn

i¼1

PrðGiÞX2

j¼0

PrðlIBDi ¼ jjGiÞ

PrðlIBDi ¼ jÞzj

/Yn

i¼1

X2

j¼0

PrðlIBDi ¼ jjGiÞ

PrðlIBDi ¼ jÞzj: ð2Þ

We are able to omit Pr(Gi) from the last expression, sinceit is not needed in the Markov chain Monte Carlo (MCMC)sampling of the posterior in (1) (details will be given later).

605Bayesian Intervals for Linkage Location

Genet. Epidemiol.

Page 3: Bayesian intervals for linkage locations

Note that neither the denominator nor numerator ofPrðlIBDi ¼ jjGiÞ=PrðlIBDi ¼ jÞ in (2) are conditioned onthe fact that both siblings are affected. That is, PrðlIBDi ¼ jÞare 1

4 ;12 ;

14, for j 5 0,1,2, respectively. Hidden Markov

Models and algorithms have been used by several geneticepidemiology softwares to compute PrðlIBDi ¼ jjGiÞ (e.g.,S.A.G.E. [2008], Merlin [Abecasis et al., 2002], and Allegro[Gudbjartsson et al., 2000]). These quantities need to becomputed only once for each possible value of l, andcombined with any value of zjs in equation (2) to providethe corresponding likelihood.

The parameterization in (1) and (2) with zjs isstraightforward, but the zjs should conform to the Hol-man’s triangle constraint (z1 �

12 and 2z0oz1). This poses

difficulties for the specification of appropriate priordistributions and the MCMC sampling from the posteriordistribution. After some experimentation, we found that apractical and equivalent alternative is to use the geneticvariance components. Consider a single-locus diseasemodel. Let K be the prevalence of the disease, a5 VA/K(1�K) and d5 VD/K(1�K), where VA is the additivegenetic variance and VD is the dominance genetic variance.Then [James, 1971]

z0ða; dÞ ¼K=ð1� KÞ

4K=ð1� KÞ þ 2aþ d;

z1ða; dÞ ¼2K=ð1� KÞ þ a

4K=ð1� KÞ þ 2aþ d;

z2ða; dÞ ¼K=ð1� KÞ þ aþ d

4K=ð1� KÞ þ 2aþ d:

8>>>>>>><>>>>>>>:

ð3Þ

Substituting (K, a, d) for (z0, z1, z2) in (1) and (2), it is theneasier to specify sound priors for (a, d) and to sample fromthe posterior distribution of these parameters, togetherwith l, using MCMC. However, the practical parameter-ization comes with a price; the data, consisting only ofASPs, provide no information on K. Thus, we assume K isknown without uncertainty. Because accurate prior esti-mates of K are not always available, we investigate theeffect of misspecification of K on the performance of ourmethod later.

The alternative parameterization might seem unneces-sarily contrived at first sight. It might further beargued that it is better to put a prior on K rather thanfixing it at a more or less arbitrary value. First note thatK(1�K) is the phenotypic variance for a binary trait. Thus ais the narrow sense heritability under a single-locusdisease model. Both a and d lie between 0 and 1. It isevident that the zjs as defined in (3) automatically satisfythe Holman’s triangle for any a and d in this range. Inaddition, it is the relative magnitudes of K/(1�K), a and dthat determines the zjs, not their absolute magnitudes.Thus, allowing K to vary would create a non-identifiabilityproblem.

Because K, a, and d can be interpreted as anything that isproportional to a baseline configuration, anchoring K atroughly the prevalence enables selecting the appropriatepriors on a and d such that the configuration of the zjs islikely to be supported by the data. This is a clearadvantage over the original parameterization. At times ofdoubt, one should intuitively fix K at the lowest reasonableguess, since the prior we put on a and d (a betadistribution for a1d) does not allow them to go over 1,but there is substantial probability for them to be quite

close to 0. This suggestion has been corroborated by laterresults on the effect of ‘‘misspecifying’’ K.

Alternatively, one might even obtain the MLE of the zjsfrom the data and make an educated selection of K so that thecorresponding a and d solved from the MLE of the zjs wouldlie in roughly the center of their prior distribution (see below).Lastly, the parameterization in terms of (3) is necessary when,in addition to ASPs, the data contain other types of affectedrelative pairs (e.g., uncle-nephew and half-sib). The allelesharing probabilities at the trait locus for different types ofaffected relative pairs will have some functional relationshipthat is better represented by using K, a and d.

PRIOR DISTRIBUTIONS

To complete the model specification in (1), the priordistribution, p(l, a, d), is needed. Let l be independent of(a, d). Following Siegmund [1998], the prior of l is chosento be a uniform distribution on a fine grid of positionswithin the candidate region, which is every 0.25 cM in ourcurrent implementation.

For the prior on a and d, we first provide a prior forg5 a1d, the locus-specific broad sense heritability. Anatural and convenient prior is a beta distribution withshape parameters b1 and b2. Conditional on g, we let theprior on a be a uniform distribution between 0 and g.Transformation leads to the joint prior on (a, d)

pða; d; b1; b2Þ ¼1

Bðb1; b2Þðaþ dÞb1�2

ð1� a� dÞb2�1;

0 � a; d; aþ d � 1; ð4Þ

where B(b1, b2) is the b function.

Choice of hyperparameters. Judicious choice ofthe hyperparameters (b1, b2) may be important for theperformance of our method. To this end, we conducted asurvey on the possible range of locus-specific heritabilitiesof complex traits that our method is designed for. Theheritability of RA was estimated to be 65% in a Finnishpopulation and 53% in a UK population [MacGregor et al.,2000]. About one-third of this heritability may beattributable to the HLA locus on chromosome 6 [Deightonet al., 1989]. This places the locus-specific heritability of theHLA locus for RA between 0.17 and 0.22. An estimate ofthe heritability of early age-related maculopathy is 45%[Hammond et al., 2002], which places heritability for amajor locus in the range of 0.1–0.2 if there is one locus thatcontributes a significant fraction to this heritability. Theseheritability estimates, together with findings for a widevariety of complex traits in diverse populations, led us topostulate that a reasonable prior for g should take valuesbetween 0 and 0.45, with a peak around 0.15, when K isspecified as the population prevalence. We chose b1 5 3and b2 5 13, which results in a mode of 0.14, a mean of0.19, and a standard deviation of 0.095. The 99% quantile is0.45. Sensitivity analysis (section ‘‘Results’’) showed thatthe estimation of the susceptibility locus l is not sensitiveto choice of b1 and b2.

SAMPLING FROM THE POSTERIOR USINGMARKOV CHAIN MONTE CARLO

The inference of l is based on a sample drawn from theposterior distribution of (l, a, d). A Metropolis-Hastings

606 Sinha et al.

Genet. Epidemiol.

Page 4: Bayesian intervals for linkage locations

algorithm [Gilks et al., 1996] was employed to constructthe Markov chain from which the posterior was drawn. Webriefly review a generic Metropolis-Hastings algorithmhere. Let x be the observed data and y be the parametervector. In a Bayesian framework, the posterior, p(y|x)pp(y)Pr(x|y), is the target distribution that we wish to drawa sample from. A proposal distribution q(y0|yt) is required,which proposes the next possible update of the parameter(y0) when the current value of the parameter is yt. Thenewly proposed y0 will be accepted as the state of the chainin the next iteration with probability

Aðyt; y0Þ ¼min 1;

pðy0jxÞqðytjy0Þ

pðytjxÞqðy0jytÞ

� �

¼min 1;pðy0ÞPrðxjy0Þqðytjy

pðytÞPrðxjytÞqðy0jytÞ

� �: ð5Þ

Otherwise the chain will take the current realization of yin the next step. Note that the acceptance probability is aratio, thus the posterior needs only be known up to aconstant, as does the probability of the observed data. Thisis the basis for the manipulation of the likelihoodexpression in (2) to improve computational efficiency.We start the chain with an arbitrary value y0 for y. Then acandidate y0 is sampled from q(y0|y0) and

y ¼y0 with probability Aðy0;y

0Þ;

y0 otherwise:

Further realizations of y are then similarly sampled insubsequent iterations. When the chain is run long enough,y0, y1,y,yn will form a correlated sample from the targetdistribution p(y|x) and inference on y can be based on thissample.

Proposal distribution. After some experimenting,we adopted a proposal distribution for our problem athand that seems to perform well. Each iteration consists oftwo steps, where l and (a, d) are updated separately. Theproposal for a new l is a uniform distribution on sevenpoints in the grid of possible chromosomal positions,consisting of the current location and three immediatelyadjacent locations on either side of the current location.The proposal distribution q(a0,d0|at,dt) consists of inde-pendent proposals a0jat � Nðat;s2

0Þ and d0jdt � Nðdt;s20Þ.

We update a and d simultaneously out of two considera-tions. First, these two parameters represent the propor-tions of trait variance attributable to the additive anddominance components at the locus. Given a fixedheritability, these two parameters are negatively corre-lated. One way to overcome possible poor mixing of theMarkov chain when parameters in a Bayesian problem arecorrelated is to update all parameters simultaneously[Gilks et al., 1996]. Second, joint updating speeds up thechain because the requirement to compute likelihoods isonly half that required for separate updating. Typicalvalues of a and d will be in the range (0, 0.45), thus we lets0 5 0.01, which seems to produce well-mixing chains. Foreach dataset, the Markov chain was run for 50,000iterations, after allowing for a burn-in of 5,000 iterations(see section ‘‘Results’’ for convergence diagnostics that ledto these choices). The output of every 50th iteration wasretained to provide a practically independent sample fromthe posterior distribution. Using this sample, highestposterior density credible intervals were constructed for l

[Chen and Shao, 1999] and the posterior mode was used asa point estimate of the gene location.

SIMULATION STUDY

SIMULATION SETTING

Disease Models. To compare the BILL approach toother available methods for intermediate fine mapping,we carried out simulation studies employing three single-locus and eight two-locus disease models. In the two-locusmodels, the trait loci were assumed to segregate indepen-dently and be in linkage equilibrium. The trait locus (loci)in each model is diallelic and follows Hardy-Weinbergproportions. The three one-locus disease models (Table I)have prevalences ranging from 4.5 to 8.5% and heritabil-ities ranging from 0.094 to 0.143. The modes of inheritanceinclude Recessive, Intermediate, and Dominant. Thesemodels represent characteristics observed in complextraits of interest today. The Dominant model has beeninvestigated in other studies [Papachristou and Lin, 2006b;Sinha and Luo, 2007b].

In addition to a linked major locus, given a set of ASPs,for many complex diseases of interest today we expectother factors, genetic and/or environmental, to contributeto disease susceptibility. To investigate the performance ofthe intermediate fine mapping methods under a morerealistic setting, additional data were simulated with eighttwo-locus disease models. These are a subset of the 12models considered by Knapp et al. [1994]. They chose theallele frequencies and penetrances for each model suchthat the prevalence is roughly 0.1, the total offspring risk is0.3, and the two loci contribute equally to the traitprevalence. Only four of the models are not symmetricin the two loci and have been considered in theintermediate fine mapping context (e.g., Papachristouand Lin, 2006c, Sinha and Luo, 2007b). These four modelswere termed EP-2 (epistatic recessive-dominant), HET-2(heterogeneous recessive-dominant), S-2 (special case ofHET-2 with the second locus fully penetrant), and EP-4(epistatic). In addition to these four two-locus models, weconsidered another four two-locus models in Knapp et al.[1994] that are symmetric in the two trait loci: EP-1(epistatic dominant-dominant), HET-1 (heterogeneousdominant-dominant), EP-3 (epistatic recessive-recessive),and HET-3 (heterogeneous recessive-recessive). Let A anda be the high-risk and low-risk alleles, respectively, at the1st locus and similarly define B and b at the 2nd locus. Letp1 and p2 be the frequency of the high-risk alleles at the

TABLE I. Single-locus disease models for the simulationstudy

Model K H ¼ VGVT

a VAVT

a VDVT

a PDb fdd

b fDdb fDD

b

Recessive 0.045 0.094 0.014 0.08 0.08 0.04 0.040 0.840Intermediate 0.058 0.110 0.050 0.06 0.10 0.04 0.100 0.800Dominant 0.085 0.143 0.143 0.00 0.01 0.07 0.827 0.865

aVT, total variance of the trait; VG, genetic variance of the trait; VA,additive genetic variance; VD, dominance genetic variance; H,broad-sense heritability.bD, high-risk allele; d, low-risk allele; PD, allele frequency of D;(fDD, fDd, fdd), penetrances for genotypes (DD, Dd, dd).

607Bayesian Intervals for Linkage Location

Genet. Epidemiol.

Page 5: Bayesian intervals for linkage locations

two loci, respectively. Among the eight two-locus models,four are heterogeneous: HET-1, HET-2, S-2, and HET-3.Each of these models is determined by two parameters: f1

as the penetrance of two-locus genotypes where the 1stlocus genotype is high risk and the 2nd locus genotype islow risk, and f2 as the penetrance of two-locus genotypeswhere the 2nd locus genotype is high risk and the 1st locusgenotype is low risk. Penetrance for genotypes that arehigh risk at both loci is j5f11f2�f1f2, while it is zerofor those that are low risk at both loci. For example, underthe heterogeneous recessive-dominant (HET-2) model,penetrance is f1 for genotype AAbb, f2 for (AaBB, aaBB,AaBb, aaBb), j for (AABB, AABb), and zero otherwise.These heterogeneous models are approximately additivewhen both f1 and f2 are small. The other models areepistatic, each determined by a single parameter j, thecommon penetrance of high-risk genotypes. EP-1, EP-2,and EP-3 are all multiplicative. For example, under EP-2(epistatic recessive-dominant), the penetrance is j forgenotypes made up of AA at the 1st locus and BB or Bb atthe 2nd locus, and is zero for all the other genotypes.Lastly, the high-risk genotypes under EP-4 include AABB,AABb, AaBB, and aaBB. The other models can be similarlyvisualized from their suggestive names. The parameters(p1,p2,j) are (0.6, 0.199, 0.778) under EP-2, and (0.372, 0.243,0.911) under EP-4, respectively. Under HET-2, p1 5 0.279,p2 5 0.04, and f1 5f2 5 0.66. Under S-2, p1 5 0.228,p2 5 0.045, and f2 5 0.66. The other models are symmetricand each is determined by a single allele frequency and asingle penetrance parameter; these are (0.210, 0.707) forEP-1, (0.577, 0.9) for EP-3, (0.053, 0.495) for HET-1, and(0.194, 1) for HET-3.

Datasets of 100, 250, and 500 ASPs were simulatedunder the one-locus models and datasets of 500 ASPs weresimulated under the two-locus models.

Marker maps. For one-locus models, a chromosomeof 60 cM was simulated with one SNP (minor allelefrequency 0.3) every 0.25 cM. Linkage equilibrium wasassumed among all loci, including the trait locus. The traitlocus was placed at 30.125 cM. Both disease chromosomeshad the same characteristics as described above when two-locus models were considered. The genotype data of theparents of each ASP were available unless otherwisenoted.

Confidence set construction. Confidence sets ofthe disease susceptibility locus were constructed, for eachset of simulated ASPs, using five methods: (1) BILL, (2)PMB [Papachristou and Lin, 2006a] since it was shown thatthe parametric version performed better than the non-parametric version, (3) CSI-MLS [Sinha and Luo, 2007b],(4) LOD-support interval with 1-LOD drop using KACLOD score, (5) GEE [Liang et al., 2001]. Since all thesemethods were proposed as intermediate fine mappingtools, we shall employ these methods only in regions thathave shown evidence of linkage. Specifically, we dividedthe marker data into two maps, a step-one map and a step-two map, each consisting of half of the SNPs with 0.5 cMspacing. Confidence sets were constructed using the step-two map only when the KAC LOD score using data fromthe step-one map exceeded 2.33, a threshold commonlyused to flag suggestive linkage. We note that it iscommonly known now [e.g., Papachristou and Lin,2006a] that using the totality of the two maps to construct

CIs in the 2nd step would yield essentially the same result.Furthermore, Papachristou and Lin [2006a] suggestedfocusing the CI construction effort to within 25 cM oneither side of the linkage peak. The size of the candidateregion should be dictated by the premise that there is onlyone trait locus within the region. Most of the five methodsrun sufficiently fast that markers on the entire chromo-some can be used to construct CI if there is no danger ofhaving more than one trait locus. BILL, GEE, and CSI-MLS, in their current implementations, require input filesfrom other software (e.g., Merlin, S.A.G.E. GENIBD) thatcontain IBD sharing probabilities at all potential diseasegene locations. The size of these IBD files limits theirapplications to candidate regions of no more than 100 cM.The candidate region is the entire 60 cM in the simulateddata. For BILL, K was given the value of the populationprevalence of the disease under the simulation model. ForCSI-MLS, the step-one map is used to estimate the MLEs ofthe zjs to be supplied to the 2nd, CI construction, step.

RESULTS

SENSITIVITY ANALYSIS

To investigate the sensitivity of the posterior distributionto the choice of prior distribution on (a, d), we simulatedtwo datasets from the Intermediate disease model(Table I), one with 100 ASPs and the other with 500 ASPs.Posterior distributions from five different beta priorsapplied to these two datasets are summarized in Table II.The inference on the main parameter of interest, theposition of the susceptibility locus l, is not sensitive tochoice of prior distributions, even with only 100 ASPs.While the posterior distributions of a and d show somesensitivity to prior choice when the sample size is small,the influence is minimal.

CONVERGENCE DIAGNOSTICS

A chain should have migrated into the target distribu-tion for the inference based on the chain sample to bevalid. While convergence characteristics for each chain will

TABLE II. Sensitivity of posterior inference to choice ofhyperparameters

Posterior summary

Prior distribution l (cM) a d

# ASP b1 b2 Mean SD Mean SD Mean SD Mean SD

100 1 9 0.10 0.09 26.44 3.01 0.091 0.060 0.064 0.0422 8 0.20 0.12 25.90 3.51 0.121 0.068 0.062 0.0462 13 0.13 0.08 26.67 3.11 0.097 0.058 0.092 0.0593 12 0.20 0.10 26.13 2.26 0.118 0.061 0.095 0.0521 1 0.50 0.29 26.74 2.95 0.127 0.076 0.049 0.037

500 1 9 0.10 0.09 29.95 0.62 0.089 0.026 0.036 0.0192 8 0.20 0.12 30.01 0.81 0.080 0.025 0.033 0.0182 13 0.13 0.08 29.94 0.69 0.076 0.024 0.035 0.0183 12 0.20 0.10 29.99 0.80 0.081 0.025 0.035 0.0171 1 0.50 0.29 29.99 0.69 0.083 0.025 0.028 0.017

ASP, affected sib pair.

608 Sinha et al.

Genet. Epidemiol.

Page 6: Bayesian intervals for linkage locations

be different, we performed some convergence diagnosticsto decide, a priori, on the required running length. Our testdataset, for this purpose, consisted of 100 ASPs simulatedunder the Intermediate model (Table I). First, visualinspection of an extremely long chain suggested that thechain converged after around 5,000 iterations. A formaldiagnostics was also performed. The Potential ScaleReduction Factor (PSRF) [Gelman and Rubin, 1992] is thesquare root of the ratio of the ‘‘between’’ and ‘‘within’’variances, computed from several independent Markovchain sequences started from an initial sample of adistribution overdispersed relative to the target distribu-tion. When each chain has been run indefinitely, thedistribution of each sequence will all be the same as thetarget distribution and thus the PSRF equals 1. Because ofthe overdispersion of the starting points, PSRF will begreater than 1 when the chains have not yet converged tothe target distribution. Gelman et al. [2004] suggest thatPSRF less than 1.1 indicates convergence. The PSRF of thethree parameters were computed based on four chains,each of 200,000 iterations and started from an over-dispersed distribution. Since a and d are proportions, a‘‘logit’’ transform was applied to them. The PSRF of (l, a, d)is (1.00, 1.03, 1.03) at iteration 10,000, reduces to (1.00, 1.00,1.02) at iteration 50,000 and stays stable thereafter. There-fore, a sample of 50,000 iterations following a burn-in of5,000 iterations was retained for inference based on eachdataset.

PRECISION AND COVERAGE OF NOMINALCONFIDENCE SETS

We investigated the empirical coverage and precision of95% CIs constructed using each of the five intermediatefine mapping approaches, for data simulated under theone-locus or two-locus disease models. For each model,250 replicates were simulated. Empirical coverage isdefined as the ratio of the number of intervals thatcaptured the location of the trait locus, over the number ofintervals that signaled linkage (KAC LOD score 42.33).Precision is measured by the length of the interval. Thenumber of replicates showing suggestive linkage was 243,217, and 155, for datasets of 100 ASPs simulated undereach of the three one-locus models, respectively. This

number is 4230 for datasets of 250 ASPs and 5 250 forsamples of 500 ASPs. This suggests that if the truecoverage of a CI estimator is 95%, the empirical coverageshould fall between 93 and 97% for at least 80% of thesimulations.

Tables III and IV show the average length and theempirical coverage of the intervals constructed from datasimulated under the single-locus and two-locus diseasemodels, respectively. Increased sample size results in moreprecise interval estimates (Table III). Only the Bayesianand the Bootstrap methods maintained the nominalcoverage level well. The intervals constructed using BILLare, on average, only 50–81% of the length of that usingBootstrap, except for the HET-3 model, where the twomethods perform comparably. CSI-MLS is conservative,with empirical coverage being 100% under most settings,and consequently provides CI that are 1.64 to 4.76 times aslong as the BILL intervals. The GEE intervals exhibitsubstantial under coverage, with most empirical coveragebeing around 85% and the most extreme one being only72%. Even with this level of under coverage, 80% of theGEE intervals are longer than the BILL intervals, some-times twice as long. LOD-support intervals show amoderate level of under coverage. When the empiricalcoverage of the LOD-support intervals are close to 95%(defined as between 93 and 97%, which occurred in 8 outof 21 simulation settings), 62.5% of the settings result inshorter BILL intervals, while 37.5% of the settings yieldshorter LOD intervals (the bold entries in the two tables).We will examine the relative performance of these twoapproaches in more details in what follows.

It is evident from Table III that the Recessive modelresults in more precise intervals than does the Intermedi-ate model. The Dominant model yields least precision. Theheritability of these three models, however, follows thereverse order.

PRECISION OF CONFIDENCE SETS UNDERCONTROLLED EMPIRICAL COVERAGE

The widely divergent empirical coverage levels of thedifferent methods render direct comparison of efficiencydifficult, although it is quite clear from the last paragraphthat BILL is the most favorable, with LOD intervals being a

TABLE III. Mean lengths in cM (L) and empirical coverages (CP) of nominal 95% confidence sets from data simulatedunder the single-locus models

BILL CSI-MLS LOD-Support GEE Bootstrap

Model # ASPs L CP L CP L CP L CP L CP

Recessive 100 8.91 0.95 23.03 0.97 11.97 0.88 10.95 0.79 15.87 0.93250 3.30 0.96 13.25 1.00 5.17 0.94 6.68 0.84 5.07 0.97500 1.81 0.97 8.69 1.00 2.90 0.94 4.63 0.80 2.53 0.95

Intermediate 100 10.44 0.94 25.57 0.98 11.69 0.93 11.80 0.83 21.08 0.94250 6.48 0.94 16.81 1.00 5.90 0.90 7.80 0.80 8.82 0.94500 2.92 0.95 11.81 1.00 3.32 0.92 5.58 0.84 4.41 0.92

Dominant 100 16.10 0.93 26.53 0.99 10.87 0.82 12.97 0.72 26.89 0.95250 8.89 0.94 19.55 1.00 6.58 0.93 10.25 0.85 16.21 0.95500 4.60 0.93 13.82 1.00 4.05 0.91 7.06 0.84 6.97 0.95

ASP, affected sib pair; CSI, confidence set inference; GEE, generalized estimating equations; BILL, Bayesian Intervals for Linkage Locations.

609Bayesian Intervals for Linkage Location

Genet. Epidemiol.

Page 7: Bayesian intervals for linkage locations

serious contender. Another experiment was conducted:empirical coverage was fixed at several levels and theprecision of the intervals was compared. The results aredisplayed in Figs. 1 (one-locus models), 2 (non-symmetrictwo-locus models), and 3 (symmetric two-locus models—

note that only results from one locus are plotted, in view ofthe symmetry). The GEE intervals are the least precise.While the differences in precision of the other intervalstend to narrow with increased sample size (Fig. 1),lengths of the GEE intervals remain widely separated

TABLE IV. Mean lengths in cM (L) and empirical coverages (CP) of nominal 95% confidence intervals from 500 ASPssimulated under two-locus models

BILL CSI-MLS LOD-Support GEE Bootstrap

Model Locus L CP L CP L CP L CP L CP

EP-2 1 3.36 0.94 10.71 1.00 3.61 0.91 5.54 0.85 4.39 0.952 3.20 0.94 11.03 1.00 2.36 0.96 5.77 0.85 4.69 0.94

EP-4 1 10.29 0.95 19.10 0.98 8.83 0.88 9.78 0.84 14.48 0.932 1.66 0.96 7.43 1.00 1.88 0.93 3.74 0.87 2.16 0.96

HET-2 1 3.63 0.94 12.88 1.00 5.09 0.91 6.92 0.88 6.12 0.932 4.87 0.93 12.60 1.00 2.85 0.90 6.77 0.86 6.41 0.92

S-2 1 2.01 0.96 8.71 1.00 3.09 0.95 4.63 0.88 2.79 0.962 7.25 0.94 16.79 0.99 4.42 0.92 8.47 0.84 11.36 0.94

EP-1 1 3.25 0.94 12.50 1.00 2.62 0.93 3.54 0.92 5.02 0.96EP-3 1 2.63 0.93 11.22 1.00 3.42 0.92 3.10 0.90 3.87 0.93HET-1 1 8.27 0.95 17.66 1.00 4.22 0.92 4.86 0.87 10.21 0.96HET-3 1 2.88 0.96 9.51 1.00 3.12 0.92 2.67 0.93 2.70 0.95

BILL, Bayesian Intervals for Linkage Locations; CSI, confidence set inference; GEE, generalized estimating equations.

Fig. 1. Average lengths of confidence intervals over fixed empirical coverages. Data were simulated under the single-locus models.

610 Sinha et al.

Genet. Epidemiol.

Page 8: Bayesian intervals for linkage locations

from those of the others. The Bootstrap intervals are alsolong compared to others for small sample sizes (100 ASPs)and large empirical coverage levels. CSI-MLS intervals areoften close to the most efficient intervals in precision.

The comparison of BILL and LOD intervals confirms thefindings from the nominal intervals. For single-locusdisease models, BILL intervals are the most precise exceptfor the case of 250 ASPs under the Dominant model,

Fig. 2. Average lengths of confidence intervals over fixed empirical coverages. Data were simulated under the non-symmetric two-locus

models.

611Bayesian Intervals for Linkage Location

Genet. Epidemiol.

Page 9: Bayesian intervals for linkage locations

where they are slightly less precise than the LOD andCSI-MLS intervals. LOD intervals outperform BILL inter-vals in five cases under two-locus models: EP-1, HET-1,the second locus of each of EP-2, HET-2, and S-2. Themode of inheritance of each of these disease loci isdominant. Close examination of these disease modelsshows that the probability that two affected siblings share1 allele IBD at each of these loci is close to 0.5, whichmeans that dE0. Irrespective of this fact, the Bayesianmethod is estimating it in the model. To investigatewhether this is the reason for BILL not outperformingLOD-support intervals, a variant of BILL, BILL-V2, whered is set to be zero, was also developed. BILL-V2 wasapplied to datasets of 500 ASPs simulated under some ofthe models with marginal mode of inheritance beingdominant. BILL-V2 intervals are improved over the BILLintervals, but are still inferior to the LOD intervals for thedominant loci under the non-symmetric two-locus models(Supplementary Table S1).

While the LOD intervals are more precise than the BILLintervals for data simulated under dominant models, it is notclear what drop size corresponds to a desired level ofcoverage. Both mode of inheritance and sample size influencethe drop size, ranging from 1.04 to 1.62 when the three single-locus models and sample sizes of 100, 250, and 500 wereconsidered (Supplementary Table S2). The drop size can be asmuch as 1.48 for 500 ASPs. Other uninvestigated factorsmight also play a role, further complicating the choice.

ESTIMATES OF SUSCEPTIBILITY GENE LOCATIONS

Four of the intermediate fine mapping methods—BILL,Bootstrap, GEE, and LOD—also provide estimates of the

susceptibility gene location. In particular, the location withthe maximum KAC LOD score is used as a point estimate.Root mean squared errors of these estimates are given inTable V for data simulated under the one-locus models.BILL estimates are the best under both the Recessive andthe Intermediate models, while Bootstrap estimates areless precise for small sample size (100 ASPs) and are just asprecise for sample sizes of or above 250 ASPs. Theestimates from GEE and LOD are much less precise. Onthe other hand, under the Dominant model, LOD providesthe most precise estimate with 250 ASPs and slightly betterestimates than those from Bootstrap and BILL with 500ASPs, while the estimate with 100 ASPs is much lessprecise than those from the other three methods.

EFFECT OF THE SPECIFICATION OF K

As discussed following equation (3), the role of K is thatof a scaling parameter. When it is fixed at the prevalence ofthe disease attributable to the trait locus under considera-tion, the other two parameters, a and d have theinterpretation of narrow sense heritability and the propor-tion of dominance genetic variance, respectively, under asingle-locus disease model. This interpretation facilitatesselection of reasonable priors for a and d demonstratedthrough the good performance of BILL when applied to allthe simulated data. The effect of specifying K asz� (prevalence) is shown in Table VI for five differentvalues of z (0.1, 0.5, 1.0, 2.0 and 4.0), based on samples of250 ASPs simulated under the single-locus models. Theeffect of z on the length of the credible interval is minimaland hence is not shown. A value of zr1 has little or noeffect on the empirical coverage, while zZ2 leads to

Fig. 3. Average lengths of confidence intervals over fixed empirical coverages. Data were simulated under the symmetric two-locus models.

612 Sinha et al.

Genet. Epidemiol.

Page 10: Bayesian intervals for linkage locations

reduced coverage. This effect is more pronounced for theDominant model (bold entries).

EFFECT OF MISSING PARENTAL GENOTYPES

We also investigated the effect of missing parentalmarker data on the performance of the five intermediatefine mapping methods, based on samples of 250 ASPssimulated under the single-locus Intermediate model(Supplementary Table S3). Although increasing percentageof missing parental marker data lead to slightly longerintervals, the empirical coverage and efficiency compar-ison of these methods confirm what was observed whenall parental data were available. The change in the lengthof the CI with increasing percentage of missing parentalinformation is more pronounced with the LOD-supportintervals, probably due to its unstable coverage level,compared to the other approaches.

APPLICATION TO DATA FROM THENORTH AMERICAN RHEUMATOIDARTHRITIS CONSORTIUM STUDY

DATA DESCRIPTION, PREPROCESSING, ANDANALYSIS STRATEGY

To further evaluate BILL and other competing methods,we analyzed the RA data contributed to Problem 2 ofGenetic Analysis Workshop 15 by the North AmericanRheumatoid Arthritis Consortium (NARAC) [Amos et al.,2007]. Microsattelite marker data for 511 families and SNPdata for 757 families were provided. The locus HLA-DRB1has been consistently implicated by numerous studies as ahighly significant risk factor for RA. In our analysis weused only the SNP data as it has been shown by numerouspapers that the SNP panels available nowadays aresatisfactory. Only the physical map of the SNPs wasprovided in the GAW15 Problem 2 data. Both physicaldistances and genetic map positions were available,however, for a denser map of SNPs on chromosome6 in the GAW15 Problem 3 data. Thus, we obtained thegenetic map positions for SNPs in the Problem 2 data bymatching information on SNPs in the two problems, using

interpolation when necessary. HLA-DRB1 is located at49.46 cM in this genetic map. There were 404 availableSNPs on chromosome 6. Because marker-marker linkagedisequilibrium (LD) can lead to biased estimation of IBDsharing probabilities and the SNPs are sufficiently dense,we opted for filtering out SNPs in LD instead ofincorporating the LD in the methods. Haploview [Barrettet al., 2005] was employed to filter the SNPs so that theremaining SNPs had pairwise r2 less than 0.02. Thisresulted in 284 SNPs across 197.4 cM. The candidate regionwas set to be the first 100 cM (142 SNPs), around half ofchromosome 6.

Of the 757 families, 722 independent ASPs, togetherwith genotypes from other family members who mightprovide information on IBD sharing, comprised our fullset of data. The unusually large sample size and thesubstantial genetic effect of HLA-DRB1 on RA presentedan opportunity to investigate in some detail the perfor-mance of the intermediate fine mapping methods, includ-ing their asymptotic behavior. To this end, we applied themethods to seven sets of nested samples of sizes 150, 200,300, 400, 500, 600, and 722 ASPs, respectively. Thecorresponding maximum KAC LOD scores are 2.51, 3.15,4.22, 5.21, 8.91, 10.54, and 16.33. We chose the minimumsize of 150 ASPs so that the maximum LOD score is still42.33, the threshold we used for when it is worthpursuing intermediate fine mapping. For CSI-MLS, thetwo-step procedure was applied to each sample withthe zjs estimated in the 1st step using all 142 SNPs in the100 cM region. The same set of SNPs was used in the 2ndstep for CI construction. Since the underlying IBD processis the same for the same set of ASPs and the SNPs aresufficiently dense for tracking the IBD process, thisstrategy yields essentially the same result as one that usedmutually exclusive sets of SNPs in the two steps. GEE(implemented in GeneFinder version 1.0) requires IBDsharing probabilities that are output from GENEHUNTER.GENEHUNTER eliminated some pedigrees, claimingnumerical underflow during likelihood calculation orinsufficient information available on IBD sharing. Thenumbers of pedigrees dropped were 8, 8, 16, 18, 25, 31, and38 for sample sizes increasing from 150 ASPs to 722 ASPs.

RESULTS

Kong and Cox LOD scores and 95% confidence/credibleintervals constructed using the five methods are displayedin Fig. 4. Numerical details are provided in supplementaryTable S4. As the sample size increases, point estimates ofthe disease gene location converge to the HLA-DRB1 locusfor BILL, 1-LOD-support, and Bootstrap. CSI-MLS doesnot provide a point estimate. On the other hand, GEE

TABLE VI. Empirical coverage of BILL when K is fixedat f� (prevalence) using 250 ASPs simulated under thesingle-locus models

z

Model Prevalence 0.1 0.5 1.0 2.0 4.0

Recessive 0.045 0.970 0.975 0.960 0.955 0.940Intermediate 0.058 0.970 0.970 0.950 0.940 0.915

Dominant 0.085 0.935 0.945 0.945 0.895 0.775

ASP, affected sib pair.

TABLE V. Root Mean Squared Error (RMSE, in cM) ofthe estimate of the susceptibility gene location

RMSE of l̂

Model # ASP GEE Bootstrap BILL LOD

Recessive 100 3.94 4.47 3.08 6.28250 2.51 1.06 1.02 1.84500 1.73 0.62 0.66 1.01

Intermediate 100 4.68 4.79 4.33 7.31250 3.00 2.00 1.90 2.57500 1.98 1.20 1.16 1.81

Dominant 100 5.90 6.20 6.47 8.20250 3.77 3.10 3.79 2.89500 2.61 2.03 2.08 2.00

ASP, affected sib pair; GEE, generalized estimating equations;BILL, Bayesian Intervals for Linkage Locations.

613Bayesian Intervals for Linkage Location

Genet. Epidemiol.

Page 11: Bayesian intervals for linkage locations

consistently yielded point estimates on the opposite side ofthe true location, compared to the other three methods.The series of point estimates seem to converge toward thetrue location, for sample size going from 150 to 300, only togo further away thereafter, with the point estimate beingmore than 5 cM away from the HLA-DRB1 locus for thefull data set. As a consequence, although the width of theCI decreases from 18.84 to 7.80 cM from the 150-ASPsample to the 722-ASP sample, the CI does not contain thetrue location when the sample size is at or above 500. Eachof the other four methods provides series of CIs thatcaptures the true location and consistently becomes moreprecise with increasing sample size. The most dramaticimprovement occurs with Bootstrap: the CI from the fulldata set is 5.3 times as precise as that from the 150-ASPsample; this number is 1.9, 3.6, and 4.0, for CSI-MLS, BILL,and 1-LOD-support intervals, respectively. A few excep-tions to the monotone trend do occur: at sample size600 for CSI-MLS, 200 for Bootstrap, and 400 for BILL andLOD-support.

Comparing the precision of the intervals from allmethods, except those from the GEE approach owing totheir bias, shows that BILL is quite efficient relative to theother methods. Using BILL intervals as baseline, the ratioof CSI-MLS interval lengths to BILL interval lengths are1.9, 2.0, 2.1, 2.0, 3.1, 3.3, and 3.5 for sample size going from150 to 722. These numbers are 1.6, 2.4, 1.3, 1.09, 1.7, 1.16,and 1.04 for Bootstrap and 1.16, 1.00, 0.90, 0.87, 1.18, 1.11,and 1.05 for 1-LOD-support intervals. Thus the conserva-tive nature of CSI-MLS is evident. Bootstrap is apparentlynot efficient compared to BILL, despite the fact that they

both maintain the nominal coverage of the intervals well.BILL and 1-LOD-support intervals are mostly similar.However, the uncertainty of the appropriate drop size toobtain a 95% confidence level and the LOD-supportintervals’ frequent tendency to be liberal makes it difficultto conduct an overall fair comparison of the LOD-supportapproach wtih BILL. Recall that our limited investigationshowed that the appropriate drop size can go from 1.04 to1.6. The relative precision of BILL to the 1.05-LOD-supportintervals are, going from the 150-ASP sample to the fullsample, 1.17, 1.02, 0.98, 0.92, 1.21, 1.18, and 1.14. Theseratios become 1.27, 1.18, 1.07, 1.02, 1.29, 1.21, and 1.33 for adrop size of 1.20, and 1.92, 1.42, 1.41, 1.24, 1.61, 1.46 and1.52 for a drop size of 1.60 (Supplementary Table S5).

Convergence of the Markov chain we used for BILL wasexamined for one of the samples. The PSRF, computedafter running four chains started from an overdisperseddistribution, were 1.01, 1.03, and 1.04, for l, a, and d,respectively. Thus, there was no evidence against conver-gence. We set K 5 0.0107 for BILL.

DISCUSSION

After identifying a genomic region that is potentiallylinked to a complex disease, construction of precise yetvalid statistical CIs of disease gene locations within thisregion is an important component in positional cloning.Several approaches have been proposed for this endeavor,but comparison of their performance indicates there ismuch room for improvement. In this article, we have

BILLKACBOOTSGEECSI-MLS

5.255.50

7.007.75

7.008.25

15.5013.50

14.5013.00

15.5015.50

18.7521.75

cMcM

KA

C L

OD

No.

AS

Ps

20 30 40 50 60 70 80

0

5

10

15

30 40 50 60 70

150

200

300

400

400

150

300

200

500

500

600

600

722722

BA

Fig. 4. Confidence intervals (panel B) constructed using the five intermediate fine mapping methods and KAC LOD scores (panel A)

across the first 100 cM of chromosome 6. The vertical line marks the HLA-DRB1 locus. Numbers in panel A are the sample sizes. Pointsin panel B are point estimates of the disease gene location and numbers are the width for BILL or 1-LOD-support intervals. CSI-MLS

does not provide point estimates.

614 Sinha et al.

Genet. Epidemiol.

Page 12: Bayesian intervals for linkage locations

proposed a novel Bayesian approach, BILL, that jointlyestimated the disease location and the IBD sharingprobabilities at the trait locus. The posterior distributionof these parameters is approximated using Markov chainMonte Carlo. The posterior mode of the disease genelocation serves as a point estimator and the highestposterior density intervals are obtained for desiredcredible levels.

Comparison of BILL with existing methods (LOD-support, GEE, CSI-MLS, Bootstrap), under simulations ofsingle- and two-locus disease models and an extensiveanalysis of the NARAC RA data, suggests that BILL isstatistically both valid and efficient. In addition tocorroborating findings found by others [Papachristouand Lin, 2006a], the comparison makes some surprisingdiscoveries. In terms of faithfulness to the nominalcoverage, intervals from CSI-MLS are conservative, whilethose from GEE and 1-LOD-support are liberal, sometimessubstantially so. BILL and Bootstrap are the only twomethods that remain truthful to the nominal coverageacross all simulation settings, and BILL intervals are from1.64 to 4.76 times as precise as Bootstrap intervals. Whenempirical coverage was held constant so that efficiency ofthe information utilization could be separated fromvariation in empirical coverage resulting from the meth-ods’s ability to approximate the sampling distributions,LOD-support emerges as a serious contender to BILLunder simulation models with z1E0.5. This might be dueto the fact that the LOD-support intervals were based onKAC LOD scores, which are efficient LOD scores fordisease models with only additive genetic components. Afairer comparison might be to base the LOD-supportinterval on the MLS. It was a surprise to discover via theanalysis of the RA data that GEE is substantially biased.

Why do these methods behave so differently, given eachone is based on sound statistical reasoning? Efficiency ofCSI-MLS is close to the optimal among the methodscompared, so it is not the test statistic (MLS) that is causingthe coverage level to be conservative. We believe thereason lies in that the CSI framework, to be practical, usesa two-step procedure but does not account for the inflationin the coverage level that results from using the first stepestimates of the zjs as known without error in the CIconstruction step. Improvement of this framework, thus,would need to focus on ways to correct for this inflation.On the other hand, as a general tool for maintainingnominal coverage level, the PMB, which uses MLEs as thepoint estimates of the disease gene location, is faithful butnot fully efficient—despite its reliance on the efficientMLEs. This lack of efficiency might arise because thesample size required, for the parametric model estimatedfrom the sample to closely approximate the samplingdistribution, is larger than most current linkage studiescan provide. Suggestive evidence for this comes from theimprovement of the Bootstrap intervals in the RA analysiswith increasing sample size, to the point of being verysimilar to the BILL intervals (the optimal) for the 722-ASPdata. The remaining approaches, BILL, GEE, and LOD-support with the KAC LOD score, share one commonality:modeling the data with both a parameter for the genelocation and a set of nuisance parameters that can beloosely interpreted as genetic effects of the locus, in theform of zjs, or mean IBD sharing at the trait locus, or theKAC parameter in a cleverly constructed linear allele-sharing model, respectively. Allowing the nuisance para-

meters to vary with the postulated gene location, togetherwith full likelihood specification, might account for theefficiency of BILL and LOD-support intervals. Thealarming bias of the GEE intervals warrant furtherinvestigation: is this due to the unsatisfactory modelingof the discrete number of alleles shared IBD with the firsttwo statistical moments, is it due to some implementationidiosyncrasy, or is it due to some other factors? The formalmodeling aspect of the GEE framework is desirable, ascovariates can then be incorporated relatively easily,compared to the other frameworks, to account for linkageheterogeneity [Glidden et al., 2003]. Lastly, the fidelity ofBILL intervals to nominal coverage levels stems, webelieve, from its employing Monte Carlo simulation toapproximate the posterior distribution, which is onlypossible with today’s computing power, instead of invok-ing an asymptotic approximation. Furthermore, the poster-ior distributions of the gene location from the NARACsamples are skewed, so HPD intervals are more efficientthan symmetric credible intervals around the mode, andthe effort to obtain HPD intervals is insignificantcompared to the other components of BILL.

In the simulation study, we used equally informative,equally spaced SNPs (one per 0.25 cM). The extensivecoverage of the human genome by today’s marker mapsimplies that sufficiently dense and informative markerswill almost always be readily available to recover theunderlying IBD process at the resolution suitable forlinkage studies. The application of the methods to theNARAC data, with real SNP maps, yielded the sameconclusion as those drawn from the simulation. Further-more, CI construction by these methods will only beinvoked when suggestive linkage has been indicated, soany data that call for the application of these methodswould have had IBD sharing distortion at the trait locusthat is sufficiently substantial to yield a linkage signal witha realistic sample size. The two-locus and one-locusdisease models employed in the simulation studies havebeen commonly used to study properties of linkagemethods in the face of complex traits/disease. Similaritybetween conclusions from the simulation study and thosefrom the RA data analysis lend support to the generality ofthe relative performance of the investigated approaches.Further simulations with more realistic disease modelscould be of interest, but the trait locus under considerationwill need to make a substantial contribution to the diseaseto be able to pass the ‘‘suggestive linkage’’ screen.

As a reviewer pointed out, the rapidly decreasing cost oflarge-scale sequencing may prompt resurgence of linkagestudies in the near future. Rare but high-penetrancevariants, which tend to be missed by genetic associationstudies, are easier to map using large pedigrees enrichedwith the disease. Thus, it is highly desirable to developintermediate fine mapping approaches applicable togeneral pedigrees. The statistical problem would, un-doubtedly, be much more difficult. The observationsobtained from the current study, however, provide insightsthat will be helpful in guiding this effort. LOD-supportintervals based on traditional parametric LOD scores are astart. But the mode of inheritance in such LOD scorecomputation is fixed across genomic locations in thecandidate region. Thus such intervals cannot be expectedto perform similarly to the KAC-based LOD-supportintervals. KAC LOD scores were designed for generalpedigrees, but there are several difficulties associated

615Bayesian Intervals for Linkage Location

Genet. Epidemiol.

Page 13: Bayesian intervals for linkage locations

with extending this LOD-support approach to generalpedigrees: (1) Simulations by others [Papachristou andLin, 2006a] showed that even for ASPs, the KAC-LOD-support intervals can be extremely liberal, to the point thata 95% nominal level yields only a 26% empirical coverage.This problem would be aggravated when the data consistof a small number of pedigrees. (2) Optimal scoringfunctions and weighting factors will depend on the modeof inheritance and the ascertainment scheme, which areusually never known. (3) KAC scores make use of theinheritance vector distribution, the exact computation ofwhich places limits on the size and complexity ofpedigrees that can be analyzed. We are currently devel-oping an extension to BILL to address this problem.

ELECTRONIC RESOURCES

The following URL provides the program implementingBILL and the supplementary material http://darwin.cwru.edu/�rsinha/work/bill.

ACKNOWLEDGMENTS

This work is partly supported by a US Public HealthService Resource Grant RR03655, Research Grants R01HG003054 and R01 GM28356, and Cancer Center SupportGrant P30CAD43703. GAW15 was supported by ResearchGrant R01 GM031575. The NARAC data contributed toGAW15 was gathered with the support of grants from theNational Institutes of Health (N01-AR-2-2263 and R01-AR-44422) and the National Arthritis Foundation. We wouldlike to express our gratitude to two anonymous reviewerswhose insightful comments and questions have resulted ina much improved manuscript.

REFERENCESAbecasis GR, Cherny SS, Cookson WO, Cardon LR. 2002. Merlin—

rapid analysis of dense genetic maps using sparse gene flow trees.

Nat Genet 30:97–101.

Amos CI, Chen W, Remmers E, Siminovich KA, Seldin MF,

Criswell LA, Lee AT, John S, Shephard ND, Worthington J,

Cornelis F, Plenge RM, Begovich AB, Dyer TD, Kastner DL,

Gregersen PK. 2007. Data for Genetic Analysis Workshop (GAW)

15 Problem 2, genetic causes of rheumatoid arthritis and associated

traits. BMC Proc 1:S3.

Barrett JC, Fry B, Maller J, Daly MJ. 2005. Haploview: analysis andvisualization of LD and haplotype maps. Bioinformatics 21:263–265.

Chen M, Shao Q. 1999. Monte Carlo estimation of Bayesian credible

and HPD intervals. J Comput Graph Stat 8:69–92.

Deighton CM, Walker DJ, Griffiths ID, Roberts DF. 1989. The

contribution of HLA to rheumatoid arthritis. Clin Genet 36:178–182.

Dupuis J, Siegmund D. 1999. Statistical methods for mapping

quantitative trait loci from a dense set of markers. Genetics

151:373–386.

Gelman A, Rubin DB. 1992. Inference from iterative simulation using

multiple sequences. Stat Sci 7:457–511.

Gelman A, Carlin JB, Stern HS, Rubin DB. 2004. Bayesian Data

Analysis. Boca Raton, FL: Chapman & Hall.

Gilks WR, Richardson R, Spiegelhalter D. 1996. Markov Chain Monte

Carlo in Practice. London: Chapman & Hall.

Glidden DV, Liang KY, Chiu YF, Pulver AE. 2003. Multipoint affected

sibpair linkage methods for localizing susceptibility genes of

complex diseases. Genet Epidemiol 24:107–117

Gudbjartsson DF, Jonasson K, Frigge M, Kong A. 2000. Allegro, a new

computer program for multipoint linkage analysis. Nat Genet24:12–13.

Hammond CJ, Webster AR, Snieder H, Bird AC, Gilbert CE,

Spector TD. 2002. Genetic influence on early age-related maculo-

pathy: a twin study. Ophthalmology 109:730–736.

Holmans P. 1993. Asymptotic properties of affected-sib-pair linkage

analysis. Am J Hum Genet 52:362–374.

James JW. 1971. Frequency in relatives for an all-or-none trait. Ann

Hum Genet 35:47–49.

Knapp M, Seuchter SA, Baur MP. 1994. Two-locus disease models with

two marker loci: the power of affected sib-pair tests. Am J Hum

Genet 55:1030–1041.

Kong A, Cox NJ. 1997. Allele-sharing models: LOD scores and

accurate linkage tests. Am J Hum Genet 61:1179–1188.

Liang K-Y, Chiu Y-F, Beaty TH. 2001. A robust identity-by-descent

procedure using affected sib pairs: multipoint mapping for

complex diseases. Hum Hered 51:64–78.

MacGregor AJ, Snieder H, Rigby AS, Koskenvuo M, Kaprio J, Aho K,

Silman AJ. 2000. Characterizing the quantitative genetic contribu-tion to rheumatoid arthritis using data from twins. Arthritis

Rheum 43:30–37.

Papachristou C, Lin S. 2006a. A comparison of methods for

intermediate fine mapping. Genet Epidemiol 30:677–689.

Papachristou C, Lin S. 2006b. Microsatellites versus single-nucleotide

polymorphisms in confidence interval estimation of disease loci.

Genet Epidemiol 30:3–17.

Papachristou C, Lin S. 2006c. A two-step procedure for constructing

confidence intervals of trait loci with application to a rheumatoid

arthritis dataset. Genet Epidemiol 30:18–29.

Risch N. 1990. Linkage strategies for genetically complex traits. III. The

effect of marker polymorphism on analysis of affected relative

pairs. Am J Hum Genet 46:242–253.

S.A.G.E. 2008. Statistical Analysis for Genetic Epidemiology, Version

5.4. http://darwin.cwru.edu/sage

Siegmund D. 1998. Genetic linkage analysis: an irregular statistical

problem. Doc Math J DMV Extra Vol ICM III:291–300.

Sinha R, Luo Y. 2007a. Efficient intermediate fine mapping: confidenceset inference with likelihood ratio test statistic. Genet Epidemiol

8:922–936.

Sinha R, Luo Y. 2007b. Two-step intermediate fine mapping with

likelihood ratio test statistics: applications to Problems 2 and 3 data

of GAW15. BMC Proc 1:S146.

616 Sinha et al.

Genet. Epidemiol.