efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

15
Genetic Epidemiology 31: 922–936 (2007) Efficient Intermediate Fine Mapping: Confidence Set Inference with Likelihood Ratio Test Statistic Ritwik Sinha and Yuqun Luo Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio In positional cloning of disease causing genes, identification of a linked chromosomal region via linkage studies is often followed by fine mapping via association studies. Efficiency can be gained with an intermediate step where confidence regions for the locations of disease genes are constructed. The confidence set inference [CSI; Papachristou and Lin, 2006b] achieves this goal by replacing the traditional null hypothesis of no linkage with a new set of null hypotheses where the chromosomal position under consideration is in tight linkage with a trait locus. This approach was shown to perform favorably compared with several competing methods. Using the duality of confidence sets and hypothesis testing, CSI was proposed for the Mean test statistics with af fected sibling pair data (CSI-Mean). We postulate that more ef ficient confidence sets will result if more efficient test statistics are used in the CSI framework. One promising candidate, the maximum LOD score (MLS) statistic, makes maximum use of available identity by descent information, in addition to handling markers with incomplete polymorphism naturally. We propose a procedure that tests the CSI null hypotheses using the MLS statistic (CSI-MLS). Compared with CSI-Mean, CSI-MLS provides tighter confidence regions over a range of single and two-locus disease models. The MLS test is also shown to be more powerful than the Mean test in testing the CSI null over a wide range of disease models, the advantage being most pronounced for recessive models. In addition, CSI-MLS is computationally much more efficient than CSI-Mean. Genet. Epidemiol. 31:922–936, 2007. r 2007 Wiley-Liss, Inc. Key words: disease gene localization; model-free linkage; maximum LOD score; effect of disease models; CSI-MLS Contract grant sponsor: National Center for Research Resources; Contract grant number: RR03655. Correspondence to: Dr. Yuqun Luo, Department of Epidemiology and Biostatistics, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH 44106-7281. E-mail: [email protected] Received 10 January 2007; Revised 29 March 2007; Accepted 21 May 2007 Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/gepi.20252 INTRODUCTION Recent advances in genotyping technology have led to a paradigm shift in linkage analysis. While in the past the identification of genomic regions linked to a trait was of primary importance, the much denser maps available today promise better localiza- tion. Thus, it is desirable to have efficient approaches to constructing confidence intervals of disease susceptibility gene locations after initial linkage signals are found, so as to better guide ensuing fine mapping efforts [e.g., Bull et al., 2005; Lewinger et al., 2005]. While the problem of supplying a confidence interval for a Quantitative Trait Locus has been extensively studied in experi- mental organisms [e.g., Lander and Botstein, 1989; Mangin et al., 1994; Visscher et al., 1996], its investigation in humans poses additional challenges [e.g., Liang et al., 2000]. For binary traits in humans, a 1-LOD support approach is often employed to construct confidence intervals of disease gene locations using data from linkage studies. In this approach, the confidence interval is the region around the linkage peak with LOD-score dropping less than one from the peak [Kristjansson et al., 2002]. However, the statistical properties of such intervals are not clear [Nemesure et al., 1995; Dupuis and Siegmund, 1999]. Focusing on affected sibling pairs (ASPs), Liang et al. [2001] proposed a generalized estimating equations (GEE) approach to simultaneously esti- mate susceptibility gene locations and their effect sizes. This approach has since been extended to include covariates [Glidden et al., 2003; Chiou et al., 2005], to use information from other relative types [Schaid et al., 2005], and to localize two linked disease loci simultaneously [Biernacka et al., 2005]. In parallel, the confidence set inference (CSI), based on the duality of confidence set construction and hypothesis testing, was developed [Lin et al., 2001; Lin, 2002]. This method also focuses on ASPs and formulates the traditional Mean test and Proportion test to test a new set of hypotheses. The CSI was extended to handle markers with incom- plete polymorphisms and to handle high-density r 2007 Wiley-Liss, Inc.

Upload: ritwik-sinha

Post on 11-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

Genetic Epidemiology 31: 922–936 (2007)

Efficient Intermediate Fine Mapping: Confidence Set Inference withLikelihood Ratio Test Statistic

Ritwik Sinha and Yuqun Luo�

Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio

In positional cloning of disease causing genes, identification of a linked chromosomal region via linkage studies is oftenfollowed by fine mapping via association studies. Efficiency can be gained with an intermediate step where confidenceregions for the locations of disease genes are constructed. The confidence set inference [CSI; Papachristou and Lin, 2006b]achieves this goal by replacing the traditional null hypothesis of no linkage with a new set of null hypotheses where thechromosomal position under consideration is in tight linkage with a trait locus. This approach was shown to performfavorably compared with several competing methods. Using the duality of confidence sets and hypothesis testing, CSI wasproposed for the Mean test statistics with affected sibling pair data (CSI-Mean). We postulate that more efficient confidencesets will result if more efficient test statistics are used in the CSI framework. One promising candidate, the maximum LODscore (MLS) statistic, makes maximum use of available identity by descent information, in addition to handling markerswith incomplete polymorphism naturally. We propose a procedure that tests the CSI null hypotheses using the MLS statistic(CSI-MLS). Compared with CSI-Mean, CSI-MLS provides tighter confidence regions over a range of single and two-locusdisease models. The MLS test is also shown to be more powerful than the Mean test in testing the CSI null over a wide rangeof disease models, the advantage being most pronounced for recessive models. In addition, CSI-MLS is computationallymuch more efficient than CSI-Mean. Genet. Epidemiol. 31:922–936, 2007. r 2007 Wiley-Liss, Inc.

Key words: disease gene localization; model-free linkage; maximum LOD score; effect of disease models; CSI-MLS

Contract grant sponsor: National Center for Research Resources; Contract grant number: RR03655.�Correspondence to: Dr. Yuqun Luo, Department of Epidemiology and Biostatistics, Case Western Reserve University, 10900 EuclidAvenue, Cleveland, OH 44106-7281. E-mail: [email protected] 10 January 2007; Revised 29 March 2007; Accepted 21 May 2007Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com).DOI: 10.1002/gepi.20252

INTRODUCTION

Recent advances in genotyping technology haveled to a paradigm shift in linkage analysis. While inthe past the identification of genomic regions linkedto a trait was of primary importance, the muchdenser maps available today promise better localiza-tion. Thus, it is desirable to have efficientapproaches to constructing confidence intervals ofdisease susceptibility gene locations after initiallinkage signals are found, so as to better guideensuing fine mapping efforts [e.g., Bull et al., 2005;Lewinger et al., 2005]. While the problem ofsupplying a confidence interval for a QuantitativeTrait Locus has been extensively studied in experi-mental organisms [e.g., Lander and Botstein, 1989;Mangin et al., 1994; Visscher et al., 1996], itsinvestigation in humans poses additional challenges[e.g., Liang et al., 2000].

For binary traits in humans, a 1-LOD supportapproach is often employed to construct confidenceintervals of disease gene locations using data from

linkage studies. In this approach, the confidenceinterval is the region around the linkage peak withLOD-score dropping less than one from the peak[Kristjansson et al., 2002]. However, the statisticalproperties of such intervals are not clear [Nemesureet al., 1995; Dupuis and Siegmund, 1999].

Focusing on affected sibling pairs (ASPs), Lianget al. [2001] proposed a generalized estimatingequations (GEE) approach to simultaneously esti-mate susceptibility gene locations and their effectsizes. This approach has since been extended toinclude covariates [Glidden et al., 2003; Chiou et al.,2005], to use information from other relative types[Schaid et al., 2005], and to localize two linkeddisease loci simultaneously [Biernacka et al., 2005].

In parallel, the confidence set inference (CSI),based on the duality of confidence set constructionand hypothesis testing, was developed [Lin et al.,2001; Lin, 2002]. This method also focuses on ASPsand formulates the traditional Mean test andProportion test to test a new set of hypotheses. TheCSI was extended to handle markers with incom-plete polymorphisms and to handle high-density

r 2007 Wiley-Liss, Inc.

Page 2: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

single nucleotide polymorphism (SNP) markers[Papachristou and Lin, 2005, 2006b]. Contrary tothe GEE approach’s simultaneous estimation ofdisease model-related parameters and the diseasegene location, the CSI approaches require knowl-edge of certain parameters related to the diseasemodel, relative risks of siblings and offspring beingone possibility. A two-step approach that precedesCSI with the estimation of these disease parameterswas shown to perform well [Papachristou and Lin,2006c]. In addition to the above three widely used ormore fully developed approaches, there exist otherless studied methods [e.g., Hauser et al., 1996;Hossjer, 2003].

Recent studies have shown that the confidenceintervals produced by the GEE method might havereduced coverage [Lebrec et al., 2006; Papachristouand Lin, 2006a], the bias being severe in regionswhere the allelic identity by descent (IBD) prob-ability estimates are less precise. In a comparison ofthe GEE, CSI, 1-LOD support and two bootstrapmethods, Papachristou and Lin [2006a] showed thatwhen the observed coverage probability (CP) washeld to be the same, the intervals obtained by the CSIwere better able to localize the disease causing loci.

Encouraged by the sound statistical propertiesof the CSI-based intermediate fine mappingapproaches, we explore other test statistics toimprove its efficiency and to reduce its computa-tional burden. The current CSI is based on the Meantest (CSI-Mean), which compares the mean observednumber of alleles shared IBD at a genomic locationto what is expected under the CSI null, that thislocation is in tight linkage with a trait locus. In itscurrent implementation, CSI-Mean requires Monte-Carlo simulations to estimate the mean and thevariance of the test statistic, and thus is computa-tionally demanding. Because CSI is based on theidea of converting a group of hypothesis tests atvarious genomic locations to obtain the confidencesets for the true disease gene location(s), wepostulate that more powerful test procedures shouldlead to more efficient estimators of confidence sets.The maximum LOD score (MLS) test [Risch, 1990c]is a likelihood ratio test (LRT) in traditional linkageanalysis that compares the distribution of the IBDsharing to that expected under the null of no linkage,with incomplete marker information handled natu-rally. Linkage tests based on MLS have been shownto perform well over a range of disease models[Davis and Weeks, 1997]. Thus, we propose toreformulate the MLS statistic in the CSI framework(CSI-MLS). An immediate benefit is the eliminationof the need of Monte-Carlo simulation and thus theimprovement in computational efficiency.

In what follows, we review the principle of the CSIframework and the CSI-Mean formulation. We then

derive the CSI-MLS procedures, both single-pointand multi-point. Extensive simulations under dis-ease models of varying complexity are performed tocompare the efficiency of CSI-Mean and CSI-MLS interms of coverage probabilities and precision of theconfidence sets. Ranges of disease models whereCSI-MLS tests perform better are also identified.The CSI framework requires knowledge of somedisease model-related parameters. In their absence,Papachristou and Lin [2006c] proposed a two-stepprocedure: the required disease parameters areestimated in the first step and used in the secondstep to construct the confidence set. In this article,we assume that the required parameters are known,so as to focus the comparison of CSI-MLS and CSI-Mean on the effect of a number of interesting factorsto gain a comprehensive understanding. In particu-lar, the space of potential disease models over whichCSI-MLS outperforms CSI-Mean has been identified.Comparison of CSI-Mean and CSI-MLS in a practicalsetting where the required disease model parametersare not available has been performed usingthe GAW15 real and simulated data [Sinha andLuo, 2007].

METHODS

REVIEW OF CONFIDENCE SET INFERENCE

To construct precise confidence regions withsound statistical properties for disease gene loca-tions, Lin et al. [2001] proposed the CSI framework.This framework builds upon one general way offinding confidence sets via inverting the acceptanceregions of a group of hypothesis tests. Contrary totraditional linkage analysis, where the null is that thegenomic location under consideration is unlinked,the CSI tests the null that the genomic location is intight linkage to a disease locus. For ASP studydesigns, both single-point and multi-point CSI-Meanprocedures have been proposed [Lin, 2002; Papa-christou and Lin, 2005, 2006b].

Single-point. The single-point CSI procedure[Lin, 2002] tests, for each marker m, the hypothesis

HðnewÞ0m : ym � y0 versus HðnewÞ

Am : ym4y0; ð1Þ

where ym is the distance between marker m andthe putative disease locus, and y0 is chosen to controlthe maximum number of null hypotheses, a priori,that can be true. If the markers are equally spaced,one choice of y0 may be half the distance betweenany two adjacent markers. This choice of y0 rendersmultiplicity adjustment unnecessary, if the diseaseis monogenic [Lin et al., 2001]. Otherwise oneonly needs to correct for the number of diseasegenes. Only information from marker m is used intesting (1), thus the name ‘‘single-point’’. The set

923Confidence Set Inference with LRT

Genet. Epidemiol. DOI 10.1002/gepi

Page 3: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

A ¼ fm : HðnewÞ0m not rejected at level ag contains mar-

kers within distance y0 of the disease locuswith probability (1�a). The confidence regionR ¼ [m2A½tm � y0; tm þ y0�, tm being the location ofmarker m, contains the disease gene with probabilityno less than (1�a).

Multi-point. The hypothesis in (1) does notmake a distinction between locations to the left andto the right of a marker, thus the resultingconfidence sets are wider than desired. To overcomethis problem, Papachristou and Lin [2006b] pro-posed to test the following hypothesis using multi-point marker genotype information. At each locationt on the genome, test

HðnewÞ0t : t ¼ t� versus HðnewÞ

At : t 6¼ t�; ð2Þ

where t� is the true but unknown disease location.The set R ¼ ft : HðnewÞ

0t not rejected at levelag is a(1�a) confidence set of the disease location. Pre-sumably, any traditional linkage test statistic couldbe reformulated to test the CSI hypotheses (1) and(2). An important step in conducting such a test is toderive the distribution of the test statistics under theCSI null of tight linkage, which requires knowledgeof certain disease model-related parameters.

CSI-Mean. Papachristou and Lin [2006b] refor-mulated the Mean test statistic [Blackwelder andElston, 1985] to test the CSI hypothesis, at genomiclocation d,

Md ¼

Pnj¼1

P2i¼0 iPðd IBDj ¼ ijGjÞ

Pnj¼1 MdðGjÞ

n;

ð3Þ

where d IBDj is the number of alleles shared IBD bythe jth ASP at location d, Gj denotes the appropriatefamily marker data for the jth ASP, j 5 1, y, n andMdðGjÞ is the average number of alleles shared IBD atlocus d by the jth ASP given Gj. Let B denote theevent that both siblings are affected. The standar-dized Mean statistic is

td ¼Md � EðMdjt;BÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

VarðMdjt;BÞp ¼

Md � mdðt;BÞsdðt;BÞ=

ffiffiffinp : ð4Þ

The mean (md(t, B)) and the variance ðs2dðt;BÞÞ are

calculated under the null hypothesis that the traitlocus is located at t. Papachristou and Lin [2006b]defined two multi-point variants, one of which isCSI-Mean-V3, where d5 t. However, the IBD shar-ing at locations between markers cannot always bedetermined with much precision; hence they pro-posed another multi-point variant, CSI-Mean-V2,that only used IBD sharing information at markers,i.e., d5 m(t), where m(t) is the marker closest to t.Asymptotically, the test statistic td follows thestandard normal distribution under the null hypoth-esis. A two-sided test was suggested.

The computation of the confidence set requiresknowledge of the parameters md(t, B) and s2

dðt;BÞ.Assuming that Md(Gj) are independent and identi-cally distributed,

mdðt;BÞ ¼ EðMdðGjÞjt;BÞ

¼X

Gj

MdðGjÞPðGjjt;BÞ: ð5Þ

The variance can be written as a similar summation.Because the summation is over all possible multi-locus genotypes, the computational burden growsexponentially with the number of markers onthe map and soon becomes intractable. Thesetwo parameters depend on the IBD sharing distribu-tion at the trait locus by an ASP, the type of pertinentrelatives genotyped in addition to the siblingsand the number and heterozygosity of the markersconsidered. Note that

P2i¼0 iPðd IBDj ¼ ijt;BÞ

6¼ mdðt;BÞ, because of incomplete IBD information.To overcome the computational difficulty, Papachris-tou and Lin [2006b] proposed a Monte-Carlo algorithmto estimate md(t, B) and s2

dðt;BÞ, recommending 6000Monte-Carlo samples per null hypothesis.

Practical considerations. As mentioned earlier,conducting the proposed test under the CSI for-mulation requires some knowledge about the modeof inheritance that determines the IBD sharingprobabilities of an affected sib pair at the diseaselocus. For a monogenic disease, it is sufficient toknow the relative risk to an offspring (lO) and to asibling (lS) of an affected individual. While theseparameters can be dependably estimated from arandomly collected sample, complex ascertainmentschemes in genetic epidemiological studies can leadto biases in the estimates [Cordell and Olson, 2000;Olson and Cordell, 2000; Zou and Zhao, 2004].Moreover, for diseases caused by multiple genes,locus-specific relative risks are not easily deter-mined. If a mode of interaction between the multipledisease susceptibility loci is assumed (e.g., a multi-plicative or heterogeneous model), one might be ableto estimate the locus-specific relative risks from thepopulation relative risks [Risch, 1990a; Papachristouand Lin, 2006b]. However, the number of interactingloci cannot be accurately estimated [Schliekelmanand Slatkin, 2002] and the mode of interactionbetween the loci cannot be exactly determined.

In view of the difficulties of obtaining the relevantparameters from an outside source, a two-step CSIwas proposed as an alternative [Papachristou andLin, 2006c]. In this approach, there must be differentsets of marker data collected on the same set ofASPs, for example, Microsatellite (MS) and SNPmarkers. Or the marker data might be partitionedinto a coarse grid and a fine grid if there is only oneset of marker data. Step 1 identifies putative diseasegene locations and estimates the required IBD

924 Sinha and Luo

Genet. Epidemiol. DOI 10.1002/gepi

Page 4: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

sharing probabilities by an ASP at these putativedisease loci. With an initial whole-genome linkagescan using the coarse marker data (e.g., MS data),linkage peaks reaching suggestive evidence forlinkage are identified, for example, with a prespe-cified cut-off of 2.33 for the nonparametric linkagestatistic proposed by Kong and Cox [1997]. Eachsuch peak is declared a putative disease locus andthe IBD sharing probabilities by an ASP at this locusare estimated using a maximum likelihood method[Risch, 1990c; Holmans, 1993]. Focusing on apreliminary broad linkage region, say 25 cM to theleft and to the right, around each putative locusidentified in step 1, step 2 employs the parameterestimates from step 1 to construct CSI intervals withthe fine grid marker data (e.g., SNPs). Papachristouand Lin [2006c] showed that such a two-stepprocedure maintains the nice properties of CSI,including preserving the nominal coverage. Theyalso showed that estimating the parameters from thedata did not lead to biases in these estimates to theextent that diminishes the advantages of the CSImethod. Only linkage peaks that are more signifi-cant than the cut-off for suggestive linkage in step 1will be followed up to construct confidence inter-vals, and a CI will always be constructed around alinkage peak that passes the prespecified cut-off. Aliberal cut-off in step 1 will lead to a higher overallprobability of including the true disease loci,together with a higher probability of includingunlinked portions of the chromosomes. The two-step CSI framework assumes that there is exactlyone disease locus within each preliminary broadlinkage region.

In what follows, we assume that the requiredparameters are known and focus the comparison ofCSI-MLS and CSI-Mean on the effects of a number ofinteresting factors. A comparison of the two proce-dures in a practical setting where the requiredparameters are not available has been performedusing both the simulated and real data contributedto GAW15 [Sinha and Luo, 2007]. Some results ofthis comparison will be discussed in the Discussionsection.

MORE EFFICIENT CSI WITH LIKELIHOODRATIO TESTS: CSI-MLS

As reviewed in the previous section, CSI is ageneral framework whose hypotheses can be testedby reformulation of any reasonable traditionallinkage test statistic. CSI-Mean is a logical first stepgiven the popularity and early application of theMean test. In view of CSI’s nature of obtaining theconfidence sets via inverting the acceptance regionsof hypothesis tests, we postulate that tests that aremore powerful in traditional linkage studies might

carry this advantage over to testing the CSI nulls andthus result in more efficient confidence set estima-tors. The MLS [Risch, 1990c] was one of the firstmodel-free linkage methods which can properlyaccount for the uncertainty in the number of allelesshared IBD. Furthermore, it makes maximum use ofthe information available from allelic IBD. Davis andWeeks [1997] compared a number of linkage teststatistics and showed that eight of the 10 beststatistics were likelihood-based tests (the other twobeing the Mean test and the Haseman-Elstonregression). We derive here the CSI-MLS that teststhe CSI nulls using the LRT, and discuss some of theimmediate benefits before we compare its efficiencyto that of CSI-Mean.

SINGLE-POINT CSI-MLS

Recall that the hypothesis for each marker m, insingle-point CSI, is

HðnewÞ0m : ym � y0 versus HðnewÞ

Am : ym4y0: ð6Þ

Let mIBD be the number of alleles shared IBD by anASP at marker m. Let ziðyÞ ¼ PðmIBD ¼ ijB; yÞ, wherey is the distance between the trait locus and marker m.Further, let Gm

j be the genotype data at marker m forjth ASP (or family). The likelihood is [Risch, 1990c]

L ðyÞ ¼Yn

j¼1

PðGmj jyÞ

¼Yn

j¼1

X2

i¼0

PðGmj jmIBDj ¼ iÞziðyÞ

¼Yn

j¼1

X2

i¼0

PðmIBDj ¼ ijGmj ÞPðG

mj Þ

PðmIBDj ¼ iÞziðyÞ:

ð7Þ

Define wij ¼ PðmIBDj ¼ ijGmj ÞPðG

mj Þ=PðmIBDj ¼ iÞ.

Then the LRT statistic for hypothesis (6) is

Lm ¼supy�y0

Qnj¼1

P2i¼0 ziðyÞwij

� �

supy�1=2

Qnj¼1

P2i¼0 ziðyÞwij

� � : ð8Þ

For single-locus disease models, the probabilitieszi (y) can be expressed in terms of lS, lO and y [Risch,1990b],

z0ðyÞ ¼2C�1þ2ð2C�1Þð1�CÞlOþ4ð1�CÞ2lS

4lS;

z1ðyÞ ¼2ð2C� 1Þ2lO þ 8Cð1�CÞlS

4lS;

z2ðyÞ ¼1� 2C� 2Cð2C� 1ÞlO þ 4C2lS

4lS; ð9Þ

where C ¼ y2þ ð1� yÞ2. As discussed above, lO and

lS can either be estimated from independent genetic

925Confidence Set Inference with LRT

Genet. Epidemiol. DOI 10.1002/gepi

Page 5: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

epidemiological studies or from an initial coarselinkage scan. By representing zi(y)’s as above, wehave written likelihood (7) in terms of a singleunknown parameter y. Maximization in the numera-tor and denominator of (8) is linear in the number ofASPs and thus, computationally fast. The parametricestimates of zi(y) obtained in this manner satisfy thepossible triangle constraints [Holmans, 1993].

Of the probabilities involved in (7), PðGmj Þ cancels in

the LRT (8). The probabilities PðmIBDj ¼ iÞ are simplythe traditional null IBD sharing probabilities, forsiblings they are (1/4, 1/2, 1/4) for sharing (0, 1, 2)alleles IBD, respectively. The third probabilityPðmIBDj ¼ ijGm

j Þ depends on the observed genotypesof the jth family. These probabilities are well-knownquantities [Haseman and Elston, 1972] and areprovided by a number of genetic software (e.g.,S.A.G.E. [2006]; Merlin [Abecasis et al., 2002]; andAllegro [Gudbjartsson et al., 2000]).

Standard regularity conditions are not satisfied forthe LRT (8) to have a simple w2 distribution. Theasymptotic distribution of the statistic �2 ln Lm is a50:50 mixture of w2

0 and w21 (our situation satisfies the

conditions of Case 5 of Self and Liang [1987]).Assuming y0 to be half the maximum distancebetween consecutive markers, the set A ¼ fm :HðnewÞ

0m not rejected at level ag contains at least onemarker within distance y0 of the disease locus withprobability Z1�a. The confidence region R ¼

Sm2A

½tm � y0; tm þ y0�, tm being the location of marker m,contains the disease gene with probability no lessthan (1�a) [Papachristou and Lin, 2005].

A distinction between the single-point CSI-MLSand CSI-Mean is their requirement for stochasticordering in their respective test statistic. While thereis no such requirement for CSI-MLS, the test statisticin the single-point CSI-Mean needs to be stochasti-cally decreasing in y in order for the test to have thecorrect size. Though such a property is intuitive andwas treated heuristically in Papachristou and Lin’s[2005] study, a strict mathematical proof is nontrivialand is not available at this point.

MULTI-POINT CSI-MLS

Genotypes at all markers on a chromosomepotentially provide additional information aboutIBD sharing at each location on a chromosome. Suchinformation can be extracted using a Hidden-Markov model [Lander and Green, 1987]. For a MSmap, this means that the CSI test can now beconducted at locations between markers. With theadvent of high throughput SNP technology we nowhave maps with thousands of low polymorphismmarkers. While not very informative individually,when information is extracted from them in a multi-point fashion these maps can be extremely useful fordetecting linkage. Hence, the null hypothesis in

multi-point CSI is, at each location (t) on thegenome, that t is the trait locus. Let yt be therecombination fraction between location t and thedisease locus t�. Then the multi-point hypothesis is

HðnewÞ0t : yt ¼ 0 versus HðnewÞ

At : yt40; ð10Þ

for every t on the genome. The LRT statistic forhypothesis (10) is

Lt ¼

Qnj¼1

P2i¼0 zið0Þwij

� �

supy�1=2

Qnj¼1

P2i¼0 ziðyÞwij

� � : ð11Þ

The probabilities zi(y) are given in (9). Definewij ¼ PðtIBDj ¼ ijGjÞPðGjÞ=PðtIBDj ¼ iÞ, where Gj

denotes multi-point marker genotypes of the jthfamily. The terms PðtIBDj ¼ ijGjÞ in wij are multi-point probabilities which can be calculated using theHidden Markov model and are available fromstandard software packages. The distribution of�2 lnLt is also a 50:50 mixture w2

0 and w21. The

maximization required is still one-dimensional andhence fast.

Construction of confidence regions. The con-fidence region for the trait locus is given by the setA ¼ ft : HðnewÞ

0t is not rejected at level ag. Because it isinfeasible to perform the test at every point on thegenome, we follow Papachristou and Lin [2006b] inconducting the hypothesis tests at a grid of pointsand then constructing the confidence region byinterpolating the test statistics. Let the set of points atwhich tests are conducted be tl; l ¼ 1; . . . ;L. Let T bea random variable with the same distribution as�2 lnLt, under the null hypothesis that t5 t�.Define Ta to be a threshold such that PðT4TaÞ ¼ a.A small value of �2 lnLt is evidence in support ofthe null hypothesis. Therefore, in constructing theconfidence set, first put in it all closed intervals½tl; tlþ1� such that maxð�2 lnLl;�2 lnLlþ1ÞoTa. Donot include in the confidence set any intervals½tl; tlþ1� for which minð�2 lnLl;�2 lnLlþ1Þ � Ta. Inthe situation where neither of the above twoconditions is satisfied, interpolate �2 lnLt with alinear function of t. If �2 lnLtl

4ðoÞTa and�2 lnLtlþ1

oð4ÞTa then include in the confidenceregion the closed interval ½t0; tlþ1�ð½tl; t0�Þ, where t0 isthe point at which the line joining ðtl;�2 lnLtl

Þ andðtlþ1;�2 lnLtlþ1

Þ crosses Ta.

POWERS OF CSI-MLS AND CSI-MEAN

We will investigate the range of disease modelswhere the MLS and the Mean test have differingpowers for the CSI hypothesis and potentiallydifferent efficiencies in CSI. The power of the MLSstatistic to test hypothesis (10) is given byPð�2 lnLt4Tajyt40Þ. A closed form of the powerfunction of the MLS test is not available.

926 Sinha and Luo

Genet. Epidemiol. DOI 10.1002/gepi

Page 6: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

We simplified the problem by considering a locationt with complete information about IBD sharing andthat is yt away from the disease locus in a single-locus disease model. In this case, the number of pairssharing (0, 1, 2) alleles IBD, (n0, n1, n2), out of n ASPsis distributed as a multinomial ðn; z0ðyÞ; z1ðyÞ; z2ðyÞÞ.Using this fact, Monte-Carlo simulations can beperformed to estimate the power of using MLS teststatistic to test hypothesis (10). While the asymptoticpower function for a completely informative markeris available for the Mean test statistic [Lin, 2002], weobtained the Monte-Carlo estimate of the power ofCSI-Mean for consistency. Papachristou and Lin[2006b] suggested using a two-sided test for hypoth-esis (10). However, for a completely informativemarker, a one-sided Mean test is appropriate andmore powerful [Liang et al., 2001; Papachristou andLin, 2006b]. Hence, we used a one-sided CSI-Meantest when carrying out the comparison.

For a single-locus disease model, the power ofboth tests depend on three parameters,ðy;VA=K2;VD=K2Þ, where K is the populationprevalence, VA is the additive genetic variance, andVD is the dominance genetic variance. The powerfunction should be increasing in all threeparameters. Note that VA=K2 ¼ ð1� KÞ=K � ðnarrowsense heritabilityÞ. Thus, the parameter VA/K2 is adecreasing function of prevalence and an increasingfunction of heritability. A similar interpretation holdsfor VD/K2. We can map ðVA=K2;VD=K2Þ into twoparameters that are more directly related to the effectof disease models on power

a ¼VA=2þ VD=4

K2 þ VA=2þ VD=4¼

lS � 1

lS; ð12Þ

d ¼VD=4

K2 þ VA=2þ VD=4¼

lS � lO

lS: ð13Þ

The parameter a lies between 0 and 1 and essentiallycontains information about the relative importanceof the locus in the disease prevalence. The parameterd must lie between 0 and a and contains informationabout how recessive the disease is [Feingold andSiegmund, 1997]. These parameters are closelyrelated to the expected number of alleles sharedIBD by an ASP at the disease locus, the relation

being z1ð0Þ þ 2z2ð0Þ ¼ ðaþ dÞ=2. The higher a is, themore genetic the trait is, and the easier it is to rejectthe CSI null hypothesis at a marker that is not tightlylinked to the disease locus.

SIMULATION STUDY SETTINGS

Extensive simulation has been carried out tocompare the efficiency of CSI-MLS and CSI-Mean,both single-point and multi-point. The effects ofvarious factors have been studied. These factorsinclude types of markers and the density of themarker map (MS and SNP maps), disease modelswith varying degrees of complexity and heritability,and sample sizes. One criterion of comparison is thelength of the confidence set that contains the traitlocus with a predetermined probability. In addition,we investigate the power of CSI-MLS and CSI-Meantests as a function of ðy;VA=K2;VD=K2Þ and identifyranges of the disease models where one is preferredover the other. This power comparison is importantin that the difference in power will translate intocorresponding difference in efficiency of the con-fidence set estimators.

All the marker and trait loci considered areassumed to be in Hardy-Weinberg equilibrium andin linkage equilibrium with each other. The traitlocus is always assumed to be diallelic, and the twotrait loci are assumed to be on different chromo-somes when two-locus trait models are considered.The unit of simulation is an ASP with markergenotypes available for the ASP and their parents.The nominal CP of the CSI confidence sets is set tobe 95%. Unless otherwise noted, 1,000 replicatesunder each setting are simulated.

In what follows, we briefly discuss the diseasemodels employed in the simulation. These modelshave been studied extensively in Papachristou andLin [2006a,b]. Tables I, II, and III are adapted fully orpartially from these two papers.

Single-locus disease models. Three models withheritability ranging from high to low (71%, 50%,14%), as displayed in Table I, are used in thesimulation to compare the efficiency of CSI-MLSand CSI-Mean, both single-point and multi-point.These models also cover the mode of inheritanceranging from recessive, additive, to dominant.

TABLE I. Three single-locus disease models and their parameters relevant to the power of CSI tests

Model K lO lS H2 5 VG/VT a d PD fDD fDd fdd

I 3.495� 10�3 10.68 56.68 0.71 0.982 0.812 0.050 0.999 0.001 0.001II 1.000� 10�1 2.25 2.75 0.50 0.636 0.182 0.226 0.998 0.105 0.020III 8.463� 10�2 1.78 1.78 0.14 0.438 0.001 0.010 0.865 0.827 0.070

VG, genetic variance of the trait; VT, total variance of the trait; H2, broad sense heritability; D, high-risk allele; d, low-risk allele; PD, allelefrequency of D; (fDD, fDd, fdd), penetrances of the genotypes (DD, Dd, dd).

927Confidence Set Inference with LRT

Genet. Epidemiol. DOI 10.1002/gepi

Page 7: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

To compare the single-point CSI-MLS and CSI-Mean under these three models, we simulated 30 MSmarkers equally spaced on a 290-cM chromosome(10-cM density). Each marker has 10 equallyfrequent alleles. The disease locus is at 141 cM,1 cM from the 15th marker. Each replicate comprises250 nuclear families with two affected siblings.

We also compare the multi-point versions of thetwo procedures under these three single-locus traitmodels. In addition, the effect of having a MS mapor a SNP map has also been investigated. The MSmap comprises 17 MS markers equally spaced on a120-cM chromosome (7.5-cM density), each witheight equally frequent alleles. The trait locus is at61.2 cM, 1.2 cM away from the nearest marker. TheSNP map comprises one SNP every 0.25 cM along a60-cM chromosome, each with a minor allelefrequency of 0.3. The trait locus is at 30 cM. Samplesizes of 100, 250, 500, and 1,000 (for model III) aresimulated.

Two-locus disease models. Diseases of interestto genetic epidemiologists today are largely complexin nature. Four two-locus trait models (Tables II andIII), two epistatic and two heterogeneous, areconsidered. A fuller discussion of these models canbe found in Papachristou and Lin [2006a,b]. Wesimulate 17 MS markers equally spaced along a 120-cM chromosome for each of the two trait loci, each

marker with eight equally frequent alleles. Each traitlocus is at 61.2 cM on their respective chromosome.Sample sizes of 100 and 500 families are considered.For each setting we simulated 500 replicates.

RESULTS

COMPARISON BETWEEN SINGLE-POINTCSI-MLS AND CSI-MEAN

The CSI framework was developed to address theinterest in better susceptibility gene localizationin the era of dense marker map. Single-pointCSI procedures are thus not of practical interest.However, they form the foundation of their multi-point counterparts and thus their investigation willaid the understanding of the more complex proce-dures. Figure 1 provides the relative frequency ofeach marker being included in the confidence setsconstructed by CSI-MLS and by CSI-Mean. Themean length of the confidence sets in centimorgan(M), the standard deviation (SD) of the lengths, andthe CP are also provided. The mean lengths of theconfidence sets are comparable for CSI-MLS andCSI-Mean. The probability of a marker beingincluded in the confidence sets decreases as themarker gets farther away from the trait locus, therate of decrease being faster for a model with higherheritability. A single-point confidence set does not

TABLE II. Two-locus disease models

Model

Penetrancesa Parameter valuesb

f22 f21 f20 f12 f11 f10 f02 f01 f00 p1 p2 j

EP-2 j j 0 0 0 0 0 0 0 0.600 0.199 0.778EP-4 j j 0 j 0 0 j 0 0 0.372 0.243 0.911Het-2 uc uc j j j 0 j j 0 0.279 0.040 0.660S-2 1 1 1 j j 0 j j 0 0.228 0.045 0.574

afij (i, j 5 0, 1, 2), the penetrance of a genotype with i copies of high-risk allele at the first locus and j copies of high-risk allele at the secondlocus.bpi (i 5 1, 2), frequency of the high-risk allele at the ith trait locus.cu ¼ 2j� j2, the penetrance of a genotype with the high-risk genotypes at each of the two component trait loci, under a heterogeneitymodel.

TABLE III. Parameters relevant to the power of the CSI tests for the two-locus trait models in Table II

Model

Trait locus Ia Trait locus II

z0 z1 z2 lO lS z0 z1 z2 lO lS

EP-2 0.1406 0.4688 0.3906 1.67 1.78 0.1355 0.4866 0.3779 1.80 1.85EP-4 0.1940 0.4667 0.3393 1.20 1.29 0.1024 0.4331 0.4645 2.11 2.44Het-2 0.1753 0.4414 0.3833 1.26 1.43 0.1468 0.4979 0.3553 1.70 1.70S-2 0.1461 0.4047 0.4492 1.39 1.71 0.1696 0.4981 0.3323 1.47 1.47

azk (k 5 0, 1, 2), marginal IBD probability for an ASP at the trait locus under consideration; lO and lS, locus-specific relative risks to anoffspring and to a sibling of an affected individual, respectively.

928 Sinha and Luo

Genet. Epidemiol. DOI 10.1002/gepi

Page 8: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

distinguish between chromosomal positions that aresymmetrically located on either side of a marker andthis could lead to confidence regions wider thandesired. This fact is reflected in the actual CP of theconfidence sets being larger than the nominal.

COMPARISON BETWEEN MULTI-POINTCSI-MLS AND CSI-MEAN

We first discuss the simulation results under thethree single-locus models given in Table I.

Effects of marker map, heritability, and samplesize. The results for a MS map of one MS markerper 7.5 cM is presented in Table IV. The results for aSNP map of one SNP per 0.25 cM, designed tosimulate commercially available chips (e.g., the 10 KSNP arrays by Affymetrix, Inc., Santa Clara, CA[Matsuzaki et al., 2004] or Illumina, Inc., SanDiego,CA) [Murray et al., 2004]), are presented in Table V.First of all, from Table IV there is no systematicdeviation in the CP from the nominal level of 95%and hence the three methods can be compared interms of precision, i.e., the length of the constructedconfidence set. Similar consistency is observed forthe results under the SNP map. For the MS map, two

variants of CSI-Mean, CSI-Mean-V2 and CSI-Mean-V3, are compared with our proposed CSI-MLS.Unlike CSI-MLS or CSI-Mean-V3, CSI-Mean-V2 doesnot use IBD sharing probability estimates at loca-tions between markers, reasoning that these may notbe estimated precisely. This consideration seems tobe advantageous for a highly heritable trait (Model I)under MS map, as reflected in CSI-Mean-V2 out-performing both CSI-MLS and CSI-Mean-V3(Table IV). This advantage vanishes as we considermore realistic models with moderate heritability(Models II and III), where CSI-MLS consistentlyoutperforms both variants of CSI-Mean by reducingthe average length of the confidence sets by as muchas around 20%.

For the SNP map (Table V), CSI-Mean-V2 is notconsidered as there is no difference betweenCSI-Mean-V3 and CSI-Mean-V2 at this map density.CSI-MLS consistently outperforms CSI-Mean underall three models, reducing the average length of theconfidence sets by 7–20%.

Comparing results in Tables IV and V, it isapparent that the current panel of SNP mapsprovides confidence sets much more precise thanthose obtained from MS maps, sometimes reducing

Fig. 1. The relative frequency (%) of each marker being included in the confidence sets obtained by the single-point confidence set

inference (CSI) procedures. The three rows correspond to the three single-locus models in Table I. A total of 30 markers in the density ofone per 10 cM are considered. We set h0 to be 5 cM in equation (6). There are 250 affected sibling pairs per replicate. The true trait locus

is marked by ‘‘X’’. Each tick mark represents a marker.

929Confidence Set Inference with LRT

Genet. Epidemiol. DOI 10.1002/gepi

Page 9: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

the length by more than 50%. The underlyingetiology of the disease also determines the precisionwith which one is able to localize the susceptibilitygene. The more genetic a trait is, the easier it is tolocalize the susceptibility gene, as demonstrated inboth Tables IV and V.

The reformulation of the traditional hypothesis tothe CSI null that the location being tested is the traitlocus guarantees that, asymptotically, a CSI test willproduce an interval that will include only the traitlocation, provided the assumptions are correct. Inother words, as the sample size increases, the lengthof a level (1�a) confidence set decreases. Thisproperty of the CSI framework, using either CSI-MLS or CSI-Mean, is apparent in Tables IV and V.For each disease model and marker map, the lengthof the CSI intervals decreases as the sample sizeincreases, irrespective of which version of CSI wasused. Reviewing Table V, by increasing the samplesize from 100 to 250, the length of the confidence setis reduced to around 60%. A further increase ofsample size to 500 can reduce the length to around40%. For model I, 100 ASPs are more than enough toprovide a tight region to be followed up on, whereas250 ASPs are needed under model II and the samplesize should be around 500 under model III.

Effect of complexity of the disease model. Tocompare CSI-MLS and CSI-Mean in their efficiencyin localizing the susceptibility genes involved incomplex traits, the four two-locus models in Table IIare simulated on a MS map with one marker per7.5 cM. The results are presented in Table VI. Asbefore, we assumed the exact IBD sharing at each ofthe trait loci by an ASP is known, which amountsto knowing the locus-specific relative risks. Bothvariants of CSI-Mean were investigated and they

yielded essentially the same result under eachmodel. This is intuitive as, judging from the locus-specific relative risks, none of the trait locus in thefour two-locus models has stronger genetic effectthan that of model II among the three one-locusmodels. Again, the coverage probabilities areapproximately equal to the nominal level and thusthe efficiency of the confidence set can be comparedbased on their lengths. CSI-MLS consistently pro-vide confidence sets that are 11–23% shorter thanthat provided by CSI-Mean. The lengths are reducedto around 40% by increasing the sample size from100 to 500.

The effect of the underlying disease model isclearly seen in the ability of the methods to narrowtrait locus regions. For the epistatic model EP-2, bothloci have similar contribution and each method isable to localize the two loci with similar precision.For the other epistatic model EP-4, the second traitlocus has a greater contribution to the disease and isidentified with much greater precision. The lengthsof the confidence sets are similar for the two loci inthe heterogeneity model Het-2. For S-2, the first traitlocus has a much higher sibling relative risk than thesecond locus and is easier to localize. ComparingTables IV and VI, we observe that for each method,the confidence set length is approximately a decreas-ing function in (lO, lS), as discussed before. Forexample, the effect of the second locus in Het-2(lO 5 1.70, lS 5 1.70) is slightly less than that ofmodel III (lO 5 1.78, lS 5 1.78), and this leads to aslightly wider confidence sets obtained fromCSI-MLS (21.4 versus 20.8 cM with 500 ASPs).

TABLE V. Confidence sets from multi-point CSIprocedures on the SNP map

Model No. ASPs

CSI-MLS CSI-Mean-V3

M SS (%) M SS (%) RR (%)

I 100 4.2 – 4.5 – 6.7250 2.6 62 2.9 64 10.3500 1.9 45 2.3 51 17.3

II 100 18.5 – 21.4 – 13.6250 10.5 57 12.1 57 13.2500 7.3 39 8.3 39 12.0

III 100 35.0 – 40.6 – 13.8250 21.2 61 25.7 63 17.5500 13.7 39 17.0 42 19.4

1000 9.1 26 11.4 28 20.2

Information includes mean length of confidence sets incentimorgan (M), effect of sample size SS ¼ Length of CI=

�Length of CI with 100 ASPs� 100%Þ; and relative reduction inlength obtained by using CSI-MLS instead of CSI-Mean RR ¼ð

ðLength of CSI-Mean�Length of CSI - MLSÞ=Length of CSI-Mean�100%Þ:

TABLE IV. Confidence sets from the three multi-pointCSI procedures on the MS map

ModelNo.

ASPs

CSI-MLS CSI-Mean-V3 CSI-Mean-V2

M SD CP M SD CP M SD CP

I 100 10.6 3.5 0.945 10.6 2.7 0.949 9.1 2.4 0.950250 8.8 3.3 0.940 8.8 2.1 0.942 6.1 1.3 0.935500 7.7 3.0 0.943 7.8 1.8 0.932 4.4 0.8 0.932

II 100 25.4 11.8 0.942 28.7 12.0 0.941 28.3 11.8 0.938250 16.5 6.5 0.957 18.7 6.4 0.953 17.7 6.2 0.957500 13.0 4.7 0.953 14.3 4.5 0.935 13.1 4.1 0.943

III 100 53.0 23.8 0.951 65.3 23.7 0.949 65.8 22.9 0.948250 29.5 14.4 0.931 36.0 15.2 0.935 36.0 15.0 0.943500 20.8 9.5 0.941 24.6 9.6 0.935 23.8 9.4 0.933

1000 14.9 6.4 0.944 17.4 6.2 0.950 16.3 6.1 0.934

Information includes mean length of confidence sets in centimor-gan (M), standard deviation (SD) of the lengths, and coverageprobability (CP).

930 Sinha and Luo

Genet. Epidemiol. DOI 10.1002/gepi

Page 10: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

POWER OF CSI-MLS AND CSI-MEAN

In the above two sections, we have demonstratedthat CSI-MLS is more efficient than CSI-Mean underthree single-locus and four two-locus models,representing a wide range of heritability and modeof inheritance. It is of interest to see how widely wecan generalize this conclusion. The CSI frameworkbuilds upon the duality of the acceptance regionsof hypothesis tests and the confidence sets. Morepowerful tests should lead to more efficient con-fidence set estimators. Although the power of thetraditional MLS and Mean tests, together with otherlinkage tests, have been extensively studied [e.g.,Davis and Weeks, 1997], CSI-MLS and CSI-Meantests use the same linkage data to test the CSI nullsthat are dramatically different from the traditionallinkage null. Thus, it is a new problem that warrantsmore investigation.

In this section, we explore the power to reject thenull hypothesis of complete linkage for markers withvarying degrees of linkage to a disease locus. Withthis aim in mind, we proceed in two ways. First, welimit our attention to the three single-locus modelswhere we already have the comparison of efficiencyof confidence sets obtained from both CSI proce-dures. Figure 2 shows the empirical power curvesfor CSI-MLS and CSI-Mean-V3. Both tests maintainthe nominal type I error rate of 5% at the trait locus.CSI-MLS is always more powerful than CSI-Meanacross the rest of the chromosome. In general, bothtests attain high power under a highly heritablemodel, reflected in the power going quickly to 1

under model I (with heritability of 0.71). As thegenetic contribution decreases (going from model Ito II, and then to III), the power of both tests decreaseand the power difference between the two testsbecomes more pronounce. These conclusions agreewith the conclusion drawn on the efficiency com-parison of the two CSI procedures presented in theprevious sections. Next, we investigate the effect of acomprehensive range of disease models on thepower of the two CSI procedures at rejecting theCSI null at a completely informative marker which isat a range of distances away from the trait locus. Thesample size was fixed at 500 ASPs. We sample 10,000replicates for each setting, which results inthe standard error of the estimate of the powerbeing no more than 0.005. As discussed in Methodssection, the effect of the disease models is entirelydetermined by sets of model-related parameters.There are at least three sets of one-to-one para-meters that play this role: (lO, lS), ðVA=K2;VD=K2Þ,and (a, d). While the relative risks are appealing toepidemiologists, it is the locus-specific relative risksthat are relevant in our investigation. When thedisease model involves more than one trait locus,this set of parameters loses its interpretationalappeal. While the third set of parameters is moredirectly related to the power of the tests, it is far lessfamiliar to the researchers in the field than thesecond set of parameters. Thus, we investigate thepower of the two CSI tests over a wide range ofdisease models as a function of ðy;VA=K2;VD=K2Þ, ybeing the recombination fraction between the mar-ker and the trait loci. The values of ðVA=K2;VD=K2Þ atwhich the powers are estimated were selected sothat ao0.65, which covers most diseases of currentinterest with the trait locus contributing only aportion of the trait variability. The results aresummarized in Table VII and Figure 3.

From Table VII, the power increases with increas-ing distance from the trait locus. At the trait locus(0 cM), 5% of the null hypotheses are expected to berejected, and this is satisfied up to two decimalplaces. At a locus that is 200 cM away from the traitlocus, the power is almost 1 over the entire range ofdisease models. The power increases with bothVA=K2 and VD=K2. Figure 3 shows the contour plotof the relative increase in power of CSI-MLS versusCSI-Mean at a locus that is 5 or 10 cM away from thetrait locus. At 5 cM, CSI-MLS is almost always morepowerful than CSI-Mean, with the increase in powerranging from 5 to 40%. Similar conclusions result fora marker that is 10 cM away. At large distances, bothmethods have power close to 1 and hence there is novisible difference in power between the two meth-ods. Greater gains in power by using CSI-MLSinstead of CSI-Mean are observed in situationswhere the dominance variance is the larger part of

TABLE VI. Confidence sets from CSI procedures ontwo-locus disease models

ModelNo.

ASPs

Trait locus I Trait locus II

CSI-MLS

CSI-Mean-V3

CSI-MLS

CSI-Mean-V3

M SS M SS RR M SS M SS RR

EP-2 100 48.1 – 57.8 – 16.8 50.1 – 60.1 – 16.7500 18.7 39 21.5 37 13.0 19.4 39 22.3 37 13.0

EP-4 100 78.1 – 91.5 – 23.0 29.3 – 34.5 – 15.1500 34.2 44 44.1 48 22.4 13.3 45 15.0 43 11.3

Het-2 100 59.3 – 74.8 – 20.1 58.3 – 69.7 – 16.4500 22.1 37 27.4 37 19.3 21.4 36 25.5 37 16.0

S-2 100 36.9 – 48.1 – 23.3 74.9 – 85.5 – 12.4500 16.2 44 19.6 41 17.3 29.5 39 35.8 42 17.6

Information includes mean length of confidence sets incentimorgan (M), effect of sample size SS ¼ Length of CI=

�Length of CI with 100 ASPs� 100%Þ; and relative reduction inlength obtained by using CSI-MLS instead of CSI-Mean RR ¼ð

ðLength of CSI-Mean�Length of CSI - MLSÞ=Length of CSI-Mean�100%Þ:

931Confidence Set Inference with LRT

Genet. Epidemiol. DOI 10.1002/gepi

Page 11: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

the trait variance. Such situations represent recessivediseases.

A subtlety that is not immediately apparent fromFigure 3 is that, for disease models with very littlegenetic effect, CSI-MLS is conservative. For example,when VA/K2 5 0 and VD/K2 5 0, CSI-MLS has a typeI error rate of 0% (compared with the nominal level

of 5%). This is due to the non-identifiability of thelikelihood when the parameters VA/K2 and VD/K2

are small. This results in the dark region near theorigin in the plots of Figure 3. CSI-Mean maintainsthe type I error rates in such situations. However, nomethod will have any power to detect the trait locusunder these disease models and hence they are of

Fig. 2. Power curves of CSI-MLS (solid line) and CSI-Mean-V3 (dashed line) under the three single-locus models given in Table I,

estimated with 1,000 replicates of 250 affected sibling pairs each. The SNP map is used. Dotted line is the level of the test (5%). The trait

locus is marked with s�.

Fig. 3. Contour plots of relative differences (%) in power in terms of (VA/K2, VD/K2), at a marker that is 5 or 10 cM away from the trait

locus. The relative difference is defined as (Power of CSI-MLS�Power of CSI-Mean)/Power of CSI-Mean � 100%.

932 Sinha and Luo

Genet. Epidemiol. DOI 10.1002/gepi

Page 12: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

little interest. For a model with a genetic effect asmodest as a5 0.06, CSI-MLS has correct type I errorrates and better power than CSI-Mean (Table VII).

DISCUSSION

In this article, we have proposed a novel approach,CSI-MLS, that adds to the increasing literature inresponse to the need for construction of confidencesets of disease gene locations to better focus fine-mapping efforts. The single-point CSI-MLS per-forms comparably with its motivator, CSI-Mean.The multi-point CSI-MLS almost always providesmore precise confidence sets than CSI-Mean, overa wide range of disease models (monogenic orotherwise), marker types, and sample sizes. The onlyexception being the situation where the diseasemodel is monogenic and highly heritable (with aheritability of 71%) and the map is a 7.5-cM MS one.

This type of model is unrealistic for any complextrait of interest. The reduction in the lengths of theconfidence sets obtained by replacing CSI-Meanwith CSI-MLS, is of the order of 10–20% for manysituations. There are several reasons for CSI-MLSbeing more efficient: (1) the MLS statistic makesmaximum use of available IBD information; and (2)the MLS detects any departure from the expectedIBD sharing probabilities, whereas the Mean testfocuses on detecting only departure in the mean IBDsharing.

The underlying disease model is a critical factor indetermining the size of the confidence sets for thetrait locus. The more genetic a trait is, the shorter theinterval will be. We have shown that over a largesubspace of the disease models, there is considerableadvantage of using CSI-MLS over CSI-Mean, withpower gain as much as 40%. The gain in power ismore pronounced for recessive diseases.

TABLE VII. Power of CSI-MLS and CSI-Mean

y (cM) VA/K2

VD/K2

0.12 0.36 0.72 1.08 1.76 2.88

0 0.06 0.05, 0.05a,b 0.05, 0.05 0.05, 0.05 0.05, 0.06 0.05, 0.05 0.05, 0.050.18 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.050.36 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.050.54 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.06 0.05, 0.05 0.05, 0.050.88 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.051.44 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.05 0.05, 0.05

5 0.06 0.09, 0.08 0.19, 0.14 0.37,0.22 0.55, 0.33 0.81, 0.52 0.96, 0.760.18 0.11, 0.11 0.20, 0.15 0.38, 0.25 0.55, 0.35 0.80, 0.54 0.96, 0.770.36 0.15, 0.14 0.23, 0.21 0.38, 0.29 0.54, 0.40 0.77, 0.56 0.95, 0.780.54 0.19, 0.18 0.26, 0.24 0.40, 0.35 0.54, 0.44 0.75, 0.59 0.94, 0.790.88 0.26, 0.26 0.32, 0.32 0.44, 0.41 0.56, 0.50 0.74, 0.64 0.92, 0.821.44 0.39, 0.38 0.44, 0.44 0.51, 0.50 0.60, 0.60 0.75, 0.72 0.90, 0.85

10 0.06 0.13, 0.11 0.37, 0.25 0.74, 0.49 0.93, 0.71 1, 0.93 1, 10.18 0.18, 0.17 0.42, 0.31 0.75, 0.56 0.92, 0.76 1, 0.95 1, 10.36 0.28, 0.27 0.48, 0.43 0.77, 0.65 0.92, 0.82 0.99, 0.96 1, 10.54 0.38, 0.38 0.57, 0.53 0.80, 0.72 0.93, 0.86 0.99, 0.97 1, 10.88 0.56, 0.56 0.69, 0.69 0.85, 0.82 0.94, 0.91 0.99, 0.98 1, 11.44 0.79, 0.78 0.85, 0.85 0.91, 0.91 0.96, 0.96 0.99, 0.99 1, 1

30 0.06 0.28, 0.23 0.81, 0.65 1, 0.97 1, 1 1, 1 1, 10.18 0.46, 0.44 0.88, 0.80 1, 0.98 1, 1 1, 1 1, 10.36 0.71, 0.70 0.95, 0.93 1, 1 1, 1 1, 1 1, 10.54 0.88, 0.88 0.98, 0.98 1, 1 1, 1 1, 1 1, 10.88 0.99, 0.98 1, 1 1, 1 1, 1 1, 1 1, 11.44 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1

200 0.06 0.42, 0.36 0.95, 0.90 1, 1 1, 1 1, 1 1, 10.18 0.70, 0.68 0.98, 0.97 1, 1 1, 1 1, 1 1, 10.36 0.93, 0.93 1, 1 1, 1 1, 1 1, 1 1, 10.54 0.99, 0.99 1, 1 1, 1 1, 1 1, 1 1, 10.88 1, 1 1, 1 1, 1 1, 1 1, 1 1, 11.44 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1

aEach cell gives (probability of rejection for CSI-MLS, probability of rejection for CSI-Mean).bEach value is correct to 2 decimal places with a standard error of at most 0.005.

933Confidence Set Inference with LRT

Genet. Epidemiol. DOI 10.1002/gepi

Page 13: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

A second advantage of CSI-MLS over CSI-Mean isits computational efficiency. Using an AMD Athlon28001 (running Linux) for 100 ASPs with a 60-cMSNP map of 241 SNPs, the multi-point CSI-MLS took2 min, while CSI-Mean took 31 min. This is notsurprising given that CSI-Mean needs to perform alarge number of Monte-Carlo simulations. A furtheradvantage of CSI-MLS is that it does not need anassumption of no missing genotypes that is requiredfor the Monte-Carlo simulations of CSI-Mean.

The SNP and MS map densities considered in thesimulations are representative of current standards.At these densities, it is obvious that the SNP mapleads to shorter confidence intervals than the MSmaps. Our findings agree with those of Papachristouand Lin [2006b]. These results, together with thetechnology and cost efficiency of dense SNP panels,imply that SNP panels might be increasingly used infuture linkage studies. The optimal density of theSNP chips is the one that maximizes density withoutinducing linkage disequilibrium (LD) between mar-kers. LD between markers can lead to bias in theestimation of multi-point IBD sharing probabilities[Xing et al., 2006] and can thus affect the perfor-mance of our method. For the 10K SNP chip, asimple remedy is to remove some SNPs that are inLD with other SNPs, say using a cut-off of r2 5 0.2.Our experience indicates that this will not result inmuch loss of efficiency.

On the practical side, the CSI framework requiresthe knowledge of the IBD sharing probabilities by anASP at the disease gene locus. In this article, weassume that these quantities are known. Papachris-tou and Lin [2006c] proposed a two-step procedureto use CSI as an intermediate mapping approach,where the required parameters were estimated in thefirst step. They have shown that CSI-Mean main-tained the desirable statistical properties with thistwo-step approach. In a recent application of bothCSI-MLS and CSI-Mean to the real and simulateddata contributed to GAW15 [Sinha and Luo, 2007],we have found that the advantage of CSI-MLS overCSI-Mean is even more pronounced in a two-stepCSI procedure, particularly when parental geno-types are not available and when the sample size islarge. The reduction in the length of the CI reachesclose to 50% in some cases.

We have considered ASPs. CSI-MLS might beimproved by using information from other relativetypes as well. Risch [1990b] showed that for diseaseswith large relative risks, distant relatives offergreater power to detect linkage. The likelihood ineq. (11) can be easily generalized to other relativetypes. Such a generalization enables one to extractlinkage information from larger pedigrees withmany different relative types.

In the current report, we assume that the chromo-some investigated contains exactly one disease gene.In the two-step approach, we assume that thepreliminary broad linkage region contains at mostone disease gene. Admittedly, this does not reflectthe fine-scale complexity underlying most diseasesof interest today. The effect departure from the one-gene assumption will have on the localization abilityof CSI will depend on how the IBD sharingprobability by an ASP at the susceptibility loci willbe distorted. In some situations, a broad linkagepeak might be due to two linked susceptibilitygenes, perhaps more than 10 cM apart. Localizationof such genes has been taken up by Biernacka et al.[2005]. If the effect of one gene is much stronger thanthe other, CSI might be able to localize the gene withthe stronger effect. CSI will fail if the two genes areof similar effect and are close to each other.However, as all the information that are used toconstruct the CSI intervals is the IBD sharingprobabilities at the resolution of current linkagestudies, mutations that are very close to each other,say within 0.1 cM to each other, will effectivelyappear to be one ‘‘super locus’’ relative to theresolution of a linkage study and CSI may performwell in localizing this ‘‘super locus’’.

Our approach aims at providing a confidence setbased on sound statistical principles and efficiency.The issue of how big a region to follow up on withfine mapping will depend also on nonstatisticalconsiderations, including the budget and the char-acteristics of the implicated genomic regions (such aswhether this is a gene-poor or gene-rich region,knowledge about candidate genes, and how big thecandidate genes are). Our analyses show that, for anumber of disease models with small effects,CSI-MLS is able to produce a confidence region ofo15 cM with 500 ASPs. When one is constrained byresources or time, the proper strategy would be tostart by constructing a confidence set with a suitableCP that leads to a feasible region that can be studiedin detail. Depending on resources, one can system-atically examine the confidence sets with greatercoverage probabilities.

ELECTRONIC RESOURCES

The URL for the R code implementing CSI-MLS is:http://darwin.cwru.edu/�rsinha/work/csi-mls/.

ACKNOWLEDGMENTS

The authors would like to thank the softwareS.A.G.E. [2006] and R [R Development Core Team,2005]. In addition the authors are grateful to Dr. ShiliLin and Dr. Sudha Iyengar for helpful discussions.

934 Sinha and Luo

Genet. Epidemiol. DOI 10.1002/gepi

Page 14: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

We thank the reviewers and the editor for construc-tive comments.

REFERENCESAbecasis GR, Cherny SS, Cookson WO, Cardon LR. 2002.

Merlin–rapid analysis of dense genetic maps using sparse

gene flow trees. Nat Genet 30:97–101.Biernacka JM, Sun L, Bull SB. 2005. Simultaneous localization of

two linked disease susceptibility genes. Genet Epidemiol 28:33–47.

Blackwelder WC, Elston RC. 1985. A comparison of sibpairlinkage tests for disease susceptibility loci. Genet Epidemiol 2:85–97.

Bull SB, John S, Briollais L. 2005. Fine mapping by linkage andassociation in nuclear family and case-control designs. GenetEpidemiol 29(Suppl):S48–S58.

Chiou JM, Liang KY, Chiu YF. 2005. Multipoint linkage mappingusing sibpairs: non-parametric estimation of trait effects with

quantitative covariates. Genet Epidemiol 28:58–69.Cordell HJ, Olson JM. 2000. Correcting for ascertainment bias of

relative-risk estimates obtained using affected-sib-pair linkagedata. Genet Epidemiol 18:307–321.

Davis S, Weeks DE. 1997. Comparison of nonparametric statistics

for detection of linkage in nuclear families: single-markerevaluation. Am J Hum Genet 61:1431–1444.

Dupuis J, Siegmund D. 1999. Statistical methods for mappingquantitative trait loci from a dense set of markers. Genetics151:373–386.

Feingold E, Siegmund DO. 1997. Strategies for mapping hetero-geneous recessive traits by allele-sharing methods. Am J HumGenet 60:965–978.

Glidden DV, Liang KY, Chiu YF, Pulver AE. 2003. Multipointaffected sibpair linkage methods for localizing susceptibility

genes of complex diseases. Genet Epidemiol 24:107–117.Gudbjartsson DF, Jonasson K, Frigge M, Kong A. 2000. Allegro, a

new computer program for multipoint linkage analysis. NatGenet 25:12–13.

Haseman JK, Elston RC. 1972. The investigation of linkagebetween a quantitative trait and a marker locus. Behav Genet2:3–19.

Hauser ER, Boehnke M, Guo SW, Risch N. 1996. Affected-sib-pairinterval mapping and exclusion for complex genetic traits:sampling considerations. Genet Epidemiol 13:117–137.

Holmans P. 1993. Asymptotic properties of affected sib-pairlinkage analysis. Am J Hum Genet 52:362–374.

Hossjer O. 2003. Assessing accuracy in linkage analysis by meansof confidence regions. Genet Epidemiol 25:59–72.

Kong A, Cox NJ. 1997. Allele-sharing models: Lod scores andaccurate linkage tests. Am J Hum Genet 61:1179–1188.

Kristjansson K, Manolescu A, Kristinsson A, Hardarson T,Knudsen H, Ingason S, Thorleifsson G, Frigge ML, Kong A,

Gulcher JR, Stefansson K. 2002. Linkage of essential hyperten-sion to chromosome 18q. Hypertension 39:1044–1049.

Lander ES, Botstein D. 1989. Mapping mendelian factors under-lying quantitative traits using RFLP linkage maps. Genetics121:185–199.

Lander ES, Green P. 1987. Construction of multilocus geneticlinkage maps in humans. Proc Natl Acad Sci USA 84:2363–2367.

Lebrec J, Putter H, van Houwelingen JC. 2006. Potential bias ingeneralized estimating equations linkage methods under

incomplete information. Genet Epidemiol 30:94–100.

Lewinger JP, Lee SS, Biernacka J, Wu LY, Shi HS, Bull SB. 2005.

Comparison of family-based association tests in chromosome

regions selected by linkage-based confidence intervals. BMC

Genet 30(6 Suppl 1):S62.Liang KY, Chiu YF, Beaty TH. 2001. A robust identity-by-descent

procedure using affected sib pairs: multipoint mapping for

complex diseases. Hum Hered 51:64–78.Liang KY, Huang CY, Beaty TH. 2000. A unified sampling

approach for multipoint analysis of qualitative and quantita-

tive traits in sib pairs. Am J Hum Genet 66:1631–1641.Lin S. 2002. Construction of a confidence set of markers for the

location of a disease gene using affected sibpair data. Hum

Hered 53:103–112.Lin S, Rogers JA, Hsu JC. 2001. A confidence-set approach for

finding tightly linked genomic regions. Am J Hum Genet 68:

1219–1228.Mangin B, Goffinet B, Rebai A. 1994. Constructing confidence

intervals for QTL location. Genetics 138:1301–1308.Matsuzaki H, Dong SL, Loi H, Di XJ, Liu GY, Hubbell E, Law J,

Bernsten T, Chadha M, Hui H, Yang GR, Kennedy GC, Webster

TA, Cawley S, Walsh PS, Jones KW, Fodor SPA, Mei R. 2004.

Genotyping over 100,000 SNPs on a pair of oligonucleotide

arrays. Nat Methods 1:109–111.Murray SS, Oliphant A, Shen R, McBride C, Steeke RJ, Shannon

SG, Rubano T, Kermani BG, Fan JB, Chee MS, Hansen MST.

2004. A highly informative SNP linkage panel for human

genetic studies. Nat Methods 1:113–117.Nemesure BB, Greenberg DA, Mendell NR. 1995. Simulation

study comparing interval estimates for the recombination

fraction. Genet Epidemiol 12:351–359.Olson JM, Cordell HJ. 2000. Ascertainment bias in the estimation

of sibling genetic risk parameters. Genet Epidemiol 18:

217–235.Papachristou C, Lin S. 2005. A confidence set inference procedure

for gene mapping using markers with incomplete polymorph-

ism. Hum Hered 59:1–13.Papachristou C, Lin S. 2006a. A comparison of methods for

intermediate fine mapping. Genet Epidemiol 30:677–689.Papachristou C, Lin S. 2006b. Microsatellites versus single-

nucleotide polymorphisms in confidence interval estimation

of disease loci. Genet Epidemiol 30:3–17.Papachristou C, Lin S. 2006c. A two-step procedure for construct-

ing confidence intervals of trait loci with application to a

rheumatoid arthritis dataset. Genet Epidemiol 30:18–29.R Development Core Team. 2005. R: A language and environment

for statistical computing. R Foundation for Statistical Comput-

ing, Vienna, Austria. URL http://www.R-project.org. ISBN

3-900051-07-0.Risch N. 1990a. Linkage strategies for genetically complex traits. I.

Multilocus models. Am J Hum Genet 46:222–228.Risch N. 1990b. Linkage strategies for genetically complex traits.

II. The power of affected relative pairs. Am J Hum Genet 46:

229–241.Risch N. 1990c. Linkage strategies for genetically complex traits.

III. The effect of marker polymorphism on analysis of affected

relative pairs. Am J Hum Genet 46:242–253.S.A.G.E. 2006. Statistical Analysis for Genetic Epidemiology,

Release 5.2. URL http://genepi.cwru.edu/.Schaid DJ, Sinnwell JP, Thibodeau SN. 2005. Robust multipoint

identical-by-descent mapping for affected relative pairs. Am J

Hum Genet 76:128–138.Schliekelman P, Slatkin M. 2002. Multiplex relative risk and

estimation of the number of loci underlying an inherited

disease. Am J Hum Genet 71:1369–1385.

935Confidence Set Inference with LRT

Genet. Epidemiol. DOI 10.1002/gepi

Page 15: Efficient intermediate fine mapping: confidence set inference with likelihood ratio test statistic

Self SG, Liang KY. 1987. Asymptotic properties of maximumlikelihood estimators and likelihood ratio tests under non-standard conditions. J Am Stat Assoc 82:605–610.

Sinha R, Luo Y. 2007. Two-step intermediate fine mapping withlikelihood ratio test statistics: applications to problems 2 and 3datasets of GAW15. BMC Genet (In Press).

Visscher PM, Thompson R, Haley CS. 1996. Confidence intervalsin QTL mapping by bootstrapping. Genetics 143:1013–1020.

Xing C, Sinha R, Xing G, Lu Q, Elston RC. 2006 Theaffected-discordant sib-pair design can guarantee validity ofmultipoint model-free linkage analysis of incomplete pedigreeswhen there is marker-marker disequilibrium. Am J Hum Genet79:396–401.

Zou G, Zhao H. 2004. The estimation of sibling geneticrisk parameters revisited. Genet Epidemiol 26:286–293.

936 Sinha and Luo

Genet. Epidemiol. DOI 10.1002/gepi