mapping quantitative trait loci from a single-tail sample ...phenotype distribution, the genotype...

17
Copyright Ó 2007 by the Genetics Society of America DOI: 10.1534/genetics.107.081299 Mapping Quantitative Trait Loci From a Single-Tail Sample of the Phenotype Distribution Including Survival Data Mikko J. Sillanpa ¨a ¨* ,1 and Fabian Hoti* ,† *Department of Mathematics and Statistics, University of Helsinki, FIN-00014 Helsinki, Finland and National Public Health Institute, Department of Vaccines, FIN-00300 Helsinki, Finland Manuscript received August 29, 2007 Accepted for publication October 5, 2007 ABSTRACT A new effective Bayesian quantitative trait locus (QTL) mapping approach for the analysis of single-tail selected samples of the phenotype distribution is presented. The approach extends the affected-only tests to single-tail sampling with quantitative traits such as the log-normal survival time or censored/selected traits. A great benefit of the approach is that it enables the utilization of multiple-QTL models, is easy to incorporate into different data designs (experimental and outbred populations), and can potentially be extended to epistatic models. In inbred lines, the method exploits the fact that the parental mating type and the linkage phases (haplotypes) are known by definition. In outbred populations, two-generation data are needed, for example, selected offspring and one of the parents (the sires) in breeding material. The idea is to statistically (computationally) generate a fully complementary, maximally dissimilar, observation for each offspring in the sample. Bayesian data augmentation is then used to sample the space of possible trait values for the pseudoobservations. The benefits of the approach are illustrated using simulated data sets and a real data set on the survival of F 2 mice following infection with Listeria monocytogenes. Q UANTITATIVE trait locus (QTL) mapping meth- ods often assume that the trait, conditionally on the effects of the QTL, follows a normal distribution. However, nonrandom missing data patterns resulting from single-tail sampling may violate this assumption. The target in single-tail sampling is to increase the ex- pected genotype–phenotype correlation of a sample with respect to the original population parameters. By sam- pling (ascertaining) individuals from the right tail of the phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en- riched. Similarly, sampling individuals from the left tail of the phenotype distribution can increase our chances to find QTL with negative effects. Single-tail sampling may also arise from censoring or if a quantitative trait exhibits measurable values only for a portion of the in- dividuals, i.e., there is a spike in the phenotype distribution (Broman 2003). However, due to single-tail sampling, the phenotypic variation of a sample may become too small for standard QTL mapping methods to work properly, i.e., the signal is totally masked by the error. Therefore current approaches to QTL mapping of data resulting from single- tail sampling of the phenotype distribution consider the deviation of the allele- (or genotype-) frequency distri- bution at the marker loci from their Mendelian expec- tation, use logistic regression-based analysis strategies, or combine both of these approaches (Henshall and Goddard 1999; Beasley et al. 2004; Tenesa et al. 2005). Alternatively one can apply nonparametric/semipara- metric methods, rank-based statistical procedures, or a robust mixture model to analyze such data (Kruglyak and Lander 1995; Zou et al. 2002, 2003; Broman 2003; Feenstra and Skovgaard 2004). A disadvantage of these approaches is that a single-QTL model is implicitly as- sumed, since only a single chromosomal position is tested at a time. As stated in Luo et al. (2005), the viability (survival) of an individual can be simply defined as a binary phe- notype indicating whether an individual has survived (y ¼ 1) or not (y ¼ 0). For continuous survival (or failure) time data, such as time to tumor or time to death (measured in logarithmic scale), the single-tail sampling approach can be considered (Broman 2003). Alternatively, methods exist for survival phenotypes (Diao et al. 2004; Moreno et al. 2005). In controlled crosses, several methods have been designed specially to map viability loci, the gene positions that have an influence on the fitness or the survival of an individual (e.g.,Vogl and Xu 2000; Luo and Xu 2003; Luo et al. 2005; Nixon 2006). In outbred populations, similar/ related methods are adopted to locate the signatures of selection—the genomic regions having been under selective pressure (subject to natural or artificial selec- tion). It is well known that (1) the variability (diversity) is reduced, (2) the linkage disequilibrium is enriched, and (3) the segregation ratios depart from their 1 Corresponding author: Department of Mathematics and Statistics, P.O. Box 68, University of Helsinki, FIN-00014 Helsinki, Finland. E-mail: [email protected].fi Genetics 177: 2361–2377 (December 2007)

Upload: others

Post on 04-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

Copyright � 2007 by the Genetics Society of AmericaDOI: 10.1534/genetics.107.081299

Mapping Quantitative Trait Loci From a Single-Tail Sample ofthe Phenotype Distribution Including Survival Data

Mikko J. Sillanpaa*,1 and Fabian Hoti*,†

*Department of Mathematics and Statistics, University of Helsinki, FIN-00014 Helsinki, Finland and†National Public Health Institute, Department of Vaccines, FIN-00300 Helsinki, Finland

Manuscript received August 29, 2007Accepted for publication October 5, 2007

ABSTRACT

A new effective Bayesian quantitative trait locus (QTL) mapping approach for the analysis of single-tailselected samples of the phenotype distribution is presented. The approach extends the affected-only teststo single-tail sampling with quantitative traits such as the log-normal survival time or censored/selectedtraits. A great benefit of the approach is that it enables the utilization of multiple-QTL models, is easy toincorporate into different data designs (experimental and outbred populations), and can potentially beextended to epistatic models. In inbred lines, the method exploits the fact that the parental mating typeand the linkage phases (haplotypes) are known by definition. In outbred populations, two-generationdata are needed, for example, selected offspring and one of the parents (the sires) in breeding material.The idea is to statistically (computationally) generate a fully complementary, maximally dissimilar,observation for each offspring in the sample. Bayesian data augmentation is then used to sample the spaceof possible trait values for the pseudoobservations. The benefits of the approach are illustrated usingsimulated data sets and a real data set on the survival of F2 mice following infection with Listeriamonocytogenes.

QUANTITATIVE trait locus (QTL) mapping meth-ods often assume that the trait, conditionally on

the effects of the QTL, follows a normal distribution.However, nonrandom missing data patterns resultingfrom single-tail sampling may violate this assumption.The target in single-tail sampling is to increase the ex-pected genotype–phenotype correlation of a sample withrespect to the original population parameters. By sam-pling (ascertaining) individuals from the right tail of thephenotype distribution, the genotype frequencies forQTL with positive phenotype effects are potentially en-riched. Similarly, sampling individuals from the left tailof the phenotype distribution can increase our chancesto find QTL with negative effects. Single-tail samplingmay also arise from censoring or if a quantitative traitexhibits measurable values only for a portion of the in-dividuals, i.e., there is a spike in the phenotype distribution(Broman 2003). However, due to single-tail sampling, thephenotypic variation of a sample may become too small forstandard QTL mapping methods to work properly, i.e., thesignal is totally masked by the error. Therefore currentapproaches to QTL mapping of data resulting from single-tail sampling of the phenotype distribution consider thedeviation of the allele- (or genotype-) frequency distri-bution at the marker loci from their Mendelian expec-tation, use logistic regression-based analysis strategies,

or combine both of these approaches (Henshall andGoddard 1999; Beasley et al. 2004; Tenesa et al. 2005).Alternatively one can apply nonparametric/semipara-metric methods, rank-based statistical procedures, or arobust mixture model to analyze such data (Kruglyak

and Lander 1995; Zou et al. 2002, 2003; Broman 2003;FeenstraandSkovgaard 2004). Adisadvantage of theseapproaches is that a single-QTL model is implicitly as-sumed, since only a single chromosomal position is testedat a time.

As stated in Luo et al. (2005), the viability (survival) ofan individual can be simply defined as a binary phe-notype indicating whether an individual has survived(y ¼ 1) or not (y ¼ 0). For continuous survival (orfailure) time data, such as time to tumor or time todeath (measured in logarithmic scale), the single-tailsampling approach can be considered (Broman 2003).Alternatively, methods exist for survival phenotypes(Diao et al. 2004; Moreno et al. 2005). In controlledcrosses, several methods have been designed specially tomap viability loci, the gene positions that have aninfluence on the fitness or the survival of an individual(e.g., Vogl and Xu 2000; Luo and Xu 2003; Luo et al.2005; Nixon 2006). In outbred populations, similar/related methods are adopted to locate the signatures ofselection—the genomic regions having been underselective pressure (subject to natural or artificial selec-tion). It is well known that (1) the variability (diversity)is reduced, (2) the linkage disequilibrium is enriched,and (3) the segregation ratios depart from their

1Corresponding author: Department of Mathematics and Statistics, P.O.Box 68, University of Helsinki, FIN-00014 Helsinki, Finland.E-mail: [email protected]

Genetics 177: 2361–2377 (December 2007)

Page 2: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

Mendelian expectations in the genomic regions at theimmediate surroundings of the gene positions thatinfluence survival. The size of the effect (selection in-tensity) the position has on the survival can be indirectlymonitored via the extent of the above influences andtheir decay as a function of the genetic distance. Thus,the general rationale behind the mapping methods ofsuch loci is in testing distorted segregation, testinglinkage disequilibrium patterns, or comparing levels ofgenetic variability between the particular genomic posi-tion and other parts of the genome or between species.Again, as a drawback, a single-QTL model is usuallyimplicitly assumed in these methods. Moreover, a com-mon difficulty in applying these methods in outbredpopulations is that the demographic history (popula-tion growth or recent expansion) leaves kinds of localsigns in the genome similar to those of selection (e.g.,Schlotterer 2003).

For case–control and association studies of binarytraits in human genetics, it is common that one samplesaffected individuals only (see Greenland 1999). Thereis the affecteds-only test for trios, where genotypedor haplotyped parents and their affected offspringare both collected (e.g., Falk and Rubinstein 1987;Terwilliger and Ott 1992; Lander and Schork 1994;Gauderman et al. 1999). Such a test is constructedbetween the cases and the controls where the individ-uals of the control sample, so-called pseudocontrols,‘‘artificial controls,’’ or ‘‘antisibs,’’ are created from thegenetic material that was not transmitted from theparents to the cases (their chromosomes are mirrorimages of the case chromosomes). The informationneeded to generate the chromosomes for the antisibs isobtained from the parental haplotypes by taking thecomplement of the genetic material of the parents thatwas transmitted to the affected offspring (see Figure 1).The ability to derive such a complement on the basis ofthe parental genotype data depends only on theparental mating type for the marker (marker informa-tiveness); e.g., it is easy to derive complemental obser-vations for mating type AB 3 CD. Some single-locus tests

also utilize genotypic pseudoobservations in 1:3 propor-tions. By adopting the pseudocontrol approach, onecan obtain well-matched controls and avoid spuriousassociations due to ethnic confounding, i.e., closer kin-ship (higher degree of background linkage disequilib-rium) in the affected sample (Terwilliger and Weiss

1998).Here we bring this antisib idea to mapping QTL in

experimental crosses (e.g., backcross and F2) of inbredlines as well as provide a theoretical basis for applyingthis method to outbred populations using a multiple-QTL model. To illustrate the methodology, we extendthe affected-only tests to single-tail sampling with quan-titative traits. In the method, continuous-trait values aregenerated for all of the antisibs on the basis of Bayesianhierarchical modeling and data augmentation. For dataaugmentation, see Albert and Chib (1993), Rubin

(1996), and Van Dyk and Meng (2001). Unlike manyothers who consider mapping QTL (for quantitative,viability, or survival data) from single-tail samples, we usea multiple-QTL model.

To map signatures of selection, viability, and otherbinary traits from case-only data (where only selected/survived individuals are in our sample and have pheno-typic value one), it is straightforward to genotype individ-uals from the single-phenotypic group only (i.e., survivedindividuals). In the continuous-trait case, one can selec-tively sample backcross or F2 progenies so that the (case)individuals from only one tail of the phenotype distri-bution are genotyped. Pseudocontrols, correspondingto observations from the other tail of the phenotypedistribution, can then be created as mirror images foreach case individual. The binary phenotypic value of thepseudocontrol individuals is zero and their continuous-scale observations (so-called liability values) can be pre-dicted using data augmentation. (Note that the liabilityvalues of the case individuals are already observed.) Thegenotype data for these artificial observations can becreated on the basis of the genotypes and linkage phasesof the parents, which in inbred line-cross designs areknown by definition. For example, in backcross, one can

Figure 1.—A general representation on howchromosomes of the ‘‘antisib’’ (pseudoobserva-tion) are created on the basis of parental infor-mation. The genetic material is numbered (andshaded) according to the four original grandpa-rental sources and the recombination pointsare indicated with vertical lines. Note that thegenetic material of the antisib is complemen-tary (each allele originating from the othergrandparent) to that of the real offspring. Alsothe recombinations occur in constant placesand make chromosomes (inherited from thesame parent) equally probable between realand pseudoobservations. Because the offspringand its antisib have different QTL genotypes

({1, 3} and {2, 4}) and they do not share any QTL alleles with each other, they also are expected (in the presence of the1-QTL model) to have a different status in their binary phenotypes (ovals).

2362 M. J. Sillanpaa and F. Hoti

Page 3: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

follow the principle that the mirror image of genotypeAA is AB and vice versa. In F2, the mirror images of thethree genotypes AA, AB, BB are BB, AB, AA, respectively.The same principle applies for outbred crosses andpopulations where the parental genotypes and haplo-types are known or estimated.

MODEL

Notation: Let us consider an inbred line-cross experi-ment (e.g., backcross, double haploids, or F2) with Ngen

possible genotypes. We assume that measurements of aquantitative trait and marker genotypes at N loci havebeen obtained from Nind individuals sampled from asingle tail of the phenotype distribution. To consider apseudocontrol idea (Figure 1) for quantitative traits, it iseasy to adopt a liability and threshold model framework(e.g., Albert and Chib 1993). Further, we assume thateach sample has a mirror image (hidden observation) inthe unsampled part of the phenotype distribution.Using case–control and threshold model terminologythis assumption means that for each case individual(whose liability value is measured) we need one controlindividual (whose liability value is systematically lower/greater than that in case individuals).

Denote the phenotype and marker data of the ob-served offspring as (yo, Mo) and the hidden phenotypeand marker measurements of the offspring data as(yh, Mh). From here on we refer to individual i and itsunobserved counterpart as pair i. (Note that survivaldata, where genotypes Mh are observed for censored in-dividuals, are a special case of this setting.) The observedand hidden phenotype vectors, yo ¼ ðyo

1 ; . . . ; yoNindÞ and

yh ¼ ðyh1 ; . . . ; yh

NindÞ; include the observed and hidden

phenotypes of pair i, yoi and yh

i ; respectively. Similarly,M o ¼ ðmo

i;jÞ and M h ¼ ðmhi;jÞ are the observed and

hidden marker matrices where the elements, moi;j and

mhi;j , are the coded genotypes from the set ½1, . . . , Ngen�

for pair i on marker j. To allow that some marker geno-types may be missing among the observed half of theindividuals, the incomplete form of Mo is denoted byM* ¼ ðmi;j*Þ: Note that if there are no missing markergenotypes, then Mo ¼ M*.

To exploit the ‘‘mirroring idea’’ by applying dataaugmentation on the hidden observations (yh, Mh), it ishelpful to consider that the observed phenotypes (yo)give rise to the discrete auxiliary variables (discretephenotypes). Let zo ¼ ðzo

1 ; . . . ; zoNindÞ denote a discrete

phenotype vector where for individual i, zoi ¼ 1fyo

i . Tg:Here T is a (known) discretization threshold and thebinary phenotype zo

i obtains value 1 if the underlying(observed) continuous phenotype yo

i is higher than thethreshold T and zo

i ¼ 0 otherwise. Similarly, for thehidden observations yh, we have a discrete vector zh ¼zh

1 ; . . . ; zhNind

� �;where zh

i ¼ 1fyhi . Tg ¼ 1� zo

i :To conclude,in single-tail sampling, (1) the threshold T uniquelydetermines the proportion of individuals selected from

the phenotype distribution, (2) all elements in vector zo

are either 0 or 1, and (3) all elements in vector zh areeither 0 or 1 and opposite to zo. In the case that T is un-known, depending on which tail has been sampled we candefine T as the smallest or the highest phenotype value.

Phenotype model: We adopt the additive multiple-QTL model considered earlier by Xu (2003). This model isclosely related to the model of Meuwissen et al. (2001) andcan be viewed as a submodel of Hoti and Sillanpaa

(2006). Although it is straightforward to include alsopairwise epistatic interaction terms in the design matrix ofthis model (see Zhang and Xu 2005; Xu 2007), we omitsuch extensions here. For considering other models anddesigns, see the discussion. In the model, let us assumethat the putative QTL can be placed only at marker points.However, this is not a very restrictive assumption becausein experimental designs, arbitrary map positions (putativeQTL) can be included into the analysis as pseudomarkers(Sen and Churchill 2001). Given the overall mean a

and the effect-specific coefficients b ¼ (bj,k) of marker j,the phenotypes ys

i ; s¼ o, h of the pair i can be expressed as

ysi ¼ a 1

XN

j¼1

XNgen

k¼1

bj ;k1fmsi;j¼kg1 es

i ; ð1Þ

where the residuals (the phenotypes after correcting forQTL effects) are assumed to be normally distributed,esi � N ð0;s2

e Þ; with unknown variance s2e : Note that this

same model is assumed for both the observed ðyoi Þ and

the hidden ðyhi Þ phenotypes. The indicator variable

1fmsi;j¼kg ¼ 1 if the marker observation ms

i;j equals geno-type code k and is 0 otherwise. For each marker j weintroduce the constraint bj,1¼ 0. Thus for a backcross ordouble haploids, where Ngen ¼ 2, only a single co-efficient bj,2 at each marker is needed to capture thecontrast between the two genotypes. Similar treatmentfor F2, where Ngen¼ 3, leads to two coefficients, bj,2 andbj,3, that can be estimated for each marker. Here we usea random-variance model, where the genetic coeffi-cients bj,k, for k . 1, are assumed to be normally dis-tributed N ð0;s2

j ;kÞ with unknown variances s2j ;k : In the

following, we denote all unknown QTL parameterstogether as u ¼ ða;b;s2;s2

e Þ; where s2 ¼ ðs2j ;kÞ is a vec-

tor of the effect-specific variances.Key assumptions (which should be considered jointly

because assumption 1 is a necessary condition for as-sumption 2):

1. Given haplotypes for parents, genotype data of thehidden observations is (generated to be) maximallydissimilar to the observed data (cf. Terwilliger andOtt 1992): When we create a mirror image (see Fig-ure 1), we actually produce a maximally dissimilarpseudoindividual for each observation with respectto the marker data. This means that by doing sowe maximize the information content of the sample(see O’Brien and Funk 2003). Note the related

QTL Mapping for Single-Tail Samples 2363

Page 4: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

approaches that try to maximize the information inthe sample by selecting individuals or pairs of indi-viduals to be phenotyped on the basis of their geneticdissimilarity (Jin et al. 2004; Jannink 2005; Xu et al.2005; Fu and Jansen 2006).

2. The phenotypes of the hidden observations are (gen-erated using data augmentation) either on the left orthe right side of the truncation point and the observedphenotypes: We briefly consider what is assumed atthe genetic level, in the presence of the additivemultiple-QTL model (1), when ‘‘mirroring’’ of thegenotypes is performed. Now, we assume ordering ofthe phenotypes with respect to the threshold T :

yhi # T # yo

i for all pairs i: ð2Þ

To see what this means at the genetic level, we substitutemodel (1) into both sides of Equation 2 so that a cancelsout, and we obtain

XN

j¼1

XNgen

k¼1

bj ;k1fmhi;j¼kg1 eh

i # T #XN

j¼1

XNgen

k¼1

bj ;k1fmoi;j¼kg1 eo

i ;

ð3Þ

where residuals ehi and eo

i are both independentlynormally distributed with mean zero. Let us considerthe following cases:

i. When the residuals are ordered ehi # eo

i in the sameway as phenotypes of Equation 2: For such pairs i,Equation 3 imposes an ordering constraint for thesum of the QTL effects that the mirrored genotypesat the QTL loci need to fulfill. However, this orderingconstraint is relaxed by the nonnegative factor fi ¼eoi � eh

i : This means that the model can cope withsome proportion of phenocopies (i.e., such data indi-viduals whose phenotype is not in agreement with theQTL model).

ii. When the residuals are ordered ehi . eo

i in the oppositeway as the phenotypes of Equation 2: For such pairs i,the ordering constraint is adjusted by the negativefactor fi ¼ eo

i � ehi :This means that for some of the phe-

notypes it is required that the ordering constraint is ful-filled in a tighter form.We now represent Equation 3 as

XN

j¼1

XNgen

k¼1

bj ;k1fmhi;j¼kg#

XN

j¼1

XNgen

k¼1

bj ;k1fmoi;j¼kg1 fi ; ð4Þ

where fi ¼ eoi � eh

i is a relaxation/adjustment factorof the ordering constraint, whose sign and size de-pend on the rank and the difference of the two re-siduals, respectively. Further understanding of thisquestion requires simulation studies that are not inthe scope of this article.

Assumptions 1 and 2 together imply that each samplehas a mirror image (hidden observation) in the un-sampled part of the phenotype distribution.

Hierarchical model: In Bayesian analysis, the aim is toobtain an estimate for the posterior distribution of themodel parameters given the data, p(u, yh, Mh, Mo, zo, zh jyo, M*). This can be achieved by using Markov chainMonte Carlo (MCMC) methods, exploiting the fact thatthe posterior is proportional to the joint distribution ofthe parameters and the data, p(u, yh, Mh, Mo, zo, zh, yo,M*). By adopting suitable conditional independenceassumptions (leading to the graphical model of Figure2), the joint distribution of the parameters and the datacan be presented as

pðu; yh;M h;M o; zo; zh; yo;M*Þ¼ pðzh j zo; yhÞpðzo j yoÞpðyo; yh j u;M o;M hÞ

3 pðM h jM oÞpðM* jM oÞpðM oÞpðuÞ;

where the likelihood can be factorized as

pðyo; yh j u;M o;M hÞ¼ pðyo j u;M oÞpðyh j u;M hÞ

¼YNind

i¼1

YN

j¼1

pðyoi j u;M o

i;jÞYNind

i¼1

YN

j¼1

pðyhi j u;M h

i;jÞ:

The functional forms of p(yoi j u, M o

i;j) and p(yhi j u;M h

i;j)are normal densities of the residuals eo

i and ehi of model

(1) with mean zero and variance s2e (see, e.g., Sillanpaa

and Arjas 1998).Constraining priors: This model includes an excep-

tionally high number of constraining priors, whichintroduce restrictions into the (MCMC) samplingscheme and take care of the consistency between vari-ables. Their specific forms are given below. The prior forthe discretized hidden observations is pðzh j zo; yhÞ ¼QNind

i¼1 pðzhi j zo

i ; yhi Þ; where p(zh

i j zoi ; y

hi Þ ¼ pðzh

i j zoi Þpðzh

i j yhi ).

Here, pðzhi ¼ 1 j zo

i Þ ¼ 1fzoi ¼0g; pðzh

i ¼ 0 j zoi Þ ¼ 1fzo

i ¼1g;pðzh

i ¼ 1 j yhi Þ ¼ 1fyh

i .Tg; and pðzhi ¼ 0 j yh

i Þ ¼ 1fyhi # Tg:

Moreover, the prior for the discretized ‘‘nonhidden’’observations is pðzo j yoÞ ¼

QNind

i¼1 pðzoi j yo

i Þ; where pðzoi ¼

1 j yoi Þ ¼ 1fyo

i .Tg and pðzoi ¼ 0 j yo

i Þ ¼ 1fyoi #Tg: The mark-

ers of the pseudoobservations are created according tothe mirroring prior pðM h jM oÞ ¼

QNind

i¼1

QNj¼1 pðmh

i;j jmoi;jÞ;

where pðmhi;j ¼ 1 jmo

i;jÞ ¼ 1fmoi;j¼0g and pðmh

i;j ¼ 0 jmoi;jÞ ¼

1fmoi;j¼1g: Also the indicator function prior pðM* jM oÞ ¼

1fM o is consistent with M*g is used to ensure that the completemarker observations are compatible with the observeddata.

Other priors: In inbred line-cross data, the prior used tohandle missing observations can be presented as a Markovchain pðM oÞ ¼

QNind

i¼1

�pðM o

i;lÞQN

j¼2 pðM oi;j jM o

i;j�1Þ�: For

the actual forms of the transition probabilitiespðM o

i;j jM oi;j�1) in various designs, see Jiang and Zeng

(1997) and Sillanpaa and Arjas (1998). The prior forthe QTL parameters can be factorized as p(u) ¼ p(a)p(b js2)p(s2)p(s2

e ), where p(a) } 1, p ðb js2Þ ¼QN

j¼1QNgen

k¼2 pðbj ;k js2j ;kÞ, and pðs2Þ ¼

QNj¼1

QNgen

k¼2 pðs2j ;kÞ: Fur-

ther, p(bj,k j s2j ;k) is the density function of a normal

2364 M. J. Sillanpaa and F. Hoti

Page 5: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

distribution with mean zero and variance s2j ;k ; and

pðs2j ;kÞ} 1=s2

j ;k is the Jeffreys scale invariant prior havingmost of the support (mass) in values near zero. Also, forthe residual variance, we choose pðs2

e Þ} 1=s2e : The use

of effect-specific variance components together withJeffreys’ prior is well justified because the prior adap-tively shrinks QTL variances at unlinked positions tozero—which then leads to the positioning of QTL withnonnegligible effects (see Xu 2003; Hoti and Sillanpaa

2006). Note that Meuwissen et al. (2001) fitted a com-mon variance for all coefficients at single locus, which,however, does not lead to an equally sparse solution. It isgood to know that even if Jeffreys’ prior in this contextseems to work extremely nicely, theoretically the poste-rior is improper because Jeffreys’ prior has an infiniteamount of mass near zero (e.g., Hopert and Casella

1996; ter Braak et al. 2005). One way to avoid thetheoretical problem is to specify a small positive numberas a lower bound for the parameter in the prior, whichwe, however, did not apply here.

Parameter estimation: We use a MCMC algorithm(e.g., Casella and George 1992; Chib and Greenberg

1995) to estimate the posterior distribution of the un-known model parameters. Here we assume that thetruncation point T of the original population is known,equals the smallest (or the highest) phenotypic value inthe data, or has been successfully estimated before theanalysis. Also if the phenotypic mean �y of the originalpopulation is available, we can utilize it as a startingvalue for a; otherwise we initialize a to zero (i.e., we set�y ¼ 0). We use nonzero starting values for the variancesso that nonzero values are proposed for all effects—all

positions initially explain the phenotype. In the follow-ing, we outline the MCMC sampling scheme used forcontinuous traits. For survival data, if genotypes Mh areobserved/available for censored individuals, we canomit the generation of mirror images in steps 1 and 3.For binary traits, step 4 below is replaced by the step ‘‘theupdating liabilities for a binary trait’’ found in earlierarticles (Kilpikari and Sillanpaa 2003; Hoti andSillanpaa 2006):

1. Specify initial values a ¼ �y, (bj,k ¼ 0, s2j ;k ¼ 0.5, j ¼

1, . . . , N, k¼ 2, . . . , Ngen), s2e ¼ 0.5; initialize the miss-

ing genotypes in Mo from their prior distribution;and generate mirror images Mh conditionally on Mo.

2. Update the QTL parameters (u) needed in thephenotype model according to the Gibbs samplingscheme outlined elsewhere (Xu 2003; Hoti andSillanpaa 2006).

3. Update missing values in Mo and the correspondingmirror images in Mh using a separate Metropolis–Hastings step for each individual and each marker.Propose the genotypes from their prior distributionp(Mo). The acceptance ratio contains only the likeli-hood p(yo, yh j u, Mo, Mh). Note, however, that eachchange in Mo also changes the value of p(yh j u, Mh)because Mh contains the mirror image of the pro-posed value.

4. Update yh using Gibbs sampling. A new yhi is sampled

(for each individual separately) from p(yhi j u, Mi

h,yo

i . T) if the observations yo have been collected fromthe right tail of the phenotype distribution and yh

i issampled from p(yi

h j u, Mih, yi

o # T) if yo are from the

Figure 2.—A graphical display of the hierar-chical structure of the model. The data and thepriors are presented using boxes and the unknownvariables are shown as ovals. The hierarchicaldependency and the conditional independenceassumptions are visible in the graph. The depen-dency can be either deterministic (dashed line)or stochastic (solid line) and its direction is indi-cated with an arrow.

QTL Mapping for Single-Tail Samples 2365

Page 6: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

left tail. The fully conditional posterior distributions

are pðyhi j u;M h

i ; yoi . T Þ ¼ 1=

ffiffiffiffiffiffiffiffiffiffiffi2ps2

e

pexp �1

2 m2i =s2

e

� �� �=

f mi=seð Þð Þ3 1fyhi #Tg and pðyh

i j u;M hi ; y

oi # T Þ ¼

ð1=ffiffiffiffiffiffiffiffiffiffiffi2ps2

e

pÞexp �1

2 m2i =s2

e

� �� �= 1� f mi=seð Þð Þ3 1fyh

i .Tg,

where mi ¼ ða 1PN

j¼1

PNgen

k¼1 bj ;k1fmhi;j¼kgÞ is the

predictive mean and f(�) is the cumulative distribu-tion function of the standard normal distribution.Similarly, as in data augmentation algorithms forbinary traits (Albert and Chib 1993; Hoti andSillanpaa 2006), this Gibbs sampling step requiressampling from a truncated normal distribution (foralgorithms, see Devroye 1986). Note that the condi-tion yo

i . T uniquely determines the values for zoi and

zhi ; which again imply the constraint for the possible

values of yhi .

5. Repeat steps 2–4 until a prespecified number ofrounds have been reached.

DATA ANALYSIS

In the following we present example analyses andcomparisons of our method under different samplingschemes (random, single-tail, and two-tail sampling),using simulated backcross data in cases of unlinked andlinked QTL. We consider both the average performance(assessed by analyzing 50 or 100 data replicates) andperformance under a single realization of a data set(assessed by analyzing several single data sets with smallheritability in each). For reasons why correction methodsbased on truncated normal distribution as ‘‘incomplete-data’’ likelihood are not used here, see the discussion.We used unrealistically large (QTL) heritabilities andsmall sample sizes in our example data sets to reducecomputation time when analyzing data replicates. Albeitthis treatment may appear to be unrealistic, the analysespresented here arguably correspond to the analyses withsmaller heritabilities and larger samples. One can useexisting power tables to find rough correspondence be-tween the two cases (e.g., Van Ooijen 1992; Carbonell

et al. 1993; Beavis 1998). For example, using a tradi-tional approach and backcross data, the probability ofsuccess to find a QTL with heritability 0.05 in a sample of400 is roughly comparable to that to find a QTL withheritability 0.16 in a sample of 100 (Lander andBotstein 1989). Additionally, we illustrate the perform-ance of our method with survival data and censoredobservations, using previously analyzed real F2 mice datathat have some degree of randomly missing genotypes(Broman 2003).

SIMULATION ANALYSIS OF UNLINKED QTL

Simulated data: The performance of the new ap-proach was tested using simulated data, which weregenerated in two phases. First, linked marker data for a

population of 250 backcross individuals were generatedusing the QTL Cartographer software (Basten et al.1996). The produced offspring data consisted of 33markers that span the area on three 100-cM-long chro-mosomes. Each chromosome had 11 equidistantmarkers, one every 10 cM. To generate phenotypes, weselected 3 markers (nos. 3, 17, and 30) of 33 as QTL withadditive genetic effects as b3 ¼ 3, b17 ¼ �2, b30 ¼ 1,respectively. For each individual i, a quantitative phe-notype y(i) was generated using an additive geneticmodel

yðiÞ ¼ b31fg3ðiÞ¼ABg1 b171fg17ðiÞ¼ABg

1 b301fg30ðiÞ¼ABg1 eðiÞ; ð5Þ

where the indicator functions 1fg3ðiÞ¼ABg; 1fg17ðiÞ¼ABg; and1fg30ðiÞ¼ABg take value 1 if individual i has genotype AB atpositions 3, 17, and 30, respectively. The additive errore(i) was generated from the normal distribution withmean zero and variance 4. This resulted in a heritability�0.5. Note that there were no missing values in themarker data.

Sampled subdata: In the following, to distinguish be-tween the original simulated data set (250 individuals)and a smaller sample (�40 individuals), which is ob-tained by sampling according to the sampling scheme,we call these two alternatives data and subdata, respectively.

Analyses: To demonstrate the efficiency of the pro-posed pseudocontrol approach and to address thesampling variation around the estimates, six differentsampling schemes were compared by analyzing simu-lated subdata replicates. Sampling of the individualsand analysis of the resulting subdata were repeated 100times under each sampling scheme, using 100 differentsimulated data sets. All simulated data sets included thesame genotype data of 250 offspring (see above) but adifferent set of phenotypes was generated each time byusing the same generating model (Equation 5). (Themethod is expected to be more sensitive to samplingvariation due to phenotypes than due to genotypes.) Ineach repetition, subdata were sampled from the samephenotype distribution of the 250 individuals accordingto different sampling schemes. Analyses using six dif-ferent schemes of sampling from the phenotype distri-bution were considered: (A) an analysis using a randomsubdata sample, (B) an analysis using a sample fromboth the left and the right tails, (C) an analysis using theright tail sample without doing any correction withrespect to truncation, (D) the pseudoobservation analy-sis using the right tail sample, (E) an analysis using theleft tail sample only without doing any correction withrespect to truncation, and (F) the pseudoobservationanalysis using the left tail sample. Analyses D and F werecarried out for all the subsamples, using the approachpresented in the model section. Analyses A–C, and Ewere carried out using our implementation of theapproach of Xu (2003); see details from Hoti and

2366 M. J. Sillanpaa and F. Hoti

Page 7: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

Sillanpaa (2006). Analyses C, D and E, F correspond tothe same analysis with and without generating pseu-doobservations, respectively. The truncation thresholdT, which determines the sampled individuals in thesubdata, was defined as Tr ¼ �y 1 sy for the right-tailsampling analyses (C and D) and as Tl ¼ �y � sy for theleft-tail sampling analyses (E and F). In the two-tail sam-pling analysis (B), both thresholds Tl and Tr were used.Due to the resampling of the phenotype, the samplesizes used in analyses C–F varied in each repetition.Therefore, to maintain comparability between theschemes (in each repetition), the size of the randomsample (A) was chosen to equal the mean of the samplesizes of the left and right tail samples (rounded upward).In the two-tail sampling analysis (B), we randomly sam-pled half of the individuals (rounded upward) fromboth tails. The sample sizes varied in the range ½35, 51�with median 41, which closely coincides with the theo-retical expectation (16% of the samples in a normaldistribution should be beyond one standard deviation,here corresponding to 0.16 3 250 � 40 individuals).

Results: We implemented the methods using Matlabsoftware on a personal computer. The posterior estima-tion (of the effects) for each of the 100 repetitions wasbased on 10,000 Markov chain Monte Carlo cycles. Ineach MCMC run the first 1000 initial cycles were dis-carded from the chain as ‘‘burn-in’’ rounds and thin-ning of 10 was applied (by saving the values at every 10thcycle) to reduce autocorrelation between the samples.Due to the rather simple data generation model, theMCMC sampler converged rapidly in all 100 cases.

Instead of using the estimated effect size directly tosummarize the results, we use a standardized form of theeffect size, because then the selected QTL threshold 0.1(giving definition for QTL as in Hoti and Sillanpaa

2006) is directly comparable/applicable to other traits(sampling schemes) and marker data. For a backcross,at marker j, the standardized effect is uj ¼ bj ;2 3 sj=sy;where sj is the empirical standard deviation of the ge-notypes at marker j and sy is the empirical standarddeviation of the phenotype (calculated from augmenteddata in D and F). Following Hoti and Sillanpaa

(2006), we define an indicator variable for the event,that the absolute value of the standardized effect size islarger than the given QTL threshold 0.1. This enables usto estimate the QTL occupancy probability (as a func-tion of the posterior distribution of the QTL effects) inmodels such as those of Xu (2003), which do notoriginally include model selection indicators. Thus, wepresent the results in the form of the posterior proba-bility of the QTL occupancy P(jujj. 0.1 jdata), which wecalculate as the proportion of MCMC rounds where jujj .0.1. To summarize the posterior QTL occupancy overthe 100 repetitions of data for each of the six samplingschemes, we calculated the mean value of the estimatedposterior QTL occupancy probability at each locus j asPjð0:1Þ ¼ 1

100

P100r¼1 Pðjuj j . 0:1 jdata r Þ by taking the aver-

age over the 100 subdata analyses in the differentsampling schemes (Figure 3). In Figure 3, the corre-sponding means of the standardized effects (calculatedover the MCMC samples in each subdata analysis wherejujj. 0.1) are shown using the curve. At each repetition,the calculation of the QTL occupancy and the posteriormean of the estimated (standardized) effect size wasbased on 900 effective MCMC samples. ½The reason fornot taking the average over repetitions with respect tothe correctly identified QTL was that we wanted espe-cially to monitor the magnitude of the signals (cf. Broman

and Speed 2002).�In Figure 3, A–F, the QTL occupancy probability

(bars) and the mean standardized QTL effect (curve)summarize the QTL evidence at each locus over the 100repetitions. Note that the same scales of the y-axis areused throughout. As expected, the analyses withoutdoing any correction for truncated data performed verybadly, showing practically no signals (Figure 3, C and E).It becomes evident from the graphs that the single-tailsampling analyses with pseudoobservations (Figure 3, Dand F) have clearly more power than the analyses basedon random subdata samples (Figure 3A) or the analysesthat do not utilize the pseudoobservations (Figure 3, Cand E). This is in the sense that analyses (Figure 3, D andF) on average show higher or elevated signals (QTLoccupancy probabilities) around the true positions(3, 17, and 30). Surprisingly, in the case of the unlinkedQTL, one can even conclude that the single-tail sam-pling analyses with pseudoobservations (Figure 3, D andF) showed power comparable to the analysis with two-tail sampling (Figure 3B). On the other hand, the meanof the estimated standardized effect size is practically atthe same level in most of the analyses, which indicatesthat standardized effects seem to be (on average) com-parable across the different analyses. In Figure 3, D andF, note the negligible bias of the QTL position (aroundlocus 17) in the opposite direction from the direction ofsampling. The simulated effect size at position 30 wasapparently very small because the position (on average)stayed undetected in most of the cases.

Heritability estimation: In schemes A–F in Table 1,the posterior mean heritability was estimated usingthe formula h2 � 1=r

Prt¼1

��s2ðtÞ

y � s2ðtÞe

�=s2ðtÞ

y

�; where

s2ðtÞy is the empirical phenotypic variance (from aug-

mented data in D and F), s2ðtÞe is the residual variance at

round t, and r is the total number of MCMC rounds (afterburn-in). The mean and the standard deviation were usedto summarize the distribution of the estimated heritabil-ity over the 100 repetitions for each of the six samplingschemes. The random-sampling analysis (scheme A)seemed to give (on average) slightly small heritabilityestimates and the standard deviation (sampling vari-ance) was very large. The heritability estimates werenegligible (or negative) in almost all analyses based onthe single-tail sampling without doing any correctionwith respect to truncation (schemes C and E). (The

QTL Mapping for Single-Tail Samples 2367

Page 8: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

negative values arise because the residual variance wasnot restricted in the prior to be smaller than thephenotypic variance.) On the other hand, the heritabil-ity was overestimated in most of the analyses and thesampling variance was relatively small for analyses usingsingle- or two-tail sampling (schemes B, D, and F) andpseudoobservations.

To enable comparison to more traditional methods(based on hypothesis testing) and to obtain an under-

standing of their potential performance on these data,Figure 4 shows the deviation of the marker genotypefrequencies from their expected values in a typical real-ization of the data after right-tail sampling. One canclearly see the dependence of the frequencies overlinked loci and the potential difficulty to control falsepositives by choosing the significance threshold. Bylooking at these data, it is easy to understand also thevalue of the multiple-QTL model.

Figure 3.—Unlinked QTL: 100data sets. (A–F) The posterior QTLoccupancy probability Pj

ð0:1Þ ¼ 1100P100

r¼1 Pðjuj j . 0:1 jdata rÞ (indi-cated with a bar at each marker lo-cus j) averaged over 100 subdataanalyses in the different samp-ling schemes. The correspondingmean of the standardized effect(calculated only over such MCMCrounds where jujj . 0.1) is shownusing the curve. The true posi-tions of the simulated QTL aremarked with an asterisk. The rightcolumn indicates the samplingscheme used/part of the distribu-tion sampled and arrows indicatethe utilization of the pseudoob-servations (mirroring idea) inthe analysis. Shown are the analy-sis of the random subdata sample(A), the two-tail sample analysis(B), the direct and the mirroranalysis of the sample from theright tail of the phenotype distri-bution (C and D), and the directand the mirror analysis of thesample from the left tail of thephenotype distribution (E and F).

2368 M. J. Sillanpaa and F. Hoti

Page 9: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

SIMULATION ANALYSIS OF TWO LINKED QTL

Simulated data: A base population of 2500 backcrossindividuals was generated using the QTL Cartographersoftware (Basten et al. 1996). Each individual had asingle chromosome with 33 linked loci every 1 cM. Twoclosely linked QTL (in coupling and 14 cM apart fromeach other) were placed at 12 and 26 cM with additivegenetic effects b12¼ 1 and b26¼ 1, respectively. Only 11markers (at 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, and 31 cM)of the original 33 were included in the offspring dataused in the analysis step.

We sampled 50 data replicates of size 1000 from thebase population. For each replicate, a quantitative pheno-type y(i) was generated using the additive genetic model

yðiÞ ¼ b121fg12ðiÞ¼ABg1 b261fg26ðiÞ¼ABg1 eðiÞ; ð6Þ

where the indicator functions 1fg12ðiÞ¼ABg and 1fg26ðiÞ¼ABgtake value 1 only if individual i has genotype AB atpositions 12 and 26, respectively. The additive error e(i)was generated from the standard normal distribution.This resulted in a heritability value�0.47. Also, here nomissing values were introduced into the marker data.

Analyses: For each of the 50 data replicates, subdatawere sampled from the phenotype distribution of the1000 individuals according to the different samplingschemes. Again, the six different analyses were consid-ered: (A) the analysis using a random subdata sample,(B) the analysis using a sample from both left and righttails, (C) the analysis using the right-tail sample withoutdoing any correction with respect to truncation, (D) theanalysis using the right-tail sample with pseudoobser-vations, (E) the analysis using the left-tail sample only

without doing any correction with respect to truncation,and (F) the analysis using the left-tail sample with pseu-doobservations. The sample sizes used in analyses C–Fvaried in each repetition, with mean 170 in C and D andmean 172 in E and F. Therefore, as in the first simulationstudy, the size of the random sample (A) was chosen toequal the mean of the sample sizes of the left- and right-tail samples (rounded upward), which resulted in asample size of �171. In the two-tail sampling analysis(B), we randomly sampled half of the individuals fromboth tails (rounded upward), which also resulted in asample size of �171. The use of a larger sample size inthe simulation analysis with linked QTL was partly moti-vated by not including the QTL (loci 12 and 26) into themarker set used in the analyses.

Results: For each data replicate, a Matlab implemen-tation of the method was run for 20,000 MCMC cyclesfrom which 2000 burn-in rounds were discarded andonly every 10th sample was stored (thinning), resultingin 1800 effective MCMC samples. In Figure 5, A–F, theQTL occupancy probability (bars) and the mean stan-dardized effect (curve) over the 50 repetitions are shownat each locus for each of the sampling schemes. Asexpected, the QTL localization was clearly more diffi-cult for two linked QTL (Figure 5) than it was for theunlinked QTL (Figure 3). No QTL were found inanalyses C and E in Figure 5. In analyses A, B, D, and Fin Figure 5, all markers in the region 10–31 cM showedelevated QTL occupancy. Arguably, in all cases (Figure5, A, B, D, and F), the average QTL occupancy increasedclearly at the flanking markers, at 10 and 13 cM and at 25and 28 cM and showed the highest value at the markerclosest to the QTL. One can conclude that in the case oftwo linked QTL, the single-tail sampling analyses withpseudoobservations (Figure 5, D and F) showed power

TABLE 1

Heritability estimates

Unlinked QTL Linked QTL

Samplingscheme Mean

Standarddeviation Mean

Standarddeviation

h2 0.50 0.47A 0.34 0.170 0.45 0.062B 0.72 0.093 0.82 0.045C �0.03 0.045 �0.01 0.018D 0.68 0.091 0.68 0.044E �0.02 0.065 �0.01 0.005F 0.64 0.072 0.67 0.033

The mean value and the standard deviation of the heritabilitypoint estimates (posterior mean) were calculated, respectively,over 100 and 50 subdata analyses for the unlinked QTL and thelinked QTL (in coupling) in the different sampling schemes(A–F). The simulated mean true heritability (h2), the analysisof the random subdata sample (A), the two-tail sample analysis(B), the direct and the mirror analysis of the sample from theright tail of the phenotype distribution (C and D), and the di-rect andthemirror analysis of the sample from the left tailof thephenotype distribution (E and F) are shown.

Figure 4.—Unlinked QTL: a typical realization of the dataobtained by sampling only the subset belonging to the righttail of the phenotype distribution of 250 simulated backcrossindividuals. At each locus the bar indicates the deviation ofthe number of individuals with genotype AB from the Mende-lian expectation. The true positions of the simulated QTL aremarked with an asterisk.

QTL Mapping for Single-Tail Samples 2369

Page 10: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

roughly comparable to the analysis with two-tail sam-pling (Figure 5B). Further, the power for the analysesusing pseudoobservations (Figure 5, D and F) wasclearly better than that for the analyses that did notutilize pseudoobservations (Figure 5, C and E), but itwas only slightly better than that for the analyses basedon random subdata samples (Figure 5A).

Heritability estimation: The mean value and the stan-dard deviation of the posterior mean heritability calcu-

lated over the 50 repetitions of the linked QTL data areshown for all six sampling schemes (A–F) in Table 1.As earlier, the random-sampling analysis (A) resulted in(on average) small heritability estimates and in largestandard deviation (sampling variance). Also, negligible(or negative) heritability estimates were obtained in theanalyses that used single-tail sampling without doing anycorrection with respect to truncation (C and E). Again,the analyses with single- or two-tail sampling and data

Figure 5.—Linked QTL in cou-pling: 50 data sets. (A–F) The pos-terior QTL occupancy probabilityPjð0:1Þ ¼ 1

50

P50r¼1 Pðjuj j . 0:1 jdata

rÞ (indicated with a bar at eachmarker locus j) averaged over 50subdata analyses in the differentsampling schemes. The corre-sponding mean of the standard-ized effect (calculated only overMCMC rounds where jujj . 0.1)is shown using the curve. The truepositions of the simulated QTLare marked with an asterisk. Theright column indicates the sam-pling scheme used/part of thedistribution sampled and arrowsindicate the utilization of thepseudoobservations (mirroringidea) in the analysis. Shown arethe analysis of the random subda-ta sample (A), the two-tail sampleanalysis (B), the direct and themirror analysis of the samplefrom the right tail of the pheno-type distribution (C and D), andthe direct and the mirror analysisof the sample from the left tailof the phenotype distribution(E and F).

2370 M. J. Sillanpaa and F. Hoti

Page 11: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

augmentation (B, D, and F) produced overestimatedheritabilities with small standard deviations. The over-estimation was highest for the two-tail sampling analysis(B). The most probable reason for the smaller standarddeviations here (for linked QTL) compared to the caseof unlinked QTL is that we used a larger sample size anda smaller number of repetitions.

Two linked QTL in repulsion (h2 � 0.1): For com-parison, we also simulated five replicates of two linkedQTL (in repulsion and placed on positions at 12 and26 cM) with additive genetic effects b12 ¼ �1 and b26 ¼1, respectively. The heritability was �0.10. The same sixschemes of sampling from the phenotype distributionwere again considered for each data replicate. Also,here each analysis was run for 20,000 MCMC rounds,which resulted in 1800 effective MCMC samples (afterburn-in and thinning). Any notable difference were notfound in the results when compared to the case of QTLin coupling. As earlier, the flanking markers showedelevated QTL occupancy in analyses B, D, and F. TheQTL evidence was notably smaller in analysis A and noQTL were found in analyses C and E. In Figure 6 (left),the QTL occupancy probabilities (bars) and the un-standardized effect estimates (curve) are shown for oneof the data replicates (heritability 0.10); the scale of they-axis differs from the others in Figure 6B. In Table 2, forthe same data replicate, the estimated posterior meansand 90% credible intervals are shown for the heritabil-ities and the unstandardized and standardized QTLeffects (for analyses B, D, and F) at the loci with highestQTL occupancy. Note that the effect estimates of ana-lyses D and F are surprisingly close to their true simu-lated values while in B they are clearly overestimated.However, the QTL were not exactly at the markers,which may partly downweight the estimates. The poste-rior mean heritabilities are highly overestimated inanalyses B, D, and F but the true heritability value fallsinside the 90% credible interval in all cases. In Table 2,the fact that the credible interval of the QTL effectincludes zero is an indication of a somewhat lower QTLoccupancy at the locus and therefore downward weight-ing of the estimate. In general, these kinds of model-averaged estimates are shown to be robust to upwardbias of small-effect QTL (see Ball 2001).

Two linked QTL in coupling (two realistic scenar-ios): To closely monitor performance of our methodunder a realistic single realization of a data set with largesample size and small heritability, we simulated two ad-ditional data sets, with the two linked QTL (in couplingand again placed on positions 12 and 26 cM) in each.The first data set has additive genetic effects b12 ¼ 0.5and b26¼ 0.2, and the second set has b12¼ 0.6 and b26¼0.3, respectively. The heritabilities for the two setswere�0.11 and 0.15. The same six schemes of samplingfrom the phenotype distribution were considered forboth data sets. All analyses were based on 20,000 MCMCrounds and 1800 effective MCMC samples (after burn-in

and thinning). We first sampled 1000 and 1500 individ-uals from 2500, which was the size of the base popula-tion. The sample size in the first data set (subdata) was162 for the left-tail sample and 161 for the right-tailsample, and in the second data set it was 232 for both theleft- and the right-tail samples. Comparable sample sizeswere used in other schemes. The QTL occupancy proba-bilities (bars) and the unstandardized effect estimates(curve) are shown in Figure 6 (center column) for thefirst data set (heritability 0.11) and in Figure 6 (rightcolumn) for the second data set (heritability 0.15); thescale of the y-axis differs from the others in Figure 6B onthe right. See Table 3 for the estimated posterior meansand 90% credible intervals for the heritabilities.

In Figure 6 (center column), the flanking markersshowed the highest QTL occupancy probabilities forone of the simulated QTL in analyses A and B, but theother QTL was undetected. In analyses C and E, all lociproduced zero QTL occupancy probabilities. In analy-ses D and F, the loci around simulated QTL gainedelevated QTL occupancy probabilities but the highestQTL occupancy probability was not necessarily obtainedfor the flanking markers. In Figure 6 (right column),one of the flanking markers showed the elevated QTLoccupancy probability around one of the simulatedQTL in A and around both QTL in B. In contrast to E,the analysis C also showed some elevated QTL occu-pancy probabilities. Again in analyses D and F, the lociaround simulated QTL gained elevated QTL occupancyprobabilities but the highest peaks did not necessarilyoccur at the flanking markers. To conclude the results ofmirroring analyses D and F (Figure 6, center and rightcolumns), it seems that the position estimates can besomewhat biased in the presence of two closely linkedQTL and small heritability.

The heritabilities were badly overestimated with thesedata in analyses B, D, and F and the true heritability fallsoutside the 90% credible interval in all cases (Table 3).

SURVIVAL DATA ANALYSIS

Real mice data: We selected survival data ofBoyartchuk et al. (2001) that were previously analyzedby Broman (2003) and Diao et al. (2004) using severaldifferent methods. Additionally, Jin et al. (2007) ana-lyzed chromosomes 1, 5, and 13 using the same data.A quantitative trait, the log time to death (in log hours)following Listeria monocytogenes infection, was measuredfrom 116 female F2 mice where for �30% of the micethe phenotype was censored (they survived to the end ofthe experiment, 264 hr). The F2 mice were obtained froman intercross between BALB/cByJ and C57BL/6ByJstrains. The marker map is known and the genotypes,with some random missing values/partial information,are available at 133 markers covering chromosomes 1–19and X. We omitted data (2 markers) on the X chomo-some because it needs to be treated differently from

QTL Mapping for Single-Tail Samples 2371

Page 12: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

autosomes in the QTL mapping (Broman et al. 2006).Also, the last marker at chromosome 19 was omittedfrom the mapping panel because it contained a largeproportion of missing values. For simplicity we treatedpartial genotype information as completely missing here.

Analyses: We analyzed the mice data using threedifferent methods: (I) the Bayesian multiple-QTL ana-lysis with only those mice that died without generatingpseudoobservations; (II) the Bayesian multiple-QTL

analysis with only those mice that died (with generatingpseudoobservations and using the highest noncensoredphenotype as T); and (III) the Bayesian multiple-QTLanalysis with all 116 mice, using data augmentation toimpute phenotypes (liabilities) for the censored micewith T ¼ log(264). Analysis I was carried out using ourimplementation of Xu (2003); see details from Hoti

and Sillanpaa (2006). Analyses II and III were carriedout using the approach presented in the model section.

Figure 6.—Analyses of singlerealizations: linked QTL in repul-sion, h2 � 0.10 (left); linked QTLin coupling, h2 � 0.11 (center);and linked QTL in coupling,h2� 0.15 (right). (A–F) The poste-rior QTL occupancy probabilitiesP(jujj . 0.1 j data) estimated atdifferent marker positions j for asingle subdata analysis in the dif-ferent sampling schemes. Thecorresponding posterior meanof the unstandardized effect (cal-culated over all MCMC rounds) isshown using the curve. The truepositions of the simulated QTLare marked with an asterisk.Shown are the analysis of the ran-dom subdata sample (A), the two-tail sample analysis (B), the directand the mirror analysis of thesample from the right tail of thephenotype distribution (C andD), and the direct and the mirroranalysis of the sample from theleft tail of the phenotype distribu-tion (E and F).

2372 M. J. Sillanpaa and F. Hoti

Page 13: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

Note that approach III is in spirit similar to the oneproposed for Gaussian mixed-effects models by Sor-

ensen et al. (1998). In all analyses, we adopted theF2 transition probabilities p(M o

i;j jM oi;j�1) as presented in

Sillanpaa and Arjas (1998).Results: The Matlab implementation of the method

was run for 30,000 MCMC cycles. The first 15,000rounds were discarded as burn-in and of the remainingsamples only every 10th sample was used in the estima-tion. For the F2 design, at marker j, bj,1 ¼ 0 for het-erozygotes and the standardized effect for k ¼ {2, 3} isobtained as uj ;k ¼ bj ;k 3 sj ;k=sy; where sj ;k is the empir-ical standard deviation of the indicator 1fms

i;j¼kg; and sy isthe empirical standard deviation of the phenotype (cal-culated from augmented data in II and III). The pos-terior QTL occupancy probabilities P(juj,kj. 0.1 j data)are separately calculated for k ¼ {2, 3} over chromo-

somes. In Figure 7, only the maximum of the two,max½P(juj,2j. 0.1 j data), P(juj,3j. 0.1 j data)�, is shownat each position j for the three different methods. Ourresults closely agree with the results by Broman (2003)who found QTL in chromosomes 1, 5, 13, and 15 using asingle-QTL model. As in Broman (2003), our analysessupport the conclusion that the QTL on chromosome 1has an effect on the time to death only among the non-survivors (a peak is present only in analysis I). Similarly,QTL on chromosome 5 appear to have an effect only onthe change of survival (a peak is present more or lessonly in analyses II and III). Again consistently withBroman (2003), QTL on chromosomes 13 and 15 havean effect on both (on the time to death among the non-survivors and on the change of survival; a peak is presentmore or less in all the analyses), where the latter chro-mosome actually has a weak QTL. In contrast to Diao

et al. (2004), we did not find any QTL evidence onchromosome 6. However, a little support for QTL (withan effect only on the change of survival) was found onchromosomes 12 and 18 in analyses II and III, respec-tively. This may indicate higher efficiency in detectingQTL by our multiple-QTL analysis.

Heritability estimates: Bayesian point estimates (pos-terior means) of the heritability for analyses I, II, and IIIwere 0.32, 0.49, and 0.41, respectively. The correspond-ing 90% credible intervals were ½0.09, 0.50�, ½0.34, 0.61�,and ½0.25, 0.57�.

DISCUSSION

In the consideration of the single-tail problem it isimportant to make a clear distinction between the phe-notype distribution and the conditional phenotypedistribution (i.e., phenotype after correcting for QTL).In principle it is possible to use a truncated normaldistribution function as an incomplete-data likelihood

TABLE 2

Posterior estimates of the heritability and the QTL effects

QTL effects

Sampling scheme Heritability estimate Locus Unstandardized Standardized

h2 0.10A 0.01 ½�0.19, 0.19� 13 �1.95 ½�2.60, �1.29� �0.58 ½�0.77, �0.38�B 0.16 ½�0.01, 0.31� 25 1.31 ½ 0.00, 2.22� 0.39 ½ 0.00, 0.66�C �0.01 ½�0.21, 0.16� 13 �0.93 ½�1.17, �0.69� �0.62 ½�0.76, �0.48�D 0.22 ½0.10, 0.33� 25 0.84 ½0.61, 1.07� 0.56 ½0.42, 0.70�E �0.02 ½�0.23, 0.16� 13 �0.72 ½�1.09, 0.00� �0.47 ½�0.69, 0.00�F 0.22 ½0.10, 0.33� 28 0.85 ½0.58, 1.09� 0.55 ½0.39, 0.70�

The heritability (posterior mean as a point estimate and 90% credible interval) and the unstandardized andstandardized QTL effects (posterior mean and 90% credible interval) from the subdata analysis of the linkedQTL (in repulsion) in the different sampling schemes (A–F) are shown. The estimates are based on the wholeMCMC sample after burn-in (no thinning). Simulated true heritability (h2), the analysis of the random subdatasample (A), two-tail sample analysis (B), the direct and the mirror analysis of the sample from the right tail ofthe phenotype distribution (C and D), and the direct and the mirror analysis of the sample from the left tail ofthe phenotype distribution (E and F) are shown.

TABLE 3

Heritability estimates

Sampling scheme Data set 1 estimate Data set 2 estimate

h2 0.11 0.15A 0.01 ½�0.19, 0.19� 0.19 ½0.06, 0.30�B 0.28 ½0.14, 0.40� 0.42 ½0.33, 0.51�D 0.28 ½0.18, 0.38� 0.35 ½0.26, 0.42�F 0.18 ½0.07, 0.29� 0.35 ½0.26, 0.42�

The estimated heritabilities (posterior mean and 90% cred-ible interval) from two subdata analyses of the linked QTL (incoupling) in the different sampling schemes (A, B, D, and F)are shown. The estimates are based on the whole MCMC sam-ple after burn-in (no thinning). Simulated true heritability(h2), the analysis of the random subdata sample (A), two-tailsample analysis (B), the mirror analysis of the sample fromthe right tail of the phenotype distribution (D), and the mir-ror analysis of the sample from the left tail of the phenotypedistribution (F) are shown.

QTL Mapping for Single-Tail Samples 2373

Page 14: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

for a single-tail sample in a way similar to that ofCarriquiry et al. (1987); see also Schmee and Hahn

(1979). However, these expressions are difficult to dealwith analytically, which means that the fully conditionalposterior distributions are not available (Sorensen et al.1998). Also such a model is very sensitive to small val-ues of the trait (Cox and Oakes 1984). In contrast, theMCMC sampling distributions concerning QTL modelparameters are unaffected by correction in correctionmethods based on a nontruncated normal distributionas full-data likelihood. Following the latter, we havepresented a new method that can be applied togetherwith a multiple-QTL model to improve the efficiency ofQTL mapping using single-tail samples. The methodeffectively utilizes additional information available fromthe parents in the data augmentation scheme. This is

done by generating artificial sample points with thegenotypes obtained deductively from the parentalmating type and the phenotypes via data augmentation.Generally, the use of data augmentation and missing-data imputation is very common in Bayesian analyses(e.g., Albert and Chib 1993; Sillanpaa and Arjas

1998; Sorensen et al. 1998; Baker et al. 2005). Note that,unlike methods that rely on the Mendelian inheritanceassumption, this method can be applied to fitness traits,to loci that are associated with the survival to birth, andto loci that suffer from segregation distortion. Whenmapping viability (selection) in F2/outbred popula-tions, both the selection intensity and the dominancecan be estimated.

Different QTL models and designs: The performanceof the method was demonstrated using a multiple-QTL

Figure 7.—Survival data analysis: themaximum posterior QTL occupancy esti-mated at different marker positions overchromosomes 1–19 for data on survivaltime following infection with Listeria mono-cytogenes in 116 F2 mice. Shown are the di-rect analysis with only those mice that died(analysis I, top), the mirror analysis withonly those mice that died (analysis II, mid-dle), and the analysis with all data includingthe observed genotypes of the censored mice(analysis III, bottom). The chromosomes areseparated by dashed vertical lines.

2374 M. J. Sillanpaa and F. Hoti

Page 15: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

model where only markers (or pseudomarkers) wereconsidered as putative QTL positions (e.g., Xu 2003).However, the presented data augmentation for hiddenobservations can in principle be applied together withthe majority of the existing Bayesian QTL mapping meth-ods for inbred line-cross data, including interval mapping(e.g., Sillanpaa and Arjas 1998) and epistatic models(e.g., Yi and Xu 2002; Yi et al. 2003; Zhang and Xu 2005).It can also be added to Bayesian models that consideroutbred line-cross (e.g., Sillanpaa and Arjas 1999) oroutbred family/trio data (e.g., Lee and Thomas 2000),given the known or estimated haplotypes in parents. Recallthe generality of Figure 1 in generating mirror images ofthe genotype data. The computational feasibility of eachextension depends on the available computer capacityand on the complexity of the model considered. In binarytraits, the pseudocontrol sample (based on parental data)can be created directly and analyzed with standard QTL/association mapping methods and software packagesdesigned for binary trait locus/association mapping(e.g., Xu and Atchley 1996; Visscher et al. 1996; Yi

and Xu 2000; Kilpikari and Sillanpaa 2003; Sillanpaa

and Bhattacharjee 2005), by assuming that there areno missing marker data. In a frequentist setting, one maytry to pursue the implementation of the presentedmethod for quantitative traits by using the EM algorithm(Dempster et al. 1977; Pettitt 1986; Smith and Helms

1995). The underlying key assumptions made in themirroring approach for quantitative traits are (1) givenhaplotypes for parents, the genotype data of the hiddenobservations are made to be maximally dissimilar to theobserved data and (2) conditionally on genotypes, allupdated (predicted) phenotypes for the hidden obser-vations are either on the left or on the right side of thetruncation point and the observed phenotypes. Note thatthese assumptions correspond to assuming that observa-tions can be (in the light of the current genetic model)divided into two ordered parts: All hidden phenotypesare smaller (or greater) than the observed phenotypes.However, this assumption (see the model section) doesnot directly rule out certain genetic models; e.g., for anadditive multiple-QTL model, it imposes a constraint forthe sum of the QTL effects rather than restricting eachQTL individually. Similar kinds of assumption (for liabili-ties) are also needed for discrete/binary phenotypes inthe case of a multiple-QTL model. Such assumptions donot exclude us from using, for example, epistatic QTLmodels in mapping but it would be valuable to assess thelimitations and possible negative influences of theseassumptions for several model types in the future.

Mapping selection in breeding populations by usingsingle-parent data: Gomez-Raya et al. (2002) proposedthe method to find the loci responsible for artificial ornatural selection on the basis of testing distorted seg-regation among selected and nonselected offspring orgametes (obtained by single-sperm typing) of widelyused bulls in cattle. As stated in Goddard (2003), this

method shows significant departure from the expected1:1 ratio for two sire alleles, when measured from theselected offspring or gametes, at a locus linked to theselection. Utilization of the pseudoobservation and dataaugmentation scheme (for quantitative traits) togetherwith this kind of design is also possible. As an advantage,a multiple-QTL model can be applied. Let us assumethat our data consist of a small number of sires and theiroffspring groups, where each sire has a large number ofoffspring. Let us then assume that only selected off-spring (with phenotype value y ¼ 1 or a tail-sampledquantitative phenotype) from each sire group are geno-typed. Again to form the complete mapping population,ungenotyped individuals can now be generated byapplying the pseudoobservation idea. For a heterozy-gote sire AB, the pseudoobservation corresponding tothe offspring genotype A* is B* and vice versa. (Here *indicates the other/maternal allele.)

Heritability and QTL-effect estimates: As illustratedin the simulated examples, our method tends to over-estimate the heritabilities; also, even if not detectedhere, there may still be some upward bias present in theQTL-effect estimates. In any case, the upward bias doesnot necessarily imply an increased number of falsepositives, because the calculation of the QTL occupancyis based on standardized effects. To correct for the po-tential bias in the heritability/effect estimates, onecould estimate the heritability (or the QTL effects) sep-arately from the selected samples using the approach ofHenshall and Goddard (1999) or from the full dataset where also the phenotypes of the ungenotypedindividuals, if available, are included (cf. Lander andBotstein 1989; Xu and Vogl 2000). The latter ap-proach should be based on a model without pseudoob-servations and where updating of the genotypes of theungenotyped individuals is done via data augmentation(without mirroring). Another equally useful approach isto estimate the QTL effects (and corresponding heritabil-ity) afterward using logistic regression, complete markerdata (for the tail sample and the pseudoobservations),and the phenotype data in a discretized form. This isbecause the logit-link function is robust for ascertain-ment bias (Kagan 2001; Neuhaus 2002; Grunewald

2004). Note that different types of pseudoobservationapproaches have also been applied to estimate the sam-pling probabilities and to correct for ascertainment biasin association studies (Clayton 2003; Grunewald 2004).

QTL position estimates: QTL mapping accuracy isnot reduced if 20–25% of the individuals are sampledfrom the high and low extremes (Darvasi 1997). How-ever, it is known that the positions of the two linked QTLare biased in the case of selective genotyping (Lin andRitland 1996). The same can be expected to be true(and was actually found in our examples) in the single-tail sampling using the pseudocontrol approach, becausean augmented single-tail sample intuitively correspondsto data sampled from two tails. Again, to correct the bias

QTL Mapping for Single-Tail Samples 2375

Page 16: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

in the position estimates of two linked QTL, one couldestimate the locations from the full data set where alsothe phenotypes of the ungenotyped individuals, if avail-able, are included (Ronin et al. 1998).

Efficiency of the sampling: Selective genotyping, two-tail sampling from the phenotype distribution, has beensuggested as a ‘‘state-of-the-art’’ sampling scheme to im-prove the power of the analysis (Lander and Botstein

1989; Darvasi and Soller 1992; Sen et al. 2005). Inour simulation analyses, we were able to demonstratecomparable power using data resulting from single-tailsampling. (Improvement of the power by generatingpseudoobservations for two-tail sampled data is an openquestion that needs to be studied in the future.)However, it is well known that the power of any selectionscheme depends on the underlying genetic architectureof the trait. Due to this, the two-tail sampling approachmay lead to an unexpected drop of power, for example,in the presence of epistatic interactions (see Allison

et al. 1998). Also, if the QTL effects are small, selectivegenotyping does not adversely affect the detection ofepistasis (Sen et al. 2005). The same is evidently true forthe single-tail sampling approach. On the other hand,in a number of situations it is possible to obtain datafrom a single-tail sample only, for example, if the cross-ing experiments suffer from the lethal effects of in-breeding depression, in which case our method mayprove to be an irreplaceable tool.

Survival data: Finally, we briefly comment on analyz-ing survival data with our method. Diao et al. (2004)analyzed censored observations by using parametricproportional hazard models and Broman (2003) useda mixture model for the same purpose. Both methodsassumed that genotype data have been observed fromall the individuals. Additionally, Broman (2003) as-sumed that the censoring time is equal among all thestudy subjects. We made the same assumption in ouranalysis above. In our method, it is possible to relax bothof these assumptions. To account for individual-specificcensoring times Ti, Equation 2 becomes ‘‘yh

i # Ti # yoi

for all pairs i,’’ and the hidden observations are updatedby the Gibbs sampling distribution where each individ-ual has its own truncation point (step 4 in parameterestimation). A Matlab implementation of the method isavailable from the authors upon request.

We are grateful to Karl Broman for his comments on the Listeriamonocytogenes data and to two anonymous referees for their construc-tive comments on the manuscript. This work was supported by aresearch grant (no. 202324) from the Academy of Finland.

LITERATURE CITED

Albert, J. H., and S. Chib, 1993 Bayesian analysis of binary andpolychotomous response data. J. Am. Stat. Assoc. 88: 669–679.

Allison, D. B., N. L. Schork, S. L. Wong and R. C. Elston,1998 Extreme selection strategies in gene mapping studies ofoligogenic quantitative traits do not always increase power.Hum. Hered. 15: 261–267.

Baker, P., K. Mengersen and G. Davis, 2005 A Bayesian solution toreconstructing centrally censored distributions. J. Agric. Biol.Environ. Soc. 10: 61–84.

Ball, R. D., 2001 Bayesian methods for quantitative trait loci map-ping based on model selection; approximate analysis using theBayesian information criterion. Genetics 159: 1351–1364.

Basten, C. J., B. S. Weir and Z.-B. Zeng, 1996 QTL Cartographer, theReference Manual and Tutorial for QTL Mapping. North CarolinaState University, Raleigh, NC.

Beasley, T. M., D. Yang, N. Yi, D. C. Bullard, E. L. Travis et al.,2004 Joint tests for quantitative trait loci in experimentalcrosses. Genet. Sel. Evol. 36: 601–619.

Beavis, W. D., 1998 QTL analyses: power, precision, and accuracy,pp. 145–162 in Molecular Dissection of Complex Traits, edited byA. H. Paterson. CRC Press, Boca Raton, FL.

Boyartchuk, V. L., K. W. Broman, R. E. Mosher, S. E. F. D’Orazio,M. N. Starnback et al., 2001 Multigenic control of Listeriamonocytogenes susceptibility in mice. Nat. Genet. 27: 259–260.

Broman, K. W., 2003 Mapping quantitative trait loci in the case ofa spike in the phenotype distribution. Genetics 163: 1169–1175.

Broman, K. W., and T. P. Speed, 2002 A model selection approachfor identification of quantitative trait loci in experimentalcrosses. J. R. Stat. Soc. B 64: 641–656.

Broman, K. W., S. Sen, S. E. Owen, A. Manichaikul, E. M. Southard-Smith et al., 2006 The X chromosome in quantitative trait locusmapping. Genetics 174: 2151–2158.

Carbonell, E. A., M. J. Asins, M. Baselga, E. Balansard and T. M.Gerig, 1993 Power studies in the estimation of genetic param-eters and the localization of quantitative trait loci for backcrossand doubled haploid populations. Theor. Appl. Genet. 86:411–416.

Carriquiry, A. L., D. Gianola and R. L. Fernando, 1987 Mixed-model analysis of a censored normal distribution with referenceto animal breeding. Biometrics 43: 929–939.

Casella, G., and E. I. George, 1992 Explaining the Gibbs sampler.Am. Stat. 46: 167–174.

Chib, S., and E. Greenberg, 1995 Understanding the Metropolis-Hastings algorithm. Am. Stat. 49: 327–335.

Clayton, D., 2003 Conditional likelihood inference under complexascertainment using data augmentation. Biometrika 90: 976–981.

Cox, D. R., and D. Oakes, 1984 Analysis of Survival Data. Chapman& Hall, London.

Darvasi, A., 1997 The effect of selective genotyping on QTL map-ping accuracy. Mamm. Genome 8: 67–68.

Darvasi, A., and M. Soller, 1992 Selective genotyping for deter-mining of linkage between a marker locus and a quantitative traitlocus. Theor. Appl. Genet. 85: 353–359.

Dempster, A. P., N. M. Laird and D. B. Rubin, 1977 Maximum like-lihood from incomplete data via the EM algorithm. J. R. Stat. Soc.B 39: 1–38.

Devroye, L., 1986 Non-Uniform Random Variable Generation. Springer-Verlag, New York.

Diao, G., D. Y. Lin and F. Zou, 2004 Mapping quantitative trait lociwith censored observations. Genetics 168: 1689–1698.

Falk, C. T., and P. Rubinstein, 1987 Haplotype relative risks: aneasy reliable way to construct a proper control sample for risk cal-culations. Ann. Hum. Genet. 51: 227–233.

Feenstra, B., and I. M. Skovgaard, 2004 A quantitative trait locusmixture model that avoids spurious LOD score peaks. Genetics167: 959–965.

Fu, J., and R. C. Jansen, 2006 Optimal design and analysis of geneticstudies on gene expression. Genetics 172: 1993–1999.

Gauderman, W. J., J. S. Witte and D. C. Thomas, 1999 Family-basedassociation studies. J. Natl. Cancer Inst. Monogr. 26: 31–37.

Greenland, S., 1999 A unified approach to the analysis of case-distribution (case-only) studies. Stat. Med. 18: 1–15.

Goddard, M. E., 2003 Detecting selection. Heredity 90: 277–277.Gomez-Raya, L., H. G. Olsen, F. Lingaas, H. Klungland, D. I. Vage

et al., 2002 The use of genetic markers to measure genomic re-sponse to selection in livestock. Genetics 162: 1381–1388.

Grunewald, M., 2004 Genetic association studies with complex as-certainment. Licentiate Thesis, Stockholm University, Stockholm.

Henshall, J. M., and M. E. Goddard, 1999 Multiple-trait mappingof quantitative trait loci after selective genotyping using logisticregression. Genetics 151: 885–894.

2376 M. J. Sillanpaa and F. Hoti

Page 17: Mapping Quantitative Trait Loci From a Single-Tail Sample ...phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially en-riched. Similarly,

Hopert, J. P., and G. Casella, 1996 The effect of improper priorson Gibbs sampling in hierarchical mixed models. J. Am. Stat.Assoc. 91: 1461–1473.

Hoti, F., and M. J. Sillanpaa, 2006 Bayesian mapping of geno-type 3 expression interactions in quantitative and qualitativetraits. Heredity 97: 4–18.

Jannink, J.-L., 2005 Selective phenotyping to accurately map quan-titative trait loci. Crop Sci. 45: 901–908.

Jiang, C., and Z.-B. Zeng, 1997 Mapping quantitative trait loci withdominant and missing markers in various crosses from two in-bred lines. Genetica 101: 47–58.

Jin, C., H. Lan, A. D. Attie, G. A. Churchill, D. Bulutuglo et al.,2004 Selective phenotyping for increased efficiency in geneticmapping studies. Genetics 168: 2285–2293.

Jin, C., J. P. Fine and B. S. Yandell, 2007 A unified semiparametricframework for quantitative trait loci analysis, with application tospike phenotypes. J. Am. Stat. Assoc. 102: 56–67.

Kagan, A., 2001 A note on the logistic link function. Biometrika88: 599–601.

Kilpikari, R., and M. J. Sillanpaa, 2003 Bayesian analysis of mul-tilocus association in quantitative and qualitative traits. Genet.Epidemiol. 25: 122–135.

Kruglyak, L., and E. S. Lander, 1995 A nonparametric approachfor mapping quantitative trait loci. Genetics 139: 1421–1428.

Lander, E. S., and D. Botstein, 1989 Mapping Mendelian factorsunderlying quantitative traits using RFLP linkage maps. Genetics121: 185–199.

Lander, E. S., and N. J. Schork, 1994 Genetic dissection of com-plex traits. Science 265: 2037–2048.

Lee, J. K., and D. C. Thomas, 2000 Performance of Markov chain-Monte Carlo approaches for mapping genes in oligogenic mod-els with an unknown number of loci. Am. J. Hum. Genet. 67:1232–1250.

Lin, J.-Z., and K. Ritland, 1996 The effects of selective genotypingon estimates of proportion of recombination between linkedquantitative trait loci. Theor. Appl. Genet. 93: 1261–1266.

Luo, L., and S. Xu, 2003 Mapping viability loci using molecularmarkers. Heredity 90: 459–467.

Luo, L., Y.-M. Zhang and S. Xu, 2005 A quantitative genetics modelfor viability selection. Heredity 94: 347–355.

Meuwissen, T. H. E., B. J. Hayes and M. E. Goddard, 2001 Pre-diction of total genetic value using genome-wide dense markermaps. Genetics 157: 1819–1829.

Moreno, C. R., J. M. Elsen, P. Le Roy and V. Ducrocq, 2005 In-terval mapping methods for detecting QTL affecting survivaland time-to-event phenotypes. Genet. Res. 85: 139–149.

Neuhaus, J. M., 2002 Bias due to ignoring the sample design incase-control studies. Aust. N. Z. J. Stat. 44: 285–293.

Nixon, J., 2006 Testing for segregation distortion in genetic scoringdata from backcross or doubled haploid populations. Heredity96: 290–297.

O’Brien, T. E., and G. M. Funk, 2003 A gentle introduction to op-timal design for regression models. Am. Stat. 57: 265–267.

Pettitt, A. N., 1986 Censored observations, repeated measures,and mixed effect models: an approach using the EM algorithmand normal errors. Biometrika 73: 635–643.

Ronin, Y., A. B. Korol and J. I. Weller, 1998 Selective genotypingto detect quantitative trait loci affecting multiple traits: intervalmapping analysis. Theor. Appl. Genet. 97: 1169–1178.

Rubin, D. B., 1996 Multiple imputation after 181 years (with discus-sion). J. Am. Stat. Assoc. 91: 473–519.

Schlotterer, C., 2003 Hitchhiking mapping: functional geno-mics from the population genetics perspective. Trends Genet.19: 32–38.

Schmee, J., and G. J. Hahn, 1979 A simple method for regressionanalysis with censored data. Technometrics 21: 417–432.

Sen, S., and G. A. Churchill, 2001 A statistical framework for quan-titative trait mapping. Genetics 159: 371–387.

Sen, S., J. M. Satagopan and G. A. Churchill, 2005 Quantitativetrait locus study design from an information perspective. Genet-ics 170: 447–464.

Sillanpaa, M. J., and E. Arjas, 1998 Bayesian mapping of multiplequantitative trait loci from incomplete inbred line cross data.Genetics 148: 1373–1388.

Sillanpaa, M. J., and E. Arjas, 1999 Bayesian mapping of multiplequantitative trait loci from incomplete outbred offspring data.Genetics 151: 1605–1619.

Sillanpaa, M. J., and M. Bhattacharjee, 2005 Bayesian associa-tion-based fine mapping in small chromosomal segments. Genet-ics 169: 427–439.

Smith, F. B., and R. W. Helms, 1995 EM mixed model analysis ofdata from informatively censored normal distributions. Biomet-rics 51: 425–436.

Sorensen, D. A., D. Gianola and I. R. Korsgaard, 1998 Bayesianmixed-effect model analysis of a censored normal distribution withanimal breeding applications. Acta Agric. Scand. 48: 222–229.

ter Braak, C. J. F., M. P. Boer and M. C. A. M. Bink, 2005 ExtendingXu’s Bayesian model for estimating polygenic effects usingmarkers of the entire genome. Genetics 170: 1435–1438.

Terwilliger, J. D., and J. Ott, 1992 A haplotype-based haplotyperelative risk approach to detecting allelic associations. Hum.Hered. 42: 337–346.

Terwilliger, J. D., and K. M. Weiss, 1998 Linkage disequilibriummapping of complex disease: Fantasy or reality? Curr. Opin. Bio-technol. 9: 578–594.

Tenesa, A., P. M. Visscher, A. D. Carothers and S. A. Knott,2005 Mapping quantitative trait loci using linkage disequilib-rium: marker- versus trait-based methods. Behav. Genet. 35: 219–228.

van Dyk, D. A., and X.-L. Meng, 2001 The art of data augmentation(with discussion). J. Comput. Graph. Stat. 10: 1–111.

van Ooijen, J. W., 1992 Accuracy of mapping quantitative trait lociin autogamous species. Theor. Appl. Genet. 84: 803–811.

Visscher, P. M., C. S. Haley and S. A. Knott, 1996 Mapping QTLsfor binary traits in backcross and F2 populations. Genet. Res. 68:55–63.

Vogl, C., and S. Xu, 2000 Multipoint mapping of viability and seg-regation distortion loci using molecular markers. Genetics 155:1439–1447.

Xu, S., 2003 Estimating polygenic effects using markers of the entiregenome. Genetics 163: 789–801.

Xu, S., 2007 An empirical Bayes method for estimating epistatic ef-fects of quantitative trait loci. Biometrics 63: 513–521.

Xu, S., and W. R. Atchley, 1996 Mapping quantitative trait loci forcomplex binary diseases using line crosses. Genetics 143: 1417–1424.

Xu, S., and C. Vogl, 2000 Maximum likelihood analysis of quanti-tative trait loci under selective genotyping. Heredity 84: 525–537.

Xu, Z., F. Zou and T. J. Vision, 2005 Improving quantitative trait locimapping resolution in experimental crosses by the use of geno-typically selected samples. Genetics 170: 401–408.

Yi, N., and S. Xu, 2000 Bayesian mapping of quantitative trait locifor complex binary traits. Genetics 155: 1391–1403.

Yi, N., and S. Xu, 2002 Mapping quantitative trait loci with epistaticeffects. Genet. Res. 79: 185–198.

Yi, N., S. Xu and D. B. Allison, 2003 Bayesian model choice andsearch strategies for mapping interacting quantitative trait loci.Genetics 165: 867–883.

Zhang, Y.-M., and S. Xu, 2005 A penalized maximum likelihoodmethod for estimating epistatic effects of QTL. Heredity 95:96–104.

Zou, F., J. P. Fine and B. S. Yandell, 2002 On empirical likelihoodfor a semiparametric mixture model. Biometrika 89: 61–75.

Zou, F., B. S. Yandell and J. P. Fine, 2003 Rank-based statisticalmethodologies for quantitative trait locus mapping. Genetics165: 1599–1605.

Communicating editor: A. D. Long

QTL Mapping for Single-Tail Samples 2377