the bayesian controversy in animal breedingmastergr.webs.upv.es/asignaturas/apuntes/08... · ods...

24
The Bayesian controversy in animal breeding A. Blasco 1,2 Departamento de Ciencia Animal, Universidad Polite ´cnica de Valencia, Valencia 46071, Spain ABSTRACT: Frequentist and Bayesian approaches to scientific inference in animal breeding are discussed. Routine methods in animal breeding (selection index, BLUP, ML, REML) are presented under the hypotheses of both schools of inference, and their properties are examined in both cases. The Bayesian approach is dis- cussed in cases in which prior information is available, prior information is available under certain hypotheses, prior information is vague, and there is no prior infor- mation. Bayesian prediction of genetic values and ge- netic parameters are presented. Finally, the frequentist and Bayesian approaches are compared from a theoreti- Key Words: Animal Breeding, Statistical Analysis 2001 American Society of Animal Science. All rights reserved. J. Anim. Sci. 2001. 79:2023–2046 Introduction Gianola and Foulley (1982) introduced Bayesian methods in animal breeding in the context of threshold traits, and soon afterward Gianola and Fernando (1986) highlighted additional possibilities exploiting Bayesian techniques. Some years before, Ronningen (1971) and Dempfle (1977) had called the attention to the fact that BLUP could be interpreted as a Bayesian estimator, and Harville (1974) offered a Bayesian interpretation of REML. However, although Bayesian methods were theoretically powerful, they usually led to formulas in which multiple integrals had to be solved in order to obtain the marginal posterior distributions used for a complete Bayesian inference. Because these integrals could not be calculated, even using approximate meth- ods, Bayesian inference was based on the mode of poste- rior distributions, often giving results rather similar 1 This paper has been revised by a considerable number of col- leagues. I would like to express my gratitude to all of them, especially to my friends Daniel Gianola, Daniel Sorensen, and Susie Bayarri for their detailed comments and to Luis Varona for providing me an intuitive way of explaining Gibbs sampling. I also want to express my gratitude to one of the referees for his/her detailed revision and suggestions. 2 Correspondence: P.O. Box 22012 (phone: +34963877433; fax: +34963877439; E-mail: [email protected]). Received February 7, 2000. March 28, 2001. 2023 cal and a practical point of view. Some problems for which Bayesian methods can be particularly useful are discussed. Both Bayesian and frequentist schools of in- ference are established, and now neither of them has operational difficulties, with the exception of some com- plex cases. There is software available to analyze a large variety of problems from either point of view. The choice of one school or the other should be related to whether there are solutions in one school that the other does not offer, to how easily the problems are solved, and to how comfortable scientists feel with the way they convey their results. to the REML approaches. When Monte-Carlo Markov Chain (MCMC) methods were applied to estimate mar- ginal posterior distributions, the computation problems were solved and interest in Bayesian methods was re- newed. However, frequentist statisticians are reluctant to use Bayesian methods, mainly due to difficulties found in the use of prior information inherent to the Bayesian paradigm. The objective of this article is to discuss the frequentist and Bayesian approaches to sci- entific inference in animal breeding, showing their ad- vantages and disadvantages, and their application to particular topics in animal breeding for which the Bayesian approach can be useful. Stuart and Ord (1991) give a frequentist viewpoint of the statistics, whereas Bernardo and Smith (1994) provide a Bayesian account. Inferences based on the likelihood can be found in Ed- wards (1992). There is an excellent book on comparative statistical inference by Barnett (1999). Scientific Inference A constant theme in the development of statistics has been the search for justification for what statisticians do. To read the textbooks one might get the distorted idea that ‘Student’ proposed his t-test because it was the Uniformly Most Powerful Unbiased test for a Normal mean, but it would be more accurate to say that the concept of UMPU gains much of its appeal because it produces the t-test, and everyone knows that the t-test is a good thing. Dawid (1976), as quoted by Robinson (1991)

Upload: others

Post on 20-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

The Bayesian controversy in animal breeding

A. Blasco1,2

Departamento de Ciencia Animal, Universidad Politecnica de Valencia, Valencia 46071, Spain

ABSTRACT: Frequentist and Bayesian approachesto scientific inference in animal breeding are discussed.Routine methods in animal breeding (selection index,BLUP, ML, REML) are presented under the hypothesesof both schools of inference, and their properties areexamined in both cases. The Bayesian approach is dis-cussed in cases in which prior information is available,prior information is available under certain hypotheses,prior information is vague, and there is no prior infor-mation. Bayesian prediction of genetic values and ge-netic parameters are presented. Finally, the frequentistand Bayesian approaches are compared from a theoreti-

Key Words: Animal Breeding, Statistical Analysis

2001 American Society of Animal Science. All rights reserved. J. Anim. Sci. 2001. 79:2023–2046

Introduction

Gianola and Foulley (1982) introduced Bayesianmethods in animal breeding in the context of thresholdtraits, and soon afterward Gianola and Fernando (1986)highlighted additional possibilities exploiting Bayesiantechniques. Some years before, Ronningen (1971) andDempfle (1977) had called the attention to the fact thatBLUP could be interpreted as a Bayesian estimator,and Harville (1974) offered a Bayesian interpretationof REML. However, although Bayesian methods weretheoretically powerful, they usually led to formulas inwhich multiple integrals had to be solved in order toobtain the marginal posterior distributions used for acomplete Bayesian inference. Because these integralscould not be calculated, even using approximate meth-ods, Bayesian inference was based on the mode of poste-rior distributions, often giving results rather similar

1This paper has been revised by a considerable number of col-leagues. I would like to express my gratitude to all of them, especiallyto my friends Daniel Gianola, Daniel Sorensen, and Susie Bayarrifor their detailed comments and to Luis Varona for providing me anintuitive way of explaining Gibbs sampling. I also want to expressmy gratitude to one of the referees for his/her detailed revisionand suggestions.

2Correspondence: P.O. Box 22012 (phone: +34963877433; fax:+34963877439; E-mail: [email protected]).

Received February 7, 2000.March 28, 2001.

2023

cal and a practical point of view. Some problems forwhich Bayesian methods can be particularly useful arediscussed. Both Bayesian and frequentist schools of in-ference are established, and now neither of them hasoperational difficulties, with the exception of some com-plex cases. There is software available to analyze alarge variety of problems from either point of view. Thechoice of one school or the other should be related towhether there are solutions in one school that the otherdoes not offer, to how easily the problems are solved,and to how comfortable scientists feel with the waythey convey their results.

to the REML approaches. When Monte-Carlo MarkovChain (MCMC) methods were applied to estimate mar-ginal posterior distributions, the computation problemswere solved and interest in Bayesian methods was re-newed. However, frequentist statisticians are reluctantto use Bayesian methods, mainly due to difficultiesfound in the use of prior information inherent to theBayesian paradigm. The objective of this article is todiscuss the frequentist and Bayesian approaches to sci-entific inference in animal breeding, showing their ad-vantages and disadvantages, and their application toparticular topics in animal breeding for which theBayesian approach can be useful. Stuart and Ord (1991)give a frequentist viewpoint of the statistics, whereasBernardo and Smith (1994) provide a Bayesian account.Inferences based on the likelihood can be found in Ed-wards (1992). There is an excellent book on comparativestatistical inference by Barnett (1999).

Scientific Inference

A constant theme in the development of statistics hasbeen the search for justification for what statisticians do.To read the textbooks one might get the distorted idea that‘Student’ proposed his t-test because it was the UniformlyMost Powerful Unbiased test for a Normal mean, but itwould be more accurate to say that the concept of UMPUgains much of its appeal because it produces the t-test, andeveryone knows that the t-test is a good thing.

Dawid (1976), as quoted by Robinson (1991)

Page 2: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2024

The recent irruption of Bayesian methods in animalbreeding has motivated a certain perplexity among ani-mal breeders. The supporters of these methods say theygive a precise answer to some problems that are notwell solved in animal breeding; for example:

Suppose some animal model holds, and a standard analy-sis coupling maximum likelihood (ML) with BLUP method-ology is carried out. With respect to the first question [Whatcan be said about genetic variance in a base population?],we would obtain a point estimate of genetic variance anda single measure of uncertainty, which, technically speak-ing is only meaningful in large samples and if the data arenormally distributed. To tackle the second one [How doesone account for uncertainty about breeding values whenlocation and dispersion parameters are unknown?] wewould have to place ourselves conditionally on the ML esti-mates of dispersion parameters, ignoring the error. Theissue of response to selection [Given a selection scheme orexperiment, how does one assess effectiveness of selection,and attach to this a measure of uncertainty?] is even morecomplicated. Estimation of responses can be derived froman animal model, but their properties are unknown. . . . Analternative is to adopt the Bayesian position. For each ofthe questions raised above, the Bayesian answer residesin arriving at the marginal posterior distribution of theunknown of interest. This distribution provides an exactaccount . . . of the uncertainty about the unknown parame-ter.

Gianola et al. (1994)

Conversely, well-known statisticians working in ani-mal breeding show their skepticism, if not their openopposition, to the use of these methods:

I also pointed out why some Bayesian estimators haveunfortunate properties, in that they are likely to give esti-mates of zero for between group variance components inwell designed experiments. However this type of estimatoris still advocated.

Thompson (1987)

The controversy between the frequentist school and theBayesian school of inference has huge implications. Fromthe viewpoint of the data analysis the implication is thatall of the procedures exposited in books like Snedecor andCochran’s Statistical Methods are not merely poor but haveno logical foundation. Also, every reporting of any investi-gation must lead to the investigator’s making statementsof the form: “My probability that a parameter, θ, of mymodel lies between, say, 2.3 and 5.7 is 0.90.

Kempthorne (1984)

Although only the Bayesian and frequentist schoolsare discussed, because they are the only relevantschools of inference today, inferences based on likeli-hood are intermediate. On one hand, the likelihoodplays a central role in the Bayesian inference as thefunction expressing all the information derived fromthe data. On the other hand, the method of ML hasinteresting frequentist properties. The animal breedermay be interested in finding efficient solutions for esti-

mating breeding values, and not necessarily in the un-derlying philosophy. Thus, it is reassuring that whena data set is sufficiently large, the results are verysimilar in most cases. It is also pleasing that the meth-ods used most frequently in animal breeding can bederived under either the Bayesian or the frequentistparadigms. Harville and Carriquirry (1992) studied theproperties of closely related estimators for animalbreeding values, deduced from frequentist and Bayes-ian approaches, and stressed the similarities betweenresults of both procedures. Robinson (1991) even statesthat the differences between schools are minimal forestimation of breeding values, and that most of thedebate comes from the insistence of supporters of bothschools on stressing their differences, rather than theirsimilarity. We may get the impression that the differ-ences between schools are merely “philosophical” ones,with no practical consequences. It is true that whendata sets are large, problems arise more from computa-tion difficulties and limits than from inferential rea-sons. However, in most studies, experimental designtries to maximize efficiency by optimizing the size of theexperiment, and a large data set may not be available inmany cases. Moreover, differences between schools intheir approach to inference are notable, even from apractical point of view.

First, the expression of uncertainty about unknownsis completely different. The frequentist school studiesthe distribution of the estimator. In this framework, itis usually given a standard error of this distributionand sometimes the complete sampling distribution ofthe estimators using techniques such as the bootstrap.The Bayesian school aims to obtain the probability den-sity function of the parameter for a given set of data.From this density function one can get the most proba-ble value of the parameter, or the probability that theparameter resides within certain limits. The fre-quentist way of inference is based on how a large num-ber of estimates would be distributed around the truevalue if a large number of samples were taken or aninfinite number of repetitions of the experiment wereperformed, whereas Bayesians examine the probabilitydistribution of the true value, given the data. For afrequentist, the true value is usually fixed and the sam-ple is variable, whereas for a Bayesian the sample isfixed and the parameter of interest is a random vari-able. That does not mean that a true value of the param-eter does not exist; for a Bayesian the true value exists,but because the Bayesian does not know which valueit is, and thus the Bayesian speaks about the probabilitythat this parameter has a particular value.

Second, some statistical concepts currently used inanimal breeding do not have a Bayesian interpretation,for example, “bias” and the difference between fixedand random effects. In a Bayesian context “bias” doesnot exist, because conceptual repetitions of the experi-ment are not considered. Also, all effects are randombecause the Bayesian way of expressing uncertainty isto draw density functions of all unknowns, and thus all

Page 3: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2025

unknowns are considered random variables. This canbe surprising to an animal breeder working with BLUP(Henderson, 1973) or REML (Patterson and Thompson,1971), but the property of unbiasedness has been dis-cussed even within the frequentist school. For example,Henderson himself stated that biased methods could bemore efficient (e.g., Henderson 1984). Fisher consideredthe property of unbiasedness as largely irrelevant(Fisher, 1956), mainly because it was not invariant toreparametrization. For example, the expectation of anunbiased estimator of the variance is the populationvariance, but the expectation of the square root of thisestimator is not an unbiased estimator of the standarddeviation. With respect to fixed effects, Fisher thinksthat the distinction between fixed and random effectswas not a clear improvement of the analysis of variance(see the introduction to the new edition of Fisher’s sta-tistical books, Yates, 1990). Henderson decided to usefixed effects in genetic evaluation, introducing mixedmodels, due to some technical problems related to thecorrelations between herd and genetic effects, as willbe seen later, but noting that random effects have alower risk of estimation.

Third, inferences obtained from both schools are notalways coincident, particularly for small samples andwhen the Bayesian analysis uses prior information(e.g., Wang et al., 1994).

Fourth, some problems that have no solution (or onlya rough approximation) in the frequentist school canbe solved unambiguously with the Bayesian approach(see the previous quote of Gianola et al., 1994).

Given the apparent advantages of the Bayesianschool on the first and fourth points, two obvious ques-tions arise. Why were the Bayesian techniques aban-doned in the past? Or, as Brad Efron asks in a briefpaper with an interesting discussion, Why isn’t every-one a Bayesian? (Efron, 1986). And why have fre-quentist techniques persisted in the field of animalbreeding? The first question has a simple answer:Bayesian methods usually require solving complicatedmultiple integrals, many times requiring the use ofnumeric methods (e.g., Cantet et al., 1992) that maynot be feasible. The recent development of MCMC tech-niques (Gibbs sampling and others) has given a solutionto many problems that were unsolvable before, due tothe impossibility of evaluating these integrals. The sec-ond question has a more difficult answer and requiresa detailed comparison of how inference is made in eachof the schools. Subsequent sections of this paper providethis comparison and expose reasons for and against theuse of one or the other form of inference in the commonstatistical problems of animal breeding.

The Frequentist School

The frequentist school was developed in the 1930sand 1940s and was based on the earlier work of KarlPearson and Ronald Fisher. The most relevant namesassociated with this school are Jerzy Neyman and Egon

Pearson (son of Karl Pearson), who dealt with inference,and Abraham de Wald, who worked in decision theory.

Inference and Precision

Most animal breeding work relates to estimation ofgenetic values or parameters and to testing hypotheses;only recently has some work related to decision theorybeen published (e.g., Wooliams and Meuwissen, 1993).Below are some of the features of inference in the fre-quentist school.

Repeating an experiment conceptually an infinitenumber of times we can arrive at an infinite numberof confidence intervals, which would include the truevalue of the parameter in, say, 95% of the cases. Noticethat when a confidence interval is given, for example[0.15, 0.25] for h2, this does not mean that the probabil-ity of h2 being between 0.15 and 0.25 is 95%. What wesay is that if the experiment would be repeated aninfinite number of times, we would get an infinite num-ber of confidence intervals (of which [0.15, 0.25] is justan example) that would contain the true value of h2 in95% of the cases. Neyman and Pearson (1933) suggestedthat the “scientific behavior” should be to act as if theinterval obtained was the true one, being sure that, inthe long run, we will be right in 95% of the cases.

Assuming the risk of considering a hypothesis to betrue when it is not, or of rejecting the hypothesis whenit is true, some regions of acceptance or rejection canbe defined, based on the distribution of the sample un-der the hypothesis to be tested when the experiment isrepeated an infinite number of times. The hypothesisis accepted or rejected depending on whether the actualsample falls in one of the regions or in the other. Afterthe hypothesis is accepted, the “scientific behavior”should be to act as though this hypothesis were true.

Without hoping to know whether each separate hypothe-sis is true or false, we may search for rules to govern ourbehavior with regard to them, in following which we insurethat, in the long run, of experience, we shall not be toooften wrong.

Neyman and Pearson (1933)

This procedure is considered to be based on a “subjec-tive belief” by some philosophers of science (Howsonand Urbach, 1996), a statement that may be consideredsurprising by many members of this inference school.Note the method gives no clues about how probable isa hypothesis related to the other. The answer given bya test of hypothesis is just yes or no, because the riskand the corresponding regions are fixed before the ex-periment is performed. This is a rather weak answerfrom an inference point of view, although it is helpfulfor making decisions.

There are some problems associated with commonpractice. For example, a hypothesis may be that twobreeds have the same mean for some trait. Note thatby augmenting sample size sufficiently, this hypothesiswill be always rejected. Thus, in a well-designed experi-

Page 4: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2026

ment, it should be decided when (that is, how largethe difference between means should be) a hypothesisshould be rejected. The problem arises in poorly de-signed experiments, in field data, or in tests for traitsnot considered when designing the experiment. In allthese cases, significance will not be associated withrelevant differences because these differences were notconsidered when the experiment was designed. Thus,the answer provided by the test (to accept or reject thehypothesis) might be unsatisfactory, because we canfind relevant differences that are not significant or irrel-evant differences that are significant. Moreover, if weconsider several traits, and the design has been madefor only one trait, we will find significant differencesfor some of these traits even when the null hypothesisis true, just because there is always some probabilityof falling into the rejection region. The probability offalling into this region for at least one trait increaseswith the number of traits considered.

One of the main problems of scientific inference isthe choice between several models that can explain theresults obtained. Many times these models are nested.For example, it has to be decided whether or not amaternal effect explains a part of the variability of thedata, whether or not there is a threshold for one ormore traits, whether or not to include the effect of aQTL, and so on. For nested models, the significance ofthe ratio of the likelihoods can be approximated usingthe chi-square distribution, if there are enough data.Then, a hypothesis test holds and the null or the alter-native hypothesis can be selected. After decidingwhether the alternative hypothesis is accepted or not,the researcher will behave as though the model chosenwas the right one, dismissing the other models as false.The main problems of this procedure are that it onlycan be used for nested models and that it is based onasymptotic properties. Thus, it is sometimes difficultto decide whether we have enough data to make therequisite choices among models.

Selection Indices and BLUPas Frequentist Predictors

Selection indices appeared in a paper of Karl Pearsonon natural selection (Pearson, 1903), although theywere used much later in animal breeding. First, Smith(1936) proposed selection indices for several traits inplant breeding. Apparently, it was Fisher who proposedthe method to Smith, because he says in his article thatsection I [Theory] is little more than a transcription ofFisher’s suggestions. Later, Hazel (1943) applied selec-tion indices to the field of animal improvement. In 1949,BLUP appeared as a ML method, which happened notto be the case (Henderson, 1949, 1950), and it was prac-tically forgotten for more than 20 yr. Henderson (1973)gives details of the history and development of BLUP.Here, we will examine the frequentist properties of thetwo methods.

Consider the linear model

y = Xb + Zu + e [1]

where y is the data vector, b the vector of fixed effects,u the vector of random effects (breeding values), e thevector of residuals, and X and Z are incidence matrices.The predictor

u = G′ V−1 (y − Xb)

where G and V are the known (co)variance matrices ofthe random effects and the data, respectively, and bis the generalized least squares estimator of the fixedeffects, gives the BLUP of u. Frequentist statisticiansdistinguish between estimating a fixed effect, whichremains constant for each conceptual repetition of theexperiment, and predicting a random value, whichchanges in each repetition. However, this distinctionseems to be rather unnecessary. When there are notfixed effects, removing Xb from the formula we obtainthe expression of the selection index, or best linear pre-dictor.

Inverting V is not easy when thousands of data areavailable, as usually happens in animal breeding. Hen-derson (1963) demonstrated that solving the equations

X′X X′ZZ′X Z′Z + G−1

bu

=

X′yZ′y

[2]

resulted in the same and b and u as before. Later,he found an algorithm to calculate G−1 directly frompedigrees, which simplified the problem and made Eq.[2] practical. Since then this method has been widelyused by animal breeders. These equations [2] are calledmixed-model equations in the animal breeding lit-erature.

It should be noted that BLUP is said to be unbiasedin a sense differing somewhat from that employed infrequentist statistics. Consider one of the elements ofb, say, b for simplicity. Taking an infinite number ofsamples, the values of b obtained will have an aver-age value

E(b) = b

That is, the estimates b are distributed around the truevalue b. Then we say that b is an unbiased estimatorof b. However, this condition cannot be met for randomvalues, because they change in each repetition of theexperiment. For random values we use the expression

E(u) = E(u)

to express unbiasedness. This latter property is muchless appealing than the former one. The predicted val-ues u in infinite repetitions of the experiment are notdistributed around the true value u, because in eachrepetition we obtain not only a different u but also adifferent u. The risk of the estimator is also different.

Page 5: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2027

In general, when a quadratic loss function �(θ,θ) =(θ − θ)2 is used,

Risk (θ) = E [�(θ, θ)] = E (θ − θ)2

= E [(θ − θ) − E(θ − θ) + E(θ − θ)]2

= var (θ − θ) + [E(θ) − E(θ)]2

Thus, the risk can be deconstructed into two quantities:the variance of the error of estimation θ, θ, and thesquare of the bias. For fixed effects, because b is a fixedvalue and does not change in the conceptual repetitionsof the experiment, E(b) = b, and the variance of theerror of estimation becomes var(b − b) = var(b). Then,

Risk (b) = [bias (b)]2 + variance (b)

However, for random effects3, the variance of the errorof estimation is as follows:

variance (u − u) = variance (u) − 2 cov (u,u) + variance(u)

= variance (u) − 2 variance(u) +variance (u)

= variance (u) − variance (u).

Then,

Risk (u) = [bias (u)]2 + variance (u) − variance(u).

The variance (u) cannot be changed because it is apopulation parameter that depends on the genetic de-termination of the trait. As information on the randomeffect increases, variance (u) increases, tending towardvariance (u), and the quadratic risk decreases. How-ever, for fixed effects, variance (b) decreases as informa-tion increases, and the quadratic risk also decreases.This different behavior of variance (u) and variance (b)is a typical source of confusion.

There may be an infinite number of estimators havingthe same risk but with different bias and variance. Tosolve the problem, the predictor with minimum vari-ance could be chosen from among the unbiased ones,even if they may have a higher variance of the error ofestimation and consequently a higher risk than somebiased predictors, and BLUP is derived in this way. Tocompute BLUP, the variances of the random effects uand e in model [1] should be known. This is usually notthe case, so it is often recommended to estimate thesefrom the data by ML methods, or to use estimates fromthe literature.

3Here bias(u) = [E(u) − E(u)], which is different from that normallyused in frequentist statistics for parameter estimation. Bias shouldbe [E(u|u) − u], but this is different from zero and, consequently,BLUP is biased. To state that BLUP is unbiased by changing theusual definition of bias seems to be a rather liberal use of the language.Besides, the term best is somewhat misleading. BLUP is only bestamong a restricted class of predictors (the “unbiased” ones), and onlywhen the true variances of the random effects are used.

The Method of Maximum Likelihood

The concept of likelihood and the method of ML weredeveloped by Fisher between 1912 and 1922, althoughthere are historical precedents attributed to Bernouilli(1782, translated by Kendall, 1961). By 1912 the theoryof estimation was in an early state and the method waspractically ignored. However, Fisher (1922) publisheda paper in which the properties of the estimators weredefined and he found that this method produced estima-tors with good properties, at least asymptotically. Themethod was then accepted by the scientific communityand has since been used frequently (Fisher 1912, 1922).

Suppose we know how the samples are distributedin infinite repetitions of an experiment, or how infinitesamples would be distributed. We will use the followingnotation: f(y|u), which means the density function ofthe sample y for a “given” value of u. Thus, here thevariable is y because we examine how it is distributedwhen repeating the experiment, and u is a fixed valuethat we provide in order to examine the probabilitydensity distribution of y for this given value. The MLmethod consists of finding which value of u maximizesthe probability of our sample y. To explain the conceptof likelihood and the rationale for using its maximumwe put forth an example.

Consider finding the average weight of rabbits of abreed at 8 wk of age. We take a sample of one rabbit,and its weight is y0 = 1.6 kg. Figure 1 shows the densityfunctions of several possible populations from whichthis rabbit can come, with population means m1 = 1.50kg, m2 = 1.60 kg, m3 = 1.80 kg. Notice that at y0 theprobability densities of the first and third populationf(y0|m1) and f(y0|m3) are lower than that of the secondone f(y0|m2). Therefore, it seems more likely that therabbit comes from the second population. All the valuesf(y0|m1), f(y0|m2), f(y0|m3), . . . define a curve with a maxi-mum in f(y0|m2). This curve varies with m, and thesample y0 is a fixed value for all those density functions.Each value represents an “instant probability,” asFisher (1912) first called it, in the sense that it belongsto a density function. However, it is obvious that thenew function defined by these values is not a densityfunction, because each value belonged to a differentprobability density function. Fisher (1912) proposed totake the value of m that maximized f(y0|m) becausefrom all the populations defined by f(y0|m1), f(y0|m2),f(y0|m3), . . . it is the one for a given value of m thatmakes the sample y0 most probable. Here the wordprobability can lead to some confusion, because thesevalues belong to different density functions and thefunction defined with all of them is not a probabilityfunction. Thus, Fisher preferred to use the word likeli-hood for all the values considered together. Notice thatthe method of ML is not the one that makes the samplemost probable. This method provides a value of theparameter such that if this were the true value the sam-ple would be most probable. Here we face a problemof notation: speaking about a set of density functions

Page 6: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2028

Figure 1. Construction of the likelihood curve.

f(y|m1), f(y|m2), f(y|m3) . . . for a given y is the same asspeaking about a function L(m|y) that is not a densityfunction and consequently it does not represent proba-bilities4. We will use the notation L(m|y) as the one thatexpresses the likelihood most clearly.

Fisher (1912, 1922) not only proposed a method ofestimation, but also proposed the likelihood as a degreeof belief different from the probability but that alloweduncertainty to be expressed in a similar manner. WhatFisher proposed is to use the whole likelihood curve, notonly its maximum, a practice rather unusual nowadayswith the exception of QTL analyses. In these cases thereis the risk of confusing likelihood and probability. Forexample, Von Mises (1957) accused Fisher of exposingwith great care the differences between likelihood andprobability, only to forget it later and use the wordlikelihood as we use probability in common language.Today, frequentist statisticians typically use only themaximum of the curve because it has good propertiesin repeated sampling. The method is asymptoticallyunbiased, sufficient when there are sufficient estima-tors, efficient, optimum asymptotically normal, and soon. Repeating the experiment an infinite number oftimes, the estimator will be distributed near the true

4Classic texts of statistics such as Kendall’s (Stuart and Ord, 1991)contribute to the confusion by using the notation L(y|m) for the likeli-hood. Moreover, some authors distinguish between “given a parame-ter” (always fixed) and “giving the data” (which are random variables).They use (y|m) for the first case and (m;y) for the second. Thus,likelihood can be found in textbooks as L(m|y), L(y|m), f(y|m), andL(m;y).

value, with a variance that can also be estimated. Butall those properties are asymptotic, and thus there isno guarantee with small samples about the goodnessof the estimator. The animal breeder usually workswith a large number of animals, but here “small sam-ples” does not mean a small total number of data. De-pending on the problem and on the distribution of thedata, the information for estimating some parameterscan come from a reduced number of data. Besides, theML estimator is not necessarily the estimator that mini-mizes the risk5. Nevertheless, the method has an inter-esting property apart from its frequentist properties:any reparametrization leads to the same type of estima-tor. For example, the ML estimator of the variance isthe square of the ML estimator of the standard devia-tion, and a function of ML estimators is also a ML es-timator.

From a practical point of view, the ML estimatoris an important tool for the applied researcher. Thefrequentist school developed a list of properties thatgood estimators should have but does not give rulesabout how to find them. Maximum likelihood is a wayof obtaining estimators with (asymptotically) desirableproperties. It is also possible to find a measurement

5A well-known example is the James-Stein paradox: Suppose wehave K populations normally distributed with the same variance and

we try to estimate the means mi with minimum risk E∑(mi −

mi)2. The maximum likelihood estimator takes the sample means

of each population m = [x1, x2, . . ., xk]. James and Stein (1961) foundan estimator on lower risk for K > 2.

Page 7: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2029

of precision from the likelihood function itself. If thelikelihood function is sharp, its maximum gives a morelikely value of the parameter than other values near it.Conversely, if the likelihood function is rather flat,other values of the parameter will be almost as likelyas the one that gives the maximum to the function.Fisher (1922) introduced the concept of “amount of in-formation” in order to measure the accuracy of the esti-mation. This “amount of information” is defined as

E[d logL(u|y)/du]2

Logarithms are taken to facilitate an additive measureof information. For example, if n individuals of a sampleare independent,

L(m|y1, . . ., yn) = L(m|y1) � L(m|y2) � . . . � L(m|yn),

and log [L(m|y1, . . ., yn)]

= log[L(m|y1)] + log[L(m|y2)] + . . . + log[L(m|yn)]

This quantity is an indicator of the sharpness of thelikelihood function. If the function is sharp, the quan-tity [d logL(u|y)/du] will be high in absolute value, andthus the amount of information will be large (the squareprevents negative values). Conversely, if the functionis rather flat, this quantity will be low, meaning thatsome solutions far away from the maximum have al-most the same likelihood or “degree of belief.” In smallsamples, the likelihood can be asymmetrical ormultimodal, giving varying chances (degrees of belief)to points that are not near the maximum.

The frequentist school has reduced the problem bymaking inferences only at the maximum of this func-tion. With infinite repetitions of the experiment thelikelihood function would converge asymptotically to anormal with mean the true value and variance the in-verse of the information quantity (e.g., Stuart and Ord,1991). The main problem is that this approximationis based on the central limit theorem, needing largesamples, but it is not possible to determine how largethe samples should be to ensure normality.

Because likelihood means “rational degree of belief,”it would be expected that ratios of likelihoods wouldindicate differences among models in hypothesis tests,and some statisticians (Edwards, 1992) recommend theuse of the likelihood ratio reasoning in this way. Unfor-tunately, likelihoods are not quantities that can betreated as probabilities; a likelihood four times higherthan another one does not lead to a “degree of rationalbelief” four times higher. Nevertheless, the likelihoodratio has good asymptotical frequentist properties andis currently used in testing hypotheses.

REML as a Frequentist Estimator

Although Fisher proposed both the method of ML andthe analysis of variance, he never intended to estimatevariance components by ML. It was Hartley and Rao

(1967) who first proposed this. The ML procedure leadsto complex equations that should be solved approxi-mately and using iterative algorithms. This causes com-puting difficulties, and because of them ML procedureswere not used in the field of animal breeding until re-cently. A conceptual difficulty also appeared whenmixed models were used. The ML estimation of variancecomponents does not take into account the loss of de-grees of freedom caused by estimation of the fixed ef-fects. This can be worrying when the model includesfixed effects with many levels, as in national geneticevaluations of dairy cattle. The solution, proposed by asimple model by Thompson (1962) and generalized byPatterson and R. Thompson (1971), was called residualor restricted ML (REML). In recent years, REML hasbeen the preferred method of animal breeders for vari-ance component estimation. Following the Harville(1977) presentation of REML, this method is based onprojecting the data in a subspace free of fixed effectsand maximizing the likelihood in this subspace. TheREML method has the advantage of giving the sameresults of the ANOVA in balanced designs (ANOVA isunbiased with minimum variance for such designs).Alternatively, ML estimation produces different resultsfrom ANOVA in balanced designs, which is somewhatdisturbing. To better explain the differences betweenML and REML, consider a very simple model:

y = X � + e = 1 � + e

where X = 1 is a vector of 1’s. The (co)variance matrixof the residuals is V = Iσ2, and its determinant |V| =(σ2)n. The likelihood function, when y∼N(�,σ2) and thesample is of size n, is

L(σ2|y) = constant � σ2(−n/2) exp[−(y − 1�)′

(y − 1�)/2σ2].

The value of σ2 that maximizes L(σ2|y) is

σ2 = (1/n)(y − 1�)′(y − 1�) = (1/n)Σ(yi − �)2

Notice that � must be known to obtain the ML estimateof the variance. Because we do not know �, we substi-tute the ML estimate of �:

� = (1/n)Σyi

The ML estimate of the variance is:

σ2ML = (1/n)(y − 1�)′(y − 1�)

Although this estimate is a function of another esti-mate, it is still a ML estimate, and thus it has all theasymptotic properties from before, and there is not aformal reason to reject it.

To calculate the REML estimates the data are pro-jected in a subspace without fixed effects. A projectionmatrix K, that follows

Page 8: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2030

K′y = K′ 1� + K′ e = K′ e

is found (i.e., K has been chosen to follow the conditionK′ 1 = 0). For example,

K′ =

1 −1 0 0 . . . . . . 01 0 −1 0 . . . . . . 01 0 0 −1 0 . . . 0

. . . . . . . . . . . . . . . . . . . .1 0 0 0 . . . . . . −1

satisfies this condition.The variance of K′y is K′VK = K′Kσ2 and the likeli-

hood is as follows:

L(σ2|K′y) = constant � |K′Kσ2|1/2

+ exp [(K′y)′ (K′K)−1 (K′y)/2σ2]

where

|K′Kσ2| = n (σ2)n−1

The value of σ2 that maximizes L(σ2|K′y) is:

σ2 = [1/(n − 1)] y′K(K′K)−1K′y

It can be shown that, in general,

y′K(K′VK)−1K′y

= [y − X (X′V−1X)− X′V−1y]′ V−1

[y − X (X′V−1X)− X′V−1y]

In this case, X = 1, V = Iσ2. Thus,

X(X′V−1X)− 1 X′V−1y = 1 [(1/σ2)1′1]−1 (1/σ2)1′y

= 1[(1/σ2) n]−1 (1/σ2)Σyi

= 1 (1/n)Σyi

and

y′K(K′K)−1K′y = [y − 1(1/n)Σyi] ′ [y − 1(1/n)Σyi]

= (y − 1�)′(y − 1�)

Thus, the REML estimate of the variance is

σ2REML = [1/(n − 1)](y − 1�)′(y − 1�),

which is similar to the ML estimate, but dividing by n− 1 instead of by n. Despite the similarity of both esti-mates, here there is no substitution of a true value byits estimate. Rather, as a result of the estimation pro-cess an expression in the formula (1/n)Σyi is coincidentwith �. The degree of freedom lost when estimating �

is reflected in dividing by (n − 1) instead of n. Whenthere are many fixed effects, this distinction is im-portant because the ML estimate is

σ2ML = (1/n)(y − Xb)′(y − Xb)

whereas the REML estimate is

σ2REML = [1/(n − r(X))](y − Xb)′(y − Xb)

where r(X) is the rank of X and b is the ML estimateof b.

With the proposed matrix K, the analysis of varianceis made on the vector

(K′y)′ = [y1 − y2, y1 − y3, . . ., y1 − yn]

in order to avoid the estimation of the mean �. By usingdifferences between data,

yi − yj = (� + ei) − (� + ej) = ei − ej

� is removed. Several matrices follow the same condi-tion K′1 = 0. This analysis could be also made on

(K′y)′ = [y1 − y2, y2 − y3, . . ., yn − 1 − yn]

and the same result obtained. This does not mean thatevery K would be useful; for example, a K with half ofthe rows null will follow the condition K′1 = 0, as willthe matrix K = 0. It is intended to find matrices thatwill not lose information relative to the dispersion pa-rameters, which implies introducing in K a maximumnumber of independent linear contrasts of residuals. Itis not important which K is used. In fact, the usual wayin which REML is found in the literature is in termsof solutions to the mixed-model equations [2] and com-ponents of the mixed-model equations, an expressionin which K does not appear explicitly. Notice that inthe example K has dimensions (n − 1) × n (there areonly n − 1 couples of data to calculate their difference)and a degree of freedom is lost. Thus, summarized infor-mation was used because the new data vector K′y hasonly n − 1 elements. To obtain a geometrical representa-tion of this sample a n-dimensional space is needed,but using REML only a n − 1 dimensional space isrequired to represent the sample that will be used inestimation, thus the reference to working in a restrictedspace, having lost a degree of freedom. When there aremany fixed effects or many levels of one effect, this ismuch more notorious.

As Searle et al. (1992) stated, there is not a clear(frequentist) argument for preferring REML to ML. Inthe former example, the REML estimate was unbiasedbut had a higher risk than the ML estimate, althoughthis may not happen in more complex cases. In othercircumstances, the risk of REML will be higher or lowerthan the risk of ML, depending on the true values of

Page 9: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2031

the parameters and the structure of the data. RestrictedML became the preferred method among animal breed-ers, particularly among dairy cattle breeders, for indi-rect reasons rather than for clear statistical advantagessuch as minimizing risk. One reason to prefer REMLis that in practical problems such as genetic evaluationof dairy cattle there may be many levels of herd-year-season effect and other effects, and ML does not takeinto account the loss in degrees of freedom due to theestimation of these effects. However, REML is used inthe field of animal breeding as the method of choiceindependent of whether there are few or many levelsof fixed effects, perhaps because, as Dawid (1976, citedby Robinson, 1991) stated for the t-test in the quotethat initiates this paper, “everyone knows that REMLis a good thing.”

The Bayesian School

The Bayesian school was, in practice, founded byCount Laplace through several works published from1774 to 1812, and it had a preponderant role in scientificinference during the 19th century (Stigler, 1986). Someyears before Laplace’s first paper on the matter, thesame principle was formalized in a posthumous paperpresented at the Royal Society of London and attributedto a rather obscure priest, rev. Thomas Bayes (whonever published a mathematical paper during his life).Apparently, the principle upon which Bayesian infer-ence is based was formulated before. Stigler (1983) at-tributes it to Saunderson (1683–1739), a blind professorof optics who published a large number of papers onseveral fields of mathematics. Fisher’s work on likeli-hood in the 1920s and the work of the frequentist schoolin the 1930s and 1940s eclipsed Bayesian inferenceuntil the 1960s, when a “revival” was started that hasincreased since. The Bayesian paradigm was intro-duced in animal breeding by Daniel Gianola, first inwork on threshold traits published with J. L. Foulley,and later in series of papers in which virtually all topicsabout animal breeding were addressed (see Gianola andFernando, 1986; Gianola et al., 1990 for reviews).

Inference and Precision

In a Bayesian context, the objective is, given the data,to describe the uncertainty about the true value of someparameter, using probability as a measurement of thisuncertainty. For example, if the parameter of interestis the heritability of some trait, the aim of Bayesianinference is to find a probability density of the heritabil-ity given the data, f(h2|y), where y is the vector of obser-vations (Figure 2). When this distribution is obtained,inferences can be made in multiple manners: for exam-ple, we can calculate the probability of h2 to be between0.1 and 0.3, by integrating the function between thesevalues. We can also find which is the shortest intervalin which the probability of finding h2 is more than 95%.If we are interested in point estimation, we can give

several values of h2 calculated from the distributionf(h2|y). The mode is the value that maximizes f(h2|y),i.e., the most probable value of h2 that we can infergiven the data. It can be shown that this is the valuethat minimizes risk when the loss function has a nullvalue for h2 = h2 and is 1 in other cases. The probabilitythat the true value of h2 is greater than the median isthe same as being less. Thus, the median minimizesrisk when the loss function is |h2 − h2|. Finally, themean minimizes the risk when the loss function is thequadratic one, (h2 − h2)2.

In order to make all of these inferences, the densityfunction f(h2|y) should be obtained. To do this, we willuse the Bayes theorem. The probability of two eventshappening together is

P(A,B) = P(A|B) � P(B) = P(B|A) � P(A) [3]

Thus,

P(A|B) = P(B|A) � P(A)/P(B)

In the present case:

f(h2|y) = f(y|h2) f(h2)/f(y) ∝ f(y|h2) f(h2) [4]

where, ∝ means “proportional to.”Notice f(h2|y) is a function of h2, but not of y, which

is given (it is “fixed”); therefore, f(y|h2) is a function ofh2 but not of y, which means that f(y|h2) is the likeli-hood, as we defined it before. Moreover, f(y) is a constantfor the same reason: it does not depend on h2, and y isfixed. Finally, f(h2) is the probability of h2 that doesnot depend on our data. The criticism of the Bayesianinference is focused on this last probability, called prob-ability a priori because it can be defined independentlyof the data and before the start of the experiment (e.g.,Glymour, 1981). In defining likelihood, we stressed thatit was the probability of the sample if the value of theparameter were the true value. Now we weigh this bythe probability that the parameter is indeed the truevalue. Sometimes this prior probability is well-defined(e.g., the prior probability of obtaining a recessive indi-vidual when crossing two individuals with a heterozy-gous trait is 1/4), but for heritability the value of thisprior probability may not be obvious. The density func-tion f(h2|y) is called posterior probability, posterior den-sity, or, more commonly, posterior distribution. A com-mon expression of this posterior density is also

f(h2|y) ∝ L(h2|y) f(h2)

containing L(h2|y), which is the density function of thedata f(y|h2), but when considered as a function of h2 isthe likelihood. We will use this notation because it ismore coherent with the use of the bar “|” that means“given.” Thus, L(h2|y) is a function of h2 given the data.

Strictly speaking, all probabilities of a parameter arebased on a certain set of assumptions. These probabili-

Page 10: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2032

ties cannot be constructed without them (e.g., the as-sumption that the sample results from a random pro-cess, that the sample space is uniform, that the observerdoes not influence the observation, etc.). If we call Hthe set of hypothesis used to describe f(h2), the rightnotation should be f(h2|H), and therefore the theoremshould be expressed as f(h2|y,H) ∝ L(h2|y,H) f(h2|H),but we will suppose this implicitly in order to simplifythe notation.

In the Bayesian school, there is not a unique possibil-ity for point estimation, and the point estimate of choicecan be based on the risk function or other reasons.For example, if the prior density is bimodal, the mean,which minimizes the quadratic risk, can have a muchlower probability than the modes. In this case, the pointestimate is not consistent with the uncertainty aboutthe parameter h2. Another example of the risk of usingpoint estimation is the case in which an asymmetricposterior density of a genetic correlation gives a nega-tive mode but a much higher probability of being posi-tive (Figure 3).

Confidence intervals for the point estimates can beasymmetric. In the case of asymmetric posterior densi-ties, if we prefer the shortest confidence interval, it isnecessarily asymmetric. There is a practical difficultyfor the estimation of confidence intervals, or “credibleregions” as Bayesians prefer to call them. They requireknowledge of the posterior density, and it is necessaryto integrate this density between the limits of the inter-val in order to know the probability of the interval.Moreover, to calculate the posterior density it is neces-sary to know f(y), a value difficult to calculate, because

f(y) = ∫ f(y, h2) f(h2) d(h2)

Because y is a vector, all these integrals are multidi-mensional and difficult, if not impossible, to calculate,

Figure 2. Example of a probability density for heritability (h2) given the data (y) and estimates from posterior distri-bution.

even by approximate methods. Until recently this hasbeen a weak point in application of Bayesian methods,and it has been the main hindrance to their develop-ment in animal breeding. Now, MCMC methods (Gibbssampling, Metropolis-Hastings, etc.) have solved theproblem.

Because there are not infinite repetitions of the exper-iment, there is nothing that can be considered as “bias.”Moreover, there are no “fixed effects,” because all pa-rameters are random variables in a Bayesian context,including those that are considered “fixed” in a fre-quentist context. Uncertainty is expressed as the proba-bility of a parameter or an effect having a particulartrue value, so they must be random variables.

The result of a test of hypothesis is different in theBayesian school. For example, if the hypothesis to betested (H0) is that two breeds have the same mean fora trait, and the alternative (H1) is that the means aredifferent, in a Bayesian context it is calculated as

P (H0|y)/P(H1|y)

Thus, if this ratio is, for example 7, the hypothesis ofhaving equal means is seven times more probable thanthe hypothesis of having different means. These poste-rior distributions can be expressed as

P(H0|y)P(H1|y) = P(y|H0)�P(H0)/P(y)

P(y|H1)�P(H1)/P(y)

= BF�P(H0)P(H1)

wherein the ratio of the likelihoods is called the Bayesfactor (BF). Because the prior probabilities of bothhypotheses are often considered equal (many times inorder to give a false impression of objectivity), Bayes

Page 11: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2033

Figure 3. Asymmetrical posterior distribution for a genetic correlation (rg) illustrating an inconsistency in signbetween the mode and the majority of estimates.

factors coincide with the ratio of posterior probabilitiesand are used to decide which model is chosen.

Notice that for nested models the frequentist schoolalso uses likelihood ratio; thus, sometimes the numeri-cal result of a frequentist test and a Bayesian test canbe the same. However, the interpretation is completelydifferent. In a frequentist context, this ratio is usedbecause under infinite repetitions of the experiment itwill be approximately distributed as a chi-square. Thus,areas of acceptation and rejection of the hypothesis areestablished and a decision is made based on this distri-bution. In the Bayesian context, the ratio of likelihoodsis used because (under equal prior probabilities) it isequal to the ratio of posterior probabilities and givesan exact account of the evidence of one hypothesis overthe other. This is neither restricted to nested hypothe-ses nor is it an approximated result. The procedure isalso not restricted to either of only two hypotheses. Theratio of the likelihoods is the part of the inference thatis supported by the data and shows how the data modifythe prior state of knowledge to obtain the posterior one.Lavine and Schervish (1997) give a detailed comparisonof Bayes factors and P-values from likelihood-ratiotests.

Although Bayes factors summarize the informationprovided by the data, it is important to notice that, ina Bayesian context, they do not have any inferenceproperty without considering the prior information. No-tice that P(y|H0) is the probability of obtaining the cur-rent data only if the hypothesis H0 is true. We need tomultiply P(y|H0) by the probability that this hypothesisis true, P(H0), and to do the same with P(y|H1) and

P(H1), in order to make proper inferences6. When amodel is chosen only under the information provided bythe Bayes factor, equal prior probabilities are implicit.

An interesting procedure for inferences that has nocounterpart in frequentist statistics is the Bayesianmodel averaging. It consists of simultaneously usingseveral models for inferences, weighted according totheir posterior probabilities. For example, if we are in-terested in estimating the heritability of a trait and wehave two models constructed under the hypotheses H0and H1 (for example, with and without a particulareffect), then

P(h2|y) = P(h2, H0|y) + P(h2, H1|y)

= P(h2|H0, y) P(H0|y) + P(h2|H1, y) P(H1|y)

where P(h2|H0, y) is the posterior distribution of h2 un-der the first model and P(H0|y) is the probability thatthis model is true, and the same holds for the secondmodel. This can be generalized to several models simul-taneously. A tutorial for Bayesian model averaging hasbeen published by Hoeting et al. (1999).

6For example, if I want to infer the height of the typical Scotsmanand I take a sample of one of them, obtaining a height of 1.70 m, thehypothesis that maximizes the probability of my sample is not thatthe average Scotsman measures 1.70 m, but that all Scotsmen mea-sure 1.70 m. Because I have prior information about them, I know theyare humans, and that the height of humans follows an approximatelynormal distribution, and so on, then I give a null prior probabilityto this hypothesis and I infer on other grounds.

Page 12: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2034

The Use of Prior Information

Sometimes there is well-defined prior information onthe parameters to be estimated. In these cases the useof prior information is not controversial, as we will nowsee taking an example by Fisher. Let us take a traitcontrolled by a single gene with two alleles with domi-nance, for example, black skin is dominant over brownskin in some mice. We try to know whether a blackmouse, son of a heterozygous mates (Aa × Aa), is homo-zygous (AA) or heterozygous (Aa). In order to assessthis, we mate this mouse with a brown (recessive)mouse, and we obtain seven offspring, all of them black.What is the probability that the black mouse is hetero-zygous, given this data? According to the Bayes theoremshown in Eq. [4],

P(AA|y = 7 black descendants)

= L(AA|y = 7 black descendants) �

P(AA)/P(y = 7 black descendants)

P(Aa|y = 7 black descendants)

= L(Aa|y = 7 black descendants) �

P(Aa)/P(y = 7 black descendants)

We are trying to test whether the black mouse wehave is or is not homozygous. The expectation from amating between two heterozygotes is ¹⁄₄ homozygousblack, ¹⁄₂ heterozygotes, and ¹⁄₄ homozygous brown. Ablack mouse from such a mating has, prior to any test-mating, a probability of 1/3 of being homozygous andof 2/3 of being heterozygous, because mating Aa withAa gives three types of black animals, AA, Aa, and aA,with the same probability. We represent as Aa both ofthe heterozygotes Aa and aA. Thus, we have P(AA) =1/3, and P(Aa) = 2/3.

After performing the experiment, the likelihood ofthe black mouse being homozygous (AA) is 1; that is,given seven black offspring, because if the black father’sgenotype is AA, all descendants from crossing an AAmouse with brown (aa) mice are expected to be black(Aa). Thus, L(AA|y = 7 black descendants) = 1. Further,the likelihood of the black mouse being heterozygousis (1/2)7, given seven black offspring, because if the blackfather’s genotype is Aa, the probability of obtaining oneblack descendent would be 1/2, because the descendantsof a cross between Aa and a brown (aa) mouse areexpected to be Aa (black) and aa (brown) in equal pro-portion. Thus, the probability of obtaining seven blackmice is (1/2)7 and L(Aa|y = 7 black descendants) = (1/2)7.

Now, there are two possibilities of having seven blackdescendants. Either the individual is homozygous andhad seven black descendants, or it is heterozygous andhad seven black descendants. Thus, the probability ofhaving seven black descendants is:

P(y = 7 black descendants)

= Prob (AA and y = 7 black descendants)

+ Prob (Aa and y = 7 black descendants)

According to the laws of probability shown in Eq. [3],

Prob (y = 7 black descendants, and AA)

= P(y = 7 black descendants|AA) � P(AA)

= L(AA|y = 7 black descendants) � P(AA) = 1 � 1/3 = 1/3

Prob (y = 7 black descendants, and Aa)

= P(y = 7 black descendants|Aa) � P(Aa)

= L(Aa|y = 7 black descendants) �

P(Aa) = (1/2)7 � (2/3) = 1/192

and thus,

P(y = 7 black descendants) = 1/3 + 1/192

Thus, the probability that our black mouse is homozy-gous (AA), given that seven descendants were black, is

P(AA|y = 7 black descendants) = (1/3) � 1/(1/3 + 1/192)

and the probability that our black mouse is heterozy-gous (Aa) and had seven black descendants when cross-ing it with a brown (aa) mouse is

P(Aa|y = 7 black descendants)

= (1/2)7 � (2/3)/(1/3 + 1/192)

This is the desired answer.Selection indexes and other related methods (BLUP

or combined selection) use genetic parameters that arepresumed to be known. This is a common practice inanimal breeding and its use is not controversial. Letus consider the model

y = Z u + e

where y is the data vector, u the genetic values, e theresiduals, and Z the incidence matrix. Our objective isto find the expression of the posterior density of thegenetic values f(u|y). We suppose

u ∼ N(0, Aσ2u) e ∼ N(0, Iσ2

e)

thus,

f(u) = constant � exp{(−1/2σ2u) u′ A−1 u},

L(u|y) = constant � exp {(−1/2σ2e) (y − Zu)′ (y − Zu)}, and

f(u|y) ∝ L(u|y) � f(u) = constant � exp {(−1/2σ2e)

[(y − Zu)′ (y − Zu) + u′ A−1 u α]}

where α = σ2e/σ2

u.This posterior density f(u|y) has its maximum in

u = (Z′Z + A−1 α)−1

Z′y

Page 13: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2035

which is a well-known result of the theory of selectionindexes; it is the best linear prediction of the geneticvalues. From a Bayesian point of view, because theposterior density is symmetric, this point estimator is atthe same time the mean, median, and mode. Confidenceintervals can be obtained from f(u|y) as we have seenbefore.

In most cases, it is difficult to quantify the prior infor-mation pertaining to a parameter. For example, howshould the prior information about the heritability oflitter size be quantified? In these cases, the non-Bayes-ian statisticians contend it is not possible to apply Bayestheorem and the problem has no solution if we want tofind probabilities about the true value of the parameter.Within the Bayesian school, several solutions havebeen proposed.

For many cases, probability defined as a frequency istoo restrictive a concept to help in solving the inferenceproblem. Thus, expressing the probability as a subjec-tive value between 0 and 1 that indicates the degree ofbelief of the researcher about the value of the parameteris preferred. Notice that subjective does not mean arbi-trary. As Keynes (1921) stressed, this subjective proba-bility should be well-founded, and a scientist should beable to persuade other scientists about the value of thissubjective probability. For example, let us consider themarket price of a rabbit carcass. Before taking samplesfrom different markets, we can easily agree that it willbe less than 100 euro/kg and more than 0 euro/kg. Wecan be more precise and think that it will be unlikelythat the price will be more than 10 euro/kg or less than1 euro/kg. Following in the same way, we can assignprobabilities to different prices until a probability dis-tribution would be of general agreement. If not, we cancompare the results given by two or three possible priordensities, and base our decision on the one we believein more. This is quite easy to do for univariate problemsbut becomes factually impossible for multivariate prob-lems. For example, in estimating genetic parametersfor three traits we have to determine the prior probabil-ity that the first trait has a heritability of 0.1, that thesecond trait has a heritability of 0.3, the third one hasa heritability of 0.3, the genetic correlation of the firstand second is −0.7, that of the first and the third is 0.1,and that of the first and second is 0.4, and so on. Andthis should be done for every possible case! Trying todetermine all possibilities can lead the researcher to apsychological crisis; therefore, this way of working isnot useful for many animal breeding problems.

Having enough data, the prior probability has littleinfluence on the final result (the posterior density). Thisargument is based on the practical fact that experi-ments are often performed when the information aboutthe topic of the experiment is scarce, then prior opinionsare vague and it should not be too expensive in mostcases to have data enough to override the prior opinion.In these cases, prior opinion is not particularly welldefined and it is usual to take prior density functionsthat facilitate computing the posterior density, avoiding

paradoxes and inadmissible results. This argument wasintroduced in the literature by Jeffreys (1931) and hasbecome popular given the difficulties of using prior in-formation accurately (e.g., Blasco et al., 1998). Althoughit is somewhat deceptive to use a theory based on theharmonic use of prior information and information pro-vided by the data, this solution is practically acceptable.Analyzing small experiments with expensive data re-mains problematic because samples are necessarilysmall and prior information will affect the results.

Thus far, even though the procedure of associatingdegrees of belief to probabilities is controversial no theo-retical difficulties have appeared. There will be re-searchers that prefer not to subjectively establish theprior probability. However, even if this position is ac-cepted the resulting inference does not lead to para-doxes or contradictions. The problem arises when itis intended to represent the prior ignorance. The firstBayesians (Bayes included) used a technique thatKeynes (1921) called “Principle of indifference” to de-scribe a situation in which there was no prior informa-tion and all possible events have the same probability.To admit this principle implies that in the familiarexample of a bag of white and black balls that is oftenused to teach probability, all possible drawings havethe same prior probability. For continuous variablesthis means that the prior density is a line parallel tothe x-axis in a determined interval, and because of thatthey are known as “flat priors.” This terrible postulateis the origin not only of philosophical, but also of mathe-matical, problems. For example, if we want to representabsolute ignorance about the value of the heritabilityand we say that all possible prior values have the sameprobability in the interval [0, 1], then P(h2 < 0.5) = ¹⁄₂.But the event h2 < 0.5 is the same as the event h4 <0.25, and therefore their probability should be the sameP(h4 < 0.25) = ¹⁄₂. This shows that h4 is more proximalto 0 than to 1. Consequently, we shall say that we haveno information about h2 but we do have informationabout h4, which is absurd. The final consequence is that,as Bernardo (1997) remarked, non-informative priorsdo not exist. There is no easy answer to this problem.

The admission of validity of flat priors for represent-ing ignorance is a popular solution, but it does not seemto be well founded. Flat priors are frequently called“non-informative priors.” This is not true, because allpriors are informative, as we have seen. We can main-tain that we have no information about h2 but we doabout h4. For example, if we admit that h16 has a flatprior between 0 and 1, this implies P(h16 > 0.5) = ¹⁄₂,thus P(h2 > 0.51/8) = P(h2 > 0.92) = ¹⁄₂, which few peoplewould find plausible. Box and Tiao (1973) sustain thisrather unconvincing proposal. As Edwards (1992) re-marks, ignorance about a parameter implies ignoranceabout any transformation. To say we ignore the priorvalues of a parameter is not the same as saying wethink all of them have the same probability.

Alternatively, ignorance may be represented with re-spect to the likelihood. The supporters of this solution

Page 14: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2036

think ignorance is only a scale problem and that bychanging the scale prior densities can be found that donot change with reparametrization. First, we have todefine what we mean by absence of information. Fromthe Bayes theorem we see that all information comingfrom the data is contained in the likelihood, whereasthe prior information is independent of the data. Priorignorance should be expressed in a scale in which thelikelihood will not change. Jeffreys (1961) first proposedthis solution, and it can be found fully developed byBox and Tiao (1973). Because it is not always possibleto find such scale, approximations may be required, andthey work well with large samples (Box and Tiao say“moderate sample sizes” to avoid obvious criticism).However, when prior ignorance should be expressedabout two parameters at the same time, such as meanand variance, these functions either do not exist or theylead to paradoxes. Thus, Jeffrey’s priors have not beenapplied successfully in animal breeding, despite beingcurrently used in other fields of scientific inference. Thisis most likely because animal breeding problems aremultivariate in one way or another (we usually haveenvironmental effects, permanent or litter effects,multitrait problems, etc.).

Many times the suggested prior functions are notproper probabilities (they do not integrate to 1). Forexample, a flat prior gives the same probability valuesfrom −∞ to +∞. These functions can lead to posteriordensities that do not integrate to 1 and thus they arenot probabilities, making the inference impossible. It isnot always easy to detect an improper posterior density,particularly when they are estimated using MCMCmethods such as Gibbs sampling. For example,Hoeschele and Tier (1995) found improper posteriorsfor herd-year-season effect in a threshold model whenthey used an improper flat prior for this effect. Due tothis difficulty, Hobert and Casella (1996) recommendalways using proper prior densities, unless the im-proper prior being used belongs to a class of improperpriors that leads to proper posterior probabilities. How-ever, in using improper priors the rather philosophicalproblem of using a description of prior uncertainty thatis not a probability still persists. Moreover, it is difficult,from a theoretical point of view, to sustain that priorignorance depends on the true value of an unknownpopulation parameter.

With prior ignorance, or in checking the robustnessof results by comparing them with the case in whichthere is no prior information, Bernardo (1979) proposedcalculating posterior densities that give maximumweight to the data. The advantage of this solution isthat the posterior density is directly calculated, ensur-ing that it is a proper probability density. For theseauthors:

The problem of characterizing a “non-informative” or “ob-jective” prior distribution, representing “prior ignorance”,“vague prior knowledge” and “letting the data speak forthemselves” is far more complex that the apparent intuitive

immediacy of these words would suggest . . . “vague” is it-self much too vague an idea to be useful. There is no “objec-tive” prior that represents ignorance . . . the reference priorcomponent of the analysis is simply a mathematical tool.

Bernardo and Smith (1994)

The main problem with this solution is an operativeone: all of these priors are difficult to calculate, particu-larly in the case of selected data, and they change withthe model used. In the multivariate case, there is alsoa distinction between the parameter of interest andnuisance parameters. Depending on the order in whichthe nuisance parameters are disposed for conditionali-zation, the reference prior obtained for the parameterof interest can be different (e.g., Robert, 1992). This issomewhat puzzling, because it means that the “mini-mum amount of information” changes with the orderin which the analysis is conducted. Of course, if it isadmitted that priors are “simply mathematical tools,”then the problem vanishes.

What can be done when there is no prior information?The easiest solution is to use several priors and observeany effect on the results. For example, in estimatingthe heritability of a new trait, use of vague prior infor-mation around several values of h2 may result in differ-ent posterior distributions. If the results are similarregardless of the prior distribution used, the informa-tion in the data has overwhelmed the prior informationand the lack of prior information is of no consequence.Although the posterior distribution was obtained underthe false hypothesis of the existence of prior informa-tion, this hypothesis would be irrelevant to the re-sults obtained.

BLUP and Selection Indexes as Bayesian Estimators

Ronningen (1971) and Dempfle (1977) remarked thatBLUP can be considered as a Bayesian estimator. Giv-ing the linear model [1]

y = Xb + Zu + e = Wt + e

where t′ = [b′ u′] and W = [X Z], the Bayesian objectiveconsists of finding the function f(u|y) to be able to makeinferences. In a Bayesian context there are no fixedeffects. Thus, selection indexes are still a particularcase without the b effects. The objective is to estimatethe vector t from the data y. First, determine f(t|y).Applying Bayes theorem [3] results in

f(t|y) = constant�f(y|t) f(t)

If f(y|t) = N (Wt, Iσ2e), then

f(y|t) ∝ exp[(y − Wt)′ (y − Wt)] [4]

Suppose that the prior density of t is also normal. Thismay be reasonable for the genetic values, but it maynot be reasonable for the environmental values. Then,

Page 15: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2037

E(t′) = E [b′ u′] = [mb′ 0] = m, and

V = Var (t′) = Var[b′ u′] =

S 00 G

where m and S are the prior mean and variance of theenvironmental effects. Also suppose that environmen-tal and genetic values are uncorrelated, although thisis not always true. For example, in dairy cattle it iswell known that the “best” farms tend to buy the “best”semen. Avoiding this correlation between environmen-tal and genetic effects was one of the reasons Henderson(1973) proposed considering farm effects as fixed. Withsmall farms the herd effects are not well estimated,which has consequences for the estimation of geneticeffects and creates a problem with no easy solution.

f(t) ∝ exp[(t − m)′ V−1 (t − m)]

f(t|y) ∝ f(y|t) f(t) ∝

exp[(y − Wt)′ (y − Wt) + (t − m)′ V−1 (t − m)]

There are three Bayesian estimators that can betaken from the posterior distribution f(t|y): mean, me-dian, and mode. Because this is a normal distribution,all of them are coincident. We will calculate the mode,which is the maximum of the distribution. Derivingwith respect to t and equating to zero, we obtain theequation

[W′W + V−1] t = W′y + V−1 m

or, in expanded form

X′X + S−1 X′ZZ′X Z′Z + G−1

bu

=

X′y + S−1mb

Z′y

These are similar to the mixed-model equations [2],but here we include the mean of the environmentaleffects mb, which is not null, and the matrix S−1, whichfor the environmental effects is equivalent to the G−1

matrix. If we apply the principle of indifference and thefixed effects can have any value in the interval ]−∞,+∞[, their variance a priori tends to infinity and conse-quently S−1 tends to zero. If S−1 is null, we obtain themixed-model equations [2]. Therefore, BLUP is aBayesian estimator (mode, median, and mean of a nor-mal posterior distribution) constructed using flat priorsfor the environmental effects and a normal prior withzero mean and variance G for the genetic effects. Thisdoes not mean that BLUP is an optimal Bayesian esti-mator. It is not reasonable to suppose that the environ-mental effects can take any value with the same proba-bility. Usually, prior information is available, at leastfor the maximum and minimum values that an environ-mental effect can take. Frequently, prior uncertaintyabout the possible values that an environmental effect

can have could be expressed by defining a prior normaldistribution. The use of flat priors has its only justifica-tion as reference priors and because it facilitates calcu-lation of the posterior density.

ML and REML as Bayesian Estimators

If any value of h2 has the same prior probability7,f(h2) = constant (which is only admissible in the interval[0,1], then Eq. [3] becomes

f(h2|y) ∝ f(y|h2)

and thus the posterior distribution is calculated directlyfrom the likelihood. This distribution is usually easy tocalculate. Often the data are thought to be distributedeither normally or according to another known function.Notice that, although using the same function, theBayesian inference is based on a completely differentprinciple from ML. Here, the likelihood is used becauseit is proportional to f(h2|y), which is a distribution ofprobabilities, whereas frequentist inferences based onthe likelihood result from the good properties that itsmaximum has in conceptually infinite repetitions ofthe experiment.

Harville (1974) was the first to realize that REMLcould be considered a Bayesian estimator. Taking themixed model [1], Bayesian inferences are made fromthe complete posterior density f(b, u, σ2

u, σ2e,|y). The ML

estimators of b, σ2u and σ2

e are coincident with the modeof the marginal posterior density

f(b, σ2u,σ2

e|y) = ∫f(b, u, σ2u, σ2

e|y) du

whereas the REML estimators of σ2u and σ2

e are coinci-dent with the mode of the marginal posterior density

f(σ2u,σ2

e|y) = ∫∫f(b, u, σ2u, σ2

e|y) db du

when prior values are assumed to be flat for b andnormal for u, as in the case of BLUP. Note that thereis no REML estimation of the “fixed” effects b, becauseit is made in a space free of these effects.

The mode of the multivariant distribution f(b, σ2u,σ2

e|y)should not be the same, in general, as the mode of thebivariant distribution f(σ2

u,σ2e|y). Integrating out the en-

vironmental effects means considering all possible val-ues for b, weighting them by their probabilities, andsumming all these possible weighed values. This is aBayesian justification for using REML instead of ML,and it is clearer than the frequentist one. Also notice

7Strictly speaking, all the intervals of the same length have thesame probability; the probability of a concrete value is zero, becausethere are infinite points in the interval [0, 1]. The same can be saidfor the flat priors of the previous section.

Page 16: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2038

that from a Bayesian point of view the weighing argu-ment can be continued. Thus, it can be proposed toestimate the variance components from the mode of themarginal posterior densities

f(σ2u|y) = ∫∫∫ f(b, u, σ2

u, σ2e|y) db du dσ2

e

f(σ2e|y) = ∫∫∫ f(b, u, σ2

u, σ2e|y) db du dσ2

u

This is what Gianola and Foulley (1990) propose.Originally, their reason to take the mode was a practicalone: it could be calculated by finding the maximumof the marginal posterior densities. Now, the modernMCMC techniques (Gibbs sampling, Metropolis-Has-tings, etc.) make finding the mode unnecessary becauseall of the posterior density can be inferred.

Bayesian Estimators of Genetic Valuesand Dispersion Parameters

Independent of any Bayesian interpretation of themethods used in animal breeding that have been de-rived from a frequentist perspective, there is a Bayesianway of estimating breeding values and genetic parame-ters. We will maintain the notation of the mixed model[1], although there are not fixed effects in a Bayesiancontext. Thus, here b is a random vector of the environ-mental effects. The most general problem is to find theposterior density f(b, u, σ2

u, σ2e|y) and to make inferences

from it or from the marginal densities f(b|y), f(u|y),f(σ2

u|y), and f(σ2e|y). The likelihood and the prior distribu-

tions are needed.

f(b, u, σ2u, σ2

e|y) = f(y|b, u, σ2u, σ2

e) f(b, u, σ2u, σ2

e)/f(y)

It is usually assumed that priors for dispersion pa-rameters are mutually independent between them andindependent of priors for genetic and environmentalvalues. Moreover, because we do not know the jointprior distribution f(b, u), we will assume that priorsfor b and u are independent. Thus,

f(b, u, σ2u, σ2

e) = f(b) f(u) f(σ2u) f(σ2

e)

In some cases b and u are not independent a priori,as in the example of the “best” dairy farms buying the“best” semen to generate replacements for their cows.However, f(b) can be assumed constant to override theproblem, as in the Bayesian interpretation of BLUP andREML, with similar advantages and disadvantages. Ifb and u are assumed independent, it seems appropriateto give informative prior values for b. This will oftenlead to a symmetric function (e.g., a normal distribu-tion). In the case of very vague prior opinions, a flatprior may be used within the limits of an interval thatfixes the maximum and minimum prior values for theeffect. However, our prior opinion about the geneticdetermination of a trait is usually based on our prior

opinion about the heritability of the trait and is affectedby σ2

u and σ2e at the same time. Thus, it is not the best

solution to suppose the independence of both variancecomponents. One way of facilitating the expression ofthis prior opinion is to reparameterize the prior densityand to express it as a function of heritability, as Sore-nsen (1999) proposed. Nevertheless, in an experimentit is expected that prior opinions should be vague andmost of the information should come from the data. Ifprior opinions about the parameters of interest are verysharp then there is no reason to perform the ex-periment.

In the past, it was important to find prior functionsthat “conjugate” with the likelihood, in order to findknown functions and simplify the problem of derivinginferences from a posterior density. The problem wasthat if the posterior density was not a function of sometype that had been studied before, then it was compli-cated to derive inferences. For example, if the posteriordensity was normal, about 95% of its elements wereknown to be within approximately 2 SD of its mean.However, if the posterior density was not a function ofsome known type then it was difficult to find probabilityareas, because it was necessary to integrate the func-tion within limits. Another difficulty was calculation off(y), for which multiple integrals should be taken. TheBayesian procedures were limited to the use of asymp-totic properties, and calculating the mode of the distri-butions. Nowadays, the use of MCMC simulation allowsderiving inferences without regard for the functionalform of the posterior density. Nevertheless, conjugatefunctions are used because they facilitate the numerictask of Gibbs sampling (i.e., sampling from conditionalfunctions). Thus, inverted chi-square functions are usedfor variance component priors instead of using the repa-rameterization that allows expressing prior opinions ofheritabilities. Inverted chi-square functions are usednot only for mathematical commodity, but also becausethese functions depend on two parameters that allowexpression of almost any kind of prior opinion (see, forexample, the graphs of Lee, 1997). Animal breederstend to call these parameters “degree of belief ν” and“dispersion parameter S,” thinking the first leads todrawing more or less sharp densities and the secondcontrols dispersion of the density prior, but this is incor-rect. The shape of the function depends on both, and aprior inverted chi-square can be made sharp by chang-ing S. Moreover, ν and S do not have to be naturalnumbers, because they do not represent degrees of free-dom. They should be taken only as two parameters thatdescribe the shape of a prior belief, and nothing else.The values of these parameters should not be estab-lished by appealing to external “objective” arguments,such as the degrees of freedom of the sampling distribu-tion of the variance or other reasons not related to thereal prior opinion of the researcher. These values arecompletely arbitrary insofar as they lead to a prior func-tion that describes a prior state of opinion about thevariance.

Page 17: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2039

Bayesian estimators of the breeding values based onthe marginal density f(u|y) take into account the uncer-tainty about the variance components and the uncer-tainty about environmental effects. All possible valuesof variance components and environmental effects areweighted according to their probability and consideredin the estimation of the breeding values, because

f(u|y) = ∫∫∫ f(b, u, σ2u, σ2

e|y) db dσ2udσ2

e

This is particularly useful for traits that are difficultor expensive to measure (for example, meat qualitytraits or traits that require surgical interventions), be-cause in these cases the database may not be largeenough to estimate the variance components with preci-sion and the literature may be scarce. It is also usefulwhen the database is large and variance componentsare estimated using subsets of it.

Bayesian estimators of variance components basedon the marginal densities of the variance componentshave the advantage of taking into account the uncer-tainty associated with the other parameters. When theyare estimated with MCMC techniques, there is a sup-plementary advantage: the posterior density of the heri-tability and genetic correlations f(h2|y) and f(rg|y) canbe calculated immediately (see Appendix I).

Discussion

Bayesian Methods and Scientific Inference

Until recently, it was thought that the main differ-ence between the two schools of inference lay in the useof prior information. The frequentist school producesinferences based on the data and prior knowledge ofthe distribution of estimators in the sampling space.The problem of the frequentist school exists in the use ofprobabilities without using prior information, as theirfounders wanted (see Pearson, 1966). In this case, thedistribution of the estimator is used for inferences, in-stead of the distribution of the parameter, which leadsto a rather unnatural form of expressing uncertaintyabout the results of an experiment. Fisher was perfectlyconscious of the loss of clarity that this method implied,and wrote:

Bayes perceived the fundamental importance of thisproblem and framed an axiom, which, if its truth weregranted, would suffice to bring this large class of inductiveinferences within the domain of the theory probability; sothat, after a sample had been observed, statements aboutthe population could be made, uncertain inferences, indeed,but having the well-defined type of uncertainty characteris-tic of statements of probability

Fisher (1936)

The Bayesian school makes inferences from probabil-ities associated with values of the parameter of interest,which is a more natural way of expressing uncertainty.The main problem of this school is to be sure that these

probabilities are indeed probabilities, because the priorinformation used to derive them can be nonexistent ordifficult to quantify precisely.

Frequentist and Bayesian methods produce differentanswers to the problem of induction8. Basically, theprinciple of induction consists of admitting that it ispossible to make inferences about the population pa-rameters from samples. The problem of induction ap-pears when this possibility is denied, even with somedegree of probability (Popper and Miller, 1983). Thesolution proposed by Hume (1738) was psychological:when I observe the repetition of an event (for example,Friesian cows produce more than Jersey cows), I expectthat in the future the behavior of the event would bethe same. This is an unsatisfactory solution. From theearly attempts of Kant (1781) until the more modernresults of Russell (1948) and Popper (1972) there hasbeen a search for a rational justification for believingcausation can be associated with repeated phenomenaand general conclusions can be derived from samples.Russell (1948) proposes five postulates that, if admitted(and this is arbitrary), would justify inductive infer-ence. Popper (1972) simply helps in a better demarca-tion of the problem without bringing anything new tobear. It is currently admitted that the induction prob-lem does not have a logical-deductive solution, and re-search is focused on examining the “good reasons” forbelieving in the induction principle, and the nature ofthese reasons (Bird, 1998). Frequentist statisticians,such as Neyman and Pearson, preferred to speak about“the scientist behavior” instead of induction. However,Fisher believed that we could associate probabilitieswith events, and he and Sir Harold Jeffreys remarkedon the principal differences between both schools facingthis problem (see Lane, 1980, for a summary of thediscussion). The main problem of the frequentist solu-tion is that inference is associated with distributionalassumptions and often with unknown parameters of thepopulation. The main problem of the Bayesian school isthat probabilities often reflect states of belief insteadof “objective” probabilities.

Nevertheless, a good way of understanding the induc-tion problem is to look at the Bayes theorem. It is neces-sary, in order to associate a probability with an event,

8The problem of induction was presented in its modern form bythe Scottish philosopher Hume, although as Copleston (1985) re-marks the essence of the problem (the difficulty of associating effectsto causes) was outlined at least from the second century. Kant lookedfor a rational justification of the principle in the prior knowledge.But the prior knowledge of Kant only has its name in common withthe Bayesian prior knowledge. Kant says: “It is one of the first andmost important questions to know whether there is some knowledgeof experience. Call this knowledge ‘a priori’ and make distinctionfrom the empiric knowledge in that the sources of this last one are‘a posteriori’, based in the experience” (Kant, 1983). Of course, noBayesian scientist would use priori knowledge as something not basedin previous experiences. Although there is a Bayesian statisticianthat quotes Kant (Robert, 1992), I do not find any relationship be-tween Kant’s philosophy and the Bayesian paradigm.

Page 18: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2040

to know its a priori probability. For example, to knowhow Spanish Landrace performs for litter size, a sampleof 20 sows can be evaluated for litter size in their firstparity. Say an average of five piglets born is observed.This seems a very unlikely outcome a priori. Becauseno external a priori probability is available, previousbeliefs are used to express this a priori knowledge.These beliefs are not arbitrary, but are based on knowl-edge of swine performance and the performance of Lan-drace in other countries. Notice that these beliefs canbe easily shared by many other colleagues, and theyshould not represent any problem in the analysis, inso-far as they are vague (if they are not, there is no reasonto perform the experiment). Frequentist scientists alsouse subjective beliefs in discussing results and compar-ing them with other published results. If a frequentistscientist finds a very unlikely value for a parameterthen the tone of the discussion will cast doubt aboutits validity, if it is published at all. Frequentists andBayesians do not differ in the use of subjective priorinformation, but in how they use the information.

The prior density is constructed from a given set ofassumptions (for example, that previous experimentsare reliable, the samples are taken at random, the sam-pling space does not contain irregularities, etc.). Be-cause the truthfulness of these assumptions cannot beknown (and this is the heart of the induction problem),strictly speaking, inferences about the population pa-rameters cannot be made. In the future, it may be dis-covered that the samples were not taken at random orthat the sampling space was bimodal and only one re-gion of it was sampled. In this case, the process shouldbe restarted and inferences made based on the newstate of knowledge. The induction problem does notmake science impossible; theories or alternatives canstill be evaluated based on reasonable sets of as-sumptions.

A different matter, which places Bayesians in an un-comfortable situation, is the impossibility of fixing theseprevious beliefs in the multivariate case. Due to thisdifficulty, the modern Bayesian tends to ignore priorinformation, considering the prior density as a mathe-matical artifact that allows him to make inferencesbased on probabilities. In a classic Bayesian paper, Lin-dley and Smith (1972) argued that “it is typically truethat there is available prior information about the pa-rameters and that this may be exploited to find im-proved, and sometimes substantially improved, esti-mates [with respect to the least squares ones].”

Compare this with Bernardo and Smith (1994),quoted at the end of the paragraph on prior information.The problem in the multivariate case is that, becauseeven flat priors are somewhat informative, how theyaffect the posterior distributions should be ascertained.However, subjective priors cannot be prepared as in theunivariate case. One way of dealing with this difficultyis to try several priors constructed under differenthypotheses (for example, independence of the parame-ters, dependence described from values taken from the

literature, flat priors) and check whether the results arerobust with respect to the change of prior information.

Bayesian Methods Applied

Until recently, the main criticism made to Bayesianmethods was their practical sterility: they did not pro-duce results. The methodology implies the resolutionof complicated multidimensional integrals or the use ofmore or less accurate approximations. The recent useof MCMC techniques has solved most of these problems,although it has generated new ones. Most of these newproblems are related to the convergence of these chains.Fortunately, these new problems are easier to handle,particularly when the distribution of the data is normal.These MCMC techniques provide random samples ofthe joint and marginal posterior distributions, and thusthe mean, standard deviation, and confidence intervalscan be directly estimated from the samples withoutthe need of integration. New variables such as ratios,squares, and so on can be derived from the componentsof the joint posterior distribution samples obtained, andconfidence intervals can be derived from these distribu-tions without the need of any approximation.

There are many cases in animal breeding in whichthe frequentist approach gives an accurate and rapidanswer, and Bayesian methods are not needed (for ex-ample, the estimation of breeding values by BLUP orindexes). The main problems of the frequentist tech-niques when they are applied to animal breeding areof two sorts. One type is derived from the use of largedatabases, and the other one is related to the difficultiesin obtaining accurate estimates of the standard errorsin many complex situations. An example of the first typeis the difficulties with obtaining multivariate REMLestimation of the variance components when the data-base is large (e.g., in dairy cattle). An example of thesecond type is the problem of taking into account theerror of estimation of variance components in the pre-diction of breeding values. This is a major problem whenthe database is not large and references about variancecomponents cannot be found in the literature, as oftenhappens in meat quality traits or litter size components,for example. For large databases, MCMC techniquespermit drawing posterior distributions of genetic pa-rameters that may be easier to compute than the corres-ponding REML solutions. The Bayesian approach alsohas advantages in some complex situations.

The Bayesian way of attacking an estimation problemis always the same: to derive the posterior distributionof the parameters given the data. After calculating thelikelihood and determining the prior distributions,MCMC techniques can be used to obtain samples of themarginal posterior distributions. New problems can besolved, in principle, just by defining them correctly.This is in some ways similar to ML techniques. Al-though many genetic programs have a massive amountof data, there are still experiments in which the data-base is limited by the cost or the available facilities.

Page 19: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2041

Even when the amount of data is substantial, it canhappen that some levels of fixed effects contain only asmall amount of information. An example of large datasets with lack of information is the case of the heteroge-neity of variance with respect to herds in dairy cattle.The model may incorrectly suppose the same additiveand residual variances for all herds. However, particu-larly when herds are managed in many different envi-ronments, there may not be enough information to esti-mate the genetic variance within each herd (see Gianolaet al., 1992, for a proposal of Bayesian estimation ofheterogeneous variances in dairy cattle). Maximumlikelihood methods have good frequentist properties,but only asymptotically. When the data are scarce it isunknown whether these good properties are conserved.Posterior densities are more precise in these cases fordescribing the state of ignorance, and by examiningdifferent priors it is possible to determine how muchthe results are affected by the lack of information. More-over, with few data, likelihoods are rather flat, makingit difficult to find the absolute maximum, and a localmaximum may be found instead. Posterior marginaldensities take into account this imprecision and havethe advantage of being univariate functions. Samplesof these functions can be found by MCMC techniquesand their representation better illustrates the accuracyof the estimation. Finally, when the amount of data issmall, REML estimates of dispersion parameters arenot reliable. This may be a serious problem, becausebreeding values depend on them. Marginal posteriordensities weight the state of ignorance about variancecomponents when breeding values are estimated. Ifsome estimates of variance components can be foundin the literature, the frequentist way of estimatingbreeding values implies the use of a single value basedon them (the average of them, for example). The Bayes-ian way weights every possible value by its posteriorprobability (e.g., see Blasco et al. [1998] for a priorweighting of estimates of variance components based onthe literature and Piles et al. [2000b] for a comparison ofmeat quality traits between a selected and a controlline).

Standard errors of the heritability, genetic correla-tion, or other functions of the variance components areestimated after Taylor series approximations (see, forexample, Bulmer, 1985). Using modern MCMC tech-niques it is not necessary to use any approximation todraw the marginal posterior distribution of a functionof the variance components or of other estimated pa-rameters (an example is given in Appendix I). On otheroccasions, for example in nonlinear problems, the sam-pling distribution of the parameters is not known. Hereagain, MCMC techniques permit drawing a marginalposterior distribution of the parameters of interest, ex-pressing the uncertainty about these parameters (e.g.,see Mignon-Gastreau et al. [2000] and Piles et al.[2000a] for examples with nonlinear growth curves).

When two lactation curves or two growth curves areto be compared, the frequentist may use transforma-

tions or approximations (for example, to work in loga-rithmic scale or to approximate to linear functions byusing Taylor series). In order to apply these transforma-tions, more hypotheses are needed; for example, if alogarithmic scale is used, the errors are assumed multi-plicative instead of additive. Tests of hypotheses to com-pare curve parameters become difficult to apply andare often based on approximated likelihoods used forlikelihood ratios. Moreover, the standard error of a pa-rameter in the original scale cannot be properly ob-tained from the standard error calculated in the trans-formed scale. Bayesian solutions for these problems arestraightforward and have been developed by Varona etal. (1997) in a general form, Varona et al. (1998) forlactation curves, and Piles et al. (2000a,b) for growthcurves. Gianola et al. (1999) derived a general solutionfor nonlinear fitting of longitudinal data when data areselected for other traits (for example, growth curves inpopulations selected for growth rate, or lactation curvesin populations selected for milk, fat, and protein). Theyalways consist, again, of deriving the posterior distribu-tion of the parameters given the data and no transfor-mations or linear approximations are then needed.

When a parameter is needed for estimating anotherone, for example, when variance components areneeded to estimate breeding values, Bayesian tech-niques allow consideration of the error of estimationof the nested parameter (the variance, in this case).Inferences are made on the marginal posterior distribu-tions, after integrating out variance components (Sor-ensen et al., 1994), that is, after having considered allpossible values of variance components, weighted bytheir probability. Frequentist techniques are usuallybased on tests of robustness, which are complicated toperform in multivariate cases. There are other cases inwhich we can find more levels of hierarchy. For exam-ple, in nonlinear growth or lactation curves, the param-eters of the curve may be affected by environmental andgenetic effects. Here there are three levels of hierarchy:variance components are needed to estimate the effectsand the effects are needed to estimate the growth curveparameters (see Gianola et al., 1999; Varona et al.,1998; and Piles et al., 2000a,b).

There is a technique called data augmentation thatpermits simplifying the computation for the modelsused for inferences. For example, when we have cen-sored data, it would be easier to use a complete normaldistribution rather than a truncated one. In multivari-ate analyses with different design matrices (for exam-ple, growth traits and reproductive traits), it would beeasier to solve a case in which all traits have the samematrix designs. In these cases, the technique consistsof augmenting the data vector by generating data to fillthe gaps. If we call z the vector of generated data, theinferences on the heritability (for example) are madeusing MCMC techniques to draw the posterior distribu-tion P(h2, z|y), because P(h2, z|y) ∝ P(z, y|h2) P(h2) andit is easier to use MCMC techniques (to derive condi-tionals, as it is explained in Appendix I) with the com-

Page 20: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2042

plete set of data (z, y). After obtaining samples forP(h2, z|y), just ignore the samples generated for z andconsider only the samples for h2, obtaining a draw fromthe marginal posterior distribution P(h2|y), as shownin Appendix I (notice the similarity to the EM-algorithmused for ML estimations). A detailed account of thisprocedure can be found in Tanner (1995). It has beenused in a variety of problems, for example, for thresholdmodels (Sorensen et al., 1995), or in QTL mapping (Sa-tagopan et al., 1996).

A common need in QTL detection is to choose betweenmodels with no QTL, one QTL, or two or more QTL,and so forth. All the advantages of the Bayesian ap-proach for model choice hold here. Scheler et al. (1998)have compared the frequentist and Bayesian ap-proaches in closely related models and find the Bayes-ian estimates to be more accurate. Because Bayes fac-tors are not easy to calculate, alternative MCMC tech-niques (such as “reversible jump”) for model choice havebeen proposed. Uimari and Hoeschele (1997) comparedseveral approaches, and Sillanpaa and Arjas (1998,1999) used reversible jump for mapping multiple QTLin a model in which the number of QTL is an unobservedrandom variable. New research is done in the field toovercome the numerical difficulties. A tutorial aboutreversible jump can be found in Waagepetersen andSorensen (2000).

Threshold traits are an example of three levels ofhierarchy, where the thresholds depend on fixed effectsand breeding values, and these effects depend on thevariance components to be estimated. The first workdone using Bayesian techniques in animal breeding wasjust for threshold traits, by Gianola and Foulley (1982),proposing to calculate the mode of a joint posterior den-sity. Now, with MCMC techniques, marginal posteriordistributions can be drawn. Bayesian techniques alsoallow the joint analysis of threshold and continuoustraits (Janss and Foulley, 1993). A detailed descriptionof how to arrive at the marginal posterior distributionsusing MCMC techniques can be found in Sorensen etal. (1995) and Sorensen (1999).

Survival analyses need the use of complex propor-tional hazard models (Weibull and Cox models). Thesemodels can be extended to include genetic effects. Mixedsurvival models are called “frailty” models. There arefrequentist approaches to these models using the EMalgorithm, and also MCMC techniques have been de-rived to estimate posterior distributions, but they arecomputationally difficult. A Bayesian solution proposedby Ducroq and Casella (1996) using an approximatemethod common in Bayesian analyses (Laplacian inte-gration) results in a more straightforward solution.

Consider one of the main problems of dairy cattlegenetic evaluation: the undeclared preferential treat-ment. Some farmers provide more feed, for example, tothe cows obtained from good sires, but they do not de-clare this. Because the normal distribution has thintails (i.e., individuals that are placed one, two, or threestandard deviations greater than the mean become

much more rare than in the case of other distributions),this preferential treatment can have a substantial ef-fect on the evaluation of sires. Other distributions, suchas Student’s t, have thicker tails and this overvaluationhas a much smaller effect; the analysis becomes more“robust.” The “thickness” of the tail depends on thedegrees of freedom, and it is a parameter to be esti-mated. Bayesian techniques permit handling problemsrelated to robustness by using t-distributions (Strandenand Gianola, 1999). The inconvenience is that multiplet-distributions do not have the nice properties of themultinormal world (e.g., independence when the covari-ance is null, regression of one variable on another is nolonger linear, etc.), and analyses become more com-plicated.

Conclusions

If the animal breeder is not interested in the philo-sophical problems associated with induction, but intools to solve problems, both Bayesian and frequentistschools of inference are well established and it is notnecessary to justify why one or the other school is pre-ferred. Neither of them now has operational difficulties,with the exception of some complex cases. There is soft-ware available to analyze a large variety of problemsfrom both points of view. To choose one school or theother should be related to whether there are solutionsin one school that the other does not offer, to how easilythe problems are solved, and to how comfortable thescientist feels with the particular way of expressingthe results. Both schools present formal problems andparadoxes, although problems and paradoxes withmethods of greater familiarity are often better toler-ated. For example, a REML analysis with few data maybe more easily accepted than a Bayesian analysis inwhich prior information affects the results. Statisticalrefinements should not lead the scientist to forget thenecessity of having sufficient experimental informationto achieve an appropriate level of precision in expres-sion of the results. The anguish of the scientist for find-ing exact results instead of probable inferences is notnew. We can find in the Eglogae IX of Virgilius, in anagricultural context, the exclamation “Felix qui potuitrerum cognoscere causas”: Happy the man that canknow the causes of the things!”

Implications

If the animal breeder is not interested in the philo-sophical problems associated with induction, but intools to solve problems, both Bayesian and frequentistschools of inference are well established and it is notnecessary to justify why one or the other school is pre-ferred. Neither of them now has operational difficulties,with the exception of some complex cases. There is soft-ware available to analyze a large variety of problemsfrom both points of view. To choose one school or theother should be related to whether there are solutions

Page 21: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2043

in one school that the other does not offer, to how easilythe problems are solved, and to how comfortable thescientist feels with the particular way of expressingthe results.

Literature Cited

Barnett V. 1999. Comparative Statistical Inference. 3rd ed. JohnWiley & Sons, Chichester, U.K.

Bernardo J. M. 1979. Reference posterior distributions for Bayesianinference. J. R. Stat. Soc. B 41:113–147.

Bernardo, J. M. 1997. Noninformative priors do not exist: A discus-sion. J. Stat. Planning Inf. 65:159–189.

Bernardo, J. M., and A. F. M. Smith. 1994. Bayesian theory. JohnWiley & Sons, Chichester, U.K.

Bernouilli, D. 1778. Dijudicatio maxime probabilis plurium observa-tonium discrepantium atque verisimillia inductio indeformanda. Acta Acad. Scien. Imp. Petropolitanae. pp 3–23. Re-printed in translation with Kendall, 1961.

Bird, A. 1998. Philosophy of Science. UCL Press, London.Blasco, A., D. Sorensen, and J. P. Bidanel. 1998. A Bayesian analysis

of genetic parameters and selection response for litter size com-ponents in pigs. Genetics 149:301–306.

Box, G. E. P., and G. C. Tiao. 1973. Bayesian Inference in StatisticalAnalysis. John Wiley & Sons, New York.

Bulmer, M. G. 1985. The Mathematical Theory of Quantitative Genet-ics. Clarendon, Press, Oxford, U.K.

Cantet, R. J. C., R. L. Fernando, and D. Gianola. 1992. Bayesianinference about dispersion parameters of univariate mixed mod-els with maternal effects: Theoretical considerations. Genet. Sel.Evol. 24:107–135.

Copleston, F. 1959. A History of Philosophy. Vol. V: Hobbes to Hume.Reprinted. p 286. Bantam Doubleday Dell Publishing, New York.

Dawid, D. 1976. Discussion of the paper of O. Bandorff-Nielsen “Plau-sibility inference.” J. R. Stat. Soc. B 38:123–125.

Dempfle, L. 1977. Relation entre BLUP (Best Linear Unbiased Predic-tion) et estimateurs bayesiens. Ann. Genet. Sel. Anim. 9:27–32.

Ducroq, V., and G. Casella. 1996. A Bayesian analysis of mixed sur-vival models. Genet. Sel. Evol. 28:505–529.

Edwards, A. W. F. 1992. Likelihood. Cambridge University Press,Cambridge, U.K.

Efron, B. 1986. Why isn’t everyone a Bayesian? Am. Stat. 40:1–11.Fisher, R. A. 1912. On an absolute criterion for fitting frequency

curves. Messeng. Math. 41:155–160.Fisher, R. A. 1922. On the mathematical foundations of theoretical

statistics. Phil. Trans. A 222:309–368.Fisher, R. A. 1936. Uncertain inference. Proc. Am. Acad. Arts Sci.

71:245–258.Fisher, R. A. 1956. Statistical methods and scientific inference. Re-

printed. pp 146–147. Oxford University Press, New York.Gelman, A., J. Carlin, H. S. Stern, and D. B. Rubin. 1995. Bayesian

Data Analysis. Chapman and Hall, London.Gianola, D., and R. L. Fernando. 1986. Bayesian methods in animal

breeding theory. J. Anim. Sci. 63:217–244.Gianola, D., and J. L. Foulley. 1982. Non linear prediction of latent

genetic liability with binary expression: An empirical Bayes ap-proach. In: Proc. 2nd World Congr. Genet. Appl. Livest. Prod.,Madrid, Spain. 7:293–303.

Gianola, D., and J. L. Foulley. 1990. Variance estimation from inte-grated likelihoods. Genet. Sel. Evol. 22:403–417.

Gianola, D., J. L. Foulley, R. L. Fernando, C. R. Henderson, and K.A. Weigel. 1992. Estimation of heterogeneous variances usingempirical Bayes methods: Theoretical considerations. J. DairySci. 75:2805–2823.

Gianola, D., S. Im, and F. W. Macedo. 1990. A framework for predic-tion of breeding value. In: D. Gianola and K. Hammond (ed.)Statistical Methods for Genetic Improvement of Livestock.Springer-Verlag, Berlin, Germany.

Gianola, D., M. Piles, and A. Blasco. 1999. Bayesian inference aboutparameters of a longitudinal trajectory when selection operates

on a correlated trait. In: Proc. Int. Symp. Anim. Breed. Genet.Univ. Federal de Vicosa, Brazil. pp 101–132.

Gianola, D., S. Rodriguez-Zas, and G. E. Shook. 1994. The Gibbssampler in the animal model: A primer. In: J. L. Foulley andM. Molenat (ed.) Seminaire Modele Animal. INRA Departamentde Genetique Animale, La Colle sur Loup, France. pp 47–56.

Gilks, W. R., S. Richardson, and D. J. Spiegelhalter. 1996. MarkovChain Monte Carlo in practice. Chapman & Hall, London.

Glymour, C. 1981. Why I am not a Bayesian. Reprinted in: D. Papi-neau (ed.) 1996. The Philosophy of Science. Oxford UniversityPress, New York.

Hartley, H. O., and J. N. K. Rao. 1967. Maximum likelihood estima-tion for the mixed analysis of variance model. Biometrika54:93–108.

Harville, D. 1974. Bayesian inference for variance components usingonly error contrasts. Biometrika 61:383–385.

Harville, D. 1977. Maximum likelihood approaches to variance com-ponent estimation and to related problems. J. Am. Stat. Assoc.72:320–342.

Harville, D., and A. Carriquirry. 1992. Classical and Bayesian predic-tions as applied to an unbalanced mixed linear model. Biometrics48:987–1003.

Hazel, L. N. 1943. The genetic basis for constructing selection indices.Genetics 38:476–490.

Henderson, C. R. 1949. Estimation of changes in herd environment.J. Dairy Sci. 32:706–715.

Henderson, C. R. 1950. Estimation of genetic parameters. Ann Math.Stat. 21:309–310.

Henderson, C. R. 1963. Selection index and expected genetic advance.In: W. D. Hanson and H. F. Robinson (ed.) Statistical Geneticsand Plant Breeding. pp 141–153. National Academy of Sci-ences—National Research Council, Washington, DC.

Henderson, C. R. 1973. Sire evaluation and genetic trends. In: Proc.Anim. Breed. and Genet. Symp. in Honor of Dr. J. L. Lush. Am.Soc. Anim. Sci., Champaign, IL. pp 10–41.

Henderson, C. R. 1984. Applications of Linear Models in AnimalBreeding. University of Guelph, Ontario, Canada.

Hobert, J. P., and G. Casella. 1996. The effect of improper priors onGibbs sampling in hierarchical linear mixed models. J. Am. Stat.Assoc. 436:1461–1473.

Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky. 1999.Bayesian model averaging: A tutorial. Stat. Sci. 14:382–417.

Hoeschele, I., and B. Tier. 1995. Estimation of variance componentsof threshold characters by marginal posterior modes and meansvia Gibbs sampling. Genet. Sel. Evol. 27:519–540.

Hume, D. 1738. Treatise of Human Nature. Reprinted. Orbis S.A.,Madrid, Spain.

James, W., and C. Stein. 1961. Estimation with quadratic loss. In:Proc. 4th Berkeley Symp., University of California Press, Berke-ley. pp 361–380.

Jans, L. L. G., and J. L. Foulley. 1993. Bivariate analysis for onecontinuous and one threshold dichotomous trait with unequaldesign matrices and an application to birth weight and calvingdifficulty. Livest. Prod. Sci. 33:183–198.

Jeffreys, H. 1931. Scientific Inference. Cambridge University Press,Cambridge, U.K.

Jeffreys, H. 1961. Theory of Probability. Oxford University Press,Oxford, U.K.

Kant, E. 1781. Critique of Pure Reason. p 98. Reprinted in translation.Orbis S.A., Madrid, Spain.

Kempthorne, O. 1984. Revisiting the past and anticipating the future.In: Statistics: An Appraisal. Proc. 50th Anniversary Iowa StateStat. Lab. The Iowa State University Press, Ames. pp 31–52.

Kendall, M. G. 1961. Daniel Bernoulli on maximum likelihood. Biome-trika 48:1–8.

Keynes, J. M. 1921. A Treatise on Probability. Macmillan PublishingCo., London.

Lane, D. A. 1980. Fisher, Jeffreys, and the Nature of Probability. In:S. E. Fienberg and D. V. Hinkley (ed.) Lecture Notes in Statistics.pp 148–160. Springer-Verlag, Berlin, Germany.

Page 22: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2044

Lavine, M., and J. Schervish. 1997. Bayes Factors: What they areand what they are not. Am. Stat.

Lee, P. M. 1997. Bayesian Statistics. pp 78–79. Arnold, London.Lindley, D. V., and A. F. M. Smith. 1972. Bayes estimates for the

linear model. J. R. Stat. Soc. B 34:1–41.Mignon-Grasteau, S., M. Piles, L. Varona, H. Rochambeau, J. P.

Poivey, A. Blasco, and C. Beaumont. 2000. Genetic analysis ofgrowth curve parameters for male and female chickens resultingfrom selection based on juvenile and adult body weights simulta-neously. J. Anim. Sci. 78:2515–2524.

Neyman, J., and E. Pearson. 1933. On the problem of the most efficienttest of statistical hypotheses. Phil. Trans. R. Soc. 231A:289–337.

Patterson, H. D., and R. Thompson. 1971. Recovery of interblockinformation when block sizes are unequal. Biometrika58:545–554.

Pearson, E. S. 1966. Some thoughts on statistical inference. In: TheSelected Papers of E. S. Pearson. pp 276–283. Cambridge Univer-sity Press, Cambridge, U.K.

Pearson, K. 1903. Mathematical contributions to the theory of evolu-tion. XI. On the influence of natural selection on the variabilityand correlation of organs. Phil. Trans. R. Soc. 200A:1–66.

Piles, M., A. Blasco, D. Gianola, and L. Varona. 2000a. Bayesianinference about parameters of growth curves in rabbits whenselection operates on growth rate. In: Proc. 51st Ann. Mtg. Europ.Assoc. Anim. Prod., The Hague, The Netherlands.

Piles, M., A. Blasco, and M. Pla. 2000b. The effect of selection forgrowth rate on carcass composition and meat characteristics ofrabbit. Meat Sci. 54:347–355.

Popper, K. 1972. Objective Knowledge. Tecnos, Madrid, Spain.Popper, K., and D. Miller. 1983. A proof of the impossibility of induc-

tive probability. Nature (Lond.) 302:687–688.Robinson, G. K. 1991. That BLUP is a good thing: The estimation of

random effects. Stat. Sci. 6:15–51.Robert, C. P. 1992. L’Analyse Statistique Bayesienne. pp 109–111,

336. Economica, Paris, France.Robert, C. P., and G. Casella. 1999. Monte Carlo Statistical Methods.

Springer, New York.Ronningen, K. 1971. Some properties of the selection index derived

by “Henderson’s Mixed Model Method.” Z. Tierz. Zuechtbiol.83:186–193.

Russell, B. 1948. Human Knowledge: Its scope and limits. p 491.Orbis S.A., Madrid, Spain.

Satagopan, J. M., B. S. Yandell, M. A. Newton, and T. C. Osborn.1996. A Bayesian approach to detect quantitative trait loci usingMarkov Chain Monte Carlo. Genetics 144:805–816.

Scheler, P., B. Mangin, B. Goffinet, P. LeRoy, D. Boichard, and J. M.Elsen. 1998. Properties of a Bayesian approach to detect QTLcompared to the flanking markers regression method. J. Anim.Breed. Genet. 115:87–95.

Searle, S. R., G. Casella, and C. E. Mcculloch. 1992. Variance Compo-nents. John Wiley & Sons, Chichester, U.K.

Sillanpaa, M. J., and E. Arjas. 1998. Bayesian mapping of multiplequantitative trait loci from incomplete inbred line cross data.Genetics 148:1373–1388.

Sillanpaa, M. J., and E. Arjas. 1999. Bayesian mapping of multiplequantitative trait loci from incomplete outbred line cross data.Genetics 151:1605–1619.

Smith, H. F. 1936. A discriminant function for plant selection. Ann.Eugen. 7:240–256.

Sorensen, D. 1999. Gibbs sampling in quantitative genetics. Natl.Inst. Anim. Sci. Internal Rep. No. 82. Tjele, Denmark.

Sorensen, D. A., C. S. Wang, J. Jensen, and D. Gianola. 1994. Bayes-ian analysis of genetic change due to selection using Gibbs sam-pling. Genet. Sel. Evol. 26:333–360.

Sorensen, D. A., S. Andersen, D. Gianola, and I. Korsgaard. 1995.Bayesian inference in threshold models using Gibbs sampling.Genet. Sel. Evol. 27:229–249.

Stigler, S. M. 1983. Who discovered Bayes’s theorem? Am. Stat.37:290–296.

Stigler, S. M. 1986. The History of Statistics: The Measurement ofUncertainty before 1900. Harvard University Press, Cam-bridge, MA.

Stranden, I. J., and D. Gianola. 1999. Mixed effects linear modelswith t-distributions for quantitative genetic analysis: A Bayes-ian approach. Genet. Sel. Evol. 31:25–42.

Stuart, A., and J. K. Ord. 1991. Kendall’s Advanced Theory of Statis-tics vol. 2. Arnold, London.

Tanner, M. A., 1996. Tools for Statistical Inference. 3rd ed. Springer-Verlag, New York.

Thompson, W. A. 1962. The problem of negative estimates of variancecomponents. Ann. Math. Stat. 33:273–289.

Thompson, R. 1987. Statistical methods and applications to animalbreeding. Ph.D. dissertation. University of Edinburgh, U.K.

Uimari, P., and I. Hoeschele. 1997. Mapping-linked quantitative traitloci using Bayesian analysis and Markov Chain Monte Carloalgorithms. Genetics 146:735–743.

Von Mises, R. 1928/1957. Probability, Statistics and Truth. Macmil-lan Publishing Co., London.

Varona, L., C. Moreno, L. A. Garcia-Cortes, and J. Altarriba. 1997.Multiple trait analysis of underlying biological variables of pro-duction functions. Livest. Prod. Sci. 47:201–209.

Varona, L., C. Moreno, L. A. Garcıa Cortes, and J. Altarriba. 1998.Bayesian analysis of the Wood’s lactation curve for Spanish dairycows. J. Dairy Sci. 81:1469–1478.

Waagepetersen, R., and D. Sorensen. 2000. A tutorial on ReversibleJump MCMC with a view toward applications in QTL mapping.Stat. Rev. (In press).

Wang, C. S., D. Gianola, D. A. Sorensen, J. Jensen, A. Christensen,and J. J. Rutledge. 1994. Response to selection for litter size inDanish Landrace Pigs: A Bayesian analysis. Theor. Appl. Genet.88:220–230.

Woolliams, J. A., and T. H. E. Meuwissen. 1993. Decision rules andvariance of response in breeding schemes. Anim. Prod.56:179–186.

Yates, F. 1990. Foreword. In: R. A. Fisher (ed.) Statistical Methods,Experimental Design and Scientific Inference. Oxford UniversityPress, Oxford, U.K.

Appendix I. Gibbs sampling

Consider the model [1]

y = Xb + Zu + e

The objective is to obtain random samples from themarginal posterior distributions of all unknowns; i.e.,from

f(b1|y), f(b2|y), . . ., f(u1|y), f(u2|y), . . ., f(σ2u|y), f(σ2

e|y)

and also for any combination of unknowns, for example

f[σ2u/(σ2

u + σ2e)| y] = f(h2|y)

These samples will be used for inferences. They can beused for drawing histograms that will give an approxi-mate idea about the shape of the posterior distribution,and they can provide estimates of the mean, mode, andmedian of the distribution and can also be used forestimating confidence intervals (called “credibility in-tervals” by Bayesians). To obtain these samples, firstobtain random samples of the complete posterior distri-bution f(b, u, σ2

u, σ2e|y). This will lead to a matrix in

which each row will be one sample of each unknown,and each column a set of random sampled points of themarginal posterior distribution of each unknown.

Page 23: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Bayesian controversy 2045

b11 b21 . . . u11 u21 . . . σ2ul σ2

el

. . . . . . . . . . . . . . . . . . . . . . . .

b1i b2i . . . u1i u2i . . . σ2ui σ2

ei

. . . . . . . . . . . . . . . . . . . . . . . .

Thus, the set of values (b11, . . ., b1i, . . .) is a randomsample of the posterior distribution f(b1|y) and will beused to make inferences about b1, and the same can besaid about the other columns, the set (σ2

ul, . . ., σ2ui, . . .)

is a sample of the posterior distribution f(σ2u|y), and so

on. To make inferences about the heritability, create anew column using σ2

u and σ2e of each row. Then the set

[σ2u1/(σ2

u1 + σ2e1), . . ., σ2

ui/(σ2ui + σ2

ei), . . .] is a random sampleof the posterior distribution f(h2|y).

It is not possible to directly obtain a random sampleof the complete posterior distribution f(b, u, σ2

u, σ2e|y),

because this is a high-dimensional function, the productof several multidimensional functions; for example, ifwe take normal priors for b and u, and inverted chi-square priors for the variance components, the posteriordistribution is a multidimensional function, a productof normal and inverted chi-square distributions, and itis divided by a constant f(y) that is virtually impossibleto calculate. To reduce the dimensionality of the prob-lem and make it possible to obtain random samples ofthe complete posterior distribution, Monte-Carlo Mar-kov Chain (MCMC) techniques are used. The most usedamong them is called “Gibbs sampling” because origi-nally it was used to sample the type of distributionscalled “Gibbs distributions.” The procedure consists ofsampling from conditional distributions in an iterativeway, as follows. Write the conditional function

f(b1|b2, b3, . . ., bi, . . ., u1, u2, u3, . . ., ui, . . ., σ2u, σ2

e, y),

then give arbitrary values to b2, b3, . . ., bi, . . ., u1, u2,u3, . . ., ui, . . ., σ2

u, σ2e, and sample a random value for

b1 from this conditional distribution. To do this, theconditional distribution should be of a type for whichsampling algorithms are available, otherwise otherMCMC techniques (for example, Metropolis-Hastings)should be used to sample a random number from thisconditional distribution. With the randomly sampledvalue for b1, and the other former arbitrary values, takea random sample of b2 from the conditional distribution

f(b2|b1, b3, . . ., bi, . . ., u1, u2, u3, . . . ui, . . ., σ2u, σ2

e, y)

With the sampled values for b1 and b2, and the otherformer values, take a random sample from the condi-tional distribution of b3, and proceed this way until asample of each conditional distribution of each un-known has been taken. After this, begin again and re-peat the cycle. After some cycles of iteration, the ran-dom sample numbers extracted from the conditionalsare also random samples from the posterior distribu-

tion. These first iterations (the “burn in”) are disre-garded.

Consider a simple example in order to understandthis intuitively: to obtain a posterior distribution of f(x,y) sampling from the conditionals f(x|y) and f(y|x). Takean arbitrary value for x, say x = 0. Figure 4 shows f(x,y)represented as lines of equal probability (as level curvesin a map). Our arbitrary value x = 0 permits samplinga random number from f(y|x = 0), which is the y-axis.Because the probability between a and b is higher thanin other parts of f(y|x = 0), the number sampled will befound between a and b more probably than in otherparts of the conditional function. Suppose y = 2: sampleon f(x|y = 2). The line y = 2 cuts f(x,y) and gives as aninterval (c,d) in which we will find the next sampledextraction with more probability, thus x will probablybe found between c and d: say, x = 4. Sample any formf(y|x = 4). This number will probably be between theinterval (e,f). Observe, the tendency to sample from thehighest areas of probability more often than from thelowest areas. At the beginning, x = 0 and y = 2 werepoints of the posterior distribution, but they were notrandom extractions, and thus we were not interestedin them. However, after many iterations, we will findmore samples in the highest areas of probability thanin the lowest areas, and thus we find random samplesfrom the posterior distribution. This explains why thefirst points sampled should be discarded, and only aftersome cycles of iteration are samples taken at random.

The main problems associated to this procedure arethe following.

First, strictly speaking, it cannot be demonstratedthat we are finally sampling from a posterior distribu-tion. A Markov chain must be irreducible to convergeto a posterior distribution. Although it can be demon-strated that some chains are not irreducible, there isno general procedure to ensure irreducibility.

Second, even in the case in which the chain is irreduc-ible, it is not known when sampling from the posteriordistribution begins. By using several chains a point isarrived at where the variability among chains may beattributed to Monte-Carlo sampling error, and thussupport the belief that samples are being drawn fromthe posterior distribution. There are some tests to checkwhether this is the situation. Good practical textbooksare Gelman et al. (1995), Gilks et al. (1996), and Robertand Casella (1999).

Third, even having a irreducible chain and when thetests ensure convergence, the converged distributionmay not be stationary. Sometimes there are large se-quences of sampling that give the impression of stabil-ity, and after many iterations the chains move to an-other area of stability.

The above problems are not trivial, and they occupya part of the research in MCMC methods. Practicallyspeaking, what people do is to launch several chainswith different starting values and to observe their be-havior. No pathologies are expected for a large set of

Page 24: The Bayesian controversy in animal breedingmastergr.webs.upv.es/Asignaturas/Apuntes/08... · ods used most frequently in animal breeding can be derived under either the Bayesian or

Blasco2046

Figure 4. Posterior distribution and conditionals.

problems (for example, when using multivariate distri-butions), but some more complicated models (for exam-ple, threshold models with environmental effects inwhich no positives are observed in some level of one ofthe effects) should be examined with care.

It should be noted that these difficulties are similarto finding a global maximum in multivariate likelihoodwith several fixed and random effects. With a largedatabase, second derivative algorithms cannot be used,and thus there is a formal incertitude about whether

the maximum found is global or local. Here again, peo-ple use several starting values and examine the behav-ior of their results. Maximum likelihood methods pres-ent additional problems in estimation when the modelbecomes more complex. Thus, the difficulties of MCMCmethods should not be considered a major problem inusing Bayesian techniques, at least not any more diffi-cult than the ones found when using advanced fre-quentist techniques, for which MCMC methods aresometimes also used.