bglr: a statistical package for whole-genome regressionbglr: a statistical package for whole-genome...

30
BGLR: A Statistical Package for Whole-Genome Regression Paulino P´ erez Rodr´ ıguez Socio Econom´ ıa Estad´ ıstica e Inform´ atica, Colegio de Postgraduados, M´ exico [email protected] Gustavo de los Campos Department of Biostatistics, Section on Statistical Genetics University of Alabama at Birmingham, USA [email protected] Abstract Many modern genomic data analysis problems require implementing regressions where the num- ber of unknowns (p, e.g., the number of marker effects) vastly exceeds sample size (n). Imple- menting these large-p-with-small-n regressions poses several statistical and computational chal- lenges. Some of these challenges can be confronted using Bayesian methods, and the Bayesian approach allows integrating various parametric and non-parametric shrinkage and variable se- lection procedures in a unified and consistent manner. The BGLR R-package implements a large collection Bayesian regression models, including various parametric regressions where regression coefficients are allowed to have different types of prior densities (flat, normal, scaled-t, double- exponential and various finite mixtures of the spike-slab family) and semi-parametric methods (Bayesian reproducing kernel Hilbert spaces regressions, RKHS). The software was originally de- veloped as an extension of the BLR package and with a focus on genomic applications; however, the methods implemented are useful for many non-genomic applications as well. The response can be continuous (censored or not) or categorical (either binary, or ordinal). The algorithm is based on a Gibbs Sampler with scalar updates and the implementation takes advantage of efficient compiled C and Fortran routines. In this article we describe the methods implemented in BGLR, present examples of the use of the package and discuss practical issues emerging in real-data analysis. Key words: Bayesian Methods, Regression, Whole Genome Regression, Whole Genome Pre- diction, Genome Wide Regression, Variable Selection, Shrinkage, semi-parametric regression, RKHS, R. 1 Introduction Many modern statistical learning problems involve the analysis of highly dimensional data; this is particularly common in genetic studies where, for instance, phenotypes are regressed on large numbers of predictor variables (e.g., SNPs) concurrently. Implementing these large-p-with-small-n regressions posses several statistical and computational challenges; including how to confront the 1

Upload: others

Post on 22-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

BGLR: A Statistical Package for Whole-Genome Regression

Paulino Perez RodrıguezSocio Economıa Estadıstica e Informatica,

Colegio de Postgraduados, [email protected]

Gustavo de los CamposDepartment of Biostatistics, Section on Statistical Genetics

University of Alabama at Birmingham, [email protected]

Abstract

Many modern genomic data analysis problems require implementing regressions where the num-ber of unknowns (p, e.g., the number of marker effects) vastly exceeds sample size (n). Imple-menting these large-p-with-small-n regressions poses several statistical and computational chal-lenges. Some of these challenges can be confronted using Bayesian methods, and the Bayesianapproach allows integrating various parametric and non-parametric shrinkage and variable se-lection procedures in a unified and consistent manner. The BGLR R-package implements a largecollection Bayesian regression models, including various parametric regressions where regressioncoefficients are allowed to have different types of prior densities (flat, normal, scaled-t, double-exponential and various finite mixtures of the spike-slab family) and semi-parametric methods(Bayesian reproducing kernel Hilbert spaces regressions, RKHS). The software was originally de-veloped as an extension of the BLR package and with a focus on genomic applications; however,the methods implemented are useful for many non-genomic applications as well. The responsecan be continuous (censored or not) or categorical (either binary, or ordinal). The algorithmis based on a Gibbs Sampler with scalar updates and the implementation takes advantage ofefficient compiled C and Fortran routines. In this article we describe the methods implementedin BGLR, present examples of the use of the package and discuss practical issues emerging inreal-data analysis.

Key words: Bayesian Methods, Regression, Whole Genome Regression, Whole Genome Pre-diction, Genome Wide Regression, Variable Selection, Shrinkage, semi-parametric regression,RKHS, R.

1 Introduction

Many modern statistical learning problems involve the analysis of highly dimensional data; thisis particularly common in genetic studies where, for instance, phenotypes are regressed on largenumbers of predictor variables (e.g., SNPs) concurrently. Implementing these large-p-with-small-nregressions posses several statistical and computational challenges; including how to confront the

1

Page 2: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

so-called ‘curse of dimensionality’ (Bellman, 1961) as well as the complexity of a genetic mechanismthat can involve various types of interactions between alleles and with environmental factors. Recentdevelopments in the areas of shrinkage estimation, both in the penalized and Bayesian regressionframeworks, as well as in computational methods have made the implementation of these large-p-with-small-n regressions feasible. Consequently, whole-genome-regression approaches (Meuwissenet al., 2001) are becoming increasingly popular for the analysis and prediction of complex traitsin plants (e.g. Crossa et al., 2010), animals (e.g. Hayes et al., 2009, VanRaden et al., 2009) andhumans (e.g. de los Campos et al., 2013, Makowsky et al., 2011, Vazquez et al., 2012, Yang et al.,2010).

In the last decade a large collection of parametric and non-parametric methods have been proposedand empirical evidence has demonstrated that there is no single approach that performs best acrossdata sets and traits. Indeed, the choice of the model depends on multiple factors such as the geneticarchitecture of the trait, marker density, sample size, the span of linkage disequilibrium (e.g., de losCampos et al., 2013). Although various software (BLR, Perez et al. 2010; rrBLUP, Endelman2011; synbreed, Wimmer et al. 2012; GEMMA, Zhou and Stephens 2012) exists, most statisticalpackages implement a few types of methods and there is need of integrating these methods in aunified statistical and computational framework. Motivated by this need we have developed the R(R Core Team, 2012) package BGLR (de los Campos and Perez, 2013). The package is availableat CRAN and at the R-forge website https://r-forge.r-project.org/projects/bglr/.

Models. BGLR can be used with continuous (censored or not) and categorical traits (binary andordinal). The user has control in choosing the prior assigned to effects and this can be used to controlthe extent and type of shrinkage of estimates. For parametric linear regressions on covariates(e.g., genetic markers, non-genetic co-variates) the user can choose a variety of prior densities, fromflat priors (the so-called ‘fixed effects’ , a method that does not induce shrinkage of estimates) topriors that induce different types of shrinkage, including: Gaussian (Bayesian Ridge Regression,BRR), scaled-t (BayesA Meuwissen et al., 2001), Double-Exponential (Bayesian LASSO, BLPark and Casella, 2008), and two component mixtures with a point of mass at zero and a with aslab that can be either Gaussian (BayesC, Habier et al., 2011) or scaled-t (BayesB, Habier et al.,2011). The BGLR package also implements Bayesian Reproducing Kernel Hilbert SpacesRegressions (RKHS, Wahba, 1990) using Gaussian processes with arbitrarily user-defined co-variance structures. This class of models allows implementing semi-parametric regressions forvarious types of problems, including, scatter-plot smoothing (e.g., smoothing splines Wahba,1990), spatial smoothing (Cressie, 1988), Genomic-BLUP (VanRaden, 2008), non-parametricRKHS genomic regressions (de los Campos et al., 2010, Gianola et al., 2006, Gianola and vanKaam, 2008) and pedigree-BLUP (Henderson, 1975).

All the above-mentioned prior densities (e.g., Gaussian, Double Exponential, Scaled-t, finite mix-tures) are index by regularization parameters that control the extent of shrinkage of estimates;rather than fixing them to some user-specified values we treat them as random. Consequently, ina deeper level of the hierarchal model these regularization parameters are assigned prior densities.

Algorithms. In BGLR samples from the posterior density are drawn using a Gibbs sampler(Casella and George, 1992, Geman and Geman, 1984); with scalar updating. This approach is veryflexible but computationally demanding. To confront the computational challenges emerging inMarkov Chain Monte Carlo (MCMC) implementations we have adopted a strategy that combines:

2

Page 3: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

(a) the use of built-in R functions for operations that can be vectorized with (b) customizedcompiled code (C and Fortran) developed to perform operations that cannot be vectorized. Thus,the kernel of our software is written in R, but the computationally demanding steps are carriedout using customized routines written in C and Fortran code. The implementation makes use ofBLAS routines daxpy and ddot. The computational performance of the algorithm can be greatlyimproved if R is linked against a tuned BLAS implementation with multithread support, for exampleOpenBLAS, ATLAS, Intel mkl, etc.

Ancillary functions and data sets. In addition to the main function (BGLR) the packagecomes with: (a) functions to read and write from the R-console *.ped and *.bed files (Purcellet al., 2007), (b) two publicly available data sets (see Section 4 for further details) and (c) variousexamples (type demo(package=‘BGLR’) in the R-console).

In what remains of the article we discuss the methods implemented (Section 2), the userinterface (Section 3), and the data sets included (Section 4) in the BGLR package in detail.Application examples are given in Section 5. A small benchmark is given in Section 6. Finally,the article is closed in Section 7 with a few concluding remarks.

2 Statistical Models and Algorithms

The BGLR supports models for continuous (censored or not) and categorical (binary or ordinalmultinomial) traits. We begin by considering the case of a continuous response without censoring;categorical and censored data are considered later on.

2.1 Conditional distribution of the data

For a continuous response (yi; i = 1, ...., n) the data equation is represented as yi = ηi + εi, whereηi is a linear predictor (the expected value of yi given predictors) and εi are independent normalmodel residuals with mean zero and variance w2

i σ2ε . Here, the w′is are user-defined weights (by

default BGLR sets wi = 1 for all data-points) and σ2ε is a residual variance parameter. In matrixnotation we have

y = η + ε,

where y = {y1, ..., yn}, η = {η1, ..., ηn} and ε = {ε1, ..., εn}.

The linear predictor represents the conditional expectation function, and it is structured as follows:

η = 1µ+J∑j

Xjβj +L∑l

ul, (1)

3

Page 4: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

where µ is an intercept, Xj are design matrices for predictors, Xj = {xijk}, βjk are vectors ofeffects associated to the columns of Xj and ul = {ul1, ..., uln} are vectors of random effects. Theonly element of the linear predictor included by default is the intercept. The other elements areuser-specified. Collecting the above assumptions, we have the following likelihood:

p(y|θ) =n∏i=1

N(yi|µ+J∑j

Kj∑k=1

xijkβjk +L∑l

uli, σ2εw

2i ),

where θ represents the collection of unknowns, including the intercept, regression coefficients, ran-dom effects and the residual variance.

2.2 Prior densities

The residual variance is assigned a Scaled-inverse Chi-square density p(σ2ε) = χ−2(σ2ε |Sε, dfε)with degree of freedom dfε (¿0) and scale parameters Sε (¿0) and the intercept (µ) is assigneda flat prior. In the parameterization used in BGLR, the prior expectation of the Scaled-inverseChi-square density χ−2(·|S·, df·) is given by S·

df·−2 .

Regression coefficients {βjk} can be assigned either un-informative (i.e., flat) or informativepriors. Those coefficients assigned flat priors, the so-called ‘fixed’ effects, are estimated based oninformation contained in the likelihood solely. For the coefficient assigned informative priors, thechoice of the prior will play an important role in determining the type of shrinkage of estimatesof effects induced. Figure 1 provides a graphical representation of the prior densities available inBGLR. The Gaussian prior induce shrinkage of estimate similar to that of Ridge Regression(RR, Hoerl and Kennard, 1970) where all effects are shrunk to a similar extent; we refer to thismodel as the Bayesian Ridge Regression (BRR). The scaled-t and double exponential (DE)densities have higher mass at zero and thicker tails than the normal density, and they induce atype of shrinkage of estimates that is size-of-effect dependent (Gianola, 2013). The scaled-t densityis the prior used in model BayesA (Meuwissen et al., 2001), and the DE or Laplace prior is the oneused in the BL (Park and Casella, 2008). Finally, BGLR implements two finite mixture priors:a mixture of a point of mass at zero and a Gaussian slab, a model usually refereed in the literatureon GS as to BayesC (Habier et al., 2011) and a mixture of a point of mass at zero and a scaled-tslab, a model known as BayesB (Meuwissen et al., 2001). By assigning a non-null prior probabilityfor the marker effect to be equal to zero, the priors used in BayesB and BayesC have potential forinducing variable selection.

Hyper-parameters. Each of the prior distributions above-described are indexed by one or moreparameters that control the type and extent of shrinkage induced. We treat these regularizationparameters as unknown; consequently a prior is assigned to these unknowns. Table 1 lists, foreach of the prior densities implemented the set of hyper-parameters. Further details about howregularization parameters are inferred from the data are given in the Appendix.

Combining priors. Different priors can be specified for each of the elements of the linear predictor,{X1,X2, ...,XJ ,u1,u2, ...,uL}, giving the user great flexibility in building models for data analysis;

4

Page 5: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

βj

p(β j

)

GaussianDouble Exponential

Scaled−t (5df)BayesC (π=0.25)

Figure 1: Prior Densities of Regression Coefficients Implemented in BGLR. All the densities dis-played correspond to random variables with null mean and unit variance.

an example illustrating how to combine different priors in a model is given in Box 3a of Section 5.

Gaussian Processes. The vectors of random effects ul are assigned multivariate-normal priorswith a mean equal to zero and co-variance matrix Cov(ul,u

′l) = K lσ

2ul where K l is an n × n

symmetric positive semi-definite matrix and σ2ul is a variance parameter with prior density σ2ul ∼χ−2(dfl, Sl). Special classes of models that can be implemented using these random effects includestandard pedigree-regression models (Henderson, 1975) in which case K l is a pedigree-derivedco-variance matrix, Genomic BLUP (VanRaden, 2008), which case K l may be a marker-derivedrelationship matrix, or models for spatial regressions (Cressie, 1988) in which case K l may bea co-variance matrix derived from spatial information. Illustration about the inclusion of theseGaussian processes into models for data analysis are given in examples of Section 5.

2.3 Algorithms

The R-package BGLR draws samples from the posterior density using a Gibbs sampler (Casella andGeorge, 1992, Geman and Geman, 1984) with scalar updating. For computational convenience thescaled-t and DE densities are represented as infinite mixtures of scaled normal densities (Andrewsand Mallows, 1974), and the finite-mixture priors are implemented using latent random Bernoulli

5

Page 6: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Table 1: Prior densities available for regression coefficients in the BGLR package.

Model Hyper-parameters Treatmend in BGLR1

(prior density)

Flat Mean (µβ) µβ = 0(FIXED) Variance (σ2β) σ2β = 1× 110

Gaussian Mean (µβ) µβ = 0(BRR) Variance (σ2β) σ2β ∼ χ−2

Scaled-t Degrees of freedom (dfβ) User-specified (default value, 5)(BayesA) Scale (Sβ) Sβ ∼ GammaDouble-Exponential λ Fixed, user specified, or(BL) λ2 λ2 ∼ Gamma, or

λmax ∼ Beta

2

Gaussian Mixture π (prop. of non-null effects) π ∼ Beta(BayesB) dfβ User-specified (default value, 5)

Sβ Sβ ∼ GammaScaled-t Mixture π (prop. of non-null effects) π ∼ Beta(BayesC) dfβ User-specified (default value, 5)

Sβ Sβ ∼ Gamma

1: Further details are given in the Appendix. 2: This approach is further discussed in de losCampos et al. (2009).

variables linking effects to components of the mixtures.

Categorical traits. The argument response type is used to indicate BGLR whether the responseshould be regarded as ‘continuous’, the default value, or ‘ordinal’. For continuos traits theresponse vector shoud be coercible to numeric; for ordinal traits the response can take onto Kpossible (ordered) values yi ∈ {1, ...,K} (the case where K = 2 corresponds to the binary outcome),and the response vector should be coercible to a factor. For categorical traits we use the probitlink; here, the probability of each of the categories is linked to the linear predictor according to thefollowing link function:

P (yi = k) = Φ(ηi − γk)− Φ(ηi − γk−1)

where Φ(·) is the standard normal cumulative distribution function, ηi is the linear predictor,specified as above-described, and γk are threshold parameters, with γ0 = −∞, γk ≥ γk−1, γK =∞.The probit link is implemented using data augmentation (Tanner and Wong, 1987), this is doneby introducing a latent variable (so-called liability) li = ηi + εi and a measurement model yi = kif γk−1 ≤ li ≤ γk. For identification purpouses, the residual variance is set equal to one. At eachiteration of the Gibbs sampler the un-observed liability scores are sampled from truncate normaldensities; once the un-observed liability has been sampled the Gibbs sampler proceed as if li wereobserved (see Albert and Chib, 1993, for further details).

6

Page 7: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Missing data . The response vector can contain missing values. Internally, at each iteration ofthe Gibbs sampler missing values are sampled from the corresponding fully-conditional density.Missing values in predictors are not allowed.

Censored data . Censored data in BGLR is described a triplet {ai, yi, bi}; the elements of thistriplet must satisfy: ai < yi < bi. Here, yi is the observed response (e.g., a time-to event variable,observable only in un-censored data points, otherwise missing, NA) and ai and bi define lower andupper-bounds for the response, respectively. Table 2 gives the configuration of the triplet for thedifferent types of data-points. The triplets are provided to BGLR in the form of three vectors(y,a, b). The vectors a and b have NULL as default value; therefore, if only y is provided this isinterpreted as a continuous trait without censoring. If a and b are provided together with y data istreated as censored. We treat censoring as a missing data problem; the missing values of yi presentdue to censoring are sampled from truncated normal densities that satisfy ai < yi < bi. Furtherdetails about models for censored data are given in examples of section 5.

Table 2: Configuration of the triplet used to described censored data-points in BGLR.

Type of point ai yi biUn-censored NULL yi NULL

Right censored ai NA Inf

Left censored -Inf NA biInterval censored ai NA bi

3 Software interface

The R-package BGLR (de los Campos and Perez, 2013) inherits part of its user interface fromBLR (de los Campos and Perez, 2010). A detailed description of this package can be found in(Perez et al., 2010); however we have modified key elements of the user-interface, and the internalimplementation, to provide the user more flexibility in building models for data analysis. All thearguments of the BGLR function have default values, except the vector of phenotypes. Therefore,the simplest call to the BGLR program is as follows:

Box 1a: Fitting win intercept model

library(BGLR)

y<-50+rnorm(100)

fm<-BGLR(y=y)

When the call fm<-BGLR(y=y) is made, BGLR fits an intercept model, a total of 1500 cycles ofa Gibbs sampler are run, and the 1st 500 samples are discarded. As the Gibbs sampler collectssamples some are saved to the hard drive (only the most recent samples are retained in memory) infiles with extension *.dat and the running means required for computing estimates of the posteriormeans and of the posterior standard deviations are updated; by default a thinning of 5 is used butthis can be modified by the user using the thin argument of BGLR. Once the iteration processfinishes BGLR returns a list with estimated posterior means and several arguments used in the

7

Page 8: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

call.

Inputs

Box 1b displays a list of the main arguments of the BGLR function, a short description follows:

• y,a,b (y, coercible to either numeric or factor, a and b of type numeric) and response type

(character) are used to define the response.

• ETA (of type list) is used to specify the linear predictor. By default is set to NULL, in whichcase only the intercept is included. Further details about the specification of this argumentare given below.

• nIter, burnIn and thin (all of type integer) control the number of iterations of the sampler,the number of samples discarded and the thinning used to compute posterior means.

• saveAt (character) can be used to indicate BGLR where to store the samples, and to providea pre-fix to be appended to the names of the file where samples are stored. By defaults samplesare saved in the current working directory and no pre-fix is added to the file names.

• S0, df0, R2 (numeric) define the prior assigned to the residual variance, df0 defines the degreeof freedom and S0 the scale. If the scale is NULL, its value is chosen so that the prior mode ofthe residual variance matches the variance of phenotypes times 1-R2 (see the Appendix forfurther details).

Box 1b: Partial list of arguments of the BGLR function

BGLR( y, a = NULL, b = NULL, response_type = "gaussian",

ETA = NULL,

nIter = 1500, burnIn = 500, thin = 5,

saveAt = "",

S0 = NULL, df0 = 5, R2 = 0.5,...

)

Return

The function BGLR returns a list with estimated posterior means and estimated posterior standarddeviations. The parameters used to fit the model are also returned within the list. Box 1c showsthe structure of the object returned after fitting the intercept model of Box 1a. The first element ofthe list (y) is the response vector used in the call to BGLR, $whichNa gives the index of the entriesin y that were missing, these two elements are then followed by several entries describing the call(omitted in Box 1c), this is followed by estimated posterior means and estimated posterior standarddeviations of the linear predictor ($yHat and $SD.yHat), the intercept ($mu and $SD.mu) and theresidual variance ($varE and $SD.varE). Finally $fit gives a list with DIC and DIC-relatedstatistics (Spiegelhalter et al., 2002).

8

Page 9: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Box 1c: Structure of the object returned by BGLR (after running the code in Box 1a)

str(fm)

List of 20

$ y : num [1:100] 50.4 48.2 48.5 50.5 50.2 ...

$ whichNa : int(0)

.

.

.

$ yHat : num [1:100] 49.7 49.7 49.7 49.7 49.7 ...

$ SD.yHat : num [1:100] 0.112 0.112 0.112 0.112 0.112 ...

$ mu : num 49.7

$ SD.mu : num 0.112

$ varE : num 1.11

$ SD.varE : num 0.152

$ fit :List of 4

..$ logLikAtPostMean: num -147

..$ postMeanLogLik : num -148

..$ pD : num 2.02

..$ DIC : num 298

-attr(*, "class")= chr "BGLR"

Output files

Box 1d shows an example of the files generated after executing the commands given in Box 1a. Inthis case samples of the intercept (mu.dat) and of the residual variance (varE.dat) were stored.These samples can be used to assess convergence and to estimate Monte Carlo error. The R-packagecoda (Plummer et al., 2006) provide several useful functions for the analysis of samples used inMonte Carlo algorithms.

Box 1d: Files generated by BGLR (after running the code in Box 1a)

list.files()

[1] "mu.dat" "varE.dat"

plot(scan("varE.dat",type=’o’))

4 Datasets

The BGLR package comes with two genomic datasets involving phenotypes, markers, pedigree andother covariates.

Mice data set. This data set is from the Wellcome Trust (http://gscan.well.ox.ac.uk) andhas been used for detection of Quantitative Trait Loci (QTL) by Valdar et al. (2006,?) and forwhole-genome regression by Legarra et al. (2008), de los Campos et al. (2009) and Okut et al.(2011). The data set consists of genotypes and phenotypes of 1,814 mice. Several phenotypes areavailable in the data frame mice.phenos. Each mouse was genotyped at 10,346 SNPs. We removed

9

Page 10: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

SNPs with minor allele frequency (MAF) smaller than 0.05, and missing marker genotypes imputedwith the corresponding average genotype calculated with estimates of allele frequencies derived fromthe same data. In addition to this, an additive relationship matrix (mice.A) is provided; this wascomputed using the R-package pedigreemm (Bates and Vazquez, 2009, Vazquez et al., 2010).

Wheat data set. This data set is from CIMMYT global Wheat breeding program and comprisesphenotypic, genotypic and pedigree information of 599 wheat lines. The data set was made publiclyavailable by Crossa et al. (2010). Lines were evaluated for grain yield (average of two replicates)at four different environments; phenotypes (wheat.Y) were centered and standardized to a unitvariance within environment. Each of the lines were genotyped for 1,279 Diversity Array Technology(DArT) markers. At each marker two possibly homocygous were possible and these were codedas 0/1. Marker genotypes are given in the object wheat.X. Finally a matrix wheat.A provides thepedigree-relationships between lines computed from the pedigree (see Crossa et al. 2010 for furtherdetails). Box 2 illustrates how to load the wheat and mice data sets.

Box 2: Loading the mice data set included in BGLR

library(BGLR)

data(mice)

data(wheat)

ls()

5 Application Examples

In this section we illustrate the use of BGLR with examples.

Fitting Models for Fixed and Random Effects for a Continuous Response

We illustrate how to fit models with various sets of predictors using the mice data set. Valdaret al. (2006) pointed out that the cage where mice were housed had an important effect in thephysiological covariates and Legarra et al. (2008) and de los Campos et al. (2009) used models thataccounted for sex, litter size, cage, familial relationships and markers. One possible linear modelthat we can fit to some of the continuous traits available in the mice data set is as follows:

y = 1µ+X1β1 +X2β2 +X3β3 + ε,

where µ is an intercept, X1 is a design matrix for the effects of sex and litter size, and β1 is thecorresponding vector of effects, which will be treated as ‘fixed’; X2 is the design matrix for theeffects of cage and β2 is the vector of cage effects which will treat as random (in the examplefo Box 3a we assign a Gaussian prior to these effects); X3 is the matrix with marker genotypesand β3 the corresponding vector of marker effects to which, in the example below, we assign IIDdouble-exponential priors.

10

Page 11: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Fitting the model. The code provided in Box 3a illustrates how to fit the model above-describedusing BGLR. The first block of code, #1#, loads the data. In the second block of code we set thelinear predictor. This is specified using a two-level list. Each of the elements of the inner list isused to specify the element of the linear predictor. We can specify the predictors to be includedin each of the inner lists either by providing the design matrix or by using a formula. When theformula is used, the design matrix is created internally using the model.matrix() function of R.Finally in the 3rd block of code we fit the model by calling the BGLR() function.

Box 3a: Fitting a model to markers and non-genetic effects in BGLR

#1# Loading and preparing the input data

library(BGLR); data(mice);

Y<-mice.pheno; X<-mice.X; A=mice.A;

y<-Y$Obesity.BMI; y<-(y-mean(y))/sd(y)

#2# Setting the linear predictor

ETA<-list( list(~factor(GENDER)+factor(Litter),

data=Y,model=’FIXED’),

list(~factor(cage),data=Y, model=’BRR’),

list(X=X, model=’BL’)

)

#3# Fitting the model

fm<-BGLR(y=y,ETA=ETA, nIter=12000, burnIn=2000)

save(fm,file=’fm.rda’)

When BGLR begins to run, a message warns the user that hyper-parameters were not providedand that consequently they were set using built-in rules; further details about these rules are givenin the Appendix.

Extracting results. Once the model was fitted one can extract from the list returned by BGLRthe estimated posterior means and the estimated posterior standard deviations as well as measuresof model goodness of fit and of model complexity. Also, as BGLR run, it saves samples of some ofthe parameters; these samples can be brought into the R-environment for posterior analysis. Box3b illustrates how to extract from the returned object estimates of the posterior means and of theposterior deviations and how to create trace and density plots.

The first block of code (#1#) in Box 3b shows how to extract estimated posterior means andposterior standard deviations of effects. In this case we extract those corresponding to the thirdelement of the linear predictor (fm$ETA[[3]]) which correspond to the markers, but the same couldbe done for any of the elements of the linear predictors. For models involving linear regressions $band $SD.b give the estimated posterior means and posterior standard deviations of effects. Thesecond block (#2#) of code of Box 3b shows how to extract the estimated posterior mean of thelinear predictor, and also how to compute the estimated posterior mean of particular elementsof the linear predictor, in this case we illustrate with genomic values (gHat). The third block ofcode (#3#) illustrates how to extract DIC (Spiegelhalter et al., 2002) and related statistics; finally,the fourth block of code (#4#) shows how to retrieve samples from the posterior distribution andproduce trace plots. The plots produced by the code in Box 3b are given in Figure 2.

11

Page 12: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Box 3b: Extracting results from a model fitted using BGLR (continues from Box 3a)

#1# Estimated Marker Effects & posterior SDs

bHat<- fm$ETA[[3]]$b

SD.bHat<- fm$ETA[[3]]$SD.b

plot(bHat^2, ylab=’Estimated Squared-Marker Effect’,

type=’o’,cex=.5,col=4,main=’Marker Effects’)

#2# Predictions

# Total prediction

yHat<-fm$yHat

tmp<-range(c(y,yHat))

plot(yHat~y,xlab=’Observed’,ylab=’Predicted’,col=2,

xlim=tmp,ylim=tmp); abline(a=0,b=1,col=4,lwd=2)

# Just the genomic part

gHat<-X%*%fm$ETA[[3]]$b

plot(gHat~y,xlab=’Phenotype’,

ylab=’Predicted Genomic Value’,col=2,

xlim=tmp,ylim=tmp); abline(a=0,b=1,col=4,lwd=2)

#3# Godness of fit and related statistics

fm$fit

fm$varE # compare to var(y)

#4# Trace plots

list.files()

# Residual variance

varE<-scan(’varE.dat’)

plot(varE,type=’o’,col=2,cex=.5,ylab=expression(var[e]));

abline(h=fm$varE,col=4,lwd=2);

abline(v=fm$burnIn/fm$thin,col=4)

# lambda (regularization parameter of the Bayesian Lasso)

lambda<-scan(’ETA_3_lambda.dat’)

plot(lambda,type=’o’,col=2,cex=.5,ylab=expression(lambda));

abline(h=fm$ETA[[3]]$lambda,col=4,lwd=2);

abline(v=fm$burnIn/fm$thin,col=4)

12

Page 13: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Figure 2: Squared-Estimated Marker Effects (top-left), phenotype versus predicted genomic values(top-right), trace plot of residual variance (lower-left) and trace plot of regularizationparameter of the Bayesian Lasso (lower-right).

Fitting a Pedigree+Markers ‘BLUP’ model using BGLR

In the following example we illustrate how to incorporate in the model Gaussian random effectswith user-defined covariance structures. These types of random effects appear both in pedigree andgenomic models. The example presented here uses the wheat data set included with the package.In the example of Box 4a we include two random effects, one representing a regression on pedigree,a ∼ N(0,Aσ2a), where A is a pedigree-derived numerator relationship matrix, and one representinga linear regression on markers, g ∼ N(0,Gσ2gu) where, G is a marker-derived genomic relationshipmatrix. The implementation of Gaussian processes in BGLR exploits the equivalence between theseprocesses and random regressions on principal components (de los Campos et al., 2010, Janss et al.,2012). Te user can implement a RKHS regression either by providing co-variance matrix (K) or itseigen-value decomposition (see the example in Box 4a). When the co-variance matrix is provided,the eigen-value decomposition is computed internally.

Box 4a: Fitting a Pedigree + Markers regression using Gaussian Processes

#1# Loading and preparing the input data

library(BGLR); data(wheat);

Y<-wheat.Y; X<-wheat.X; A<-wheat.A;

y<-Y[,1]

13

Page 14: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Continued...

#2# Computing the genomic relationship matrix

X<-scale(X,center=TRUE,scale=TRUE)

G<-tcrossprod(X)/ncol(X)

#3# Computing the eigen-value decomposition of G

EVD <-eigen(G)

#3# Setting the linear predictor

ETA<-list(list(K=A, model=’RKHS’),

list(V=EVD$vectors,d=EVD$values, model=’RKHS’)

)

#4# Fitting the model

fm<-BGLR(y=y,ETA=ETA, nIter=12000, burnIn=2000,saveAt=’PGBLUP_’)

save(fm,file=’fmPG_BLUP.rda’)

Box 4b shows how to extract estimates, predictions, and samples from the fitted model. The firstblock of code (#1) shows how to obtain the predictions. The second block of code shows howto extract some goodness of fit related statistics. The third block of code shows how to extractthe posterior mean of the variances components σ2a and σ2gu. Note that in order to obtain theestimate it is necessary to specify the component number (1 or 2), this can be done by writingfm$ETA[[1]]$varU and fm$ETA[[2]]$varU respectively. Finally, the fourth block of code showshow to produce the trace plots for σ2a, σ

2g and σ2ε (graphs not shown).

Box 4b: Extracting estimates; predictions and samples from Reproducing Kernel HilbertSpaces Regressions (continues from Box 4a first)

#1# Predictions

# Total prediction

yHat<-fm$yHat

tmp<-range(c(y,yHat))

plot(yHat~y,xlab=’Observed’,ylab=’Predicted’,col=2,

xlim=tmp,ylim=tmp); abline(a=0,b=1,col=4,lwd=2)

#2# Godness of fit and related statistics

fm$fit

fm$varE # compare to var(y)

#3# Variance components associated with the genomic and pedigree

# matrices

fm$ETA[[1]]$varU

fm$ETA[[2]]$varU

#4# Trace plots

list.files()

# Residual variance

varE<-scan(’PGBLUP_varE.dat’)

plot(varE,type=’o’,col=2,cex=.5);

14

Page 15: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Continued...

#varA and varU

varA<-scan(’PGBLUP_ETA_1_varU.dat’)

plot(varA,type=’o’,col=2,cex=.5);

varU<-scan(’PGBLUP_ETA_2_varU.dat’)

plot(varU,type=’o’,col=2,cex=.5)

Reproducing Kernel Hilbert Spaces Regressions

Reproducing Kernel Hilbert Spaces Regressions (RKHS) have been used for regression (e.g., Smooth-ing Spline Wahba, 1990), spatial smoothing (e.g., Kriging Cressie, 1988) and classification problems(e.g., Support Vector Machine, Vapnik, 1998). Gianola et al. (2006), proposed to use this approachfor genomic prediction and since then several methodological and applied articles have been pub-lished (de los Campos et al., 2009, 2010, Gianola and de los Campos, 2008, Gonzalez-Recio et al.,2008).

Single-Kernel Models. In RKHS the regression function is a linear combination of the basisfunction provided by the reproducing kernel (RK); therefore, the choice of the RK constitutesone of the central elements of model specification. The RK is a function that maps from pairsof points in input space into the real line and must be positive semi-definite. For instance, if theinformation set is given by vectors of marker genotypes the RK, K(xi,xi′) maps from pairs ofvectors of genotypes, {xi,xi′}, onto the real line and must satisfy,

∑i

∑i′ αiαi′K(xi,xi′) ≥ 0, for

any non-null sequence of coefficients αi. Following de los Campos et al. (2009) the Bayesian RKHSregression can be represented as follows:

{y = 1µ+ u+ ε withp(µ,u, ε) ∝ N(u|0,Kσ2u)N(ε|0, Iσ2ε)

(2)

where K = {K(xi,xi′)} is an (n× n) matrix whose entries are the evaluations of the RK at pairsof points in input space. The structure of the model described by (2) is that of the standardAnimal Model (Quaas and Pollak, 1980) with the pedigree-derived numerator relationship matrix(A) replaced by the kernel matrix (K). Box 5 features an example using a Gaussian Kernelevaluated in the (average) squared-Euclidean distance between genotypes, that is: K(xi,xi′) =

exp{−h×

∑pk=1(xik−xi′k)

2

p

}. In the example genotypes were centered and standardized, but this is

not strictly needed. The bandwidth parameter controls how fast the co-variance function drops asthe distance between pairs of vector genotypes increases. This parameter plays an important role.In this example we have chosen the bandwidth parameter to be equal to 0.5, further discussionabout this parameter is given in next example.

15

Page 16: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Box 5: Fitting a Single Kernel Model in BGLR

#1# Loading and preparing the input data

library(BGLR); data(wheat);

Y<-wheat.Y; X<-wheat.X; n<-nrow(X); p<-ncol(X)

y<-Y[,1]

#2# Computing the distance matrix and then the krenel.

X<-scale(X,center=TRUE,scale=TRUE)

D<-(as.matrix(dist(X,method=’euclidean’))^2)/p

h<-0.5

K<-exp(-h*D)

#3# Single Kernel Regression using BGLR

ETA<-list(list(K=K,model=’RKHS’))

fm<-BGLR(y=y,ETA=ETA,nIter=12000, burnIn=2000,saveAt=’RKHS_h=0.5_’)

Multi-Kernel Models. The bandwidth parameter of the Gaussian kernel can be chosen eitherusing cross-validation (CV) or with Bayesian methods. The CV approach requires fitting modelsover a grid of values of h. The Bayesian approach estimates h, and all the model uknowns, formthe data concurrently. The fully Bayesian treatment, which consist of treating h as unknown, iscomputationally demanding because, any time h is updated, the RK needs to be re-computed.To overcome this problem de los Campos et al. (2010) proposed to use a multi-kernel approach(named Kernel Averaging, KA) consisting on: (a) defining a sequence of kernels based on a set ofreasonable values of h, and (b) fitting a multi-kernel model with as many random effects as kernelsin the sequence. The model has the following form:

{y = 1µ+

∑Ll=1 ul + ε with

p(µ,u1, ...,uL, ε) ∝∏Ll=1N(u|0,K lσ

2ul

)N(ε|0, Iσ2ε)(3)

where K l is the RK evaluated at the lth value of the bandwidth parameter in the sequence{h1, ..., hL}. It can be shown (e.g., de los Campos et al., 2010) that if variance components areknown, the model of expression (3) is equivalent to a model with a single random effect whosedistribution is N(u|0, Kσ2u) where K is a weighted average of all the RK used in (3) with weightsproportional to the corresponding variance components (hence the name, Kernel Averaging).

Performing a grid search or implementing a multi-kernel model requires defining a reasonable rangefor h. One possibility is to choose as a focal point for that range a value of h that gives a RKsimilar to the one given by the G-matrix (this one represents the kernel for an additive model).

The entries of the distance matrix D ={Dii′ =

∑pk=1(xik−xi′k)

2

p

}can be calculated from the entries

of the G-matrix G ={Gii′ =

∑pk=1 xikxi′k

p

}; indeed, Dii′ = Gii + Gi′i′ − 2Gii′ . What value of h

makes exp{−h×Dii′} ≈ Gii′? Consider for instance a pair of full sibs in an outbreed population,in this case E[Gii] = E[Gi′i′ ] = 1 and E[Gii′ ] = 0.5 therefore, E[Dii′ ] = 1, and using h = 0.8 we getK(xi,xi′) = exp{−0.8} ≈ 0.4. Similarly, for a pair of half-sibs se have: E[Gii′ ] = 0.5, therefore,

16

Page 17: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

E[Dii′ ] = 1.5 and K(xi,xi′) = exp{−0.8 × 1.5} ≈ 0.3, which gives values for the RK for thattype of relatives close to the ones given by the G-matrix. In inbreed populations smaller valuesof h will be needed because E[Gii′ ] > 1. Box 6 illustrates how to fit a multi-kernel model usingh = 0.5 × {1/5, 1, 5}. With this choice of values for the bandwidth parameter the model includeskernels that give correlations much smaller (h = 2.5) similar (h = 0.5), and much higher (h = 0.1)than the ones given by the G matrix. This is illustrated in Figure 3 that displays the entries ofthe 1st row of the kernel matrix evaluated at each of the values of the bandwidth parameter in thegrid.

Box 6: Fitting a RKHS Using a Multi-Kernel Methods (Kernel Averaging)

#1# Loading and preparing the input data

library(BGLR); data(wheat);

Y<-wheat.Y; X<-wheat.X; n<-nrow(X); p<-ncol(X)

y<-Y[,1]

#2# Computing D and then K

X<-scale(X,center=TRUE,scale=TRUE)

D<-(as.matrix(dist(X,method=’euclidean’))^2)/p

h<-0.5*c(1/5,1,5)

#3# Kernel Averaging using BGLR

ETA<-list(list(K=exp(-h[1]*D),model=’RKHS’),

list(K=exp(-h[2]*D),model=’RKHS’),

list(K=exp(-h[3]*D),model=’RKHS’))

fm<-BGLR(y=y,ETA=ETA,nIter=5000, burnIn=1000,saveAt=’RKHS_KA_’)

#1# Variance Components

fm$ETA[[1]]$varU ; fm$ETA[[2]]$varU; fm$ETA[[3]]$varU

Assessment of Prediction Accuracy

The simple way of assessing prediction accuracy consists of partitioning the data set into two disjointsets: one used for model training (TRN) and one used for testing (TST). Box 7 shows code that fitsa G-BLUP model in a TRN-TST setting using the wheat data set. The code randomly assigns 100individuals to the TST set. The variable tst is a vector that indicates which data-points belong tothe TST data set; for these entries we put missing values in the phenotypic vector (see Box 7). Oncethe model is fitted predictions for individuals in TST set can be obtained typing fit$yHat[tst]

in the R command line. Figure 4 plots observed vs predicted phenotypes for individuals in trainingand TST sets.

17

Page 18: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

●●

●●●●

●●●

●●●●

●●

●●●●●●●●

●●

●●●●●

●●●●●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●

●●●●●

●●●●●●

●●●●●

●●●

●●●

●●●

●●●●●●

●●

●●

●●●●●●

●●

●●

●●●●●●●●●

●●●●

●●

●●

●●●●

●●

●●●●●●●●●●●

●●

●●●●●●●●

●●●●●●●

●●

●●●

●●

●●●

●●●●●●●

●●●●●●

●●

●●●

●●

●●●

●●●●

●●●

●●●●●

●●

●●●

●●●

●●●

●●●

●●●●●

●●●●

●●

●●●●

●●●●●●●●●●●

●●

●●●

●●●●●●

●●

●●●

●●

●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●

●●

●●●●

●●●

●●●●

●●●●

●●●●●●●●

●●●

●●

●●●●●●

●●●●

●●

●●●●●●●●

●●

●●

●●●●●●●●●●

●●●

●●●●●

●●●●

●●

●●

●●●●●●●●

●●●

●●

●●●●●

●●●●

●●●

●●

●●

●●●●

●●●●

●●

●●●●●

●●

●●●●●●●●●●●

●●●●

●●●

●●●

●●●●●●●

●●

●●

●●●●●

0 100 200 300 400 500 600

0.0

0.2

0.4

0.6

0.8

1.0

individual

K(1,i)

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●●●

●●●

●●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●●●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

Figure 3: Entries of the 1st row of the (Gaussian) kernel matrix evaluated at three different valuesof the bandwidth parameter, h = 0.5× {1/5, 1, 5}.

Box 7: Assessment of Prediction Accuracy: Continuous Response

#1# Loading and preparing the input data

library(BGLR); data(wheat);

Y<-wheat.Y; X<-wheat.X; n<-nrow(X); p<-ncol(X)

y<-Y[,1]

#2# Creating a Testing set

yNA<-y

set.seed(123)

tst<-sample(1:n,size=100,replace=FALSE)

yNA[tst]<-NA

#3# Computing G

X<-scale(X,center=TRUE,scale=TRUE)

G<-tcrossprod(X)/p

#4# Fits the G-BLUP model

ETA<-list(list(K=G,model=’RKHS’))

fm<-BGLR(y=yNA,ETA=ETA,nIter=5000, burnIn=1000,saveAt=’RKHS_’)

plot(fm$yHat,y,xlab="Phenotype",

ylab="Pred. Gen. Value" ,cex=.8,bty="L")

points(x=y[tst],y=fm$yHat[tst],col=2,cex=.8,pch=19)

legend("topleft", legend=c("training","testing"),bty="n",

pch=c(1,19), col=c("black","red"))

#5# Assesment of correlation in TRN and TST data sets

cor(fm$yHat[tst],y[tst]) #TST

cor(fm$yHat[-tst],y[-tst]) #TRN

18

Page 19: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

−2

−1

01

23

Phenotype

Pre

d. G

en. V

alue

●●

●●

●●

●●

●●

● ●

● ●

trainingtesting

Figure 4: Estimated genetic values for training and testing sets. Predictions were derived usingG-BLUP model (see Box7).

A cross-validation is simply a generalization of the TRN-TST evaluation presented in Box 7. For aK-fold cross-validation there are K TRN-TST partitions; in each fold, the individuals assigned tothat particular fold are used for TST and the remaining individuals are used for TRN.

5.1 Regression with Ordinal and Binary Traits

For categorical traits BGLR uses the probit link and the phenotype vector should be coercibleto a factor. The type of response is defined by setting the argument ‘response type’. By de-fault this argument is set equal to ‘Gaussian’. For binary and ordinal outcomes we should setresponse type=‘ordinal’. Box 8 provides a simple example that uses the wheat data set witha discretized phenotype. The second block of code, #2#, presents the analysis of a binary out-come, and the third one, #3#, that of an ordinal trait. Figure 5 shows, for the binary outcome, aplot of predicted probability versus realized value in the TRN and TST datasets. The estimatedposterior means and posterior standard deviations of marker effects and posterior means of thelinear predictor are retrieved as described before (e.g., fm$ETA[[1]]$b, fm$yHat). For continuousoutcomes the posterior mean of the linear predictor is also the conditional expectation function.For binary outcomes, the conditional expectation is simply the success probability; therefore, in

19

Page 20: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

this case BGLR also returns the estimated probabilities of each of the categories fm$probs).

Box 8: Fitting models with binary and ordinal responses

#1# Loading and preparing the input data

library(BGLR); data(wheat);

Y<-wheat.Y; X<-wheat.X; A<-wheat.A;

y<-Y[,1]

tst<-sample(1:nrow(X),size=150)

#2# Binary outcome

yBin<-ifelse(y>0,1,0)

yBinNA<-yBin ; yBinNA[tst]<-NA

ETA<-list(list(X=X,model=’BL’))

fmBin<-BGLR(y=yBinNA,response_type=’ordinal’, ETA=ETA,

nIter=1200,burnIn=200)

head(fmBin$probs)

par(mfrow=c(1,2))

boxplot(fmBin$probs[-tst,2]~yBin[-tst],main=’Training’,ylab=’Estimated prob.’)

boxplot(fmBin$probs[tst,2]~yBin[tst],main=’Testing’, ylab=’Estimated prob.’)

#2# Ordinal outcome

yOrd<-ifelse(y<quantile(y,1/4),1,ifelse(y<quantile(y,3/4),2,3))

yOrdNA<-yOrd ; yOrdNA[tst]<-NA

ETA<-list(list(X=X,model=’BL’))

fmOrd<-BGLR(y=yOrdNA,response_type=’ordinal’, ETA=ETA,

nIter=1200,burnIn=200)

head(fmOrd$probs)

5.2 Regression with Censored Outcomes

Box 9 illustrates how to fit a model to a censored trait. Note that in the case of censored trait theresponse is specified using a triplet (ai, yi, bi) (see Table 2 for further details). For assessment ofprediction accuracy (not done in Box 9), one can set ai = −∞, yi = NA, bi =∞ for individuals intesting data sets, this way there is no information about the ith phenotype available for the modelfit.

20

Page 21: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

0 1

0.3

0.4

0.5

0.6

0.7

TrainingEs

timat

ed p

rob.

0 1

0.3

0.4

0.5

0.6

0.7

Testing

Estim

ated

pro

b.Figure 5: Estimated probability by category, versus observed category (binary response).

Box 9: Fitting censored traits

#1# Loading and preparing the input data

library(BGLR); data(wheat);

Y<-wheat.Y; X<-wheat.X; A<-wheat.A;

y<-Y[,1]

#censored

n<-length(y)

cen<-sample(1:n,size=200)

yCen<-y

yCen[cen]<-NA

a<-rep(NA,n)

b<-rep(NA,n)

a[cen]<-y[cen]-runif(min=0,max=1,n=200)

b[cen]<-Inf

#models

ETA<-list(list(X=X,model=’BL’))

fm<-BGLR(y=yCen,a=a,b=b,ETA=ETA,nIter=12000,burnIn=2000)

cor(y[cen],fm$yHat[cen])

21

Page 22: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

6 Benchmark of parametric models

We carried out a benchmark evaluation by fitting a BRR to data sets involving three differentsample size (n=1K, 2K and 5K, K=1,000) and four different marker densities (p=5K, 10K, 50Kand 100K). The evaluation was carried out in an Intel(R) Xeon(R) processor @ 2 GHz. Computingtime, expressed in seconds per thousand iterations of the Gibbs sampler are given in Figure 6. R wasexecuted in a single thread and was linked against OpenBLAS. Computing scales approximatelyproportional to the product of the number of records and the number of effects. For the mostdemanding scenario (n=5K, p=100K) it took approximately 11 min to complete 1,000 iterationsof the Gibbs sampler.

20 40 60 80 100

010

020

030

040

050

060

0

Thousands of markers in the model

Seco

nd/1

000

itera

tions

Number of individuals1K2K5K

1000 2000 3000 4000 5000

010

020

030

040

050

060

0

Sample Size

Seco

nd/1

000

itera

tions

Number of markers5K10K50K100K

Figure 6: Seconds per 1000 iterations of the Gibbs sampler by number of markers and samplesize. The Benchmark was carried out by fitting a Gaussian regression (BRR) using anIntel(R) Xeon(R) processor @ 2 GHz. Computations were carried out using a singlethread.

In general, the computational time of models BayesA and BL are slightly longer than that of BRR(∼ 10% longer). The computational time of models using finite-mixture priors (e.g., models BayesBor BayesC) tend to be higher than those of BayesA, BL and BRR, unless the proportion of markersentering in the model is low.

22

Page 23: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

7 Concluding Remarks

In BGLR we implemented, in a unified Bayesian framework, several methods commonly used ingenome-enabled prediction, including various parametric models as well as Gaussian processesthat can be used for parametric or semi-parametric regression/prediction. The package supportscontinuous (censored or not) as well as binary and ordinal traits. The user interface gives the usergreat latitude in combining different modeling approaches for data analysis. Operations that canbe vectorized are performed using built-in R-functions, but most of the computing intensive tasksare performed using compiled routines written in C and Fortran languages. The package is alsoable to take advantage of multi-thread BLAS implementations in both Windows and UNIX-likesystems. Finally, together with the package we have included two data sets and ancillary functionsthat can be used to read into the R-environment genotype files written in ped and bed formats.

The Gibbs sampler implemented is computationally very intensive and our current implementationstores genotypes in memory; therefore, despite of the effort made in in the development of BGLR tomake the algorithm computationally efficient, performing regressions with hundreds of thousands ofmarkers requires access to large amounts of RAM and the computational time can be considerable.Certainly, faster algorithms could be conceived, but these are in general not as flexible, in terms ofthe class of models that can be implemented, as the ones implemented in BGLR.

Future developments. Although some of the computationally intensive algorithms implementedin BGLR can benefit from multi-thread computing; there is large room to further improve thecomputational performance of the software by making more intensive use of parallel computing.In future releases we plan to exploit parallel computing to a much greater extent. Also, we arecurrently working on modifying the software so that genotypes do not need to be stored in memory.Future releases including these and other features will be made at the R-Forge website (https://r-forge.r-project.org/R/?group_id=1525) first and after considerable testing at CRAN.

Acknowledgements

In the development of BGLR Paulino Perez and Gustavo de los Campos had financial supportprovided by NIH Grants: R01GM099992 and R01GM101219.

References

Albert, J. H. and S. Chib (1993). Bayesian analysis of binary and polychotomous response data.Journal of the American Statistical Association 88 (422), 669–679.

Andrews, D. F. and C. L. Mallows (1974). Scale Mixtures of Normal Distributions. Journal of theRoyal Statistical Society. Series B (Methodological) 36 (1), 99–102.

Bates, D. and A. I. Vazquez (2009). pedigreemm: Pedigree-based mixed-effects models. R packageversion 0.2-4.

23

Page 24: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Bellman, R. E. (1961). Adaptive control processes - A Guided Tour. Princeton, New Jersey, U.S.A.:Princeton University Press.

Casella, G. and E. I. George (1992). Explaining the gibbs sampler. The American Statistician 46 (3),167–174.

Cressie, N. (1988). Spatial prediction and ordinary kriging. Mathematical Geology 20 (4), 405–421.

Crossa, J., G. de los Campos, P. Perez, D. Gianola, J. Burgueno, J. L. Araus, D. Makumbi, R. P.Singh, S. Dreisigacker, J. B. Yan, V. Arief, M. Banziger, and H. J. Braun (2010). Predictionof genetic values of quantitative traits in plant breeding using pedigree and molecular markers.Genetics 186 (2), 713–U406.

de los Campos, G., D. Gianola, and G. J. M. Rosa (2009). Reproducing kernel hilbert spacesregression: A general framework for genetic evaluation. Journal of Animal Science 87 (6), 1883–1887.

de los Campos, G., D. Gianola, G. J. M. Rosa, K. A. Weigel, and J. Crossa (2010, 7). Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel hilbert spacesmethods. Genetics Research 92, 295–308.

de los Campos, G., J. M. Hickey, R. Pong-Wong, H. D. Daetwyler, and M. P. L. Calus (2013). Wholegenome regression and prediction methods applied to plant and animal breeding. Genetics 193,327–345.

de los Campos, G., H. Naya, D. Gianola, J. Crossa, A. Legarra, E. Manfredi, K. Weigel, and J. M.Cotes (2009). Predicting quantitative traits with regression models for dense molecular markersand pedigree. Genetics 182 (1), 375–385.

de los Campos, G. and P. Perez (2010). Blr: Bayesian linear regression r package, version 1.2. Rpackage version 1.2.

de los Campos, G. and P. Perez (2013). Bglr: Bayesian generalized regression r package, version1.0. R package version 1.0.

de los Campos, G., A. I. Vazquez, R. L. Fernando, K. Y. C., and S. Daniel (2013, 07). Predictionof complex human traits using the genomic best linear unbiased predictor. PLoS Genetics 7 (7),e1003608.

Endelman, J. B. (2011). Ridge regression and other kernels for genomic selection with r packagerrblup. Plant Genome 4, 250–255.

Geman, S. and D. Geman (1984). Stochastic relaxation, gibbs distributions and the bayesianrestoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (6),721–741.

Gianola, D. (2013, 12). Priors in whole-genome regression: The bayesian alphabet returns. Genet-ics 90, 525–540.

Gianola, D. and G. de los Campos (2008, 12). Inferring genetic values for quantitative traitsnon-parametrically. Genetics Research 90, 525–540.

Gianola, D., R. L. Fernando, and A. Stella (2006). Genomic-assisted prediction of genetic valuewith semiparametric procedures. Genetics 173 (3), 1761–1776.

24

Page 25: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Gianola, D. and J. B. C. H. M. van Kaam (2008). Reproducing kernel hilbert spaces regressionmethods for genomic assisted prediction of quantitative traits. Genetics 178 (4), 2289–2303.

Gonzalez-Recio, O., D. Gianola, N. Long, K. A. Weigel, G. J. M. Rosa, and S. Avendano (2008).Nonparametric methods for incorporating genomic information into genetic evaluations: Anapplication to mortality in broilers. Genetics 178 (4), 2305–2313.

Habier, D., R. Fernando, K. Kizilkaya, and D. Garrick (2011). Extension of the bayesian alphabetfor genomic selection. BMC Bioinformatics 12 (1), 186.

Hayes, B., P. Bowman, A. Chamberlain, and M. Goddard (2009). Invited review: Genomic selectionin dairy cattle: Progress and challenges. Journal of Dairy Science 92 (2), 433 – 443.

Henderson, C. R. (1975). Best linear unbiased estimation and prediction under a selection model.Biometrics 31 (2), 423–447.

Hoerl, A. E. and R. W. Kennard (1970). Ridge regression: Biased estimation for nonorthogonalproblems. Technometrics 42 (1), 80–86.

Janss, L., G. de los Campos, N. Sheehan, and D. Sorensen (2012). Inferences from genomic modelsin stratified populations. Genetics 192 (2), 693–704.

Legarra, A., C. Robert-Granie, E. Manfredi, and J.-M. Elsen (2008). Performance of genomicselection in mice. Genetics 180 (1), 611–618.

Makowsky, R., N. M. Pajewski, Y. C. Klimentidis, A. I. Vazquez, C. W. Duarte, D. B. Allison,and G. de los Campos (2011). Beyond missing heritability: Prediction of complex traits. PLoSGenet 7 (4), e1002051.

Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard (2001). Prediction of total genetic valueusing genome-wide dense marker maps. Genetics 157 (4), 1819–1829.

Okut, H., D. Gianola, G. J. M. Rosa, and K. A. Weigel (2011). Prediction of body mass indexin mice using dense molecular markers and a regularized neural network. Genetics Research 93,189–201.

Park, T. and G. Casella (2008). The bayesian lasso. Journal of the American Statistical Associa-tion 103 (482), 681–686.

Perez, P., G. de los Campos, J. Crossa, and D. Gianola (2010). Genomic-enabled prediction basedon molecular markers and pedigree using the bayesian linear regression package in r. PlantGenome 3 (2), 106–116.

Plummer, M., N. Best, K. Cowles, and K. Vines (2006). Coda: Convergence diagnosis and outputanalysis for mcmc. R News 6 (1), 7–11.

Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. A. R. Ferreira, D. Bender, J. Maller, P. Sklar,P. I. W. de Bakker, M. J. Daly, and P. C. Sham (2007). Plink: A tool set for whole-genomeassociation and population-based linkage analyses. The American Journal of Human Genetics 81,559 – 575.

Quaas, R. L. and E. J. Pollak (1980). Mixed model methodology for farm and ranch beef cattletesting programs. Journal of Animal Science 51 (6), 1277–1287.

25

Page 26: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

R Core Team (2012). R: A Language and Environment for Statistical Computing. Vienna, Austria:R Foundation for Statistical Computing. ISBN 3-900051-07-0.

Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. Van Der Linde (2002). Bayesian measures ofmodel complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Method-ology) 64 (4), 583–639.

Tanner, M. A. and W. H. Wong (1987). The calculation of posterior distributions by data augmen-tation. Journal of the American Statistical Association 82 (398), 528–540.

Valdar, W., L. C. Solberg, D. Gauguier, S. Burnett, P. Klenerman, W. O. Cookson, M. S. Taylor,J. N. P. Rawlins, R. Mott, and J. Flint (2006). Genome-wide genetic association of complextraits in heterogeneous stock mice. Nature Genetics 38, 879–887.

Valdar, W., L. C. Solberg, D. Gauguier, W. O. Cookson, J. N. P. Rawlins, R. Mott, and J. Flint(2006). Genetic and environmental effects on complex traits in mice. Genetics 174 (2), 959–984.

VanRaden, P., C. V. Tassell, G. Wiggans, T. Sonstegard, R. Schnabel, J. Taylor, and F. Schenkel(2009). Invited review: Reliability of genomic predictions for north american holstein bulls.Journal of Dairy Science 92 (1), 16 – 24.

VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of DairyScience 91 (11), 4414–23.

Vapnik, V. (1998). Statistical learning theory (1 ed.). Wiley.

Vazquez, A. I., D. M. Bates, G. J. M. Rosa, D. Gianola, and K. A. Weigel (2010). Technical note:An r package for fitting generalized linear mixed models in animal breeding. Journal of AnimalScience 88 (2), 497–504.

Vazquez, A. I., G. de los Campos, Y. C. Klimentidis, G. J. M. Rosa, D. Gianola, N. Yi, and D. B.Allison (2012). A comprehensive genetic approach for improving prediction of skin cancer riskin humans. Genetics 192 (4), 1493–1502.

Wahba, G. (1990). Spline Models for Observational Data. Society for Industrial and AppliedMathematics.

Wimmer, V., T. Albrecht, H.-J. Auinger, and C.-C. Schoen (2012). synbreed: a framework for theanalysis of genomic prediction data using r. Bioinformatics 28 (15), 2086–2087.

Yang, J., B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt, P. A. Madden, A. C.Heath, N. G. Martin, G. W. Montgomery, M. E. Goddard, and P. M. Visscher (2010). Commonsnps explain a large proportion of the heritability for human height. Nature Genetics 42 (7),565–569.

Zhou, X. and M. Stephens (2012). Genome-wide efficient mixed-model analysis for associationstudies. Nature Genetics 44 (7), 821–824.

26

Page 27: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Appendices

1. Prior Densities Used in the BGLR R-Package

In this appendix we describe the prior distributions assigned to the location parameters, (βj ,ul),entering in the linear predictor of eq. (1). For each of the unknown effects included in the linearpredictor, {β1, ..,βJ ,u1, ...,uL}, the prior density assigned is specified via the argument model inthe corresponding entry of the list (see Box 3a for an example). Table A1 describes, for each of theoptions implemented, the prior density used. A brief description is given below.

FIXED. In this case regression coefficients are assigned flat priors, specifically we use a Gaussianprior with mean zero and variance equal to 1× 1010.

BRR. When this option is used regression coefficients are assigned normal IID normal distributions,with mean zero and variance σ2β. In a 2nd level of the hierarchy, the variance parameter is assigneda scaled-inverse Chi-squared density, with parameters dfβ and Sβ. This density is parameterized

in a way that the prior expected value and mode are E(σ2β) =Sβ

dfβ−2 and Mode(σ2β) =Sβ

dfβ+2 ,

respectively. By default, if dfβ and Sβ are not provided, BGLR sets dfβ = 5 and solves for the scaleparameter to match the R-squared of the model (see default rules to set hyper-parameters below).An analysis with fixed variance parameter can be obtained by choosing the degree of freedomparameter to a very large value (e.g., 1× 1010) and solving for the scale using Sβ = σ2β × (dfβ + 2);

this gives a prior that collapses to a point of mass at σ2β.

BayesA. In this model the marginal distribution of marker effects is a scaled-t density, withparameters dfβ and Sβ. For computational convenience this density is implemented as an infinitemixture of scaled-normal densities. In a first level of the hierarchy marker effects are assignednormal densities with zero mean and marker-specific variance parameters, σ2βjk . In a 2nd level ofthe hierarchy these variance parameters are assigned IID scaled-inverse Chi-squared densities withdegree of freedom and scale parameters dfβ and Sβ, respectively. The degree of freedom parameteris regarded as known; if the user does not provide a value for this parameter BGLR sets dfβ = 5.The scale parameter is treated as unknown, and BGLR assigns to this parameter a gamma densitywith rate and shape parameters r and s, respectively. The mode and coefficient of variation (CV)of the gamma density are Mode(Sβ) = (s− 1)/r (for s > 1) and CV (S0) = 1/

√s. If the user does

not provide shape and rate parameters BGLR sets s = 1.1, this gives a relatively un-informativeprior with a CV of approximately 95%, and then solves for the rate so that the total contribution ofthe linear predictor matches the R-squared of the model (see default rules to set hyper-parameters,below). If one wants to run the analysis with fixed scale one can choose a very large value for theshape parameter (e.g., 1 × 1010) and then solve for the rate so that the prior mode matches thedesired value of the scale parameter using r = (s− 1)/Sβ.

Bayesian LASSO (BL). In this model the marginal distribution of marker effects is double-exponential. Following Park and Casella (2008) we implement the double-exponential densityas a mixture of scaled normal densities. In the first level of the hierarchy, marker effects areassigned independent normal densities with null mean and maker-specific variance parameter τ2jk×

27

Page 28: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

σ2ε . The residual variance is assigned a scaled-inverse Chi-square density, and the marker-specificscale parameters, τ2jk, are assigned IID exponential densities with rate parameter λ2/2. Finally,

in the last level of the hierarchy λ2 is either regarded as fixed (this is obtained by setting in thelinear predictor the option type=‘FIXED’), or assigned either a Gamma (λ2 ∼ Gamma(r, s) iftype=‘gamma’) or a λ/max is assigned a Beta prior, if type=‘beta’, here max is a user-definedparameter representing the maximum value that λ can take). If nothing is specified, BGLR setstype=‘gamma’ and s = 1.1, and solves for the scale parameter to match the expected R-squared ofthe model (see section 2 of this appendix).

BayesB-C. In these models marker effects are assigned IID priors that are mixtures of a point ofmass at zero and a slab that is either normal (BayesC) or a scaled-t density (BayesB). The slabis structured as either in the BRR (this is the case of BayesC) or as in BayesA (this is the caseof BayesB). Therefore, BayesB and BayesC extend BayesA and BRR, respectively, by introducingan additional parameter π which in the case of BGLR represents the prior proportion of non-zeroeffects. This parameter is treated as unknown and it is assigned a Beta prior π ∼ Beta(p0, π0),with p0 > 0 and π0 ∈ [0, 1]. The beta prior is parameterized in a way that the expected value byE(π) = π0; on the other hand p0 can be interpreted as the number of prior counts (priors “successes”

plus prior “failures”); the variance of the Beta distribution is then given by V ar(π) = π0(1−π0)(p0+1) ,

which is inversely proportional to p0. Choosing p0 = 2 and π0 = 0.5 gives a uniform prior in theinterval [0, 1]. Choosing a very large value for p0 gives a prior that collapses to a point of mass atπ0.

2. Default rules for choosing hyper-parameters

BGLR has built-in rules to set values of hyper-parameters. The default rules assign proper, butweakly informative, priors with prior modes chosen in a way that, a priori, they obey a variancepartition of the phenotype into components attributable to the error terms and to each of theelements of the linear predictor. The user can control this variance partition by setting the argumentR2 (representing the model R-squared) of the BGLR function to the desired value. By defaultthe model R2 is set equal to 0.5, in which case hyper-parameters are chosen to match a variancepartition where 50% of the variance of the response is attributable to the linear predictor and 50%to model residuals. Each of the elements of the linear predictor has its own R2 parameter (seelast column of Table A1). If these are not provided, the R2 attributable to each element of the linearpredictor equals the R-squared of the model divided the number of elements in the linear predictor.Once the R2 parameters are set, BGLR checks whether each of the hyper-parameters have beenspecified and if not, the built in-rules are used to set values for these hyper-parameters. Next webriefly describe the built-in rules implemented in BGLR; these are based on formulas similar tothose described by de los Campos et al. (2013) implemented using the prior mode instead of theprior mean.

Variance parameters. The residual variance (σ2ε , σ2ul

), of the RKHS model, and σ2β, of the BRR,are assigned scaled-inverse Chi-square densities, which are indexed by a scale and a degree offreedom parameter. By default, if degree of freedom parameter is not specified, these are set equalto 5 (this gives a relatively un-informative scaled-inverse Chi-square and guarantees a finite priorvariance) and the scale parameter is solved for to match the desired variance partition. For instance,

28

Page 29: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

Tab

leA

1.P

rior

den

siti

esim

ple

men

ted

inB

GL

R.

model=

Join

dis

trib

uti

on

of

eff

ects

an

dhyp

er-

para

mete

rsS

pecifi

cati

on

of

ele

ments

inth

elin

ear

pre

dic

tor

FIXED

p(β

j)∝

1list(X=,model="FIXED")

BRR

p(β

j,σ

2 β)

={ ∏ k

N(βjk|0,σ

2 β)} χ

−2(σ

2 β|df

β,S

β)

list(X=,model="BRR",df0=,S0=,R2=)

BayesA

p(β

j,σ

2 βj,S

β)

={ ∏ k

N(βjk|0,σ

2 βjk)χ−2(σ

2 βjk|df

β,S

β)} G

(Sβ|r,s

)list(X=,model="BayesA",df0=,rate0=,

shape0=,R2=)

p(β

j,τ

2 j,λ

2|σ

2 ε)

={ ∏ k

N(βjk|0,τ

2 jk×σ2 ε)Exp{ τ

2 jk|λ

2 2

}} ×G

(λ2|r,s

),

orlist(X=,model="BL",lambda=,type="gamma",

rate=,shape=,R2=)1

BL

p(β

j,τ

2 j,λ|σ

2 ε,max)

={ ∏ k

N(βjk|0,τ

2 jk×σ2 ε)Exp{ τ

2 jk|λ

2 2

}} ×B

(λ/

max|p

0,π

0),

orlist(X=,model="BL",lambda=,type="beta",

probIn=,counts=,max=,R2=)1

p(β

j,τ

2 j|σ

2 ε,λ

)={ ∏ k

N(βjk|0,τ

2 jk×σ2 ε)Exp{ τ

2 jk|λ

2 2

}}list(X=,model="BL",lambda=,type="FIXED")1

BayesC

p(β

j,σ

2 β,π

)=

{ ∏ k

[ πN

(βjk|0,σ

2 β)

+(1−π

)1(βjk

=0)]}

×χ−2(σ

2 β|df

β,S

β)B

(π|p

0,π

0)

list(X=,model="BayesC",df0,S0,

probIn=,counts=,R2=)2

BayesB

p(β

j,σ

2 β,π

)=

{ ∏ k

[ πN

(βjk|0,σ

2 β)

+(1−π

)1(βjk

=0)] χ−2(σ

2 βjk|df

β,S

β)}

B(π|p

0,π

0)×G

(Sβ|r,s

)list(X=,model="BayesB",df0,rate0,shape0,

probIn=,counts=,R2=)2

RKHS

p(u

l,σ2 ul)

=N

(ul|0,K

l×σ2 ul)χ−2(σ

2 ul|df

l,Sl)

Eit

her

list(K=,model="RKHS",df0,S0,R2=)

orlist(V=,d=,model="RKHS",df0,S0,R2=)3

N(·|·,·),

χ−2(·|·,·),

G(·|·,·),

Exp(·|·),

B(·|·,·)

den

ote

nor

mal

,sc

aled

inve

rse

Ch

i-sq

uar

ed,

gam

ma,

exp

onen

tial

and

bet

ad

ensi

ties

,re

-sp

ecti

vely

.(1

)type

can

take

valu

es"FIXED","gamma",

or"beta";

(2)probIn

rep

rese

nts

the

pri

orp

rob

abil

ity

ofa

mar

ker

hav

ing

an

on

-nu

lleff

ect

(π0),

cou

nts

(th

enu

mb

erof

‘pri

orco

unts

’)ca

nb

euse

dto

contr

olh

owin

form

ativ

eth

ep

rior

is;

(3)V

andd

rep

rese

nt

the

eigen

-vec

tors

and

eige

n-v

alu

esofK

,re

spec

tive

ly.

29

Page 30: BGLR: A Statistical Package for Whole-Genome RegressionBGLR: A Statistical Package for Whole-Genome Regression Paulino P erez Rodr guez Socio Econom a Estad stica e Inform atica, Colegio

in case of the residual variance the scale is calculated using Sε = var(y)× (1−R2)× (dfε + 2), thisgives a prior mode for the residual variance equal to var(y)×(1−R2). Similar rules are used in caseof other variance parameters. For instance, if one element of the linear predictor involves a linearregression of the form Xβ with model=‘BRR’ then Sβ = var(y)×R2× (dfβ + 2)/MSx where MSxis the sum of the sample variances of the columns of X and R2 is the proportion of phenotypicvariance a-priori assigned to that particular element of the linear predictor. The selection of the scaleparameter when the model is the RKHS regression is modified relative to the above rule to accountfor the fact that the average diagonal value of K may be different than 1, specifically we choose thescale parameter according to the following formula Sl = var(y)×R2× (dfl + 2)/mean(diag(K)).

In models BayesA and BayesB the scale-parameter indexing the t-prior assigned to marker effectsis assigned a Gamma density with rate and shape parameters r and s, respectively. By defaultBGLR sets s = 1.1 and solves for the rate parameter using r = (s−1)/Sβ with Sβ = var(y)×R2×(dfβ + 2)/MSx, here, as before, MSx represents the sum of the variances of the columns of X.

For the BL, the default is to set: type=‘gamma’, fix the shape parameter of the gamma density to1.1 and solve for the rate parameter to match the expected proportion of variance accounted forby the corresponding element of the linear predictor, as specified by the argument R2. Specifically,we set the rate to be (s− 1)/(2× (1−R2)/R2×MSx).

For models BayesB and BayesC, the default rule is to set π0 = 0.5 and p0 = 10. This givesa weakly informative beta prior for π with a prior mode at 0.5. The scale and degree-of freedomparameters entering in the priors of these two models are treated as in the case of models BayesA (inthe case of BayesB) and BRR (in the case of BayesC), but the rules are modified by considering thatonly a fraction of the markers (π) nave non-null effects; therefore, in BayesC we use Sβ = var(y)×R2×(dfβ+2)/MSx/π and in BayesB we set r = (s−1)/Sβ with Sβ = var(y)×R2×(dfβ+2)/MSx/π.

30