Bayesian Regularization Hedibert F. Lopes Insper - Institute of Education and Research August 4th, 2015

Bayesian Regularization

Hedibert F. Lopes

Insper - Institute of Education and Research

August 4th, 2015

1 Least absolute shrinkage and selection operator

2 Bridge regression and elastic net

3 Bayesian Lasso

4 Spike and slab variable selection

5 Horseshoe prior

6 Normal-gamma prior

7 Support vector machines

8 Sparse factor modelsCase 1: Constructing Economically Justified AggregatesCase 2: Exchange rates

Least absolute shrinkage and selection operator

Least absolute shrinkage and selection operator

In the linear regression set-up with p standardized regressors:

yi = x ′i β + εi ,

Tibshirani’s (1996) lasso solves the L1-penalization problem:

β = arg minβ

{l(β) + λ



|βj |}


Ridge regression: λ ∑pj=1 β2

j .

Variable seletion/shrinkage: The lasso does variable selection and shrinkage,whereas ridge regression, in contrast, only shrinks.

Bayesian interpretation: Maximum a posteriori under double-exponential prior.

Least absolute shrinkage and selection operator

Figure 2 of Tibshirani (1996)

Least absolute shrinkage and selection operator

Free advertisement

Least absolute shrinkage and selection operator

Other penalties

Breiman’s (1995) non-negative garrotte: minimize, with respect to cj ,



(yi −∑


cjxij βj


subject to cj ≥ 0,p


cj ≤ t,

where βj are OLS.

Frank and Friedman’s (1993) bridge regression, where the penalty is



|βj |γ,

with λ and γ estimated from the data.

Bridge regression and elastic net

Bridge regression and elastic net

Frank and Friedman’s (1993) Bridge regression generalizes both lasso and ridgeregression.

The bridge estimator can be viewed as the Bayes posterior mode under the prior

p(β|λ, q) ∝ exp{−λ|β|qq}.

Ridge regression: Gaussian priorLasso regression: Double-exponential prior.

Lasso + ridge: The elastic net penalty corresponds to a new prior given by

p(β|λ, α) ∝ exp{−λ(α|β|2 + (1− α)|β|1)},

a compromise between the Gaussian and Double-exponential priors.

Bridge regression and elastic net

Naıve elastic net

The naıve elastic net estimator is defined as

β = arg minβ{|y − X β|2 + λ1|β|1 + λ2|β|22}

Zou and Hastie (2005) argues that . . . the elastic net often outperforms the lasso,while enjoying a similar sparsity of representation. . . encourages a grouping. . . isparticularly useful when p is much bigger than n.

Bridge regression and elastic net

Theorem 2 of Zhou and Hastie (2005)

Given data (y ,X ) and (λ1, λ2) then the elastic net estimates β are given by

β = arg minβ

β′(X ′X + λ2I

1 + λ2

)β− 2y ′X β− λ1|β|1.

It is easy to see that

β(lasso) = arg minβ

β′(X ′X )β− 2y ′X β− λ1|β|1.

Also,X ′X + λ2I

1 + λ2= (1− γ)X ′X + γI ,

where γ = λ2/(1 + λ2).

Bridge regression and elastic net

Table 1 of Tibshirani (2011)

Bayesian Lasso

Bayesian Lasso

Park and Casella (2008) consider a fully Bayesian analysis using a conditionalLaplace prior specification of the form

p(β|σ2) ∝p




σ2exp{−λ|βj |/


They show that “the Gibbs sampler for the Bayesian Lasso exploits the followingrepresentation of the Laplace distribution as a scale mixture of normals (with anexponential mixing density)” (Andrews and Mallows, 1974)

p(z |λ) = λ

2e−λ|z | =

∫ ∞

0N (z |0, τ2)E(τ2|λ2/2)dτ2

Related work: Fernandes and Steel (2000), Figueiredo (2003), Bae and Mallick(2004), Yuan and Lin (2005), Hans (2009,2010), Balakrishnan and Madigan(2010).The Bayesian bridge: Polson, Scott and Windle (2013)

Bayesian Lasso

Hierarchical model

The hierarchical representation of the full model is given by

y |µ,X , β, σ2 ∼ N(µ1n + X β, σ2In)

β|σ2, τ21 , . . . , τ2

p ∼ N(0, σ2Dτ)

σ2, τ21 , . . . , τ2

p ∼ p(σ2)dσ2p




j /2}dτ2j

where Dτ = diag(τ21 , . . . , τ2

p ).

The τ2j s are known as the local shrinkage parameters.

The full conditional distributions of τ2i s are inverse-Gaussian.

Note: Approximate analytical methods proposed by Tibshirani (1996) and Fanand Li (2001) fail to provide reasonable standard error estimates for theparameters estimated to be 0.

Bayesian Lasso

Elastic net & orthant normal prior

Hans (2011) shows that a modified version of the Zhou and Hastie’s (2005)elastic net penalty

p(β|α, λ, σ2) ∝ exp

[− λ


{α|β|2 + (1− α)|β|1

}]can be rewritten as




(βj |

1− α




)+ 0.5N+

(βj |

1− α





where N− and N+ are properly normalized pdf for truncated normals. Or,







(β| − λ1

2λ2z ,



)1(β ∈ Oz ),

where Z = {−1, 1}p and Oz the orthant1 of of z ∈ Z .

Spike and slab variable selection

Let us step back a bit (SSVS)

George and McCulloch (1993) is a seminal paper in the Bayesian literatureregarding variable selection via sparsity:

βi |γi ∼ (1− γi )N(0, τ2i ) + γiN(0, c2

i τ2i ).

Prior of γ:

p(γ) =

∏ pγii (1− pi )

(1−γi )



See also George and McCulloch (1997), who describes and compares varioushierarchical mixture prior formulations of variable selection uncertainty in normallinear regression models.

Spike and slab variable selection

Spike and slab approaches

By a spike and slab model, Ishwaran and Rao (2005) mean a Bayesian modelspecified by the following prior hierarchy:

yi |xi , β, σ2 ∼ N(x ′i β, σ2)

(β|γ) ∼ N(0, Γ)γ ∼ π(dγ)

σ2 ∼ µ(dσ2)

The literature designs hierarchical priors over parameter and model spaces.2

Gibbs sampling is used to identify promising models.

The choice of priors is often tricky.

Barbieri and Berger (2004) have shown that

in many circumstances the high frequency model is not the optimalpredictive model and that the median model is predictively optimal.

2Mitchell and Beauchamp (1988), Chipman (1996), Clyde, DeSimone and Parmigiani (1996),Geweke (1996), Kuo and Mallick (1998), Chipman, George and McCulloch (2001), O’Hara andSillanpaa (2009).

Horseshoe prior

The horseshoe prior

Carvalho, Polson and Scott (2010) propose a hierarchical prior as follows:

θi |λi ∼ N(0, λ2i )

λi |τ ∼ C+(0, τ)

τ|σ ∼ C+(0, σ),

where C+(0, a) is a standard half-Cauchy distribution

Horseshoe prior: When σ2 = τ2 = 1:

E (θi |y) =∫ 1

0(1− κi )yip(κi |y)dκi = {1− E (κi |y)}yi ,

where κi = 1/(1 + λ2i ). E (κi |y) is the amount of shrinkage towards zero.

The half-Cauchy prior on λi implies a horseshoe-shaped Be(1/2, 1/2) prior forthe shrinkage coefficient κi .

Horseshoe prior

Horseshoe prior

Normal-gamma prior

Normal-gamma prior Griffin and Brown (2010)

The normal-gamma distribution arises by assuming that the mixing distribution ina scale mixture of normals (SMN) has the density g(x) = Ga(x |λ, 1/(2γ2)):

p(βi ) =1√

2π2λ−1/2γλ+1/2Γ(λ)|βi |λ−1/2Kλ−1/2(|βi |/γ),

where K is the modified Bessel function of the third kind.

The variance of βi is 2λγ2 and the excess kurtosis is 3/λ.

Small λ:

Large mass close to zero.

Heavy tails.

Hedibert Lopes (Insper) Brazilian School of Times Series and Econometrics August 3rd, 2015 18 / 38

Normal-gamma prior

Figure 1 of Griffin and Brown (2010)

Normal-gamma prior with a variance of 2 andλ = 0.1 (solid line), λ = 0.333 (dot-dashed line) and λ = 1 (dashed line).

Normal-gamma prior

The distribution is a member of the generalized hyperbolic family(Barndorff-Nielsen and Blaesild 1981).

The prior was considered by Griffin and Brown (2007), but the shape of thedensity made it difficult to obtain MAP estimates.

More recently, Caron and Doucet (2008) have looked at MAP estimation anddrawn a link to Levy processes.

See also Griffin and Brown (2013).

Support vector machines

Support vector machines

Polson and Scott (2011) introduce a latent variable representation of regularizedsupport vector machines3 (SVM) that enables EM, ECME or MCMC algorithmsto provide parameter estimates. See also Tipping (2001).

The Lα-norm regularized support vector classifier chooses a set of coefficients β tominimize the objective function

dα(β, µ) =n


max(1− yix′i β, 0) + ν−α



|βj /σj |α,

where σj is the standard deviation of the j ’th element of x and ν is a tuningparameter.

Minimizing the above equation is equivalent to finding the mode of thepseudo-posterior distribution p(β|ν, α, y) defined by

p(β|ν, α, y) ∝ exp{−dα(β, ν)) ∝ Cα(ν)L(y |β)p(β|ν, α).

3Binary classifiers that are often used with extremely high dimensional covariates.Hedibert Lopes (Insper) Brazilian School of Times Series and Econometrics August 3rd, 2015 21 / 38

Support vector machines

Local shrinkage rules and Levy processes

Polson and Scott (2012) use Levy processes to generate joint prior distributionsfor a location parameter β = (β1, . . . , βp) as p grows large.

This generalizes the class of local-global shrinkage rules based on scale mixtures ofnormals. See also Polson and Scott (2011).

Sparse factor models

Sparse factor models

The basic, most common, linear and Gaussian factor structure is

yi |fi ∼ N(βfi , Σ)fi ∼ N(0,H)

where Σ = diag(σ21 , . . . , σ2

p ) and H = diag(h1, . . . , hk ), such that

Var(yi ) = βHβ′ + Σ.

Sparsity:West (2003), Carvalho et al. (2008) and Fruhwirth-Schnatter and Lopes (2009)

βij ∼ (1− πij )δ0(βij ) + πijN(βij |0, τj )

πij ∼ (1− ρj )δ0(πij ) + ρjBe(πij |ajmj , aj (1−mj )),

where Be(am, a(1−m) is a beta with mean m and precision a.

Hedibert Lopes (Insper) Brazilian School of Times Series and Econometrics August 3rd, 2015 23 / 38

Sparse factor models

Figure 1(a) of Carvalho et al. (2008)

See Lucas et al. (2006), Knowles and Ghahramani (2011), Ma and Zhao (2012), Runcie and Mukherjee (2013), Mayrink and Lucas (2013), Gao, Brown and Engelhardt (2013) and Zhao, Gao, Mukherjee and Engelhardt (2014) for additional contributions related to genomics.

Pati, Bhattacharya, Pillai and Dunson (2014)

Pati, Bhattacharya, Pillai and Dunson (2014)

Sparse factor models

Sparse Factor models in Econometrics

Fruhwirth-Schnatter and Lopes (2009/2015)Parsimonious/Sparse Bayesian Factor Analysis when the Number of Factors is Unknown

Conti, Heckman, Lopes and Piatek (2011)Constructing Economically Justified Aggregates: An Application to the Early Origins ofHealth

Hahn and Lopes (2013)4 Factor model shrinkage for linear instrumental variable analysiswith many instruments

Kastner, Fruhwirth-Schnatter and Lopes (2015)Sparse Bayesian Latent Factor Stochastic Volatility Models for High-DimensionalFinancial Time Series

Sparse factor models Case 1: Constructing Economically Justified Aggregates

Case 1: Constructing Economically Justified Aggregates

Application to the 1970 British Cohort Study to analyze the effect of childcognition, mental/physical health on educational choices and adult economic andhealth outcomes.

A survey of all babies born (alive or dead) after the 24th week of gestation from00.01 hours on Sunday, 5th April to 24.00 hours on Saturday, 11 April, 1970 inEngland, Scotland, Wales and Northern Ireland.

Follow-ups (so far): 1975, 1980, 1986, 1996, 2000, 2004, 2008.

Background characteristics:

Cognitive, mental, physical health measurements (age 10)Education and adult outcomes (age 30)

Sample size: 5,105 women and 5,420 men.Hedibert Lopes (Insper) Brazilian School of Times Series and Econometrics August 3rd, 2015 26 / 38

Sparse factor models Case 1: Constructing Economically Justified Aggregates

Overall Model

Hedibert Lopes (Insper) Brazilian School of Times Series and Econometrics August 3rd, 2015 27 / 38

Sparse factor models Case 1: Constructing Economically Justified Aggregates

Overall model




Y ?01

Y ?11...

Y ?0S

Y ?1S


α′1 0...

...α′Q 0

α′D α′Zα′01 0α′11 0


α′0S 0α′1S 0







θ +







Hedibert Lopes (Insper) Brazilian School of Times Series and Econometrics August 3rd, 2015 28 / 38

Sparse factor models Case 1: Constructing Economically Justified Aggregates

Sparse factor models Case 1: Constructing Economically Justified Aggregates

Sparse factor models Case 1: Constructing Economically Justified Aggregates

Sparse factor models Case 1: Constructing Economically Justified Aggregates

Two main latent factors

ADHD: Attention Deficit Hyperactivity DisorderIt is loaded highly by items which are related to the child’s inability to payattention or to the child’s hyperactivity, as perceived by the teacher.

IQ: Cognitive AbilityIt is loaded highly by cognitive ability tests.

Hedibert Lopes (Insper) Brazilian School of Times Series and Econometrics August 3rd, 2015 32 / 38

Sparse factor models Case 2: Exchange rates

Case 2: Exchange rates

EUR exchange rates. Data stems from the European Central Banks StatisticalData Warehouse and comprises T = 3140 observations of 20 currencies rangingfrom January 3, 2000 to April 4, 2012.

Sparse factor models Case 2: Exchange rates

Factor stochastic volatility model

The basic FSV model is written as

yt |ft ∼ N(βft , Σt)

ft ∼ N(0,Ht)


Σt = diag(σ21t , . . . , σ2


Ht = diag(σ2p+1,t , . . . , σ2


andlog σ2

it | log σ2i ,t−1 ∼ N(µi + φi (log σ2

i ,t−1 − µi ), τ2i )

for i = 1, . . . , p + k.

Key features:Griffin and Brown’s (2010) Normal-Gamma prior on β.Rue’(2001) and McCausland, Miller, and Pelletier’s (2011) All WithOut aLoop (AWOL) sampling.Yu and Meng’s (2011) Ancillarity-Sufficiency Interweaving Strategy (ASIS)

Hedibert Lopes (Insper) Brazilian School of Times Series and Econometrics August 3rd, 2015 34 / 38

Sparse factor models Case 2: Exchange rates


Sparse factor models Case 2: Exchange rates

Time-varying Covariances

Sparse factor models Case 2: Exchange rates


