theory for linear models - southern illinois university...

David J. Olive

Theory for Linear Models

January 29, 2019

Springer

Preface

Many statistics departments offer a one semester graduate course in linearmodel theory. Three good books on linear model theory, in increasing or-der of difficulty, are Myers and Milton (1991), Seber and Lee (2003), andChristensen (2011). Other texts include Freedman (2005), Graybill (1976,2000), Guttman (1982), Hocking (2013), Monahan (2008), Muller and Stew-art (2006), Rao (1973), Rao et al. (2008), Ravishanker and Dey (2002),Rencher and Schaalje (2008), Scheffe (1959), Searle and Gruber (2017), Sen-gupta and Jammalamadaka (2003), Stapleton (2009), and Wang and Chow(1994). A good summary is Olive (2017a, ch. 11).

The prerequisites for this text are i) a calculus based course in statistics atthe level of Chihara and Hesterberg (2011), Hogg and Tanis (2005), Larsenand Marx (2011), Wackerly, Mendenhall and Scheaffer (2008) or Walpole,Myers, Myers and Ye (2006). ii) Linear algebra at the level of Anton andRorres (2014) and Leon (2015). iii) A calculus based course in multiple linearregression at the level of Abraham and Ledolter (2006), Kutner et al. (2005),Olive (2017a) and Cook and Weisberg (1999).

Some highlights of this text follow.

• Prediction intervals are given that can be useful even if n < p.• The response plot is useful for checking the model.• The large sample theory for the elastic net, lasso, and ridge regression is

greatly simplified.• The bootstrap is used for inference after variable selection if n ≥ 10p.• Half set inference is used for inference after variable selection or model

building.

Downloading the book’s R functions linmodpack.txt and data fileslinmoddata.txt into R: The commands

source("http://lagrange.math.siu.edu/Olive/linmodpack.txt")

source("http://lagrange.math.siu.edu/Olive/linmoddata.txt")

v

vi Preface

The R software is used in this text. See R Core Team (2016). Some packagesused in the text include glmnet Friedman et al. (2015), leaps Lumley(2009), MASS Venables and Ripley (2010), and pls Mevik et al. (2015).

Acknowledgements

Teaching this course in Math 584 at Southern Illinois University was veryuseful.

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Response Plots and Response Transformations . . . . . . . 4

1.2.1 Response and Residual Plots . . . . . . . . . . . . . . . . . . . 51.2.2 Response Transformations . . . . . . . . . . . . . . . . . . . . . 7

1.3 Large Sample Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.1 The CLT and the Delta Method . . . . . . . . . . . . . . . 111.3.2 Modes of Convergence and Consistency . . . . . . . . 141.3.3 Slutsky’s Theorem and Related Results . . . . . . . . 221.3.4 Multivariate Limit Theorems . . . . . . . . . . . . . . . . . . 25

1.4 Mixture Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291.5 The Multivariate Normal Distribution . . . . . . . . . . . . . . . 301.6 Elliptically Contoured Distributions . . . . . . . . . . . . . . . . . . 331.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371.8 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2 Full Rank Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.1 Projection Matrices and the Column Space . . . . . . . . . . 532.2 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.3 Least Squares Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.3.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3 Nonfull Rank Linear Models and Cell Means Models . . . . . 833.1 Nonfull Rank Linear Models . . . . . . . . . . . . . . . . . . . . . . . . 833.2 Cell Means Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

vii

viii Contents

3.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4 Variable Selection When n >> p . . . . . . . . . . . . . . . . . . . . . . . . . 854.1 Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.2 Prediction Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.3 Bootstrapping Hypothesis Tests and Confidence

Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.3.1 Bootstrapping the Population Coefficient of

Multiple Determination . . . . . . . . . . . . . . . . . . . . . . . . 974.3.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.3.3 The Prediction Region Method for Hypothesis

Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.3.4 Theory for the Prediction Region Method . . . . . 1074.3.5 Bootstrapping Variable Selection . . . . . . . . . . . . . . . 112

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5 Statistical Learning Alternatives to OLS When n >> p . . . 1335.1 The MLR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1335.2 Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1415.3 Principal Components Regression . . . . . . . . . . . . . . . . . . . . 1475.4 Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1505.5 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1525.6 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.7 Relaxed Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1635.8 The Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1655.9 Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1695.10 Choosing from M Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1715.11 Hypothesis Testing After Model Selection, n/p Large . 1765.12 Two Set Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1775.13 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1775.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1805.15 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1855.16 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

6 What if n is not >> p? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1996.1 Two Set Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1996.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2016.3 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2016.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Contents ix

7 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2037.1 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

7.1.1 The Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 2057.1.2 Outlier Detection with Mahalanobis Distances . 2077.1.3 The RMVN and RFCH Estimators . . . . . . . . . . . . 2107.1.4 Outlier Detection if p > n . . . . . . . . . . . . . . . . . . . . . . 210

7.2 Resistant Multiple Linear Regression . . . . . . . . . . . . . . . . . 2157.3 MLR Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2207.4 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

7.4.1 MLR Breakdown and Equivariance . . . . . . . . . . . . 2287.4.2 A Practical High Breakdown Consistent

Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2357.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2417.6 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

7.6.1 More on Robust Regression . . . . . . . . . . . . . . . . . . . . 2447.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

8 Multivariate Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 2598.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2598.2 Plots for the Multivariate Linear Regression Model . . 2648.3 Asymptotically Optimal Prediction Regions . . . . . . . . . . 2668.4 Testing Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2718.5 An Example and Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 282

8.5.1 Simulations for Testing . . . . . . . . . . . . . . . . . . . . . . . . . 2858.5.2 Simulations for Prediction Regions . . . . . . . . . . . . . 288

8.6 Two Robust Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2908.6.1 The rmreg Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 2908.6.2 The rmreg2 Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 292

8.7 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2958.7.1 Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 2958.7.2 Residual Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2958.7.3 Nonparametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . 295

8.8 Two Set Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2958.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2958.10 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3008.11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

9 One Way MANOVA Type Models . . . . . . . . . . . . . . . . . . . . . . . 3079.1 Two Set Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3079.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3079.3 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3079.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

x Contents

10 1D Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30910.1 Estimating the Sufficient Predictor . . . . . . . . . . . . . . . . . . . 31210.2 Visualizing 1D Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31610.3 Predictor Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 32210.4 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32310.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33110.6 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34010.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

11 Stuff for Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34911.1 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34911.2 Hints for Selected Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 35311.3 Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35411.4 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371

Chapter 1

Introduction

This chapter provides a preview of the book, and some techniques useful forvisualizing data in the background of the data. Some large sample theory ispresented in Section 1.3, Section 1.4 covers mixture distributions, and Section1.5 covers the multivariate normal distribution.

1.1 Overview

Linear Model Theory provides theory for the multiple linear regression model.This text will also give theory for some related regression models such asgeneralized linear models and the multivariate linear regression model wherethere is m ≥ 2 response variables. Some Statistical Learning techniques willalso be covered.

Statistical Learning could be defined as the statistical analysis of mul-tivariate data. Machine learning, data mining, big data, analytics, businessanalytics, data analytics, and predictive analytics are synonymous terms. Thetechniques are useful for Data Science and Statistics, the science of extract-ing information from data. The R software will be used. See R Core Team(2016).

Let z = (z1, ..., zk)T where z1, ..., zk are k random variables. Often z =

(Y,xT )T where xT = (x1, ..., xp) is the vector of predictors and Y is thevariable of interest, called a response variable. Predictor variables are alsocalled independent variables, covariates, or features. The response variableis also called the dependent variable. Usually context will be used to decidewhether z is a random vector or the observed random vector.

Definition 1.1. A case or observation consists of k random variablesmeasured for one person or thing. The ith case zi = (zi1, ..., zik)

T . Thetraining data consists of z1, ..., zn. A statistical model or method is fit

1

2 1 Introduction

(trained) on the training data. The test data consists of zn+1, ..., zn+m, andthe test data is often used to evaluate the quality of the fitted model.

Following James et al. (2013, p. 30), the previously unseen test data is notused to train the Statistical Learning method, but interest is in how well themethod performs on the test data. If the training data is (x1, Y1), ..., (xn, Yn),and the previously unseen test data is (xf , Yf), then particular interest is in

the accuracy of the estimator Yf of Yf obtained when the Statistical Learningmethod is applied to the predictor xf . The two Pelawa Watagoda and Olive(2018) prediction intervals, developed in Section 2.2, will be tools for eval-uating Statistical Learning methods for the additive error regression modelYi = m(xi) + ei = E(Yi|xi) + ei for i = 1, ..., n where E(W ) is the expectedvalue of the random variableW . The multiple linear regression (MLR) model,Yi = β1 + x2β2 + · · ·+ xpβp + e = xT β + e, is an important special case.

The estimator Yf is a prediction if the response variable Yf is continuous,

as occurs in regression models. If Yf is categorical, then Yf is a classification.For example, if Yf can be 0 or 1, then xf is classified to belong to group i if

Yf = i for i = 0 or 1.

Following Marden (2006, pp. 5,6), the focus of supervised learning is pre-dicting a future value of the response variable Yf given xf and the trainingdata (Y1,x1), ..., (Yn,x1). Hence the focus is not on hypothesis testing, con-fidence intervals, parameter estimation, or which model fits best, althoughthese four inference topics can be useful for better prediction.

Notation: Typically lower case boldface letters such as x denote columnvectors, while upper case boldface letters such as S or Y are used for ma-trices or column vectors. If context is not enough to determine whether yis a random vector or an observed random vector, then Y = (Y1, ..., Yp)

T

may be used for the random vector, and y = (y1 , ..., yp)T for the observed

value of the random vector. An upper case letter such as Y will usually be arandom variable. A lower case letter such as x1 will also often be a randomvariable. An exception to this notation is the generic multivariate locationand dispersion estimator (T,C) where the location estimator T is a p × 1vector such as T = x. C is a p× p dispersion estimator and conforms to theabove notation.

The main focus of the first seven chapters is developing tools to analyzethe multiple linear regression (MLR) model Yi = xT

i β + ei for i = 1, ..., n.Classical regression techniques use (ordinary) least squares (OLS) and assumen >> p, but Statistical Learning methods often give useful results if p >> n.OLS forward selection, lasso, ridge regression, and the elastic net (PCR) willbe some of the techniques examined.

For classical regression and multivariate analysis, we often want n ≥ 10p,and a model with n < 5p is overfitting: the model does not have enough data

1.1 Overview 3

to estimate parameters accurately. Statistical Learning methods often use amodel with a crude degrees of freedom d, where n ≥ Jd with J ≥ 5 andpreferably J ≥ 10.

Acronyms are widely used in regression and Statistical Learning, and someof the more important acronyms appear in Table 1.1. Also see the text’s index.

Table 1.1 Acronyms

Acronym DescriptionAER additive error regressionAP additive predictor = SP for a GAMcdf cumulative distribution functioncf characteristic functionCI confidence interval

CLT central limit theoremCV cross validationEC elliptically contoured

EAP estimated additive predictor = ESP for a GAMESP estimated sufficient predictorESSP estimated sufficient summary plot = response plotGAM generalized additive modelGLM generalized linear modeliid independent and identically distributed

lasso an MLR methodLR logistic regression

MAD the median absolute deviationMCLT multivariate central limit theoremMED the medianmgf moment generating functionMLD multivariate location and dispersionMLR multiple linear regressionMVN multivariate normalOLS ordinary least squarespdf probability density functionPI prediction intervalpmf probability mass functionSE standard errorSP sufficient predictorSSP sufficient summary plot

Remark 1.1. There are several important Statistical Learning principles.1) There is more interest in prediction or classification, e.g. producing Yf ,than in other types of inference such as parameter estimation, hypothesistesting, confidence intervals, or which model fits best.2) Often the focus is on extracting useful information when n/p is not large,e.g. p > n. If d is a crude estimator of the fitted model degrees of freedom,we want n/d large. A sparse model has few nonzero coefficients. We can have

4 1 Introduction

sparse population models and sparse fitted models. Sometimes sparse fittedmodels are useful even if the population model is dense (not sparse). Oftenthe number of nonzero coefficients of a sparse fitted model = d. Sparse fittedmodels are often useful for prediction.3) Interest is in how well the method performs on test data. Performance ontraining data is overly optimistic for estimating performance on test data.4) Some methods are flexible while others are unflexible. For unflexible re-gression methods, the sufficient predictor is often a hyperplane SP = xT β(see Definition 1.2), and often the mean function E(Y |x) = M(xT β) wherethe function M is known but the p×1 vector of parameters β is unknown andmust be estimated (GLMs). Flexible methods tend to be useful for more com-plicated regression methods where E(Y |x) = m(x) for an unknown functionm or SP 6= xT β (GAMs). Flexibility tends to increase with d.

1.2 Response Plots and Response Transformations

This section will consider tools for visualizing the regression model in thebackground of the data. The definitions in this section tend not to dependon whether n/p is large or small, but the estimator h tends to be better ifn/p is large. In regression, the response variable is the variable of interest:the variable you want to predict. The predictors or features x1, ..., xp arevariables used to predict Y .

Definition 1.2. Regression investigates how the response variable Ychanges with the value of a p × 1 vector x of predictors. Often this con-ditional distribution Y |x is described by a 1D regression model, where Yis conditionally independent of x given the sufficient predictor SP = h(x),written

Y x|SP or Y x|h(x), (1.1)

where the real valued function h : Rp → R. The estimated sufficient predictor

ESP = h(x). An important special case is a model with a linear predictor

h(x) = α+βT x where ESP = α+βTx. This class of models includes the gen-

eralized linear model (GLM). Another important special case is a generalizedadditive model (GAM), where Y is independent of x = (x1, ..., xp)

T given theadditive predictor AP = α+

∑pj=1 Sj(xj) for some (usually unknown) func-

tions Sj . The estimated additive predictor EAP = ESP = α+∑p

j=1 Sj(xj).

Notation. Often the index i will be suppressed. For example, the multiplelinear regression model

Yi = α+ βT xi + ei (1.2)

1.2 Response Plots and Response Transformations 5

for i = 1, ..., n where β is a p× 1 unknown vector of parameters, and ei is arandom error. This model could be written Y = α+βT x+e. More accurately,Y |x = α+βT x+e, but the conditioning on x will often be suppressed. Oftenthe errors e1, ..., en are iid (independent and identically distributed) from adistribution that is known except for a scale parameter. For example, theei’s might be iid from a normal (Gaussian) distribution with mean 0 andunknown standard deviation σ. For this Gaussian model, estimation of α, β,and σ is important for inference and for predicting a new future value of theresponse variable Yf given a new vector of predictors xf .

1.2.1 Response and Residual Plots

Definition 1.3. An estimated sufficient summary plot (ESSP) or responseplot is a plot of the ESP versus Y . A residual plot is a plot of the ESP versusthe residuals.

Notation: In this text, a plot of x versus Y will have x on the horizontalaxis, and Y on the vertical axis. For the additive error regression modelY = m(x)+e, the ith residual is ri = Yi −m(xi) = Yi− Yi where Yi = m(xi)is the ith fitted value.

For the additive error regression model, the response plot is a plot of Yversus Y where the identity line with unit slope and zero intercept is added asa visual aid. The residual plot is a plot of Y versus r. Assume the errors ei areiid from a unimodal distribution that is not highly skewed. Then the plottedpoints should scatter about the identity line and the r = 0 line (the horizontalaxis) with no other pattern if the fitted model (that produces m(x)) is good.

Fig. 1.1 Residual and Response Plots for the Tremearne Data

Example 1.1. Tremearne (1911) presents a data set of about 17 mea-surements on 115 people of Hausa nationality. We deleted 3 cases becauseof missing values and used height as the response variable Y . Along with aconstant xi,1 ≡ 1, the five additional predictor variables used were heightwhen sitting, height when kneeling, head length, nasal breadth, and span (per-haps from left hand to right hand). Figure 1.1 presents the (ordinary) leastsquares (OLS) response and residual plots for this data set. These plots showthat an MLR model Y = xT β + e should be a useful model for the datasince the plotted points in the response plot are linear and follow the identityline while the plotted points in the residual plot follow the r = 0 line with

6 1 Introduction

no other pattern (except for a possible outlier marked 44). Note that manyimportant acronyms, such as OLS and MLR, appear in Table 1.1.

To use the response plot to visualize the conditional distribution of Y |xT β,

use the fact that the fitted values Y = xT β. For example, suppose the heightgiven fit = 1700 is of interest. Mentally examine the plot about a narrowvertical strip about fit = 1700, perhaps from 1685 to 1715. The cases in thenarrow strip have a mean close to 1700 since they fall close to the identityline. Similarly, when the fit = w for w between 1500 and 1850, the cases haveheights near w, on average.

Cases 3, 44, and 63 are highlighted. The 3rd person was very tall whilethe 44th person was rather short. Beginners often label too many points asoutliers: cases that lie far away from the bulk of the data. Mentally draw abox about the bulk of the data ignoring any outliers. Double the width of thebox (about the identity line for the response plot and about the horizontalline for the residual plot). Cases outside of this imaginary doubled box arepotential outliers. Alternatively, visually estimate the standard deviation ofthe residuals in both plots. In the residual plot look for residuals that aremore than 5 standard deviations from the r = 0 line. In Figure 1.1, thestandard deviation of the residuals appears to be around 10. Hence cases 3and 44 are certainly worth examining.

The identity line can also pass through or near an outlier or a clusterof outliers. Then the outliers will be in the upper right or lower left of theresponse plot, and there will be a large gap between the cluster of outliers andthe bulk of the data. Figure 1.1 was made with the following R commands,using slpack function MLRplot and the major.lsp data set from the text’swebpage.

major <- matrix(scan(),nrow=112,ncol=7,byrow=T)

#copy and paste the data set, then press enter

major <- major[,-1]

X<-major[,-6]

Y <- major[,6]

MLRplot(X,Y) #left click the 3 highlighted cases,

#then right click Stop for each of the two plots

A problem with response and residual plots is that there can be a lot ofblack in the plot if the sample size n is large (more than a few thousand).A variant of the response plot for the additive error regression model wouldplot the identity line, the two lines parallel to the identity line correspondingto the Section 2.2 large sample 100(1 − δ)% prediction intervals for Yf that

depends on Yf . Then plot points corresponding to training data cases thatdo not lie in their 100(1 − δ)% PI. Use δ = 0.01 or 0.05. Try the followingcommands that used δ = 0.2 since n is small. The commands use the slpackfunction AERplot.

out<-lsfit(X,Y)


res<-out$res

yhat<-Y-res

AERplot(yhat,Y,res=res,d=2,alph=1) #usual response plot

AERplot(yhat,Y,res=res,d=2,alph=0.2)

#plots data outside the 80% pointwise PIs

n<-100000; q<-7

b <- 0 * 1:q + 1

x <- matrix(rnorm(n * q), nrow = n, ncol = q)

y <- 1 + x %*% b + rnorm(n)

out<-lsfit(x,y)

res<-out$res

yhat<-y-res

dd<-length(out$coef)

AERplot(yhat,y,res=res,d=dd,alph=1) #usual response plot

AERplot(yhat,y,res=res,d=dd,alph=0.01)

#plots data outside the 99% pointwise PIs

AERplot2(yhat,y,res=res,d=2)

#response plot with 90% pointwise prediction bands

1.2.2 Response Transformations

A response transformation Y = tλ(Z) can make the MLR model or additiveerror regression model hold if the variable of interest Z is measured on thewrong scale. For MLR, Y = tλ(Z) = xT β +e, while for additive error regres-sion, Y = tλ(Z) = m(x) + e. Predictor transformations are used to removegross nonlinearities in the predictors, and this technique is often very useful.However, if there are hundreds or more predictors, graphical methods forpredictor transformations take too long. Olive (2017a, Section 3.1) describesgraphical methods for predictor transformations.

Power transformations are particularly effective, and a power transforma-tion has the form x = tλ(w) = wλ for λ 6= 0 and x = t0(w) = log(w) forλ = 0. Often λ ∈ ΛL where

ΛL = {−1,−1/2,−1/3, 0, 1/3, 1/2, 1} (1.3)

is called the ladder of powers. Often when a power transformation is needed,a transformation that goes “down the ladder,” e.g. from λ = 1 to λ = 0 willbe useful. If the transformation goes too far down the ladder, e.g. if λ = 0is selected when λ = 1/2 is needed, then it will be necessary to go back “upthe ladder.” Additional powers such as ±2 and ±3 can always be added. Thefollowing rules are useful for both response transformations and predictortransformations.

8 1 Introduction

a) The log rule states that a positive variable that has the ratio betweenthe largest and smallest values greater than ten should be transformed tologs. So W > 0 and max(W )/min(W ) > 10 suggests using log(W ).

b) The ladder rule appears in Cook and Weisberg (1999a, p. 86), and isused for a plot of two variables, such as ESP versus Y for response transfor-mations or x1 versus x2 for predictor transformations.Ladder rule: To spread small values of a variable, make λ smaller.To spread large values of a variable, make λ larger.

Consider the ladder of powers. Often no transformation (λ = 1) is best,then the log transformation, then the square root transformation, then thereciprocal transformation.

Fig. 1.2 Plots to Illustrate the Ladder Rule

Example 1.2. Examine Figure 1.2. Since w is on the horizontal axis,mentally add a narrow vertical slice to the plot. If a large amount of data fallsin the slice at the left of the plot, then small values need spreading. Similarly,if a large amount of data falls in the slice at the right of the plot (comparedto the middle and left of the plot), then large values need spreading. Forthe variable on the vertical axis, make a narrow horizontal slice. If the plotlooks roughly like the northwest corner of a square then small values of thehorizontal and large values of the vertical variable need spreading. Hence inFigure 1.2a, small values of w need spreading. If the plot looks roughly likethe northeast corner of a square, then large values of both variables needspreading. Hence in Figure 1.2b, large values of x need spreading. If the plotlooks roughly like the southwest corner of a square, as in Figure 1.2c, thensmall values of both variables need spreading. If the plot looks roughly likethe southeast corner of a square, then large values of the horizontal andsmall values of the vertical variable need spreading. Hence in Figure 1.2d,small values of x need spreading.

Consider the additive error regression model Y = m(x) + e. Then theresponse transformation model is Y = tλ(Z) = mλ(x)+ e, and the graphicalmethod for selecting the response transformation is to plot mλi(x) versustλi(Z) for several values of λi, choosing the value of λ = λ0 where the plottedpoints follow the identity line with unit slope and zero intercept. For themultiple linear regression model, mλi (x) = xT βλi

where βλican be found

using the desired fitting method, e.g. OLS or lasso.

Definition 1.4. Assume that all of the values of the “response” Zi arepositive. A power transformation has the form Y = tλ(Z) = Zλ for λ 6= 0and Y = t0(Z) = log(Z) for λ = 0 where

λ ∈ ΛL = {−1,−1/2,−1/3, 0, 1/3, 1/2, 1}.


Definition 1.5. Assume that all of the values of the “response” Zi arepositive. Then the modified power transformation family

tλ(Zi) ≡ Z(λ)i =

Zλi − 1

λ(1.4)

for λ 6= 0 and Z(0)i = log(Zi). Generally λ ∈ Λ where Λ is some interval such

as [−1, 1] or a coarse subset such as ΛL. This family is a special case of theresponse transformations considered by Tukey (1957).

A graphical method for response transformations refits the model usingthe same fitting method: changing only the “response” from Z to tλ(Z).Compute the “fitted values” Wi using Wi = tλ(Zi) as the “response.” Thena transformation plot of Wi versus Wi is made for each of the seven values ofλ ∈ ΛL with the identity line added as a visual aid. Vertical deviations fromthe identity line are the “residuals” ri = Wi−Wi. Then a candidate responsetransformation Y = tλ∗(Z) is reasonable if the plotted points follow theidentity line in a roughly evenly populated band if the MLR or additive errorregression model is reasonable for Y = W and x. Curvature from the identityline suggests that the candidate response transformation is inappropriate.

Notice that the graphical method is equivalent to making “response plots”for the seven values of W = tλ(Z), and choosing the “best response plot”where the MLR model seems “most reasonable.” The seven “response plots”are called transformation plots below. Our convention is that a plot of Xversus Y means that X is on the horizontal axis and Y is on the verticalaxis.

Definition 1.6. A transformation plot is a plot of W versus W with theidentity line added as a visual aid.

Fig. 1.3 Four Transformation Plots for the Textile Data

There are several reasons to use a coarse grid of powers. First, several of thepowers correspond to simple transformations such as the log, square root, andcube root. These powers are easier to interpret than λ = 0.28, for example.According to Mosteller and Tukey (1977, p. 91), the most commonly usedpower transformations are the λ = 0 (log), λ = 1/2, λ = −1, and λ = 1/3

transformations in decreasing frequency of use. Secondly, if the estimator λn

can only take values in ΛL, then sometimes λn will converge (e.g. in prob-ability) to λ∗ ∈ ΛL. Thirdly, Tukey (1957) showed that neighboring powertransformations are often very similar, so restricting the possible powers toa coarse grid is reasonable. Note that powers can always be added to the

10 1 Introduction

grid ΛL. Useful powers are ±1/4,±2/3,±2, and ±3. Powers from numericalmethods can also be added.

Application 1.1. This graphical method for selecting a response trans-formation is very simple. Let Wi = tλ(Zi). Then for each of the seven valuesof λ ∈ ΛL, perform the regression fitting method, such as OLS or lasso, on(Wi,xi) and make the transformation plot of Wi versus Wi. If the plotted

points follow the identity line for λ∗, then take λo = λ∗, that is, Y = tλ∗(Z)is the response transformation.

If more than one value of λ ∈ ΛL gives a linear plot, take the simplest ormost reasonable transformation or the transformation that makes the mostsense to subject matter experts. Also check that the corresponding “residualplots” of W versus W−W look reasonable. The values of λ in decreasing orderof importance are 1, 0, 1/2,−1, and 1/3. So the log transformation would bechosen over the cube root transformation if both transformation plots lookequally good.

After selecting the transformation, the usual checks should be made. Inparticular, the transformation plot for the selected transformation is the re-sponse plot, and a residual plot should also be made. The following exampleillustrates the procedure, and the plots show W = tλ(Z) on the vertical axis.The label “TZHAT” of the horizontal axis are the “fitted values” W thatresult from using W = tλ(Z) as the “response” in the OLS software.

Example 1.3: Textile Data. In their pioneering paper on response trans-formations, Box and Cox (1964) analyze data from a 33 experiment on thebehavior of worsted yarn under cycles of repeated loadings. The “response”Z is the number of cycles to failure and a constant is used along with thethree predictors length, amplitude, and load. Using the normal profile loglikelihood for λo, Box and Cox determine λo = −0.06 with approximate 95percent confidence interval −0.18 to 0.06. These results give a strong indi-cation that the log transformation may result in a relatively simple model,as argued by Box and Cox. Nevertheless, the numerical Box–Cox transfor-mation method provides no direct way of judging the transformation againstthe data.

Shown in Figure 1.3 are transformation plots of W versus W = Zλ forfour values of λ except log(Z) is used if λ = 0. The plots show how the trans-formations bend the data to achieve a homoscedastic linear trend. Perhapsmore importantly, they indicate that the information on the transformationis spread throughout the data in the plot since changing λ causes all pointsalong the curvilinear scatter in Figure 1.3a to form along a linear scatter inFigure 1.3c. Dynamic plotting using λ as a control seems quite effective forjudging transformations against the data and the log response transformationdoes indeed seem reasonable.

Note the simplicity of the method: Figure 1.3a shows that a response trans-formation is needed since the plotted points follow a nonlinear curve while

1.3 Large Sample Theory 11

Figure 1.3c suggests that Y = log(Z) is the appropriate response transforma-tion since the plotted points follow the identity line. If all 7 plots were madefor λ ∈ ΛL, then λ = 0 would be selected since this plot is linear. Also, Figure1.3a suggests that the log rule is reasonable since max(Z)/min(Z) > 10.

1.3 Large Sample Theory

The first three subsections will review large sample theory for the univariatecase, then multivariate theory will be given.

1.3.1 The CLT and the Delta Method

Large sample theory, also called asymptotic theory, is used to approximatethe distribution of an estimator when the sample size n is large. This the-ory is extremely useful if the exact sampling distribution of the estimator iscomplicated or unknown. To use this theory, one must determine what theestimator is estimating, the rate of convergence, the asymptotic distribution,and how large n must be for the approximation to be useful. Moreover, the(asymptotic) standard error (SE), an estimator of the asymptotic standarddeviation, must be computable if the estimator is to be useful for inference.Often the bootstrap can be used to compute the SE. See Sections 2.3 and3.12.

Theorem 1.2: the Central Limit Theorem (CLT). Let Y1, ..., Yn beiid with E(Y ) = µ and VAR(Y ) = σ2. Let the sample mean Y n = 1

n

∑ni=1 Yi.

Then √n(Y n − µ)

D→ N(0, σ2).

Hence√n

(Y n − µ

σ

)=

√n

(∑ni=1 Yi − nµ

nσ

)D→ N(0, 1).

Note that the sample mean is estimating the population mean µ with a√n

convergence rate, the asymptotic distribution is normal, and the SE = S/√n

where S is the sample standard deviation. For distributions “close” to thenormal distribution, the central limit theorem provides a good approximationif the sample size n ≥ 30. Hesterberg (2014, pp. 41, 66) suggests n ≥ 5000is needed for moderately skewed distributions. A special case of the CLT isproven after Theorem 1.15.

12 1 Introduction

Notation. The notation X ∼ Y and XD= Y both mean that the random

variables X and Y have the same distribution. Hence FX(x) = FY (y) for all

real y. The notation YnD→ X means that for large n we can approximate the

cdf of Yn by the cdf of X. The distribution of X is the limiting distributionor asymptotic distribution of Yn. For the CLT, notice that

Zn =√n

(Y n − µ

σ

)=

(Y n − µ

σ/√n

)

is the z–score of Y . If ZnD→ N(0, 1), then the notation Zn ≈ N(0, 1), also

written as Zn ∼ AN(0, 1), means approximate the cdf of Zn by the standardnormal cdf. See Definition 1.20. Similarly, the notation

Y n ≈ N(µ, σ2/n),

also written as Y n ∼ AN(µ, σ2/n), means approximate the cdf of Y n asif Y n ∼ N(µ, σ2/n). The distribution of X does not depend on n, but theapproximate distribution Y n ≈ N(µ, σ2/n) does depend on n.

The two main applications of the CLT are to give the limiting distributionof

√n(Y n −µ) and the limiting distribution of

√n(Yn/n−µX) for a random

variable Yn such that Yn =∑n

i=1Xi where the Xi are iid with E(X) = µX

and VAR(X) = σ2X .

Example 1.9. a) Let Y1, ..., Yn be iid Ber(ρ). Then E(Y ) = ρ andVAR(Y ) = ρ(1 − ρ). (The Bernoulli (ρ) distribution is the binomial (1,ρ)distribution.) Hence

√n(Y n − ρ)

D→ N(0, ρ(1− ρ))

by the CLT.

b) Now suppose that Yn ∼ BIN(n, ρ). Then YnD=∑n

i=1Xi whereX1, ..., Xn are iid Ber(ρ). Hence

√n

(Yn

n− ρ

)D→ N(0, ρ(1 − ρ))

since√n

(Yn

n− ρ

)D=

√n(Xn − ρ)

D→ N(0, ρ(1 − ρ))

by a).c) Now suppose that Yn ∼ BIN(kn , ρ) where kn → ∞ as n→ ∞. Then

√kn

(Yn

kn− ρ

)≈ N(0, ρ(1 − ρ))


orYn

kn≈ N

(ρ,ρ(1 − ρ)

kn

)or Yn ≈ N(knρ, knρ(1 − ρ)) .

Theorem 1.3: the Delta Method. If g does not depend on n, g′(θ) 6= 0,and √

n(Tn − θ)D→ N(0, σ2),

then √n(g(Tn) − g(θ))

D→ N(0, σ2[g′(θ)]2).

Example 1.10. Let Y1, ..., Yn be iid with E(Y ) = µ and VAR(Y ) = σ2.Then by the CLT, √

n(Y n − µ)D→ N(0, σ2).

Let g(µ) = µ2. Then g′(µ) = 2µ 6= 0 for µ 6= 0. Hence

√n((Y n)2 − µ2)

D→ N(0, 4σ2µ2)

for µ 6= 0 by the delta method.

Example 1.11. Let X ∼ Binomial(n, p) where the positive integer n is

large and 0 < p < 1. Find the limiting distribution of√n

[ (X

n

)2

− p2

].

Solution. Example 1.9b gives the limiting distribution of√n(X

n− p). Let

g(p) = p2. Then g′(p) = 2p and by the delta method,

√n

[ (X

n

)2

− p2

]=

√n

(g

(X

n

)− g(p)

)D→

N(0, p(1− p)(g′(p))2) = N(0, p(1− p)4p2) = N(0, 4p3(1 − p)).

Example 1.12. Let Xn ∼ Poisson(nλ) where the positive integer n islarge and λ > 0.

a) Find the limiting distribution of√n

(Xn

n− λ

).

b) Find the limiting distribution of√n

[ √Xn

n−

√λ

].

Solution. a) XnD=∑n

i=1 Yi where the Yi are iid Poisson(λ). Hence E(Y ) =λ = V ar(Y ). Thus by the CLT,

√n

(Xn

n− λ

)D=

√n

( ∑ni=1 Yi

n− λ

)D→ N(0, λ).

14 1 Introduction

b) Let g(λ) =√λ. Then g′(λ) = 1

2√

λand by the delta method,

√n

[ √Xn

n−

√λ

]=

√n

(g

(Xn

n

)− g(λ)

)D→

N(0, λ (g′(λ))2) = N

(0, λ

1

4λ

)= N

(0,

1

4

).

Example 1.13. Let Y1, ..., Yn be independent and identically distributed(iid) from a Gamma(α, β) distribution.

a) Find the limiting distribution of√n(Y − αβ

).

b) Find the limiting distribution of√n(

(Y )2 − c)

for appropriate con-stant c.

Solution: a) Since E(Y ) = αβ and V (Y ) = αβ2, by the CLT√n(Y − αβ

) D→ N(0, αβ2).b) Let µ = αβ and σ2 = αβ2. Let g(µ) = µ2 so g′(µ) = 2µ and

[g′(µ)]2 = 4µ2 = 4α2β2. Then by the delta method,√n(

(Y )2 − c) D→

N(0, σ2[g′(µ)]2) = N(0, 4α3β4) where c = µ2 = α2β2 .

1.3.2 Modes of Convergence and Consistency

Definition 1.20. Let {Zn, n = 1, 2, ...} be a sequence of random variableswith cdfs Fn, and letX be a random variable with cdf F . Then Zn convergesin distribution to X, written

ZnD→ X,

or Zn converges in law to X, written ZnL→ X, if

limn→∞

Fn(t) = F (t)

at each continuity point t of F . The distribution of X is called the limitingdistribution or the asymptotic distribution of Zn.

An important fact is that the limiting distribution does not dependon the sample size n. Notice that the CLT and delta method give thelimiting distributions of Zn =

√n(Y n − µ) and Zn =

√n(g(Tn) − g(θ)),

respectively.Convergence in distribution is useful if the distribution of Xn is unknown

or complicated and the distribution of X is easy to use. Then for large n wecan approximate the probability that Xn is in an interval by the probability


that X is in the interval. To see this, notice that if XnD→ X, then P (a <

Xn ≤ b) = Fn(b) − Fn(a) → F (b) − F (a) = P (a < X ≤ b) if F is continuousat a and b.

Warning: convergence in distribution says that the cdf Fn(t) of Xn getsclose to the cdf of F (t) of X as n → ∞ provided that t is a continuitypoint of F . Hence for any ε > 0 there exists Nt such that if n > Nt, then|Fn(t) −F (t)| < ε. Notice that Nt depends on the value of t. Convergence indistribution does not imply that the random variables Xn ≡ Xn(ω) convergeto the random variable X ≡ X(ω) for all ω.

Example 1.14. Suppose that Xn ∼ U(−1/n, 1/n). Then the cdf Fn(x) ofXn is

Fn(x) =

0, x ≤ −1n

nx2

+ 12, −1

n≤ x ≤ 1

n1, x ≥ 1

n.

Sketching Fn(x) shows that it has a line segment rising from 0 at x = −1/nto 1 at x = 1/n and that Fn(0) = 0.5 for all n ≥ 1. Examining the casesx < 0, x = 0, and x > 0 shows that as n → ∞,

Fn(x) →

0, x < 012 x = 01, x > 0.

Notice that the right hand side is not a cdf since right continuity does nothold at x = 0. Notice that if X is a random variable such that P (X = 0) = 1,then X has cdf

FX(x) =

{0, x < 01, x ≥ 0.

Since x = 0 is the only discontinuity point of FX(x) and since Fn(x) → FX(x)for all continuity points of FX(x) (i.e. for x 6= 0),

XnD→ X.

Example 1.15. Suppose Yn ∼ U(0, n). Then Fn(t) = t/n for 0 < t ≤ nand Fn(t) = 0 for t ≤ 0. Hence limn→∞ Fn(t) = 0 for t ≤ 0. If t > 0 andn > t, then Fn(t) = t/n → 0 as n → ∞. Thus limn→∞ Fn(t) = 0 for allt, and Yn does not converge in distribution to any random variable Y sinceH(t) ≡ 0 is not a cdf.

Definition 1.21. A sequence of random variables Xn converges in distri-bution to a constant τ (θ), written

XnD→ τ (θ), if Xn

D→ X

16 1 Introduction

where P (X = τ (θ)) = 1. The distribution of the random variable X is saidto be degenerate at τ (θ) or to be a point mass at τ (θ).

Definition 1.22. A sequence of random variables Xn converges in prob-ability to a constant τ (θ), written

XnP→ τ (θ),

if for every ε > 0,

limn→∞

P (|Xn − τ (θ)| < ε) = 1 or, equivalently, limn→∞

P(|Xn − τ (θ)| ≥ ε) = 0.

The sequence Xn converges in probability to X, written

XnP→ X,

if Xn −XP→ 0.

Notice that XnP→ X if for every ε > 0,

limn→∞

P (|Xn −X| < ε) = 1, or, equivalently, limn→∞

P(|Xn − X| ≥ ε) = 0.

Definition 1.23. Let the parameter space Θ be the set of possible valuesof θ. A sequence of estimators Tn of τ (θ) is consistent for τ (θ) if

TnP→ τ (θ)

for every θ ∈ Θ. If Tn is consistent for τ (θ), then Tn is a consistent esti-mator of τ (θ).

Consistency is a weak property that is usually satisfied by good estimators.Tn is a consistent estimator for τ (θ) if the probability that Tn falls in anyneighborhood of τ (θ) goes to one, regardless of the value of θ ∈ Θ.

Definition 1.24. For a real number r > 0, Yn converges in rth mean to arandom variable Y , written

Ynr→ Y,

ifE(|Yn − Y |r) → 0

as n→ ∞. In particular, if r = 2, Yn converges in quadratic mean to Y ,written

Yn2→ Y or Yn

qm→ Y,

ifE[(Yn − Y )2] → 0


as n → ∞.

Lemma 1.4: Generalized Chebyshev’s Inequality. Let u : R →[0,∞) be a nonnegative function. If E[u(Y )] exists then for any c > 0,

P [u(Y ) ≥ c] ≤ E[u(Y )]

c.

If µ = E(Y ) exists, then taking u(y) = |y− µ|r and c = cr givesMarkov’s Inequality: for r > 0 and any c > 0,

P [|Y − µ| ≥ c] = P [|Y − µ|r ≥ cr] ≤ E[|Y − µ|r]cr

.

If r = 2 and σ2 = VAR(Y ) exists, then we obtainChebyshev’s Inequality:

P [|Y − µ| ≥ c] ≤ VAR(Y )

c2.

Proof. The proof is given for pdfs. For pmfs, replace the integrals by sums.Now

E[u(Y )] =

∫

R

u(y)f(y)dy =

∫

{y:u(y)≥c}u(y)f(y)dy +

∫

{y:u(y)<c}u(y)f(y)dy

≥∫

{y:u(y)≥c}u(y)f(y)dy

since the integrand u(y)f(y) ≥ 0. Hence

E[u(Y )] ≥ c

∫

{y:u(y)≥c}f(y)dy = cP [u(Y ) ≥ c]. �

The following proposition gives sufficient conditions for Tn to be a consis-tent estimator of τ (θ). Notice that Eθ[(Tn − τ (θ))2] = MSEτ(θ)(Tn) → 0 for

all θ ∈ Θ is equivalent to Tnqm→ τ (θ) for all θ ∈ Θ.

Proposition 1.5. a) If

limn→∞

MSEτ(θ)(Tn) = 0

for all θ ∈ Θ, then Tn is a consistent estimator of τ (θ).

b) Iflim

n→∞VARθ(Tn) = 0 and lim

n→∞Eθ(Tn) = τ (θ)


18 1 Introduction

Proof. a) Using Lemma 1.4 with Y = Tn, u(Tn) = (Tn−τ (θ))2 and c = ε2

shows that for any ε > 0,

Pθ(|Tn − τ (θ)| ≥ ε) = Pθ[(Tn − τ (θ))2 ≥ ε2] ≤ Eθ[(Tn − τ (θ))2 ]

ε2.

Hencelim

n→∞Eθ[(Tn − τ (θ))2] = lim

n→∞MSEτ(θ)(Tn) → 0

is a sufficient condition for Tn to be a consistent estimator of τ (θ).b) Recall that

MSEτ(θ)(Tn) = VARθ(Tn) + [Biasτ(θ)(Tn)]2

where Biasτ(θ)(Tn) = Eθ(Tn) − τ (θ). Since MSEτ(θ)(Tn) → 0 if bothVARθ(Tn) → 0 and Biasτ(θ)(Tn) = Eθ(Tn) − τ (θ) → 0, the result followsfrom a). �

The following result shows estimators that converge at a√n rate are con-

sistent. Use this result and the delta method to show that g(Tn) is a consistentestimator of g(θ). Note that b) follows from a) with Xθ ∼ N(0, v(θ)). TheWLLN shows that Y is a consistent estimator of E(Y ) = µ if E(Y ) exists.

Proposition 1.6. a) Let Xθ be a random variable with distribution de-pending on θ, and 0 < δ ≤ 1. If

nδ(Tn − τ (θ))D→ Xθ

then TnP→ τ (θ).

b) If √n(Tn − τ (θ))

D→ N(0, v(θ))


Definition 1.25. A sequence of random variables Xn converges almosteverywhere (or almost surely, or with probability 1) to X if

P ( limn→∞

Xn = X) = 1.

This type of convergence will be denoted by

Xnae→ X.

Notation such as “Xn converges to X ae” will also be used. Sometimes “ae”will be replaced with “as” or “wp1.” We say that Xn converges almost ev-erywhere to τ (θ), written

Xnae→ τ (θ),


if P (limn→∞Xn = τ (θ)) = 1.

Theorem 1.7. Let Yn be a sequence of iid random variables with E(Yi) =µ. Then

a) Strong Law of Large Numbers (SLLN): Y nae→ µ, and

b) Weak Law of Large Numbers (WLLN): Y nP→ µ.

Proof of WLLN when V (Yi) = σ2: By Chebyshev’s inequality, for everyε > 0,

P (|Y n − µ| ≥ ε) ≤ V (Y n)

ε2=

σ2

nε2→ 0

as n → ∞. �

In proving consistency results, there is an infinite sequence of estimatorsthat depend on the sample size n. Hence the subscript n will be added to theestimators.

Definition 1.26. Lehmann (1999, pp. 53-54): a) A sequence of randomvariables Wn is tight or bounded in probability, written Wn = OP (1), if forevery ε > 0 there exist positive constants Dε and Nε such that

P (|Wn| ≤ Dε) ≥ 1 − ε

for all n ≥ Nε. Also Wn = OP (Xn) if |Wn/Xn| = OP (1).b) The sequence Wn = oP (n−δ) if nδWn = oP (1) which means that

nδWnP→ 0.

c) Wn has the same order as Xn in probability, written Wn �P Xn, if forevery ε > 0 there exist positive constants Nε and 0 < dε < Dε such that

P

(dε ≤

∣∣∣∣Wn

Xn

∣∣∣∣ ≤ Dε

)= P

(1

Dε≤∣∣∣∣Xn

Wn

∣∣∣∣ ≤1

d ε

)≥ 1 − ε

for all n ≥ Nε.d) Similar notation is used for a k × r matrix An = A = [ai,j(n)] if each

element ai,j(n) has the desired property. For example, A = OP (n−1/2) ifeach ai,j(n) = OP (n−1/2).

Definition 1.27. Let Wn = ‖µn − µ‖.a) If Wn �P n−δ for some δ > 0, then both Wn and µn have (tightness)

rate nδ.b) If there exists a constant κ such that

nδ(Wn − κ)D→ X

20 1 Introduction

for some nondegenerate random variable X, then both Wn and µn haveconvergence rate nδ .

Proposition 1.8. Suppose there exists a constant κ such that

nδ(Wn − κ)D→ X.

a) Then Wn = OP (n−δ).b) If X is not degenerate, then Wn �P n−δ .

The above result implies that if Wn has convergence rate nδ, then Wn hastightness rate nδ, and the term “tightness” will often be omitted. Part a) isproved, for example, in Lehmann (1999, p. 67).

The following result shows that if Wn �P Xn, then Xn �P Wn, Wn =OP (Xn), and Xn = OP (Wn). Notice that if Wn = OP (n−δ), then nδ isa lower bound on the rate of Wn. As an example, if the CLT holds thenY n = OP (n−1/3), but Y n �P n−1/2.

Proposition 1.9. a) If Wn �P Xn, then Xn �P Wn.b) If Wn �P Xn, then Wn = OP (Xn).c) If Wn �P Xn, then Xn = OP (Wn).d) Wn �P Xn iff Wn = OP (Xn) and Xn = OP (Wn).

Proof. a) Since Wn �P Xn,

P

(dε ≤

∣∣∣∣Wn

Xn

∣∣∣∣ ≤ Dε

)= P

(1

Dε≤∣∣∣∣Xn

Wn

∣∣∣∣ ≤1

d ε

)≥ 1 − ε

for all n ≥ Nε. Hence Xn �P Wn.b) Since Wn �P Xn,

P (|Wn| ≤ |XnDε|) ≥ P

(dε ≤

∣∣∣∣Wn

Xn

∣∣∣∣ ≤ Dε

)≥ 1 − ε

for all n ≥ Nε. Hence Wn = OP (Xn).c) Follows by a) and b).d) If Wn �P Xn, then Wn = OP (Xn) and Xn = OP (Wn) by b) and c).

Now suppose Wn = OP (Xn) and Xn = OP (Wn). Then

P (|Wn| ≤ |Xn|Dε/2) ≥ 1 − ε/2

for all n ≥ N1, and

P (|Xn| ≤ |Wn|1/dε/2) ≥ 1 − ε/2

for all n ≥ N2. Hence


P (A) ≡ P

(∣∣∣∣Wn

Xn

∣∣∣∣ ≤ Dε/2

)≥ 1 − ε/2

and

P (B) ≡ P

(dε/2 ≤

∣∣∣∣Wn

Xn

∣∣∣∣)

≥ 1 − ε/2

for all n ≥ N = max(N1, N2). Since P (A∩B) = P (A)+P (B)−P (A∪B) ≥P (A) + P (B) − 1,

P (A ∩B) = P (dε/2 ≤∣∣∣∣Wn

Xn

∣∣∣∣ ≤ Dε/2) ≥ 1 − ε/2 + 1 − ε/2− 1 = 1 − ε

for all n ≥ N. Hence Wn �P Xn. �

The following result is used to prove the following Theorem 1.11 which saysthat if there are K estimators Tj,n of a parameter β, such that ‖Tj,n −β‖ =OP (n−δ) where 0 < δ ≤ 1, and if T ∗

n picks one of these estimators, then‖T ∗

n − β‖ = OP (n−δ).

Proposition 1.10: Pratt (1959). Let X1,n, ..., XK,n each be OP (1)where K is fixed. Suppose Wn = Xin,n for some in ∈ {1, ..., K}. Then

Wn = OP (1). (1.5)

Proof.

P (max{X1,n, ..., XK,n} ≤ x) = P (X1,n ≤ x, ..., XK,n ≤ x) ≤

FWn(x) ≤ P (min{X1,n, ..., XK,n} ≤ x) = 1 − P (X1,n > x, ..., XK,n > x).

SinceK is finite, there exists B > 0 andN such that P (Xi,n ≤ B) > 1−ε/2Kand P (Xi,n > −B) > 1 − ε/2K for all n > N and i = 1, ..., K. Bonferroni’s

inequality states that P (∩Ki=1Ai) ≥

∑Ki=1 P (Ai) − (K − 1). Thus

FWn(B) ≥ P (X1,n ≤ B, ..., XK,n ≤ B) ≥

K(1 − ε/2K)− (K − 1) = K − ε/2 −K + 1 = 1 − ε/2

and−FWn(−B) ≥ −1 + P (X1,n > −B, ..., XK,n > −B) ≥

−1 +K(1 − ε/2K) − (K − 1) = −1 +K − ε/2−K + 1 = −ε/2.Hence

FWn(B) − FWn(−B) ≥ 1 − ε for n > N. �

Theorem 1.11. Suppose ‖Tj,n − β‖ = OP (n−δ) for j = 1, ..., K where0 < δ ≤ 1. Let T ∗

n = Tin,n for some in ∈ {1, ..., K} where, for example, Tin,n

is the Tj,n that minimized some criterion function. Then

22 1 Introduction

‖T ∗n − β‖ = OP (n−δ). (1.6)

Proof. Let Xj,n = nδ‖Tj,n − β‖. Then Xj,n = OP (1) so by Proposition1.10, nδ‖T ∗

n − β‖ = OP (1). Hence ‖T ∗n − β‖ = OP (n−δ). �

1.3.3 Slutsky’s Theorem and Related Results

Theorem 1.12: Slutsky’s Theorem. Suppose YnD→ Y and Wn

P→ w forsome constant w. Then

a) Yn +WnD→ Y + w,

b) YnWnD→ wY, and

c) Yn/WnD→ Y/w if w 6= 0.

Theorem 1.13. a) If XnP→ X, then Xn

D→ X.

b) If Xnae→ X, then Xn

P→ X and XnD→ X.

c) If Xnr→ X, then Xn

P→ X and XnD→ X.

d) XnP→ τ (θ) iff Xn

D→ τ (θ).

e) If XnP→ θ and τ is continuous at θ, then τ (Xn)

P→ τ (θ).

f) If XnD→ θ and τ is continuous at θ, then τ (Xn)

D→ τ (θ).

Suppose that for all θ ∈ Θ, TnD→ τ (θ), Tn

r→ τ (θ), or Tnae→ τ (θ). Then

Tn is a consistent estimator of τ (θ) by Theorem 1.13. We are assuming thatthe function τ does not depend on n.

Example 1.16. Let Y1, ..., Yn be iid with mean E(Yi) = µ and varianceV (Yi) = σ2. Then the sample mean Y n is a consistent estimator of µ since i)the SLLN holds (use Theorems 1.7 and 1.13), ii) the WLLN holds, and iii)the CLT holds (use Proposition 1.6). Since

limn→∞

VARµ(Y n) = limn→∞

σ2/n = 0 and limn→∞

Eµ(Y n) = µ,

Y n is also a consistent estimator of µ by Proposition 1.5b. By the deltamethod and Proposition 1.6b, Tn = g(Y n) is a consistent estimator of g(µ)if g′(µ) 6= 0 for all µ ∈ Θ. By Theorem 1.13e, g(Y n) is a consistent estimatorof g(µ) if g is continuous at µ for all µ ∈ Θ.

Theorem 1.14. Assume that the function g does not depend on n.

a) Generalized Continuous Mapping Theorem: If XnD→ X and the

function g is such that P [X ∈ C(g)] = 1 where C(g) is the set of points

where g is continuous, then g(Xn)D→ g(X).


b) Continuous Mapping Theorem: If XnD→ X and the function g is

continuous, then g(Xn)D→ g(X).

Remark 1.2. For Theorem 1.13, a) follows from Slutsky’s Theorem by

taking Yn ≡ X = Y and Wn = Xn − X. Then YnD→ Y = X and Wn

P→ 0.

Hence Xn = Yn +WnD→ Y +0 = X. The convergence in distribution parts of

b) and c) follow from a). Part f) follows from d) and e). Part e) implies thatif Tn is a consistent estimator of θ and τ is a continuous function, then τ (Tn)is a consistent estimator of τ (θ). Theorem 1.14 says that convergence in dis-tribution is preserved by continuous functions, and even some discontinuitiesare allowed as long as the set of continuity points is assigned probability 1by the asymptotic distribution. Equivalently, the set of discontinuity pointsis assigned probability 0.

Example 1.17. (Ferguson 1996, p. 40): If XnD→ X, then 1/Xn

D→ 1/Xif X is a continuous random variable since P (X = 0) = 0 and x = 0 is theonly discontinuity point of g(x) = 1/x.

Example 1.18. Show that if Yn ∼ tn, a t distribution with n degrees of

freedom, then YnD→ Z where Z ∼ N(0, 1).

Solution: YnD= Z/

√Vn/n where Z Vn ∼ χ2

n. If Wn =√Vn/n

P→ 1,

then the result follows by Slutsky’s Theorem. But VnD=∑n

i=1Xi where the

iid Xi ∼ χ21. Hence Vn/n

P→ 1 by the WLLN and√Vn/n

P→ 1 by Theorem1.13e.

Theorem 1.15: Continuity Theorem. Let Yn be sequence of randomvariables with characteristic functions φn(t). Let Y be a random variablewith characteristic function (cf) φ(t).

a)

YnD→ Y iff φn(t) → φ(t) ∀t ∈ R.

b) Also assume that Yn has moment generating function (mgf) mn and Yhas mgf m. Assume that all of the mgfs mn and m are defined on |t| ≤ d forsome d > 0. Then if mn(t) → m(t) as n→ ∞ for all |t| < c where 0 < c < d,

then YnD→ Y .

Application: Proof of a Special Case of the CLT. FollowingRohatgi (1984, pp. 569-9), let Y1, ..., Yn be iid with mean µ, variance σ2, andmgf mY (t) for |t| < to. Then

Zi =Yi − µ

σ

has mean 0, variance 1, and mgf mZ(t) = exp(−tµ/σ)mY (t/σ) for |t| < σto.We want to show that

24 1 Introduction

Wn =√n

(Y n − µ

σ

)D→ N(0, 1).

Notice that Wn =

n−1/2n∑

i=1

Zi = n−1/2n∑

i=1

(Yi − µ

σ

)= n−1/2

∑ni=1 Yi − nµ

σ=n−1/2

1n

Y n − µ

σ.

Thus

mWn(t) = E(etWn) = E[exp(tn−1/2n∑

i=1

Zi)] = E[exp(

n∑

i=1

tZi/√n)]

=

n∏

i=1

E[etZi/√

n] =

n∏

i=1

mZ(t/√n) = [mZ(t/

√n)]n.

Set ψ(x) = log(mZ (x)). Then

log[mWn(t)] = n log[mZ(t/√n)] = nψ(t/

√n) =

ψ(t/√n)

1n

.

Now ψ(0) = log[mZ(0)] = log(1) = 0. Thus by L’Hopital’s rule (where thederivative is with respect to n), limn→∞ log[mWn(t)] =

limn→∞

ψ(t/√n )

1n

= limn→∞

ψ′(t/√n )[−t/2

n3/2 ]

(−1n2 )

=t

2lim

n→∞ψ′(t/

√n )

1√n

.

Now

ψ′(0) =m′

Z(0)

mZ(0)= E(Zi)/1 = 0,

so L’Hopital’s rule can be applied again, giving limn→∞ log[mWn(t)] =

t

2lim

n→∞

ψ′′(t/√n )[ −t

2n3/2 ]

( −12n3/2 )

=t2

2lim

n→∞ψ′′(t/

√n ) =

t2

2ψ′′(0).

Now

ψ′′(t) =d

dt

m′Z(t)

mZ(t)=m′′

Z(t)mZ(t) − (m′Z (t))2

[mZ(t)]2.

Soψ′′(0) = m′′

Z(0) − [m′Z(0)]2 = E(Z2

i ) − [E(Zi)]2 = 1.

Hence limn→∞ log[mWn(t)] = t2/2 and

limn→∞

mWn(t) = exp(t2/2)


which is the N(0,1) mgf. Thus by the continuity theorem,

Wn =√n

(Y n − µ

σ

)D→ N(0, 1). �

1.3.4 Multivariate Limit Theorems

Many of the univariate results of the previous 3 subsections can be extendedto random vectors. For the limit theorems, the vector X is typically a k × 1column vector and XT is a row vector. Let ‖x‖ =

√x2

1 + · · ·+ x2k be the

Euclidean norm of x.

Definition 1.28. Let Xn be a sequence of random vectors with joint cdfsFn(x) and let X be a random vector with joint cdf F (x).

a) Xn converges in distribution to X, written XnD→ X , if Fn(x) →

F (x) as n → ∞ for all points x at which F (x) is continuous. The distributionof X is the limiting distribution or asymptotic distribution of Xn.

b) Xn converges in probability to X, written XnP→ X, if for every

ε > 0, P (‖Xn − X‖ > ε) → 0 as n→ ∞.c) Let r > 0 be a real number. Then Xn converges in rth mean to X ,

written Xnr→ X, if E(‖Xn − X‖r) → 0 as n → ∞.

d) Xn converges almost everywhere to X , written Xnae→ X, if

P (limn→∞ Xn = X) = 1.

Theorems 1.16 and 1.17 below are the multivariate extensions of thelimit theorems in subsection 1.4.1. When the limiting distribution of Zn =√n(g(T n) − g(θ)) is multivariate normal Nk(0,Σ), approximate the joint

cdf of Zn with the joint cdf of the Nk(0,Σ) distribution. Thus to find proba-bilities, manipulate Zn as if Zn ≈ Nk(0,Σ). To see that the CLT is a specialcase of the MCLT below, let k = 1, E(X) = µ, and V (X) = Σx = σ2.

Theorem 1.16: the Multivariate Central Limit Theorem (MCLT).If X1, ...,Xn are iid k × 1 random vectors with E(X) = µ and Cov(X) =Σx, then √

n(Xn − µ)D→ Nk(0,Σx)

where the sample mean

Xn =1

n

n∑

i=1

X i.

To see that the delta method is a special case of the multivariate deltamethod, note that if Tn and parameter θ are real valued, then Dg(θ)

= g′(θ).

26 1 Introduction

Theorem 1.17: the Multivariate Delta Method. If g does not dependon n and √

n(T n − θ)D→ Nk(0,Σ),

then √n(g(T n) − g(θ))

D→ Nd(0,Dg(θ)ΣDT

g(θ))

where the d× k Jacobian matrix of partial derivatives

Dg(θ) =

∂∂θ1

g1(θ) . . . ∂∂θk

g1(θ)...

...∂

∂θ1gd(θ) . . . ∂

∂θkgd(θ)

.

Here the mapping g : Rk → R

d needs to be differentiable in a neighborhoodof θ ∈ R

k.

Definition 1.29. If the estimator g(T n)P→ g(θ) for all θ ∈ Θ, then g(T n)

is a consistent estimator of g(θ).

Proposition 1.18. If 0 < δ ≤ 1, X is a random vector, and

nδ(g(T n) − g(θ))D→ X ,

then g(T n)P→ g(θ).

Theorem 1.19. If X1, ...,Xn are iid, E(‖X‖) <∞, and E(X) = µ, then

a) WLLN: XnP→ µ and

b) SLLN: Xnae→ µ.

Theorem 1.20: Continuity Theorem. Let Xn be a sequence of k × 1random vectors with characteristic functions φn(t), and let X be a k × 1random vector with cf φ(t). Then

XnD→ X iff φn(t) → φ(t)

for all t ∈ Rk.

Theorem 1.21: Cramer Wold Device. Let Xn be a sequence of k× 1random vectors, and let X be a k × 1 random vector. Then

XnD→ X iff tTXn

D→ tTX

for all t ∈ Rk.

Application: Proof of the MCLT Theorem 1.16. Note that for fixedt, the tT X i are iid random variables with mean tT µ and variance tT Σt.


Hence by the CLT, tT√n(Xn − µ)D→ N(0, tT Σt). The right hand side has

distribution tT X where X ∼ Nk(0,Σ). Hence by the Cramer Wold Device,√n(Xn − µ)

D→ Nk(0,Σ). �

Theorem 1.22. a) If XnP→ X , then Xn

D→ X .b)

XnP→ g(θ) iff Xn

D→ g(θ).

Let g(n) ≥ 1 be an increasing function of the sample size n: g(n) ↑ ∞, e.g.g(n) =

√n. See White (1984, p. 15). If a k×1 random vector T n−µ converges

to a nondegenerate multivariate normal distribution with convergence rate√n, then T n has (tightness) rate

√n.

Definition 1.30. Let An = [ai,j(n)] be an r × c random matrix.a) An = OP (Xn) if ai,j(n) = OP (Xn) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.b) An = op(Xn) if ai,j(n) = op(Xn) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.c) An �P (1/(g(n)) if ai,j(n) �P (1/(g(n)) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.d) Let A1,n = T n − µ and A2,n = Cn − cΣ for some constant c > 0. IfA1,n �P (1/(g(n)) and A2,n �P (1/(g(n)), then (T n,Cn) has (tightness)rate g(n).

Theorem 1.23: Continuous Mapping Theorem. Let Xn ∈ Rk. If

XnD→ X and if the function g : R

k → Rj is continuous, then

g(Xn)D→ g(X).

The following two theorems are taken from Severini (2005, pp. 345-349,354).

Theorem 1.24. Let Xn = (X1n, ..., Xkn)T be a sequence of k × 1random vectors, let Y n be a sequence of k × 1 random vectors, and letX = (X1 , ..., Xk)

T be a k× 1 random vector. Let W n be a sequence of k× knonsingular random matrices, and let C be a k × k constant nonsingularmatrix.

a) XnP→ X iff Xin

P→ Xi for i = 1, ..., k.

b) Slutsky’s Theorem: If XnD→ X and Y n

P→ c for some constant k×1

vector c, then i) Xn + Y nD→ X + c and

ii) Y TnXn

D→ cT X .

c) If XnD→ X and W n

P→ C, then W nXnD→ CX, XT

nW nD→ XT C,

W−1n Xn

D→ C−1X , and XTn W−1

nD→ XT C−1.

Theorem 1.25. LetWn, Xn, Yn, and Zn be sequences of random variablessuch that Yn > 0 and Zn > 0. (Often Yn and Zn are deterministic, e.g.Yn = n−1/2.)

a) If Wn = OP (1) and Xn = OP (1), then Wn +Xn = OP (1) and WnXn =OP (1), thus OP (1) + OP (1) = OP (1) and OP (1)OP (1) = OP (1).

28 1 Introduction

b) If Wn = OP (1) and Xn = oP (1), then Wn +Xn = OP (1) and WnXn =oP (1), thus OP (1) + oP (1) = OP (1) and OP (1)oP (1) = oP (1).

c) If Wn = OP (Yn) and Xn = OP (Zn), then Wn +Xn = OP (max(Yn, Zn))and WnXn = OP (YnZn), thus OP (Yn) + OP (Zn) = OP (max(Yn, Zn)) andOP (Yn)OP (Zn) = OP (YnZn).

Theorem 1.26. i) Suppose√n(Tn − µ)

D→ Np(θ,Σ). Let A be a q × p

constant matrix. Then A√n(Tn−µ) =

√n(ATn −Aµ)

D→ Nq(Aθ,AΣAT ).ii) Let Σ > 0. If (T,C) is a consistent estimator of (µ, s Σ) where s > 0

is some constant, then D2x(T,C) = (x− T )T C−1(x− T ) = s−1D2

x(µ,Σ) +oP (1), so D2

x(T,C) is a consistent estimator of s−1D2x(µ,Σ).

iii) Let Σ > 0. If√n(T−µ)

D→ Np(0,Σ) and if C is a consistent estimator

of Σ, then n(T − µ)T C−1(T − µ)D→ χ2

p. In particular,

n(x− µ)T S−1(x − µ)D→ χ2

p.

Proof: ii) D2x(T,C) = (x − T )T C−1(x− T ) =

(x− µ + µ− T )T [C−1 − s−1Σ−1 + s−1Σ−1](x− µ + µ− T )= (x − µ)T [s−1Σ−1](x − µ) + (x − T )T [C−1 − s−1Σ−1](x− T )+(x − µ)T [s−1Σ−1](µ− T ) + (µ− T )T [s−1Σ−1](x − µ)+(µ − T )T [s−1Σ−1](µ− T ) = s−1D2

x(µ,Σ) + OP (1).(Note that D2

x(T,C) = s−1D2x(µ,Σ) +OP (n−δ) if (T,C) is a consistent

estimator of (µ, s Σ) with rate nδ where 0 < δ ≤ 0.5 if [C−1 − s−1Σ−1] =OP (n−δ).)

Alternatively, D2x(T,C) is a continuous function of (T,C) if C > 0 for

n > 10p. Hence D2x(T,C)

P→ D2x(µ, sΣ).

iii) Note that Zn =√n Σ−1/2(T − µ)

D→ Np(0, Ip). Thus ZTnZn =

n(T − µ)T Σ−1(T − µ)D→ χ2

p. Now n(T − µ)T C−1(T − µ) =

n(T − µ)T [C−1 − Σ−1 + Σ−1](T − µ) = n(T − µ)T Σ−1(T − µ) +

n(T −µ)T [C−1 −Σ−1](T −µ) = n(T −µ)T Σ−1(T −µ)+ oP (1)D→ χ2

p since√n(T − µ)T [C−1 − Σ−1]

√n(T − µ) = OP (1)oP (1)OP (1) = oP (1). �

Example 1.19. Suppose that xn yn for n = 1, 2, .... Suppose xnD→ x,

and ynD→ y where x y. Then

[xn

yn

]D→[

xy

]

by Theorem 1.20. To see this, let t = (tT1 , t

T2 )T , zn = (xT

n , yTn )T , and z =

(xT , yT )T . Since xn yn and x y, the characteristic function

φzn(t) = φxn(t1)φyn(t2) → φx(t1)φy(t2) = φz(t).

Hence g(zn)D→ g(z) by Theorem 1.23.

1.4 Mixture Distributions 29

1.4 Mixture Distributions

Mixture distributions are useful for model and variable selection since βImin,0

is a mixture distribution of βIj,0, and the lasso estimator βL is a mixture

distribution of βL,λifor i = 1, ...,M . See Sections 2.3, 3.2, and 3.6. A random

vector u has a mixture distribution if u equals a random vector uj withprobability πj for j = 1, ..., J . See Definition 1.18 for the population meanand population covariance matrix of a random vector.

Definition 1.31. The distribution of a g×1 random vector u is a mixturedistribution if the cumulative distribution function (cdf) of u is

Fu(t) =

J∑

j=1

πjFuj (t) (1.7)

where the probabilities πj satisfy 0 ≤ πj ≤ 1 and∑J

j=1 πj = 1, J ≥ 2,and Fuj (t) is the cdf of a g × 1 random vector uj . Then u has a mixturedistribution of the uj with probabilities πj.

Theorem 1.27. Suppose E(h(u)) and the E(h(uj)) exist. Then

E(h(u)) =

J∑

j=1

πjE[h(uj)]. (1.8)

Hence

E(u) =

J∑

j=1

πjE[uj ], (1.9)

and Cov(u) = E(uuT ) −E(u)E(uT ) = E(uuT ) − E(u)[E(u)]T =∑Jj=1 πjE[uju

Tj ]− E(u)[E(u)]T =

J∑

j=1

πjCov(uj) +

J∑

j=1

πjE(uj)[E(uj)]T −E(u)[E(u)]T . (1.10)

If E(uj) = θ for j = 1, ..., J , then E(u) = θ and

Cov(u) =

J∑

j=1

πjCov(uj).

This theorem is easy to prove if the uj are continuous random vectors with(joint) probability density functions (pdfs) fuj (t). Then u is a continuousrandom vector with pdf

30 1 Introduction

fu(t) =

J∑

j=1

πjfuj (t), and E(h(u)) =

∫ ∞

−∞· · ·∫ ∞

−∞h(t)fu(t)dt

=

J∑

j=1

πj

∫ ∞

−∞· · ·∫ ∞

−∞h(t)fuj (t)dt =

J∑

j=1

πjE[h(uj)]

where E[h(uj)] is the expectation with respect to the random vector uj . Notethat

E(u)[E(u)]T =

J∑

j=1

J∑

k=1

πjπkE(uj)[E(uk)]T . (1.11)

1.5 The Multivariate Normal Distribution

Although usually random vectors in this text are denoted by x, y, or z, thissection will usually use the notation X = (X1, ..., Xp)

T and Y for the randomvectors, and x = (x1, ..., xp)

T for the observed value of the random vector.This notation will be useful to avoid confusion when studying conditionaldistributions such as Y |X = x.

Definition 1.32: Rao (1965, p. 437). A p × 1 random vector X hasa p−dimensional multivariate normal distribution Np(µ,Σ) iff tT X has aunivariate normal distribution for any p× 1 vector t.

If Σ is positive definite, then X has a pdf

f(z) =1

(2π)p/2|Σ|1/2e−(1/2)(z−µ)T Σ−1

(z−µ) (1.12)

where |Σ|1/2 is the square root of the determinant of Σ. Note that if p = 1,then the quadratic form in the exponent is (z − µ)(σ2)−1(z − µ) and X hasthe univariate N(µ, σ2) pdf. If Σ is positive semidefinite but not positivedefinite, then X has a degenerate distribution. For example, the univariateN(0, 02) distribution is degenerate (the point mass at 0).

Definition 1.33. The population mean of a random p × 1 vector X =(X1, ..., Xp)

T isE(X) = (E(X1), ..., E(Xp))

T

and the p× p population covariance matrix

Cov(X) = E(X − E(X))(X −E(X))T = (σij).

That is, the ij entry of Cov(X) is Cov(Xi, Xj) = σij.

1.5 The Multivariate Normal Distribution 31

The covariance matrix is also called the variance–covariance matrix andvariance matrix. Sometimes the notation Var(X) is used. Note that Cov(X)is a symmetric positive semidefinite matrix. If X and Y are p × 1 randomvectors, a a conformable constant vector, and A and B are conformableconstant matrices, then

E(a + X) = a +E(X) and E(X + Y ) = E(X) + E(Y ) (1.13)

andE(AX) = AE(X) and E(AXB) = AE(X)B. (1.14)

ThusCov(a + AX) = Cov(AX) = ACov(X)AT . (1.15)

Some important properties of multivariate normal (MVN) distributions aregiven in the following three propositions. These propositions can be provedusing results from Johnson and Wichern (1988, pp. 127-132) or Severini (2005,ch. 8).

Proposition 1.28. a) If X ∼ Np(µ,Σ), then E(X) = µ and

Cov(X) = Σ.

b) If X ∼ Np(µ,Σ), then any linear combination tT X = t1X1 + · · · +tpXp ∼ N1(t

T µ, tT Σt). Conversely, if tT X ∼ N1(tT µ, tT Σt) for every p×1

vector t, then X ∼ Np(µ,Σ).

c) The joint distribution of independent normal random variablesis MVN. If X1, ..., Xp are independent univariate normal N(µi, σ

2i ) random

vectors, then X = (X1 , ..., Xp)T is Np(µ,Σ) where µ = (µ1, ..., µp)

T andΣ = diag(σ2

1 , ..., σ2p) (so the off diagonal entries σij = 0 while the diagonal

entries of Σ are σii = σ2i ).

d) If X ∼ Np(µ,Σ) and if A is a q×pmatrix, then AX ∼ Nq(Aµ,AΣAT ).If a is a p × 1 vector of constants and b is a constant, then a + bX ∼Np(a + bµ, b2Σ). (Note that bX = bIpX with A = bIp.)

It will be useful to partition X, µ, and Σ. Let X1 and µ1 be q×1 vectors,let X2 and µ2 be (p − q) × 1 vectors, let Σ11 be a q × q matrix, let Σ12

be a q × (p − q) matrix, let Σ21 be a (p − q) × q matrix, and let Σ22 be a(p− q) × (p− q) matrix. Then

X =

(X1

X2

), µ =

(µ1

µ2

), and Σ =

(Σ11 Σ12

Σ21 Σ22

).

Proposition 1.29. a) All subsets of a MVN are MVN: (Xk1, ..., Xkq)

T

∼ Nq(µ, Σ) where µi = E(Xki) and Σij = Cov(Xki , Xkj). In particular,X1 ∼ Nq(µ1,Σ11) and X2 ∼ Np−q(µ2,Σ22).

32 1 Introduction

b) If X1 and X2 are independent, then Cov(X1,X2) = Σ12 =E[(X1 −E(X1))(X2 − E(X2))

T ] = 0, a q × (p− q) matrix of zeroes.

c) If X ∼ Np(µ,Σ), then X1 and X2 are independent iff Σ12 = 0.

d) If X1 ∼ Nq(µ1,Σ11) and X2 ∼ Np−q(µ2,Σ22) are independent, then

(X1

X2

)∼ Np

((µ1

µ2

),

(Σ11 00 Σ22

)).

Proposition 1.30. The conditional distribution of a MVN isMVN. If X ∼ Np(µ,Σ), then the conditional distribution of X1 giventhat X2 = x2 is multivariate normal with mean µ1 +Σ12Σ

−122 (x2 −µ2) and

covariance matrix Σ11 − Σ12Σ−122 Σ21. That is,

X1|X2 = x2 ∼ Nq(µ1 + Σ12Σ−122 (x2 − µ2),Σ11 − Σ12Σ

−122 Σ21).

Example 1.20. Let p = 2 and let (Y,X)T have a bivariate normal distri-bution. That is,

(YX

)∼ N2

((µY

µX

),

(σ2

Y Cov(Y,X)Cov(X, Y ) σ2

X

)).

Also, recall that the population correlation between X and Y is given by

ρ(X, Y ) =Cov(X, Y )√

VAR(X)√

VAR(Y )=

σX,Y

σXσY

if σX > 0 and σY > 0. Then Y |X = x ∼ N(E(Y |X = x),VAR(Y |X = x))where the conditional mean

E(Y |X = x) = µY + Cov(Y,X)1

σ2X

(x− µX) = µY + ρ(X, Y )

√σ2

Y

σ2X

(x− µX)

and the conditional variance

VAR(Y |X = x) = σ2Y − Cov(X, Y )

1

σ2X

Cov(X, Y )

= σ2Y − ρ(X, Y )

√σ2

Y

σ2X

ρ(X, Y )√σ2

X

√σ2

Y

= σ2Y − ρ2(X, Y )σ2

Y = σ2Y [1 − ρ2(X, Y )].

Also aX + bY is univariate normal with mean aµX + bµY and variance

a2σ2X + b2σ2

Y + 2ab Cov(X, Y ).

1.6 Elliptically Contoured Distributions 33

Remark 1.3. There are several common misconceptions. First, it is nottrue that every linear combination tT X of normal random variablesis a normal random variable, and it is not true that all uncorrelatednormal random variables are independent. The key condition in Propo-sition 1.28b and Proposition 1.29c is that the joint distribution of X is MVN.It is possible that X1, X2, ..., Xp each has a marginal distribution that is uni-variate normal, but the joint distribution of X is not MVN. See Seber andLee (2003, p. 23), Kowalski (1973), and examine the following example fromRohatgi (1976, p. 229). Suppose that the joint pdf of X and Y is a mix-ture of two bivariate normal distributions both with EX = EY = 0 andVAR(X) = VAR(Y ) = 1, but Cov(X, Y ) = ±ρ. Hence f(x, y) =

1

2

1

2π√

1 − ρ2exp(

−1

2(1 − ρ2)(x2 − 2ρxy + y2)) +

1

2

1

2π√

1 − ρ2exp(

−1

2(1 − ρ2)(x2 + 2ρxy + y2)) ≡ 1

2f1(x, y) +

1

2f2(x, y)

where x and y are real and 0 < ρ < 1. Since both marginal distributionsof fi(x, y) are N(0,1) for i = 1 and 2 by Proposition 1.29 a), the marginaldistributions of X and Y are N(0,1). Since

∫ ∫xyfi(x, y)dxdy = ρ for i = 1

and −ρ for i = 2, X and Y are uncorrelated, butX and Y are not independentsince f(x, y) 6= fX(x)fY (y).

Remark 1.4. In Proposition 1.30, suppose that X = (Y,X2, ..., Xp)T . Let

X1 = Y and X2 = (X2, ..., Xp)T . Then E[Y |X2] = β1 + β2X2 + · · ·+ βpXp

and VAR[Y |X2] is a constant that does not depend on X2. Hence Y |X2 =β1 + β2X2 + · · ·+ βpXp + e follows the multiple linear regression model.

1.6 Elliptically Contoured Distributions

Definition 1.34: Johnson (1987, pp. 107-108). A p×1 random vector Xhas an elliptically contoured distribution, also called an elliptically symmetricdistribution, if X has joint pdf

f(z) = kp|Σ|−1/2g[(z − µ)T Σ−1(z − µ)], (1.16)

and we say X has an elliptically contoured ECp(µ,Σ, g) distribution.

If X has an elliptically contoured (EC) distribution, then the characteristicfunction of X is

φX(t) = exp(itT µ)ψ(tT Σt) (1.17)

for some function ψ. If the second moments exist, then

34 1 Introduction

E(X) = µ (1.18)

andCov(X) = cXΣ (1.19)

wherecX = −2ψ′(0).

Definition 1.35. The population squared Mahalanobis distance

U ≡ D2 = D2(µ,Σ) = (X − µ)T Σ−1(X − µ). (1.20)

For elliptically contoured distributions, U has pdf

h(u) =πp/2

Γ (p/2)kpu

p/2−1g(u). (1.21)

For c > 0, an ECp(µ, cI, g) distribution is spherical about µ where I isthe p×p identity matrix. The multivariate normal distribution Np(µ,Σ) haskp = (2π)−p/2, ψ(u) = g(u) = exp(−u/2), and h(u) is the χ2

p pdf.

The following lemma is useful for proving properties of EC distributionswithout using the characteristic function (1.17). See Eaton (1986) and Cook(1998, pp. 57, 130).

Lemma 1.31. Let X be a p × 1 random vector with 1st moments; i.e.,E(X) exists. Let B be any constant full rank p× r matrix where 1 ≤ r ≤ p.Then X is elliptically contoured iff for all such conforming matrices B,

E(X|BT X) = µ + MBBT (X − µ) = aB + MBBT X (1.22)

where the p× 1 constant vector aB and the p× r constant matrix MB bothdepend on B.

A useful fact is that aB and MB do not depend on g:

aB = µ− MBBT µ = (Ip − MBBT )µ,

andMB = ΣB(BT ΣB)−1.

See Problem 1.19. Notice that in the formula for MB , Σ can be replaced bycΣ where c > 0 is a constant. In particular, if the EC distribution has 2ndmoments, Cov(X) can be used instead of Σ.

To use Lemma 1.31 to prove interesting properties, partition X , µ, andΣ. Let X1 and µ1 be q × 1 vectors, let X2 and µ2 be (p − q) × 1 vectors.

1.6 Elliptically Contoured Distributions 35

Let Σ11 be a q × q matrix, let Σ12 be a q × (p − q) matrix, let Σ21 be a(p− q) × q matrix, and let Σ22 be a (p − q) × (p− q) matrix. Then

X =

(X1

X2

), µ =

(µ1

µ2

), and Σ =

(Σ11 Σ12

Σ21 Σ22

).

Also assume that the (p + 1) × 1 vector (Y,XT )T is ECp+1(µ,Σ, g) whereY is a random variable, X is a p× 1 vector, and use

(YX

), µ =

(µY

µX

), and Σ =

(ΣY Y ΣY X

ΣXY ΣXX

).

Proposition 1.32. Let X ∼ ECp(µ,Σ, g) and assume that E(X) exists.

a) Any subset of X is EC, in particular X1 is EC.

b) (Cook 1998 p. 131, Kelker 1970). If Cov(X) is nonsingular,

Cov(X|BT X) = dg(BT X)[Σ − ΣB(BT ΣB)−1BT Σ]

where the real valued function dg(BT X) is constant iff X is MVN.

Proof of a). Let A be an arbitrary full rank q×r matrix where 1 ≤ r ≤ q.Let

B =

(A0

).

Then BT X = AT X1, and

E[X|BT X ] = E

[(X1

X2

)|AT X1

]=

(µ1

µ2

)+

(M1B

M2B

) (AT 0T

) (X1 − µ1

X2 − µ2

)

by Lemma 1.31. Hence E[X1|AT X1] = µ1+M1BAT (X1−µ1). Since A wasarbitrary, X1 is EC by Lemma 1.31. Notice that MB = ΣB(BT ΣB)−1 =

(Σ11 Σ12

Σ21 Σ22

) (A0

) [(AT 0T

)(Σ11 Σ12

Σ21 Σ22

)(A0

)]−1

=

(M1B

M2B

).

HenceM1B = Σ11A(AT Σ11A)−1

and X1 is EC with location and dispersion parameters µ1 and Σ11. �

36 1 Introduction

Proposition 1.33. Let (Y,XT )T be ECp+1(µ,Σ, g) where Y is a randomvariable.

a) Assume that E[(Y,XT )T ] exists. Then E(Y |X) = α + βT X whereα = µY − βT µX and

β = Σ−1XXΣXY .

b) Even if the first moment does not exist, the conditional median

MED(Y |X) = α+ βT X

where α and β are given in a).

Proof. a) The trick is to choose B so that Lemma 1.31 applies. Let

B =

(0T

Ip

).

Then BT ΣB = ΣXX and

ΣB =

(ΣY X

ΣXX

).

Now

E

[(YX

)| X

]= E

[(YX

)| BT

(YX

)]

= µ + ΣB(BT ΣB)−1BT

(Y − µY

X − µX

)

by Lemma 1.31. The right hand side of the last equation is equal to

µ +

(ΣY X

ΣXX

)Σ−1

XX(X − µX) =

(µY − ΣY XΣ−1

XXµX + ΣY XΣ−1XXX

X

)

and the result follows since

βT = ΣY XΣ−1XX .

b) See Croux et al. (2001) for references.

Example 1.21. This example illustrates another application of Lemma3.4. Suppose that X comes from a mixture of two multivariate normals withthe same mean and proportional covariance matrices. That is, let

X ∼ (1 − γ)Np(µ,Σ) + γNp(µ, cΣ)

where c > 0 and 0 < γ < 1. Since the multivariate normal distribution iselliptically contoured (and see Theorem 1.27),

1.7 Summary 37

E(X|BT X) = (1 − γ)[µ + M1BT (X − µ)] + γ[µ + M2B

T (X − µ)]

= µ + [(1 − γ)M 1 + γM 2]BT (X − µ) ≡ µ + MBT (X − µ).

Since MB only depends on B and Σ, it follows that M1 = M2 = M =MB. Hence X has an elliptically contoured distribution by Lemma 1.31. SeeProblem 1.13 for a related result.

Let x ∼ Np(µ,Σ) and y ∼ χ2d be independent. Let wi = xi/(y/d)

1/2 fori = 1, ..., p. Then w has a multivariate t-distribution with parameters µ andΣ and degrees of freedom d, an important elliptically contoured distribution.

Cornish (1954) showed that the covariance matrix of w is Cov(w) =d

d− 2Σ

for d > 2. The case d = 1 is known as a multivariate Cauchy distribution.The joint pdf of w is

f(z) =Γ ((d+ p)/2)) |Σ|−1/2

(πd)p/2Γ (d/2)[1 + d−1(z − µ)T Σ−1(z − µ)]−(d+p)/2.

See Mardia et al. (1979, pp. 43, 57). See Johnson and Kotz (1972, p. 134) forthe special case where the xi ∼ N(0, 1).

The following EC(µ,Σ, g) distribution for a p × 1 random vector x isthe uniform distribution on a hyperellipsoid where f(z) = c for z in thehyperellipsoid where c is the reciprocal of the volume of the hyperellipsoid.The pdf of the distribution is

f(z) =Γ (p

2 + 1)

[(p+ 2)π]p/2|Σ|−1/2I[(z − µ)T Σ−1(z − µ) ≤ p + 2].

Then E(x) = µ by symmetry and is can be shown that Cov(x) = Σ.If x ∼ Np(µ,Σ) and ui = exp(xi) for i = 1, ..., p, then u has a multivariate

lognormal distribution with parameters µ and Σ. This distribution is not anelliptically contoured distribution. See Problem 1.5.

1.7 Summary

1) Statistical Learning techniques extract information from multivariate data.A case or observation consists of k random variables measured for oneperson or thing. The ith case zi = (zi1, ..., zik)

T . The training data consistsof z1, ..., zn. A statistical model or method is fit (trained) on the trainingdata. The test data consists of zn+1, ..., zn+m, and the test data is oftenused to evaluate the quality of the fitted model.

2) The focus of supervised learning is predicting a future value of theresponse variable Yf given xf and the training data (Y1,x1), ..., (Yn,xn).

38 1 Introduction

The focus of unsupervised learning is to group x1, ...,xn into clusters. Datamining is looking for relationships in large data sets.

3) For classical regression and multivariate analysis, we often want n ≥10p, and a model with n < 5p is overfitting: the model does not have enoughdata to estimate parameters accurately if x is p × 1. Statistical Learningmethods often use a model with a “plug in” degrees of freedom d, wheren ≥ Jd with J ≥ 5 and preferably J ≥ 10. A model is underfitting if it omitsimportant predictors. Fix p, if the probability that a model underfits goesto 0 as the sample size n → ∞, then overfitting may not be too serious ifn ≥ Jd. Underfitting can cause the model to fail to hold.

4) There are several important Statistical Learning principles.i) There is more interest in prediction or classification, e.g. producing Yf ,than in other types of inference.ii) Often the focus is on extracting useful information when n/p is not large,e.g. p > n. If d is a “plug in” estimator of the fitted model degrees of freedom,we want n/d large. A sparse model has few nonzero coefficients. We can havesparse population models and sparse fitted models. Sometimes sparse fittedmodels are useful even if the population model is dense (not sparse). Oftenthe number of nonzero coefficients of a sparse fitted model = d.iii) Interest is in how well the method performs on test data. Performance ontraining data is overly optimistic for estimating performance on test data.iv) Some methods are flexible while others are unflexible. For unflexible meth-ods, the sufficient predictor is often a hyperplane SP = xT β and often themean function E(Y |x) = M(xT β) where the function M is known but thep × 1 vector of parameters β is unknown and must be estimated (GLMs).Flexible methods tend to be useful for more complicated regression methodswhere E(Y |x) = m(x) for an unknown function m or SP 6= xT β (GAMs).Flexibility tends to increase with d.

5) Regression investigates how the response variable Y changes with thevalue of a p × 1 vector x of predictors. For a 1D regression model, Y isconditionally independent of x given the sufficient predictor SP = h(x),written Y x|h(x), where the real valued function h : R

p → R. The estimated

sufficient predictorESP = h(x). A response plot is a plot of the ESP versus

the response Y . Often SP = xT β and ESP = xT β. A residual plot is a plotof the ESP versus the residuals. Tip: if the model for Y (more accuratelyfor Y |x) depends on x only through the real valued function h(x), thenSP = h(x).

6) a) The log rule states that a positive variable that has the ratio betweenthe largest and smallest values greater than ten should be transformed to logs.So W > 0 and max(W )/min(W ) > 10 suggests using log(W ).

b) The ladder rule: to spread small values of a variable, make λ smaller,to spread large values of a variable, make λ larger.

7) Let the ladder of powers ΛL = {−1,−1/2,−1/3, 0, 1/3, 1/2, 1}. Lettλ(Z) = Zλ for λ 6= 0 and Y = t0(Z) = log(Z) for λ = 0. Consider the addi-tive error regression model Y = m(x)+ e. Then the response transformation

1.7 Summary 39

model is Y = tλ(Z) = mλ(x) + e. Compute the “fitted values” Wi usingWi = tλ(Zi) as the “response.” Then a transformation plot of Wi versus Wi

is made for each of the seven values of λ ∈ ΛL with the identity line addedas a visual aid. Make the transformations for λ ∈ ΛL, and choose the trans-formation with the best transformation plot where the plotted points scatterabout the identity line.

13) If X and Y are p×1 random vectors, a a conformable constant vector,and A and B are conformable constant matrices, then

E(X+Y ) = E(X)+E(Y ), E(a+Y ) = a+E(Y ), & E(AXB) = AE(X)B.

AlsoCov(a + AX) = Cov(AX) = ACov(X)AT .

Note that E(AY ) = AE(Y ) and Cov(AY ) = ACov(Y )AT .14) If X ∼ Np(µ,Σ), then E(X) = µ and Cov(X) = Σ.

15) If X ∼ Np(µ,Σ) and if A is a q×pmatrix, then AX ∼ Nq(Aµ,AΣAT ).If a is a p × 1 vector of constants, then X + a ∼ Np(µ + a,Σ).

16) All subsets of a MVN are MVN: (Xk1, ..., Xkq)

T ∼ Nq(µ, Σ)

where µi = E(Xki) and Σij = Cov(Xki , Xkj). In particular, X1 ∼Nq(µ1,Σ11) and X2 ∼ Np−q(µ2,Σ22). If X ∼ Np(µ,Σ), then X1 andX2 are independent iff Σ12 = 0.

17)

Let

(YX

)∼ N2

((µY

µX

),

(σ2

Y Cov(Y,X)Cov(X, Y ) σ2

X

)).

Also recall that the population correlation between X and Y is given by

ρ(X, Y ) =Cov(X, Y )√

VAR(X)√

VAR(Y )=

σX,Y

σXσY

if σX > 0 and σY > 0.18) The conditional distribution of a MVN is MVN. If X ∼ Np(µ,Σ), then

the conditional distribution of X1 given that X2 = x2 is multivariate normalwith mean µ1+Σ12Σ

−122 (x2−µ2) and covariance matrix Σ11−Σ12Σ

−122 Σ21.

That is,

X1|X2 = x2 ∼ Nq(µ1 + Σ12Σ−122 (x2 − µ2),Σ11 − Σ12Σ

−122 Σ21).

19) Notation:

X1|X2 ∼ Nq(µ1 + Σ12Σ−122 (X2 − µ2),Σ11 − Σ12Σ

−122 Σ21).

20) Be able to compute the above quantities if X1 and X2 are scalars.21) A p× 1 random vector X has an elliptically contoured distribution, if

X has joint pdf

40 1 Introduction

f(z) = kp|Σ|−1/2g[(z − µ)T Σ−1(z − µ)], (1.23)

and we say X has an elliptically contoured ECp(µ,Σ, g) distribution. If thesecond moments exist, then

E(X) = µ (1.24)


for some constant cX > 0.22) Let Xn be a sequence of random vectors with joint cdfs Fn(x) and let

X be a random vector with joint cdf F (x).

a) Xn converges in distribution to X, written XnD→ X , if Fn(x) →

F (x) as n → ∞ for all points x at which F (x) is continuous. The distributionof X is the limiting distribution or asymptotic distribution of Xn. Notethat X does not depend on n.

b) Xn converges in probability to X, written XnP→ X, if for every

ε > 0, P (‖Xn − X‖ > ε) → 0 as n→ ∞.23) Multivariate Central Limit Theorem (MCLT): If X1, ...,Xn are iid

k × 1 random vectors with E(X) = µ and Cov(X) = Σx, then

√n(Xn − µ)

D→ Nk(0,Σx)

where the sample mean

Xn =1

n

n∑

i=1

X i.

24) Suppose√n(Tn − µ)

D→ Np(θ,Σ). Let A be a q × p constant matrix.

Then A√n(Tn − µ) =

√n(ATn − Aµ)

D→ Nq(Aθ,AΣAT ).

25) Suppose A is a conformable constant matrix and XnD→ X . Then

AXnD→ AX .

26) A g × 1 random vector u has a mixture distribution of the uj

with probabilities πj if u is equal to uj with probability πj. The cdf of

u is Fu(t) =

J∑

j=1

πjFuj(t) where the probabilities πj satisfy 0 ≤ πj ≤

1 and∑J

j=1 πj = 1, J ≥ 2, and Fuj(t) is the cdf of a g × 1 ran-

dom vector uj . Then E(u) =∑J

j=1 πjE[uj ] and Cov(u) = E(uuT ) −E(u)E(uT ) = E(uuT )−E(u)[E(u)]T =

∑Jj=1 πjE[uju

Tj ]−E(u)[E(u)]T =∑J

j=1 πjCov(uj) +∑J

j=1 πjE(uj)[E(uj)]T −E(u)[E(u)]T . If E(uj) = θ for

j = 1, ..., J , then E(u) = θ and Cov(u) =∑J

j=1 πjCov(uj). Note that

E(u)[E(u)]T =∑J

j=1

∑Jk=1 πjπkE(uj)[E(uk)]T .

1.9 Problems 41

1.8 Complements

Graphical response transformation methods similar to those in Section 1.2include Cook and Olive (2001) and Olive (2004b, 2017a: section 3.2). A nu-merical method is given by Zhang and Yang (2017).

Section 1.4 followed Olive (2014, ch. 8) closely, which is a good Master’slevel treatment of large sample theory. There are several PhD level textson large sample theory including, in roughly increasing order of difficulty,Lehmann (1999), Ferguson (1996), Sen and Singer (1993), and Serfling (1980).White (1984) considers asymptotic theory for econometric applications.

For a nonsingular matrix, the inverse of the matrix, the determinant ofthe matrix, and the eigenvalues of the matrix are continuous functions ofthe matrix. Hence if Σ is a consistent estimator of Σ, then the inverse,determinant, and eigenvalues of Σ are consistent estimators of the inverse,determinant, and eigenvalues of Σ > 0. See, for example, Bhatia et al. (1990),Stewart (1969), and Severini (2005, pp. 348-349).

Big DataSometimes n is huge and p is small. Then importance sampling and se-

quential analysis with sample size less than 1000 can be useful for inferencefor regression and time series models. Sometimes n is much smaller than p,for example with microarrays. Sometimes both n and p are large.

1.9 Problems

crancap hdlen hdht Data for 1.1

1485 175 132

1450 191 117

1460 186 122

1425 191 125

1430 178 120

1290 180 117

90 75 51

1.1∗. The table (W ) above represents 3 head measurements on 6 peopleand one ape. Let X1 = cranial capacity, X2 = head length, and X3 = headheight. Let x = (X1, X2, X3)

T . Several multivariate location estimators, in-cluding the coordinatewise median and sample mean, are found by applyinga univariate location estimator to each random variable and then collectingthe results into a vector. a) Find the coordinatewise median MED(W ).

b) Find the sample mean x.

1.2. The table W shown below represents 4 measurements on 5 people.

age breadth cephalic size

42 1 Introduction

39.00 149.5 81.9 3738

35.00 152.5 75.9 4261

35.00 145.5 75.4 3777

19.00 146.0 78.1 3904

0.06 88.5 77.6 933

a) Find the sample mean x.b) Find the coordinatewise median MED(W ).

1.3. Suppose x1, ...,xn are iid p × 1 random vectors from a multivariatet-distribution with parameters µ and Σ with d degrees of freedom. Then

E(xi) = µ and Cov(x) =d

d− 2Σ for d > 2. Assuming d > 2, find the

limiting distribution of√n(x − c) for appropriate vector c.

1.4. Suppose x1, ...,xn are iid p× 1 random vectors where E(xi) = e0.51and Cov(xi) = (e2 − e)Ip. Find the limiting distribution of

√n(x − c) for

appropriate vector c.

1.5. Suppose x1, ...,xn are iid 2 × 1 random vectors from a multivariatelognormal LN(µ, Σ) distribution. Let xi = (Xi1, Xi2)

T . Following Press(2005, pp. 149-150), E(Xij) = exp(µj + σ2

j /2),

V (Xij) = exp(σ2j )[exp(σ2

j ) − 1] exp(2µj) for j = 1, 2, and

Cov(Xi1, Xi2) = exp[µ1 + µ2 + 0.5(σ21 + σ2

2) + σ12][exp(σ12) − 1]. Find thelimiting distribution of

√n(x − c) for appropriate vector c.

1.6. The most used Poisson regression model is Y |x ∼ Poisson(exp(xT β)).What is the sufficient predictor SP = h(x)?

Fig. 1.4 Elastic Net Transformation Plots for Problem 1.7.

1.7. Let Z be the variable of interest and let Y = t(z) be the responsevariable for the multiple linear regression model Y = xT β + e. For the fourtransformation plots shown in Figure 1.8, n = 1000, and p = 4. The fittingmethod was the elastic net. What response transformation should be used?

1.8. The data set follows the multiple linear regression model Y = xT β+ewith n = 100 and p = 101. The response plots for two methods are shown.Which method fits the data better, lasso or ridge regression? For ridge re-gression, is anything wrong with yhat = Y .

Fig. 1.5 Response Plots for Problem 1.8.

1.9 Problems 43

1.9. For the Buxton (1920) data with multiple linear regression, height wasthe response variable while an intercept, head length, nasal height, bigonalbreadth, and cephalic index were used as predictors in the multiple linearregression model. Observation 9 was deleted since it had missing values. Fiveindividuals, cases 61–65, were reported to be about 0.75 inches tall with headlengths well over five feet! The response plot shown in Figure 1.4a) is for lasso.The response plot in Figure 1.4b) did lasso for the cases in the covmb2 set Bapplied to the predictors and set B included all of the clean cases and omittedthe 5 outliers. The response plot was made for all of the data, including theoutliers. Both plots include the identity line and prediction interval bands.

Which method is better: Fig. 1.4 a) or Fig. 1.4 b) for data analysis?1.10∗. Suppose that

X1

X2

X3

X4

∼ N4

49100177

,

3 1 −1 01 6 1 −1−1 1 4 00 −1 0 2

.

a) Find the distribution of X2.

b) Find the distribution of (X1, X3)T .

c) Which pairs of random variables Xi and Xj are independent?

d) Find the correlation ρ(X1 , X3).

1.11∗. Recall that if X ∼ Np(µ,Σ), then the conditional distribution ofX1 given that X2 = x2 is multivariate normal with mean µ1+Σ12Σ

−122 (x2−

µ2) and covariance matrix Σ11 − Σ12Σ−122 Σ21.

Let σ12 = Cov(Y,X) and suppose Y and X follow a bivariate normaldistribution (

YX

)∼ N2

((49100

),

(16 σ12

σ12 25

)).

a) If σ12 = 0, find Y |X. Explain your reasoning.

b) If σ12 = 10, find E(Y |X).

c) If σ12 = 10, find Var(Y |X).

1.12. Let σ12 = Cov(Y,X) and suppose Y andX follow a bivariate normaldistribution (

YX

)∼ N2

((1520

),

(64 σ12

σ12 81

)).

a) If σ12 = 10, find E(Y |X).

44 1 Introduction

b) If σ12 = 10, find Var(Y |X).

c) If σ12 = 10, find ρ(Y,X), the correlation between Y and X.

1.13. Suppose that

X ∼ (1 − γ)ECp(µ,Σ, g1) + γECp(µ, cΣ, g2)

where c > 0 and 0 < γ < 1. Following Example 1.21, show that X hasan elliptically contoured distribution assuming that all relevant expectationsexist.

1.14. In Proposition 1.32b, show that if the second moments exist, thenΣ can be replaced by Cov(X).

1.15. Using the notation in Proposition 1.33, show that if the secondmoments exist, then

Σ−1XXΣXY = [Cov(X)]−1Cov(X , Y ).

1.16. Using the notation under Lemma 1.31, show that if X is ellipticallycontoured, then the conditional distribution of X1 given that X2 = x2 isalso elliptically contoured.

1.17∗. Suppose Y ∼ Nn(Xβ, σ2I). Find the distribution of(XT X)−1XT Y if X is an n × p full rank constant matrix and β is a p× 1constant vector.

1.18. Recall that Cov(X,Y ) = E[(X − E(X))(Y − E(Y ))T ]. Using thenotation of Proposition 1.33, let (Y,XT )T be ECp+1(µ,Σ, g) where Y is a

random variable. Let the covariance matrix of (Y,XT ) be

Cov((Y,XT )T ) = c

(ΣY Y ΣY X

ΣXY ΣXX

)=

(VAR(Y ) Cov(Y,X)

Cov(X , Y ) Cov(X)

)

where c is some positive constant. Show that E(Y |X) = α+ βT X where

α = µY − βT µX and

β = [Cov(X)]−1Cov(X , Y ).

1.19. (Due to R.D. Cook.) Let X be a p×1 random vector with E(X) = 0and Cov(X) = Σ. Let B be any constant full rank p × r matrix where1 ≤ r ≤ p. Suppose that for all such conforming matrices B,

E(X |BT X) = MBBT X

where MB a p× r constant matrix that depend on B.

1.9 Problems 45

Using the fact that ΣB = Cov(X,BTX) = E(XXTB) =E[E(XXTB|BTX)], compute ΣB and show that MB = ΣB(BT ΣB)−1.Hint: what acts as a constant in the inner expectation?

1.20. Let x be a p× 1 random vector with covariance matrix Cov(x). LetA be an r × p constant matrix and let B be a q × p constant matrix. FindCov(Ax,Bx) in terms of A,B, and Cov(x).

1.21. Suppose that

X1

X2

X3

X4

∼ N4

91641

,

1 0.8 −0.4 00.8 1 −0.56 0−0.4 −0.56 1 0

0 0 0 1

.

a) Find the distribution of X3.

b) Find the distribution of (X2, X4)T .

c) Which pairs of random variables Xi and Xj are independent?

d) Find the correlation ρ(X1 , X3).

1.22. Suppose x1, ...,xn are iid p× 1 random vectors where

xi ∼ (1 − γ)Np(µ,Σ) + γNp(µ, cΣ)

with 0 < γ < 1 and c > 0. Then E(xi) = µ and Cov(xi) = [1 + γ(c − 1)]Σ.Find the limiting distribution of

√n(x − d) for appropriate vector d.

1.23. Let X be an n × p constant matrix and let β be a p × 1 constantvector. Suppose Y ∼ Nn(Xβ, σ2I). Find the distribution of HY if HT =H = H2 is an n× n matrix and if HX = X . Simplify.

1.24. Recall that if X ∼ Np(µ,Σ), then the conditional distribution of X1

given that X2 = x2 is multivariate normal with mean µ1+Σ12Σ−122 (x2−µ2)

and covariance matrix Σ11 − Σ12Σ−122 Σ21. Let Y and X follow a bivariate

normal distribution(YX

)∼ N2

((13496

),

(24.5 1.11.1 23.0

)).

a) Find E(Y |X).

b) Find Var(Y |X).1.25. Suppose that

46 1 Introduction

X1

X2

X3

X4

∼ N4

1730

,

4 0 2 10 1 0 02 0 3 11 0 1 5

.

a) Find the distribution of (X1, X4)T .

b) Which pairs of random variables Xi and Xj are independent?

c) Find the correlation ρ(X1, X4).

1.26. Suppose that

X1

X2

X3

X4

∼ N4

3423

,

3 2 1 12 4 1 01 1 2 01 0 0 3

.




1.27. Suppose that

X1

X2

X3

X4

∼ N4

492594

,

2 −1 3 0−1 5 −3 03 −3 5 00 0 0 4

.




1.28. Recall that if X ∼ Np(µ,Σ), then the conditional distribution of X1

given that X2 = x2 is multivariate normal with mean µ1+Σ12Σ−122 (x2−µ2)

and covariance matrix Σ11 − Σ12Σ−122 Σ21. Let Y and X follow a bivariate

normal distribution(YX

)∼ N2

((4917

),

(3 −1−1 4

)).

a) Find E(Y |X).

b) Find Var(Y |X).

1.29. Following Srivastava and Khatri (1979, p. 47), let

1.9 Problems 47

X =

(X1

X2

)∼ Np

[(µ1

µ2

),

(Σ11 Σ12

Σ21 Σ22

)].

a) Show that the nonsingular linear transformation

(I −Σ12Σ

−122

0 I

)(X1

X2

)=

(X1 − Σ12Σ

−122 X2

X2

)∼

Np

[(µ1 − Σ12Σ

−122 µ2

µ2

),

(Σ11 − Σ12Σ

−122 Σ21 0

0 Σ22

)].

b) Then X1 − Σ12Σ−122 X2 X2, and

X1 − Σ12Σ−122 X2 ∼ Nq(µ1 − Σ12Σ

−122 µ2,Σ11 − Σ12Σ

−122 Σ21).

By independence, X1 − Σ12Σ−122 X2 has the same distribution as

(X1−Σ12Σ−122 X2)|X2, and the term −Σ12Σ

−122 X2 is a constant, given X2.

Use this result to show that

X1|X2 ∼ Nq(µ1 + Σ12Σ−122 (X2 − µ2),Σ11 − Σ12Σ

−122 Σ21).

R Problem

Use the command source(“G:/slpack.txt”) to download the func-tions and the command source(“G:/sldata.txt”) to download the data.See Preface or Section 11.1. Typing the name of the slpack func-tion, e.g. tplot2, will display the code for the function. Use the args com-mand, e.g. args(tplot2), to display the needed arguments for the function.For the following problem, the R command can be copied and pasted from(http://lagrange.math.siu.edu/Olive/slrhw.txt) into R.

1.30. This problem uses some of the R commands at the end of Section1.2.1. A problem with response and residual plots is that there can be a lotof black in the plot if the sample size n is large (more than a few thousand).A variant of the response plot for the additive error regression model Y =m(x)+e would plot the identity line, the two lines parallel to the identity linecorresponding to the Section 2.1 large sample 100(1−δ)% prediction intervalsfor Yf that depends on Yf . Then plot points corresponding to training datacases that do not lie in their 100(1−δ)% PI. We will use δ = 0.01, n = 100000,and p = 8.

a) Copy and paste the commands for this part into R. They make theusual response plot with a lot of black. Do not include the plot in Word.

b) Copy and paste the commands for this part into R. They make theresponse plot with the points within the pointwise 99% prediction intervalbands omitted. Include this plot in Word. For example, left click on the plotand hit the Ctrl and c keys at the same time to make a copy. Then paste the

48 1 Introduction

plot into Word, e.g., get into Word and hit the Ctrl and v keys at the sametime.

c) The additive error regression model is a 1D regression model. What isthe sufficient predictor = h(x)?

1.31. The slpack function tplot2 makes transformation plots for themultiple linear regression model Y = t(Z) = xT β + e. Type = 1 for fullmodel OLS and should not be used if n < 5p, type = 2 for elastic net, 3 forlasso, 4 for ridge regression, 5 for PLS, 6 for PCR, and 7 for forward selectionwith Cp if n ≥ 10p and EBIC if n < 10p. These methods are discussed inChapter 3.

Copy and paste the three library commands near the top of slrhw into R.For parts a) and b), n = 100, p = 4 and Y = log(Z) = 0x1 + x2 + 0x3 +

0x4 + e = x2 + e. (Y and Z are swapped in the R code.)a) Copy and paste the commands for this part into R. This makes the

response plot for the elastic net using Y = Z and x when the linear modelneeds Y = log(Z). Do not include the plot in Word, but explain why the plotsuggests that something is wrong with the model Z = xT β + e.

b) Copy and paste the command for this part into R. Right click Stop 3times until the horizontal axis has log(z). This is the response plot for thetrue model Y = log(Z) = xT β + e = x2 + e. Include the plot in Word. Rightclick Stop 3 more times so that the cursor returns in the command window.

c) Is the response plot linear?For the remaining parts, n = p − 1 = 100 and Y = log(Z) = 0x1 + x2 +

0x3 + · · ·+ 0x101 + e = x2 + e. Hence the model is sparse.d) Copy and paste the commands for this part into R. Right click Stop 3

times until the horizontal axis has log(z). This is the response plot for thetrue model Y = log(Z) = xT β + e = x2 + e. Include the plot in Word. Rightclick Stop 3 more times so that the cursor returns in the command window.

e) Is the plot linear?f) Copy and paste the commands for this part into R. Right click Stop 3

times until the horizontal axis has log(z). This is the response plot for the truemodel Y = log(Z) = xT β + e = x2 + e. Include the plot in Word. Right clickStop 3 more times so that the cursor returns in the command window. PLSis probably overfitting since the identity line nearly interpolates the fittedpoints.

1.32. Get the R commands for this problem. The data is such that Y =2 + x2 + x3 + x4 + e where the zero mean errors are iid [exponential(2) -2]. Hence the residual and response plots should show high skew. Note thatβ = (2, 1, 1, 1)T. The R code uses 3 nontrivial predictors and a constant, andthe sample size n = 1000.

a) Copy and paste the commands for part a) of this problem into R. Includethe response plot in Word. Is the lowess curve fairly close to the identity line?

b) Copy and paste the commands for part b) of this problem into R.Include the residual plot in Word: press the Ctrl and c keys as the same time.

1.9 Problems 49

Then use the menu command “Paste” in Word. Is the lowess curve fairlyclose to the r = 0 line? The lowess curve is a flexible scatterplot smoother.

c) The output out$coef gives β. Write down β or copy and paste β into

Word. Is β close to β?

1.33. For the Buxton (1920) data with multiple linear regression, heightwas the response variable while an intercept, head length, nasal height, bigonalbreadth, and cephalic index were used as predictors in the multiple linearregression model. Observation 9 was deleted since it had missing values. Fiveindividuals, cases 61–65, were reported to be about 0.75 inches tall with headlengths well over five feet!

a) Copy and paste the commands for this problem into R. Include the lassoresponse plot in Word. The identity line passes right through the outlierswhich are obvious because of the large gap. Prediction interval (PI) bandsare also included in the plot.

b) Copy and paste the commands for this problem into R. Include thelasso response plot in Word. This did lasso for the cases in the covmb2 setB applied to the predictors which included all of the clean cases and omittedthe 5 outliers. The response plot was made for all of the data, including theoutliers.

c) Copy and paste the commands for this problem into R. Include the DDplot in Word. The outliers are in the upper right corner of the plot.

1.34. Consider the Gladstone (1905) data set that has 12 variables on267 persons after death. There are 5 infants in the data set. The responsevariable was brain weight. Head measurements were breadth, circumference,head height, length, and size as well as cephalic index and brain weight. Age,height, and three categorical variables cause, ageclass (0: under 20, 1: 20-45,2: over 45) and sex were also given. The constant x1 was the first variable.The variables cause and ageclass were not coded as factors. Coding as factorsmight improve the fit.

a) Copy and paste the commands for this problem into R. Include thelasso response plot in Word. The identity line passes right through the infantswhich are obvious because of the large gap. Prediction interval (PI) bandsare also included in the plot.

b) Copy and paste the commands for this problem into R. Include thelasso response plot in Word. This did lasso for the cases in the covmb2 setB applied to the nontrivial predictors which are not categorical (omit theconstant, cause, ageclass and sex) which omitted 8 cases, including the 5infants. The response plot was made for all of the data.

c) Copy and paste the commands for this problem into R. Include the DDplot in Word. The infants are in the upper right corner of the plot.

1.35. The slpack function mldsim6 compares 7 estimators: FCH, RFCH,CMVE, RCMVE, RMVN, covmb2, and MB described in Olive (2017b, ch.4). Most of these estimators need n > 2p, need a nonsingular dispersionmatrix, and work best with n > 10p. The function generates data sets and

50 1 Introduction

counts how many times the minimum Mahalanobis distance Di(T,C) of theoutliers is larger than the maximum distance of the clean data. The valuepm controls how far the outliers need to be from the bulk of the data, andpm roughly needs to increase with

√p.

For data sets with p > n possible, the function mldsim7 used the Eu-clidean distances Di(T, Ip) and the Mahalanobis distances Di(T,Cd) whereCd is the diagonal matrix with the same diagonal entries as C where (T,C)is the covmb2 estimator using j concentration type steps. Dispersion ma-trices are effected more by outliers than good robust location estimators,so when the outlier proportion is high, it is expected that the Euclideandistances Di(T, Ip) will outperform the Mahalanobis distance Di(T,Cd) formany outlier configurations. Again the function counts the number of timesthe minimum outlier distance is larger than the maximum distance of theclean data.

Both functions used several outlier types. The simulations generated 100data sets. The clean data had xi ∼ Np(0, diag(1, ..., p)). Type 1 had outliersin a tight cluster (near point mass) at the major axis (0, ..., 0, pm)T . Type 2had outliers in a tight cluster at the minor axis (pm, 0, ..., 0)T. Type 3 hadmean shift outliers xi ∼ Np((pm, ..., pm)T , diag(1, ..., p)). Type 4 changedthe pth coordinate of the outliers to pm. Type 5 changed the 1st coordinateof the outliers to pm. (If the outlier xi = (x1i, ..., xpi)

T , then xi1 = pm.)

Table 1.2 Number of Times All Outlier Distances > Clean Distances, otype=1

n p γ osteps pm FCH RFCH CMVE RCMVE RMVN covmb2 MB100 10 0.25 0 20 85 85 85 85 86 67 89

a) Table 1.2 suggests with osteps = 0, covmb2 had the worst count. Whenpm is increased to 25, all counts become 100. Copy and paste the commandsfor this part into R and make a table similar to Table 1.2, but now osteps=9and p = 45 is close to n/2 for the second line where pm = 60. Your tableshould have 2 lines from output.

Table 1.3 Number of Times All Outlier Distances > Clean Distances, otype=1

n p γ osteps pm covmb2 diag100 1000 0.4 0 1000 100 41100 1000 0.4 9 600 100 42

b) Copy and paste the commands for this part into R and make a tablesimilar to Table 1.3, but type 2 outliers are used.

c) When you have two reasonable outlier detectors, there are outlier con-figurations where one will beat the other. Simulations by Wang (2018) sug-gest that “covmb2” using Di(T, Ip) outperforms “diag” using Di(T,Cd) for

1.9 Problems 51

many outlier configurations, but there are some exceptions. Copy and pastethe commands for this part into R and make a table similar to Table 1.3, buttype 3 outliers are used.

Chapter 2

Full Rank Linear Models

2.1 Projection Matrices and the Column Space

Vector spaces, subspaces, and column spaces should be familiar from linearalgebra, but are reviewed below.

Definition 2.1. A set V ⊂ Rk is a vector space if for any vectors

x, y, z ∈ V, and scalars a and b, the operations of vector addition and scalarmultiplication are defined as follows.1) (x + y) + z = x + (y + z).2) x + y = y + x.3) There exists 0 ∈ V such that x + 0 = x = 0 + x.4) For any x ∈ V, there exists y = −x such that x + y = y + x = 0.5) a(x + y) = ax + ay.6) (a + b)x = ax + by.7) (ab) x = a(b x).8) 1 x = x.

Hence for a vector space, addition is associative and commutative, thereis an additive identity vector 0, there is an additive inverse −x for eachx ∈ V, scalar multiplication is distributive and associative, and 1 is thescalar identity element.

Two important vector spaces are Rk and V = {0}. Showing that a set M

is a subspace is a common method to show that M is a vector space.

Definition 2.2. Let M be a nonempty subset of a vector space V. If i)ax ∈ M ∀x ∈ M and for any scalar a, and ii) x + y ∈ M ∀x, y ∈ M, thenM is a vector space known as a subspace.

Definition 2.3. The set of all linear combinations of x1, ...,xn is thevector space known as span(x1, ...,xn) = {y ∈ R

k : y =∑n

i=1 aixi for someconstants a1, ..., an}.

53

54 2 Full Rank Linear Models

Definition 2.4. Let x1, ...,xk ∈ V. If ∃ scalars α1, ..., αk not all zero suchthat

∑ki=1 αixi = 0, then x1, ...,xk are linearly dependent. If

∑ki=1 αixi = 0

only if αi = 0 ∀ i = 1, ..., k, then x1, ...,xk are linearly independent. Suppose{x1, ...,xk} is a linearly independent set and V = span(x1, ...,xk). Then{x1, ...,xk} is a linearly independent spanning set for V, known as a basis.

Definition 2.5. Let A = [a1 a2 ... am] be an n ×m matrix. The spacespanned by the columns of A = column space of A = C(A). Then C(A) ={y ∈ R

n : y = Aw for some w ∈ Rm} = {y : y = w1a1 +w2a2 + · · ·+wmam

for some scalars w1, ...., wm} = span(a1, ...,am).

The space spanned by the rows of A is the row space of A. The row spaceof A is the column space C(AT ) of AT . Note that

Aw = [a1 a2 ... am]

w1

...wm

=

m∑

i=1

wiai.

With the design matrix X , different notation is used to denote the columnsof X since both the columns and rows X are important. Let

X = [v1 v2 ... vp] =

xT1...

xTn

be an n× p matrix. Note that C(X) = {y ∈ Rn : y = Xb for some b ∈ R

p}.Hence Xb is a typical element of C(X) and Aw is a typical element of C(A).Note that

Xb =

xT1...

xTn

b =

xT1 b...

xTnb

= [v1 v2 ... vp]

b1...bp

=

p∑

i=1

bivi.

If the function Xf(b) = Xb where the f indicates that the operationXf : R

p → Rn is being treated as a function, then C(X) is the range of Xf .

Hence some authors call the column space of A the range of A.Let B be n × k, and let A be n ×m. One way to show C(A) = C(B)

is to show that i) ∀x ∈ Rm, ∃ y ∈ R

k such that Ax = By ∈ C(B) soC(A) ⊆ C(B), and ii) ∀y ∈ R

k, ∃ x ∈ Rm such that By = Ax ∈ C(A) so

C(B) ⊆ C(A). Another way to show C(A) = C(B) is to show that a basisfor C(A) is also a basis for C(B).

Definition 2.6. The dimension of a vector space V = dim(V) = thenumber of vectors in a basis of V. The rank of a matrix A = rank(A) =dim(C(A)), the dimension of the column space of A. Let A be n×m. Then

2.1 Projection Matrices and the Column Space 55

rank(A) = rank(AT ) ≤ min(m, n). If rank(A) = min(m, n), then A has fullrank, or A is a full rank matrix.

Definition 2.7. The null space of A = N(A) = {x : Ax = 0} = kernelof A. The nullity of A = dim[N(A)]. The subspace V⊥ = {y ∈ R

k : y ⊥ V}is the orthogonal complement of V, where y ⊥ V means yT x = 0 ∀ x ∈ V.N(AT ) = [C(A)]⊥, so N(A) = [C(AT )]⊥.

Theorem 2.1: Rank Nullity Theorem. Let A be n × m. Thenrank(A) + dim(N(A)) = m.

Generalized inverses are useful for the non-full rank linear model and fordefining projection matrices.

Definition 2.8. A generalized inverse of an n × m matrix A is anym× n matrix A− satisfying AA−A = A.

Other names are conditional inverse, pseudo inverse, g-inverse, and p-inverse. Usually a generalized inverse is not unique, but if A−1 exists, thenA− = A−1 is unique.

Notation: G := A− means G is a generalized inverse of A.

Recall that if A is idempotent, then A2 = A. A matrix A is tripotent ifA3 = A. For both these cases, A := A− since AAA = A. It will turn outthat symmetric idempotent matrices are projection matrices.

Definition 2.9. Let V be a subspace of Rn. Then every y ∈ R

n can beexpressed uniquely as y = w + z where w ∈ V and z ∈ V⊥. Let X =[v1 v2 ... vp] be n× p, and let V = C(X) = span(v1, ..., vp). Then the n× nmatrix P V = P X is a projection matrix on C(X) if PX y = w ∀ y ∈ R

n.(Here y = w + z = wy + zy , so w depends on y.)

Note: Some authors call a projection matrix an “orthogonal projectionmatrix,” and call an idempotent matrix a “projection matrix.”

Theorem 2.2: Projection Matrix Theorem. a) PX is unique.

b) P X = X(XT X)−XT where (XT X)− is any generalized inverse of

XT X .c) A is a projection matrix onC(A) iff A is symmetric and idempotent. HencePX is a projection matrix on C(PX ) = C(X), and PX is symmetric andidempotent. Also, each column pi of P X satisfies P Xpi = pi ∈ C(X).d) In − P X is the projection matrix on [C(X)]⊥.e) A = P X iff i) y ∈ C(X) implies Ay = y and ii) y ⊥ C(X) impliesAy = 0.f) P XX = X, and P XW = W if each column of W ∈ C(X).g) P Xvi = vi.h) If C(XR) is a subspace of C(X), then P XP XR

= P XRP X = PXR

.


i) The eigenvalues of P X are 0 or 1.j) Let tr(A) = trace(A). Then rank(P X ) = tr(P X ).k) P X is singular unless X is a nonsingular n×n matrix, and then P X = In.l) Let X = [Z Xr ] where rank(X) = rank(Xr) = r so the columns of Xr

form a basis for C(X). Then

[0 0

0 (XTr Xr)

−1

]

is a generalized inverse of XT X , and PX = Xr(XTr Xr)

−1XTr .

Some results from linear algebra are needed to prove parts of the abovetheorem. Unless told otherwise, matrices in this text are real. Then theeigenvalues of a symmetric matrix A are real. If A is symmetric, thenrank(A) = number of nonzero eigenvalues of A. Recall that if AB isa square matrix, then tr(AB) = tr(BA). Similarly, if A1 is m1 × m2,A2 is m2 × m3, ..., Ak−1 is mk−1 × mk, and Ak is mk × m1 , thentr(A1A2 · · ·Ak) = tr(AkA1A2 · · ·Ak−1) = tr(Ak−1AkA1A2 · · ·Ak−2) =· · · = tr(A2A3 · · ·AkA1). Also note that a scalar is a 1 × 1 matrix, sotr(a) = a.

For part d), note that if y = w + z, then (In − PX )y = z ∈ [C(X)]⊥.Hence the result follows from the definition of a projection matrix by in-terchanging the roles of w and z. Part e) follows from the definition of aprojection matrix since if y ∈ C(X) then y = y+0 where y = w and 0 = z.If y ⊥ C(X) then y = 0+y where 0 = w and y = z. Part g) is a special caseof f). In k), P X is singular unless p = n since rank(X) = r ≤ p < n unlessp = n, and P X is an n×nmatrix. Need rank(PX ) = n for P X to be nonsin-gular. For h), PXPXR

= P XRby f) since each column of P Xr

∈ C(PX ).Taking transposes and using symmetry shows P XR

P X = P XR. For i), if

λ is an eigenvalue of P X , then for some x 6= 0, λx = P Xx = P 2Xx = λ2x

since P X is idempotent by c). Hence λ = λ2 is real since P X is symmetric,so λ = 0 or λ = 1. Then j) follows from i) since rank(P X ) = number ofnonzero eigenvalues of P X = tr(P X).

2.2 Quadratic Forms

Definition 2.10. Let A be an n×nmatrix and let x ∈ Rn. Then a quadratic

form xT Ax =∑n

i=1

∑nj=1 aijxixj , and a linear form is Ax. Suppose A

is a symmetric matrix. Then A is positive definite (A > 0) if xT Ax >0 ∀ x 6= 0, and A is positive semidefinite (A ≥ 0) if xT Ax ≥ 0 ∀ x.

Notation: The matrix A in a quadratic form xT Ax will be symmetricunless told otherwise. Suppose B is not symmetric. Since the quadratic form

2.2 Quadratic Forms 57

is a scalar, xT Bx = (xT Bx)T = xT BT x = xT (B+BT )x/2, and the matrixA = (B + BT )/2 is symmetric. If A ≥ 0 then the eigenvalues λi of A arereal and nonnegative. If A ≥ 0, let λ1 ≥ λ2 ≥ · · · ≥ λn ≥ 0. If A > 0, thenλn > 0. Some authors say symmetric A is nonnegative definite if A ≥ 0, andthat A is positive semidefinite if A ≥ 0 and there exists a nonzero x suchthat xT Ax = 0. Then A is singular.

The spectral decomposition theorem is very useful. One application forlinear models is defining the square root matrix.

Theorem 2.3: Spectral Decomposition Theorem. Let A be an n×nsymmetric matrix with eigenvector eigenvalue pairs (λ1, t1), (λ2, t2), ..., (λn, tn)where tT

i ti = 1 and tTi tj = 0 if i 6= j for i = 1, ..., n. Hence Ati = λiti. Then

the spectral decomposition of A is

A =

n∑

i=1

λititTi = λ1t1t

T1 + · · ·+ λntntT

n .

Let T = [t1 t2 · · · tn] be the n × n orthogonal matrix with ith column

ti. Then TT T = T T T = I . Let Λ = diag(λ1, ..., λn) and let Λ1/2 =diag(

√λ1, ...,

√λn). Then A = TΛT T .

Definition 2.11. If A is a positive definite n× n symmetric matrix withspectral decomposition A =

∑ni=1 λitit

Ti , then A = TΛT T and

A−1 = TΛ−1T T =

n∑

i=1

1

λitit

Ti .

The square root matrix A1/2 = TΛ1/2T T is a positive definite symmetricmatrix such that A1/2A1/2 = A.

The following theorem is often useful. Both the expected value and traceare linear operators. Hence tr(A + B) = tr(A) + tr(B), and E[tr(X)] =tr(E[X]) when the expected value of the random matrix X exists.

Theorem 2.4: expected value of a quadratic form. Let x be a ran-dom vector with E(x) = µ and Cov(x) = Σ. Then

E(xT Ax) = tr(AΣ) + µT Aµ.

Proof. Two proofs are given. i) Searle (1971, p. 55): Note that E(xxT ) =Σ + µµT . Since the quadratic form is a scalar and the trace is a linearoperator, E[xT Ax] = E[tr(xT Ax)] = E[tr(AxxT )] = tr(E[AxxT ]) =tr(AΣ + AµµT ) = tr(AΣ) + tr(AµµT ) = tr(AΣ) + µT Aµ.


ii) Graybill (1976, p. 140): Using E(xixj) = σij + µiµj , E[xT Ax] =∑ni=1

∑nj=1 aijE(xixj) =

∑ni=1

∑nj=1 aij(σij +µiµj) = tr(AΣ)+µT Aµ. �

Much of the theoretical results for quadratic forms assumes that the ei

are iid N(0, σ2). These exact results are often special cases of large sampletheory that holds for a large class of iid zero mean error distributions thathave V (ei) ≡ σ2. For linear models, Y is typically an n × 1 random vector.The following theorem from statistical inference will be useful.

Theorem 2.5. Suppose x y, g(x) is a function of x alone, and h(y) isa function of y alone. Then g(x) h(y).

The following theorem shows that independence of linear forms impliesindependence of quadratic forms.

Theorem 2.6. If A and B are symmetric matrices and AY BY , thenY T AY Y T BY .

Proof. Let g(AY ) = Y T AT A−AY = Y T AA−AY = Y T AY , andlet h(BY ) = Y T BT B−BY = Y T BB−BY = Y T BY . Then the resultfollows by Theorem 2.5. �

Theorem 2.7. Let Y ∼ Nn(µ,Σ). Let u = AY and w = BY . ThenAY BY iff Cov(u,w) = AΣBT = 0 iff BΣAT = 0. Note that ifΣ = σ2In, then AY BY iff ABT = 0 iff BAT = 0.

Proof. Note that(

uw

)=

(AYBY

)=

(AB

)Y

has a multivariate normal distribution. Hence AY BY iff Cov(u,w) = 0.Taking transposes shows Cov(u,w) = AΣBT = 0 iff BΣAT = 0. �

One of the most useful theorems for proving that Y T AY Y T BY isCraig’s Theorem. Taking transposes shows AΣB = 0 iff BΣA = 0. Notethat if AΣB = 0, then (∗) holds. Note AΣB = 0 is a sufficient conditionfor Y T AY Y T BY if Σ ≥ 0, but necessary and sufficient if Σ > 0.


Theorem 2.8: Craig’s Theorem. Let Y ∼ Nn(µ,Σ).a) If Σ > 0, then Y T AY Y T BY iff AΣB = 0 iff BΣA = 0.b) If Σ ≥ 0, then Y T AY Y T BY if AΣB = 0 (or if BΣA = 0).c) If Σ ≥ 0, then Y T AY Y T BY iff

(∗) ΣAΣBΣ = 0,ΣAΣBµ = 0,ΣBΣAµ = 0, and µT AΣBµ = 0.Proof. For a) and b), AΣB = 0 implies Y T AY Y T BY by c)

or by Theorems 2.5, 2.6, and 2.7. See Reid and Driscoll (1988) for whyY T AY Y T BY implies AΣB = 0 in a).

c) See Driscoll and Krasnicka (1995).

For the following theorem, note that if A = AT = A2, then A is aprojection matrix since A is symmetric and idempotent. An n×n projectionmatrix A is not a full rank matrix unless A = In. See Theorem 2.2 j) andk). Often results are given for Y ∼ Nn(0, I), and then the Y ∼ Nn(0, σ2I)case is handled as in b) below, since Y /σ ∼ Nn(0, I).

Theorem 2.9. Let A = AT be symmetric.a) Let Y ∼ Nn(0, I). Then Y T AY ∼ χ2

r iff A is idempotent of rankr = tr(A) since then A is a projection matrix.

b) Let Y ∼ Nn(0, σ2I). Then

Y T AY

σ2∼ χ2

r or Y TAY ∼ σ2 χ2r

iff A is idempotent of rank r.c) If Y ∼ Nn(0,Σ) where Σ > 0, then Y T AY ∼ χ2

r iff AΣ is idempotentof rank r = rank(A).

Note that a) is a special case of c). To see that b) holds, note Z = Y /σ ∼Nn(0, I). Hence by a)

Y T AY

σ2= ZT AZ ∼ χ2

r

iff A is idempotent of rank r.

The following theorem is a corollary of Craig’s Theorem.

Theorem 2.10. Let Y ∼ Nn(0, In), with A and B symmetric. IfY T AY ∼ χ2

r and Y T BY ∼ χ2d, then Y T AY Y T BY iff AB = 0.

Theorem 2.11. If Y ∼ Nn(µ,Σ) with Σ > 0, then the populationsquared Mahalanobis distance (Y − µ)T Σ−1(Y − µ) ∼ χ2

n.

Proof. Let Z = Σ1/2(Y − µ) ∼ Nn(0, I). Then Z = (Z1, ..., Zn)T wherethe Zi are iidN(0, 1). Hence (Y −µ)T Σ−1(Y −µ) = ZT Z =

∑ni=1 Z

2i ∼ χ2

n.�


For large sample theory, the noncentral χ2 distribution is important. IfZ1, ..., Zn are independent N(0, 1) random variables, then

∑ni=1 Z

2i ∼ χ2

n.The noncentral χ2(n, γ) distribution is the distribution of

∑ni=1 Y

2i where

Y1, ..., Yn are independent N(µi, 1) random variables. Note that if Y ∼N(µ, 1), then Y 2 ∼ χ2(n = 1, γ = µ2/2), and if Y ∼ N(

√2γ, 1), then

Y 2 ∼ χ2(n = 1, γ).

Definition 2.12. Suppose Y1, ..., Yn are independent N(µi, 1) randomvariables so that Y = (Y1, ..., Yn)T ∼ Nn(µ, In). Then Y T Y =

∑ni=1 Y

2i ∼

χ2(n, γ = µT µ/2), a noncentral χ2(n, γ) distribution, with n degrees of free-dom and noncentrality parameter γ = µT µ/2 = 1

2

∑ni=1 µ

2i ≥ 0. The noncen-

trality parameter δ = µT µ = 2γ is also used. If W ∼ χ2n, then W ∼ χ2(n, 0)

so γ = 0. The χ2n distribution is also called the central χ2 distribution.

Some of the proof ideas for the following theorem came from Marden(2012, pp. 48, 96-97). Recall that if Y1, ..., Yk are independent with moment

generating functions (mgfs) mYi(t), then the mgf of∑k

i=1 Yi is m∑ki=1

Yi(t) =

k∏

i=1

mYi(t). If Y ∼ χ2(n, γ), then the probability density function (pdf) of Y

is rather hard to use, but is given by

f(y) =

∞∑

j=0

e−γγj

j!

yn2+j−1e−y/2

2n2+jΓ (n

2 + j)=

∞∑

j=0

pγ(j)fn+2j(y)

where pγ(j) = P (W = j) is the probability mass function of a Poisson(γ)random variable W , and fn+2j(y) is the pdf of a χ2

n+2j random variable. If

γ = 0, define γ0 = 1 in the first sum, and p0(0) = 1 with p0(j) = 0 forj > 0 in the second sum. For computing moments and the moment gen-erating function, the integration and summation operations can be inter-changed. Hence

∫∞0f(y)dy =

∑∞j=0 pγ(j)

∫∞0fn+2j(y)dy =

∑∞j=0 pγ(j) = 1.

Similarly, if mn+2j(t) = (1 − 2t)−(n+2j)/2 is the mgf of a χ2n+2j ran-

dom variable, then the mgf of Y is mY (t) = E(etY ) =∫∞0 etyf(y)dy =∑∞

j=0 pγ(j)∫∞0etyfn+2j(y)dy =

∑∞j=0 pγ(j)mn+2j(t).

Theorem 2.12. a) If Y ∼ χ2(n, γ), then the moment generating functionof Y is mY (t) = (1 − 2t)−n/2 exp(−γ[1 − (1 − 2t)−1]) =(1 − 2t)−n/2 exp[2γt/(1 − 2t)] for t < 0.5.

b) If Yi ∼ χ2(ni, γi) are independent for i = 1, ..., k, then∑ki=1 Yi ∼ χ2

(∑ki=1 ni,

∑ki=1 γi

).

c) If Y ∼ χ2(n, γ), then E(Y ) = n+ 2γ and V (Y ) = 2n+ 8γ.Proof. Two proofs are given. a) i) From the above remarks, and using ex =

∞∑

j=0

xj

j!,mY (t) =

∞∑

j=0

e−γγj

j!(1−2t)−(n+2j)/2 = (1−2t)−n/2

∞∑

j=0

e−γ(

γ1−2t

)j

j!=


(1 − 2t)−n/2 exp

(−γ +

γ

1 − 2t

)= (1 − 2t)−n/2 exp

(2γt

1 − 2t

).

ii) Let W ∼ N(√δ, 1) where δ = 2γ. Then W 2 ∼ χ2(1, δ/2) = χ2(1, γ).

Let W X where X ∼ χ2n−1 ∼ χ2(n− 1, 0), and let Y = W 2 +X ∼ χ2(n, γ)

by b). Then mW2 (t) =

E(etW2

) =

∫ ∞

−∞etw2 1√

2πexp

[−1

2(w −

√δ)2]dw =

∫ ∞

−∞

1√2π

exp

[2

2tw2 − 1

2(w2 − 2

√δ w + δ)

]dw =

∫ ∞

−∞

1√2π

exp

[−1

2(w2 − 2tw2 − 2

√δ w + δ)

]dw =

∫ ∞

−∞

1√2π

exp

[−1

2(w2(1 − 2t) − 2

√δw + δ)

]dw =

∫ ∞

−∞

1√2π

exp

[−1

2A

]dw

where A = [√

1 − 2t (w − b)]2 + c with

b =

√δ

1 − 2tand c =

−2tδ

1 − 2t

after algebra. Hence m2W (t) =

e−c/2

√1

1 − 2t

∫ ∞

−∞

1√2π

1√1

1−2t

exp

[−1

2

11

1−2t

(w − b)2

]dw = e−c/2

√1

1 − 2t

since the integral = 1 =∫∞−∞ f(w)dw where f(w) is the N(b, 1/(1− 2t)) pdf.

Thus

mW2 (t) =1√

1 − 2texp

(tδ

1 − 2t

).

So mY (t) = mW2+X(t) = mW2 (t)mX (t) =

1√1 − 2t

exp

(tδ

1 − 2t

)(1

1− 2t

)(n−1)/2

=1

(1 − 2t)n/2exp

(tδ

1 − 2t

)=

(1 − 2t)−n/2 exp

(2γt

1 − 2t

).

b) i) By a), m∑ki=1

Yi(t) =

k∏

i=1

mYi(t) =

k∏

i=1

(1 − 2t)−ni/2 exp(−γi[1− (1 − 2t)−1]) =


(1 − 2t)−∑k

i=1ni/2 exp

(−

k∑

i=1

γi[1− (1 − 2t)−1]

),

the χ2

(k∑

i=1

ni,

k∑

i=1

γi

)mgf.

ii) Let Yi = ZTi Zi where the Zi ∼ Nni (µi, Ini) are independent. Let

Z =

Z1

Z2

...Zk

∼ N∑k

i=1ni

µ1

µ2...

µk

, I∑k

i=1ni

∼ N∑k

i=1ni

(µZ , I∑k

i=1ni

).

Then ZT Z =k∑

i=1

ZTi Zi =

k∑

i=1

Yi ∼ χ2

(k∑

i=1

ni, γZ

)where

γZ =µT

ZµZ2

=k∑

i=1

µTi µi

2=

k∑

i=1

γi.

c) i) Let W ∼ χ2(1, γ) X ∼ χ2n−1 ∼ χ2(n − 1, 0). Then by b) Y =

W+X ∼ χ2(n, γ). Let Z ∼ N(0, 1) and δ = 2γ. Then√δ+Z ∼ N(

√δ, 1), and

W = (√δ+Z)2. Thus E(W ) = E[(

√δ+Z)2 ] = δ+2

√δE(Z)+E(Z2) = δ+1.

Using the binomial theorem

(x+ y)n =

n∑

i=0

(n

i

)xiyn−i

with x =√δ, y = Z, and n = 4, E(W 2) = E[(

√δ + Z)4] =

E[δ2 + 4δ3/2Z + 6δZ2 + 4√δZ3 + Z4] = δ2 + 6δ + 3

since E(Z) = E(Z3) = 0, and E(Z4) = 3 by Problem 2.8. Hence V (W ) =E(W 2)− [E(W )]2 = δ2 +6δ+3−(δ+1)2 = δ2 +6δ+3−δ2 −2δ−1 = 4δ+2.Thus E(Y ) = E(W ) + E(X) = δ + 1 + n − 1 = n + δ = n + 2γ, andV (Y ) = V (W ) + V (X) = 4δ + 2 + 2(n− 1) = 8δ + 2n.

ii) Let Zi ∼ N(µi, 1) so E(Z2i ) = σ2 + µ2

i = 1 + µ2i . By Problem 2.8,

E(Z3i ) = µ3

i + 3µi, and E(Z4i ) = µ4

i + 6µ2i + 3. Hence Y ∼ χ2(n, γ) where

Y = ZT Z =∑n

i=1 Z2i where Z ∼ Nn(µ, I). So E(Y ) =

∑ni=1 E(Z2

i ) =∑ni=1(1 + µ2

i ) = n+ µT µ = n+ 2γ, and V (Y ) =∑n

i=1 V (Z2i ) =

n∑

i=1

[E(Z4i ) − (E[Z2

i ])2] =n∑

i=1

[µ4i + 6µ2

i + 3 − µ4i − 2µ2

i − 1] =n∑

i=1

[4µ2i + 2]

2.3 Least Squares Theory 63

= 2n+ 4µT µ = 2n+ 8γ. �

For the following theorem, see Searle (1971, p. 57). Most of the results inTheorem 2.14 are corollaries of Theorem 2.13. Recall that the matrix in aquadratic form is symmetric, unless told otherwise.

Theorem 2.13. If Y ∼ Nn(µ,Σ) where Σ > 0, then Y T AY ∼χ2(rank(A),µTAµ/2) iff AΣ is idempotent.

Theorem 2.14. a) If Y ∼ Nn(0,Σ) where Σ is a projection matrix, thenY T Y ∼ χ2(rank(Σ)) where rank(Σ) = tr(Σ).

b) If Y ∼ Nn(0, I), then Y T AY ∼ χ2r iff A is idempotent with rank(A) =

tr(A) = r.c) If Y ∼ Nn(0,Σ) where Σ > 0, then Y T AY ∼ χ2

r iff AΣ is idempotentand rank(A) = r = rank(AΣ).

d) If Y ∼ Nn(0, σ2I) thenY T Y

σ2∼ χ2

(n,

µT µ

2σ2

).

e) If Y ∼ Nn(µ, I) then Y T AY ∼ χ2(r,µT Aµ/2) iff A is idempotentwith rank(A) = tr(A) = r.

f) If Y ∼ Nn(µ, σ2I) thenY T AY

σ2∼ χ2

(r,

µT Aµ

2σ2

)iff A is idempotent

with rank(A) = tr(A) = r.

The following theorem is useful for constructing ANOVA tables. See Searle(1971, pp. 60-61).

Theorem 2.15: Generalized Cochran’s Theorem. Let Y ∼ Nn(µ,Σ).

Let Ai = ATi have rank ri for i = 1, ..., k, and let A =

∑ki=1 Ai = AT have

rank r. Then Y T AiY ∼ χ2(ri,µT Aiµ/2), and the Y T AiY are independent,

and Y T AY ∼ χ2(r,µT Aµ/2), iffI) any 2 of a) AiΣ are idempotent ∀i,b) AiΣAj = 0 ∀i < j,c) AΣ is idempotent

are true; or II) c) is true and d) r =∑k

i=1 ri;or III) c) is true and e) A1Σ, ..,Ak−1Σ are idempotent and AkΣ ≥ 0 issingular.

2.3 Least Squares Theory

Definition 2.13. Estimating equations are used to find estimators ofunknown parameters. The least squares criterion and log likelihood for max-imum likelihood estimators are important examples.


Estimating equations are often used with a model, like Y = Xβ + e,and often have a variable β that is used in the equations to find the es-timator β of the vector of parameters in the model. For example, the loglikelihood log(L(β, σ2)) has β and σ2 as variables for a parametric statisticalmodel where β and σ2 are fixed unknown parameters, and maximizing thelog likelihood with respect to these variables gives the maximum likelihoodestimators of the parameters β and σ2. So the term β is both a variable inthe estimating equations, which could be replaced by another variable suchas η, and a vector of parameters in the model. In the theorem below, wecould replace η by β where β is a vector of parameters in the linear modeland a variable in the least squares criterion which is an estimating equation.

Theorem 2.16. Let θ = Xη ∈ C(X) where Yi = xTi η + ri(η) and the

residual ri(η) depends on η. The least squares estimator β is the valueof η ∈ R

p that minimizes the least squares criterion∑ni=1 r

2i (η) = ‖Y − Xη‖2.

Proof. Following Seber and Lee (2003, pp. 36-36), let Y = θ = P XY ∈C(X), r = (I − PX )Y ∈ [C(X)]⊥, and θ ∈ C(X). Then (Y − θ)T (θ −θ) = (Y − P XY )T (P XY − PXθ) = Y T (I − P X)P X (Y − θ) = 0 since

PXθ = θ. Thus ‖Y − θ‖2 = (Y − θ + θ − θ)T (Y − θ + θ − θ) =

‖Y − θ‖2 + ‖θ − θ‖2 + 2(Y − θ)T (θ − θ) ≥ ‖Y − θ‖2

with equality iff ‖θ − θ‖2 = 0 iff θ = θ = Xη. Since θ = Xβ the resultfollows. �

Definition 2.14. The normal equations are

XT Xβ = XT Y .

To see that the normal equations hold, note that r = Y − Y ⊥ C(X)

since X(Y − Y ) = 0. Thus r ∈ [C(X)]⊥ = N(XT ), and XT (Y − Y ) = 0.

Hence XT Y = XT Xβ = XT Y .

The maximum likelihood estimator uses the log likelihood as an estimatingequation. Note that it is crucial to observe that the likelihood function is afunction of θ (and that y1, ..., yn act as fixed constants). Also, if the MLE θ

exists, then θ ∈ Θ, the parameter space.

Definition 2.15. Let f(y|θ) be the joint pdf of Y1, ..., Yn. If Y = y isobserved, then the likelihood function L(θ) = f(y|θ). For each sample

point y = (y1, ..., yn), let θ(y) be a parameter value at which L(θ|y) attainsits maximum as a function of θ with y held fixed. Then a maximum likelihoodestimator (MLE) of the parameter θ based on the sample Y is θ(Y ).


Definition 2.16. Let the log likelihood of θ1 and θ2 be log[L(θ1, θ2)]. If θ2

is the MLE of θ2, then the log profile likelihood is log[Lp(θ1)] = log[L(θ1, θ2)].

We can often fix σ and then show β is the MLE by direct maximization.Then the MLE σ or σ2 can be found by maximizing the log profile likelihoodfunction log[Lp(σ)] or log[Lp(σ

2)] where Lp(σ) = L(σ,β = β).

Remark 2.1. a) Know how to find the max and min of a function h thatis continuous on an interval [a,b] and differentiable on (a, b). Solve h′(x) ≡ 0and find the places where h′(x) does not exist. These values are the criticalpoints. Evaluate h at a, b, and the critical points. One of these values willbe the min and one the max.

b) Assume h is continuous. Then a critical point θo is a local max of h(θ)if h is increasing for θ < θo in a neighborhood of θo and if h is decreasing forθ > θo in a neighborhood of θo. The first derivative test is often used.

c) If h is strictly concave

(d2

dθ2h(θ) < 0 for all θ

), then any local max

of h is a global max.

d) Suppose h′(θo) = 0. The 2nd derivative test states that ifd2

dθ2h(θo) < 0,

then θo is a local max.e) If h(θ) is a continuous function on an interval with endpoints a < b

(not necessarily finite), and differentiable on (a, b) and if the critical pointis unique, then the critical point is a global maximum if it is a localmaximum (because otherwise there would be a local minimum and the critical

point would not be unique). To show that θ is the MLE (the global maximizerof h(θ) = logL(θ)), show that logL(θ) is differentiable on (a, b). Then show

that θ is the unique solution to the equationd

dθlogL(θ) = 0 and that the

2nd derivative evaluated at θ is negative:d2

dθ2logL(θ)|θ < 0. Similar remarks

hold for finding σ2 using the profile likelihood.

Theorem 2.17. Let Y = Xβ + e = Y + r where X is full rank, andY ∼ Nn(Xβ, σ2I). Then the MLE of β is the least squares estimator β andthe MLE of σ2 is RSS/n = (n− p)MSE/n.

Proof. The likelihood function is

L(β, σ2) = (2πσ2)−n/2 exp

( −1

2σ2‖y − Xβ‖2

).

For fixed σ2, maximizing the likelihood is equivalent to maximizing

exp

( −1

2σ2‖y − Xβ‖2

),


which is equivalent to minimizing ‖y−Xβ‖2. But the least squares estimator

minimizes ‖y − Xβ‖2 by Theorem 2.16. Hence β is the MLE of β.

Let Q = ‖y − Xβ‖2. Then the MLE of σ2 can be found by maximizingthe log profile likelihood log(LP (σ2)) where

LP (σ2) =1

(2πσ2)n/2exp

( −1

2σ2Q

).

Let τ = σ2. Then

log(Lp(σ2)) = c− n

2log(σ2) − 1

2σ2Q,

and

log(Lp(τ )) = c − n

2log(τ ) − 1

2τQ.

Henced log(LP (τ ))

dτ=

−n2τ

+Q

2τ2

set= 0

or −nτ +Q = 0 or nτ = Q or

τ =Q

n= σ2 =

∑ni=1 r

2i

n=n− p

nMSE,

which is a unique solution.Now

d2 log(LP (τ ))

dτ2=

n

2τ2− 2Q

2τ3

∣∣∣∣τ=τ

=n

2τ2− 2nτ

2τ3=

−n2τ2

< 0.

Thus by Remark 2.1, σ2 is the MLE of σ2. �

There are two ways to compute β. Use β = (XT X)−1XT Y , and usesample covariance matrices. The population OLS coefficients are defined be-low. Let xT

i = (1 uTi ) where ui is the vector of nontrivial predictors. Let

1

n

n∑

j=1

Xjk = Xok = uok for k = 2, ..., p. The subscript “ok” means sum over

the first subscript j. Let u = (uo,2, ..., uo,p)T be the sample mean of theui. Note that regressing on u is equivalent to regressing on x if there is anintercept β1 in the model.

Definition 2.17. Using the above notation, let xTi = (1 uT

i ), and let βT =(β1 βT

S ) where β1 is the intercept and the slopes vector βS = (β2, ..., βp)T .

Let the population covariance matrices

Cov(u) = E[(u −E(u))(u − E(u))T ] = Σu, and


Cov(u, Y ) = E[(u −E(u))(Y −E(Y ))] = ΣuY .

Then the population coefficients from an OLS regression of Y on u (even ifa linear model does not hold) are

β1 = E(Y ) − βTSE(u) and βS = Σ−1

u ΣuY.

Definition 2.18. Let the sample covariance matrices be

Σu =1

n− 1

n∑

i=1

(ui − u)(ui − u)T and ΣuY =1

n − 1

n∑

i=1

(ui − u)(Yi − Y ).

Let the method of moments or maximum likelihood estimators be Σu =1

n

n∑

i=1

(ui−u)(ui−u)T and ΣuY =1

n

n∑

i=1

(ui−u)(Yi−Y ) =1

n

n∑

i=1

uiYi−u Y .

Definition 2.19. The notation “θP→ θ as n → ∞” means that θ is a

consistent estimator of θ, or, equivalently, that θ converges in probability toθ.

Lemma 2.18: Seber and Lee (2003, p. 106). Let X = (1 X1). Then

XT Y =

(nY

XT1 Y

)=

(nY∑n

i=1 uiYi

), XT X =

(n nuT

nu XT1 X1

),

and (XT X)−1 =

(1n + uT D−1u −uT D−1

−D−1u D−1

)

where the (p− 1) × (p − 1) matrix D−1 = [(n− 1)Σu]−1 = Σ−1

u /(n− 1).

Theorem 2.19: Second way to compute β: Suppose that wi =(Yi,u

Ti )T are iid random vectors such that σ2

Y , Σ−1u , and ΣuY exist. Then

β1 = Y − βT

SuP→ β1 and

βS =n

n− 1Σ

−1

u ΣuY = Σ−1

u ΣuY = Σ−1

u ΣuYP→ βS as n → ∞.

The above theorem can be proved using Lemma 2.18 and using the factthat the sample covariance matrices are consistent estimators of the popula-tion covariance matrices. It is important to note that this result is for iid wi

with second moments: a linear model Y = Xβ + e does not need to hold.Also, X is a random matrix, and the least squares regression is conditionalon X . Some properties of the least squares estimators and related quantitiesare given below, where X is a constant matrix.

Theorem 2.20. Let Y = Xβ+e = Y +r where X is full rank, E(e) = 0,and Cov(e) = σ2I. Let P = P X be the projection matrix on C(X) so


Y = P X , r = Y − Y = (I − P )Y , and PX = X so XT P = XT .i) The predictor variables and residuals are orthogonal. Hence the columnsof X and the residual vector are orthogonal: XT r = 0.ii) E(Y ) = Xβ.iii) Cov(Y ) = Cov(e) = σ2I.

iv) The fitted values and residuals are uncorrelated: Cov(r, Y ) = 0.

v) The least squares estimator β is an unbiased estimator of β : E(β) = β.

vi) Cov(β) = σ2(XT X)−1.

Proof. i) XT r = XT (I−P )Y = 0Y = 0, while ii) and iii) are immediate.

iv) Cov(r, Y ) = E([r − E(r)][Y − E(Y )]T ) =

E([(I − P )Y − (I − P )E(Y )][PY − PE(Y )]T ) =

E[(I − P )[Y − E(Y )][Y − E(Y )]T P ] = (I − P )σ2IP = σ2(I − P )P = 0.

v) E(β) = E[(XT X)−1XT Y ] = (XT X)−1XTE[Y ] = (XT X)−1XT Xβ= β.

vi) Cov(β) = Cov[(XT X)−1XT Y ] = Cov(AY ) = ACov(Y )AT =

σ2(XT X)−1XT IX(XT X)−1 = σ2(XT X)−1. �

Definition 2.20. Let a, b, and c be n × 1 constant vectors. A linearestimator aT Y of cT θ is the best linear unbiased estimator (BLUE) of cT θif E(aT Y ) = cT θ, and for any other unbiased linear estimator bT Y of cT θ,V ar(aT Y ) ≤ V ar(bT Y ).

The following theorem is useful for finding the BLUE when X has fullrank. Note that if Z is a random variable, then the covariance matrix of Zis Cov(Z) = Cov(Z, Z) = V (Z). Note that the theorem shows that cT Xβ =

aT β is the BLUE of cT Xβ = aT β where aT = cT X and θ = Xβ.

Theorem 2.21: Gauss Markov Theorem. Let Y = Xβ + e where Xis full rank, E(e) = 0, and Cov(e) = σ2I. Then aT β is the BLUE of aT βfor every constant p× 1 vector a.

Proof. (Guttman (1982, pp. 141-142):) Note that aT (XT X)−1XT Y =

aT β = dT Y is linear, and by Theorem 2.20 vi), V (aT β) = V (dT Y ) =

Cov(aT β) = aT Cov(β)a = σ2aT (XT X)−1a. (2.1)

Let cT Y be any other linear unbiased estimator of aT β. Then E(cT Y ) =aT β = cTE(Y ) = cT Xβ for any β ∈ R

p, the parameter space of β. HenceaT = cT X. Recall that

V (Z −X) = Cov(Z −X) = V (Z) + V (X) − 2Cov(Z,X). (2.2)


Now Cov(cT Y ,aT β) = Cov(cT Y ,dT Y ) = cT Cov(Y )d = σ2cT d =

σ2cT X(XT X)−1a = σ2aT (XT X)−1a = V (aT β) by (2.1). Hence

Cov(cT Y ,dT Y ) = V (aT β) = V (dT Y ). (2.3)

By (2.2), 0 ≤ V (cT Y − aT β) = V (cT Y − dT Y ) = V (cT Y ) + V (dT Y ) −2Cov(cT Y ,dT Y ) = V (cT Y ) + V (dT Y ) − 2V (dT Y ) = V (cT Y ) − V (dT Y )

by (2.3). Thus V (cT Y ) ≥ V (dT Y ) = V (aT β). �

The following theorem gives some properties of the least squares estimatorsβ and MSE under the normal least squares model. Similar properties will bedeveloped without the normality assumption.

Theorem 2.22. Suppose Y = Xβ + e where X is full rank, e ∼Nn(0, σ2I), and Y ∼ Nn(Xβ, σ2I).

a) β ∼ Np(β, σ2(XT X)−1).

b)(β − β)T XT X(β − β)

σ2∼ χ2

p.

c) β MSE.

d)RSS

σ2=

(n − p)MSE

σ2∼ χ2

n−p.

Proof. Let P = PX .

a) Since A = (XT X)−1XT is a constant matrix,

β = AY ∼ Np(AE(Y ),ACov(Y )AT ) ∼

Np((XT X)−1XT Xβ, σ2(XT X)−1XT IX(XT X)−1) ∼

Np(β, σ2(XT X)−1).

b) The population Mahalanobis distance of β is

(β − β)T XT X(β − β)

σ2= (β − β)T [Cov(β)]−1(β − β) ∼ χ2

p

by Theorem 2.11.c) Since Cov(β, r) = Cov((XT X)−1XT Y , (I − P )Y ) =

σ2(XT X)−1XT I(I−P ) = 0, β r. Thus β RSS = ‖r‖2, and β MSE.d) Since P X = X and XT P = XT , it follows that XT (I − P ) = 0 and

(I − P )X = 0. Thus RSS = rT r = Y T (I − P )Y =

(Y − Xβ)T (I − P )(Y − Xβ) = eT (I − P )e.

Since e ∼ Nn(0, σ2I), then by Theorem 2.9 b), eT (I−P )e/σ2 ∼ χ2n−p where

n− p = rank(I − P ) = tr(I − P ). �


2.3.1 Hypothesis Testing

Suppose Y = Xβ + e where rank(X) = p, E(e) = 0 and Cov(e) = σ2I. LetL be an r × p constant matrix with rank(L) = r, let c be an r × 1 constantvector, and consider testing H0 : Lβ = c. First theory will be given for whene ∼ Nn(0, σ2I). The large sample theory will be given for when the iid zeromean ei have V (ei) = σ2. Note that the normal model will satisfy the largesample theory conditions.

The partial F test, and its special cases the ANOVA F test and the Waldt test, use c = 0. Let the full model use Y , x1 ≡ 1, x2, ..., xp, and letthe reduced model use Y , x1 = xj1 ≡ 1, xj2 , ..., xjk where {j1, ..., jk} ⊂{1, ..., p} and j1 = 1. Here 1 ≤ k < p, and if k = 1, then the model isYi = β1 +ei. Hence the full model is Yi = β1 +β2xi,2 + · · ·+βpxi,p +ei, whilethe reduced model is Yi = β1 + βj2xi,j2 + · · ·+ βjkxi,jk + ei. In matrix form,the full model is Y = Xβ + e and the reduced model is Y = XRβR + eR

where the columns of XR are a proper subset of the columns of X . i) Thepartial F test has H0 : βjk+1

= · · · = βjp = 0, or H0 : the reduced model isgood, or H0 : Lβ = 0 where L is a (p− k)× p matrix where the ith row of Lhas a 1 in the jk+ith position and zeroes elsewhere. In particular, if β1, ..., βk

are the only βi in the reduced model, then L = [0 Ip−k] and 0 is a (p−k)×kmatrix. Hence r = p− k = number of predictors in the full model but not inthe reduced model. ii) The ANOVA F test is the special case of the partialF test where the reduced model is Yi = β1+εi. Hence H0 : β2 = · · · = βp = 0,or H0 : none of the nontrivial predictors x2, ..., xp are needed in the linearmodel, or H0 : Lβ = 0 where L = [0 Ip−1] and 0 is a (p − 1) × 1 vector.Hence r = p − 1. iii) The Wald t test uses the reduced model that deletesthe jth predictor from the full model. Hence H0 : βj = 0, or H0 : the jthpredictor xj is not needed in the linear model given that the other predictorsare in the model, or H0 : Ljβ = 0 where Lj = [0, ..., 0, 1, 0, ..., 0] is a 1 × prow vector with a 1 in the jth position for j = 1, ..., p. Hence r = 1.

A way to get the test statistic FR for the partial F test is to fit thefull model and the reduced model. Let RSS be the RSS of the full model,and let RSS(R) be the RSS of the reduced model. Similarly, let MSE andMSE(R) be the MSE of the full and reduced models. Let dfR = n− k anddfF = n− p be the degrees of freedom for the reduced and full models. Then

FR =RSS(R) −RSS

rMSEwhere r = dfR − dfF = p− k = number of predictors

in the full model but not in the reduced model.If β ∼ Np(β, σ2(XT X)−1), then

Lβ − c ∼ Nr(Lβ − c, σ2L(XT X)−1LT ).

If H0 is true then Lβ − c ∼ Nr(0, σ2L(XT X)−1LT ), and by Theorem 2.11

rF1 =1

σ2(Lβ − c)T [L(XT X)−1LT ]−1(Lβ − c) ∼ χ2

r .


Let rFR = σ2rF1/MSE. If H0 is true, rFRD→ χ2

r for a large class of zeromean error distributions. See Theorem 2.25 c).

Definition 2.21. If ZnD→ Z as n→ ∞, then Zn converges in distribution

to the random vector Z, and “Z is the limiting distribution of Zn” meansthat the distribution of Z is the limiting distribution of Zn. The notation

ZnD→ Nk(µ,Σ) means Z ∼ Nk(µ,Σ).

Remark 2.2. a) Z is the limiting distribution of Zn, and does not dependon the sample size n (since Z is found by taking the limit as n→ ∞).

b) When ZnD→ Z, the distribution of Z can be used to approximate

probabilities P (Zn ≤ c) ≈ P (Z ≤ c) at continuity points c of the cdf FZ (z).Often the limiting distribution is a continuous distribution, so all points care continuity points.

c) Often the two quantities ZnD→ Nk(µ,Σ) and Zn ∼ Nk(µ,Σ) behave

similarly. A big difference is that the distribution on the RHS (right hand

side) can depend on n for ∼ but not forD→. In particular, if Zn

D→ Nk(µ,Σ),

then AZn + bD→ Nm(Aµ + b,AΣAT ), provided the RHS does not depend

on n, where A is an m×k constant matrix and b is an m×1 constant vector.d) We often want a normal approximation where the RHS can depend on n.

Write Zn ∼ ANk(µ,Σ) for an approximate multivariate normal distributionwhere the RHS may depend on n. For normal linear model, if e ∼ Nn(0, σ2I),

then β ∼ Np(β, σ2(XT X)−1). If the ei are iid with E(ei) = 0 and V (ei) =

σ2, use the multivariate normal approximation β ∼ ANp(β, σ2(XT X)−1) or

β ∼ ANp(β,MSE(XT X)−1). The RHS depends on n since the number ofrows of X is n.

Theorem 2.23. Suppose Σn and Σ are positive definite and symmetric.

If W nD→ Nk(µ,Σ) and Σn

P→ Σ, then Zn = Σ−1/2

n (W n − µ)D→ Nk(0, I),

and ZTnZn = (W n − µ)T Σ

−1

n (W n − µ)D→ χ2

k.

Theorem 2.24. If Wn ∼ Fr,dn where the positive integer dn → ∞ as

n→ ∞, then rWnD→ χ2

r.

Proof. If X1 ∼ χ2d1

X2 ∼ χ2d2, then

X1/d1

X2/d2∼ Fd1,d2

.

If Ui ∼ χ21 are iid then

∑ki=1 Ui ∼ χ2

k. Let d1 = r and k = d2 = dn. Hence ifX2 ∼ χ2

dn, then

X2

dn=

∑dn

i=1 Ui

dn= U

P→ E(Ui) = 1


by the law of large numbers. Hence if W ∼ Fr,dn , then rWnD→ χ2

r . �

By the LS CLT (Theorem 1.8),

√n(β − β)

D→ Np(0, σ2 W ) (2.4)

where XT X/n→ W−1. Equivalently,

(XT X)1/2(β − β)D→ Np(0, σ

2 Ip). (2.5)

If Σ = σ2W , then Σn = nMSE(XT X)−1. Hence

β ∼ ANp(β,MSE(XT X)−1), and

rFR =1

MSE(Lβ − c)T [L(XT X)−1LT ]−1(Lβ − c)

D→ χ2r (2.6)

as n → ∞.

Definition 2.22. A test with test statistic Tn is a large sample right tailδ test if the test rejects H0 if Tn > an and P (Tn > an) = δn → δ as n → ∞when H0 is true.

Typically we want δ ≤ 0.1, and the values δ = 0.05 or δ = 0.01 arecommon. (An analogy is a large sample 100(1 − δ)% confidence interval orprediction interval.)

Remark 2.3. Suppose P (W ≤ χ2q(1−δ)) = 1−δ and P (W > χ2

q(1−δ)) =δ where W ∼ χ2

q . Suppose P (W ≤ Fq,dn(1 − δ)) = 1 − δ when W ∼ Fq,dn .Also write χ2

q(1− δ) = χ2q,1−δ and Fq,dn(1− δ) = Fq,dn,1−δ. Suppose P (W >

z1−δ) = δ when W ∼ N(0, 1), and P (W > tdn,1−δ) = δ when W ∼ tdn .i) Theorem 2.24 is important because it can often be shown that a statistic

Tn = rWnD→ χ2

r when H0 is true. Then tests that reject H0 when Tn >χ2

r(1 − δ) or when Tn/r = Wn > Fr,dn(1 − δ) are both large sample righttail δ tests if the positive integer dn → ∞ as n → ∞. Large sample F testsand intervals are used instead of χ2 tests and intervals since the F tests andintervals are more accurate for moderate n.

ii) An analogy is that if test statistic TnD→ N(0, 1) when H0 is true, then

tests that reject H0 if Tn > z1−δ or if Tn > tdn,1−δ are both large sampleright tail δ tests if the positive integer dn → ∞ as n → ∞. Large sample ttests and intervals are used instead of Z tests and intervals since the t testsand intervals are more accurate for moderate n.

iii) Often n ≥ 10p starts to give good results for the OLS output for errordistributions not too far from N(0, 1). Larger values of n tend to be neededif the zero mean iid errors have a distribution that is far from a normaldistribution. Also see Proposition 2.5.


Theorem 2.25, Partial F Test Theorem. Suppose H0 : Lβ = 0 istrue for the partial F test. Under the OLS full rank model, a)

FR =1

rMSE(Lβ)T [L(XT X)−1LT ]−1(Lβ).

b) If e ∼ Nn(0, σ2I), then FR ∼ Fr,n−p.

c) For a large class of zero mean error distributions rFRD→ χ2

r.d) The partial F test that rejects H0 : Lβ = 0 if FR > Fr,n−p(1 − δ) is alarge sample right tail δ test for the OLS model for a large class of zero meanerror distributions.

Proof sketch. a), b) Let Y ∼ Nn(Xβ, σ2In) where X is an n×p matrixof rank p. Let X = [X1 X2] and β = (βT

1 βT2 )T where X1 is an n × k

matrix and r = p− k. Consider testing H0 : β2 = 0. (The columns of X canbe rearranged so that H0 corresponds to the partial F test.) Let P be theprojection matrix on C(X). Then rT r = Y T (I − P )Y = eT (I − P )e =(Y − Xβ)T (I − P )(Y − Xβ) since P X = X and XT P = XT imply thatXT (I − P ) = 0 and (I − P )X = 0.

Suppose that H0 : β2 = 0 is true so that Y ∼ Nn(X1β1, σ2In). Let

P 1 be the projection matrix on C(X1). By the above argument, rTRrR =

Y T (I −P 1)Y = (Y −X1β1)T (I −P 1)(Y −X1β1) = eT

R(I −P 1)eR whereeR ∼ Nn(0, σ2In) when H0 is true. Or use RHS = Y T (I − P 1)Y

−βT1 XT

1 (I − P 1)Y + βT1 XT

1 (I − P 1)X1β1 − Y T (I − P 1)X1β1,

and the last three terms equal 0 since XT1 (I−P 1) = 0 and (I −P 1)X1 = 0.

HenceY T (I − P )Y

σ2∼ χ2

n−p

Y T (P − P 1)Y

σ2∼ χ2

r

by Theorem 2.9 b) using e and eR instead of Y , and Craig’s Theorem 2.8 b)since n− p = rank(I −P ) = tr(I −P ), r = rank(P − P 1) = tr(P −P 1) =p− k, and (I − P )(P − P 1) = 0.

If X1 ∼ χ2d1

X2 ∼ χ2d2, then

X1/d1

X2/d2∼ Fd1,d2

.

HenceY T (P − P 1)Y /r

Y T (I − P )Y /(n− p)=

Y T (P − P 1)Y

rMSE∼ Fr,n−p

when H0 is true. Since RSS = Y T (I −P )Y and RSS(R) = Y T (I −P 1)Y ,RSS(R) − RSS = Y T (I − P 1 − [I − P ])Y = Y T (P − P 1)Y , and thus

FR =Y T (P − P 1)Y

rMSE∼ Fr,n−p.


Seber and Lee (2003, p. 100) show that

RSS(R) − RSS = (Lβ)T [L(XT X)−1LT ]−1(Lβ).

c) By (2.6), rFRD→ χ2

r as n → ∞ where c = 0.

d) By Theorem 2.24, if Wn ∼ Fr,dn then rWnD→ χ2

r as n → ∞ anddn → ∞. Hence the result follows by c). �

Let X ∼ tn−p. Then X2 ∼ F1,n−p. The two tail Wald t test for H0 :βj = 0 versus H1 : βj 6= 0 is equivalent to the corresponding right tailed Ftest since rejecting H0 if |X| > tn−p(1 − δ) is equivalent to rejecting H0 ifX2 > F1,n−p(1 − δ).

Definition 2.23. The pvalue of a test is the probability, assuming H0 istrue, of observing a test statistic as extreme as the test statistic Tn actuallyobserved. For a right tail test, pvalue = PH0

(of observing a test statistic≥ Tn).

Under the OLS model where FR ∼ Fq,n−p when H0 is true (so the ei areiid N(0, σ2)), the pvalue = P (W > FR) where W ∼ Fq,n−p. In general, wecan only estimate the pvalue. Let pval be the estimated pvalue. Then pval

= P (W > FR) where W ∼ Fq,n−p, and pvalP→ pvalue an n → ∞ for the

large sample partial F test. The pvalues in output are usually actually pvals(estimated pvalues).

Definition 2.24. Let Y ∼ F (d1, d2) ∼ F (d1, d2, 0). Let X1 ∼ χ2(d1, γ)

X2 ∼ χ2(d2, 0). Then W =X1/d1

X2/d2∼ F (d1, d2, γ), a noncentral F distri-

bution with d1 and d2 numerator and denominator degrees of freedom, andnoncentrality parameter γ.

Theorem 2.26, distribution of FR under normality when H0 maynot hold. Assume Y = Xβ + e where e ∼ Nn(0, σ2I). Let X = [X1 X2]be full rank, and let the reduced model Y = X1β1 + eR. Then

FR =Y T (P − P 1)Y /r

Y T (I − P )Y /(n− p)∼ F

(r, n− p,

βT XT (P − P 1)Xβ

2σ2

).

If H0 : β2 = 0 is true, then γ = 0.

Proof. Note that the denominator is the MSE, and (n − p)MSE/σ2 ∼χ2

n−p by the proof of Theorem 2.25. By Theorem 2.14 f),

Y T (P − P 1)Y /σ2 ∼ χ2

(r,

βT XT (P − P 1)Xβ

2σ2

)

2.4 Summary 75

where r = rank(P −P 1) = tr(P −P 1) = p− k since P −P 1 is a projectionmatrix (symmetric and idempotent). �

2.4 Summary

1) The set of all linear combinations of x1, ...,xn is the vector space knownas span(x1, ...,xn) = {y ∈ R

k : y =∑n

i=1 aixi for some constants a1, ..., an}.2) Let A = [a1 a2 ... am] be an n×m matrix. The space spanned by the

columns of A = column space of A = C(A). Then C(A) = {y ∈ Rn : y =

Aw for some w ∈ Rm} = {y : y = w1a1 + w2a2 + · · · + wmam for some

scalars w1, ...., wm} = span(a1, ...,am).3) A generalized inverse of an n×m matrix A is any m×n matrix A−

satisfying AA−A = A.4) The projection matrix P = PX onto the column space of X is

unique, symmetric, and idempotent. P X = X , and PW = W if eachcolumn of W ∈ C(X). The eigenvalues of P X are 0 or 1. Rank(P ) = tr(P ).Hence P is singular unless X is a nonsingular n×nmatrix, and then P = In.If C(XR) is a subspace of C(X), then P XP XR

= P XRP X = PXR

.

5) In − P is the projection matrix on [C(X)]⊥.6) Let A be a positive definite symmetric matrix. The square root matrix

A1/2 is a positive definite symmetric matrix such that A1/2A1/2 = A.7) The matrix A in a quadratic form xT Ax will be symmetric unless

told otherwise.8) Theorem 2.4. Let x be a random vector with E(x) = µ and Cov(x) =

Σ. Then E(xT Ax) = tr(AΣ) + µT Aµ.9) Theorem 2.6. If A and B are symmetric matrices and AY BY ,

then Y T AY Y T BY .10) The important part of Craig’s Theorem is that if Y ∼ Nn(µ,Σ),

then Y T AY Y T BY if AΣB = 0.11) Theorem 2.9. Let A = AT be symmetric. a) If Y ∼ Nn(0, I), then

Y T AY ∼ χ2r iff A is idempotent of rank r. b) If Y ∼ Nn(0, σ2I), then

Y T AY ∼ σ2 χ2r iff A is idempotent of rank r.

12) Often theorems are given for when Y ∼ Nn(0, I). If Y ∼ Nn(0, σ2I),then apply the theorem using Z = Y /σ ∼ Nn(0, I).

13) Suppose Y1, ..., Yn are independent N(µi, 1) random variables so thatY = (Y1, ..., Yn)T ∼ Nn(µ, In). Then Y T Y =

∑ni=1 Y

2i ∼ χ2(n, γ =

µT µ/2), a noncentral χ2(n, γ) distribution, with n degrees of freedom andnoncentrality parameter γ = µT µ/2 = 1

2

∑ni=1 µ

2i ≥ 0. The noncentrality

parameter δ = µT µ = 2γ is also used.14) Theorem 2.16. Let θ = Xη ∈ C(X) where Yi = xT

i η+ri(η) and the

residual ri(η) depends on η. The least squares estimator β is the valueof η ∈ R

p that minimizes the least squares criterion∑ni=1 r

2i (η) = ‖Y − Xη‖2.


15) Let xTi = (1 uT

i ), and let βT = (β1 βTS ) where β1 is the intercept and

the slopes vector βS = (β2, ..., βp)T . Let the population covariance matrices

Cov(u) = Σu, and Cov(u, Y ) = ΣuY . If the (Yi,uTi )T are iid, then the

population coefficients from an OLS regression of Y on u are

β1 = E(Y ) − βTSE(u) and βS = Σ−1

u ΣuY.

16) Theorem 2.19: Second way to compute β: Suppose that wi =(Yi,u

Ti )T are iid random vectors such that σ2

Y , Σ−1u , and ΣuY exist. Then

β1 = Y − βT

SuP→ β1 and

βS =n

n− 1Σ

−1

u ΣuY = Σ−1

u ΣuY = Σ−1

u ΣuYP→ βS as n → ∞.

17) Theorem 2.20. Let Y = Xβ + e = Y + r where X is full rank,E(e) = 0, and Cov(e) = σ2I. Let P = PX be the projection matrix on

C(X) so Y = PX , r = Y − Y = (I −P )Y , and PX = X so XT P = XT .i) The predictor variables and residuals are orthogonal. Hence the columnsof X and the residual vector are orthogonal: XT r = 0.ii) E(Y ) = Xβ.iii) Cov(Y ) = Cov(e) = σ2I.

iv) The fitted values and residuals are uncorrelated: Cov(r, Y ) = 0.

v) The least squares estimator β is an unbiased estimator of β : E(β) = β.

vi) Cov(β) = σ2(XT X)−1.18) Theorem 2.25, Partial F Test Theorem. Suppose H0 : Lβ = 0 is

true for the partial F test. Under the OLS full rank model, a)

FR =1

rMSE(Lβ)T [L(XT X)−1LT ]−1(Lβ).

b) If e ∼ Nn(0, σ2I), then FR ∼ Fr,n−p.

c) For a large class of zero mean error distributions rFRD→ χ2

r.d) The partial F test that rejects H0 : Lβ = 0 if FR > Fr,n−p(1 − δ) is alarge sample right tail δ test for the OLS model for a large class of zero meanerror distributions.

2.5 Complements

A good reference for quadratic forms and the noncentral χ2, t, and F distri-butions is Johnson and Kotz (1970, ch. 28-31).

Least squares theory can be extended in at least two ways. The first ex-tension follows Chang and Olive (2010) closely. Suppose β = (α βT

U )T

where the intercept is α and the vector of slopes is βU . Then for the linear

2.5 Complements 77

model, α = β1 and βS = βU = (β2 , ..., βp)T . Let x = (1 uT )T , and sup-

pose Y u|uT βU , e.g. Yi = α + uTi βU + ei, or Yi = m(uT

i βU ) + ei, or aGLM (generalized linear model). These models are 1D regression models. Ifthe ui are iid from an elliptically contoured distribution, then often the OLS

estimator βSP→ cβU for some constant c 6= 0.

Let βT = (α βTU ) and suppose the full model is Y u|(α + uT βU ).

Consider testing AβU = 0. Let the full model be Y u|(α+uTRβR +uT

OβO),and let the reduced model be Y u|(α + uT

RβR) where uT = (uTR uT

O)and uO denotes the terms outside of the reduced model. Notice that OLSANOVA F test corresponds to Ho: βU = 0 and uses A = Ip−1. The tests forH0 : βi = 0 use A = (0, ..., 0, 1, 0, ..., 0) where the 1 is in the ith position andare equivalent to the OLS t tests. The test Ho: βO = 0 uses A = [0 Ij] if βO

is a j×1 vector. For simplicity, let βU = (β1, ..., βp−1)T (start the numbering

at 1 instead of 2 and use α for the intercept). For the linear model, we coulduse L = (a A) where a is a known r×1 vector and r = k if A is k× (p−1).

Assume Y u|(α + βTUu), which is equivalent to Y u|βT

Uu. Let thepopulation OLS residual

v = Y − α− βTSu

withτ2 = E[(Y − α− βT

Su)2] = E(v2),

and let the OLS residual be

r = Y − α− βT

Su. (2.7)

Then under regularity conditions, results i) – iv) below hold.

i) Li and Duan (1989): The OLS slopes estimator βS = cβU for someconstant c.

ii) Li and Duan (1989) and Chen and Li (1998):

√n(βS − cβU )

D→ Np−1(0,COLS)

where

COLS = Σ−1u E[(Y − α− βT

S u)2(u − E(u))(u − E(u))T ]Σ−1u .

iii) Chen and Li (1998): Let A be a known full rank constant k × (p − 1)matrix. If the null hypothesis Ho: AβU = 0 is true, then

√n(AβS − cAβU ) =

√nAβS

D→ Nk(0,ACOLSAT )

andACOLSAT = τ2AΣ−1

u AT .

To create test statistics, the estimator


τ2 = MSE =1

n − p

n∑

i=1

r2i =1

n − p

n∑

i=1

(Yi − α− βT

S ui)2

will be useful. The estimator COLS =

Σ−1

u

[1

n

n∑

i=1

[(Yi − α− βT

Sui)2(ui − u)(ui − u)T ]

]Σ

−1

u

can also be useful. Notice that for general 1D regression models, the OLSMSE estimates τ2 rather than the error variance σ2.

iv) Result iii) suggests that a test statistic for Ho : AβU = 0 is

WOLS = nβT

SAT [AΣ−1

u AT ]−1LβS/τ2 D→ χ2

k.

Under regularity conditions, ifHo : AβU = 0 is true, then the test statistic

FR =n − 1

knWOLS

D→ χ2k/k

as n → ∞. This result means that the OLS partial F tests are large sampletests for a large class of nonlinear models where Y u|uT βU .

The second extension of least squares theory is to an autoregressive AR(p)time series model: Yt = φ0 + φ1Yt−1 + · · ·+φpYt−p + et. In matrix form, thismodel is Y = Xβ + e =

Yp+1

Yp+2

...Yn

=

1 Yp Yp−1 . . . Y1

1 Yp+1 Yp . . . Y2

......

.... . .

...1 Yn−1 Yn−2 . . . Yn−p

φ0

φ1

...φp

+

ep+1

ep+2

...en

.

If the AR(p) model is stationary, then under regularity conditions, OLSpartial F tests are large sample tests for this model. See Anderson (1971, pp.210–217).

2.6 Problems

2.1. Suppose Yi = xTi β+εi where the errors are independent N(0, σ2). Then

the likelihood function is

L(β, σ2) = (2πσ2)−n/2 exp

( −1

2σ2‖y − Xβ‖2

).

2.6 Problems 79

a) Since the least squares estimator β minimizes ‖y−Xβ‖2, show that βis the MLE of β.

b) Then find the MLE σ2 of σ2.

2.2. Suppose Yi = xTi β + ei where the errors are iid double exponential

(0, σ) where σ > 0. Then the likelihood function is

L(β, σ) =1

2n

1

σnexp

(−1

σ

n∑

i=1

|Yi − xTi β|

).

Suppose that β is a minimizer of∑n

i=1 |Yi − xTi β|.

a) By direct maximization, show that β is an MLE of β regardless of thevalue of σ.

b) Find an MLE of σ by maximizing

L(σ) ≡ L(β, σ) =1

2n

1

σnexp

(−1

σ

n∑

i=1

|Yi − xTi β|

).

2.3. Suppose Yi = xTi β + εi where the errors are independent N(0, σ2/wi)

where wi > 0 are known constants. Then the likelihood function is

L(β, σ2) =

(n∏

i=1

√wi

)(1√2π

)n1

σnexp

(−1

2σ2

n∑

i=1

wi(yi − xTi β)2

).

a) Suppose that βW minimizes∑n

i=1 wi(yi −xTi β)2. Show that βW is the

MLE of β.

b) Then find the MLE σ2 of σ2.

2.4. Suppose Y ∼ Nn(Xβ, σ2V ) for known positive definite n×n matrixV . Then the likelihood function is

L(β, σ2) =

(1√2π

)n1

|V |1/2

1

σnexp

( −1

2σ2(y − Xβ)T V −1(y − Xβ)

).

a) Suppose that βG minimizes (y − Xβ)T V −1(y − Xβ). Show that βG

is the MLE of β.

b) Find the MLE σ2 of σ2.

2.5. Find the vector a such that aT Y is an unbiased estimator for E(Yi)if the usual linear model holds.

2.6. Write the following quantities as bT Y or Y T AY or AY .

a) Y , b)∑

i(Yi − Yi)2, c)

∑i(Yi)

2, d) β, e) Y


2.7. Show that I −H = I −X(XT X)−1XT is idempotent, that is, showthat (I − H)(I − H) = (I − H)2 = I − H.

2.8. Let Y ∼ N(µ, σ2) so that E(Y ) = µ and Var(Y ) = σ2 = E(Y 2) −[E(Y )]2. If k ≥ 2 is an integer, then

E(Y k) = (k − 1)σ2E(Y k−2) + µE(Y k−1).

Let Z = (Y − µ)/σ ∼ N(0, 1). Hence µk = E(Y − µ)k = σkE(Zk). Use thisfact and the above recursion relationship E(Zk) = (k − 1)E(Zk−2) to finda) µ3 and b) µ4.

2.9. Let A and B be matrices with the same number of rows. If C isanother matrix such that A = BC, is it true that rank(A) = rank(B)?Prove or give a counterexample.

2.10. Let x be an n× 1 vector and let B be an n× n matrix. Show thatxT Bx = xT BT x.

(The point of this problem is that if B is not a symmetric n× n matrix,

then xT Bx = xT Ax where A =B + BT

2is a symmetric n× n matrix.)

2.11. Consider the model Yi = β1 +β2Xi,2 + · · ·+βpXi,p +ei = xTi β +ei.

The least squares estimator β minimizes

QOLS(η) =

n∑

i=1

(Yi − xTi η)2

and the weighted least squares estimator minimizes

QWLS(η) =

n∑

i=1

wi(Yi − xTi η)2

where the wi, Yi and xi are known quantities. Show that

n∑

i=1

wi(Yi − xTi η)2 =

n∑

i=1

(Yi − xTi η)2

by identifying Yi, and xi. (Hence the WLS estimator is obtained from theleast squares regression of Yi on xi without an intercept.)

2.12. Suppose that X is an n × p matrix but the rank of X < p < n.Then the normal equations XT Xβ = XT Y have infinitely many solutions.Let β be a solution to the normal equations. So XT Xβ = XT Y . Let G =(XT X)− be a generalized inverse of (XT X). Assume that E(Y ) = Xβ and

2.6 Problems 81

Cov(Y ) = σ2I . It can be shown that all solutions to the normal equationshave the form bz given below.

a) Show that bz = GXT Y + (GXT X − I)z is a solution to the normalequations where the p× 1 vector z is arbitrary.

b) Show that E(bz) 6= β.

(Hence some authors suggest that bz should be called a solution to thenormal equations but not an estimator of β.)

c) Show that Cov(bz) = σ2GXT XGT .

d) Although G is not unique, the projection matrix P = XGXT onto

C(X) is unique. Use this fact to show that Y = Xbz does not depend on Gor z.

e) There are two ways to show that aT β is an estimable function. Eithershow that there exists a vector c such that E(cT Y ) = aT β, or show thata ∈ C(XT ). Suppose that a = XT w for some fixed vector w. Show thatE(aT bz) = aT β.

(Hence aT β is estimable by aT bz where bz is any solution of the normalequations.)

f) Suppose that a = XT w for some fixed vector w. Show that V ar(aT bz) =σ2wT P w.

Chapter 3

Nonfull Rank Linear Models and Cell

Means Models

3.1 Nonfull Rank Linear Models

Definition 3.1. The nonfull rank linear model is Y = Xβ +e where Xhas rank r < p ≤ n, and X is an n× p matrix.

Nonfull rank models are often used in experimental design. Much of thenonfull rank model theory is similar to that of the full rank model, but thereare some differences. Now the generalized inverse (XT X)− is not unique.

Similarly, β is a solution to the normal equations, but depends on the gener-alized inverse and is not unique. Some properties of the least squares estima-tors are summarized below. Let P = P X be the projection matrix on C(X).Recall that projection matrices are symmetric and idempotent but singularunless P = I . Also recall that P X = X, so XT P = XT .

Theorem 3.1. i) P = X(XT X)−XT is the unique projection matrix onC(X) and does not depend on the generalized inverse (XT X)−.

ii) β = (XT X)−XT Y does depend on (XT X)− and is not unique.

iii) Y = Xβ = P Y , r = Y − Y = Y −Xβ = (I −P )Y and RSS = rT rare unique and so do not depend on (XT X)−.

iv) β is a solution to the normal equations: XT Xβ = XT Y .v) Rank(P ) = r and rank(I − P ) = n− r.

vi) Let θ = Xβ and θ = Xβ. Suppose there exists a constant vector c

such that E(cT θ) = cT θ. Then among the class of linear unbiased estimators

of cT θ, the least squares estimator cT θ is BLUE.

vii) If Cov(Y ) = Cov(e) = σ2I, then MSE =RSS

n− r=

rT r

n− ris an

unbiased estimator of σ2.viii) Let the columns of X1 form a basis for C(X). For example, take r lin-

early independent columns of X to form X1. Then P = X1(XT1 X1)

−1XT1 .

83

84 3 Nonfull Rank Linear Models and Cell Means Models

Definition 3.2. Let a and b be constant vectors. Then aT β is estimableif there exists a linear unbiased estimator bT Y so E(bT Y ) = aT β.

The term “estimable” is misleading since there are nonestimable quantitiesaT β that can be estimated with biased estimators. For full rank models, aT βis estimable for any p × 1 constant vector a since aT β is a linear unbiasedestimator of aT β. See the Gauss Markov Theorem 11.21. Estimable quantitiestend to go with the nonfull rank linear model. We can avoid nonestimablefunctions by using a full rank model instead of a nonfull rank model (deletecolumns of X until it is full rank).

Theorem 3.2. a) The quantity aT β is estimable iff aT = bT X iff a =XT b (for some constant vector b) iff a ∈ C(XT ).

b) If aT β is estimable and a least squares estimator β is any solution to

the normal equations XT Xβ = XT Y , then aT β is unique and aT β is theBLUE of aT β.

Remark 3.1. There are several ways to show whether aT β is estimableor nonestimable. i) For the full rank model, aT β is estimable: use the BLUE

aT β.Now consider the nonfull rank model. ii) If aT β is estimable: use the BLUE

aT β.iii) There are two more ways to check whether aT β is estimable.a) If there is a constant vector b such that E(bT Y ) = aT β, then aT β is

estimable.b) If aT = bT X or a = XT b or aT ∈ C(XT ), then aT β is estimable.

3.2 Cell Means Models

3.3 Summary

3.4 Complements

3.5 Problems

Chapter 4

Variable Selection When n >> p

This chapter considers variable selection when n >> p. The prediction regionand bootstrap prediction region method are for n >> p, but two regressionprediction intervals are developed that can be used for n > p or p > n aslong as n >> d where d is often found as in Remark 4.1.

Remark 4.1. A d is often obtained by plugging in the degrees of freedomformula as if model selection did not occur. Note that the “plug in degreesof freedom” is rarely an actual degrees of freedom. Often d is the numberof “variables” used by the method: the number of variables with nonzerocoefficients. The false degrees of freedom is useful for prediction intervalssince we want n/d large where d is an integer and 1 ≤ d ≤ p: if thereare d estimated parameters, then we want n ≥ 10d. The prediction intervallength will be increased (penalized) if n/d is not large. Often the “degreesof freedom” (df) used in literature examples is a false degrees of freedomsince the df was computed ignoring the fact that model selection occurred.As an example, if Y = HλY , then df = trace(Hλ) if λ is selected before

examining the data. If model selection is used to pick λ, then a false df =d = trace(H λ). Suppose an estimator Tn is equal to estimator Tjn with

probability πjn for j = 1, ..., J , where Y jn = HλjnY . A conjecture is thatdf =

∑j πjntrace(Hλjn) =

∑j πjndf(Tjn) where df(Tjn) is the degrees of

freedom for estimator Tjn. For many estimators, the degrees of freedom isnot known if model selection is used.

4.1 Prediction Intervals

Definition 4.1. Consider predicting a future test value Yf given a p × 1vector of predictors xf and training data (Y1,x1), ..., (Yn,xn). A large sam-

ple 100(1 − δ)% prediction interval (PI) for Yf has the form [Ln, Un] where

P (Ln ≤ Yf ≤ Un) → 1 − δ as the sample size n→ ∞.

85

86 4 Variable Selection When n >> p

The interpretation of a 100 (1−δ)% PI for a random variable Yf is similarto that of a confidence interval (CI). See Definition 4.5. Collect data, thenform the PI, and repeat for a total of k times where k trials are indepen-dent from the same population. If Yfi is the ith random variable and PIiis the ith PI, then the probability that Yfi ∈ PIi for j of the PIs follows abinomial(k, ρ= 1− δ) distribution. Hence if 100 95% PIs are made, ρ = 0.95and Yfi ∈ PIi happens about 95 times. For a large sample 100 (1 − δ)% PI,replace 1 − δ by 1 − δn where δn → δ as n → ∞.

There are two big differences between CIs and PIs. First, the length of theCI goes to 0 as the sample size n goes to ∞ while the length of the PI con-verges to some nonzero number J , say. Secondly, many confidence intervalswork well for large classes of distributions while many prediction intervalsassume that the distribution of the data is known up to some unknown pa-rameters. Usually the N(µ, σ2) distribution is assumed, and the parametricPI may not perform well if the normality assumption is violated. This sectionwill develop two nonparametric PIs for the additive error regression model,Y = m(x) + e, that work well for a large class of unknown zero mean errordistributions.

First we will consider the location model, Yi = µ+ ei, where Y1, ..., Yn, Yf

are iid and there are no vectors of predictors xi and xf . The shorth(c) esti-mator is useful for making prediction intervals. Let Z(1), ..., Z(n) be the orderstatistics of Z1, ..., Zn. Then let the shortest closed interval containing at leastc of the Zi be

shorth(c) = [Z(s),Z(s+c−1)]. (4.1)

Letkn = dn(1 − δ)e (4.2)

where dxe is the smallest integer ≥ x, e.g., d7.7e = 8. Let Zi = Yi. ThenFrey (2013) showed that for large nδ and iid data, the shorth(kn) PI hasmaximum undercoverage ≈ 1.12

√δ/n, and used the shorth(c) estimator as

the large sample 100(1 − δ)% PI where

c = min(n, dn[1 − δ + 1.12√δ/n ] e). (4.3)

A problem with the prediction intervals that cover ≈ 100(1 − δ)% of thetraining data cases Yi (such as (4.1) using c = kn given by (4.2)), is that theyhave coverage lower than the nominal coverage of 1− δ for moderate n. Thisresult is not surprising since empirically statistical methods perform worse ontest data. For iid data, Frey (2013) used (4.3) to correct for undercoverage.

Example 4.1. Given below were votes for preseason 1A basketball pollfrom Nov. 22, 2011 WSIL News where the 778 was a typo: the actual valuewas 78. As shown below, finding shorth(3) from the ordered data is simple.If the outlier was corrected, shorth(3) = [76,78].

111 89 778 78 76

4.1 Prediction Intervals 87

order data: 76 78 89 111 778

13 = 89 - 76

33 = 111 - 78

689 = 778 - 89

shorth(3) = [76,89]

Fig. 4.1 The 36.8% Highest Density Region is [0,1]

For a random variable Y , the highest density region is a union of k ≥ 1disjoint intervals such that sum of the k interval lengths is as small as possible.Suppose that f(z) is a unimodal pdf that has interval support, and that thepdf f(z) of Y decreases rapidly as z moves away from the mode. Let [a, b]be the shortest interval such that FY (b) − FY (a) = 1 − δ where the cdfFY (z) = P (Y ≤ z). Then the interval [a, b] is the 100(1− δ) highest densityregion. To find the 100(1 − δ)% highest density region of a pdf, move ahorizontal line down from the top of the pdf. The line will intersect the pdfor the boundaries of the support of the pdf at [a1, b1], ..., [ak, bk] for somek ≥ 1. Stop moving the line when the areas under the pdf corresponding tothe intervals is equal to 1 − δ. As an example, let f(z) = e−z for z > 0. SeeFigure 4.1 where the area under the pdf from 0 to 1 is 0.368. Hence [0,1] isthe 36.8% highest density region. The shorth PI estimates the highest densityinterval which is the highest density region for a distribution with a unimodalpdf. Often the highest density region is an interval [a, b] where f(a) = f(b),especially if the support where f(z) > 0 is (−∞,∞).

Next we consider prediction intervals for the additive error regressionmodel Yi = m(xi) + ei = E(Yi|xi) + ei for i = 1, ..., n where x is a p× 1 vec-tor of predictors and the mean function m(x) is estimated by m(x). Variableselection models for the multiple linear regression model in Chapter 3 willhave this form with m(x) = xT β. The ith residual ri = Yi − m(x).

The Pelawa Watagoda and Olive (2019b) PI that can work if n > p orp > n is defined below. The positive real number d is an estimated valuesuch that if n ≥ 20d, it is expected that the residuals estimate the errorsfairly well for a large class of zero mean error distributions. If n/p is large,d = p usually worked. Often d is a false degrees of freedom (df) (plug in thedegrees of freedom formula as if model selection did not occur). See Section 5.9for the recommended values of d for forward selection, principal componentregression, partial least squares, lasso, relaxed lasso, and ridge regression. Letqn = min(1 − δ + 0.05, 1− δ + d/n) for δ > 0.1 and


qn = min(1 − δ/2, 1− δ + 10δd/n), otherwise. (4.4)

If 1 − δ < 0.999 and qn < 1 − δ + 0.001, set qn = 1 − δ. Let

c = dnqne, (4.5)

and let

bn =

(1 +

15

n

)√n+ 2d

n− d(4.6)

if d ≤ 8n/9, and

bn = 5

(1 +

15

n

),

otherwise. Compute the shorth(c) of the residuals = [r(s), r(s+c−1)] = [ξδ1, ξ1−δ2

].

Let Yf = m(xf). Then a 100 (1 − δ)% large sample PI for Yf is

[m(xf) + bnξδ1, m(xf) + bnξ1−δ2

]. (4.7)

Note that this PI roughly uses the shorth of the pseudodata Yf + ri fori = 1, ..., n. The correction factors (4.5) and (4.6) are meant to approximatethe Frey correction factor (4.3) provided that n >> d. Olive (2013a) used asimilar PI with d = p, but this PI fails unless n ≥ 10p, roughly. If we had asample Y1f , ..., Ynf of size n of responses where Yif = m(xf )+ei, the shorth(c)PI of the Yif would be a good PI for Yf . If we knew m(xf) = E(Y |xf) andhad a sample of iid ei, we could take Yif = m(xf) + ei. Instead, we use

Yf = m(xf) and ri = ei to form pseudodata. In the response plot, thepointwise prediction interval bands for n/d large are approximately 2 linesparallel to the identity line covering 100(1 − δ)% of the training data withthe smallest vertical length L.

Remark 4.2. Note that correction factors bn → 1 are used in large sampleconfidence intervals and tests if the limiting distribution is N(0,1) or χ2

p, buta tdn or pFp,dn cutoff is used: tdn,1−δ/z1−δ → 1 and pFp,dn,1−δ/χ

2p,1−δ → 1 if

dn → ∞ as n → 1. Using correction factors for large sample confidence inter-vals, tests, prediction intervals, prediction regions, and bootstrap confidenceregions improves the performance for moderate sample size n.

The second PI randomly divides the data into two half sets H and Vwhere H has nH = dn/2e of the cases and V has the remaining nV = n−nH

cases i1, ..., inV . The estimator mH(x) is computed using the training dataset H . Then the validation residuals vj = Yij − mH(xij) are computed forj = 1, ..., nV cases in the validation set V . Find the Frey PI [v(s), v(s+c−1)] ofthe validation residuals (replacing n in (4.3) by nV = n− nH). Then secondnew 100(1− δ)% large sample PI for Yf is

[mH(xf) + v(s), mH(xf) + v(s+c−1)]. (4.8)


The PIs (4.7) and (4.8) are asymptotically equivalent if p is fixed andn→ ∞, but mH has about half the efficiency of m. When PI (4.7) has severeundercoverage because m is a poor estimator of m, it is expected that PI(4.8) may have coverage closer to the nominal coverage. For example, if minterpolates the data and mH interpolates the training data from H , thenthe validation residuals will be huge. Hence PI (4.8) will be long comparedto PI (4.7).

We can also motivate PI (4.7) by modifying the justification for the Leiet al. (2017) split conformal prediction interval [mH(xf )− aq, mH(xf) + aq]where aq is an appropriate quantile of the absolute validation residuals. Sup-pose (Yi,xi) are iid for i = 1, ..., n, n + 1 where (Yf ,xf ) = (Yn+1,xn+1).

Compute mH(x) from the cases in H . For example, get βH from the casesin H . Consider the validation residuals vi for i = 1, ..., nV and the validationresidual vnV +1 for case (Yf ,xf ). Since these nV + 1 cases are iid, the prob-ability that vt has rank j for j = 1, ..., nV + 1 is 1/(nV + 1) for each t, i.e.,the ranks follow the discrete uniform distribution. Let t = nV +1 and let v(i)be the ordered residuals using i = 1, ..., nV . That is, get the order statisticswithout using the unknown validation residual vnV +1. Then v(i) has rank iif v(i) < vnV +1 but rank i+ 1 if v(i) > vnV +1. Thus

P (Yf ∈ [mH(xf )+v(k), mH(xf )+v(k+b−1)]) = P (v(k) ≤ vnV +1 ≤ v(k+b−1)) ≥

P (vnV +1 has rank between k + 1 and k + b− 1 and there are no tied ranks)≥ (b− 1)/(nV + 1) ≈ 1 − δ if b = d(nV + 1)(1− δ)e+ 1 and k + b− 1 ≤ nV .This probability statement holds for a fixed k such as k = dnV δ/2e. Thestatement is not true when the shorth(b) estimator is used since the shortestinterval using k = s can have s change with the data set. That is, s is notfixed. Hence if PI’s were made from J independent data sets, the PI’s withfixed k would contain Yf about J(1−δ) times, but this value would be smallerfor the shorth(b) prediction intervals where s can change with the data set.

The above argument works if the estimator m(x) is “symmetric in thedata.” The assumption of iid cases is stronger than that of iid errors ei. Thesplit conformal PI can have good coverage, but PI (4.7) does not need theerror distribution to be symmetric to be asymptotically optimal.

Remark 4.3. For a good fitting model, residuals ri tend to be smaller inmagnitude than the errors ei, while validation residuals vi tend to be largerin magnitude than the ei. Hence PI (4.7) needs a stronger correction factorthan PI (4.8).


4.2 Prediction Regions

Consider predicting a p × 1 future test value xf , given past training datax1, ...,xn where x1, ...,xn,xf are iid. Much as confidence regions and inter-

vals give a measure of precision for the point estimator θ of the parameterθ, prediction regions and intervals give a measure of precision of the pointestimator T = xf of the future random vector xf .

Definition 4.2. A large sample 100(1− δ)% prediction region is a set An

such that P (xf ∈ An) → 1 − δ as n→ ∞.

Some researchers define a large sample prediction region An such thatP (xf ∈ An) ≥ 1−δ, asymptotically. A PI is a prediction region where p = 1.

Open intervals are often used. Mahalanobis distances Dx and Di =√D2

i aredefined in Definition 1.15. The sample mean and covariance matrix (x,S) aredefined in Definition 1.13.

Consider the hyperellipsoid

An = {x : D2x(x,S) ≤ D2

(c)} = {x : Dx(x,S) ≤ D(c)}. (4.9)

If n is large, we can use c = kn = dn(1 − δ)e. If n is not large, using c = Un

where Un decreases to kn, can improve small sample performance. Un will bedefined in the paragraph below Equation (4.13). Olive (2013a) showed that(4.9) is a large sample 100(1 − δ)% prediction region under mild conditions,although regions with smaller volumes may exist. Note that the result followssince if Σx and S are nonsingular, then the Mahalanobis distance is a con-

tinuous function of (x,S). Let µ = E(x) and D = D(µ,Σx). Then DiD→ D

and D2i

D→ D2. Hence the sample percentiles of the Di are consistent estima-tors of the population percentiles of D at continuity points of the cumulativedistribution function (cdf) of D. The prediction region (4.9) estimates thehighest density region for a large class of elliptically contoured distributions.(These distributions have a highest density region which is a hyperellipsoiddetermined by a population Mahalanobis distance. See Olive (2017b, ch. 3).The multivariate normal distribution is a special case.)

A problem with the prediction regions that cover ≈ 100(1 − δ)% of thetraining data cases xi (such as (4.9) for appropriate c), is that they havecoverage lower than the nominal coverage of 1 − δ for moderate n. Thisresult is not surprising since empirically statistical methods perform worseon test data. Increasing c will improve the coverage for moderate samples.Empirically for many distributions, for n ≈ 20p, the prediction region (4.9)applied to iid data using c = kn = dn(1 − δ)e tended to have undercoverageas high as 5%. The undercoverage decreases rapidly as n increases. Let qn =min(1 − δ + 0.05, 1− δ + p/n) for δ > 0.1 and

qn = min(1 − δ/2, 1− δ + 10δp/n), otherwise. (4.10)

4.2 Prediction Regions 91

If 1 − δ < 0.999 and qn < 1 − δ + 0.001, set qn = 1 − δ. Using

c = dnqne (4.11)

in (4.9) decreased the undercoverage. Note that Equations (4.4) and (4.5) aresimilar to Equations (4.10) and (4.11), but replace p by d.

If (T,C) is a√n consistent estimator of (µ, d Σ) for some constant d > 0

where Σ is nonsingular, then D2(T,C) = (x − T )T C−1(x − T ) =

(x − µ + µ − T )T [C−1 − d−1Σ−1 + d−1Σ−1](x− µ + µ− T )

= d−1D2(µ,Σ) + op(1).

Thus the sample percentiles of D2i (T,C) are consistent estimators of the per-

centiles of d−1D2(µ,Σ) (at continuity points D1−δ of the cdf of D2(µ,Σ)).If x ∼ Nm(µ,Σ), then D2

x(µ,Σ) = D2(µ,Σ) ∼ χ2m.

Suppose (T,C) = (xM , b SM ) is the sample mean and scaled sample co-variance matrix applied to some subset of the data. The classical and RMVNestimators satisfy this assumption. For h > 0, the hyperellipsoid

{z : (z − T )T C−1(z − T ) ≤ h2} = {z : D2z ≤ h2} = {z : Dz ≤ h} (4.12)

has volume equal to

2πp/2

pΓ (p/2)hp√det(C) =

2πp/2

pΓ (p/2)hpbp/2

√det(SM). (4.13)

A future observation (random vector) xf is in the region (4.12) if Dxf ≤ h.If (T,C) is a consistent estimator of (µ, dΣ) for some constant d > 0 where

Σ is nonsingular, then (4.12) is a large sample 100(1− δ)% prediction regionif h = D(Un) where D(Un) is the 100qnth sample quantile of the Di whereqn is defined above (4.11). If x1, ...,xn and xf are iid, then region (4.12) isasymptotically optimal on a large class of elliptically contoured distributionsin that its volume converges in probability to the volume of the highestdensity region.

The Olive (2013a) nonparametric prediction region uses (T,C) = (x,S).For the classical prediction region, see Chew (1966) and Johnson and Wichern(1988, pp. 134, 151). Refer to the above paragraph for D(Un).

Definition 4.3. The large sample 100(1 − δ)% nonparametric predictionregion for a future value xf given iid data x1, ...,xn is

{z : D2z(x,S) ≤ D2

(Un)}, (4.14)

while the large sample 100(1− δ)% classical prediction region is

{z : D2z(x,S) ≤ χ2

p,1−δ}. (4.15)


If p is small, Mahalanobis distances tend to be right skewed with a pop-ulation shorth that discards the right tail. For p = 1 and n ≥ 20, the finitesample correction factors c/n for c given by (4.3) and (4.11) do not differby much more than 3% for 0.01 ≤ δ ≤ 0.5. See Figure 4.2 where ol = (Eq.2.11)/n is plotted versus fr = (Eq. 4.3)/n for n = 20, 21, ..., 500.The top plotis for δ = 0.01, while the bottom plot is for δ = 0.3. The identity line is addedto each plot as a visual aid. The value of n increases from 20 to 500 from theright of the plot to the left of the plot. Examining the axes of each plot showsthat the correction factors do not differ greatly. R code to create Figure 4.2is shown below.

cmar <- par("mar"); par(mfrow = c(2, 1))

par(mar=c(4.0,4.0,2.0,0.5))

frey(0.01); frey(0.3)

par(mfrow = c(1, 1)); par(mar=cmar)

Fig. 4.2 Correction Factor Comparison when δ = 0.01 (Top Plot) and δ = 0.3(Bottom Plot)

Remark 4.4. The nonparametric prediction region (4.14) is useful ifx1, ...,xn,xf are iid from a distribution with a nonsingular covariance matrix,and the sample size n is large enough. The distribution could be continuous,discrete, or a mixture. The asymptotic coverage is 1 − δ if the 100(1 − δ)thpercentile D1−δ of D is a continuity point of the distribution of D, althoughprediction regions with smaller volume may exist. If D1−δ is not a continu-ity point of the distribution of D, then the asymptotic coverage tends to be≥ 1−δ since a sample percentile with cutoff qn that decreases to 1−δ is usedand a closed region is used. Often D has a continuous distribution and hencehas no discontinuity points for 0 < δ < 1. (If there is a jump in the distri-bution from 0.9 to 0.96 at discontinuity point a, and the nominal coverage is0.95, we want 0.96 coverage instead of 0.9. So we want the sample percentileto decrease to a.) The nonparametric prediction region (4.14) contains Un ofthe training data cases xi provided that S is nonsingular, even if the modelis wrong. For many distributions, the coverage started to be close to 1 − δfor n ≥ 10p where the coverage is the simulated percentage of times that theprediction region contained xf .

Remark 4.5. The most used prediction regions assume that the error vec-tors are iid from a multivariate normal distribution. The ratio of the volumesof regions (4.15) and (4.14) is

(χ2

p,1−δ

D2(Un)

)p/2

,

4.2 Prediction Regions 93

which can become close to zero rapidly as p gets large if the xi are notfrom the light tailed multivariate normal distribution. For example, supposeχ2

4,0.5 ≈ 3.33 and D2(Un) ≈ D2

x,0.5 = 6. Then the ratio is (3.33/6)2 ≈ 0.308.Hence if the data is not multivariate normal, severe undercoverage can occurif the classical prediction region is used, and the undercoverage tends to getworse as the dimension p increases. The coverage need not to go to 0, since bythe multivariate Chebyshev’s inequality, P (D2

x(µ,Σx) ≤ γ) ≥ 1 − p/γ > 0for γ > p where the population covariance matrix Σx = Cov(x). See Budny(2014), Chen (2011), and Navarro (2014, 2016). Using γ = h2 = p/δ in (4.12)usually results in prediction regions with volume and coverage that is toolarge.

Remark 4.6. The nonparametric prediction region (4.14) starts to havegood coverage for n ≥ 10p for a large class of distributions. Olive (2013a)suggests n ≥ 50p may be needed for the prediction region to have a goodvolume. Of course for any n there are error distributions that will have severeundercoverage.

For the multivariate lognormal distribution with n = 20p, the Olive(2013a) large sample nonparametric 95% prediction region (4.14) had cov-erages 0.970, 0.959, and 0.964 for p = 100, 200, and 500. Some R code isbelow.

nruns=1000 #lognormal, p = 100

count<-0

for(i in 1:nruns){

x <- exp(matrix(rnorm(200000),ncol=100,nrow=2000))

xff <- exp(as.vector(rnorm(100)))

count <- count + predrgn(x,xf=xff)$inr}

count #970/1000

Notice that for the training data x1, ...,xn, if C−1 exists, then c ≈ 100qn%of the n cases are in the prediction regions for xf = xi, and qn → 1−δ even if(T,C) is not a good estimator. Hence the coverage qn of the training data isrobust to model assumptions. Of course the volume of the prediction regioncould be large if a poor estimator (T,C) is used or if the xi do not comefrom an elliptically contoured distribution. Also notice that qn = 1 − δ/2or qn = 1 − δ + 0.05 for n ≤ 20p and qn → 1 − δ as n → ∞. If qn ≡1 − δ and (T,C) is a consistent estimator of (µ, dΣ) where d > 0 and Σis nonsingular, then (4.12) is a large sample prediction region, but takingqn given by (4.10) improves the finite sample performance of the predictionregion. Taking qn ≡ 1 − δ does not take into account variability of (T,C),and for small n the resulting prediction region tended to have undercoverageas high as min(0.05, δ/2). Using (4.10) helped reduce undercoverage for smalln due to the unknown variability of (T,C).


4.3 Bootstrapping Hypothesis Tests and Confidence

Regions

This section shows that, under regularity conditions, applying the nonpara-metric prediction region of Section 4.2 to a bootstrap sample results in aconfidence region. The volume of a confidence region → 0 as n → 0, whilethe volume of a prediction region goes to that of a population region thatwould contain a new xf with probability 1 − δ. The nominal coverage is100(1− δ). If the actual coverage 100(1− δn) > 100(1− δ), then the region isconservative. If 100(1− δn) < 100(1 − δ), then the region is liberal. A regionthat is 5% conservative is considered “much better” than a region that is 5%liberal.

When teaching confidence intervals, it is often noted that by the centrallimit theorem, the probability that Y n is within two standard deviations(2SD(Y n) = 2σ/

√n) of θ = µ is about 95%. Hence the probability that θ is

within two standard deviations of Y n is about 95%. Thus the interval [θ −1.96S/

√n, θ+1.96S/

√n] is a large sample 95% prediction interval for a future

value of the sample mean Y n,f if θ is known, while [Y n − 1.96S/√n, Y n +

1.96S/√n] is a large sample 95% confidence interval for the population mean

θ. Note that the lengths of the two intervals are the same. Where the intervalis centered, at the parameter θ or the statistic Y n, determines whether theinterval is a prediction or a confidence interval.

Definition 4.4. A large sample 100(1−δ)% confidence region for a vectorof parameters θ is a set An such that P (θ ∈ An) → 1 − δ as n → ∞.

Some researchers define a large sample confidence region An such thatP (θ ∈ An) ≥ 1 − δ, asymptotically. The following theorem shows that thehyperellipsoid R1,c centered at the statistic Tn is a large sample 100(1− δ)%confidence region for θ, but the hyperellipsoid R1,p centered at known θ is alarge sample 100(1− δ)% prediction region for a future value of the statisticTf,n. The result uses the fact that the squared distance of θ from a statisticTn in a hyperellipsoid R1,c centered at Tn is equal to the squared distance ofTn from a parameter θ in a hyperellipsoid R1,p centered at θ:

D2θ(Tn, ΣT ) = (θ − Tn)T Σ

−1

T (θ − Tn) =

(Tn − θ)T Σ−1

T (Tn − θ) = D2Tn

(θ, ΣT ).

As in Remark 4.4, if the 100(1−δ)th percentile of D2 is not a continuity pointof the distribution of D2 , then the asymptotic coverage tends to be ≥ 1 − δif a sample percentile with cutoff qn that decreases to 1 − δ is used, since aclosed region is used. Often D2 has a continuous distribution and hence hasno discontinuity points for 0 < δ < 1.

4.3 Bootstrapping Hypothesis Tests and Confidence Regions 95

Theorem 4.1. Let the 100(1− δ)th percentile D21−δ be a continuity point

of the distribution of D2. Assume that D2θ(Tn,ΣT )

D→ D2, D2θ(Tn, ΣT )

D→D2, and D2

1−δP→ D2

1−δ where P (D2 ≤ D21−δ) = 1 − δ. i) Then R1,c = {w :

D2w(Tn, ΣT ) ≤ D2

1−δ} is a large sample 100(1− δ)% confidence region for θ,

and if θ is known, then R1,p = {w : D2w(θ, ΣT ) ≤ D2

1−δ} is a large sample100(1−δ)% prediction region for a future value of the statistic Tf,n. ii) RegionR1,c contains θ iff region R1,p contains Tn.

Proof. i) From the discussion above, D2θ(Tn, ΣT ) = D2

Tn(θ, ΣT ). Thus

the probability that R1,c contains θ is P (D2θ(Tn, ΣT ) ≤ D2

1−δ) → 1− δ, and

the probability that R1,p contains Tf,n is P (D2θ(Tf,n , ΣT ) ≤ D2

1−δ) → 1− δ,as n → ∞.

ii) D2θ(Tn, ΣT ) ≤ D2

1−δ iff D2Tn

(θ, ΣT ) ≤ D21−δ since D2

θ(Tn, ΣT ) =

D2Tn

(θ, ΣT ). �

The prediction region method centers the hyperellipsoid at the samplemean T

∗of the bootstrap sample, and more regularity conditions are needed.

The following theorem is called the geometric argument since it uses hyperel-lipsoids. Let D2

(UB) be the cutoff for the nonparametic prediction region (4.14)

computed from the D2i (T ,ST ) for i = 1, ..., B. Hence n is replaced by B. See

Figure 4.3 and the discussion after Remark 4.11 for some of the geometry ofthe bootstrap confidence regions. Since Tn depends on the sample size n, weneed nΣTn and (nST )−1 to be fairly well behaved (not too ill conditioned)for each n ≥ 20g, say.

Theorem 4.2: Geometric Argument. Suppose√n(Tn − θ)

D→ u withE(u) = 0 and Cov(u) = Σu. Assume T1, ..., TB are iid with nonsingular co-variance matrix ΣTn . Assume (nST )−1 is “not too ill conditioned.” Then thelarge sample 100(1− δ)% prediction region Rp = {w : D2

w(T ,ST ) ≤ D2(UB)}

centered at T contains a future value of the statistic Tf with probability1−δB → 1−δ as B → ∞. Hence the region Rc = {w : D2

w(Tn,ST ) ≤ D2(UB)}

is a large sample 100(1− δ)% confidence region for θ.Proof. The region Rc centered at a randomly selected Tn contains T with

probability 1 − δB . Since the√n(Ti − θ) are iid random vectors,

√n(T1 − θ)

...√n(TB − θ)

D→

v1

...vB

where the vi are iid with the same distribution as u. (Use Theorems 1.20and 1.21, and see Example 1.19.) For fixed B, the average of these randomvectors is


√n(T − θ)

D→ 1

B

B∑

i=1

vi ∼ ANg

(0,

ΣuB

)

by Theorem 1.23. Hence (T − θ) = OP ((nB)−1/2), and T gets arbitrarilyclose to θ compared to Tn as B → ∞. Thus Rc is a large sample 100(1− δ)%confidence region for θ as n, B → ∞. �

Hence if there was an iid sample T1,n, ..., TB,n of the statistic, the large sam-ple 100(1 − δ)% nonparametric prediction region {w : D2(T ,ST ) ≤ D2

(UB)}for Tf,n contains θ with asymptotic coverage ≥ 1−δ. To make the asymptoticcoverage equal to 1 − δ, use the large sample 100(1 − δ)% confidence region{w : D2(T1,n,ST ) ≤ D2

(UB)}. The prediction region method bootstraps thisprocedure by using a bootstrap sample of the statistic T ∗

1,n, ..., T∗B,n. See Sec-

tion 4.3.4. Centering the region at T ∗1,n instead of T

∗is not needed since

the bootstrap sample is centered near Tn: the distribution of√n(Tn − θ)

is approximated by the distribution of√n(T ∗ − Tn) or by the distribution

of√n(T ∗ − T

∗). See Equations (4.18) and (4.25)–(4.27). Also note that if

w∗i1, ...,w

∗in is a permutation of w1, ...,wn, if Tn = t(w1, ...,wn), and if

T ∗in = t(w∗

i1, ...,w∗in), then T ∗

in = Tn if t is permutation invariant.A bootstrap sample is a sample from the empirical distribution: there are n

cases, and a sample of size n of the cases is drawn with replacement and eachcase is equally likely to be drawn. The residual bootstrap draws a bootstrapsample from the residuals. Section 4.3.2 will help clarify ideas.

When g = 1, the Frey (2013) shorth(c) interval given by (4.1), with

c = min(B, dB[1 − δ + 1.12√δ/B ] e), (4.16)

can be applied to the bootstrap sample to obtain a CI for θ. One sided confi-dence intervals give a lower or upper confidence bound for θ. A large sample100(1 − δ)% lower confidence interval (−∞, Un] uses an upper confidence

bound Un and is in the lower tail of the distribution of θ. A large sample100(1− δ)% upper confidence interval [Ln,∞) uses a lower confidence bound

Ln and is in the upper tail of the distribution of θ. These CIs can be useful ifθ ∈ [a, b] and θ = a or θ = b. For example, [a, b] = [0, 1] if θ = ρ2, the squaredpopulation correlation. Then use [a, Un] and [Ln, b] as CIs.

Definition 4.5. The interval [Ln, Un] is a large sample 100(1 − δ)% con-fidence interval for θ if P (Ln ≤ θ ≤ Un) → 1 − δ as n → ∞. The inter-val (−∞, Un] is a large sample 100(1 − δ)% lower confidence interval for θif P (θ ≤ Un) → 1 − δ as n → ∞. The interval [Ln,∞) is large sample100(1− δ)% upper confidence interval for θ if P (θ ≥ Ln) → 1− δ as n → ∞.

The percentile method for obtaining a one sided CI is to lop off the appro-priate tail of the bootstrap sample. If T ∗

(1), ..., T∗(B) are the order statistics of

the the bootstrap sample, then (−∞, T ∗(dB(1−δ)e)] and [T ∗

(dBδe),∞) are used.


See Chen (2016). The percentile CI is a prediction interval applied to thebootstrap sample, and PI theory suggests that the percentile CIs have un-dercoverage for moderate B since a PI that covers 100(1−δ)% of the trainingdata will have lower coverage on test data.

PI theory estimates the undercoverage and then compensates. Hence thefollowing large sample CIs give less undercoverage if n is large and B ismoderate, say B ≥ 50. Let c be given by (4.16).

Definition 4.6. The large sample 100(1 − δ)% lower shorth CI for θis (−∞, T ∗

(c)], while the large sample 100(1 − δ)% upper shorth CI for θ is[T ∗

(B−c+1),∞). The large sample 100(1 − δ)% shorth(c) CI uses the interval[T ∗

(1), T∗(c)], [T

∗(2), T

∗(c+1)], ..., [T

∗(B−c+1), T

∗(B)] of shortest length.

The shorth confidence interval is a practical implementation of the Hall(1988) shortest bootstrap interval based on all possible bootstrap samples.

Note that if√n(Tn −θ) D→ u and

√n(T ∗

i −Tn)D→ u where u has a unimodal

probability density function symmetric about 0, then the shorth confidenceinterval, the usual percentile method (two sided) confidence interval thatuses the central proportion of the bootstrap sample, and the prediction re-gion method CI are asymptotically equivalent. (See discussion below Remark4.10. Apply the geometric argument of Theorem 4.2 to justify the three CIs.Symmetry is needed so that the first two CIs are centered at T

∗, asymptoti-

cally.)

4.3.1 Bootstrapping the Population Coefficient of

Multiple Determination

This subsection illustrates a case where the shorth(c) bootstrap CI fails, butthe lower shorth CI can be useful. See Definition 4.6. Another application isgiven in Fan (2016).

The multiple linear regression (MLR) model is

Yi = β1 + xi,2β2 + · · ·+ xi,pβp + ei = xTi β + ei

for i = 1, ..., n. Here n is the sample size and the random variable ei is theith error. When a constant is in the model, let the total sum of squares

SSTO =

n∑

i=1

(Yi − Y )2,

the regression sum of squares


SSR =

n∑

i=1

(Yi − Y )2,

and the residual sum of squares or error sum of squares

SSE =

n∑

i=1

(Yi − Yi)2 =

n∑

i=1

r2i .

Then SSTO = SSE+SSR, and if SSTO 6= 0, then coefficient of multipledetermination

R2 = [corr(Yi, Yi)]2 =

SSR

SSTO= 1 − SSE

SSTO

where corr(Yi, Yi) is the sample correlation of Yi and Yi.Assume that the variance of the errors is σ2

e and that the variance of Y isσ2

Y . Let the linear combination L =∑p

i=2 xiβi where Y = β1 +∑p

i=2 xiβi +e = β1 + L + e. Let the variance of L be σ2

L. Then

R2 = 1 −∑n

i=1 r2i∑n

i=1(Yi − Y )2P→ τ2 = 1 − σ2

e

σ2Y

= 1 − σ2e

σ2e + σ2

L

.

Here we assume that e is independent of the predictors x2, ..., xp. Hence e isindependent of L and the variance σ2

Y = V (L+e) = V (L)+V (e) = σ2L +σ2

e .One of the sufficient conditions for the shorth(c) interval to be a large

sample CI for θ is√n(T − θ)

D→ N(0, σ2). If the function t(θ) has an inverse,

and√n(t(T )− t(θ))

D→ N(0, v2), then the above condition typically holds bythe delta method. See the paragraph below Definition 4.6.

For T = R2 and θ = τ2, the test statistic F0 for testing H0 : β2 = · · · =

βp = 0 in the Anova F test has (p − 1)F0D→ χ2

p−1 for a large class of errordistributions when H0 is true, where

F0 =R2

1 −R2

n− p

p − 1

if the MLR model has a constant. If H0 is false, then F0 has an asymptoticscaled noncentral χ2 distribution. These results suggest that the large sampledistribution of

√n(R2 − τ2) may not be N(0, σ2) if H0 is false so τ2 > 0. If

τ2 = 0, we may have√n(R2 − 0)

D→ N(0, 0), the point mass at 0. Hence theshorth CI may not be a large sample CI for τ2. The lower shorth CI shouldbe useful for testing H0 : τ2 = 0 versus HA : τ2 > a where 0 < a ≤ 1 sincethe coverage is 1 and the length of the CI converges to 0. So reject H0 if a isnot in the CI.

The simulation simulated iid data w with u = Aw and Aij = ψ for i 6= jand Aii = 1 where 0 ≤ ψ < 1 and u = (x2, ..., xp)

T . Hence cor(xi, xj) = ρ =


[2ψ+(p−3)ψ2]/[1+(p−2)ψ2] for i 6= j. If ψ = 1/√kp, then ρ→ 1/(k+1) as

p→ ∞ where k > 0. We used w ∼ Np−1(0, Ip−1). If ψ is high or if p is largewith ψ ≥ 0.5, then the data are clustered tightly about the line with direction1 = (1, ..., 1)T, and there is a dominant principal component with eigenvector1 and eigenvalue λ1. We used ψ = 0, 1/

√p, and 0.9. Then ρ = 0, ρ→ 0.5, or

ρ→ 1 as p → ∞.We also used V (x2) = · · · = V (xp) = σ2

x. If p > 2, then Cov(xi, xj) = ρσ2x

for i 6= j and Cov(xi, xj) = V (xi) = σ2x for i = j. Then V (Y ) = σ2

Y = σ2L+σ2

e

where

σ2L = V (L) = V (

p∑

i=2

βixi) = Cov(

p∑

i=2

βixi,

p∑

j=2

βjxj) =

p∑

i=2

p∑

j=2

βiβjCov(xi, xj)

=

p∑

i=2

β2i σ

2x + 2ρσ2

x

p∑

i=2

p∑

j=i+1

βiβj .

The simulations took βi ≡ 0 or βi ≡ 1 for i = 2, ..., p. For the latter case,

σ2L = V (L) = (p − 1)σ2

x + 2ρσ2xp(p− 1)/2.

The zero mean errors ei were from 5 distributions: i) N(0,1), ii) t3, iii)EXP (1)− 1, iv) uniform(−1, 1), and v) (1− ε)N(0, 1)+ εN(0, (1+ s)2) withε = 0.1 and s = 9 in the simulation. Then Y = 1 + bx2 + bx3 + · · ·+ bxp + ewith b = 0 or b = 1.

Remark 4.7. Suppose the simulation uses K runs and Wi = 1 if µ isin the ith CI, and Wi = 0 otherwise, for i = 1, ..., K. Then the Wi are iidbinomial(1,1− δn) where ρn = 1− δn is the true coverage of the CI when the

sample size is n. Let ρn = W . Since∑K

i=1Wi ∼ binomial(K, ρn), the standard

error SE(W ) =√ρn(1 − ρn)/K. For K = 5000 and ρn near 0.9, we have

3SE(W ) ≈ 0.01. Hence an observed coverage of ρn within 0.01 of the nominalcoverage 1 − δ suggests that there is no reason to doubt that the nominalCI coverage is different from the observed coverage. So for a large sample95% CI, we want the observed coverage to be between 0.94 and 0.96. Alsoa difference of 0.01 is not large. Coverage slightly higher than the nominalcoverage is better than coverage slightly lower than the nominal coverage.

Bootstrapping confidence intervals for quantities like ρ2 and τ2 is notori-ously difficult. If β2 = · · · = βp = 0, then σ2

L = 0 and τ2 = 0. However, theprobability that R2∗

i > 0 = 1. Hence the usual two sided bootstrap percentileand shorth intervals for τ2 will never contain 0. The one sided bootstrap CI[0, T ∗

(c)] always contains 0, and is useful if the length of the CI goes to 0 as

n→ ∞. In the table below, βi = b for i = 2, ..., p. If b = 0, then τ2 = 0.The simulation for the table used 5000 runs with the bootstrap sample

size B = 1000. When n = 400, the shorth(c) CI never contains τ2 = 0 andthe average length of the CI is 0.035. See ccov and clen. The lower shorth CI


always contained τ2 = 0 with lcov = 1, and the average CI length was llen =0.036. The upper shorth CI never contains τ2 = 0, and the average length isnear 1.

Table 4.1 Bootstrapping τ2 with R2 and B = 1000

etype n p b ψ τ2 ccov clen lcov llen ucov ulen1 100 4 0 0 0 0 0.135 1 0.137 0 0.9901 200 4 0 0 0 0 0.0693 1 0.0702 0 0.9951 400 4 0 0 0 0 0.0354 1 0.0358 0 0.988

Three slpack functions were used in the simulation. The function shorthLUgets the shorth(c) CI, the lower shorth CI, and the upper shorth CI. Thefunction Rsqboot bootstraps R2, while the function Rsqbootsim does thesimulation. Some R code for the first line of Table 4.1 is below where b = cc.

Rsqbootsim(n=100,p=4,BB=1000,nruns=5000,type=1,psi=0,

cc=0)

$rho

[1] 0

$sigesq

[1] 1

$sigLsq

[1] 0

$poprsq

[1] 0

$cicov

[1] 0

$avelen

[1] 0.1348881

$lcicov

[1] 1

$lavelen

[1] 0.13688

$ucicov

[1] 0

$uavelen

[1] 0.9896608

4.3.2 The Bootstrap

Definition 4.7. Suppose that data x1, ...,xn has been collected and ob-served. Often the data is a random sample (iid) from a distribution with


cdf F . The empirical distribution is a discrete distribution where the xi arethe possible values, and each value is equally likely. If w is a random variablehaving the empirical distribution, then pi = P (w = xi) = 1/n for i = 1, ..., n.The cdf of the empirical distribution is denoted by Fn.

Example 4.2. Let w be a random variable having the empirical distri-bution given by Definition 4.7. Show that E(w) = x ≡ xn and Cov(w) =n− 1

nS ≡ n− 1

nSn.

Solution: Recall that for a discrete random vector, the population expectedvalue E(w) =

∑xipi where xi are the values that w takes with positive

probability pi. Similarly, the population covariance matrix

Cov(w) = E[(w −E(w))(w − E(w))T ] =∑

(xi − E(w))(xi −E(w))T pi.

Hence

E(w) =

n∑

i=1

xi1

n= x,

and

Cov(w) =n∑

i=1

(xi − x)(xi − x)T 1

n=n− 1

nS. �

Example 4.3. If W1, ...,Wn are iid from a distribution with cdf FW , thenthe empirical cdf Fn corresponding to FW is given by

Fn(y) =1

n

n∑

i=1

I(Wi ≤ y)

where the indicator I(Wi ≤ y) = 1 if Wi ≤ y and I(Wi ≤ y) = 0 if Wi > y.Fix n and y. Then nFn(y) ∼ binomial (n, FW (y)). Thus E[Fn(y)] = FW (y)and V [Fn(y)] = FW (y)[1 − FW (y)]/n. By the central limit theorem,

√n(Fn(y) − FW (y))

D→ N(0, FW(y)[1 − FW (y)]).

Thus Fn(y) − FW (y) = OP (n−1/2), and Fn is a reasonable estimator of FW

if the sample size n is large.

Suppose there is data w1, ...,wn collected into an n × p matrix W . Letthe statistic Tn = t(W ) = T (Fn) be computed from the data. Supposethe statistic estimates µ = T (F ), and let t(W ∗) = t(F ∗

n) = T ∗n indicate

that t was computed from an iid sample from the empirical distributionFn: a sample w∗

1, ...,w∗n of size n was drawn with replacement from the

observed sample w1, ...,wn. This notation is used for von Mises differentiablestatistical functions in large sample theory. See Serfling (1980, ch. 6). Theempirical distribution is also important for the influence function (widely


used in robust statistics). The empirical bootstrap or nonparametric bootstrapor naive bootstrap draws B samples of size n from the rows of W , e.g. fromthe empirical distribution of w1, ...,wn. Then T ∗

jn is computed from the jthbootstrap sample for j = 1, ..., B.

Example 4.4. Suppose the data is 1, 2, 3, 4, 5, 6, 7. Then n = 7 and thesample median Tn is 4. Using R, we drew B = 2 bootstrap samples (samplesof size n drawn with replacement from the original data) and computed thesample median T ∗

1,n = 3 and T ∗2,n = 4.

b1 <- sample(1:7,replace=T)

b1

[1] 3 2 3 2 5 2 6

median(b1)

[1] 3

b2 <- sample(1:7,replace=T)

b2

[1] 3 5 3 4 3 5 7

median(b2)

[1] 4

The bootstrap has been widely used to estimate the population covariancematrix of the statistic Cov(Tn), for testing hypotheses, and for obtainingconfidence regions (often confidence intervals). An iid sample T1n, ..., TBn ofsize B of the statistic would be very useful for inference, but typically weonly have one sample of data and one value Tn = T1n of the statistic. OftenTn = t(w1, ...,wn), and the bootstrap sample T ∗

1n, ..., T∗Bn is formed where

T ∗jn = t(w∗

j1, ...,w∗jn). Section 4.3.4 will show that T ∗

1n − Tn, ..., T∗Bn − Tn is

pseudodata for T1n − θ, ..., TBn − θ when n is large in that√n(Tn − θ)

D→ u

and√n(T ∗ − Tn)

D→ u.The residual bootstrap is often useful for additive error regression models

of the form Yi = m(xi) + ei = m(xi) + ri = Yi + ri for i = 1, ..., n wherethe ith residual ri = Yi − Yi. Let Y = (Y1, ..., Yn)T , r = (r1, ..., rn)T , and letX be an n × p matrix with ith row xT

i . Then the fitted values Yi = m(xi),and the residuals are obtained by regressing Y on X . Here the errors ei areiid, and it would be useful to be able to generate B iid samples e1j , ..., enj

from the distribution of ei where j = 1, ..., B. If the m(xi) were known, thenwe could form a vector Y j where the ith element Yij = m(xi) + eij fori = 1, ..., n. Then regress Y j on X. Instead, draw samples r∗1j, ..., r

∗nj with

replacement from the residuals, then form a vector Y ∗j where the ith element

Y ∗ij = m(xi) + r∗ij for i = 1, ..., n. Then regress Y ∗

j on X.

Example 4.5. For multiple linear regression, Yi = xTi β + ei is written in

matrix form as Y = Xβ + e. Regress Y on X to obtain β, r, and Y withith element Yi = m(xi) = xT

i β. For j = 1, ..., B, regress Y ∗j on X to form

β∗1,n, ..., β

∗B,n using the residual bootstrap.


Now examine the OLS model. Let Y = Y OLS = XβOLS = HY be thefitted values from the OLS full model where H = X(XT X)−1XT . Let rW

denote an n×1 random vector of elements selected with replacement from theOLS full model residuals. Following Efron (1982, p. 36), Y ∗ = XβOLS +rW

follows a standard linear model where the elements rWi of rW are iid from

the empirical distribution of the OLS full model residuals ri. Hence

E(rWi ) =

1

n

n∑

i=1

ri = 0, V (rWi ) = σ2

n =1

n

n∑

i=1

r2i =n− p

nMSE,

E(rW ) = 0, and Cov(Y ∗) = Cov(rW) = σ2nIn.

Let β = βOLS . Then β∗

= (XT X)−1XT Y ∗ with Cov(β∗) = σ2

n(XT X)−1 =n− p

nMSE(XT X)−1, and E(β

∗) = (XT X)−1XTE(Y ∗) = (XT X)−1XT

HY = β = βn since HX = X. The expectations are with respect to thebootstrap distribution where Y acts as a constant.

For the OLS estimator β = βOLS , the estimated covariance matrix

of βOLS is Cov(βOLS) = MSE(XT X)−1. The sample covariance matrix

of the β∗

is estimating Cov(β∗) as B → ∞. Hence the residual boot-

strap standard error SE(β∗i ) ≈

√n− p

nSE(βi) for i = 1, ..., p where

βOLS = β = (β1, ..., βp)T . The LS CLT Thoerem 3.1 says

√n(β − β)

D→ Np(0, limn→∞

nCov(βOLS)) ∼ Np(0, σ2V )

where n(XT X)−1 → V . Since Y ∗ = XβOLS +rW follows a standard linearmodel, it may not be surprising that

√n(β

∗ − βOLS)D→ Np(0, lim

n→∞nCov(β

∗)) ∼ Np(0, σ

2V ).

See Freedman (1981).

Example 4.6. Suppose there is training data (yi,xi) for the modelyi = m(xi) + εi for i = 1, ..., n, and it is desired to predict a future testvalue yf given xf and the training data. The model can be fit and the resid-ual vectors formed. One method for obtaining a prediction region for yf is toform the pseudodata yf +εi for i = 1, ..., n, and apply the nonparametric pre-diction region (4.14) to the pseudodata. See Olive (2017b, 2018). The residualbootstrap could also be used to make a bootstrap sample yf + ε∗1, ..., yf + ε∗Bwhere the ε∗j are selected with replacement from the residual vectors forj = 1, ..., B. As B → ∞, the bootstrap sample will take on the n valuesyf + εi (the pseudodata) with probabilities converging to 1/n for i = 1, ..., n.


Suppose there is a statistic Tn that is a g × 1 vector. Let

T∗

=1

B

B∑

i=1

T ∗i and S∗

T =1

B − 1

B∑

i=1

(T ∗i − T

∗)(T ∗

i − T∗)T (4.17)

be the sample mean and sample covariance matrix of the bootstrap sampleT ∗

1 , ..., T∗B where T ∗

i = T ∗i,n. Fix n, and let E(T ∗

i,n) = θn and Cov(T ∗i,n) = Σn.

We will often assume that Cov(Tn) = ΣT , and√n(Tn −θ)

D→ Ng(0,ΣA).

Often nΣTP→ ΣA. For example, using least squares and the residual boot-

strap for the multiple linear regression model, Σn =n− p

nMSE(XT X)−1,

θn = β, θ = β, and ΣA = σ2 limn→∞(XT X/n)−1.Suppose the T ∗

i = T ∗i,n are iid from some distribution with cdf Fn. For

example, if T ∗i,n = t(F ∗

n) where iid samples from Fn are used, then Fn is the

cdf of t(F ∗n). With respect to Fn, both θn and Σn are parameters, but with

respect to F , θn is a random vector and Σn is a random matrix. For fixedn, by the multivariate central limit theorem,

√B(T

∗ − θn)D→ Ng(0,Σn) and B(T

∗ − θn)T[S∗

T]−1(T∗ − θn)

D→ χ2r

as B → ∞.

Remark 4.8. For Examples 4.2, 4.5, and 4.6, the bootstrap works butis expensive compared to alternative methods. For Example 4.2, fix n, then

T∗ P→ θn = x and S∗

TP→ (n−1)S/n as B → ∞, but using (x,S) makes more

sense. For Example 4.5, using β and the classical estimated covariance matrixCov(β) = MSE(XT X)−1 makes more sense than using the bootstrap. ForExample 4.6, use the pseudodata instead of the residual bootstrap. For thesethree examples, it is known how the bootstrap sample behaves as B → ∞.

The bootstrap can be very useful when√n(Tn − θ)

D→ Ng(0,ΣA), but itnot known how to estimate ΣA without using a resampling method like the

bootstrap. The bootstrap may be useful when√n(Tn − θ)

D→ u, but thelimiting distribution (the distribution of u) is unknown.

Remark 4.9. From Example 4.5, Cov(β∗) =

n− p

nMSE(XT X)−1 =

n− p

nCov(β) where Cov(β) = MSE(XT X)−1 starts to give good estimates

of Cov(β) = ΣT for many error distributions if n ≥ 10p and T = β. For

the residual bootstrap with large B, note that S∗T ≈ 0.95Cov(β) for n = 20p

and S∗T ≈ 0.99Cov(β) for n = 100p. Hence we may need n >> p before the

S∗T is a good estimator of Cov(T ) = ΣT . The distribution of

√n(Tn − θ) is

approximated by the distribution of√n(T ∗ − Tn) or by the distribution of√

n(T ∗ − T∗), but n may need to be large before the approximation is good.


Suppose the bootstrap sample mean T∗

estimates θ, and the bootstrapsample covariance matrix S∗

T estimates cnCov(Tn) ≈ cnΣT where cn in-

creases to 1 as n → ∞. Then S∗T is not a good estimator of Cov(Tn) un-

til cn ≈ 1 (n ≥ 100p for OLS β), but the squared Mahalanobis distance

D2∗w(T

∗,S∗

T ) ≈ D2w(θ,ΣT )/cn and D2∗

(UB) ≈ D21−δ/cn. Hence the prediction

region method, described below, has a cutoff D2∗(UB) that estimates the cutoff

D21−δ/cn. Thus the prediction region method may give good results for much

smaller n than a bootstrap method that uses a χ2g,1−δ cutoff when a cutoff

χ2g,1−δ/cn should be used for moderate n.

4.3.3 The Prediction Region Method for Hypothesis

Testing

Consider testing H0 : θ = θ0 versus H1 : θ 6= θ0 where θ0 is a known g × 1vector. If a confidence region can be constructed for θ−θ0, then fail to rejectH0 if 0 is in the confidence region, and reject H0 if 0 is not in the confidenceregion. For example, let θ = Aβ where β is a p×1 vector of parameters, andA is a known full rank g × p matrix with 1 ≤ g ≤ p.

The prediction region method makes a bootstrap sample wi = θ∗i −θ0

for i = 1, ..., B. Make the nonparametric prediction region (4.14) for thewi, and reject H0 if 0 is not in the prediction region. As shown below, theprediction region method is a special case of the percentile method wheng = 1, and a special case of bootstrapping a test statistic.

For g = 1, the percentile method uses an interval that contains UB ≈kB = dB(1− δ)e of the T ∗

i,n from a bootstrap sample T ∗1,n, ..., T

∗B,n where the

statistic Tn is an estimator of θ based on a sample of size n. Often the nis suppressed in the double subscripts. Here dxe is the smallest integer ≥ x,e.g. d7.8e = 8. Let T ∗

(1), T∗(2), ..., T

∗(B) be the order statistics of the bootstrap

sample. Then one version of the percentile method discards the largest andsmallest dBδ/2e order statistics, resulting in an interval [LB , RB] that is alarge sample 100(1 − δ)% confidence interval (CI) for θ, and also a largesample 100(1−δ)% prediction interval (PI) for a future bootstrap value T ∗

f,n.Olive (2014: p. 283, 2017ab, 2018, 2019) recommend using the shorth(c)

estimator for the percentile method. The shorth interval tends to be shorterthan the interval that deletes the smallest and largest dBδ/2e observationsWi = T ∗

i,n when the Wi do not come from a symmetric distribution. Frey(2013) showed that for largeBδ and iid data, the shorth(kB) PI has maximumundercoverage ≈ 1.12

√δ/B.

Consider testing H0 : θ = θ0 versus H1 : θ 6= θ0, and the statistic Ti =θ − θ0. The prediction region method bootstraps the squared Mahalanobis

distances, forming the bootstrap sample wi = T ∗i = θ

∗i − θ0 and the squared

Mahalanobis distancesD2i = D2

i (T∗,S∗

T ) = (T ∗i −T

∗)T [S∗

T ]−1(T ∗i −T

∗) where


(T∗,S∗

T ) are the sample mean and sample covariance matrix of T ∗1 , ..., T

∗B. See

(4.17). Then the percentile method that contains the smallest UB distances is

used to get the closed interval [0, D(UB)]. Let D20 = T

∗T[S∗

T ]−1T∗

and fail torejectH0 ifD0 ≤ D(UB) and reject H0 ifD0 > D(UB). This percentile methodis equivalent to computing the prediction region (4.14) on the wi = T ∗

i andchecking whether 0 is in the prediction region.

Remark 4.10. For g = 1 we will use the shorth(c) intervals with c =min(B, d1−δ+1.12

√δ/Be). For g > 1, let qB = min(1−δ+0.05, 1−δ+g/B)

for δ > 0.1. Let qB = min(1−δ/2, 1−δ+10δg/B) for δ ≤ 0.1. If 1−δ < 0.999and qB < 1 − δ + 0.001, set qB = 1 − δ. Let

c = dBqBe.

Let D(UB) be the 100qBth percentile of the Di. We may need n ≥ 50g andB ≥ max(100, n, 50g). If d is the model degrees of freedom, we also wantn ≥ 20d. Sometimes much larger n is needed to avoid undercoverage.

Note that the percentile method makes an interval that contains UB ofthe scalar valued T ∗

i . The prediction region method makes a hyperellipsoidthat contains UB of the g × 1 vectors T ∗

i = wi, and equivalently, makes aninterval [0, D(UB)] that contains UB of the Di.

When g = 1, a hyperellipsoid is an interval. Suppose the parameter ofinterest is θ, and there is a bootstrap sample T ∗

1 , ..., T∗B. Let ai = |T ∗

i − T∗|.

Let T∗

and S2∗T be the sample mean and variance of the T ∗

i . Then the squared

Mahalanobis distance D2θ = (θ − T

∗)2/S∗2

T ≤ D2(UB) is equivalent to θ ∈

[T∗−S∗

TD(UB), T∗+S∗

TD(UB)] = [T∗−a(UB), T

∗+a(UB)], which is an interval

centered at T∗

just long enough to cover UB of the T ∗i . Hence the prediction

region method is a special case of the percentile method if g = 1. Note thatwhen g = 1, then S∗

T and D(UB) do not need to be computed. Also see thediscussion under Definition 4.6.

Bootstrapping test statistics is well known, and the prediction regionmethod is a special case of this technique using D2

i = D2i (T

∗,S∗

T ) as thetest statistic. See Bickel and Ren (2001).

Example 4.7, Bootstrapping Multiple Linear Regression with thePrediction Region Method. Following Example 4.5, we suggest using theresidual bootstrap. If the zi = (Yi,x

Ti )T are iid observations from some popu-

lation, then a sample of size n can be drawn with replacement from z1, ..., zn.Then the response and predictor variables can be formed into vector Y ∗

1 anddesign matrix X∗

1 . Then Y ∗1 is regressed on X∗

1 resulting in the estimator

β∗1. This process is repeated B times resulting in the estimators β

∗1, ..., β

∗B .

If the zi are the rows of a matrix Z, then this nonparametric bootstrap usesthe empirical distribution of the zi.


Consider testing H0 : Aβ = θ0 where A is a g × p matrix with fullrank g and θ = Aβ. To use the prediction region method to perform the

test, suppose a bootstrap sample β∗1, ..., β

∗B has been generated. Form the

prediction region (4.14) for w1 = Aβ∗1 − θ0, ...,wB = Aβ

∗B − θ0. If 0 is in

the prediction region, fail to reject H0, otherwise reject H0.Following Seber and Lee (2003, p. 100), the classical test statistic for test-

ing H0 is

FR =(Aβ − θ0)

T [MSE A(XT X)−1AT ]−1(Aβ − θ0)

g,

and when H0 is true, gFRD→ χ2

g for a large class of error distributions. Thesample covariance matrix Sw of the wi is estimating

n− p

nMSE A(XT X)−1AT ,

and w ≈ 0 when H0 is true. Thus under H0, the squared distance D2i =

(wi − w)T S−1w (wi − w) ≈n

n − p(Aβ

∗ − θ0)T [MSE A(XT X)−1AT ]−1(Aβ

∗ − θ0),

and we expect D2(UB) ≈

n

n− pχ2

g,1−δ, for large n and B, and small p.

4.3.4 Theory for the Prediction Region Method

When the bootstrap is used, a large sample 100(1 − δ)% confidence regionfor a g × 1 parameter vector θ is a set An,B such that P (θ ∈ An,B) → 1− δ

as n, B → ∞. Assume nS∗T

P→ ΣA as n, B → ∞ where ΣA and S∗T are

nonsingular g × g matrices, and Tn is an estimator of θ such that

√n (Tn − θ)

D→ u (4.18)

as n → ∞. Then

√n Σ

−1/2A (Tn − θ)

D→ Σ−1/2A u = z,

n (Tn − θ)T Σ−1

A (Tn − θ)D→ zT z = D2

as n → ∞ where ΣA is a consistent estimator of ΣA, and

(Tn − θ)T [S∗T ]−1 (Tn − θ)

D→ D2 (4.19)


as n, B → ∞. Assume the cumulative distribution function (cdf) of D2 iscontinuous and increasing in a neighborhood ofD2

1−δ where P (D2 ≤ D21−δ) =

1 − δ. If the distribution of D2 is known, then a common bootstrap largesample 100(1 − δ)% confidence region for θ is {w : D2

w(Tn,S∗T ) ≤ D2

1−δ}

= {w : (w − Tn)T [S∗T ]−1(w − Tn) ≤ D2

1−δ}. (4.20)

Often by a central limit theorem or the multivariate delta method,√n(Tn −

θ)D→ Ng(0,ΣA), andD2 ∼ χ2

g. Note that [S∗T ]−1 could be replaced by nΣ

−1

A .Machado and Parente (2005) provide sufficient conditions and references forwhen nS∗

T is a consistent estimator of ΣA.The prediction region method large sample 100(1− δ)% confidence region

for θ is {w : (w − T∗)T [S∗

T ]−1(w − T∗) ≤ D2

(UB)} =

{w : D2w(T

∗,S∗

T ) ≤ D2(UB)} (4.21)

where D2(UB) is computed from D2

i = (T ∗i − T

∗)T [S∗

T ]−1(T ∗i − T

∗) for i =

1, ..., B. Note that the corresponding test for H0 : θ = θ0 rejects H0 if (T∗ −

θ0)T [S∗

T ]−1(T∗ − θ0) > D2

(UB). This procedure is basically the one sample

Hotelling’s T 2 test applied to the T ∗i using S∗

T as the estimated covariancematrix and replacing the χ2

g,1−δ cutoff by D2(UB).

The prediction region method for testing H0 : θ = θ0 versus H1 : θ 6= θ0

is simple. Let θ be a consistent estimator of θ and make a bootstrap sample

wi = θ∗i − θ0 for i = 1, ..., B. Make the region (4.21) for the wi and fail to

reject H0 if 0 is in the confidence region (if D0 ≤ D(UB)), reject H0 otherwise.

Bickel and Ren (2001) use nΣ−1

A instead of [S∗T ]−1, and replace the D2

cutoff in (4.20) by D2(kB) where D2

(kB) is computed from D2i = n(T ∗

i −Tn)T Σ

−1

A (T ∗i −Tn) for i = 1, ..., B. Note that kB = dB(1− δ)e. See Equation

(4.2). Let nS∗T = ΣA. The modified Bickel and Ren (2001) large sample

100(1 − δ)% confidence region for θ is {w : (w − Tn)T [S∗T ]−1(w − Tn) ≤

D2(UB,T )} =

{w : D2w(Tn,S

∗T ) ≤ D2

(UB,T )} (4.22)

where D2(UB,T ) is computed from D2

i = (T ∗i − Tn)T [S∗

T ]−1(T ∗i − Tn).

Shift region (4.21) to have center Tn, or equivalently, change the cutoff ofregion (4.22) to D2

(UB) to get the hybrid large sample 100(1− δ)% confidence

region for θ: {w : (w − Tn)T [S∗T ]−1(w − Tn) ≤ D2

(UB)} =

{w : D2w(Tn,S

∗T ) ≤ D2

(UB)}. (4.23)

Hyperellipsoids (4.21) and (4.23) have the same volume since they are thesame region shifted to have a different center. The ratio of the volumes ofregions (4.21) and (4.22) is


(D(UB)

D(UB,T )

)g

. (4.24)

Given (4.18) and (4.19), a sufficient condition for (4.22) to be a confidenceregion is √

n(T ∗i − Tn)

D→ u, (4.25)

while sufficient conditions for (4.21) to be a confidence region are

√n(T ∗

i − T∗)

D→ u, (4.26)

and √n(T

∗ − θ)D→ u. (4.27)

Note (4.26) and (4.27) follow from (4.25) and (4.18) if√n(Tn − T

∗)

P→ 0, so

Tn − T∗

= oP (n−1/2).Following Bickel and Ren (2001), let the vector of parameters θ = T (F ),

the statistic Tn = T (Fn), and the bootstrapped statistic T ∗ = T (F ∗n) where

F ∗n is the empirical cdf of w∗

1, ...,w∗n, a sample from Fn using the nonparamet-

ric bootstrap. If√n(Fn − F )

D→ zF , a Gaussian random process, and if T issufficiently smooth (Hadamard differentiable with a well behaved Hadamardderivative T (F )), then (4.18) and (4.25) hold with u = T (F )zF . Note thatFn is a perfectly good cdf “F ” and F ∗

n is a perfectly good empirical cdf fromFn = “F .” Thus if n is fixed, and a sample of size m is drawn with replace-

ment from the empirical distribution, then√m(T (F ∗

m) − Tn)D→ T (Fn)zFn .

Now let n → ∞ with m = n. Then bootstrap theory gives√n(T ∗

i − Tn)D→

limn→∞ T (Fn)zFn = T (F )zF ∼ u.

If i)√n(Tn−θ)

D→ u, then under regularity conditions, ii)√n(T ∗

i −Tn)D→

u, iii)√n(T

∗ − θ)D→ u, iv)

√n(T ∗

i − T∗)

D→ u, and v) nS∗T

P→ Cov(u).Use zn ∼ ANg (µn,Σn) to indicate that an asymptotic multivariate normalapproximation is used: zn ≈ Ng(µn,Σn).

To justify the prediction region method, assume i) and ii) hold withE(u) =0 and Cov(u) = Σu. With respect to the bootstrap sample, Tn is a constant

and the√n(T ∗

i − Tn) are iid for i = 1, ..., B. Let√n(T ∗

i − Tn)D→ vi ∼ u

where the vi are iid with the same distribution as u. Fix B. Then, similar toExample 1.19, the average of the

√n(T ∗

i − Tn) is

√n(T

∗ − Tn)D→ 1

B

B∑

i=1

vi ∼ ANg

(0,

ΣuB

)

where z ∼ ANg(0,Σ). Hence as B → ∞,√n(T

∗ − Tn)P→ 0, and iii) and iv)

hold. If B is fixed and u ∼ Ng(0,Σu), then


1

B

B∑

i=1

vi ∼ Ng

(0,

ΣuB

)and

√B√

n(T∗ − Tn)

D→ Ng(0,Σu).

Hence the prediction region method gives a large sample confidence region forθ provided that the sample percentile D2

1−δ of the D2T∗

i(T

∗,S∗

T ) =√n(T ∗

i −T

∗)T (nS∗

T )−1√n(T ∗

i − T∗) is a consistent estimator of the percentile D2

n,1−δ

of the random variable D2θ(T

∗,S∗

T ) =√n(θ − T

∗)T (nS∗

T )−1√n(θ − T∗) in

that D21−δ −D2

n,1−δP→ 0. Since iii) and iv) hold, the sample percentile will be

consistent under much weaker conditions than v) if Σu is nonsingular. SeeRemark 4.11 below. Olive (2017b:

∮5.3.3, 2017c) proved that the prediction

region method gives a large sample confidence region under the much strongerconditions of v) and u ∼ Ng(0,Σu), but the above proof is simpler.

Following Efron (2014), T∗

is the bagging or smoothed bootstrap estimatorof µ, which often outperforms Tn for inference. See Buchlmann and Yu (2002)and Friedman and Hall (2007) for theory and references for the baggingestimator.

Remark 4.12. Under reasonable conditions, i)√n(Tn − θ)

D→ u, ii)√n(T ∗

i − Tn)D→ u, iii)

√n(T

∗ − θ)D→ u, iv)

√n(T ∗

i − T∗)

D→ u. Suppose(nS∗

T )−1 is “not too ill conditioned.” Then

D21 = D2

T∗

i(T

∗,S∗

T ) =√n(T ∗

i − T∗)T (nS∗

T )−1√n(T ∗i − T

∗),

D22 = D2

θ(Tn,S∗T ) =

√n(Tn − θ)T (nS∗

T )−1√n(Tn − θ),

D23 = D2

θ(T∗,S∗

T ) =√n(T

∗ − θ)T (nS∗T )−1

√n(T

∗ − θ), and

D24 = D2

T∗

i(Tn,S

∗T ) =

√n(T ∗

i − Tn)T (nS∗T )−1

√n(T ∗

i − Tn),

are well behaved. If (nS∗T )−1 P→ Σ−1

A , thenD2j

D→ D2 = uT Σ−1A u. If (nS∗

T )−1

is “not too ill conditioned” then D2j ≈ uT (nS∗

T )−1u for large n, and theprediction regions (4.21), (4.22), and (4.23) will have coverage near 1 − δ.The regularity conditions for the prediction region method are weaker wheng = 1, since S∗

T does not need to be computed.

Fig. 4.3 Confidence Regions for 2 Statistics with MVN Distributions

The prediction region method will often simulate well even if B is rathersmall. Figure 4.3 shows 10%, 30%, 50%, 70%, 90%, and 98% prediction re-gions for a future value of Tf for two multivariate normal statistics. Theplotted points are iid T1, ..., TB. If the T ∗

i are iid from the bootstrap distri-

bution, then Cov(T∗) ≈ Cov(T )/B ≈ ΣA/(nB). Consider the 90% region.


Suppose many iid samples are generated to produce T∗. By Theorem 4.2,

if T∗

is in the 90% prediction region with probability near 90%, then theconfidence region should give simulated coverage near 90% and the volumeof the confidence region should be near that of the 90% prediction region. IfB = 100, then T

∗falls in a covering region of the same shape as the pre-

diction region, but centered near Tn and the lengths of the axes are dividedby

√B. Hence if B = 100, then the axes lengths are about one tenth of

those in Figure 4.3. Hence when Tn falls within the 70% prediction region,the probability that T

∗falls in the 90% prediction region is near one. If Tn

is just within or just without the boundary of the 90% prediction region, T∗

tends to be just within or just without of the 90% prediction region. Hencethe coverage and volume of prediction region confidence region is near that ofthe nominal coverage 90% and near the volume of the 90% prediction region.

Hence B does not need to be large provided that n and B are large enoughso that S∗

T ≈ Cov(T ∗) ≈ ΣA/n. If n is large, the sample covariance matrixstarts to be a good estimator of the population covariance matrix when B ≥Jg where J = 20 or 50. For small g, using B = 1000 often led to goodsimulations, but B = max(50g, 100) may work well.

Often D2 is unknown, and we use D2(UB) to estimate D2

1−δ instead of

assuming D2 ∼ χ2g . For g = 1, Efron (2014) uses confidence intervals

T∗ ± z1−δSE(T

∗) where P (Z ≤ z1−δ) = 1 − δ if Z ∼ N(0, 1). Efron uses

a delta method estimate of SE(T∗) to avoid using the computationally ex-

pensive double bootstrap, and assumes that T∗is asymptotically normal. The

prediction region method, T∗±S∗

TD(UB), estimates the cutoff using quantiles

of the Mahalanobis distances of the T ∗i,n from T

∗. The shorth(c) estimator is

recommended since it can be much shorter.Remark 4.12. Remark 4.9 suggests that even if the statistic Tn is asymp-

totically normal so the Mahalanobis distances are asymptotically χ2g , the pre-

diction region method can give better results for moderate n by using thecutoff D2

(UB) instead of the cutoff χ2g,1−δ. Theorem 4.2 says that the hyper-

ellipsoidal prediction and confidence regions have exactly the same volume.We compensate for the prediction region undercoverage when n is moderateby using D2

(Un). If n is large, by using D2(UB), the prediction region method

confidence region compensates for undercoverage when B is moderate, sayB ≥ Jg where J = 20 or 50. See Remark 4.10. This result can be usefulif a simulation with B = 1000 or B = 10000 is much slower than a simu-lation with B = Jg. The price to pay is that the prediction region methodconfidence region is inflated to have better coverage, so the power of thehypothesis test is decreased if moderate B is used instead of larger B.

Software. The prediction region method will be used several times inthe text, sometimes as an “exploratory test.” The slpack function predrgn

makes the nonparametric prediction region and determines whether xf is inthe region. The function predreg also makes the nonparametric prediction


region, and determines if 0 is in the region. For multiple linear regression,the function regboot does the residual bootstrap for multiple linear regres-sion, regbootsim simulates the residual bootstrap for regression, and thefunction rowboot does the empirical nonparametric bootstrap. The func-tion vsbootsim simulates the bootstrap for all subsets variable selection,so needs p small, while vsbootsim2 simulates the prediction region methodfor forward selection. The functions fselboot and vselboot bootstrapthe forward selection and all subsets variable selection estimators that min-imize Cp. See Examples 4.8 and 4.9. The shorth3 function computes theshorth(c) intervals with the Frey (2013) correction used when g = 1.

4.3.5 Bootstrapping Variable Selection

Variable selection, also called subset or model selection, is the search for asubset of predictor variables that can be deleted without important loss ofinformation. A model for variable selection in multiple linear regression canbe described by Y = xT β + e = βT x + e = xT

SβS + e where

xT β = xTS βS + xT

EβE = xTSβS , (4.28)

e is an error, Y is the response variable, x = (xTS ,x

TE)T is a p × 1 vector of

predictors, xS is an aS × 1 vector, and xE is a (p − aS) × 1 vector. Giventhat xS is in the model, βE = 0 and E denotes the subset of terms that canbe eliminated given that the subset S is in the model.

Since S is unknown, candidate subsets will be examined. Let xI be thevector of k terms from a candidate subset indexed by I, and let xO be thevector of the remaining predictors (out of the candidate submodel). Then

Y = xTI βI + xT

OβO + e. (4.29)

Suppose that S is a subset of I and that model (4.28) holds. Then

xT β = xTSβS = xT

S βS + xTI/Sβ(I/S) + xT

O0 = xTI βI (4.30)

where xI/S denotes the predictors in I that are not in S.The model Y = xT β + e that uses all of the predictors is called the full

model. A model Y = xTI βI + e that only uses a subset xI of the predictors is

called a submodel, and the full model is a submodel. Criterion such as Cp(I)and AIC(I) are often used to select a subset. See Olive and Hawkins (2005)and Olive (2010: section 3.4, 2017a: section 3.4).

Suppose model I is selected after variable selection. Then least squaresoutput for the model Y = XIβI + e can be obtained, but the least squaresoutput is not correct for inference. In particular, MSE(I)(XT

I XI)−1 is not

the correct estimated covariance matrix of βI . The selected model tends to fit


the data too well, so SE(βi) from the incorrect estimated covariance matrixtends to be too small. Hence the confidence intervals for βi are too short, andhypothesis tests reject H0 : βi = 0 too often. For MSE and Cp, see Definition4.5.

Hastie et al. (2009, p. 57) note that variable selection is a shrinkage es-timator: the coefficients are shrunk to 0 for the omitted variables. Supposen ≥ 10p. If βI is a × 1, form βI,0 from βI by adding 0s corresponding to

the omitted variables. Then βI,0 is a nonlinear estimator of β, and the resid-

ual bootstrap method can be applied. For example, suppose β = βImin,0

is formed from model Imin that minimizes Cp from some variable selectionmethod such as forward selection, backward elimination, stepwise selection,or all subsets variable selection. Instead of computing the least squares esti-mator from regressing Y ∗

i on X , perform variable selection on Y ∗i and X ,

fit the model that minimizes the criterion, and add 0s corresponding to the

omitted variables, resulting in estimators β∗1, ..., β

∗B where β

∗i = β

∗Imin,0,i.

Suppose the variable selection method, such as forward selection or allsubsets, produces K models. Let model Imin be the model that minimizesthe criterion, e.g. Cp(I) or AIC(I). Following Seber and Lee (2003, p. 448),Nishii (1984), and Shao (1993), the probability that model Imin from Cp orAIC underfits goes to zero as n → ∞. Hence P (S ⊆ Imin) → 1 as n → ∞.This result holds for all subsets regression and variable selection methods,such as forward selection and backward elimination, that produce a sequenceof nested models including the full model. Since there are a finite numberof regression models I that contain the true model, and each model gives aconsistent estimator βI,0 of β, the probability that Imin picks one of these

models goes to one as n→ ∞. So βImin,0 is a√n consistent estimator of β.

See Remark 4.14.Note that if S ⊆ I, and Y = XIβI +e, then

√n(βI −βI)

D→ Nk(0, σ2V I)

under mild regularity conditions where n(XTI XI)

−1 → V I . Hence√n(βI,0−

β)D→ Np(0, σ

2V I,0) where the V I,0 has a column and row of zeroes addedfor each variable not in I. Note that V I,0 is singular unless I corresponds tothe full model. For example, if p = 3 and model I uses a constant x1 ≡ 1 andx3 with

V I =

[V11 V12

V21 V22

], then V I,0 =

V11 0 V12

0 0 0V21 0 V22

.

Theory for bootstrapping variable selection is difficult since βImin,0 is a

mixture distribution of βIj,0 and β∗Imin,0 is a mixture distribution of β

∗Ij,0. See

Section 1.5. If Tn is has a mixture distribution of Tjn with probabilities πjn,then theory for the prediction region method of Section 4.3.4 no longer holds

since Tn is no longer smooth. We will have√n(Tn−θ)

D→ u, but√n(T

∗−θ)D→

w, where in general, u and w do not have the same distribution.


Now suppose that Tn is equal to the estimator Tjn with probability πjn

for j = 1, ..., J where∑

j πjn = 1, πjn → πj as n → ∞, and√n(Tjn − θ)

D→uj with E(uj) = 0 and Cov(uj) = Σj. Then the cumulative distributionfunction (cdf) of Tn is FTn(z) =

∑j πjnFTjn(z) where FTjn(z) is the cdf of

Tjn. Hence√n(Tn − θ) has a mixture distribution of the

√n(Tjn − θ), and

√n(Tn − θ)

D→ u (4.31)

where the cdf of u is Fu(z) =∑

j πjFuj(z) and Fuj

(z) is the cdf of uj .Thus u has a mixture distribution of the uj with probabilities πj, E(u) = 0,and Cov(u) = Σu =

∑j πjΣj. See Theorem 1.27.

For the bootstrap, suppose that T ∗i is equal to T ∗

ij with probability ρjn

for j = 1, ..., J where∑

j ρjn = 1, and ρjn → πj as n → ∞. Let Bjn countthe number of times T ∗

i = T ∗ij in the bootstrap sample. Then the bootstrap

sample T ∗1 , ..., T

∗B can be written as

T ∗1,1, ..., T

∗B1n,1, ..., T

∗1,J, ..., T

∗BJn,J

where the Bjn follow a multinomial distribution and Bjn/BP→ ρjn as B →

∞. Conditionally on the Bjn and with respect to the bootstrap sample, theT ∗

ij are independent. Denote T ∗1j, ..., T

∗Bjn,j as the jth bootstrap component

of the bootstrap sample with sample mean T∗j and sample covariance matrix

S∗T,j. Then

T∗

=1

B

B∑

i=1

T ∗i =

∑

j

Bjn

B

1

Bjn

Bjn∑

i=1

T ∗ij =

∑

j

ρjnT∗j .

Note that T ∗i =

∑j I(T

∗i = T ∗

ij)T∗ij where the indicator is one or zero and∑

j I(T∗i = T ∗

ij) = 1 since exactly one indicator will equal one.Consider an iid sample of size B of the statistic T1, ..., TB with sample

mean T . Here B is a constant such as B = max(1000, n). Let Tin = Tijn if

Tjn is chosen Djn times where the random variables Djn/BP→ πjn. The Djn

follow a multinomial distribution. Then the iid sample can be written as

T1,1, ..., TD1n,1, ..., T1,J, ..., TDJn,J ,

where the Tij are not iid. Denote T1j, ..., TDjn,j as the jth component of the

iid sample with sample mean T j and sample covariance matrix ST,j. Thus

T =1

B

B∑

i=1

Tijn =∑

j

Djn

B

1

Djn

Djn∑

i=1

Tij =∑

j

πjnT j.


Hence T is a random linear combination of the T j. Conditionally on theDjn, the Tij are independent, and T is a linear combination of the T j . Notethat Cov(T ) = Cov(Tn)/B. Note that Tn =

∑j I(Tn = Tjn)Tjn where the

indicator is one or zero and∑

j I(Tn = Tij) = 1 since exactly one indicatorwill equal one.

To examine the bagging estimator, assume that each bootstrap component

satisfies vi)√n(Tjn − θ)

D→ uj ∼ Ng(0,Σj), vii)√n(T ∗

ij − Tjn)D→ uj ,

viii)√n(T

∗j − θ)

D→ uj, ix)√n(T ∗

ij − T∗j )

D→ uj , x) nS∗T,j

P→ Σj, and xi)√n(Tjn − T

∗j )

P→ 0 as Bjn → ∞ and n → ∞. (For the nonparametricbootstrap, cases are sampled with replacement, and the above conditionshold if each component bootstraps correctly.)

Consider the random vectors

Zn =∑

j

Bjn

BTjn and Wn =

∑

j

ρjnTjn.

By xi)

√n(Zn − T

∗) =

√n(∑

j

Bjn

BTjn − T

∗) =

∑

j

Bjn

B

√n(Tjn − T

∗j )

P→ 0.

Also,√n(Zn − θ) −√

n(Wn − θ) =

∑

j

(Bjn

B− ρjn

)√n(Tjn − θ) =

∑

j

OP (1)OP (n−1/2)P→ 0.

Assume the unj =√n(Tjn − θ)

D→ uj are such that

√n(Wn − θ) =

∑

j

ρjn

√n(Tjn − θ)

D→ w =∑

j

πjuj.

Note that E(w) = 0 and Cov(w) = Σw =∑

j

∑k πjπkCov(uj,uk). Hence

√n(T

∗ − θ)D→ w. (4.32)

Since w is a weighted mean of the uj ∼ Ng(0,Σj), a normal approximationis w ≈ Ng(0,Σw). The approximation is exact if the uj with positive πj

have a joint multivariate normal distribution.Now consider variable selection for model (4.28) with θ = Aβ where A is

a known full rank g × p matrix with 1 ≤ g ≤ p. Olive (2017a: p. 128, 2018)showed that the prediction region method can simulate well for the p × 1vector βImin,0. Assume p is fixed, n ≥ 20p, and that the error distributionis unimodal and not highly skewed. The response plot and residual plot areplots with Y = xT β on the horizontal axis and Y or r on the vertical axis,


respectively. Then the plotted points in these plots should scatter in roughlyeven bands about the identity line (with unit slope and zero intercept) andthe r = 0 line, respectively. See Figure 3.1. If the error distribution is skewedor multimodal, then much larger sample sizes may be needed.

For the residual bootstrap, we use the fitted values and residuals fromthe OLS full model, but fit β with forward selection where each componentuses a βIj

. As in Example 4.5, Y ∗ = XβOLS + rW follows a standard

linear model with E(rW ) = 0, and Cov(Y ∗) = Cov(rW ) = σ2nIn where

σ2n =

1

n

n∑

i=1

r2i =n− p

nMSE. Then β

∗Ij

= (XTIj

XIj )−1XT

IjY ∗ = DjY

∗

with Cov(β∗Ij

) = σ2n(XT

IjXIj)

−1 and E(β∗Ij

) = (XTIj

XIj )−1XT

IjE(Y ∗) =

(XTIj

XIj)−1XT

IjHY = βIj

since HXIj = XIj . The expectations are with

respect to the bootstrap distribution where Y = HY = XβOLS acts as aconstant.

For the above residual bootstrap with forward selection and Cp, let Tn =

AβImin,0 and Tjn = AβIj ,0 = ADj,0Y where Dj,0 adds rows of zeroes to

Dj corresponding to the xi not in Ij . If S ⊆ Ij , then√n(βIj

− βIj)

D→Naj(0, σ

2V j) and√n(βIj,0 − β)

D→ uj ∼ Np(0, σ2V j,0) where V j,0 adds

columns and rows of zeroes corresponding to the xi not in Ij . Then under

regularity conditions, (4.31) and (4.32) hold (√n(Tn −θ)

D→ u and√n(T

∗ −θ)

D→ w) where√n(∑

j ρjnTjn − θ)D→ w, and the sum is over j : S ⊆ Ij .

Thus E(T ∗) =∑

j ρjnAβIj,0 and S∗T is a consistent estimator of Cov(T ∗)

=∑

j

ρjnCov(T ∗jn) +

∑

j

ρjnAβIj,0βT

Ij ,0AT −E(T ∗)[E(T ∗)]T

where asymptotically the sum is over j : S ⊆ Ij . If θ0 = 0, then nS∗T =

ΣA +OP (1) where

nCov(Tn)P→ ΣA =

∑

j

σ2πjAV j,0AT .

Then (nS∗T )−1 tends to be “well behaved” if ΣA is nonsingular.

Now examine the geometric argument Theorem 4.2. When Tn does nothave a mixture distribution, the bootstrap confidence regions approximateRc. With a mixture distribution, the bootstrap sample shifts the data cloud

to be centered at T∗

where√n(T

∗−∑j ρjnTjn)P→ 0. The Tjn are computed

from the same data set and hence correlated. Suppose√n(Tn − θ)

D→ u,√n(T

∗ − θ)D→ w, and (nS∗

T )−1 is “not too ill conditioned.” Then

D21 = D2

T∗

i(T

∗,S∗

T ) =√n(T ∗

i − T∗)T (nS∗

T )−1√n(T ∗

i − T∗),


D22 = D2

θ(Tn,S∗T ) =

√n(Tn − θ)T (nS∗

T )−1√n(Tn − θ), and

D23 = D2

θ(T∗,S∗

T ) =√n(T

∗ − θ)T (nS∗T )−1√n(T

∗ − θ)

are well behaved in that there exist cutoffs D2i,1−δ that would result in good

confidence regions for i = 2 and 3.For the residual bootstrap with forward selection nCov(Tjn) and nCov(T ∗

jn)

both converge in probability to σ2AV j,0AT , and

Cov(T ∗jn) =

(n− p)MSE

nσ2Cov(Tjn) ≈ n− p

nCov(Tjn)

are close for n ≥ 20p. Hence the jth component of an iid sample T1, ..., TB

and the jth component of the bootstrap sample T ∗1 , ..., T

∗B have the same

variability asymptotically. Since E(Tjn) = θ, each component of the iid sam-

ple is centered at θ. Since E(T ∗jn) = Tjn = AβIj,0, the bootstrap compo-

nents have the same asymptotic variability as the iid sample components,but are centered at Tjn. Thus the variability of T ∗

n is larger than that ofTn for a mixture distribution, asymptotically. Hence the prediction regionapplied to the bootstrap sample is slightly larger than the prediction regionapplied to the iid sample, asymptotically (we want n ≥ 20p). Hence cutoffD2

2,1−δ = D2(UB) gives coverage close to or higher than the nominal cover-

age for prediction regions (4.21) and (4.23), using the geometric argument.The deviation T ∗

i − Tn tends to be larger in magnitude than the deviations

T∗ − θ, Tn − θ, and T ∗

i − T∗. Hence the cutoff D2

2,1−δ = D2(UB,T ) tends to

be larger than D2(UB), and region (4.22) tends to have higher coverage than

region (4.23) for a mixture distribution. In simulations for n ≥ 20p and (4.21)and (4.23), the coverage tends to get close to 1− δ for B ≥ max(400, 50p) sothat S∗

T is a good estimator of Cov(T ∗).

To see that T ∗ has more variability than Tn, asymptotically, look at Figure4.3. Imagine that n is huge and the J = 6 ellipsoids are 99.9% coveringregions for the component data clouds corresponding to Tjn for j = 1, ..., J .Separating the clouds slightly, without rotation, increases the variability ofthe overall data cloud. The bootstrap distribution of T ∗ corresponds to theseparated clouds. The shape of the overall data cloud does not change much,but the volume does increase.

Remark 4.13. The bootstrap confidence regions are expected to be ef-fective for lasso and elastic net, provided λ1 = 0, by Chapter 3 and Zhang(1992). To bootstrap T

∗, the too expensive double bootstrap could be used

to estimate Cov(T∗).

In the simulations where S is not the full model, inference with forwardselection with Imin using Cp appears to be more precise than inference withthe OLS full model if n ≥ 20p and B ≥ 50p. Higher than nominal coverage


can occur because of the zero padding. It is possible that S∗T is singular if a

column of the bootstrap sample is equal to 0.Two special cases are interesting. First, suppose πd = 1 so u ∼ ud ∼

Np(0,Σd). This special case occurs for Cp if aS = p so S is the full model,and for methods like BIC that choose IS with probability going to one. Knightand Fu (2000) had similar bootstrap results for this case.

The second special case occurs if for each πj > 0, Auj ∼ Ng(0,AΣjAT ) =

Ng(0,AΣAT ). Then Au ∼ Ng(0,AΣAT ). This special case occurs for βS

if the nontrivial predictors are orthogonal or uncorrelated with zero meanso XT X/n → diag(d1, ..., dp) as n → ∞ where each di > 0. Then βS hasthe same multivariate normal limiting distribution for Imin and for the OLSfull model. The bootstrap distribution for βE is a mixture of zeros and adistribution that would produce a confidence region for AβE = 0 that hasasymptotic coverage of 0 equal to 100(1−δ)%. Hence the asymptotic coverageis greater than the nominal coverage provided that nS∗

T in nonsingular with

probability going to one (e.g., p − aS is small), where T = AβE,Imin,0. Foruncorrelated predictors with zero mean, the number of bootstrap samplesB ≥ 50p may work well for the shorth confidence intervals and for testingAβS = 0.

Remark 4.14. By the above discussion for forward selection with Cp, if

p is fixed and n → ∞, then under mild conditions βImin,0 is a√n consistent

estimator of β under model (4.28), and

√n(βImin,0 − β)

D→ u (4.33)

where the cdf

Fu(t) =

J∑

j=1

πjFuj (t) and uj ∼ Np(0, σ2V j,0),

0 ≤ πj ≤ 1,∑J

j=1 πj = 1, and J = 2p−aS is the number of subsets Ij that

contain S. Olive (2017a, p. 123) noted that βImin,0 is a consistent estimatorof β. Note that

√n consistency also follows since with probability tending to

1, Imin will pick one of the J models with S ⊆ I, and picking from a fixednumber of

√n consistent estimators results in a

√n consistent estimator by

Pratt (1959). See Proposition 1.10 and Theorem 1.11. With different values ofπj, (4.33) also holds for other OLS variable selection methods with Cp, suchas backward elimination, all subsets, and if λ1 = 0, the variant of relaxedlasso that computes the OLS submodel for the subset corresponding to λi

for i = 1, ...,M . These other OLS variable selection methods can also bebootstrapped in the same way as forward selection.

In the simulations for forward selection, coverages did not change muchas the ρ was increased from zero to near one, where ρ was the correlation


between any two nontrivial predictors. Under model (4.28), we still have that

βIj ,0 is a√n consistent asymptotically normal estimator of β = (βT

S ,βTE)T

where βE = 0. By Remark 4.14, the limiting distribution of√n(βi,Imin,0−βi)

is a mixture of N(0, σ2ij) distributions. For a βi that is a component of βS ,

the symmetric mixture distribution has a pdf. Then the simulated shorthconfidence intervals have coverage near the nominal coverage if n and B arelarge enough.

Let βO be a vector component of βE , and consider testing H0 : AβO = 0.

If Aβ∗O,i = 0 for greater than Bδ of the bootstrap samples i = 1, ..., B, then

the 100(1 − δ)% prediction region method confidence region will contain 0,and the test will fail to reject H0.

Suppose we want to bootstrap T = βO, where β = (βTI ,β

TO)T , and all

β∗O,i = 0 for i = 1, ..., B. Then S∗

T is singular, but the singleton set {0} isthe large sample prediction region method 100(1− δ)% confidence region forβO and δ ∈ (0, 1), and the pvalue for H0 : βO = 0 is one. For large sampletheory tests, the pvalue estimates the population pvalue. For the Imin modelfrom forward selection, there may be strong evidence that xO is not neededin the model given xI is in the model if the “100%” confidence region is{0}, n ≥ 20p, B ≥ 50p, and the error distribution is unimodal and nothighly skewed. (Since the pvalue is one, this technique may be useful for datasnooping: applying OLS theory to submodel I may have negligible selectionbias.)

Note that there are several important variable selection models, includingthe model given by Equation (4.28) where xT β = xT

SβS . Another modelis xT β = xT

SiβSi

for i = 1, ..., J . Then there are J ≥ 2 competing “true”nonnested submodels where βSi

is aSi × 1. For example, suppose the J = 2models have predictors x1, x2, x3 for S1 and x1, x2, x4 for S2. Then x3 andx4 are likely to be selected and omitted often by forward selection for the Bbootstrap samples. Hence omitting all predictors xi that have a β∗

ij = 0 forat least one of the bootstrap samples j = 1, ..., B could result in underfitting,e.g. using just x1 and x2 in the above J = 2 example. If n and B are largeenough, the singleton set {0} could still be the “100%” confidence region fora vector βO.

Suppose the predictors xi have been standardized. Then another importantregression model has the βi taper off rapidly, but no coefficients are equal tozero. For example, βi = e−i for i = 1, ..., p.

Inference techniques for the variable selection model have not had muchsuccess. Efron (2014) let t(Z) be a scalar valued statistic, based on all of thedata Z, that estimates a parameter of interest µ. Form a bootstrap sample

Z∗i and t(Z∗

i ) for i = 1, ..., B. Then µ = s(Z) =1

B

B∑

i=1

t(Z∗i ), a “boot-

strap smoothing” or “bagging” estimator. In the regression setting with vari-


able selection, Z∗i can be formed with the nonparametric or residual boot-

strap using the full model. The prediction region method can also be appliedto t(Z). For example, when A is 1 × p, the prediction region method uses

µ = Aβ − c, t(Z) = Aβ − c and T∗

= µ. Efron (2014) used the confidence

interval T∗ ± z1−δSE(T

∗) which is symmetric about T

∗. The prediction re-

gion method uses T∗ ± S∗

TD(UB) which is also a symmetric interval centered

at T∗. If both the prediction region method and Efron’s method are large

sample confidence intervals for µ, then they have the same asymptotic length(scaled by multiplying by

√n), since otherwise the shorter interval will have

lower asymptotic coverage. Since the prediction region interval is a percentileinterval, the shorth(c) interval could have much shorter length than the Efroninterval and the prediction region interval if the bootstrap distribution is notsymmetric.

The prediction region method can be used for vector valued statisticsand parameters. Prediction intervals and regions can have higher than thenominal coverage 1−δ if the distribution is discrete or a mixture of a discretedistribution and some other distribution. In particular, coverage can be highif the T distribution is a mixture of a point mass at 0 and the method checkswhether 0 is in the prediction region. Such a mixture often occurs for variable

selection methods. The bootstrap sample for the Wi = β∗ij can contain many

zeroes and be highly skewed if the jth predictor is weak. Then the computer

program may fail because S∗T is singular, but if all or nearly all of the β

∗ij = 0,

then there is strong evidence that the jth predictor is not needed given thatthe other predictors are in the variable selection method.

As an extreme simulation case, suppose β∗ij = 0 for i = 1, ..., B and for

each run in the simulation. Consider testing H0 : βj = 0. Then regardless ofthe nominal coverage 1 − δ, the closed interval [0,0] will contain 0 for eachrun and the observed coverage will be 1 > 1 − δ. Using the open interval(0,0) would give observed coverage 0. Also intervals [0, b] and [a, 0] correctlysuggest failing to reject βj = 0, while intervals (0, b) and (a, 0) incorrectlysuggest rejecting H0 : βj = 0. Hence closed regions and intervals make sense.

Warning: The bootstrap tests for variable selection are exploratory tests:for variable selection where XT X/n does not converge to a diagonalmatrix, it has not yet been proven that i) the prediction region method is alarge sample test, and ii) the shorth interval is a large sample CI for βi. Forfoward selection with Cp, the geometric argument suggests that the coverageswill be close to or higher than the nominal for large n and B (n ≥ 20p andB ≥ 50p are lower bounds). The prediction region method tends to fail if S∗

T

is singular. Singularity will occur if a β∗ij = 0 for j = 1, ..., B. (If n and B

are large enough, then such a result gives strong evidence that the predictorxi is not needed in the model given the other predictors are in the model.)Singularity is likely if many βi = 0.


Example 4.8. Cook and Weisberg (1999, pp. 351, 433, 447) gives a dataset on 82 mussels sampled off the coast of New Zealand. Let the responsevariable be the logarithm log(M) of the muscle mass, and the predictors arethe length L and heightH of the shell in mm, the logarithm log(W ) of the shellwidth W, the logarithm log(S) of the shell mass S, and a constant. Inferencefor the full model is shown below along with the shorth(c) nominal 95%confidence intervals for βi computed using the nonparametric and residualbootstraps. As expected, the residual bootstrap intervals are close to theclassical least squares confidence intervals ≈ βi ± 1.96SE(βi). In the output,W and S are actually log(W ) and log(S) since R tends not to change labelswhen transformations are made.

large sample full model inference

Est. SE t Pr(>|t|) nparboot 95% shorth CI

i -1.249 0.838 -1.49 0.14 [-2.93,-0.048][-3.138,0.194]

L -0.001 0.002 -0.28 0.78 [-0.005,0.003][-0.005,0.004]

W 0.130 0.374 0.35 0.73 [-0.384,0.827][-0.555,0.971]

H 0.008 0.005 1.50 0.14 [-0.002,0.018][-0.003,0.017]

S 0.640 0.169 3.80 0.00 [ 0.188,1.001][ 0.276,0.955]

output and shorth intervals for the min Cp submodel

Est. SE t Pr(>|t|) 95% shorth CI

int -0.9573 0.1519 -6.3018 0.0000 [-2.769, 0.460]

L 0 [-0.004, 0.004]

W 0 [-0.595, 0.869]

H 0.0072 0.0047 1.5490 0.1254 [ 0.000, 0.016]

S 0.6530 0.1160 5.6297 0.0000 [ 0.324, 0.913]

The minimum Cp model from all subsets variable selection uses a constant,H , and log(S). The shorth(c) nominal 95% confidence intervals for βi usingthe residual bootstrap are shown. Note that the interval for H is right skewedand contains 0 when closed intervals are used instead of open intervals. Theleast squares output is also shown, but should only be used for inference ifthe model was selected before looking at the data.

It was expected that log(S) may be the only predictor needed, along witha constant, since log(S) and log(M) are both log(mass) measurements andlikely highly correlated. Hence we want to test H0 : β2 = β3 = β4 = 0 withthe Imin model selected by all subsets variable selection. (Of course this testwould be easy to do with the full model using least squares theory.) ThenH0 : Aβ = (β2 , β3, β4)

T = 0. Using the prediction region method with the

full model gave an interval [0,2.930] with D0 = 1.641. Note that√χ2

3,0.95 =

2.795. So fail to reject H0. Using the prediction region method with the Imin

variable selection model had [0, D(UB)] = [0, 3.293] while D0 = 1.134. So failto reject H0. The R code used to produce the above output is shown below.

library(leaps)

y <- log(mussels[,5]); x <- mussels[,1:4]


x[,4] <- log(x[,4]); x[,2] <- log(x[,2])

out <- regboot(x,y,B=1000)

tem <- rowboot(x,y,B=1000)

outvs <- vselboot(x,y,B=1000) #get bootstrap CIs,

apply(out$betas,2,shorth3);

apply(tem$betas,2,shorth3);

apply(outvs$betas,2,shorth3)

ls.print(outvs$full)

ls.print(outvs$sub)

#test if beta_2 = beta_3 = beta_4 = 0

Abeta <- out$betas[,2:4]

#prediction region method with residual bootstrap

predreg(Abeta)

Abeta <- outvs$betas[,2:4]

#prediction region method with Imin

predreg(Abeta)

Example 4.9. Consider the Gladstone (1905) data set that has 12 vari-ables on 267 persons after death. The response variable was brain weight.Head measurements were breadth, circumference, head height, length, andsize as well as cephalic index and brain weight. Age, height, and two categor-ical variables ageclass (0: under 20, 1: 20-45, 2: over 45) and sex were alsogiven. The eight predictor variables shown in the output were used.

Output is shown below for the full model and the bootstrapped minimumCp forward selection estimator. Note that the shorth intervals for length andsex are quite long. These variables are often in and often deleted from thebootstrap forward selection. Model II is the model with the fewest predictorssuch that CP (II ) ≤ CP (Imin)+1. For this data set, II = Imin. The bootstrapCIs differ due to different random seeds.

large sample full model inference for Ex. 4.9

Estimate SE t Pr(>|t|) 95% shorth CI

Int -3021.255 1701.070 -1.77 0.077 [-6549.8,322.79]

age -1.656 0.314 -5.27 0.000 [ -2.304,-1.050]

breadth -8.717 12.025 -0.72 0.469 [-34.229,14.458]

cephalic 21.876 22.029 0.99 0.322 [-20.911,67.705]

circum 0.852 0.529 1.61 0.109 [ -0.065, 1.879]

headht 7.385 1.225 6.03 0.000 [ 5.138, 9.794]

height -0.407 0.942 -0.43 0.666 [ -2.211, 1.565]

len 13.475 9.422 1.43 0.154 [ -5.519,32.605]

sex 25.130 10.015 2.51 0.013 [ 6.717,44.19]

output and shorth intervals for the min Cp submodel

Estimate SE t Pr(>|t|) 95% shorth CI

Int -1764.516 186.046 -9.48 0.000 [-6151.6,-415.4]

age -1.708 0.285 -5.99 0.000 [ -2.299,-1.068]

breadth 0 [-32.992, 8.148]


cephalic 5.958 2.089 2.85 0.005 [-10.859,62.679]

circum 0.757 0.512 1.48 0.140 [ 0.000, 1.817]

headht 7.424 1.161 6.39 0.000 [ 5.028, 9.732]

height 0 [ -2.859, 0.000]

len 6.716 1.466 4.58 0.000 [ 0.000,30.508]

sex 25.313 9.920 2.55 0.011 [ 0.000,42.144]

output and shorth for I_I model

Estimate Std.Err t-val Pr(>|t|) 95% shorth CI

Int -1764.516 186.046 -9.48 0.000 [-6104.9,-778.2]

age -1.708 0.285 -5.99 0.000 [ -2.259,-1.003]

breadth 0 [-31.012, 6.567]

cephalic 5.958 2.089 2.85 0.005 [ -6.700,61.265]

circum 0.757 0.512 1.48 0.140 [ 0.000, 1.866]

headht 7.424 1.161 6.39 0.000 [ 5.221,10.090]

height 0 [ -2.173, 0.000]

len 6.716 1.466 4.58 0.000 [ 0.000,28.819]

sex 25.313 9.920 2.55 0.011 [ 0.000,42.847]

The R code used to produce the above output is shown below. The lastfour commands are useful for examining the variable selection output.

x<-cbrainx[,c(1,3,5,6,7,8,9,10)]

y<-cbrainy

library(leaps)

out <- regboot(x,y,B=1000)

outvs <- fselboot(x,cbrainy) #get bootstrap CIs,

apply(out$betas,2,shorth3)


ls.print(outvs$full)

ls.print(outvs$sub)

outvs <- modIboot(x,cbrainy) #get bootstrap CIs,


ls.print(outvs$sub)

tem<-regsubsets(x,y,method="forward")

tem2<-summary(tem)

tem2$which

tem2$cp

A small simulation study was done in R using B = max(1000, n, 20p) and5000 runs. The regression model used β = (1, 1, 0, 0)T with n = 100, p = 4,and various zero mean iid error distributions. Thus βS = (β1, β2)

T and βE =(β3, β4)

T . The simulation used the same type of data as that for the predictioninterval simulation in Section 5.12. Hence ψ = 0 corresponded to uncorrelatednontrivial predictors. Then the full model least squares confidence intervalsfor βi should have length near 2t96,0.975σ/

√n ≈ 2(1.96)σ/10 = 0.392σ when

the iid zero mean errors have variance σ2. The simulation computed the


shorth(c) interval for each βi and the three confidence regions were used totest H0 : βE = 0 and H0 : βS = 1. See Remark 4.10. The nominal coveragewas 0.95 with δ = 0.05. Observed coverage between 0.94 and 0.96 wouldsuggest coverage is close to the nominal value. Models with the first (k + 1)βi = 1 and the last (p− k − 1) βi = 0 were also considered.

Table 4.2 Bootstrapping OLS Forward Selection with Cp, ei ∼ N(0,1)

ψ β1 β2 β3 β4 pm0 hyb0 br0 pm1 hyb1 br1reg,0 0.9458 0.9500 0.9474 0.9484 0.9400 0.9408 0.9410 0.9368 0.9362 0.9370len 0.3955 0.3990 0.3987 0.3982 2.4508 2.4508 2.4496 2.4508 2.4496 2.4508vs,0 0.9480 0.9496 0.9972 0.9958 0.9910 0.9786 0.9914 0.9384 0.9394 0.9402len 0.3954 0.3987 0.3233 0.3231 2.6987 2.6987 3.0020 2.4497 2.4497 2.4570

reg,0.5 0.9458 0.9442 0.9456 0.9448 0.9380 0.9380 0.9382 0.9344 0.9358 0.9364len 0.3955 0.6612 0.6609 0.6613 2.4509 2.4509 2.4520 2.4509 2.4509 2.4524

vs,0.5 0.9474 0.9676 0.9972 0.9978 0.9926 0.9844 0.9930 0.9548 0.9552 0.9634len 0.3954 0.6580 0.5372 0.5389 2.7031 2.7031 2.9939 2.4609 2.4609 2.5773

reg,0.9 0.9458 0.9410 0.9442 0.9498 0.9402 0.9398 0.9400 0.9354 0.9350 0.9352len 0.3955 3.2567 3.2534 3.2586 2.4507 2.4507 2.4517 2.4509 2.4509 2.4523

vs,0.9 0.9474 0.9684 0.9944 0.9962 0.9918 0.9808 0.9918 0.9620 0.9594 0.9696len 0.3954 2.7511 2.7252 2.7353 2.7158 2.7158 2.9714 2.4965 2.4965 2.5986

The regression models used the residual bootstrap on the full model leastsquares estimator and on the forward selection estimator βImin,0. Results areshown for when the iid errors ei ∼ N(0, 1). Table 4.2 shows two rows for eachmodel giving the observed confidence interval coverages and average lengthsof the confidence intervals. The term “reg” is for the full model regression,and the term “vs” is for forward selection. The last six columns for the testsgives the “length” = cutoff (D(UB) or D(UB,T )) and coverage = P(fail to

reject H0). See Equation (17). The cutoff will often be near√χ2

g,0.95 if the

statistic T is asymptotically normal. Note that√χ2

2,0.95 = 2.448 is close to

2.45 for the full model regression bootstrap tests. The coverages were near0.95 for the regression bootstrap on the full model, although there was slightundercoverage for the tests since (n− p)/n = 0.96 when n = 25p.

The inference for forward selection was about as precise or more precisethan the inference for the full model. Suppose ψ = 0. Then from the discussionabove Remark 4.14, βS has the same limiting distribution for Imin and thefull model. Note that the average lengths and coverages were similar for thefull model and forward selection Imin for β1 and β2 and βS = (β1, β2)

T .Forward selection inference was more precise for βE = (β3, β4)

T . Simulationsfor lasso were similar, but the confidence intervals were longer than those forforward selection. Ridge regression only simulated well for ψ = 0.

For ψ > 0 and Imin, the coverages for the βi corresponding to βS werenear 0.95, but the average length could be shorter since Imin tends to haveless multicorrelation than the full model. For ψ ≥ 0, the Imin coverages were


higher than 0.95 for β3 and β4 and for testing H0 : βE = 0 since zeros often

occurred for β∗j for j = 3, 4. The average CI lengths were shorter for Imin

than for the OLS full model for β3 and β4. Note that for Imin, the coveragefor testing H0 : βS = 1 was higher than that for the OLS full model.

Some R code for the simulation is shown below.

record coverages and ‘‘lengths" for

b1, b2, bp-1, bp, pm0, hyb0, br0, pm1, hyb1, br1

regbootsim3(n=100,p=4,k=1,nruns=5000,type=1,psi=0)

$cicov

[1] 0.9458 0.9500 0.9474 0.9484 0.9400 0.9408 0.9410

0.9368 0.9362 0.9370

$avelen

[1] 0.3955 0.3990 0.3987 0.3982 2.4508 2.4508 2.4521

[8] 2.4496 2.4496 2.4508

$beta

[1] 1 1 0 0

$k

[1] 1

library(leaps)

vsbootsim4(n=100,p=4,k=1,nruns=5000,type=1,psi=0)

$cicov

[1] 0.9480 0.9496 0.9972 0.9958 0.9910 0.9786 0.9914

0.9384 0.9394 0.9402

$avelen

[1] 0.3954 0.3987 0.3233 0.3231 2.6987 2.6987 3.0020

[8] 2.4497 2.4497 2.4570

$beta

[1] 1 1 0 0

$k

[1] 1

Remark 4.15. In spite of a massive literature on variable selection andmodel selection, to my knowledge, Table 4.2 and simulations in PelawaWatagoda (2017ab) and Pelawa Watagoda and Olive (2019a) may be thefirst simulations where the simulated model selection inference is at least asgood as that as the OLS full model when n/p is large. When n/p is small,prediction can be much better for model selection estimators than for theOLS full model which interpolates the data if n = p. Using shorth basedprediction intervals results in asymptotically optimal PIs for unimodal zeromean error distributions with second moments. The best competing PIs in2019, that do not use the shorth of the residuals, also needed the distributionsto be symmetric for asymptotic optimality.


4.4 Summary

1) For h > 0, the hyperellipsoid {z : (z − T )T C−1(z − T ) ≤ h2} ={z : D2

z ≤ h2} = {z : Dz ≤ h}. A future observation (random vector) xf isin this region if Dxf

≤ h. A large sample 100(1− δ)% prediction region is a

set An such that P (xf ∈ An)P→ 1 − δ where 0 < δ < 1.

2) The classical 100(1− δ)% large sample prediction region is{z : D2

z(x,S) ≤ χ2p,1−δ} and works well if n is large and the data are iid

MVN.3) Let qn = min(1 − δ + 0.05, 1− δ + p/n) for δ > 0.1 and qn =

min(1−δ/2, 1−δ+10δp/n), otherwise. If qn < 1−δ+0.001, set qn = 1−δ. If(T,C) is a consistent estimator of (µ, dΣ), then {z : Dz(T,C) ≤ h} is a largesample 100(1−δ)% prediction regions if h = D(Un) where D(Un) is the 100qnthsample quantile of the Di. The large sample 100(1 − δ)% nonparametricprediction region {z : D2

z (x,S) ≤ D2(Un)} uses (T,C) = (x,S).

4) Suppose m independent large sample 100(1−δ)% prediction regions aremade where x1, ...,xn,xf are iid from the same distribution for each of them runs. Let Y count the number of times xf is in the prediction region. ThenY ∼ binomial (m, 1−δn) where 1−δn is the true coverage and 1−δn → 1−δas n→ ∞. Simulation can be used to see if the true or actual coverage 1− δnis close to the nominal coverage 1−δ. A prediction region with 1−δn < 1−δis liberal and a region with 1 − δn > 1 − δ is conservative. It is better to beconservative by 5% than liberal by 5%. Parametric prediction regions tendto have large undercoverage and so are too liberal.

5) For the nonparametric prediction region, we want n ≥ 10p for goodcoverage and n ≥ 50p for good volume.

6) Consider testing H0 : θ = θ0 versus H1 : θ 6= θ0 where θ0 is a knowng × 1 vector. The prediction region method makes a bootstrap sample

wi = θ∗i − θ0 for i = 1, ..., B. Make the nonparametric prediction region

(4.14) for the wi, and reject H0 if 0 is not in the prediction region.

4.5 Complements

This chapter followed Olive (2017b, ch. 5) and Pelawa Watagoda and Olive(2019ab) closely. Also see Olive (2018, 2019) and Pelawa Watagoda (2017a).

For forward selection with n/p large, the minimumCp model Imin is knownto overfit, but empirically does not overfit much. Suppose the Imin overfit isnegligible for models containing t+1 or more predictors than model S. ThenβImin,0 could still be a mixture of

(p− aS

0

)+ · · ·+

(p− aS

t

)

4.5 Complements 127

distributions.The Pelawa Watagoda and Olive (2019b) PI can be regarded as the Olive

(2013a) PI, modified to work even if p > n as long as n >> d. The nonpara-metric prediction region is due to Olive (2013a). The bootstrap predictionregion method is due to Olive (2017b: section 5.3, 2017c, 2018).

Good references for the bootstrap include Efron (1982), Efron and Hastie(2016, ch. 10–11), and Efron and Tibshirani (1993). Also see Chen (2016),Hesterberg (2014), Olive (2019), and Machado and Parente (2005). Anothermethod for a bootstrap confidence region is given in Ghosh and Polansky(2014).

The shorth(kB) CI is the shortest percentile CI that contains kB of theT ∗

i , but the shorth(c) CI should be used to avoid undercoverage. The shorthconfidence interval is a practical implementation of the Hall (1988) shortestbootstrap interval based on all possible bootstrap samples.

The Cauchy Schwartz inequality says |aT b| ≤ ‖a‖ ‖b‖. Suppose√n(β −

β) = OP (1) is bounded in probability. This will occur if√n(β − β)

D→Np(0,Σ), e.g. if β is the OLS estimator. Then

|ri − ei| = |Yi − xTi β − (Yi − xT

i β)| = |xTi (β − β)|.

Hence

√n max

i=1,...,n|ri − ei| ≤ ( max

i=1,...,n‖xi‖) ‖

√n(β − β)‖ = OP (1)

since max‖xi‖ = OP (1) or there is extrapolation. Hence OLS residuals be-have well if the zero mean error distribution of the iid ei has a finite varianceσ2.

The Hall et al. (1989) double bootstrap technique is likely useful for cor-relation statistics like T = R2 if p is very small, but makes the simulationtime too long, otherwise.

One of the sufficient conditions for the bootstrap confidence region is thatT has a well behaved Hadamard derivative. Frechet differentiability impliesHadamard differentiability, and many statistics are shown to be Hadamarddifferentiable in Bickel and Ren (2001), Clarke (1986, 2000), Fernholtz (1983),Gill (1989), Ren (1991) and Ren and Sen (1995). Bickel and Ren (2001)showed that their method can work when Hadamard differentiability fails.

We can get a prediction region by randomly dividing the data into twohalf sets H and V where H has nH = dn/2e of the cases and V has theremaining m = nV = n − nH cases i1, ..., im. Compute (xH ,SH) from thecases in H . Then compute the distances D2

i = (xi − xH)T S−1H (xi − xH) for

the m vectors xi in V . Then the large sample 100(1− δ)% prediction regionfor xF is {x : D2

x(xH ,SH) ≤ D2(cm)} where cm is given by (4.3) with n

replaced by m. The prediction region {x : D2x(x,S) ≤ D2

(cm)} should also

work. D2(Un) for the nonparametric prediction region (4.14) and D2

(cm) should


be close for large n. D2(cm) may give better results if 5p ≤ n ≤ 20p or if the

distribution of x is not elliptically contoured.The following result is also informative for the prediction region method.

Let Ti = Ti,n, and assume T1, ..., TB are iid where

n

B

B∑

i=1

(Ti − θ)(Ti − θ)T P→ ΣA andn

B

B∑

i=1

(T∗i − T

∗)(T∗

i − T∗)T

P→ ΣA.

Then

n

B

B∑

i=1

(Ti − θ)(Ti − θ)T − n

B

B∑

i=1

(T ∗i − T

∗)(T ∗

i − T∗)T P→ 0, (4.34)

the g × g matrix of zeroes. The trace is a continuous linear function. Postmultiply both sides of (2.34) by [S∗

T ]−1, and take the trace of both sides toget

n

B

B∑

i=1

(Ti − µ)T [S∗T ]−1(Ti − µ) − n

B

B∑

i=1

(T ∗i − T

∗)T [S∗

T ]−1(T ∗i − T

∗)

P→ 0.

(4.35)

Now (Ti −µ)T [S∗T ]−1(Ti −µ)−n(Ti −µ)T Σ−1

A (Ti −µ)P→ 0. Hence the first

sum in (2.35) behaves like a sum of iid nonnegative terms that each convergein distribution to D2 . If n is fixed, then the T ∗

i are iid with respect to the

bootstrap distribution where T∗ ≈ E(T ∗

i ) = µn and S∗T ≈ Cov(T ∗

i ) = Σn

with respect to the bootstrap distribution. Hence the second sum in (4.35)behaves like a sum of iid nonnegative terms with respect to the bootstrapdistribution.

The following theorem will provide some intuition for why the percentilemethod works if Tn has a central limit theorem. Let Zδ be the 100δth per-centile of Z: P (Z ≤ Zδ) = δ, and let P (ZδL ≤ Z ≤ ZδU ) = 1 − δ. Let Tn,δ

be the 100δth percentile of (the sampling distribution of) Tn. Then a pop-ulation prediction interval for Tf,n is [Tn,δL, Tn,δU ] which can be estimatedby the sample percentiles [T(cL), T(cU)] when there is an iid sample T1, ..., Tn.The shortest such interval can be estimated by the shorth.

Theorem 4.3. Suppose g = 1 and√n(Tn − µ)

D→ X and

√n(Tn − µ)

σ

D→1

σX = W. If the percentiles are continuity points of the distribution of W ,

then for each large sample 100(1 − δ)% PI [T(cL), T(cU)] for Tf,n, there is

a large sample 100(1 − δ)% CI

[Tn −WδU

σ√n, Tn −WδL

σ√n

]for µ with

approximately the same length.Proof. Note that 1 − δ = P (Tn,δL ≤ Tn ≤ Tn,δU ] ≈ P (T(cL) ≤ Tn ≤

T(cU )) ≈

4.5 Complements 129

P (WδL ≤√n(Tn − µ)

σ≤WδU ) = P (Tn −WδU

σ√n≤ µ ≤ Tn −WδL

σ√n

) =

P (WδL

σ√n

+ µ ≤ Tn ≤ WδU

σ√n

+ µ). Hence Tn,δL ≈ WδL

σ√n

+ µ and

Tn,δU ≈WδU

σ√n

+µ. Thus T(cU ) −T(cL) ≈σ√n

(WδU −WδL ) ≈ Tn,δU −Tn,δL .

�

Theorem 4.3 suggests that the Frey (2013) shorth(c) interval applied tothe bootstrap sample estimates the shortest large sample 100(1 − δ)% CI[Tn −WδU

σ√n, Tn −WδL

σ√n

]based on the asymptotic pivot. Note that if

Zi = Tn + µ − Ti for i = 1, ..., n, then P (Z(cL) ≤ µ ≤ Z(cU )) ≈ P (Tn +µ − Tn,δU ≤ µ ≤ Tn + µ − Tn,δL) = P (Tn,δL ≤ Tn ≤ Tn,δU ) = 1 − δ.Then the Zi are centered at Tn with deviations equal to µ−Ti. Note that the

distribution of Tn−µ is the same as the distribution of Ti−µ: Ti−µ D= Tn−µ.

Now the bootstrap approximation says that the distribution of Tn−µ can be

approximated by the distribution of T ∗i −Tn . Thus Ti−µ D

= Tn−µ ≈ T ∗i −Tn ,

or T ∗i ≈ Ti +Tn−µ. If the distribution of Tn−µ is approximately the same as

the distribution of µ−Tn (asymptotic symmetry), then the percentile method

should work. Since√n(Tn −µ)

D→ X, we have nγ(Tn −µ)D→ 0 if 0 < γ < 0.5.

The point mass at 0 is a symmetric distribution, and nγ(Ti + Tn − µ) ≈ nγµfor large n.

For forward selection with Cp, θ = Aβ, Tn = AβImin,0 is a mixture of

Tjn = AβIj,0 where√n(AβIj,0 − θ)

D→ uj ∼ Np(0, σ2AV j,0AT ). Hence

√n(Tn − θ)

D→ u


j πjFuj(z) and Fuj (z) is the cdf of uj .Under regularity conditions

√n(T

∗ − θ)D→ w where

√n(∑

j

πjTjn − θ)D→ w =

J∑

j=1

πjuj. (4.36)

Note that∑

j πjTjn =∑

j πjADj,0Y = CY is a linear combination of the

Tjn, not a mixture distribution of the Tjn. Note that if Y ∼ Nn(Xβ, σ2In),

then CY ∼ Ng(θ, σ2CCT ). Equation (2.36) holds if

√n(AβI1,0 − θ)

...√n(AβIJ ,0 − θ)

=

√n(AD1,0Y − θ)

...√n(ADJ,0Y − θ)

D→

u1

...uJ

. (4.37)


We conjecture that Equation (4.37) holds because each entry of the middlevector has the same random vector Y = Y n (we conjecture that the marginalconvergence in distribution implies joint convergence in distribution for thissituation).

4.6 Problems

4.1. Consider the Cushny and Peebles data set (see Staudte and Sheather1990, p. 97) listed below. Find shorth(7). Show work.

0.0 0.8 1.0 1.2 1.3 1.3 1.4 1.8 2.4 4.6

4.2. Find shorth(5) for the following data set. Show work.

6 76 90 90 94 94 95 97 97 1008

4.3. Find shorth(5) for the following data set. Show work.

66 76 90 90 94 94 95 95 97 98

4.4. Suppose you are estimating the mean θ of losses with the maxi-mum likelihood estimator (MLE)X assuming an exponential (θ) distribution.Compute the sample mean of the fourth bootstrap sample.

actual losses 1, 2, 5, 10, 50: X = 13.6bootstrap samples:2, 10, 1, 2, 2: X = 3.450, 10, 50, 2, 2: X = 22.810, 50, 2, 1, 1: X = 12.85, 2, 5, 1, 50: X =?

4.5. The data below are a sorted residuals from a least squares regressionwhere n = 100 and p = 4. Find shorth(97) of the residuals.

number 1 2 3 4 ... 97 98 99 100

residual -2.39 -2.34 -2.03 -1.77 ... 1.76 1.81 1.83 2.16

4.6. To find the sample median of a list of n numbers where n is odd, orderthe numbers from smallest to largest and the median is the middle orderednumber. The sample median estimates the population median. Suppose thesample is {14, 3, 5, 12, 20, 10, 9}. Find the sample median for each of the threebootstrap samples listed below.Sample 1: 9, 10, 9, 12, 5, 14, 3Sample 2: 3, 9, 20, 10, 9, 5, 14Sample 3: 14, 12, 10, 20, 3, 3, 5

4.7. Suppose you are estimating the mean µ of losses with T = X.actual losses 1, 2, 5, 10, 50: X = 13.6,

4.6 Problems 131

a) Compute T ∗1 , ..., T

∗4 , where T ∗

i is the sample mean of the ith bootstrapsample. bootstrap samples:

2, 10, 1, 2, 2:

50, 10, 50, 2, 2:

10, 50, 2, 1, 1:

5, 2, 5, 1, 50:

b) Now compute the bagging estimator which is the sample mean of the

T ∗i : the bagging estimator T

∗=

1

B

B∑

i=1

T ∗i where B = 4 is the number of

bootstrap samples.

4.8. Consider the output above Example 4.8 for the minimum Cp forwardselection model.

a) What is βImin?

b) What is βImin,0?c) The large sample 95% shorth CI for H is [0,0.016]. Is H needed is the

minimum Cp model given that the other predictors are in the model?d) The large sample 95% shorth CI for log(S) is [0.324,0.913]. Is log(S)

needed is the minimum Cp model given that the other predictors are in themodel? (S in the output is actually log(S).)

e) Suppose x1 = 1, x4 = H = 130, and x5 = log(S) = 5.075. Find

Y = (x1 x4 x5)βImin. (W and S in the output are actually log(W ) and

log(S). R tends not to change labels when transformations are made. Notethat Y = log(M). )

R ProblemUse the command source(“G:/slpack.txt”) to download the func-

tions and the command source(“G:/sldata.txt”) to download the data.See Preface or Section 11.1. Typing the name of the slpack function,e.g. regbootsim2, will display the code for the function. Use the args com-mand, e.g. args(regbootsim2), to display the needed arguments for the func-tion. For the following problem, the R command can be copied and pastedfrom (http://lagrange.math.siu.edu/Olive/slrhw.txt) into R.

4.9. a) Type the R command predsim() and paste the output into Word.This program computes xi ∼ N4(0, diag(1, 2, 3, 4)) for i = 1, ..., 100 and

xf = x101. One hundred such data sets are made, and ncvr, scvr, and mcvrcount the number of times xf was in the nonparametric, semiparametric,and parametric MVN 90% prediction regions. The volumes of the predictionregions are computed and voln, vols, and volm are the average ratio of thevolume of the ith prediction region over that of the semiparametric region.Hence vols is always equal to 1. For multivariate normal data, these ratiosshould converge to 1 as n → ∞.

b) Were the three coverages near 90%?


4.10. Consider the multiple linear regression model Yi = β1 + β2xi,2 +β3xi,3 + β4xi,4 + ei where β = (1, 1, 0, 0)T . The function regbootsim2

bootstraps the regression model, finds bootstrap confidence intervals for βi

and a bootstrap confidence region for (β3 , β4)T corresponding to the test

H0 : β3 = β4 = 0 versus HA: not H0. See the R code above Remark 4.14. Thelengths of the CIs along with the proportion of times the CI for βi containedβi are given. The fifth interval gives the length of the interval [0, D(c)] whereH0 is rejected if D0 > D(c) and the fifth “coverage” is the proportion of timesthe test fails to reject H0. Since nominal 95% CIs were used and the nominallevel of the test is 0.05 when H0 is true, we want the coverages near 0.95.The CI lengths for the first 4 intervals should be near 0.392. The residualbootstrap is used.

Copy and paste the commands for this problem into R, and include theoutput in Word.

Chapter 5

Statistical Learning Alternatives to OLS

When n >> p

5.1 The MLR Model

Suppose that the response variable Yi and at least one predictor variable xi,j

are quantitative with xi,1 ≡ 1. Let xTi = (xi,1, ..., xi,p) and β = (β1, ..., βp)

T

where β1 corresponds to the intercept. Then the multiple linear regression(MLR) model is

Yi = β1 + xi,2β2 + · · ·+ xi,pβp + ei = xTi β + ei (5.1)

for i = 1, ..., n.This model is also called the full model. Here n is the samplesize and the random variable ei is the ith error. Assume that the ei are iidwith variance V (ei) = σ2. In matrix notation, these n equations become

Y = Xβ + e, (5.2)

where Y is an n × 1 vector of dependent variables, X is an n × p matrixof predictors, β is a p × 1 vector of unknown coefficients, and e is an n × 1vector of unknown errors. Equivalently,

Y1

Y2

...Yn

=

x1,1 x1,2 . . . x1,p

x2,1 x2,2 . . . x2,p

......

. . ....

xn,1 xn,2 . . . xn,p

β1

β2

...βp

+

e1e2...en

. (5.3)

There are many methods for estimating β, including (ordinary) leastsquares (OLS) for the full model, forward selection with OLS, principal com-ponents regression (PCR), partial least squares (PLS), lasso, relaxed lasso,and ridge regression (RR). For some of these methods, it is convenient to usecentered or scaled data. Suppose U has observed values U1, ..., Un. For exam-ple, if Ui = Yi then U corresponds to the response variable Y . The observedvalues of a random variable V are centered if their sample mean is 0. The

133

134 5 Statistical Learning Alternatives to OLS When n >> p

centered values of U are Vi = Ui − U for i = 1, ..., n. If the sample varianceof the Ui is

σ2g =

1

n− g

n∑

i=1

(Ui − U)2,

then the sample standard deviation of Ui is σg. If the values of Ui are not allthe same, then σg > 0, and the standardized values of the Ui are

Wi =Ui − U

σg.

Typically g = 1 or g = 0 are used: g = 1 gives an unbiased estimatorof σ2 while g = 0 gives the method of moments estimator. Note that thestandardized values are centered, W = 0, and the sample variance of thestandardized values

1

n − g

n∑

i=1

W 2i = 1. (5.4)

Remark 5.1. Let the nontrivial predictors uTi = (xi,2, ..., xi,p) = (ui,1, ...,

ui,p−1). Then xi = (1 uTi )T . Let the n × (p − 1) matrix of standardized

nontrivial predictors W g = (Wij) when the predictors are standardized usingσg. Thus,

∑ni=1Wij = 0 and

∑ni=1W

2ij = n− g for j = 1, ..., p− 1. Hence

Wij =xi,j+1 − xj+1

σj+1where σ2

j+1 =1

n − g

n∑

i=1

(xi,j+1 − xj+1)2

is σg for the (j+1)th variable xj+1. Note that the sample correlation matrixof the nontrivial predictors ui is

Ru =W T

g W g

n− g.

Let Z = Y −Y where Y = Y 1. Since the R software tends to use g = 0, letW = W 0. Note that n × (p − 1) matrix W does not include a vector 1 ofones. Then regression through the origin is used for the model

Z = Wη + e (5.5)

where Z = (Z1, ..., Zn)T and η = (η1, ..., ηp−1)T . The vector of fitted values

Y = Y + Z.

Remark 5.2. i) Interest is in model (5.1): estimate Yf and β. Equation

(5.5) is a tool for estimating model (5.1). For many regression models, Ydepends on the units of the predictors, e.g. inches, yards, centimeters, meters.Suppose g = 0. The method of moments estimator of the variance σ2

w is

5.1 The MLR Model 135

σ2g=0 = S2

M =1

n

n∑

i=1

(wi −w)2.

When data xi are standardized to have w = 0 and S2M = 1, the standardized

data wi has no units. ii) Hence the estimators Z and η do not depend on theunits of measurement of the xi if standardized data and Equation (5.5) areused. Linear combinations of the wi are linear combinations of the ui, whichare linear combinations of the xi. (Note that γT u = (0 γT ) x.) Thus the

estimators Y and β are obtained using Z, η, and Y . iii) Also, since W j = 0and S2

M,j = 1, the standardized predictor variables have similar spread, andthe magnitude of ηi is a measure of the importance of the predictor variableWj for predicting Y .

For MLR, let

X =

xT1...

xTn

=

x1,1 x1,2 . . . x1,p

x2,1 x2,2 . . . x2,p

......

. . ....

xn,1 xn,2 . . . xn,p

=

[v1 v2 . . . vp

]

where the ith row of X is xTi , and the jth column vj of X corresponds to

n measurements of the jth random variable xj for j = 1, ..., p where v1 = 1is a vector of ones. Let ui be the nontrivial predictors for the ith case, andlet wT

i = (wi,1, ..., wi,p−1) be the standardized vector of nontrivial predictorsfor the ith case. Then the sample covariance matrix of the wi is the samplecorrelation matrix ρu of the ui where the sample covariance matrix of theui is

Σu,g =1

n− g

n∑

i=1

(ui − u)(ui − u)T

and ρu = Ru = (rij) where rij is the sample correlation of ui = xi+1

and uj = xj+1. The unbiased estimator with g = 1 is often denoted by

Σu = Su = (Sij). Since the standardize data are also centered, w = 0. Note

that the standardization and Σu,g depend on g, but the sample correlationmatrix Ru does not depend on g.

For fitting the MLR model (5.1), let σj be the sample standard deviationof variable xj for j = 2, ...., p. The fitted model satisfies

Yi = β1 + xi,2β2 + · · ·+ xi,pβp + ri = Yi + ri

where the ith residualri = Yi − Yi

and the ith fitted value


Yi = β1 + xi,2β2 + · · ·+ xi,pβp = xTi β. (5.6)

Remark 5.3. If standardized nontrivial predictors are used, then

Yi = γ + wi,1η1 + · · ·+wi,p−1ηp−1 = γ +xi,2 − x2

σ2η1 + · · ·+ xi,p − xp

σpηp−1

= γ + wTi η (5.7)

whereηj = σj+1βj+1 (5.8)

for j = 1, ..., p − 1. Often the fitted values are defined so that Yi = Y ifxi,j = xj for j = 2, ..., p. Then γ = Y . Alternatively,

γ = β1 +x2

σ2η1 + · · ·+ xp

σpηp−1.

If standardized nontrivial predictors are used with the centered response Zi =Yi−Y where Zi = Yi−Y , then Zi = α+wT

i η where α = γ−Y and Yi = Zi+Y .Often the standardization uses g = 0 as in Remark 5.1, and the fitted valuesare defined so that α = 0, and the vector of fitted values Y = Y + Z.

Notation. The symbol A ≡ B = f(c) means that A and B are equivalentand equal, and that f(c) is the formula used to compute A and B.

Definition 5.1. Given an estimate b of β, the corresponding vector ofpredicted values or fitted values is Y ≡ Y (b) = Xb. Thus the ith fitted value

Yi ≡ Yi(b) = xTi b = xi,1b1 + · · ·+ xi,pbp.

The vector of residuals is r ≡ r(b) = Y − Y (b). Thus ith residual ri ≡ri(b) = Yi − Yi(b) = Yi − xi,1b1 − · · · − xi,pbp.

Most regression methods attempt to find an estimate β of β which mini-mizes some criterion function Q(b) of the residuals.

Definition 5.2. Suppose the MLR model (5.1) holds where the n × p

matrix X has full rank p. The ordinary least squares (OLS) estimator βOLS

minimizes

QOLS(b) =

n∑

i=1

r2i (b), (5.9)

and βOLS = (XT X)−1XT Y .

The vector of predicted or fitted values Y OLS = XβOLS = HY where thehat matrix H = X(XT X)−1XT provided the inverse exists. Typically thesubscript OLS is omitted, and the least squares regression equation is


Y = β1x1 + β2x2 + · · ·+ βpxp where x1 ≡ 1 if the model contains a constant.

The following model is useful for the centered response and standardizednontrivial predictors, or if Z = Y , W = XI , and η = βI corresponds to asubmodel I. See the next three paragraphs.

Definition 5.3. If Z = Wη +e, where the n× q matrix W has full rankq, then the OLS estimator

ηOLS = (W T W )−1W T Z.

The vector of predicted or fitted values ZOLS = WηOLS = HZ where H =W (W T W )−1W T . The vector of residuals r = r(Z ,W ) = Z − Z =(I − H)Z.

Remark 5.4: Let p be the number of predictors in the full model, includ-ing a constant. Let q = p−1 be the number of nontrivial predictors in the fullmodel. Let a = aI be the number of predictors in the submodel I, includinga constant. Let k = kI = aI −1 be the number of nontrivial predictors in thesubmodel. For submodel I, think of I as indexing the predictors in the model,including the constant. Let A index the nontrivial predictors in the model.Hence I adds the constant (trivial predictor) to the collection of nontrivialpredictors in A. Often there will be a “true submodel” Y = XSβS +e whereall of the elements of βS are nonzero but all of the elements of β that arenot elements of βS are zero. See the following paragraph. Then a = aS is thenumber of predictors in the submodel, including a constant, and k = kS isthe number of active predictors = number of nonnoise variables = numberof nontrivial predictors in the true model S = IS . Then there are p− a noisevariables (xi that have coefficient βi = 0) in the full model. The true modelis generally only known in simulations.

Variable selection is the search for a subset of predictor variables that canbe deleted without important loss of information if n/p is large (and thesearch for a useful subset of predictors if n/p is not large). Following Oliveand Hawkins (2005), a model for variable selection can be described by

xT β = xTSβS + xT

EβE = xTSβS (5.10)

where x = (xTS ,x

TE)T , xS is an aS ×1 vector, and xE is a (p−aS)×1 vector.

Given that xS is in the model, βE = 0 and E denotes the subset of termsthat can be eliminated given that the subset S is in the model. Let xI be thevector of a terms from a candidate subset indexed by I, and let xO be thevector of the remaining predictors (out of the candidate submodel). Supposethat S is a subset of I and that model (5.10) holds. Then

xT β = xTSβS = xT


O0 = xTI βI (5.11)


where xI/S denotes the predictors in I that are not in S. Since this is trueregardless of the values of the predictors, βO = 0 if S ⊆ I. We also assumethat if xT β = xT

I βI , then S ⊆ I. Hence S is the unique smallest subset ofpredictors such that xT β = xT

SβS .Model selection generates M models. Then a hopefully good model is

selected from these M models. Variable selection is a special case of modelselection. Many methods for variable and model selection have been suggestedfor the MLR model. We will consider several R functions including i) forwardselection computed with the regsubsets function from the leaps library,ii) principal components regression (PCR) with the pcr function from thepls library, iii) partial least squares (PLS) with the plsr function from thepls library, iv) ridge regression with the cv.glmnet or glmnet functionfrom the glmnet library, v) lasso with the cv.glmnet or glmnet functionfrom the glmnet library, and vi) relaxed lasso which is OLS applied tothe lasso active set (nontrivial predictors with nonzero coefficients) and aconstant.

These six methods produce M models and use a criterion to select thefinal model (e.g. Cp or 10-fold cross validation (CV)). See Section 5.10. Thenumber of models M depends on the method. One of the models is thefull model (5.1) that uses all p − 1 nontrivial predictors. The full model is(approximately) fit with (ordinary) least squares. For one of the M models,some of the methods use η = 0 and fit the model Yi = β1 + ei with Yi ≡ Ythat uses none of the nontrivial predictors. Forward selection, PCR, and PLSuse variables v1 = 1 (the constant or trivial predictor) and vj = γT

j x that arelinear combinations of the predictors for j = 2, ..., p. Model Ii uses variablesv1, v2, ..., vi for i = 1, ...,M where M ≤ p and often M ≤ min(p, n/10). ThenM models Ii are used. (For forward selection and PCR, OLS is used to regressY (or Z) on v1, ..., vi.) Then a criterion chooses the final submodel Id fromcandidates I1, ..., IM.

For forward selection, PCR, PLS, lasso, and relaxed lasso, let d be thenumber of predictors vj = γT

j x in the final model (with nonzero coefficients),including a constant v1. Then d is a false degrees of freedom for the model.For forward selection, lasso, and relaxed lasso, vj corresponds to a singlenontrivial predictor, say vj = x∗j = xkj .

Overfitting or “fitting noise” occurs when there is not enough data toestimate the p × 1 vector β well with the estimation method, such as OLS.The OLS model is overfitting if n < 5p. When n > p, X is not invertible,but if n = p, then Y = HY = X(XT X)−1XT Y = InY = Y regardless of

how bad the predictors are. If n < p, then the OLS program fails or Y = Y :the fitted regression plane interpolates the training data response variablesY1, ..., Yn. The following rule of thumb is useful for many regression methods.Note that d = p for the full OLS model.


Rule of thumb 5.1. We want n ≥ 10d to avoid overfitting. Occasionallyn as low as 5d is used, but models with n < 5d are overfitting.

Lasso and ridge regression have a parameter λ. When λ = 0, the OLSfull model is used. Otherwise, the centered response and scaled nontrivialpredictors are used with Z = Wη + e. See Remark 5.1. These methods alsouse a maximum value λM of λ and a grid of M λ values 0 ≤ λ1 < λ2 < · · · <λM−1 < λM where often λ1 = 0. For lasso, λM is the smallest value of λ suchthat ηλM

= 0. Hence ηλi6= 0 for i < M .

See James et al. (2013, ch. 6) for more details about forward selection,PCR, PLS, ridge regression, lasso, and relaxed lasso. These methods will alsobe described in more detail in the following six sections.

Remark 5.5. Use Zn ∼ ANr (µn,Σn) to indicate that a normal approx-imation is used: Zn ≈ Nr(µn,Σn). Let a be a constant, let A be a k × rconstant matrix (often with full rank k ≤ r), and let c be a k × 1 constant

vector. If√n(θn − θ)

D→ Nr(0,V ), then aZn = aIrZn with A = aIr,

aZn ∼ ANr

(aµn, a

2Σn

), and AZn + c ∼ ANk

(Aµn + c,AΣnAT

),

θn ∼ ANr

(θ,

V

n

), and Aθn + c ∼ ANk

(Aθ + c,

AV AT

n

).

The following theorem is analogous to the central limit theorem and thetheory for the t–interval for µ based on Y and the sample standard deviation(SD) SY . If the data Y1, ..., Yn are iid with mean 0 and variance σ2, then Y isasymptotically normal and the t–interval will perform well if the sample sizeis large enough. The result below suggests that the OLS estimators Yi andβ are good if the sample size is large enough. The condition maxhi → 0 inprobability usually holds if the researcher picked the design matrix X or if thexi are iid random vectors from a well behaved population. Outliers can causethe condition to fail. See Sen and Singer (1993, p. 280) for the theorem, which

implies that β ≈ Np(β, σ2(XT X)−1)) or β ∼ ANp(β,MSE(XT X)−1)). SeeRemark 5.5 for notation. The ith leverage hi is the ith diagonal element ofthe hat matrix H = X(XT X)−1XT .

Theorem 5.1, LS CLT (Least Squares Central Limit Theorem):Consider the MLR model Yi = xT

i β+ei and assume that the zero mean errorsare iid with E(ei) = 0 and VAR(ei) = σ2. Assume p is fixed and n → ∞.Also assume that maxi(h1, ..., hn) → 0 in probability as n → ∞ and

XT X

n→ V −1

as n → ∞. Then the least squares (OLS) estimator β satisfies


√n(β − β)

D→ Np(0, σ2 V ). (5.12)

Equivalently,

(XT X)1/2(β − β)D→ Np(0, σ

2 Ip). (5.13)

When minimizing or maximizing a real valued function Q(η) of the k × 1vector η, the solution η is found by setting the gradient of Q(η) equal to0. The following definition and lemma follow Graybill (1983, pp. 351-352)closely. Maximum likelihood estimators are examples of estimating equations.There is a vector of parameters η, and the gradient of the log likelihoodfunction logL(η) is set to zero. The solution η is the MLE, an estimatorof the parameter vector η, but in the log likelihood, η is a dummy variablevector, not the fixed unknown parameter vector.

Definition 5.4. Let Q(η) be a real valued function of the k× 1 vector η.The gradient of Q(η) is the k × 1 vector

5Q = 5Q(η) =∂Q

∂η=∂Q(η)

∂η=

∂∂η1

Q(η)∂

∂η2Q(η)...

∂∂ηk

Q(η)

.

Suppose there is a model with unknown parameter vector η. A set of esti-mating equations f(η) is used to maximize or minimize Q(η) where η is adummy variable vector.

Often f(η) = 5Q, and we solve f(η) = 5Q set= 0 for the solution η, and

f : Rk → R

k. Note that η is an estimator of the unknown parameter vectorη in the model, but η is a dummy variable in Q(η). Hence we could use Q(b)instead of Q(η), but the solution of the estimating equations would still be

b = η.

As a mnemonic (memory aid) for the following lemma, note that the

derivatived

dxax =

d

dxxa = a and

d

dxax2 =

d

dxxax = 2ax.

Lemma 5.2. a) If Q(η) = aT η = ηT a for some k × 1 constant vector a,then 5Q = a.

b) If Q(η) = ηT Aη for some k × k constant matrix A, then 5Q = 2Aη.

c) If Q(η) =∑k

i=1 |ηi| = ‖η‖1, then 5Q = s = sη where si = sign(ηi)where sign(ηi) = 1 if ηi > 0 and sign(ηi) = −1 if ηi < 0. This gradient is onlydefined for η where none of the k values of ηi are equal to 0.

5.2 Forward Selection 141

Example 5.1. If Z = Wη+e, then the OLS estimator minimizesQ(η) =‖Z − Wη‖2

2 = (Z − Wη)T (Z − Wη) = ZT Z − 2ZT Wη + ηT (W T W )η.Using Lemma 5.2 with aT = ZT W and A = W T W shows that 5Q =−2W T Z+2(W T W )η. Let 5Q(η) denote the gradient evaluated at η. Thenthe OLS estimator satisfies the normal equations (W T W )η = W T Z.

Example 5.2. The Hebbler (1847) data was collected from n = 26 dis-tricts in Prussia in 1843. We will study the relationship between Y = thenumber of women married to civilians in the district with the predictors x1

= constant, x2 = pop = the population of the district in 1843, x3 = mmen= the number of married civilian men in the district, x4 = mmilmen = thenumber of married men in the military in the district, and x5 = milwmn =the number of women married to husbands in the military in the district.Sometimes the person conducting the survey would not count a spouse ifthe spouse was not at home. Hence Y is highly correlated but not equal tox3. Similarly, x4 and x5 are highly correlated but not equal. We expect thatY = x3 +e is a good model, but n/p = 5.2 is small. See the following output.

; ls.print(out)

Residual Standard Error=392.8709

R-Square=0.9999, p-value=0

F-statistic (df=4, 21)=67863.03

Estimate Std.Err t-value Pr(>|t|)

Intercept 242.3910 263.7263 0.9191 0.3685

pop 0.0004 0.0031 0.1130 0.9111

mmen 0.9995 0.0173 57.6490 0.0000

mmilmen -0.2328 2.6928 -0.0864 0.9319

milwmn 0.1531 2.8231 0.0542 0.9572

res<-out$res

yhat<-Y-res #d = 5 predictors used including x_1

AERplot2(yhat,Y,res=res,d=5)

#response plot with 90% pointwise PIs

$respi #90% PI for a future residual

[1] -950.4811 1445.2584 #90% PI length = 2395.74

5.2 Forward Selection

Forward selection uses variables v1 = x1 = x∗1 = 1 (the constant or trivialpredictor) and vj = γT

j x = x∗j = xkj for j = 1, ..., p. Hence the linearcombination of variables vj is just one of the predictors, x∗j . Then M modelsIi are used where OLS is used to regress Y on 1, x∗2, ..., x

∗i . Then a criterion

chooses which model Id from candidates I1, ..., IM is to be used as the finalsubmodel.


Let submodel I be the model Y = xIβI + e that only uses a subset xI ofthe predictors. For submodel I, let the vector of fitted values be Y (I) = Y I

and let the vector of residuals be r(I) = rI .We suggest the following method of forming a sequence of submodels

I1, ..., IM where Ij uses j predictors including the constant. Let I1 usex∗1 = x1 ≡ 1: the model has a constant but no nontrivial predictors. Toform I2, consider all models I with two predictors including x∗1. ComputeQ2(I) = SSE(I) = RSS(I) = rT (I)r(I) =

∑ni=1 r

2i (I) =

∑ni=1(Yi − Yi(I))

2

where RSS stands for residual sum of squares and SSE stands for sum ofsquared errors. Let I2 minimize Q2(I) for the p − 1 models I that containx∗1 and one other predictor. Denote the predictors in I2 by x∗1, x

∗2. In gen-

eral, to form Ij consider all models I with j predictors including variables

x∗1, ..., x∗j−1. Compute Qj(I) = rT (I)r(I) =

∑ni=1 r

2i (I) =

∑ni=1(Yi− Yi(I))

2 .Let Ij minimize Qj(I) for the p − j + 1 models I that contain x∗1, ..., x

∗j−1

and one other predictor not already selected. Denote the predictors in Ij byx∗1, ..., x

∗j. Continue in this manner for j = 2, ...,M . Often M = min(dn/Je, p)

for some positive integer J such as J = 5, 10, or 20. Here dxe is the smallestinteger ≥ x, e.g., d7.7e = 8.

Remark 5.6. Suppose n/J is an integer. If p ≤ n/J , then forward selec-tion fits (p−1)+(p−2)+ · · ·+2+1 = p(p−1)/2 ≈ p2/2 models, where p− imodels are fit at step i for i = 1, ..., (p−1). If n/J < p, then forward selectionuses (n/J) − 1 steps and fits ≈ (p − 1) + (p − 2) + · · · + (p − (n/J) + 1) =p((n/J) − 1) − (1 + 2 + · · ·+ ((n/J) − 1)) =

p(n

J− 1) −

nJ (n

J − 1)

2≈ n

J

(2p− nJ )

2

models. Thus forward selection can be slow if n and p are both large, al-though the R package leaps uses a branch and bound algorithm that likelyeliminates many of the possible fits. Note that after step i, the model hasi+ 1 predictors, including the constant.

When there is a sequence of M submodels, the final submodel Id needs tobe selected. Let xI and βI be a × 1. Hence the candidate model contains aterms, including a constant. Suppose the ei are independent and identicallydistributed (iid) with variance V (ei) = σ2. Then there are many criteria usedto select the final submodel Id. A simple method is to take the model thatuses d = min(dn/Je, p) variables. If p is fixed, the method will use the OLSfull model once n/J ≥ p. For a given data set, p, n, and σ2 act as constants,and a criterion below may add a constant or be divided by a positive constantwithout changing the subset Imin that minimizes the criterion.

Let criteria CS(I) have the form

CS(I) = SSE(I) + aKnσ2.


These criteria need a good estimator of σ2. The criterion Cp(I) = AICS(I)uses Kn = 2 while the BICS(I) criterion uses Kn = log(n). Typically σ2 isthe OLS full model

MSE =n∑

i=1

r2in− p

when n/p is large. Then σ2 = MSE is a√n consistent estimator of σ2 under

mild conditions by Su and Cook (2012).It is hard to get a good estimator of σ2 when n/p is not large. The following

criteria are described in Burnham and Anderson (2004), but still need n/plarge. AIC is due to Akaike (1973), AICC is due to Hurvich and Tsai (1991),and BIC to Schwarz (1978) and Akaike (1977, 1978).

AIC(I) = n log

(SSE(I)

n

)+ 2a,

AICC(I) = n log

(SSE(I)

n

)+ 2

a(a + 1)

n− a− 1,

and

BIC(I) = n log

(SSE(I)

n

)+ a log(n).

Let Imin be the submodel that minimizes the criterion. Following Seberand Lee (2003, p. 448), Nishi (1984), and Shao (1993), the probability that

model Imin from Cp or AIC underfits goes to zero as n→ ∞. If βI is a× 1,

form the p×1 vector βI,0 from βI by adding 0s corresponding to the omittedvariables. Since fewer than 2p regression models I contain the true model, andeach such model gives a

√n consistent estimator βI,0 of β, the probability

that Imin picks one of these models goes to one as n → ∞. Hence βImin,0

is a√n consistent estimator of β under model (5.10). Olive (2017a: p. 123,

2017b: p. 176) showed that βImin,0 is a consistent estimator.The EBIC criterion given in Luo and Chen (2013) may work when n/p is

not large. Let 0 ≤ γ ≤ 1 and |I| = a ≤ min(n, p) if βI is a × 1. We may usea ≤ min(n/5, p). Then EBIC(I) =

n log

(SSE(I)

n

)+ a log(n) + 2γ log

[(p

a

)]= BIC(I) + 2γ log

[(p

a

)].

This criterion can give good results if p = pn = O(nk) and γ > 1 − 1/(2k).Hence we will use γ = 1. Then minimizing EBIC(I) is equivalent to mini-mizing BIC(I) − 2 log[(p− a)!]− 2 log(a!) since log(p!) is a constant.

The above criteria can be applied to forward selection and relaxed lasso.The Cp criterion can also be applied to lasso. See, for example, Efron andHastie (2016, pp. 221, 231).


Next we give some alternative forms for Cp(I). For multiple linear regres-sion, if the candidate model of xI has a terms (including the constant), thenthe partial F statistic for testing whether the p− a predictor variables in xO

can be deleted is

FI =SSE(I) − SSE

(n − a) − (n− p)/SSE

n− p=n − p

p − a

[SSE(I)

SSE− 1

]

where SSE and MSE are for the full model, and SSE(I) and MSE(I) arefor submodel I.

Definition 5.5. Let n > p, and let model I use a predictors including aconstant. Then the Cp criterion is

Cp(I) =SSE(I)

MSE+ 2a− n = (p− a)(FI − 1) + a.

Minimizing Cp(I) is equivalent to minimizing

MSE [Cp(I)] = SSE(I) + (2a− n)MSE = rT (I)r(I) + (2a− n)MSE.

If n ≥ Jp where J ≥ 5 and preferably J ≥ 10, then we will oftenuse the OLS forward selection model Imin that minimizes Cp(I) for modelsI1, I2, ..., IM where M = p. Olive (2017a, section 3.4) has some useful resultsfor Cp when p is fixed and n → ∞. See Section 4.3.5 for more information onOLS forward selection.

If p > n/5, we may use M = dn/Je, but a different criterion than Cp(I)should be used. The R function regsubsets can be used for forward selec-tion if p < n, and if p ≥ n if the maximum number of variables is less thann. Then warning messages are common. Some R code is shown below.

#regsubsets works if p < n, e.g. p = n-1, and works

#if p > n with warnings if nvmax is small enough

set.seed(13)

n<-100

p<-200

k<-19 #the first 19 nontrivial predictors are active

J<-5

q <- p-1

b <- 0 * 1:q

b[1:k] <- 1 #beta = (1, 1, ..., 1, 0, 0, ..., 0)ˆT


y <- 1 + x %*% b + rnorm(n)

nc <- ceiling(n/J)-1 #the constant will also be used

nc <- min(nc,q)

nc <- max(nc,1) #nc is the maximum number of

#nontrivial predictors used by forward selection


pp <- nc+1 #d = pp is the false degrees of freedom

vars <- as.vector(1:(p-1))

temp<-regsubsets(x,y,nvmax=nc,method="forward")

out<-summary(temp)

num <- length(out$cp)

mod <- out$which[num,] #use the last model

#do not need the constant in vin

vin <- vars[mod[-1]]

out$rss

[1] 1496.49625 1342.95915 1214.93174 1068.56668

973.36395 855.15436 745.35007 690.03901

638.40677 590.97644 542.89273 503.68666

467.69423 420.94132 391.41961 328.62016

242.66311 178.77573 79.91771

out$bic

[1] -9.4032 -15.6232 -21.0367 -29.2685

-33.9949 -42.3374 -51.4750 -54.5804

-57.7525 -60.8673 -64.7485 -67.6391

-70.4479 -76.3748 -79.0410 -91.9236

-117.6413 -143.5903 -219.498595

tem <- lsfit(x[,1:19],y) #last model used the

sum(tem$residˆ2) #first 19 predictors

[1] 79.91771 #SSE(I) = RSS(I)

n*log(out$rss[19]/n) + 20*log(n)

[1] 69.68613 #BIC(I)

for(i in 1:19) #a formula for BIC(I)

print( n*log(out$rss[i]/n) + (i+1)*log(n) )

bic <- c(279.7815, 273.5616, 268.1480, 259.9162,

255.1898, 246.8474, 237.7097, 234.6043, 231.4322,

228.3175, 224.4362, 221.5456, 218.7368, 212.8099,

210.1437, 197.2611, 171.5435, 145.5944, 69.6861)

tem<-lsfit(bic,out$bic)

tem$coef

Intercept X

-289.1846831 0.9999998 #bic - 289.1847 = out$bic

xx <- 1:min(length(out$bic),p-1)+1

ebic <- out$bic+2*log(dbinom(x=xx,size=p,prob=0.5))

#actually EBIC(I) - 2 p log(2).

Example 5.2, continued. The output below shows results from forwardselection for the marry data. The minimum Cp model Imin uses a constantand mmem. The forward selection PIs are shorter than the OLS full modelPIs.

library(leaps);Y <- marry[,3]; X <- marry[,-3]


temp<-regsubsets(X,Y,method="forward")

out<-summary(temp)

Selection Algorithm: forward

pop mmen mmilmen milwmn

1 ( 1 ) " " "*" " " " "

2 ( 1 ) " " "*" "*" " "

3 ( 1 ) "*" "*" "*" " "

4 ( 1 ) "*" "*" "*" "*"

out$cp

[1] -0.8268967 1.0151462 3.0029429 5.0000000

#mmen and a constant = Imin

mincp <- out$which[out$cp==min(out$cp),]

#do not need the constant in vin

vin <- vars[mincp[-1]]

sub <- lsfit(X[,vin],Y)

ls.print(sub)


R-Square=0.9999

F-statistic (df=1, 24)=307694.4

Estimate Std.Err t-value Pr(>|t|)

Intercept 241.5445 190.7426 1.2663 0.2175

X 1.0010 0.0018 554.7021 0.0000

res<-sub$res

yhat<-Y-res #d = 2 predictors used including x_1

AERplot2(yhat,Y,res=res,d=2)



[1] -778.2763 1336.4416 #length 2114.72

Consider forward selection where xI is a × 1. Underfiting occurs if S isnot a subset of I so xI is missing important predictors. A special case ofunderfitting is d = a < aS . Overfitting for forward selection occurs if i)n < 5a so there is not enough data to estimate the a parameters in βI well,or ii) S ⊆ I but S 6= I. Overfitting is serious if n < 5a, but “not much of aproblem” if n > Jp where J = 10 or 20 for many data sets. Underfitting isa serious problem. Let Y = xT

I βI + eI . Then V (eI,i) may not be a constantσ2: V (eI,i) could depend on case i, and the model may no longer be linear.Check model I with response and residual plots.

Forward selection is a shrinkage method: pmodels are produced and exceptfor the full model, some |βi| are shrunk to 0. Lasso and ridge regression are

also shrinkage methods. Ridge regression is a shrinkage method, but |βi| is

not shrunk to 0. Shrinkage methods that shrink βi to 0 are also variableselection methods. See Sections 5.5, 5.6, and 5.8.

Definition 5.6. Suppose the population MLR model has βS an aS × 1vector. The population MLR model is sparse if aS is small. The population

5.3 Principal Components Regression 147

MLR model is dense or abundant if n/aS < J where J = 5 or J = 10, say.

The fitted model β = βImin,0 is sparse if d = number of nonzero coefficientsis small. The fitted model is dense if n/d < J where J = 5 or J = 10.

5.3 Principal Components Regression

Some notation for eigenvalues, eigenvectors, orthonormal eigenvectors, posi-tive definite matrices, and positive semidefinite matrices will be useful beforedefining principal components regression, which is also called principal com-ponent regression.

Notation: Recall that a square symmetric p × p matrix A has an eigen-value λ with corresponding eigenvector x 6= 0 if

Ax = λx. (5.14)

The eigenvalues of A are real since A is symmetric. Note that if constantc 6= 0 and x is an eigenvector of A, then c x is an eigenvector of A. Lete be an eigenvector of A with unit length ‖e‖2 =

√eT e = 1. Then e and

−e are eigenvectors with unit length, and A has p eigenvalue eigenvectorpairs (λ1, e1), (λ2, e2), ..., (λp, ep). Since A is symmetric, the eigenvectors arechosen such that the ei are orthonormal: eT

i ei = 1 and eTi ej = 0 for i 6=

j. The symmetric matrix A is positive definite iff all of its eigenvalues arepositive, and positive semidefinite iff all of its eigenvalues are nonnegative.If A is positive semidefinite, let λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0. If A is positivedefinite, then λp > 0.

Theorem 5.3. Let A be a p×p symmetric matrix with eigenvector eigen-value pairs (λ1, e1), (λ2, e2), ..., (λp, ep) where eT

i ei = 1 and eTi ej = 0 if i 6= j

for i = 1, ..., p. Then the spectral decomposition of A is

A =

p∑

i=1

λieieTi = λ1e1e

T1 + · · ·+ λpepe

Tp .

Using the same notation as Johnson and Wichern (1988, pp. 50-51),let P = [e1 e2 · · · ep] be the p × p orthogonal matrix with ith column

ei. Then P P T = P T P = I . Let Λ = diag(λ1, ..., λp) and let Λ1/2 =

diag(√λ1, ...,

√λp). If A is a positive definite p × p symmetric matrix with

spectral decomposition A =∑p

i=1 λieieTi , then A = P ΛP T and

A−1 = P Λ−1P T =

p∑

i=1

1

λieie

Ti .


Theorem 5.4. Let A be a positive definite p× p symmetric matrix withspectral decomposition A =

∑pi=1 λieie

Ti . The square root matrix A1/2 =

PΛ1/2P T is a positive definite symmetric matrix such that A1/2A1/2 = A.

Principal components regression (PCR) uses OLS regression on the prin-cipal components of the correlation matrix Ru of the p − 1 nontrivial pre-dictors u1 = x2, ..., up−1 = xp. Suppose Ru has eigenvalue eigenvector pairs

(λ1, e1), ..., (λK, eK) where λ1 ≥ λ2 ≥ · · · ≥ λK ≥ 0 where K = min(n, p−1).

Then Ruei = λiei for i = 1, ..., K. Since Ru is a symmetric positive semidef-inite matrix, the λi are real and nonnegative.

The eigenvectors ei are orthonormal: eTi ei = 1 and eT

i ej = 0 for i 6= j.If the eigenvalues are unique, then ei and −ei are the only orthonormaleigenvectors corresponding to λi. For example, the eigenvalue eigenvectorpairs can be found using the singular value decomposition of the matrixW g/

√n− g where W g is the matrix of the standardized nontrivial predictors

wi, the sample covariance matrix

Σw =W T

g W g

n− g=

1

n− g

n∑

i=1

(wi − w)(wi − w)T =1

n− g

n∑

i=1

wiwTi = Ru,

and usually g = 0 or g = 1. If n > K = p−1, then the spectral decompositionof Ru is

Ru =

p−1∑

i=1

λieieTi = λ1e1e

T1 + · · ·+ λp−1ep−1e

Tp−1,

and∑p−1

i=1 λi = p− 1.Let w1, ...,wn denote the standardized vectors of nontrivial predictors.

Then the K principal components corresponding to the jth case wj are

Pj1 = eT1 wj, ..., PjK = eT

Kwj . Following Hastie et al. (2001, p. 62), the itheigenvector ei is known as the ith principal component direction or KarhunenLoeve direction of W g .

Principal components have a nice geometric interpretation if n > K =p− 1. If n > K and Ru is nonsingular, then the hyperellipsoid

{w|D2w(0,Ru) ≤ h2} = {w : wT R−1

u w ≤ h2}

is centered at 0. The volume of the hyperellipsoid is

2πK/2

KΓ (K/2)|Ru|1/2hK .

Then points at squared distance wT R−1u w = h2 from the origin lie on the

hyperellipsoid centered at the origin whose axes are given by the eigenvectors

5.3 Principal Components Regression 149

ei where the half length in the direction of ei is h√λi. Let j = 1, ..., n. Then

the first principal component Pj1 is obtained by projecting the wj on the(longest) major axis of the hyperellipsoid, the second principal component Pj2

is obtained by projecting the wj on the next longest axis of the hyperellipsoid,..., and the (p − 1)th principal component Pj,p−1 is obtained by projectingthe wj on the (shortest) minor axis of the hyperellipsoid. Examine Figure 4.3for two ellipsoids with 2 nontrivial predictors. The axes of the hyperellipsoidare a rotation of the usual axes about the origin.

Let the random variable Vi correspond to the ith principal component,and let (P1i, ..., Pni)

T = (V1i, ..., Vni)T be the observed data for Vi. Let g = 1.

Then the sample mean

V i =1

n

n∑

k=1

Vki =1

n

n∑

k=1

eTi wk = eT

i w = eTi 0 = 0,

and the sample covariance of Vi and Vj is Cov(Vi, Vj) =

1

n

n∑

k=1

(Vki − V i)(Vkj − V j) =1

n

n∑

k=1

eTi wkwT

k ej = eTi Ruej

= λj eTi ej = 0 for i 6= j since the sample covariance matrix of the standard-

ized data is1

n

n∑

k=1

wkwTk = Ru

and Ruej = λjej . Hence Vi and Vj are uncorrelated.PCR uses linear combinations of the standardized data as predictors. Let

Vj = eTj w for j = 1, ..., K. Let model Ji contain V1, ..., Vi. Then for model Ji,

use OLS regression of Z = Y − Y on V1, ..., Vi with Y = Z + Y . Since linearcombinations of w are linear combinations of x, Y = XβPCR,Ij

where themodel Ij uses a constant and the first j − 1 PCR components.

Notation: Just as we use xi or Xi to denote the ith predictor, we willuse vj or Vj to denote predictors that are linear combinations of the originalpredictors: e.g. vj = Vj = γT

j x or vj = Vj = γTj u.

Remark 5.7. The set of (p− 1)× 1 vectors {(1, 0, ..., 0)T , (0, 1, 0, ..., 0)T,(0, ...0, 1)T} is the standard basis for R

p−1. The set of vectors {e1, ..., ep−1}is also a basis for R

p−1. For PCR and some constants θi,∑j

i=1 θieTj w =∑p−1

i=1 ηiwi if j = p− 1, but not if j < p− 1 in general. Hence PCR tends togive inconsistent estimators unless P(j = p− 1) = P(PCR uses the OLS fullmodel) goes to one.

There are at least two problems with PCR. i) In general, βPCR,Ijis an

inconsistent estimator of β unless P (j → p − 1) = P (βPCR,Ij→ βOLS) →


1 as n → ∞. ii) Generally there is no reason why the predictors shouldbe ranked from best to worst by V1, V2, ..., VK. For example, the last fewprincipal components (and a constant) could be much better for predictionthan the other principal components. See Jolliffe (1983). If n ≥ 10p, oftenPCR needs to use all p− 1 components (i.e., PCR = OLS full model) to becompetitive with other regression models. Performing OLS forward selectionor lasso on V1, ..., VK may be more effective. There is one exception. Suppose∑J

i=1 λi ≥ q(p − 1) where 0.5 ≤ q ≤ 1, e.g. q = 0.8 where J is a lot smallerthan p− 1. Then the J predictors V1, ..., VJ capture much of the informationof the standardized nontrivial predictors w1, ..., wp−1. Then regressing Y on1, V1, ..., VJ may be competitive with regressing Y on 1, w1, ..., wp−1. Thisexception tends to occur when p is very small, and is an example of dimensionreduction. PCR is equivalent to OLS on the full model when Y is regressedon a constant and all K of the principal components. PCR can also be usefulif X is singular or nearly singular (ill conditioned).

Example 5.2, continued. The PCR output below shows results for themarry data where 10-fold CV was used. The OLS full model was selected.

library(pls); y <- marry[,3]; x <- marry[,-3]

z <- as.data.frame(cbind(y,x))

out<-pcr(y˜.,data=z,scale=T,validation="CV")

tem<-MSEP(out)

tem

(Int) 1 comps 2 comps 3 comps 4 comps

CV 1.743e+09 449479706 8181251 371775 197132

cvmse<-tem$val[,,1:(out$ncomp+1)][1,]

nc <-max(which.min(cvmse)-1,1)

res <- out$residuals[,,nc]

yhat<-y-res #d = 5 predictors used including constant



$respi #90% PI same as OLS full model

-950.4811 1445.2584 #PI length = 2395.74

5.4 Partial Least Squares

Partial least squares (PLS) uses variables v1 = 1 (the constant or trivialpredictor) and “PLS components” vj = γT

j x for j = 2, ..., p. Next let theresponse Y be used with the standardized predictors Wj. Let the “PLS com-

ponents” Vj = gTj w. Let model Ji contain V1, ..., Vi. Often k–fold cross val-

idation is used to pick the PLS model from J1, ..., JM. PLS seeks directionsgj such that the PLS components Vj are highly correlated with Y , subject to

5.4 Partial Least Squares 151

being uncorrelated with other PLS components Vi for i 6= j. Note that PCRcomponents are formed without using Y .

Remark 5.8. PLS may or may not give a consistent estimator of β ifp/n does not go to zero: rather strong regularity conditions have been usedto prove consistency or inconsistency if p/n does not go to zero. See Chunand Keles (2010), Cook (2018), Cook et al. (2013), and Cook and Forzani(2018ab).

Following Hastie et al. (2009, pp. 80-81), let W = [s1, ..., sp−1] so sj isthe vector corresponding to the standardized jth nontrivial predictor. Letg1i = sT

j Y be n times the least squares coefficient from regressing Y on

si. Then the first PLS direction g1 = (g11, ..., g1,p−1)T . Note that Wgi =

(Vi1, ..., Vin)T = pi is the ith PLS component. This process is repeated usingmatrices W k = [sk

1 , ..., skp−1] where W 0 = W and W k is orthogonalized

with respect to pk for k = 1, ..., p− 2. So skj = sk−1

j − [pTk sk−1

j /(pTk pk)]pk

for j = 1, ..., p− 1. If the PLS model Ii uses a constant and PLS componentsV1, ..., Vi−1, let Y Ii be the predicted values from the PLS model using Ii.

Then Y Ii = Y Ii−1+ θipi where Y I0

= Y 1 and θi = pTi Y /(pT

i pi). Since

linear combinations of w are linear combinations of x, Y = XβPLS,Ijwhere

Ij uses a constant and the first j − 1 PLS components. If j = p, then thePLS model Ip is the OLS full model.

Example 5.2, continued. The PLS output below shows results for themarry data where 10-fold CV was used. The OLS full model was selected.

library(pls); y <- marry[,3]; x <- marry[,-3]

z <- as.data.frame(cbind(y,x))

out<-plsr(y˜.,data=z,scale=T,validation="CV")

tem<-MSEP(out)

tem

(Int) 1 comps 2 comps 3 comps 4 comps

CV 1.743e+09 256433719 6301482 249366 206508

cvmse<-tem$val[,,1:(out$ncomp+1)][1,]

nc <-max(which.min(cvmse)-1,1)

res <- out$residuals[,,nc]

yhat<-y-res #d = 5 predictors used including constant


$respi #90% PI same as OLS full model

-950.4811 1445.2584 #PI length = 2395.74

The Mevik et al. (2015) pls library is useful for computing PLS and PCR.


5.5 Ridge Regression

Consider the MLR model Y = Xβ + e. Ridge regression uses the centeredresponse Zi = Yi − Y and standardized nontrivial predictors in the modelZ = Wη +e. Then Yi = Zi +Y . Note that in Definition 5.8, λ1,n is a tuning

parameter, not an eigenvalue. The residuals r = r(βR) = Y − Y .

Definition 5.7. Consider the MLR model Z = Wη + e. Let b be a(p − 1) × 1 vector. Then the fitted value Zi(b) = wT

i b and the residual

ri(b) = Zi − Zi(b). The vector of fitted values Z(b) = Wb and the vector

of residuals r(b) = Z − Z(b). If W has full rank p − 1, then the OLSestimator ηOLS = (W T W )−1W T Z minimizes the OLS criterion QOLS(η) =r(η)T r(η) over all vectors η ∈ R

p−1.

Definition 5.8. Consider fitting the MLR model Y = Xβ + e usingZ = Wη + e. Let λ ≥ 0 be a constant. The ridge regression estimator ηR

minimizes the ridge regression criterion

QR(η) =1

a(Z − Wη)T (Z − Wη) +

λ1,n

a

p−1∑

i=1

η2i (5.15)

over all vectors η ∈ Rp−1 where λ1,n ≥ 0 and a > 0 are known constants

with a = 1, 2, n, and 2n common. Then

ηR = (W T W + λ1,nIp−1)−1W T Z. (5.16)

The residual sum of squares RSS(η) = (Z −Wη)T (Z −Wη), and λ1,n = 0corresponds to the OLS estimator ηOLS. The ridge regression vector of fittedvalues is Z = ZR = WηR, and the ridge regression vector of residualsrR = r(ηR) = Z − ZR. The estimator is said to be regularized if λ1,n > 0.

Obtain Y and βR using ηR, Z, and Y .

Using a vector of parameters η and a dummy vector η in QR is commonfor minimizing a criterion Q(η), often with estimating equations. See theparagraph below Definition 5.4. We could also write

QR(b) =1

ar(b)T r(b) +

λ1,n

abT b

where the minimization is over all vectors b ∈ Rp−1. Note that

∑p−1i=1 η

2i =

ηT η = ‖η‖22. The literature often uses λa = λ = λ1,n/a.

Note that λ1,nbT b = λ1,n

∑p−1i=1 b

2i . Each coefficient bi is penalized equally

by λ1,n. Hence using standardized nontrivial predictors makes sense so thatif ηi is large in magnitude, then the standardized variable wi is important.

5.5 Ridge Regression 153

Remark 5.9. i) If λ1,n = 0, the ridge regression estimator becomes theOLS full model estimator: ηR = ηOLS.

ii) If λ1,n > 0, then W T W + λ1,nIp−1 is nonsingular. Hence ηR existseven if X and W are singular or ill conditioned, or if p > n.

iii) Following Hastie et al. (2001, p. 77), let the augmented matrix W A

and the augmented response vector ZA be defined by

W A =

(W√

λ1,n Ip−1

), and ZA =

(Z0

),

where 0 is the (p− 1)× 1 zero vector. For λ1,n > 0, the OLS estimator fromregressing ZA on W A is

ηA = (W TAW A)−1W T

AZA = ηR

since W TAZA = W T Z and

W TAW A =

(W T

√λ1,n Ip−1

)( W√λ1,n Ip−1

)= W T W + λ1,n Ip−1.

iv) A simple way to regularize a regression estimator, such as the L1 esti-mator, is to compute that estimator from regressing ZA on W A.

Remark 5.9 iii) is interesting. Note that for λ1,n > 0, the (n+p−1)×(p−1)matrix W A has full rank p−1. The augmented OLS model consists of addingp− 1 pseudo-cases (wT

n+1, Zn+1)T , ..., (wT

n+p−1, Zn+p−1)T where Zj = 0 and

wj = (0, ...,√λ1,n, 0, ..., 0)T for j = n+1, ..., n+p−1 where the nonzero entry

is in the kth position if j = n + k. For centered response and standardizednontrivial predictors, the population OLS regression fit runs through theorigin (wT , Z)T = (0T , 0)T . Hence for λ1,n = 0, the augmented OLS modeladds p − 1 typical cases at the origin. If λ1,n is not large, then the pseudo-data can still be regarded as typical cases. If λ1,n is large, the pseudo-dataact as w–outliers (outliers in the standardized predictor variables), and the

OLS slopes go to zero as λ1,n gets large, making Z ≈ 0 so Y ≈ Y .To prove Remark 5.9 ii), let (ψ, g) be an eigenvalue eigenvector pair of

W T W = nRu. Then [WT W + λ1,nIp−1]g = (ψ+ λ1,n)g, and (ψ+λ1,n, g)

is an eigenvalue eigenvector pair of W T W +λ1,nIp−1 > 0 provided λ1,n > 0.

The degrees of freedom for a ridge regression with known λ1,n is alsointeresting and will be found in the next paragraph. The sample correlationmatrix of the nontrivial predictors

Ru =1

n− gW T

g W g


where we will use g = 0 and W = W 0. Then W T W = nRu. By singularvalue decomposition (SVD) theory, the SVD of W is W = UΛV T wherethe positive singular values σi are square roots of the positive eigenvalues ofboth W T W and of WW T . Also V = (e1 e2 · · · ep), and W T Wei = σ2

i ei.

Hence λi = σ2i where λi = λi(W

T W ) is the ith eigenvalue of W T W , and ei

is the ith orthonormal eigenvector of Ru and of W T W . The SVD of W T isW T = V ΛT UT , and the Gram matrix

WW T =

wT1 w1 wT

1 w2 . . . wT1 wn

......

. . ....

wTnw1 wT

n w2 . . . wTnwn

which is the matrix of scalar products. Warning: Note that σi is the ithsingular value of W , not the standard deviation of wi.

Following Hastie et al. (2001, p. 63), if λi = λi(WT W ) is the ith eigenvalue

of W T W where λ1 ≥ λ2 ≥ · · · ≥ λp−1, then the (effective) degrees of freedomfor the ridge regression of Z on W with known λ1,n is df(λ1,n) =

tr[W (W T W + λ1,nIp−1)−1W T ] =

p−1∑

i=1

σ2i

σ2i + λ1,n

=

p−1∑

i=1

λi

λi + λ1,n

(5.17)

where the trace of a square (p − 1) × (p − 1) matrix A = (aij) is tr(A) =∑p−1i=1 aii =

∑p−1i=1 λi(A). Note that the trace of A is the sum of the diagonal

elements of A = the sum of the eigenvalues of A.Note that 0 ≤ df(λ1,n) ≤ p − 1 where df(λ1,n) = p − 1 if λ1,n = 0 and

df(λ1,n) → 0 as λ1,n → ∞. The R code below illustrates how to computeridge regression degrees of freedom.

set.seed(13)

n<-100; q<-3 #q = p-1

b <- 0 * 1:q + 1

u <- matrix(rnorm(n * q), nrow = n, ncol = q)

y <- 1 + u %*% b + rnorm(n) #make MLR model

w1 <- scale(u) #t(w1) %*% w1 = (n-1) R = (n-1)*cor(u)

w <- sqrt(n/(n-1))*w1 #t(w) %*% w = n R = n cor(u)

t(w) %*% w/n

[,1] [,2] [,3]

[1,] 1.00000000 -0.04826094 -0.06726636

[2,] -0.04826094 1.00000000 -0.12426268

[3,] -0.06726636 -0.12426268 1.00000000

cor(u) #same as above

rs <- t(w)%*%w #scaled correlation matrix n R

svs <-svd(w)$d #singular values of w

lambda <- 0

d <- sum(svsˆ2/(svsˆ2+lambda))


#effective df for ridge regression using w

d

[1] 3 #= q = p-1

112.60792 103.88089 83.51119

svsˆ2 #as above

uu<-scale(u,scale=F) #centered but not scaled

svs <-svd(uu)$d #singular values of uu

svsˆ2

[1] 135.78205 108.85903 85.83395

d <- sum(svsˆ2/(svsˆ2+lambda))

#effective df for ridge regression using uu

#d is again 3 if lambda = 0

In general, if Z = HλZ, then df(Z) = tr(Hλ) where Hλ is a (p − 1) ×(p− 1) “hat matrix.” For computing Y , df(Y ) = df(Z) + 1 since a constant

β1 also needs to be estimated. These formulas for degrees of freedom assumethat λ is known before fitting the model. The formulas give the false degreesof freedom if λ is selected from M values λ1, ..., λM using a criterion such ask-fold cross validation.

Suppose the ridge regression criterion is written, using a = 2n, as

QR,n(b) =1

2nr(b)T r(b) + λ2nbT b, (5.18)

as in Hastie et al. (2015, p. 10). Then λ2n = λ1,n/(2n) using the λ1,n from(5.15).

The following remark is interesting if λ1,n and p are fixed. However, λ1,n isusually used, for example, after 10-fold cross validation. The fact that ηR =An,ληOLS appears in Efron and Hastie (2016, p. 98), and Marquardt andSnee (1975). See Theorem 5.5 for the ridge regression central limit theorem.

Remark 5.10. Ridge regression has a simple relationship with OLS ifn > p and (W T W )−1 exists. Then ηR = (W T W + λ1,nIp−1)

−1W T Z =

(W T W+λ1,nIp−1)−1(W T W )(W T W )−1W T Z = An,ληOLS where An,λ ≡

An = (W T W + λ1,nIp−1)−1W T W . By the OLS CLT Theorem 5.1 with

V /n = (W T W )−1, a normal approximation for OLS is

ηOLS ∼ ANn−p(η,MSE (W T W )−1).

Hence a normal approximation for ridge regression is

ηR ∼ ANp−1(Anη,MSE An(W T W )−1ATn ) ∼

ANp−1[Anη,MSE (W T W + λ1,nIp−1)−1(W T W )(W T W + λ1,nIp−1)

−1].


If the ui can be taken as a random sample from a population, then the samplecorrelation matrix

Ru = W T W /nP→ ρu,

the population correlation matrix of the nontrivial predictors ui where x =

(1 uT )T , and AnP→ Ip−1.

Remark 5.11. The ridge regression criterion from Definition 5.8 can alsobe defined by

QR(η) = ‖Z − Wη‖22 + λ1,nηT η. (5.19)

Then by Lemma 5.2, the gradient 5QR = −2W T Z + 2(W T W )η + 2λ1,nη.Cancelling constants and evaluating the gradient at ηR gives the score equa-tions

−W T (Z − WηR) + λ1,nηR = 0. (5.20)

Following Hastie and Efron (2016, pp. 381-382, 392), this means ηR = W T afor some n× 1 vector a. Hence −W T (Z − WW T a) + λ1,nW T a = 0, or

W T (WW T + λ1,nIn)]a = W T Z

which has solution a = (WW T + λ1,nIn)−1Z. Hence

ηR = W T a = W T (WW T + λ1,nIn)−1Z = (W T W + λ1,nIp−1)−1W T Z.

Using the n × n matrix WW T is computationally efficient if p > n whileusing the p × p matrix W T W is computationally efficient if n > p. If A isk × k, then computing A−1 has O(k3) complexity.

The following identity from Gunst and Mason (1980, p. 342) is useful forridge regression inference: ηR =(W T W + λ1,nIp−1)

−1W T Z

= (W T W + λ1,nIp−1)−1W T W (W T W )−1W T Z

= (W T W + λ1,nIp−1)−1W T WηOLS = AnηOLS =

[Ip−1 − λ1,n(W T W + λ1,nIp−1)−1]ηOLS = BnηOLS =

ηOLS − λ1n

nn(W T W + λ1,nIp−1)

−1ηOLS

since An − Bn = 0. See Problem 5.3. Assume

Ru =W T W

n

P→ V −1 (5.21)

as n → ∞. Note that V −1 = ρu if the ui are a random sample froma population with a nonsingular population correlation matrix ρu. Under(5.21), if λ1,n/n→ 0 then


W T W + λ1,nIp−1

n

P→ V −1, and n(W TW + λ1,nIp−1)−1 P→ V .

Note that (5.21) is a condition of the LS CLT Theorem 5.1. Also note that

An = An,λ =

(W T W + λ1,nIp−1

n

)−1W T W

n

P→ V V −1 = Ip−1

if λ1,n/n → 0 since matrix inversion is a continuous function of a positivedefinite matrix. See, for example, Bhatia et al. (1990), Stewart (1969), andSeverini (2005, pp. 348-349).

For model selection, the M values of λ = λ1,n are denoted by λ1, λ2, ..., λM

where λi = λ1,n,i depends on n for i = 1, ...,M . If λs corresponds to the model

selected, then λ1,n = λs. The following theorem shows that ridge regression

and the OLS full model are asymptotically equivalent if λ1,n = oP (n1/2) so

λ1,n/√n

P→ 0.

Theorem 5.5, RR CLT (Ridge Regression Central Limit Theo-rem. Assume p is fixed and that the conditions of the LS CLT Theorem 5.1hold for the model Z = Wη + e.

a) If λ1,n/√n

P→ 0, then

√n(ηR − η)

D→ Np−1(0, σ2V ).

b) If λ1,n/√n

P→ τ ≥ 0 then

√n(ηR − η)

D→ Np−1(−τV η, σ2V ).

Proof: If λ1,n/√n

P→ τ ≥ 0, then by the above Gunst and Mason (1980)identity,

ηR = [Ip−1 − λ1,n(W T W + λ1,nIp−1)−1]ηOLS .

Hence √n(ηR − η) =

√n(ηR − ηOLS + ηOLS − η) =

√n(ηOLS − η) −

√nλ1,n

nn(W T W + λ1,nIp−1)

−1ηOLS

D→ Np−1(0, σ2V ) − τV η ∼ Np−1(−τV η, σ2V ). �

For p fixed, Knight and Fu (2000) note i) that ηR is a consistent estimatorof η if λ1,n = o(n) so λ1,n/n → 0 as n → ∞, ii) OLS and ridge regressionare asymptotically equivalent if λ1,n/

√n → 0 as n → ∞, iii) ridge regression

is a√n consistent estimator of η if λ1,n = O(

√n) (so λ1,n/

√n is bounded),

and iv) if λ1,n/√n → τ ≥ 0, then


√n(ηR − η)

D→ Np−1(−τV η, σ2V ).

Hence the bias can be considerable if τ 6= 0. If τ = 0, then OLS and ridgeregression have the same limiting distribution.

Even if p is fixed, there are several problems with ridge regression infer-ence if λ1,n is selected, e.g. after 10-fold cross validation. For OLS forwardselection, the probability that the model Imin underfits goes to zero, andeach model with S ⊆ I produced a

√n consistent estimator βI,0 of β. Ridge

regression with 10-fold CV tends to underfit if both i) the number of popu-lation active predictors kS = aS − 1 in Equations (4.28) and (5.10) is greaterthan about 20, and ii) the predictors are highly correlated. If p is fixed andλ1,n = oP (

√n), then the OLS full model and ridge regression are asymptoti-

cally equivalent, but much larger sample sizes may be needed for the normalapproximation to be good for ridge regression since the ridge regression es-timator can have large bias for moderate n. Ten fold CV does not appear to

guarantee that λ1,n/√n

P→ 0 or λ1,n/nP→ 0.

Ridge regression can be a lot better than the OLS full model if i) XT X issingular or ill conditioned or ii) n/p is small. Ridge regression can be muchfaster than forward selection if M = 100 and n and p are large.

Roughly speaking, the biased estimation of the ridge regression estimatorcan make the MSE of βR or ηR less than that of βOLS or ηOLS , but thelarge sample inference may need larger n for ridge regression than for OLS.However, the large sample theory has n >> p. We will try to use predictionintervals to compare OLS, forward selection, ridge regression, and lasso fordata sets where p > n. See Sections 5.9, 5.10, 5.11, and 5.12.

Warning. Although the R functions glmnet and cv.glmnet appear todo ridge regression, getting the fitted values, λ1,n, and degrees of freedom tomatch up with the formulas of this section can be difficult.

Example 5.2, continued. The ridge regression output below shows resultsfor the marry data where 10-fold CV was used. A grid of 100 λ values wasused, and λ0 > 0 was selected. A problem with getting the false degrees offreedom d for ridge regression is that it is not clear that λ = λ1,n/(2n). Weneed to know the relationship between λ and λ1,n in order to compute d. Itseems unlikely that d ≈ 1 if λ0 is selected.

library(glmnet); y <- marry[,3]; x <- marry[,-3]

out<-cv.glmnet(x,y,alpha=0)

lam <- out$lambda.min #value of lambda that minimizes

#the 10-fold CV criterion

yhat <- predict(out,s=lam,newx=x)

res <- y - yhat

n <- length(y)

w1 <- scale(x)

w <- sqrt(n/(n-1))*w1 #t(w) %*% w = n R_u, u = x

5.6 Lasso 159

diag(t(w)%*%w)


26 26 26 26

#sum w_iˆ2 = n = 26 for i = 1, 2, 3, and 4

svs <- svd(w)$d #singular values of w,

pp <- 1 + sum(svsˆ2/(svsˆ2+2*n*lam)) #approx 1

# d for ridge regression if lam = lam_{1,n}/(2n)

AERplot2(yhat,y,res=res,d=pp)


[1] -5482.316 14854.268 #length = 20336.584

#try to reproduce the fitted values

z <- y - mean(y)

q<-dim(w)[2]

I <- diag(q)

M<- w%*%solve(t(w)%*%w + lam*I/(2*n))%*%t(w)

fit <- M%*%z + mean(y)

plot(fit,yhat) #they are not the same

max(abs(fit-yhat))

[1] 46789.11

M<- w%*%solve(t(w)%*%w + lam*I/(1547.1741))%*%t(w)

fit <- M%*%z + mean(y)

max(abs(fit-yhat)) #close

[1] 8.484979

5.6 Lasso

Consider the MLR model Y = Xβ + e. Lasso uses the centered responseZi = Yi−Y and standardized nontrivial predictors in the model Z = Wη+eas described in Remark 5.1. Then Yi = Zi + Y . The residuals r = r(βL) =

Y − Y .

Definition 5.9. Consider fitting the MLR model Y = Xβ + e usingZ = Wη + e. The lasso estimator ηL minimizes the lasso criterion

QL(η) =1

a(Z − Wη)T (Z − Wη) +

λ1,n

a

p−1∑

i=1

|ηi| (5.22)

over all vectors η ∈ Rp−1 where λ1,n ≥ 0 and a > 0 are known constants

with a = 1, 2, n, and 2n are common. The residual sum of squares RSS(η) =(Z − Wη)T (Z − Wη), and λ1,n = 0 corresponds to the OLS estimator

ηOLS = (W T W )−1W T Z if W has full rank p−1. The lasso vector of fitted

values is Z = ZL = WηL, and the lasso vector of residuals r(ηL) = Z−ZL.


The estimator is said to be regularized if λ1,n > 0. Obtain Y and βL using

ηL, Z, and Y .

Using a vector of parameters η and a dummy vector η in QL is commonfor minimizing a criterion Q(η), often with estimating equations. See theparagraph below Definition 5.4. We could also write

QL(b) =1

ar(b)T r(b) +

λ1,n

a

p−1∑

j=1

|bj|, (5.23)

where the minimization is over all vectors b ∈ Rp−1. The literature often uses

λa = λ = λ1,n/a.

For fixed λ1,n, the lasso optimization problem is convex. Hence fast algo-rithms exist. As λ1,n increases, some of the ηi = 0. If λ1,n is large enough,

then ηL = 0 and Yi = Y for i = 1, ..., n. If none of the elements ηi of ηL arezero, then ηL can be found, in principle, by setting the partial derivatives ofQL(η) to 0. Potential minimizers also occur at values of η where not all of thepartial derivatives exist. An analogy is finding the minimizer of a real valuedfunction of one variable h(x). Possible values for the minimizer include valuesof xc satisfying h′(xc) = 0, and values xc where the derivative does not exist.Typically some of the elements ηi of ηL that minimizes QL(η) are zero, anddifferentiating does not work.

The following identity from Efron and Hastie (2016, p. 308), for example,is useful for inference for the lasso estimator ηL:

−1

nW T (Z − WηL) +

λ1,n

2nsn = 0 or − W T(Z − WηL) +

λ1,n

2sn = 0

where sin ∈ [−1, 1] and sin = sign(ηi,L) if ηi,L 6= 0. Here sign(ηi) = 1 if ηi > 1and sign(ηi) = −1 if ηi < 1. Note that sn = sn,ηL

depends on ηL. Thus ηL

= (W T W )−1W T Z− λ1,n

2nn(W T W )−1 sn = ηOLS − λ1,n

2nn(W T W )−1 sn.

If none of the elements of η are zero, and if ηL is a consistent estimator of η,

then snP→ s = sη. If λ1,n/

√n → 0, then OLS and lasso are asymptotically

equivalent even if sn does not converge to a vector s as n → ∞ since sn isbounded. For model selection, the M values of λ are denoted by 0 ≤ λ1 <λ2 < · · · < λM where λi = λ1,n,i depends on n for i = 1, ...,M . Also, λM

is the smallest value of λ such that ηλM= 0. Hence ηλi

6= 0 for i < M . If

λs corresponds to the model selected, then λ1,n = λs. The following theoremshows that lasso and the OLS full model are asymptotically equivalent if

λ1,n = oP (n1/2) so λ1,n/√n

P→ 0: thus√n(ηL − ηOLS) = op(1).

5.6 Lasso 161

Theorem 5.6, Lasso CLT. Assume p is fixed and that the conditions ofthe LS CLT Theorem 5.1 hold for the model Z = Wη + e.

a) If λ1,n/√n

P→ 0, then

√n(ηL − η)

D→ Np−1(0, σ2V ).

b) If λ1,n/√n

P→ τ ≥ 0 and snP→ s = sη , then

√n(ηL − η)

D→ Np−1

(−τ2

V s, σ2V

).

Proof. If λ1,n/√n

P→ τ ≥ 0 and snP→ s = sη , then

√n(ηL − η) =

√n(ηL − ηOLS + ηOLS − η) =

√n(ηOLS − η) −

√nλ1,n

2nn(W T W )−1sn

D→ Np−1(0, σ2V ) − τ

2V s

∼ Np−1

(−τ2

V s, σ2V

)

since under the LS CLT, n(W T W )−1 P→ V .

Part a) does not need snP→ s as n→ ∞, since sn is bounded. �

Suppose p is fixed. Knight and Fu (2000) note i) that ηL is a consistentestimator of η if λ1,n = o(n) so λ1,n/n→ 0 as n → ∞, ii) OLS and lasso areasymptotically equivalent if λ1,n → ∞ too slowly as n → ∞ (e.g. if λ1,n = λis fixed), iii) lasso is a

√n consistent estimator of η if λ1,n = O(

√n) (so

λ1,n/√n is bounded). Note that Theorem 5.6 shows that OLS and lasso are

asymptotically equivalent if λ1,n/√n→ 0 as n→ 0.

In the literature, the criterion often uses λa = λ1,n/a:

QL,a(b) =1

ar(b)T r(b) + λa

p−1∑

j=1

|bj|.

The values a = 1, 2, and 2n are common. Following Hastie et al. (2015, pp.9, 17, 19) for the next two paragraphs, it is convenient to use a = 2n:

QL,2n(b) =1

2nr(b)T r(b) + λ2n

p−1∑

j=1

|bj|, (5.24)

where the Zi are centered and the wj are standardized using g = 0 so wj = 0and nσ2

j =∑n

i=1 w2i,j = n. Then λ = λ2n = λ1,n/(2n) in Equation (5.22).

For model selection, the M values of λ are denoted by 0 ≤ λ2n,1 < λ2n,2 <


· · ·< λ2n,M where ηλ = 0 iff λ ≥ λ2n,M and

λ2n,max = λ2n,M = maxj

∣∣∣∣1

nsT

j Z

∣∣∣∣

and sj is the jth column of W corresponding to the jth standardized non-trivial predictor Wj . In terms of the 0 ≤ λ1 < λ2 < · · · < λM , used aboveTheorem 5.6, we have λi = λ1,n,i = 2nλ2n,i and

λM = 2nλ2n,M = 2 maxj

∣∣sTj Z∣∣ .

For model selection we let I denote the index set of the predictors in thefitted model including the constant. The set A defined below is the index setwithout the constant.

Definition 5.10. The active set A is the index set of the nontrivial pre-dictors in the fitted model: the predictors with nonzero ηi.

Suppose that there are k active nontrivial predictors. Then for lasso, k ≤ n.Let the n × k matrix W A correspond to the standardized active predictors.If the columns of W A are in general position, then the lasso vector of fittedvalues

ZL = W A(W TAW A)−1W T

AZ − nλ2nW A(W TAW A)−1sA

where sA is the vector of signs of the active lasso coefficients. Here we areusing the λ2n of (5.24), and nλ2n = λ1,n/2. We could replace n λ2n by λ2 ifwe used a = 2 in the criterion

QL,2(b) =1

2r(b)T r(b) + λ2

p−1∑

j=1

|bj|. (5.25)

See, for example, Tibshirani (2015). Note that W A(W TAW A)−1W T

AZ is thevector of OLS fitted values from regressing Z on W A without an intercept.

Example 5.2, continued. The lasso output below shows results for themarry data where 10-fold CV was used. A grid of 38 λ values was used, andλ0 > 0 was selected.


out<-cv.glmnet(x,y)




res <- y - yhat

pp <- out$nzero[out$lambda==lam] + 1 #d for lasso


5.7 Relaxed Lasso 163


-4102.672 4379.951 #length = 8482.62

There are some problems with lasso. i) Lasso large sample theory is worseor as good as that of the OLS full model if n/p is large. ii) Ten fold CV does

not appear to guarantee that λ1,n/√n

P→ 0 or λ1,n/nP→ 0. iii) Lasso tends

to underfit if aS ≥ 20 and the predictors are highly correlated. iv) Ridgeregression can be better than lasso if aS > n.

Lasso can be a lot better than the OLS full model if i) XT X is singularor ill conditioned or ii) n/p is small. iii) For lasso, M = M(lasso) is typicallynear 100. Let J ≥ 5. If n/J and p are both a lot larger than M(lasso), thenlasso can be considerably faster than forward selection, PLS, and PCR ifM = M(lasso) = 100 and M = M(F ) = min(dn/Je, p) where F stands forforward selection, PLS, or PCR. iv) The number of nonzero coefficients inηL ≤ n even if p > n. This property of lasso can be useful if p >> n and thepopulation model is sparse.

5.7 Relaxed Lasso

The variant of relaxed lasso that we will use applies OLS on a constant andthe active predictors that have nonzero lasso ηi. This estimator is the LARS-OLS hybrid estimator of Efron et al. (2004), also called the relaxed lasso(φ = 0) estimator by Meinshausen (2007). Let XA denote the matrix witha column of ones and the unstandardized active nontrivial predictors. Hencethe relaxed lasso estimator is βRL = (XT

AXA)−1XTAY , and relaxed lasso is

an alternative to forward selection. Relaxed lasso should be√n consistent

when lasso is√n consistent. Suppose the n × q matrix x has the q = p − 1

nontrivial predictors. The following R code gives some output for a lassoestimator and then the corresponding relaxed lasso estimator.

library(glmnet)

y <- marry[,3]

x <- marry[,-3]

out<-glmnet(x,y,dfmax=2) #Use 2 for illustration:

#often dfmax approx min(n/J,p) for some J >= 5.

lam<-out$lambda[length(out$lambda)]


#lasso with smallest lambda in grid such that df = 2

lcoef <- predict(out,type="coefficients",s=lam)

as.vector(lcoef) #first term is the intercept

#3.000397e+03 1.800342e-03 9.618035e-01 0.0 0.0

res <- y - yhat

AERplot(yhat,y,res,d=3,alph=1) #lasso response plot

##relaxed lasso =


#OLS on lasso active predictors and a constant

vars <- 1:dim(x)[2]

lcoef<-as.vector(lcoef)[-1] #don’t need an intercept

vin <- vars[lcoef>0] #the lasso active set

vin

#1 2 since predictors 1 and 2 are active

sub <- lsfit(x[,vin],y) #relaxed lasso

sub$coef

# Intercept pop mmen

#2.380912e+02 6.556895e-05 1.000603e+00

# 238.091 6.556895e-05 1.0006

res <- sub$resid

yhat <- y - res

AERplot(yhat,y,res,d=3,alph=1) #response plot

Example 5.2, continued. The relaxed lasso output below shows resultsfor the marry data where 10-fold CV was used to choose the lasso estimator.Then relaxed lasso is OLS applied to the active variables with nonzero lassocoefficients and a constant. A grid of 38 λ values was used, and λ0 > 0was selected. The OLS SE, t statistic and pvalue are generally not valid forrelaxed lasso.


out<-cv.glmnet(x,y)



pp <- out$nzero[out$lambda==lam] + 1

#d for relaxed lasso

#get relaxed lasso

lcoef <- predict(out,type="coefficients",s=lam)

lcoef<-as.vector(lcoef)[-1]

vin <- vars[lcoef!=0]

sub <- lsfit(x[,vin],y)

ls.print(sub)


R-Square=0.9999

F-statistic (df=2, 23)=147440.1

Estimate Std.Err t-value Pr(>|t|)58

Intercept 238.0912 248.8616 0.9567 0.3487

pop 0.0001 0.0029 0.0223 0.9824

mmen 1.0006 0.0164 60.9878 0.0000

res <- sub$resid

yhat <- y - res



-822.759 1403.771 #length = 2226.53

5.8 The Elastic Net 165

Fig. 5.1 Marry Data Response Plots

To summarize Example 5.2, forward selection selected the model with theminimum Cp while the other methods used 10-fold CV. PLS and PCR usedthe OLS full model with PI length 2395.74, forward selection used a constantand mmen with PI length 2114.72, ridge regression had PI length 20336.58,lasso and relaxed lasso used a constant, mmen, and pop with lasso PI length8482.62 and relaxed lasso PI length 2226.53. Figure 5.1 shows the responseplots for forward selection, ridge regression, lasso, and relaxed lasso. The plotsfor PLS=PCR=OLS full model were similar to those of forward selection andrelaxed lasso. The plots suggest that the MLR model is appropriate since theplotted points scatter about the identity line. The 90% pointwise predictionbands are also shown, and consist of two lines parallel to the identity line.These bands are very narrow in Figure 5.1 a) and d).

If n/p is large, then Section 4.3.5 showed that the shorth and predictionregion method can simulate well for forward selection using the residual boot-strap with residuals from the OLS full model. A similar result may hold forthe relaxed lasso: compute OLS and Cp (for each set of active predictors anda constant) for the M lasso models and the OLS full model. Use the relaxed

lasso model Imin and βImin,0.If λ1 = 0, a variant of relaxed lasso that computes the OLS submodel for

the subset corresponding to λi for i = 1, ...,M . If Cp is use, then this varianthas nice large sample theory. See Remark 4.13. A more general version ofrelaxed lasso has an R package relaxo. See Meinshausen (2007).

5.8 The Elastic Net

Following Hastie et al. (2015, p. 57), let β = (β1 ,βTS )T , let λ1,n ≥ 0, and let

α ∈ [0, 1]. Let

RSS(β) = (Y − Xβ)T (Y − Xβ) = ‖Y − Xβ‖22.

Definition 5.11. The elastic net estimator βEN minimizes the criterion

QEN(β) =1

2RSS(β) + λ1,n

[1

2(1 − α)‖βS‖2

2 + α‖βS‖1

], or (5.26)

Q2(β) = RSS(β) + λ1‖βS‖22 + λ2‖βS‖1 (5.27)

where 0 ≤ α ≤ 1, λ1 = (1 − α)λ1,n and λ2 = 2αλ1,n.

Note that α = 1 corresponds to lasso (using λa=0.5), and α = 0 correspondsto ridge regression. For α < 1 and λ1,n > 0, the optimization problem is


strictly convex with a unique solution. The elastic net is due to Zou andHastie (2005). It has been observed that the elastic net can have much betterprediction accuracy than lasso when the predictors are highly correlated.

As with lasso, it is often convenient to use the centered response Z = Y −Ywhere Y = Y 1, and the n×(p−1) matrix of standardized nontrivial predictorsW . Then regression through the origin is used for the model

Z = Wη + e (5.28)

where the vector of fitted values Y = Y + Z.Ridge regression can be computed using OLS on augmented matrices.

Similarly, the elastic net can be computed using lasso on augmented matrices.Let the elastic net estimator ηEN minimize

QEN (η) = RSSW (η) + λ1‖η‖22 + λ2‖η‖1 (5.29)

where λ1 = (1 − α)λ1,n and λ2 = 2αλ1,n. Let the (n + p − 1) × (p − 1)augmented matrix W A and the (n + p − 1) × 1 augmented response vectorZA be defined by

W A =

(W√

λ1 Ip−1

), and ZA =

(Z0

),

where 0 is the (p− 1)× 1 zero vector. Let RSSA(η) = ‖ZA −W Aη‖22. Then

ηEN can be obtained from the lasso of ZA on W A: that is, ηEN minimizes

QL(η) = RSSA(η) + λ2‖η‖1 = QEN (η). (5.30)

Proof: We need to show that QL(η) = QEN (η). Note that ZTAZA = ZT Z,

W A η =

(Wη√λ1 η

),

and ZTAW A η = ZT Wη. Then

RSSA(η) = ‖ZA − W Aη‖22 = (ZA − W Aη)T (ZA − W Aη) =

ZTAZA − ZT

AW Aη − ηT W TAZA + ηT W T

AW Aη =

ZT Z − ZT Wη − ηT W T Z +(ηT W T

√λ1 ηT

)(Wη√λ1 η

).

Thus

QL(η) = ZT Z − ZT Wη − ηT W T Z + ηT W T Wη + λ1ηT η + λ2‖η‖1 =

RSS(η) + λ1‖η‖22 + λ2‖η‖1 = QEN(η). �

5.8 The Elastic Net 167

Remark 5.12. i) You could compute the elastic net estimator using agrid of 100 λ1,n values and a grid of J ≥ 10 α values, which would take aboutJ ≥ 10 times as long to compute as lasso. The above equivalent lasso problem(5.30) still needs a grid of λ1 = (1 − α)λ1,n and λ2 = 2αλ1,n values. OftenJ = 11, 21, 51, or 101. The elastic net estimator tends to be computed withfast methods for optimizing convex problems, such as coordinate descent.ii) Like lasso and ridge regression, the elastic net estimator is asymptotically

equivalent to the OLS full model if p is fixed and λ1,n = oP (√n), but behaves

worse than the OLS full model otherwise. See Theorem 5.7. iii) A false degreesof freedom d is d = tr[WAS(W T

ASW AS + λ2,nIp−1)−1W T

AS ] where W AS

corresponds to the active set (not the augmented matrix). This d seems tobe, to a good approximation, the number of nonzero coefficients from theequivalent augmented lasso problem (5.30). See Tibshirani and Taylor (2012,p. 1214). Again λ2,n may not be the λ2 given by the software. iv) The numberof nonzero lasso components (not including the constant) is at most min(n, p−1). Elastic net tends to do variable selection, but the number of nonzerocomponents can equal p− 1 (make the elastic net equal to ridge regression).Note that the number of nonzero components in the augmented lasso problem(5.30) is at most min(n + p − 1, p − 1) = p − 1. vi) The elastic net can becomputed with glmnet, and there is an R package elasticnet. vii) Forfixed α > 0, we could get λM for elastic net from the equivalent lasso problem.For ridge regression, we could use the λM for an α near 0.

Since lasso uses at most min(n, p−1) nontrivial predictors, elastic net andridge regression can perform better than lasso if the true number of activenontrivial predictors aS > min(n, p − 1). For example, suppose n = 1000,p = 5000, and aS = 1500.

Following Jia and Yu (2010), by standard Karush-Kuhn-Tucker (KKT)conditions for convex optimality for Equation (5.27), ηEN is optimal if

2W T WηEN − 2W T Z + 2λ1ηEN + λ2sn = 0, or

(W T W + λ1Ip−1)ηEN = W T Z − λ2

2sn, or

ηEN = ηR − n(W T W + λ1Ip−1)−1 λ2

2nsn. (5.31)

Hence

ηEN = ηOLS−λ1

nn(W T W +λ1Ip−1)

−1 ηOLS−λ2

2nn(W T W+λ1Ip−1)

−1 sn

= ηOLS − n(W T W + λ1Ip−1)−1 [

λ1

nηOLS +

λ2

2nsn].

Note that if λ1,n/√n

P→ τ and αP→ ψ, then λ1/

√n

P→ (1−ψ)τ and λ2/√n

P→2ψτ. The following theorem shows elastic net is asymptotically equivalent to


the OLS full model if λ1,n/√n

P→ 0. Note that we get the RR CLT if ψ = 0

and the lasso CLT (using 2λ1,n/√n

P→ 2τ ) if ψ = 1. Under these conditions,

√n(ηEN −η) =

√n(ηOLS−η)−n(W T W + λ1Ip−1)

−1 [λ1√n

ηOLS +λ2

2√n

sn].

Theorem 5.7, Elastic Net CLT. Assume p is fixed and that the condi-tions of the LS CLT Theorem 5.1 hold for the model Z = Wη + e.

a) If λ1,n/√n

P→ 0, then

√n(ηEN − η)

D→ Np−1(0, σ2V ).

b) If λ1,n/√n

P→ τ ≥ 0, αP→ ψ ∈ [0, 1], and sn

P→ s = sη, then

√n(ηEN − η)

D→ Np−1

(−V [(1− ψ)τη + ψτs], σ2V

).

Proof. By the above remarks and the RR CLT Theorem 5.5,

√n(ηEN − η) =

√n(ηEN − ηR + ηR − η) =

√n(ηR − η) +

√n(ηEN − ηR)

D→ Np−1

(−(1 − ψ)τV η, σ2V

)− 2ψτ

2V s

∼ Np−1

(−V [(1− ψ)τη + ψτs], σ2V

).

The mean of the normal distribution is 0 under a) since α and sn are bounded.�

Example 5.2, continued. The slpack function enet does elastic netusing 10-fold CV and a grid of α values {0, 1/am, 2/am, ..., am/am= 1}. Thedefault uses am = 10. The default chose lasso with alph = 1. The functionalso makes a response plot, but does not add the lines for the pointwiseprediction intervals since the false degrees of freedom d is not computed.


tem <- enet(x,y)

tem$alph

[1] 1 #elastic net was lasso

tem<-enet(x,y,am=100)

tem$alph

[1] 0.97 #elastic net was not lasso with a finer grid


5.9 Prediction Intervals

Results from Hastie, Tibshirani, and Wainwright (2015, pp. 20, 296, ch. 6, ch.11) and Luo and Chen (2013) suggest that lasso, relaxed lasso, and forwardselection with EBIC can perform well for sparse models: the subset S in (5.10)in has aS small (for example aS/n→ 0 as n → ∞).

Let d be a false degrees of freedom of the fitted model (e.g., plug in thedegrees of freedom formula as if model selection did not occur). We wantd = 1 if Yi ≡ Y and d = p if the OLS full model is used and n > p. To avoidoverfitting, we also want d ≤ min(dn/Je, p) for some integer J ≥ 5. Forwardselection, lasso, and relaxed lasso use variables x∗1, ..., x

∗d where x∗1 ≡ 1 is a

constant, and vj = γTj x = x∗j = xkj for j = 1, ..., d. PCR and PLS use

variables that are linear combinations of the predictors vj = γTj x for j =

1, ..., d. Then for forward selection, PLS, PCR, lasso, and relaxed lasso, d =the number of “variables” vj in the fitted model, including the constant v1.Hence for lasso and forward selection, d−1 is the number of active predictorswhile d−1 is the number of “components” used by PCR and PLS. See Efronand Hastie (2016, pp. 221, 222, 231) and Tibshirani (2015) for lasso degreesof freedom.

For ridge regression, we assume that the λ given by output uses a = 2n.Then we take d = 1 + df(λ1,n) where df(λ1,n) is given by Equation (5.17).Hence

d− 1 =

p−1∑

i=1

σ2i

σ2i + λ1,n

=

p−1∑

i=1

σ2i

σ2i + 2nλ2n

=

p−1∑

i=1

λi

λi + 2nλ2n

(5.32)

where the σi are the singular values of W and the λi are the eigenvalues ofW T W . Some R code for computing d for ridge regression is shown below.Warning. The code may not give the correct df since it is hard to determinehow glmnet gets its fitted values. See the end of Section 5.5.

library(glmnet)

set.seed(974)

n <- 100

p <- 100

k <- 1

q <- p-1

b <- 0 * 1:q

b[1:k] <- 1

u <- matrix(rnorm(n * q), nrow = n, ncol = q)

y <- 1 + u %*% b + rnorm(n)

out<-cv.glmnet(u,y,alpha=0) #RR with 10-fold CV

lam <- out$lambda.min

w1 <- scale(u)

w <- sqrt(n/(n-1))*w1 #t(w) %*% w = n R_u


svs <- svd(w)$d #singular values of w,

d <- 1 + sum(svsˆ2/(svsˆ2+2*n*lam))

# d for ridge regression if lam = hat(lam)_{1,n}/(2n)

d

[1] 4.790351 #want d near k+1 = 2: RR overfit some

Equations (2.4) through (2.6) were used to give the Pelawa Watagoda andOlive (2019b) PI for additive error regression (7.7). MLR is a special case

with Yf = m(xf ) = xTf β. Let the training data (Y1,x1), ..., (Yn,xn) be from

the MLR model Yi = xTi β + ei for i = 1, ..., n, and consider predicting a new

test value Yf given the training data and xf where Yf = xTf β + ef . Then a

100 (1 − δ)% large sample PI for Yf is

[Yf + bnξδ1, Yf + bnξ1−δ2

] (5.33)

where the PI uses d to compute bn in Equation (2.6). Asymptotically, this PIfor Yf adds Yf to the endpoints of the shorth PI for a future residual.

Many things can go wrong with prediction. It is assumed that the testdata follows the same MLR model as the training data. Population drift is acommon reason why the above assumption, which assumes that the variousdistributions involved do not change over time, is violated. Population driftoccurs when the population distribution does change over time.

A second thing that can go wrong is that the training or test data set isdistorted away from the population distribution. This could occur if outliersare present or if the training data set and test data set are drawn fromdifferent populations. For example, the training data set could be drawnfrom three hospitals, and the test data set could be drawn from two morehospitals. These two populations of three and two hospitals may differ.

A third thing that can go wrong is extrapolation: if xf is added tox1, ...,xn, then there is extrapolation if xf is not like the xi, e.g. xf is anoutlier. Predictions based on extrapolation are not reliable. Check whetherthe Euclidean distance of xf from the coordinatewise median MED(X) ofthe x1, ...,xn satisfies Dxf

(MED(X), Ip) ≤ maxi=1,...,nDi(MED(X), Ip).Alternatively, use the ddplot5 function, described above Definition 1.17,applied to x1, ...,xn,xf to check whether xf is an outlier.

When n ≥ 10p, let the hat matrix H = X(XT X)−1XT . Let hi = hii

be the ith diagonal element of H for i = 1, ..., n. Then hi is called theith leverage and hi = xT

i (XT X)−1xi. Then the leverage of xf is hf =

xTf (XT X)−1xf . Then a rule of thumb is that extrapolation occurs if hf >

max(h1, ..., hn). This rule works best if the predictors are linearly related inthat a plot of xi versus xj should not have any strong nonlinearities. If thereare strong nonlinearities among the predictors, then xf could be far from thexi but still have hf < max(h1, ..., hn). If the regression method, such as lassoor forward selection, uses a set I of a predictors, including a constant, where

5.10 Choosing from M Models 171

n ≥ 10a, the above rule of thumb could be used for extrapolation where xf ,xi, and X are replaced by xI,f , xI,i, and XI .

5.10 Choosing from M Models

For the model Y = Xβ + e, methods such as forward selection, PCR,PLS, ridge regression, relaxed lasso, and lasso each generate M fitted mod-els I1, ..., IM, where M depends on the method, and we considered severalmethods for selecting the final submodel Id. Only method i) needs n/p large.

i) Let Id = Imin be the model that minimizes Cp for forward selection,relaxed lasso, or lasso.

ii) Let Id use d = min(dn/Je, p) variables, including a constant, whereJ ≥ 5 is a positive integer. This method uses the OLS full model if n/p ≥ J .Hence large sample inference is simple if p is fixed. This method simulatedwell for d = p and J ≥ 10, but prediction intervals were often unreliable,otherwise. See Pelawa Watagoda (2017ab).

iii) Let Id = Imin be the model that minimizes EBIC for forward selectionor relaxed lasso. For forward selection, we used M = min(dn/5e, p).

iv) Choose Id using k–fold cross validation (CV). We used k = 5 or 10.The following method is currently slow to simulate, but is a useful diag-

nostic. When the model underfits, PI (5.33) tends to have coverage near orgreater than the nominal 0.95 coverage, but the PI length is long. When themodel severely overfits, the PI tends to have short length with coverage lessthan 0.95.

v) Modify k–fold cross validation to compute the PI coverage and averagePI length on all M models. Then n PIs are made for Yi using xf = xi fori = 1, ..., n. The coverage is the proportion of times the n PIs contained Yi.For example, choose the model Id with the shortest average PI length giventhat the nominal large sample 100(1 − δ)% PI had coverage

≥ cn = max(1 − δ − 1

3√n, 1 − δ − 0.02).

If no model Ii had coverage ≥ cn, pick the model with the largest coverage.

Remark 5.13. Alternative methods could be used. For example, chooseseveral “degrees of freedom” d1, ..., dL where di = min(dn/Jie, p) for i =1, ..., L and then apply some criterion on the L models (use the OLS fullmodel if n/p ≥ max(Ji)). For example, use J1 = 5, J2 = 10, J3 = 20, andJ4 = 50.

Definition 5.12. For k-fold cross validation (k-fold CV), randomly dividethe training data into k groups or folds of approximately equal size nj ≈ n/kfor j = 1, ..., k. Leave out the first fold, fit the statistical method to the k− 1


remaining folds, and then compute some criterion for the first fold. Repeatfor folds 2, ..., k.

Following James et al. (2013, p. 181), if the statistical method is an MLRmethod, we often compute Yi(j) for each Yi in the fold j left out. Then

MSEj =1

nj

nj∑

i=1

(Yi − Yi(j))2 ,

and the overall criterion is

CV(k) =1

k

k∑

j=1

MSEj .

Note that if each nj = n/k, then

CV(k) =1

n

n∑

i=1

(Yi − Yi(j))2.

Then CV(k) ≡ CV(k)(Ii) is computed for i = 1, ...,M , and the model Ic withthe smallest CV(k)(Ii) is selected. Method v) above Remark 5.13 uses analternative criterion based on PIs.

Assume that model (2.28) holds: Y = xT β + e = xTS βS + e where βS is

an aS ×1 vector. Suppose p is fixed and n→ ∞. If βI is a×1, form the p×1

vector βI,0 from βI by adding 0s corresponding to the omitted variables.

Remark 4.13 noted that βImin,0 is a√n consistent estimator of β under mild

regularity conditions. Note that if aS = p, then βImin,0 is asymptotically

equivalent to the OLS full model β (since the probability that Cp underfitsgoes to 0), and hence asymptotically normal.

Remark 5.14. Let p be fixed and n → ∞. For elastic net, forward selec-tion, PCR, PLS, ridge regression, relaxed lasso, and lasso, if P (d → p) → 1as n→ ∞ then the seven methods are asymptotically equivalent to the OLSfull model, and the PI (5.33) is asymptotically optimal on a large class ofiid unimodal zero mean error distributions. The asymptotic optimality holdssince the sample quantile of the OLS full model residuals are consistent es-timators of the population quantiles of the unimodal error distribution for a

large class of distributions. Note that dP→ p if P (λ1n → 0) → 1 for elastic

net, lasso and ridge regression, and dP→ p if the number d− 1 of components

(γTj x or γT

j w) used by the method satisfies P (d−1 → p−1) → 1. Consistent

estimators β of β also produce residuals such that the sample quantiles ofthe residuals are consistent estimators of quantiles of the error distribution.See Olive and Hawkins (2003) and Rousseeuw and Leroy (1987, p. 128).


Choosing folds for k-fold cross validation is similar to randomly allocatingcases to treatment groups. James et al. (2013, p. 250) use code similar tothat shown below for the Example 5.2. data set. This code is fast but thegroup sizes can differ, although the expected sizes are the same. The slpackfunction rand was used by Olive (2017a, p. 176) to randomly allocate casesto treatment groups, but is slow since the function uses a for loop.

k<-5

folds<-sample(1:k,nrow(marry),replace=TRUE)

folds

5 2 1 1 3 5 4 4 1 1 3 4 5 1 3 5 5 4 4 3 1 3 5 4 3 2

#fold5 has cases 1,6,13,16,17,23

#fold2 only has 2 cases: use sum(folds==2)

#Find folds using the slow slpack rand function.

n<-nrow(marry)

folds<-rand(n,k)$groups

folds

2 1 5 1 5 1 3 3 4 4 1 3 2 4 2 5 3 2 5 5 2 3 4 1 4 5

#last fold gets the extra cases if n is not divisible by k

The following code is more efficient for a simulation. It makes copies of1 to k in a vector of length n called tfolds. The sample command makes apermutation of tfolds to get the folds. The lengths of the k folds differ by atmost 1.

n<-26

k<-5

J<-as.integer(n/k)+1

tfolds<-rep(1:k,J)

tfolds<-tfolds[1:n] #can pass tfolds to a loop

folds<-sample(tfolds)

folds

4 2 3 5 3 3 1 5 2 2 5 1 2 1 3 4 2 1 5 5 1 4 1 4 4 3

Example 5.2, continued. The slpack function pifold uses k-fold CV toget the coverage and average PI lengths. We used 5-fold CV with coverageand average 95% PI length to compare the forward selection models. All4 models had coverage 1, but the average 95% PI lengths were 2591.243,2741.154, 2902.628, and 2972.963 for the models with 2 to 5 predictors. Seethe following R code.

y <- marry[,3]; x <- marry[,-3]

x1 <- x[,2]

x2 <- x[,c(2,3)]

x3 <- x[,c(1,2,3)]

pifold(x1,y) #nominal 95% PI

$cov


[1] 1

$alen

[1] 2591.243

pifold(x2,y)

$cov

[1] 1

$alen

[1] 2741.154

pifold(x3,y)

$cov

[1] 1

$alen

[1] 2902.628

pifold(x,y)

$cov

[1] 1

$alen

[1] 2972.963

#Validation PIs for submodels: the sample size is

#likely too small and the validation PI is formed

#from the validation set.

n<-dim(x)[1]

nH <- ceiling(n/2)

indx<-1:n

perm <- sample(indx,n)

H <- perm[1:nH]

vpilen(x1,y,H) #13/13 were in the validation PI

$cov

[1] 1.0

$len

[1] 116675.4

vpilen(x2,y,H)

$cov

[1] 1.0

$len

[1] 116679.8

vpilen(x3,y,H)

$cov

[1] 1.0

$len

[1] 116312.5

vpilen(x,y,H)

$cov

[1] 1.0

$len #shortest length


[1] 116270.7

Some more code is below.

n <- 100

p <- 4

k <- 1

q <- p-1


b <- 0 * 1:q

b[1:k] <- 1

y <- 1 + x %*% b + rnorm(n)

x1 <- x[,1]

x2 <- x[,c(1,2)]

x3 <- x[,c(1,2,3)]

pifold(x1,y)

$cov

[1] 0.96

$alen

[1] 4.2884

pifold(x2,y)

$cov

[1] 0.98

$alen

[1] 4.625284

pifold(x3,y)

$cov

[1] 0.98

$alen

[1] 4.783187

pifold(x,y)

$cov

[1] 0.98

$alen

[1] 4.713151

n <- 10000

p <- 4

k <- 1

q <- p-1


b <- 0 * 1:q

b[1:k] <- 1

y <- 1 + x %*% b + rnorm(n)

x1 <- x[,1]

x2 <- x[,c(1,2)]


x3 <- x[,c(1,2,3)]

pifold(x1,y)

$cov

[1] 0.9491

$alen

[1] 3.96021

pifold(x2,y)

$cov

[1] 0.9501

$alen

[1] 3.962338

pifold(x3,y)

$cov

[1] 0.9492

$alen

[1] 3.963305

pifold(x,y)

$cov

[1] 0.9498

$alen

[1] 3.96203

5.11 Hypothesis Testing After Model Selection, n/pLarge

Assume n ≥ 20p and that the error distribution is unimodal and not highlyskewed. The response plot and residual plot are plots with Y = xT β on thehorizontal axis and Y or r on the vertical axis, respectively. Then the plottedpoints in these plots should scatter in roughly even bands about the identityline (with unit slope and zero intercept) and the r = 0 line, respectively.See Figure 5.1. If the plots for the OLS full model suggest that the errordistribution is skewed or multimodal, then much larger sample sizes may beneeded.

If the error distribution is unknown, then large sample theory tests arestraightforward if the estimator is asymptotically equivalent to the OLS fullmodel, e.g. λ1,n = oP (

√n), or choose the OLS full model if n ≥ 50p. The

latter technique may be reasonable if the large sample theory of the method isnot better than that of the OLS full model (elastic net, lasso, ridge regression),or if it is not known how to do inference unless the model is asymptoticallyequivalent to the OLS full model, e.g., if P (d → p) → 1 as n → ∞ (PCR,PLS, forward selection).

Section 4.3.5 showed how to do inference after forward selection usingCp and the residual bootstrap with the residuals from the OLS full model.

5.13 Simulations 177

Confidence intervals for βi and confidence regions for H0 : θ = Aβ = θ0

were given. The geometric argument Theorem 2.2 was used to argue that thecoverage will be close to or higher than the nominal coverage for large n andB (n ≥ 20p and B ≥ 50p start to give good results for some light tailed errordistributions). If λ1 = 0, these methods are expected to work well for forwardselection with BIC and for lasso, relaxed lasso, and elastic net using k-foldCV by Sections 5.6–5.8 and Zhang (1992).

5.12 Two Set Inference

Randomly divide the data set into two half sets. On the first half set fit themodel selection method, e.g. forward selection or lasso, to get the a predictors.On the second half set, use the standard OLS inference from regressing theresponse on the predictors found from the first half set. This method canbe inefficient if n ≥ 10p, but is useful for a sparse model if n ≤ 5p, if theprobability that the model underfits goes to zero, and if n ≥ 20a. A model issparse if the number of predictors with nonzero coefficients is small.

For lasso, the active set from the first half set (training set) is found, buttwo set inference is not the relaxed lasso estimator. The two set estimatorβI,2set has the same large sample theory as if I was chosen before obtainingthe data. Lasso and relaxed lasso use fits from all n cases.

If P (S ⊆ I) → 1 for lasso, then the relaxed lasso estimator has the samelimiting distribution as the forward selection estimator, except the πi differ.

5.13 Simulations

A simulation study was done for PI (5.33). Let x = (1 uT )T where u is the(p− 1)× 1 vector of nontrivial predictors. In the simulations, for i = 1, ..., n,we generated wi ∼ Np−1(0, I) where the m = p − 1 elements of the vectorwi are iid N(0,1). Let the m×m matrix A = (aij) with aii = 1 and aij = ψwhere 0 ≤ ψ < 1 for i 6= j. Then the vector ui = Awi so that Cov(ui) =Σu = AAT = (σij) where the diagonal entries σii = [1 + (m − 1)ψ2] andthe off diagonal entries σij = [2ψ + (m − 2)ψ2]. Hence the correlations arecor(xi, xj) = ρ = (2ψ+(m−2)ψ2)/(1+(m−1)ψ2) for i 6= j where xi and xj

are nontrivial predictors. If ψ = 1/√cp, then ρ→ 1/(c+ 1) as p→ ∞ where

c > 0. As ψ gets close to 1, the predictor vectors cluster about the line in thedirection of (1, ..., 1)T. Then Yi = 1+1xi,2 + · · ·+1xi,k+1 +ei for i = 1, ..., n.Hence β = (1, .., 1, 0, ..., 0)T with k + 1 ones and p − k − 1 zeros. The zeromean errors ei were iid from five distributions: i) N(0,1), ii) t3, iii) EXP(1)- 1, iv) uniform(−1, 1), and v) 0.9 N(0,1) + 0.1 N(0,100). Only distributioniii) is not symmetric.


The lengths of the asymptotically optimal 95% PIs are i) 3.92 = 2(1.96), ii)6.365, iii) 2.996, iv) 1.90 = 2(0.95), and v) 13.490. The simulation used 5000runs, so an observed coverage in [0.94, 0.96] gives no reason to doubt thatthe PI has the nominal coverage of 0.95. The simulation used ψ = 0, 1/

√p,

or 0.9; and k = 1, 19, or p−1. The simulation often used p = 20, 40, n, or 2n.Table 5.1 shows some simulation results for PI (4.7) = PI (5.33) where for-

ward selection used Cp for n ≥ 10p and EBIC for n < 10p. The other methodsminimized 10-fold CV (with relaxed lasso applied to the selected lasso model).For ridge regression we used the same d used by lasso. Hence the ridge re-gression false degrees of freedom d is very crude. For methods other thanridge regression, the maximum number of variables used was approximatelymin(dn/5e, p). For n ≥ 10p, coverages tended to be near or higher than thenominal value of 0.95. The average PI length was often near 1.3 times theasymptotically optimal length for n = 10p and close to the optimal lengthfor n = 100p. An exception was that lasso, relaxed lasso, and ridge regressionoften had too long lengths if both k ≥ 19 and ψ ≥ 0.5. These results alsoheld for all five error distributions.

Table 5.1 Simulated Large Sample 95% PI Coverages and Lengths, ei ∼ N(0,1)

n p ψ k FS lasso RL RR PLS PCR100 20 0 1 cov 0.9644 0.9750 0.9666 0.9560 0.9438 0.9772

len 4.4490 4.8245 4.6873 4.5723 4.4149 5.5647100 40 0 1 cov 0.9654 0.9774 0.9588 0.9274 0.8810 0.9882

len 4.4294 4.8889 4.6226 4.4291 4.0202 7.3393100 200 0 1 cov 0.9648 0.9764 0.9268 0.9584 0.6616 0.9922

len 4.4268 4.9762 4.2748 6.1612 2.7695 12.412100 50 0 49 cov 0.8996 0.9719 0.9736 0.9820 0.8448 1.0000

len 22.067 6.8345 6.8092 7.7234 4.2141 38.904200 20 0 19 cov 0.9788 0.9766 0.9788 0.9792 0.9550 0.9786

len 4.9613 4.9636 4.9613 5.0458 4.3211 4.9610200 40 0 19 cov 0.9742 0.9762 0.9740 0.9738 0.9324 0.9792

len 4.9285 5.2205 5.1146 5.2103 4.2152 5.3616200 200 0 19 cov 0.9728 0.9778 0.9098 0.9956 0.3500 1.0000

len 4.8835 5.7714 4.5465 22.351 2.1451 51.896400 20 0.9 19 cov 0.9664 0.9748 0.9604 0.9726 0.9554 0.9536

len 4.5121 10.609 4.5619 10.663 4.0017 3.9771400 40 0.9 19 cov 0.9674 0.9608 0.9518 0.9578 0.9482 0.9646

len 4.5682 14.670 4.8656 14.481 4.0070 4.3797400 400 0.9 19 cov 0.9348 0.9636 0.9556 0.9632 0.9462 0.9478

len 4.3687 47.361 4.8530 48.021 4.2914 4.4764400 400 0 399 cov 0.9486 0.8508 0.5704 1.0000 0.0948 1.0000

len 78.411 37.541 20.408 244.28 1.1749 305.93400 800 0.9 19 cov 0.9268 0.9652 0.9542 0.9672 0.9438 0.9554

len 4.3427 67.294 4.7803 66.577 4.2965 4.6533

For n ≤ p, good performance needed stronger regularity conditions. PLStended to have severe undercoverage with small average length, but some-

5.13 Simulations 179

times performed well for ψ = 0.9. The PCR length was often too long forψ = 0. If there was k = 1 active population predictor, then forward selection,lasso, and relaxed lasso often performed well. For k = 19, forward selectionperformed well, as did lasso and relaxed lasso for ψ = 0. For dense mod-els with k = p − 1 and n/p not large, there was often undercoverage. Letd− 1 be the number of active predictors in the selected model. For N(0, 1)errors, ψ = 0, and d < k, an asymptotic population 95% PI has length3.92

√k − d+ 1. Note that when the (Yi,u

Ti )T follow a multivariate normal

distribution, every subset follows a multiple linear regression model. EBICoccasionally had undercoverage, especially for k = 19 or p − 1, which wasusually more severe for ψ = 0.9 or 1/

√p.

A small simulation was done for confidence intervals and confidence re-gions, using the same type of data as for the prediction interval simula-tion, using B = max(1000, n, 20p) and 5000 runs. The regression model usedβ = (1, 1, 0, 0)T with n = 100 and p = 4. When ψ = 0, the design matrixX consisted of iid N(0,1) random variables. See Section 2.3 and Table 2.2.Simulations for lasso were similar for ψ = 0. See Table 3.2 which is takenfrom Pelawa Watagoda (2017a).

Table 5.2 Bootstrapping Lasso, ψ = 0

n ε type β1 β2 β3 β4 test100 1 cov 0.9440 0.9376 0.9910 0.9946 0.9790

len 0.4143 0.4470 0.3759 0.3763 2.64442 cov 0.9468 0.9428 0.9946 0.9944 0.9816

len 0.6870 0.7565 0.6238 0.6226 2.68323 cov 0.9418 0.9408 0.9930 0.9948 0.9840

len 0.4110 0.4506 0.3743 0.3746 2.66844 cov 0.9468 0.9370 0.9938 0.9948 0.9838

len 0.2392 0.2578 0.2151 0.2153 2.64540.5 5 cov 0.9438 0.9344 0.9988 0.9970 0.9924

len 2.9380 2.5042 2.4912 2.4715 2.85360.9 5 cov 0.9506 0.9290 0.9974 0.9976 0.9956

len 3.9180 3.2760 3.7356 3.2739 2.8836

The simulations were done in R. See R Core Team (2016). Much largersimulation studies are in Pelawa Watagoda (2017ab). We used several Rfunctions including forward selection as computed with the regsubsets

function from the leaps library, principal components regression with thepcr function and partial least squares with the plsr function from the plslibrary, and ridge regression and lasso with the glmnet and cv.glmnet

functions from the glmnet library.Some useful functions for the inference are in the collection of R functions

slpack. Tables 3.1 was made with mspisim while Table 2.2 was made withregbootsim3 for the OLS full model and vsbootsim4 for forward selec-tion. The function lassobootsim3 uses the prediction region method for


lasso. See Table 3.2. For PI (2.8), the function valvspisim is for forwardselection using the minimumCp model, and the function valrelpisim sim-ulates the relaxed lasso model corresponding to the lasso model chosen with10-fold CV. For forward selection or relaxed lasso, the function pifold canbe used to do k-fold CV with PI coverage and average length. Some morefunctions are described below.

i) If d = min(dn/Je, p), the function drelpisim simulates the PI forlasso and relaxed lasso, the function dpisim3 for PLS and PCR, the func-tion dvspisim for forward selection, and the function dpisim2 for forwardselection, PLS, PCR, lasso, and relaxed lasso.

ii) For 10–fold CV, lassopisim2 is for lasso and ridge regression, whilethe functions pcrpisim2, plspisim2, and relpisim2 are for PCR, PLS,and relaxed lasso (from the lasso model selected by 10–fold CV). (The Olive(2013a) PI can be simulated for n/p large with the corresponding functionsthat do not end with a 2, and with the function vspisim for the forwardselection model that minimizes Cp.)

iii) For EBIC, the function evspisim is for forward selection.The function fselboot bootstraps the minimum Cp model from for-

ward selection. The function shorth3 can be used to find the Frey (2013)shorth(c) intervals for µi. The function vsbootsim4 can be used to sim-ulate the bootstrap tests for minimum Cp forward selection. The functionbicbootsim can be used if BIC is used instead of Cp.

5.14 Summary

1) The MLR model is Yi = β1 + xi,2β2 + · · · + xi,pβp + ei = xTi β + ei for

i = 1, ..., n. This model is also called the full model. In matrix notation,these n equations become Y = Xβ + e. Note that xi,1 ≡ 1.

2) The ordinary least squares OLS full model estimator βOLS minimizesQOLS(β) =

∑ni=1 r

2i (β) = RSS(β) = (Y −Xβ)T (Y −Xβ). In the estimat-

ing equations QOLS(β), the vector β is a dummy variable. The minimizer

βOLS estimates the parameter vector β for the MLR model Y = Xβ + e.

Note that βOLS ∼ ANp(β,MSE(XT X)−1).3) Given an estimate b of β, the corresponding vector of predicted values

or fitted values is Y ≡ Y (b) = Xb. Thus the ith fitted value

Yi ≡ Yi(b) = xTi b = xi,1b1 + · · ·+ xi,pbp.

The vector of residuals is r ≡ r(b) = Y − Y (b). Thus ith residual ri ≡ri(b) = Yi − Yi(b) = Yi − xi,1b1 − · · · − xi,pbp. A response plot for MLR is a

plot of Yi versus Yi. A residual plot is a plot of Yi versus ri. If the ei are iidfrom a unimodal distribution that is not highly skewed, the plotted pointsshould scatter about the identity line and the r = 0 line.

5.14 Summary 181

4)

Label coef SE shorth 95% CI for βi

Constant=intercept= x1 β1 SE(β1) [L1, U1]

x2 β2 SE(β2) [L2, U2]...

xp βp SE(βp) [Lp, Up]

The classical OLS large sample 95% CI for βi is βi ±1.96SE(βi). Considertesting H0 : βi = 0 versus HA : βi 6= 0. If 0 ∈ CI for βi, then fail to reject H0,and conclude xi is not needed in the MLR model given the other predictorsare in the model. If 0 6∈ CI for βi, then reject H0, and conclude xi is neededin the MLR model.

5) Let xTi = (1 uT

i ). It is often convenient to use the centered responseZ = Y − Y where Y = Y 1, and the n × (p − 1) matrix of standardizednontrivial predictors W = (Wij). For j = 1, ..., p− 1, let Wij denote the(j + 1)th variable standardized so that

∑ni=1Wij = 0 and

∑ni=1W

2ij = n.

Then the sample correlation matrix of the nontrivial predictors ui is

Ru =W T W

n.

Then regression through the origin is used for the model Z = Wη + ewhere the vector of fitted values Y = Y + Z. Thus the centered responseZi = Yi − Y and Yi = Zi + Y . Then η does not depend on the units ofmeasurement of the predictors. Linear combinations of the ui can be writtenas linear combinations of the xi, hence β can be found from η.

6) A model for variable selection is xT β = xTSβS + xT

EβE = xTSβS where

x = (xTS ,x

TE)T , xS is an aS × 1 vector, and xE is a (p − aS) × 1 vector. Let

xI be the vector of a terms from a candidate subset indexed by I, and let xO

be the vector of the remaining predictors (out of the candidate submodel). IfS ⊆ I, then xT β = xT

SβS = xTSβS + xT

I/Sβ(I/S) + xTO0 = xT

I βI where xI/S

denotes the predictors in I that are not in S. Since this is true regardlessof the values of the predictors, βO = 0 if S ⊆ I. Note that βE = 0. LetkS = aS − 1 = the number of population active nontrivial predictors. Thenk = a− 1 is the number of active predictors in the candidate submodel I.

7) Use Zn ∼ ANg (µn,Σn) to indicate that a normal approximation isused: Zn ≈ Ng(µn,Σn). Let a be a constant, let A be a k × g constant

matrix, and let c be a k×1 constant vector. If√n(θn−θ)

D→ Ng(0,V ), thenaZn = aIgZn with A = aIg,

aZn ∼ ANg

(aµn, a

2Σn

), and AZn + c ∼ ANk

(Aµn + c,AΣnAT

),


θn ∼ ANg

(θ,

V

n

), and Aθn + c ∼ ANk

(Aθ + c,

AV AT

n

).

8) LS CLT: Consider the MLR model Yi = xTi β + ei and assume that the

zero mean errors are iid with E(ei) = 0 and VAR(ei) = σ2. Assume p is fixed

and n → ∞. Also assume that maxhiP→ 0 as n→ ∞ and

XT X

n→ V −1

as n→ ∞. Then the least squares (OLS) estimator β satisfies√n(β−β)

D→Np(0, σ

2 V ). Equivalently, (XT X)1/2(β − β)D→ Np(0, σ

2 Ip).9) Let Q(η) be a real valued function of the k × 1 vector η. The gradient

of Q(η) is the k × 1 vector

5Q = 5Q(η) =∂Q

∂η=∂Q(η)

∂η=

∂∂η1

Q(η)∂

∂η2Q(η)...

∂∂ηk

Q(η)

.

Suppose there is a model with unknown parameter vector η. A set of estimat-ing equations f(η) is minimized or maximized where η is a dummy variablevector in the function f : R

k → Rk.

10) As a mnemonic (memory aid) for the following results, note that the

derivatived

dxax =

d

dxxa = a and

d

dxax2 =

d

dxxax = 2ax.

a) If Q(η) = aT η = ηT a for some k× 1 constant vector a, then 5Q = a.b) If Q(η) = ηT Aη for some k × k constant matrix A, then 5Q = 2Aη.

c) If Q(η) =∑k

i=1 |ηi| = ‖η‖1, then 5Q = s = sη where si = sign(ηi)where sign(ηi) = 1 if ηi > 0 and sign(ηi) = −1 if ηi < 0. This gradient is onlydefined for η where none of the k values of ηi are equal to 0.

11) Forward selection with OLS generates a sequence of M models. LetI1 use x∗1 = x1 ≡ 1: the model has a constant but no nontrivial predictors.To form I2, consider all models I with two predictors including x∗1. ComputeQ2(I) = SSE(I) = RSS(I) = rT (I)r(I) =

∑ni=1 r

2i (I) =

∑ni=1(Yi − Yi(I))

2 .Let I2 minimize Q2(I) for the p− 1 models I that contain x∗1 and one otherpredictor. Denote the predictors in I2 by x∗1, x

∗2. In general, to form Ij con-

sider all models I with j predictors including variables x∗1, ..., x∗j−1. Compute

Qj(I) = rT (I)r(I) =∑n

i=1 r2i (I) =

∑ni=1(Yi−Yi(I))

2. Let Ij minimizeQj(I)for the p − j + 1 models I that contain x∗1, ..., x

∗j−1 and one other predictor

not already selected. Denote the predictors in Ij by x∗1, ..., x∗j. Continue in

this manner for j = 2, ...,M to form I1, I2, ..., IM. Often M = min(dn/Je, p).12) For the model Y = Xβ +e, methods such as forward selection, PCR,

PLS, ridge regression, relaxed lasso, and lasso each generate M fitted mod-

5.14 Summary 183

els I1, ..., IM, where M depends on the method, and we considered severalmethods for selecting the final submodel Id. Only method i) needs n/p large.

i) Let Id = Imin be the model that minimizes Cp for forward selection,relaxed lasso, or lasso.

ii) Let Id use d = min(dn/Je, p) variables where J is a positive integer.We used J = 5, 10, 20, and 50. Forward selection used M = d. (For ridgeregression, we used the model Ic with the “degrees of freedom” closest to d.)For PCR and PLS, the “variables” were the vj = γT

j x. This method uses theOLS full model if n/p ≥ J for forward selection, PCR, PLS, ridge regression,relaxed lasso, and lasso. Hence large sample inference is simple for these sixmodel selection estimators if p is fixed. For lasso, several values of λ mayhave the same degrees of freedom: we chose the model with the smallest λvalue. In the simulation for lasso with d = p, we used the lasso model withλ0 instead of OLS.

iii) Let Id = Imin be the model that minimizes EBIC for forward selectionor relaxed lasso. For forward selection, we used M = min(dn/5e, p).

iv) Choose Id using k–fold cross validation (CV). We used 10-fold CV.The following method is currently slow to simulate, but is a useful diag-

nostic. When the model underfits, PI (4.7) = PI (5.33) tends to have coveragenear or greater than the nominal 0.95 coverage, but the PI length is long.When the model severely overfits, the PI length is short, but the coverage isless than 0.95.

v) Modify k–fold cross validation to compute the PI coverage and averagePI length on all M models. Then n PIs are made for Yi using xf = xi fori = 1, ..., n. The coverage is the proportion of times the n PIs contained Yi.For example, choose the model Id with the shortest average PI length giventhat the nominal large sample 100(1 − δ)% PI had coverage

≥ cn = max(1 − δ − 1

3√n, 1 − δ − 0.02).

If no model Ii had coverage ≥ cn, pick the model with the largest coverage.

13) For PI (5.33), if n ≥ 10, then using d = p works fairly well since MLRmethods tend to be unflexible with d ≤ p.

14) Consider choosing η to minimize the criterion

Q(η) =1

a(Z − Wη)T (Z − Wη) +

λ1,n

a

p−1∑

i=1

|ηi|j (5.34)

where λ1,n ≥ 0, a > 0, and j > 0 are known constants. Then j = 2 corre-sponds to ridge regression, j = 1 corresponds to lasso, and a = 1, 2, n, and 2nare common. The residual sum of squares RSSW (η) = (Z−Wη)T (Z−Wη),and λ1,n = 0 corresponds to the OLS estimator ηOLS = (W T W )−1W T Z.

15) a) The OLS full model tends to be useful if n ≥ 10p with large sampletheory better than that of lasso, ridge regression, and elastic net. Testing is


easier and the Olive (2007) PI tailored to the OLS full model will work betterfor smaller sample sizes than PI (2.7) if n ≥ 10p. If n ≥ 10p but XT X issingular or ill conditioned, other methods can perform better.

If n ≤ p then OLS interpolates the data and is a poor method. If n = Jp,then as J decreases from 10 to 1, other methods become competitive.

b) If n ≥ 10p and kS < p− 1, then forward selection can give more preciseinference than the OLS full model. When n/p is small, the PI (2.7) for forwardselection can perform well if n/kS is large. Forward selection can be worsethan ridge regression or elastic net if kS > min(n/J, p). Forward selectioncan be too slow if both n and p are large.

c) If n ≥ 10p, lasso can be better than the OLS full model if XT X is illconditioned. Lasso seems to perform best if kS is not much larger than 10or if the nontrivial predictors are orthogonal or uncorrelated. Lasso can beoutperformed by ridge regression or elastic net if kS > min(n, p− 1).

d) If n ≥ 10p ridge regression and elastic net can be better than the OLSfull model if XT X is ill conditioned. Ridge regression (and likely elastic net)seems to perform best if kS is not much larger than 10 or if the nontrivialpredictors are orthogonal or uncorrelated. Ridge regression and elastic netcan outperform lasso if kS > min(n, p− 1).

e) The PLS PI (2.7) can perform well if n ≥ 10p, but PLS is inconsistentif n/p is not large. Ridge regression tends to be inconsistent unless P (d →p) → 1 so that ridge regression is asymptotically equivalent to the OLS fullmodel.

16) Under strong regularity conditions, lasso and relaxed lasso with k–foldCV, and forward selection with EBIC can perform well even if n/p is small.So PI (5.33) can be useful when n/p is small.

17) Suppose Q(η) = RSS(η) + λ1‖η‖22 + λ2‖η‖1. Then by standard

Karush-Kuhn-Tucker conditions for convex optimality, η is optimal (mini-mizes Q(η)) if

2W T Wη − 2W T Z + 2λ1η + λ2sn = 0, or

(W T W + λ1Ip−1)η = W T Z − λ2

2sn, or

η = η − n(W T W + λ1Ip−1)−1 λ2

2nsn.

This result is useful for finding the CLT for elastic net, the CLT for ridgeregression if λ2 = 0, and the CLT for lasso if λ1 = 0. Note that for a k × 1vector η, the squared (Euclidean) L2 norm ‖η‖2

2 = ηT η =∑k

i=1 η2i and the

L1 norm ‖η‖1 = ηT η =∑k

i=1 |ηi|.

5.15 Complements 185

5.15 Complements

Good references for forward selection, PCR, PLS, ridge regression, and lassoare Hastie et al. (2009, 2015), James et al. (2013), Pelawa Watagoda (2017a)and Pelawa Watagoda and Olive (2019b). Also see Efron and Hastie (2016).Ye (1998) and Tibshirani (2015) are good references for degrees of freedom.Under strong regularity conditions, Gunst and Mason (1980, ch. 10) coversinference for ridge regression (and a modified version of PCR) when the iiderrors ei ∼ N(0, σ2).

Xu et al. (2011) notes that sparse algorithms are not stable. Belsley (1984)shows that centering can mask ill conditioning of X .

Classical principal component analysis based on the correlation matrix canbe done using the singular value decomposition (SVD) of the scaled matrix

W S = W g/√n− 1 using ei and λi = σ2

i where λi = λi(WTSW S) is the ith

eigenvalue of W TSW S . Here the scaling is using g = 1. For more information

about the SVD, see Datta (1995, pp. 552-556) and Fogel et al. (2013).There is massive literature on variable selection and a fairly large literature

for inference after variable selection. See, for example, Bertsimas et al. (2016),Fan and Lv (2010), Ferrari and Yang (2015), Fithian et al. (2014), Hjort andClaeskins (2003), Knight and Fu (2000), Lee et al. (2016), Leeb and Potscher(2005, 2006), Lockhart et al. (2014), Qi et al. (2015), and Tibshirani et al.(2016).

When n/p is not large, inference is currently much more difficult. Understrong regularity conditions, lasso and forward selection with EBIC can workwell. Leeb et al. (2015) suggests that the Berk et al. (2013) method doesnot really work. Also see Dezeure et al. (2015), Javanmard and Montanari(2014), Lu et al. (2017), Tibshirani et al. (2016), van de Geer et al. (2014),and Zhang and Cheng (2017). Fan and Lv (2010) gave large sample theory forsome methods if p = o(n1/5). See Tibshirani et al. (2016) for an R package.

Warning: For n < 5p, every estimator is unreliable, to my knowledge.Regularity conditions for consistency are strong if they exist. For example,PLS is sometimes inconsistent and sometimes

√n consistent. Validating the

MLR estimator with PIs can help. Also make response and residual plots.Full OLS Model: A sufficient condition for βOLS to be a consistent

estimator of β is Cov(βOLS) = σ2(XT X)−1 → 0 as n → ∞. See Lai et al.(1979).

Forward Selection: See Olive and Hawkins (2005) and Pelawa Watagodaand Olive (2019ab).

Principal Components Regression: Principal components are KarhunenLoeve directions of centered X. See Hastie et al. (2009, p. 66). A useful PCRpaper is Cook and Forzani (2008).

Partial Least Squares: PLS was introduced by Wold (1975). Also seeWold (1985, 2006). Two useful papers are Cook et al. (2013) and Cook andSu (2016). Cook and Forzani (2018a) showed PLS can be

√n consistent even


if n < p. Hence the Chun and Keles (2010) result, showing PLS does notgive a consistent estimator of β unless p/n → 0, holds under rather strongregularity conditions. Also see Cook and Forzani (2018b). Denham (1997)suggested a PI for PLS that assumes the number of components is selectedin advance.

Ridge Regression: An important ridge regression paper is Hoerl andKennard (1970). Also see Gruber (1998). Ridge regression is known asTikhonov regularization in the numerical analysis literature.

Lasso: Lasso was introduced by Tibshirani (1996). Efron et al. (2004)and Tibshirani et al. (2012) are important papers. Su et al. (2017) note someproblems with lasso. If n/p is large, the residual bootstrap with OLS fullmodel residuals should work for lasso, relaxed lasso, and ridge regression ifλ1,n = oP (

√n). Also see Knight and Fu (2000). Camponovo (2015) suggested

that the nonparametric bootstrap does not work for lasso. Chatterjee andLahiri (2011) stated that the residual bootstrap with lasso does not work.Hall et al. (2009) stated that the residual bootstrap with OLS full modelresiduals does not work, but the m out of n residual bootstrap with OLSfull model residuals does work. Rejchel (2016) gave a good review of lassotheory. Fan and Lv (2010) reviewed large sample theory for some alternativemethods. See Lockhart et al. (2014) for a partial remedy for hypothesis testingwith lasso. The Ning and Liu (2017) method needs a log likelihood. Knightand Fu (2000) gave theory for fixed p.

Regularity conditions for testing are strong. Often lasso tests assume thatY and the nontrivial predictors follow a multivariate normal (MVN) distri-bution. For the MVN distribution, the MLR model tends to be dense notsparse if n/p is small.

Relaxed lasso:Applying OLS on the k predictors that have nonzero lasso ηi is called

relaxed lasso. See Meinshausen (2007) for a more general method. We wantn ≥ 10k.

Lasso can also be used for other estimators, such as generalized linearmodels (GLMs). Then relaxed lasso is the “classical estimator,” such as aGLM, applied to the lasso active set. For prediction, relaxed lasso may bebetter than lasso.

Dense Regression or Abundant Regression: occurs when most of thepredictors contribute to the regression. Hence the regression is not sparse. SeeCook, Forzani, and Rothman (2013).

Other Methods: Consider the MLR model Z = Wη + e. Let λ ≥ 0 bea constant and let q ≥ 0. The estimator ηq minimizes the criterion

Qq(b) = r(b)T r(b) + λ

p−1∑

j=1

|bi|q, (5.35)


over all vectors b ∈ Rp−1 where we take 00 = 0. Then q = 1 corresponds

to lasso and q = 2 corresponds to ridge regression. If q = 0, the penaltyλ∑p−1

j=1 |bi|0 = λk where k is the number of nonzero components of b. Hencethe q = 0 estimator is often called the “best subset” estimator. See Frankand Friedman (1993). For fixed p, large sample theory is given by Knight andFu (2000). Following Hastie et al. (2001, p. 72), the optimization problem isconvex if q ≥ 1 and λ is fixed.

If n ≤ 400 and p ≤ 3000, Bertsimas et al. (2016) give a fast “all subsets”variable selection method. Lin et al. (2012) claim to have a very fast methodfor variable selection. Lee and Taylor (2014) suggest the marginal screeningalgorithm: let W be the matrix of standardized nontrivial predictors. Com-pute W T Y = (c1, ..., cp−1)

T and select the J variables corresponding to theJ largest |ci|. These are the J standardized variables with the largest absolutecorrelations with Y . Then do an OLS regression of Y on these J variablesand a constant. A slower algorithm somewhat similar but much slower thatthe Lin et al. (2012) algorithm follows. Let a constant x1 be in the model, andlet W = [a1, ...,ap−1] and r = Y −Y . Compute W T r and let x∗2 correspondto the variable with the largest absolute entry. Remove the correspondingaj from W to get W 1. Let r1 be the OLS residuals from regressing Y on

x1 and x∗2. Compute W T r1 and let x∗3 correspond to the variable with thelargest absolute entry. Continue in this manner to get x1, x

∗2, ..., x

∗J where

J = min(p, dn/7e). Like forward selection, evaluate the J − 1 models Ij con-taining the first j predictors x1, x

∗2, ..., x

∗J for j = 2, ..., J with a criterion such

as Cp.Gao and Huang (2010) give theory for a LAD-lasso estimator, and Qi et

al. (2015) is an interesting lasso competitor.Multivariate linear regression has m ≥ 2 response variables. See Olive

(2017ab: ch. 12). PLS also works if m ≥ 1, and methods like ridge regressionand lasso can also be extended to multivariate linear regression. See, for ex-ample, Haitovsky (1987) and Obozinski et al. (2011). Sparse envelope modelsare given in Su et al. (2016).

AIC and BIC Type Criterion:Olive and Hawkins (2005) and Burnham and Anderson (2004) are useful

reference when p is fixed. Some interesting theory for AIC appears in Zhang(1992). Zheng and Loh (1995) show that BICS can work if p = pn = o(log(n))and there is a consistent estimator of σ2. For the Cp criterion, see Jones (1946)and Mallows (1973).

AIC and BIC type criterion and variable selection for high dimensional re-gression are discussed in Chen and Chen (2008), Fan and Lv (2010), Fujikoshiet al. (2014), and Luo and Chen (2013). Wang (2009) suggests using

WBIC(I) = log[SSE(I)/n] + n−1|I|[log(n) + 2 log(p)].


See Bogdan et al. (2004), Cho and Fryzlewicz (2012), and Kim et al. (2012).Luo and Chen (2013) state that WBIC(I) needs p/na < 1 for some 0 < a <1.

If n/p is large and one of the models being considered is the true model,then BIC tends to outperform AIC. If none of the models being consideredis the true model, then AIC tends to outperform BIC. See Yang (2003).

Robust Versions: Hastie et al. (2015, pp. 26-27) discuss some modifica-tions of lasso that are robust to certain types of outliers. A robust methodsfor forward selection and LARS are given by Uraibi et al. (2017, 2018) thatneed n >> p. If n is not much larger than p, then Hoffman et al. (2015)have a robust Partial Least Squares–Lasso type estimator that uses a cleverweighting scheme.

A simple method to make an MLR method robust to certain types of out-liers is to find the covmb2 set B of Definition 1.16 applied to the quantitativepredictors. Then use the MLR method (such as elastic net, lasso, PLS, PCR,ridge regression, or forward selection) applied to the cases corresponding tothe xj in B. Make a response and residual plot, based on the robust estimator

βB , using all n cases.Prediction Intervals:Lei et al. (2017) and Wasserman (2014) suggested prediction intervals for

estimators such as lasso. The method has interesting theory if the (xi, Yi) areiid from some population. Also see Butler and Rothman (1980). Steinbergerand Leeb (2016) used leave-one-out residuals, but delete the upper and lower2.5% of the residuals to make a 95% PI. Hence the PI will have undercoverageand the shorth PI will tend to be shorter when the error distribution is notsymmetric. See Pelawa Watagoda (2017ab).

Degrees of Freedom:A formula for the model degrees of freedom d tend to be given for a model

when there is no model selection or variable selection. Using the d after modelselection or variable selection gives a false degrees of freedom. See Jansen etal. (2007) and Zou et al. (2007).

Scaled Shrinkage EstimatorsFor multiple linear regression, shrinkage estimators often shrink β and the

ESP too much. See Figure 1.9b for ridge regression. Suppose Y = β1 +β2x2+· · ·β101x101 + e = x2 + e with n = 100 and p = 101. This model is sparseand lasso performs well, similar to Figure 1.9a. Ridge regression shrinks toomuch, but Y is highly correlated with Y . This suggests regressing Y on Yto get Y = a + bY + ε. Then Y = Xβ2 where βi2 = bβiM for i = 2, ..., p

and βi1 = a+ bβiM and M is the shrinkage method such as ridge regression.If b ≈ 1 or if the response plot using shrinkage method M looks good (theplotted points are linear and cover the identity line), then the improvementis not needed. Also see “Harder Projects” 3) in Chapter 10.

This technique greatly improves the appearance of the response plot andthe prediction intervals on the training data.

Inference After Model Selection

5.16 Problems 189

Mixture distributions can be useful for inference. Suppose that Tn is equalto the estimator Tkn with probability πkn for k = 1, ...,M where

∑k πkn = 1

and πknP→ πk as n → ∞. Then Tn has a mixture distribution of the Tkn

with probabilities πkn, and the cdf of Tn is FTn(z) =∑

k πknFTkn(z) whereFTkn(z) is the cdf of Tkn.

Denote the nonzero πk by πj and assume ujn =√n(Tjn − β)

D→ uj withE(uj) = µj and Cov(uj) = Σj. Hence

√n(Tn−β) has a mixture distribution

of the√n(Tkn − β) with probabilities πkn, and

√n(Tn − β)

D→ u (5.36)


j πjFuj (z) and Fuj(z) is the cdf of uj. Thusu is a mixture distribution of the uj with probabilities πj, E(u) =

∑j πjµj ,

and Cov(u) = Σu =∑

j πjΣj +∑

j πjµjµTj −E(u)[E(u)]T .

For example, suppose Tn is the lasso, ridge regression, or elastic net (with

αP→ ψ) estimator that uses λ selected by a method such as 10-fold CV from a

grid of 0 ≤ λ1n < λ2n < · · · < λMn. If a λjn is such that λjn/√n→ ∞, then√

n(Tn−η) does not have a limiting distribution u. If the πj correspond to λjn

such that λjn/√n

P→ τj , then u is a mixture distribution of Np−1(µj, σ2V )

distributions since Σj ≡ σ2V . If all of the τj = τk, then u ∼ Np−1(µk, σ2V ),

and if τk = 0 then u ∼ Np−1(0, σ2V ) so Tn is asymptotically equivalent to

the OLS estimator.For relaxed lasso with zero padding, if Iλ is the subset chosen, and P (S ⊆

Iλ) → 1 as n → ∞, then u is a mixture of Np−1(0, σ2V j,0) distributions, as

was the OLS forward selection estimator that used Cp.The problem with this theory for lasso, ridge regression, and elastic net, is

that theory for λ from 10-fold CV is needed. For example, if λin = iλMn/Mfor i = 1, ...,M , then the lasso estimator is not consistent since λin/n doesnot go to 0 as n → ∞. (Note that λMn produces η = 0 which is not aconsistent estimator of η. Hence λMn/n does not go to 0 as n→ ∞.)

5.16 Problems

5.1. For ridge regression, suppose V = ρ−1u . Show that if p/n and λ/n =

λ1,n/n are both small, then

ηR ≈ ηOLS − λ

nV ηOLS .

5.2. Consider choosing η to minimize the criterion


Q(η) =1

a(Z − Wη)T (Z − Wη) +

λ1,n

a

p−1∑

i=1

|ηi|j

where λ1,n ≥ 0, a > 0, and j > 0 are known constants. Consider the regres-sion methods OLS, forward selection, lasso, PLS, PCR, ridge regression, andrelaxed lasso.a) Which method corresponds to j = 1?b) Which method corresponds to j = 2?c) Which method corresponds to λ1,n = 0?

5.3. For ridge regression, let An = (W T W + λ1,nIp−1)−1W T W and

Bn = [Ip−1 − λ1,n(W T W + λ1,nIp−1)−1]. Show An − Bn = 0.

5.4. Suppose Y = HY where H is an n × n hat matrix. Then the de-grees of freedom df(Y ) = tr(H) = sum of the diagonal elements of H. Anestimator with low degrees of freedom is inflexible while an estimator withhigh degrees of freedom is flexible. If the degrees of freedom is too low, theestimator tends to underfit while if the degrees of freedom is to high, theestimator tends to overfit.

a) Find df(Y ) if Y = Y 1 which uses H = (hij) where hij ≡ 1/n for alli and j. This inflexible estimator uses the sample mean Y of the responsevariable as Yi for i = 1, ..., n.

b) Find df(Y ) if Y = Y = InY which uses H = In where hii = 1. Thisbad flexible estimator interpolates the response variable.

Table 5.3 Simulated Large Sample 95% PI Coverages and Lengths, k=19

lasso 10-fold cv relaxed lasso d approx min(n/5, p)ψ = 0 ψ = 0.5 ψ = 0 ψ = 0.5

n p cov len cov len cov len cov len100 20 0.984 5.612 0.992 9.140 0.983 5.693 0.984 5.707100 40 0.981 6.100 0.995 13.842 0.980 6.231 0.971 7.110100 100 0.962 6.583 0.999 25.811 0.973 8.428 0.961 8.782100 200 0.931 6.807 0.9996 37.090 0.960 10.592 0.961 9.588400 20 0.975 4.683 0.974 6.860 0.975 4.691 0.978 4.696400 40 0.976 4.816 0.982 9.813 0.982 4.968 0.982 5.000400 400 0.980 5.254 0.982 31.215 0.965 5.004 0.991 7.121400 800 0.978 5.388 0.977 41.262 0.960 5.382 0.988 8.9642000 20 0.952 4.034 0.956 5.766 0.954 4.035 0.961 4.0362000 40 0.952 4.112 0.958 8.259 0.965 4.183 0.960 4.1912000 2000 0.973 4.627 0.954 54.961 0.974 4.964 0.997 13.9732000 4000 0.975 4.687 0.953 77.228 0.962 4.736

5.5. Table 5.3 shows some simulation results for lasso with 10-fold CVand relaxed lasso with d = min(n/5, p) when the iid errors ei ∼ N(0, 1) whenk = 19. Then the asymptotically optimal length of the nominal 95% PI is

5.16 Problems 191

3.92. For ψ = 0 and fixed n, the PI lengths grew slowly as p increased. Muchlarge sample sizes appear to be needed for good performance when ψ = 0.5for 10-fold CV. For N(0, 1) errors, ψ = 0 and d < k, expect the PI length tobe near 3.92

√k − d+ 1 for forward selection if the coverage is not less than

0.95. Did relaxed lasso with d ≈ min(n/5, p) work well?

5.6. Suppose Y = Xβ + e, Z = Wη + e, Z = Wη, Z = Y − Y , andY = Z + Y . Let the n × p matrix W 1 = [1 W ] and the p × 1 vectorη1 = (Y ηT )T where the scalar Y is the sample mean of the response

variable. Show Y = W 1η1.

5.7. Let Z = Y − Y where Y = Y 1, and the n× (p− 1) matrix of stan-dardized nontrivial predictors G = (Gij). For j = 1, ..., p− 1, let Gij denotethe (j + 1)th variable standardized so that

∑ni=1Gij = 0 and

∑ni=1G

2ij = 1.

Note that the sample correlation matrix of the nontrivial predictors ui isRu = GT G. Then regression through the origin is used for the model

Z = Gη + e (5.37)

where the vector of fitted values Y = Y +Z . The standardization differs fromthat used for earlier regression models (see Remark 5.1), since

∑ni=1G

2ij =

1 6= n =∑n

i=1W2ij . Note that

G =1√n

W .

Following Zou and Hastie (2005), the naive elastic net ηN estimator is theminimizer of

QN(η) = RSS(η) + λ∗2‖η‖22 + λ∗1‖η‖1 (5.38)

where λ∗i ≥ 0. The term “naive” is used because the elastic net estimator

is better. Let τ =λ∗2

λ∗1 + λ∗2, γ =

λ∗1√1 + λ∗2

, and ηA =√

1 + λ∗2 η. Let the

(n+p−1)×(p−1) augmented matrix GA and the (n+p−1)×1 augmentedresponse vector ZA be defined by

GA =

(G√

λ∗2 Ip−1

), and ZA =

(Z0

),

where 0 is the (p−1)×1 zero vector. Let ηA =√

1 + λ∗2 η be obtained fromthe lasso of ZA on GA: that is ηA minimizes

QN(ηA) = ‖ZA − GAηA‖22 + γ‖ηA‖1 = QN(η).

Prove QN (ηA) = QN(η).(Then


ηN =1√

1 + λ∗2ηA and ηEN =

√1 + λ∗2 ηA = (1 + λ∗2)ηN .

The above elastic net estimator minimizes the criterion

QG(η) =ηT GT Gη

1 + λ∗2− 2ZT Gη +

λ∗21 + λ∗2

‖η‖22 + λ∗1‖η‖1,

and hence is not the elastic net estimator corresponding to Equation (5.29).

5.8. Let β = (β1,βTS )T . Consider choosing β to minimize the criterion

Q(β) = RSS(β) + λ1‖βS‖22 + λ2‖βS‖1

where λi ≥ 0 for i = 1, 2.a) Which values of λ1 and λ2 correspond to ridge regression?b) Which values of λ1 and λ2 correspond to lasso?c) Which values of λ1 and λ2 correspond to elastic net?d) Which values of λ1 and λ2 correspond to the OLS full model?

5.9. For the output below, an asterisk means the variable is in the model.All models have a constant, so model 1 contains a constant and mmen.

a) List the variables, including a constant, that models 2, 3, and 4 contain.b) The term out$cp lists the Cp criterion. Which model (1, 2, 3, or 4) is

the minimum Cp model Imin?

c) Suppose βImin= (241.5445, 1.001)T . What is βImin,0?

Selection Algorithm: forward #output for Problem 5.9


1 ( 1 ) " " "*" " " " "

2 ( 1 ) " " "*" "*" " "

3 ( 1 ) "*" "*" "*" " "

4 ( 1 ) "*" "*" "*" "*"

out$cp

[1] -0.8268967 1.0151462 3.0029429 5.0000000

5.10. Consider the output above Example 4.8 for the OLS full model. Thecolumn resboot gives the large sample 95% CI for βi using the shorth appliedto the β∗

ij for j = 1, ..., B using the residual bootstrap. The standard large

sample 95% CI for βi is βi±1.96SE(βi). Hence for β2 corresponding to L, thestandard large sample 95% CI is −0.001± 1.96(0.002) = −0.001± 0.00392 =[−0.00492, 0.00292] while the shorth 95% CI is [−0.005, 0.004].

a) Compute the standard 95% CIs for βi corresponding to W, H, and S.Also write down the shorth 95% CI. Are the standard and shorth 95% CIsfairly close?

b) Consider testing H0 : βi = 0 versus HA : βi 6= 0. If the corresponding95% CI for βi does not contain 0, then reject H0 and conclude that the

5.16 Problems 193

predictor variable Xi is needed in the MLR model. If 0 is in the CI then failto reject H0 and conclude that the predictor variable Xi is not needed in theMLR model given that the other predictors are in the MLR model.

Which variables, if any, are needed in the MLR model? Use the standardCI if the shorth CI gives a different result. The nontrivial predictor variablesare L, W, H, and S.

5.11. Tremearne (1911) presents a data set of about 17 measurements on112 people of Hausa nationality. We used Y = height. Along with a constantxi,1 ≡ 1, the five additional predictor variables used were xi,2 = height whensitting, xi,3 = height when kneeling, xi,4 = head length, xi,5 = nasal breadth,and xi,6 = span (perhaps from left hand to right hand). The output below isfor the OLS full model.

Estimate Std.Err 95% shorth CI

Intercept -77.0042 65.2956 [-208.864,55.051]

X2 0.0156 0.0992 [-0.177, 0.217]

X3 1.1553 0.0832 [ 0.983, 1.312]

X4 0.2186 0.3180 [-0.378, 0.805]

X5 0.2660 0.6615 [-1.038, 1.637]

X6 0.1396 0.0385 [0.0575, 0.217]

a) Give the shorth 95% CI for β2 .b) Compute the standard 95% CI for β2.c) Which variables, if any, are needed in the MLR model given that the

other variables are in the model?

Now we use forward selection and Imin is the minimum Cp model.

Estimate Std.Err 95% shorth CI

Intercept -42.4846 51.2863 [-192.281, 52.492]

X2 0 [ 0.000, 0.268]

X3 1.1707 0.0598 [ 0.992, 1.289]

X4 0 [ 0.000, 0.840]

X5 0 [ 0.000, 1.916]

X6 0.1467 0.0368 [ 0.0747, 0.215]

(Intercept) a b c d e

1 TRUE FALSE TRUE FALSE FALSE FALSE

2 TRUE FALSE TRUE FALSE FALSE TRUE

3 TRUE FALSE TRUE TRUE FALSE TRUE

4 TRUE FALSE TRUE TRUE TRUE TRUE

5 TRUE TRUE TRUE TRUE TRUE TRUE

> tem2$cp

[1] 14.389492 0.792566 2.189839 4.024738 6.000000

d) What is the value of Cp(Imin) and what is βImin,0?e) Which variables, if any, are needed in the MLR model given that the

other variables are in the model?


f) List the variables, including a constant, that model 3 contains.

5.12. Table 5.4 below shows simulation results for bootstrapping OLS (reg)and forward selection (vs) with Cp when β = (1, 1, 0, 0, 0)T . The βi columnsgive coverage = the proportion of CIs that contained βi and the averagelength of the CI. The test is for H0 : (β3, β4, β5)

T = 0 and H0 is true. The“coverage” is the proportion of times the prediction region method bootstraptest failed to reject H0. Since 1000 runs were used, a cov in [0.93,0.97] isreasonable for a nominal value of 0.95. Output is given for three differenterror distributions. If the coverage for both methods ≥ 0.93, the methodwith the shorter average CI length was more precise. (If one method hadcoverage ≥ 0.93 and the other had coverage < 0.93, we will say the methodwith coverage ≥ 0.93 was more precise.)

a) For β3 , β4 , and β5, which method, forward selection or the OLS fullmodel, was more precise?

Table 5.4 Bootstrapping Forward Selection, n = 100, p = 5, ψ = 0, B = 1000

β1 β2 β3 β4 β5 testreg cov 0.95 0.93 0.93 0.93 0.94 0.93

len 0.658 0.672 0.673 0.674 0.674 2.861vs cov 0.95 0.94 0.998 0.998 0.999 0.993

len 0.661 0.679 0.546 0.548 0.544 3.11reg cov 0.96 0.93 0.94 0.96 0.93 0.94

len 0.229 0.230 0.229 0.231 0.230 2.787vs cov 0.95 0.94 0.999 0.997 0.999 0.995

len 0.228 0.229 0.185 0.187 0.186 3.056reg cov 0.94 0.94 0.95 0.94 0.94 0.93

len 0.393 0.398 0.399 0.399 0.398 2.839vs cov 0.94 0.95 0.997 0.997 0.996 0.990

len 0.392 0.400 0.320 0.322 0.321 3.077

b) The test “length” is the average length of the interval [0, D(UB)] = D(UB)

where the test fails to reject H0 if D0 ≤ D(UB). The OLS full model isasymptotically normal, and hence for large enough n and B the reg len row

for the test column should be near√χ2

3,0.95 = 2.795.

Were the three values in the test column for reg within 0.1 of 2.795?

5.13. Suppose the MLR model Y = Xβ + e, and the regression methodfits Z = Wη + e. Suppose Z = 245.63 and Y = 105.37. What is Y ?

5.14. To get a large sample 90% PI for a future value Yf of the response

variable, find a large sample 90% PI for a future residual and add Yf to theendpoints of the of that PI. Suppose forward selection is used and the largesample 90% PI for a future residual is [−778.28, 1336.44]. What is the large

sample 90% PI for Yf if βImin= (241.545, 1.001)T used a constant and the

predictor mmen with corresponding xImin,f = (1, 75000)T?

5.16 Problems 195

5.15. Table 5.5 below shows simulation results for bootstrapping OLS(reg), lasso, and ridge regression (RR) with 10-fold CV when β = (1, 1, 0, 0)T .The βi columns give coverage = the proportion of CIs that contained βi andthe average length of the CI. The test is for H0 : (β3 , β4)

T = 0 and H0 istrue. The “coverage” is the proportion of times the prediction region methodbootstrap test failed to reject H0. OLS used 1000 runs while 100 runs wereused for lasso and ridge regression. Since 100 runs were used, a cov in [0.89,1] is reasonable for a nominal value of 0.95. If the coverage for both methods≥ 0.89, the method with the shorter average CI length was more precise.(If one method had coverage ≥ 0.89 and the other had coverage < 0.89, wewill say the method with coverage ≥ 0.89 was more precise.) The resultsfor the lasso test were omitted since sometimes S∗

T was singular. (Lengthsfor the test column are not comparable unless the statistics have the sameasymptotic distribution.)

Table 5.5 Bootstrapping lasso and RR, n = 100, ψ = 0.9, p = 4, B = 250

β1 β2 β3 β4 testreg cov 0.942 0.951 0.949 0.943 0.943

len 0.658 5.447 5.444 5.438 2.490RR cov 0.97 0.02 0.11 0.10 0.05

len 0.681 0.329 0.334 0.334 2.546reg cov 0.947 0.955 0.950 0.951 0.952

len 0.658 5.511 5.497 5.500 2.491lasso cov 0.93 0.91 0.92 0.99

len 0.698 3.765 3.922 3.803

a) For β3 and β4 which method, ridge regression or the OLS full model,was better?

b) For β3 and β4 which method, lasso or the OLS full model, was moreprecise?

5.16. Suppose n = 15 and 5-fold CV is used. Suppose observations aremeasured for the following people. Use the output below to determine whichpeople are in the first fold.

folds: 4 3 4 2 1 4 3 5 2 2 3 1 5 5 1

1) Athapattu, 2) Azizi, 3) Cralley 4) Gallage, 5) Godbold, 6) Gunawar-dana, 7) Houmadi, 8) Mahappu, 9) Pathiravasan, 10) Rajapaksha, 11)Ranaweera, 12) Safari, 13) Senarathna, 14) Thakur, 15) Ziedzor

5.17. Table 5.6 below shows simulation results for a large sample 95% pre-diction interval. Since 5000 runs were used, a cov in [0.94, 0.96] is reasonablefor a nominal value of 0.95. If the coverage for a method ≥ 0.94, the methodwith the shorter average PI length was more precise. Ignore methods withcov < 0.94. The MLR model had β = (1, 1, ..., 1, 0, ..., 0)T where the first


k+1 coefficients were equal to 1. If ψ = 0 then the nontrivial predictors wereuncorrelated, but highly correlated if ψ = 0.9.

Table 5.6 Simulated Large Sample 95% PI Coverages and Lengths, ei ∼ N(0,1)

n p ψ k FS lasso RL RR PLS PCR100 40 0 1 cov 0.9654 0.9774 0.9588 0.9274 0.8810 0.9882

len 4.4294 4.8889 4.6226 4.4291 4.0202 7.3393400 400 0.9 19 cov 0.9348 0.9636 0.9556 0.9632 0.9462 0.9478

len 4.3687 47.361 4.8530 48.021 4.2914 4.4764

a) Which method was most precise, given cov ≥ 0.94, when n = 100?b) Which method was most precise, given cov ≥ 0.94, when n = 400?

5.18. When doing a PI or CI simulation for a nominal 100(1− δ)% = 95%interval, there are m runs. For each run, a data set and interval are generated,and for the ith run Yi = 1 if µ or Yf is in the interval, and Yi = 0, otherwise.Hence the Yi are iid Bernoulli(1 − δn) random variables where 1 − δn isthe true probability (true coverage) that the interval will contain µ or Yf .The observed coverage (= coverage) in the simulation is Y =

∑i Yi/m. The

variance V (Y ) = σ2/m where σ2 = (1 − δn)δn ≈ (1 − δ)δ ≈ (0.95)0.05 ifδn ≈ δ = 0.05. Hence

SD(Y ) ≈√

0.95(0.05)

m.

If the (observed) coverage is within 0.95 ± kSD(Y ) the integer k is near 3,then there is no reason to doubt that the actual coverage 1− δn differs fromthe nominal coverage 1−δ = 0.95 if m ≥ 1000 (and as a crude benchmark, form ≥ 100). In the simulation, the length of each interval is computed, and theaverage length is computed. For intervals with coverage ≥ 0.95 − kSD(Y ),intervals with shorter average length are better (have more precision).

a) If m = 5000 what is 3 SD(Y ), using the above approximation? Youranswer should be close to 0.01.

b) If m = 1000 what is 3 SD(Y ), using the above approximation?

R Problem

Use the command source(“G:/slpack.txt”) to download the func-tions and the command source(“G:/sldata.txt”) to download the data.See Preface or Section 11.1. Typing the name of the slpack function,e.g. vsbootsim3, will display the code for the function. Use the args com-mand, e.g. args(vsbootsim3), to display the needed arguments for the function.For the following problem, the R command can be copied and pasted from(http://lagrange.math.siu.edu/Olive/slrhw.txt) into R.

5.19. The R program generates data satisfying the MLR model

5.16 Problems 197

Y = β1 + β2x2 + β3x3 + β4x4 + e

where β = (β1, β2, β3, β4)T = (1, 1, 0, 0).

a) Copy and paste the commands for this part into R. The output gives

βOLS for the OLS full model. Give βOLS . Is βOLS close to β = 1, 1, 0, 0)T?b) The commands for this part bootstrap the OLS full model using the

residual bootstrap. Copy and paste the output into Word. The output shows

T ∗j = β

∗j for j = 1, ..., 5.

c) B = 1000 T ∗j were generated. The commands for this part compute the

sample mean T∗

of the T ∗j . Copy and paste the output into Word. Is T

∗close

to βOLS found in a)?d) The commands for this part bootstrap the forward selection using the

residual bootstrap. Copy and paste the output into Word. The output shows

T ∗j = β

∗Imin,0,j for j = 1, ..., 5. The last two variables may have a few 0s.

e) B = 1000 T ∗j were generated. The commands for this part compute the

sample mean T∗

of the T ∗j where T ∗

j is as in d). Copy and paste the output

into Word. Is T∗

close to β = (1, 1, 0, 0)?

5.20. This simulation is similar to that used to form Table 4.2, but 1000runs are used so coverage in [0.93,0.97] suggests that the actual coverage isclose to the nominal coverage of 0.95.

The model is Y = xT β + e = xTSβS + e where βS = (β1, β2, ..., βk+1)

T =(β1, β2)

T and k = 1 is the number of active nontrivial predictors in the popu-lation model. The output for test tests H0 : (βk+2, ..., βp)

T = (β3 , ..., βp)T = 0

andH0 is true. The output gives the proportion of times the prediction regionmethod bootstrap test fails to reject H0. The nominal proportion is 0.95.

After getting your output, make a table similar to Table 2.2 with 4 lines.If your p = 5 then you need to add a column for β5 . Two lines are for reg(the OLS full model) and two lines are for vs (forward selection with Imin).The βi columns give the coverage and lengths of the 95% CIs for βi. If thecoverage ≥ 0.93, then the shorter CI length is more precise. Were the CIsfor forward selection more precise than the CIs for the OLS full model for β3

and β4?To get the output, copy and paste the source commands from

(http://lagrange.math.siu.edu/Olive/slrhw.txt) into R. Copy and past thelibrary command for this problem into R.

If you are person j then copy and paste the R code for person j for thisproblem into R.

5.21. This problem is like Problem 5.20, but ridge regression is used in-stead of forward selection. This simulation is similar to that used to formTable 3.2, but 100 runs are used so coverage in [0.89,1.0] suggests that theactual coverage is close to the nominal coverage of 0.95.

The model is Y = xT β + e = xTSβS + e where βS = (β1, β2, ..., βk+1)

T =(β1, β2)

T and k = 1 is the number of active nontrivial predictors in the popu-lation model. The output for test tests H0 : (βk+2, ..., βp)

T = (β3 , ..., βp)T = 0


andH0 is true. The output gives the proportion of times the prediction regionmethod bootstrap test fails to reject H0. The nominal proportion is 0.95.

After getting your output, make a table similar to Table 2.2 with 4 lines.If your p = 5 then you need to add a column for β5 . Two lines are for reg(the OLS full model) and two lines are for ridge regression (with 10 fold CV).The βi columns give the coverage and lengths of the 95% CIs for βi. If thecoverage ≥ 0.89, then the shorter CI length is more precise. Were the CIs forridge regression more precise than the CIs for the OLS full model for β3 andβ4?

To get the output, copy and paste the source commands from(http://lagrange.math.siu.edu/Olive/slrhw.txt) into R. Copy and past thelibrary command for this problem into R.

If you are person j then copy and paste the R code for person j for thisproblem into R.

5.22. This is like Problem 5.21, except lasso is used. If you are person j inProblem 5.21, then copy and paste the R code for person j for this probleminto R. Make a table with 4 lines: two for OLS and 2 for lasso. Were the CIsfor lasso more precise than the CIs for the OLS full model for β3 and β4?

5.23. This problem is like Problem 1.9, except elastic net is used insteadof lasso.

a) Copy and paste the commands for this problem into R. Include theelastic net response plot in Word. The identity line passes right through theoutliers which are obvious because of the large gap. Prediction interval (PI)bands are also included in the plot.

b) Copy and paste the commands for this problem into R. Include theelastic net response plot in Word. This did elastic net for the cases in thecovmb2 set B applied to the predictors which included all of the clean casesand omitted the 5 outliers. The response plot was made for all of the data,including the outliers. (Problem 1.9 c) shows the DD plot for the data.)

Chapter 6

What if n is not >> p?


Consider a regression model with response variable Y and a p × 1 vectorof predictors x. This model is the full model. Suppose there are n cases.Randomly divide the data set into two half sets. On the first half set fit themodel selection method, e.g. forward selection or lasso, to get the a predictors.On the second half set, use the standard regression inference from regressingthe response on the predictors found from the first half set. This method canbe inefficient if n ≥ 10p, but is useful for a sparse model if n ≤ 5p, if theprobability that the model underfits goes to zero, and if n ≥ 20a. A model issparse if the number of predictors with nonzero coefficients is small.

The method is useful for many regression models when the cases are inde-pendent, including multiple linear regression, multivariate linear regressionwhere there are m ≥ 2 response variables, generalized linear models (GLMs),the Cox (1972) proportional hazards regression model, and parametric sur-vival regression models.

In a 1D regression model, the response variable Y is conditionally inde-pendent of the p × 1 vector of predictors x given the sufficient predictorSP = h(x), written

Y x|SP or Y x|h(x), (6.1)

where the real valued function h : Rp → R. The estimated sufficient predictor

ESP = h(x). An important special case is a model with a linear predictor

h(x) = xT β where ESP = xT β. This class of models includes many of theabove regression models.

For the 1D regression model, let the ith case be (Yi,xi) for i = 1, ..., nwhere the n cases are independent. Variable selection is the search for a subsetof predictor variables that can be deleted with little loss of information if n/pis large. Following Olive and Hawkins (2005), a model for variable selectioncan be described by

199

200 6 What if n is not >> p?

xT β = xTSβS + xT

EβE = xTSβS (6.2)

where x = (xTS ,x

TE)T , xS is an aS ×1 vector, and xE is a (p−aS)×1 vector.

Given that xS is in the model, βE = 0 and E denotes the subset of termsthat can be eliminated given that the subset S is in the model. Let xI be thevector of a terms from a candidate subset indexed by I, and let xO be thevector of the remaining predictors (out of the candidate submodel). Supposethat S is a subset of I and that model (2) holds. Then

xT β = xTSβS = xT


O0 = xTI βI

where xI/S denotes the predictors in I that are not in S. Since this is trueregardless of the values of the predictors, βO = 0 if S ⊆ I.

To do inference after model selection, randomly divide the data into twohalf sets H and V where H has nH = dn/2e of the cases and V has the

remaining nV = n − nH cases i1, ..., inV . The estimator βI , e.g. based onthe maximum likelihood estimator (MLE) or ordinary least squares (OLS),is computed using the training data set H . Then the regress YV on XV,I andperform the usual inference for the model using the j = 1, ..., nV cases in thevalidation set V . If βI uses a predictors, we want nV ≥ 10a and we wantP (S ⊆ I) → 1 as n→ ∞ or for (YV ,XV,I) to follow the regression model.

To get the subset I when n ≥ 10p, use variable selection with criterion Cp

due to Jones (1946) and Mallows (1973), AIC due to Akaike (1973), or BICdue to Schwarz (1978), or use methods . When n < 5p use forward selectionwith EBIC due to Luo and Chen (2013). Methods such as the Tibshirani(1996) lasso or the Zou and Hastie (2005) elastic net estimators can be usedfor both cases. Let relaxed lasso or relaxed elastic net apply the standard re-gression estimator, such as the MLE or OLS, to a constant and the nontrivialpredictors that had nonzero lasso or elastic net coefficients. See Efron et al.(2004), Meinshausen (2007), and Olive (2019).

The method is simple, use one half set to get the predictors, then fitthe regression model, such as a GLM or OLS, to the validation half set(Y V ,XV,I). The regression model needs to hold for (Y V ,XV,I) and we wantnV ≥ 10a if I uses a predictors. The regression model can hold if S ⊆ Iand the model is sparse. Let x = (x1, ...,xp)

T where x1 is a constant. If(Y,x2, ...,xp)T follows a multivariate normal distribution, then (Y,xI ) followsa multiple linear regression model for every I. Hence the full model need notbe sparse, although the selected model may be suboptimal.

Of course other sample sizes than half sets could be used. For example ifn = 1000p, use n = 10p for the training set and n = 990p for the validationset.

6.4 Problems 201

6.2 Summary

6.3 Complements

Fan and Li (2002) give a method of variable selection for the Cox (1972)proportional hazards regression model. Using AIC is also useful.

For a time series Y1, ..., Yn, we could use Y1, ..., Ym as one set and Ym+1 , ..., Yn

as the other set. Three set inference may be needed. Use Y1, ..., Ym as the firstset (trianing data), Ym+1 , ..., Ym+k as a burn in set, and Ym+k+1, ..., Yn as thethird set for inference.

When the entire data set is used to build a model with the response vari-able, the inference tends to be invalid, and cross validation should not be usedto check the model. See Hastie et al. (2009, p. 45). In order for the inferenceand cross validation to be useful, the response variable and the predictorsfor the regression should be chosen before looking at the response variable.Predictor transformations can be done as long as the response variable is notused to choose the transformation. You can do model building on the testset, and then inference for the chosen (built) model as the full model withthe validation set, provided this model follows the regression model used forinference (e.g. multiple linear regression or a GLM). This process is difficultto simulate.

6.4 Problems

Chapter 7

Robust Regression

This chapter needs a lot of work!Three mpack functions are useful for cleaning data with the RMVN set

U described in Section 7.1.3. i) The function getu gets the RMVN set U .ii) If there are g groups (g = G for discriminant analysis, g = 2 for binaryregression, and g = p for one way MANOVA), the function getubig gets theRMVN set Ui for each group and combines the g RMVN sets into one large setUbig = U1∪U2∪· · ·∪Ug . iii) If there are g groups and it can be assumed thatthe data x1, ...,xn only differs by g location vectors, then getuc subtractsthe group coordinatewise median from each group, combines the centereddata into one data set z1, ..., zn, and gets the RMVN set Uc applied to thezi. All three functions return indx, the indices of the cases in the cleaned dataset (U, Ubig, or Uc). Functions getbigu and getuc also return grp whichhas the group to which each case in the cleaned data set belongs.

Two functions are useful for cleaning data with the covmb2 set B, whichcan be useful even if p > n. See Section 7.1.4. The function getB gets B,and the function getBbig is like getubig except it gets the covmb2 setBi for each group, and Bbig = B1 ∪B2 ∪ · · · ∪Bg .

i) Resistant regression: Suppose the regression model has anm×1 responsevector y, and a p × 1 vector of predictors x. Assume that predictor trans-formations have been performed to make x, and that w consists of k ≤ pcontinuous predictor variables that are linearly related. Find the RMVN setbased on the w to obtain nu cases (yci,xci), and then run the regressionmethod on the cleaned data. Often the theory of the method applies to thecleaned data set since y was not used to pick the subset of the data. Effi-ciency can be much lower since nu cases are used where n/2 ≤ nu ≤ n, andthe trimmed cases tend to be the “farthest” from the center of w.

The method will have the most outlier resistance if k = p (or k = p− 1 ifthere is a trivial predictor X1 ≡ 1). If m = 1, make the response plot of Yc

versus Yc with the identity line added as a visual aid, and make the residualplot of Yc versus rc = Yc − Yc.

203

204 7 Robust Regression

In R, assume Y is the vector of response variables, x is the data matrix ofthe predictors (often not including the trivial predictor), and w is the datamatrix of the wi. Then the following R commands can be used to get thecleaned data set. We could use the covmb2 set B instead of the RMVN setU computed from the w by replacing the command getu(w) by getB(w).

indx <- getu(w)$indx #often w = x

Yc <- Y[indx]

Xc <- x[indx,]

#example

indx <- getu(buxx)$indx

Yc <- buxy[indx]

Xc <- buxx[indx,]

outr <- lsfit(Xc,Yc)

MLRplot(Xc,Yc) #right click Stop twice

Two special cases are listed below.a) Resistant additive error regression: An additive error regression model

has the form Y = g(x) + e where there is m = 1 response variable Y , andthe p × 1 vector of predictors x is assumed to be known and independentof the additive error e. An enormous variety of regression models have thisform, including multiple linear regression, nonlinear regression, nonparamet-ric regression, partial least squares, lasso, ridge regression, etc. As describedabove, find the RMVN set (or covmb2 set) based on the w to obtain nU

cases (Yci,xci), and then run the additive error regression method on thecleaned data.

b) Resistant Additive Error Multivariate RegressionAssume y = g(x)+ε = E(y|x)+ε where g : R

p → Rm, y = (Y1, ..., Ym)T ,

and ε = (ε1, ..., εm)T . Many models have this form, including multivariatelinear regression, seemingly unrelated regressions, partial envelopes, partialleast squares, and the models in a) with m = 1 response variable. Clean thedata as in a) but let the cleaned data be stored in (Zc,Xc). Again, the theoryof the method tends to apply to the method applied to the cleaned data sincethe response variables were not used to select the cases, but the efficiency isoften much lower. In the R code below, assume the y are stored in z.

indx <- getu(w)$indx #often w = x

Zc <- z[indx]

Xc <- x[indx,]

#example

ht <- buxy

t <- cbind(buxx,ht);

z <- t[,c(2,5)];

x <- t[,c(1,3,4)]

indx <- getu(x)$indx

Zc <- z[indx,]

7.1 Outlier Detection 205

Xc <- x[indx,]

mltreg(Xc,Zc) #right click Stop four times

7.1 Outlier Detection

Outliers are cases that lie far away from the bulk of the data, and outliers canruin a statistical analysis. For multiple linear regression, the response plot isoften useful for outlier detection. Look for gaps in the response plot and forcases far from the identity line. There are no gaps in Figure 7.?, but case 44is rather far from the identity line. Figure 7.? has a gap in the response plot.

Next, this section discusses a technique for outlier detection that workswell for certain outlier configurations provided bulk of the data consists ofmore than n/2 cases. The technique could fail if there are g > 2 groups ofabout n/g cases per group. First we need to define Mahalanobis distancesand the coordinatewise median. Some univariate estimators will be definedfirst.

7.1.1 The Location Model

The location model is

Yi = µ+ ei, i = 1, . . . , n (7.1)

where e1, ..., en are error random variables, often independent and identicallydistributed (iid) with zero mean. The location model is used when there isone variable Y , such as height, of interest. The location model is a specialcase of the multivariate location and dispersion model, where there are pvariables x1, ..., xp of interest, such as height and weight if p = 2. StatisticalLearning is the analysis of multivariate data, and the location model is anexample of univariate data, not an example of multivariate data.

The location model is often summarized by obtaining point estimates andconfidence intervals for a location parameter and a scale parameter. Assumethat there is a sample Y1, . . . , Yn of size n where the Yi are iid from a distri-bution with median MED(Y ), mean E(Y ), and variance V (Y ) if they exist.Also assume that the Yi have a cumulative distribution function (cdf) F thatis known up to a few parameters. For example, Yi could be normal, exponen-tial, or double exponential. The location parameter µ is often the populationmean or median while the scale parameter is often the population standarddeviation

√V (Y ). The ith case is Yi.

Point estimation is one of the oldest problems in statistics and four impor-tant statistics for the location model are the sample mean, median, variance,


and the median absolute deviation (MAD). Let Y1, . . . , Yn be the randomsample; i.e., assume that Y1, ..., Yn are iid. The sample mean is a measure oflocation and estimates the population mean (expected value) µ = E(Y ).

Definition 7.1. The sample mean

Y =

∑ni=1 Yi

n. (7.2)

If the data set Y1, ..., Yn is arranged in ascending order from smallest tolargest and written as Y(1) ≤ · · · ≤ Y(n), then Y(i) is the ith order statisticand the Y(i)’s are called the order statistics. If the data Y1 = 1, Y2 = 4, Y3 =

2, Y4 = 5, and Y5 = 3, then Y = 3, Y(i) = i for i = 1, ..., 5 and MED(n) = 3where the sample size n = 5. The sample median is a measure of locationwhile the sample standard deviation is a measure of spread. The sample meanand standard deviation are vulnerable to outliers, while the sample medianand MAD, defined below, are outlier resistant.

Definition 7.2. The sample median

MED(n) = Y((n+1)/2) if n is odd, (7.3)

MED(n) =Y(n/2) + Y((n/2)+1)

2if n is even.

The notation MED(n) = MED(Y1, ..., Yn) will also be used.

Definition 7.3. The sample variance

S2n =

∑ni=1(Yi − Y )2

n− 1=

∑ni=1 Y

2i − n(Y )2

n− 1, (7.4)

and the sample standard deviation Sn =√S2

n.

Definition 7.4. The sample median absolute deviation is

MAD(n) = MED(|Yi − MED(n)|, i = 1, . . . , n). (7.5)

Since MAD(n) is the median of n distances, at least half of the observationsare within a distance MAD(n) of MED(n) and at least half of the observationsare a distance of MAD(n) or more away from MED(n). Like the standarddeviation, MAD(n) is a measure of spread.

Example 7.1. Let the data be 1, 2, 3, 4, 5, 6, 7, 8, 9. Then MED(n) = 5and MAD(n) = 2 = MED{0, 1, 1, 2, 2, 3, 3, 4, 4}.


7.1.2 Outlier Detection with Mahalanobis Distances

Now suppose the multivariate data has been collected into an n× p matrix

W = X =

xT1...

xTn

=

x1,1 x1,2 . . . x1,p

x2,1 x2,2 . . . x2,p

......

. . ....

xn,1 xn,2 . . . xn,p

=

[v1 v2 . . . vp

]

where the ith row of W is the ith case xTi and the jth column vj of W

corresponds to n measurements of the jth random variableXj for j = 1, ..., p.Hence the n rows of the data matrix W correspond to the n cases, while thep columns correspond to measurements on the p random variables X1, ..., Xp.For example, the data may consist of n visitors to a hospital where the p = 2variables height and weight of each individual were measured.

Definition 7.5. The coordinatewise median MED(W ) = (MED(X1), ...,MED(Xp))

T where MED(Xi) is the sample median of the data in column icorresponding to variable Xi and vi.

Example 7.2. Let the data forX1 be 1, 2, 3, 4, 5, 6, 7, 8, 9while the data forX2 is 7, 17, 3, 8, 6, 13, 4, 2, 1. Then MED(W ) = (MED(X1),MED(X2))

T =(5, 6)T .

For multivariate data, sample Mahalanobis distances play a role similar tothat of residuals in multiple linear regression. Let the observed training databe collected in an n× p matrix W . Let the p× 1 column vector T = T (W )be a multivariate location estimator, and let the p × p symmetric positivedefinite matrix C = C(W ) be a dispersion estimator.

Definition 7.6. Let x1j, ..., xnj be measurements on the jth randomvariable Xj corresponding to the jth column of the data matrix W . The

jth sample mean is xj =1

n

n∑

k=1

xkj. The sample covariance Sij estimates

Cov(Xi, Xj) = σij = E[(Xi −E(Xi))(Xj − E(Xj))], and

Sij =1

n− 1

n∑

k=1

(xki − xi)(xkj − xj).

Sii = S2i is the sample variance that estimates the population variance

σii = σ2i . The sample correlation rij estimates the population correlation

Cor(Xi, Xj) = ρij =σij

σiσj, and

rij =Sij

SiSj=

Sij√SiiSjj

=

∑nk=1(xki − xi)(xkj − xj)√∑n

k=1(xki − xi)2√∑n

k=1(xkj − xj)2.


Definition 7.7. Let x1, ...,xn be the data where xi is a p× 1 vector. Thesample mean or sample mean vector

x =1

n

n∑

i=1

xi = (x1, ..., xp)T =1

nW T1

where 1 is the n × 1 vector of ones. The sample covariance matrix

S =1

n − 1

n∑

i=1

(xi − x)(xi − x)T = (Sij).

That is, the ij entry of S is the sample covariance Sij . The classical estimatorof multivariate location and dispersion is (T,C) = (x,S).

It can be shown that (n− 1)S =∑n

i=1 xixTi − x xT =

W T W − 1

nW T11T W .

Hence if the centering matrix H = I − 1

n11T , then (n− 1)S = W T HW .

Definition 7.8. The sample correlation matrix

R = (rij).

That is, the ij entry of R is the sample correlation rij.

Let the standardized random variables

Zj =xj − xj√

Sjj

for j = 1, ..., p.Then the sample correlation matrix R is the sample covariancematrix of the zi = (Zi1, ..., Zip)

T where i = 1, ..., n.Often it is useful to standardize variables with a robust location estimator

and a robust scale estimator. The R function scale is useful. The R codebelow shows how to standardize using

Zj =xj − MED(xj)

MAD(xj)

for j = 1, ..., p. Here MED(xj) = MED(x1j, ..., xnj) and MAD(xj) =MAD(x1j, ..., xnj) are the sample median and sample median absolute de-viation of the data for the jth variable: x1j, ..., xnj. See Definitions 7.8 and7.10. Some of these results are illustrated with the following R code.


x <- buxx[,1:3]; cov(x)

len nasal bigonal

len 118299.9257 -191.084603 -104.718925

nasal -191.0846 18.793905 -1.967121

bigonal -104.7189 -1.967121 36.796311

cor(x)

len nasal bigonal

len 1.00000000 -0.12815187 -0.05019157

nasal -0.12815187 1.00000000 -0.07480324

bigonal -0.05019157 -0.07480324 1.00000000

z <- scale(x)

cov(z)

len nasal bigonal

len 1.00000000 -0.12815187 -0.05019157

nasal -0.12815187 1.00000000 -0.07480324

bigonal -0.05019157 -0.07480324 1.00000000

medd <- apply(x,2,median)

madd <- apply(x,2,mad)/1.4826

z <- scale(x,center=medd,scale=madd)

ddplot4(z)#scaled data still has 5 outliers

cov(z) #in the length variable

len nasal bigonal

len 4731.997028 -12.738974 -6.981262

nasal -12.738974 2.088212 -0.218569

bigonal -6.981262 -0.218569 4.088479

cor(z)

len nasal bigonal

len 1.00000000 -0.12815187 -0.05019157

nasal -0.12815187 1.00000000 -0.07480324

bigonal -0.05019157 -0.07480324 1.00000000

apply(z,2,median)

len nasal bigonal

0 0 0

#scaled data has coord. median = (0,0,0)ˆT

apply(z,2,mad)/1.4826

len nasal bigonal

1 1 1 #scaled data has unit MAD

Notation. A rule of thumb is a rule that often but not always works wellin practice.


Rule of Thumb 1.1. Multivariate procedures start to give good resultsfor n ≥ 10p, especially if the distribution is close to multivariate normal.In particular, we want n ≥ 10p for the sample covariance and correlationmatrices. For procedures with large sample theory on a large class of distri-butions, for any value of n, there are always distributions where the resultswill be poor, but will eventually be good for larger sample sizes. Normanand Streiner (1986, pp. 122, 130, 157) give this rule of thumb and note thatsome authors recommend n ≥ 30p. This rule of thumb is much like the ruleof thumb that says the central limit theorem normal approximation for Ystarts to be good for many distributions for n ≥ 30.

Definition 7.9. The ith Mahalanobis distance Di =√D2

i where the ithsquared Mahalanobis distance is

D2i = D2

i (T (W ),C(W )) = (xi − T (W ))T C−1(W )(xi − T (W )) (7.6)

for each point xi. Notice that D2i is a random variable (scalar valued). Let

(T,C) = (T (W ),C(W )). Then

D2x(T,C) = (x− T )T C−1(x− T ).

Hence D2i uses x = xi.

Let the p × 1 location vector be µ, often the population mean, and letthe p × p dispersion matrix be Σ, often the population covariance matrix.See Section 1.3.3. Notice that if x is a random vector, then the populationsquared Mahalanobis distance is

D2x(µ,Σ) = (x − µ)T Σ−1(x − µ) (7.7)

and that the term Σ−1/2(x− µ) is the p−dimensional analog to the z-scoreused to transform a univariate N(µ, σ2) random variable into a N(0, 1) ran-dom variable. Hence the sample Mahalanobis distance Di =

√D2

i is an ana-log of the absolute value |Zi| of the sample Z-score Zi = (Xi −X)/σ. Alsonotice that the Euclidean distance of xi from the estimate of center T (W )is Di(T (W ), Ip) where Ip is the p× p identity matrix.

7.1.3 The RMVN and RFCH Estimators

7.1.4 Outlier Detection if p > n

Most outlier detection methods work best if n ≥ 20p, but often data sets havep > n, and outliers are a major problem. One of the simplest outlier detectionmethods uses the Euclidean distances of the xi from the coordinatewise me-


dianDi = Di(MED(W ), Ip). Concentration type steps compute the weightedmedian MEDj : the coordinatewise median computed from the “half set” ofcases xi with D2

i ≤ MED(D2i (MEDj−1, Ip)) where MED0 = MED(W ).

We often used j = 0 (no concentration type steps) or j = 9. Let Di =Di(MEDj, Ip). Let Wi = 1 if Di ≤ MED(D1, ..., Dn) + kMAD(D1, ..., Dn)where k ≥ 0 and k = 5 is the default choice. Let Wi = 0, otherwise. Usingk ≥ 0 insures that at least half of the cases get weight 1. This weightingcorresponds to the weighting that would be used in a one sided metricallytrimmed mean (Huber type skipped mean) of the distances.

Application 7.1. This outlier resistant regression method uses terms fromthe following definition. Let the ith case wi = (Yi,x

Ti )T where the continuous

predictors from xi are denoted by ui for i = 1, ..., n. Apply the covmb2

estimator to the ui, and then run the regression method on the m cases wi

corresponding to the covmb2 set B indices i1, ...im, where m ≥ n/2.

Definition 7.10. Let the covmb2 set B of at least n/2 cases correspondto the cases with weight Wi = 1. Then the covmb2 estimator (T,C) is thesample mean and sample covariance matrix applied to the cases in set B.Hence

T =

∑ni=1Wixi∑n

i=1Wiand C =

∑ni=1Wi(xi − T )(xi − T )T

∑ni=1Wi − 1

.

Example 7.3. Let the clean data (nonoutliers) be i 1 for i = 1, 2, 3, 4, and5 while the outliers are j 1 for j = 16, 17, 18, and 19. Here n = 9 and 1 is p×1.Making a plot of the data for p = 2 may be useful. Then the coordinatewisemedian MED0 = MED(W ) = 5 1. The median Euclidean distance of the datais the Euclidean distance of 5 1 from 1 1 = the Euclidean distance of 5 1 from9 1. The median ball is the hypersphere centered at the coordinatewise medianwith radius r = MED(Di(MED(W ), Ip), i = 1, ..., n) that tends to contain(n+1)/2 of the cases if n is odd. Hence the clean data are in the median balland the outliers are outside of the median ball. The coordinatewise medianof the cases with the 5 smallest distances is the coordinatewise median ofthe clean data: MED1 = 3 1. Then the median Euclidean distance of thedata from MED1 is the Euclidean distance of 3 1 from 1 1 = the Euclideandistance of 3 1 from 5 1. Again the clean cases are the cases with the 5 smallestEuclidean distances. Hence MEDj = 3 1 for j ≥ 1. For j ≥ 1, if xi = j 1, thenDi = |j − 3|√p. Thus D(1) = 0, D(2) = D(3) =

√p, and D(4) = D(5) = 2

√p.

Hence MED(D1, ..., Dn) = D(5) = 2√p = MAD(D1, ..., Dn) since the median

distance of the Di from D(5) is 2√p − 0 = 2

√p. Note that the 5 smallest

absolute distances |Di − D(5)| are 0, 0,√p,√p, and 2

√p. Hence Wi = 1 if

Di ≤ 2√p + 10

√p = 12

√p. The clean data get weight 1 while the outliers

get weight 0 since the smallest distance Di for the outliers is the Euclideandistance of 3 1 from 16 1 with a Di = ‖16 1 − 3 1‖ = 13

√p. Hence the

covmb2 estimator (T,C) is the sample mean and sample covariance matrix


of the clean data. Note that the distance for the outliers to get zeroweight is proportional to the square root of the dimension

√p.

The covmb2 estimator attempts to give a robust dispersion estimatorthat reduces the bias by using a big ball about MEDj instead of a ball thatcontains half of the cases. The weighting is the default method, but you canalso plot the squared Euclidean distances and estimate the number m ≥ n/2of cases with the smallest distances to be used. The slpack function medout

makes the plot, and the slpack function getB gives the set B of cases thatgot weight 1 along with the index indx of the case numbers that got weight1. The function vecw stacks the columns of the dispersion matrix C into avector. Then the elements of the matrix can be plotted.

The function ddplot5 plots the Euclidean distances from the coordi-natewise median versus the Euclidean distances from the covmb2 locationestimator. Typically the plotted points in this DD plot cluster about theidentity line, and outliers appear in the upper right corner of the plot witha gap between the bulk of the data and the outliers. An alternative for out-lier detection is to replace C by Cd = diag(σ11, ..., σpp). For example, useσii = Cii. See Ro et al. (2015) and Tarr et al. (2016) for references.

Example 7.4. For the Buxton (1920) data with multiple linear regression,height was the response variable while an intercept, head length, nasal height,bigonal breadth, and cephalic index were used as predictors in the multiplelinear regression model. Observation 9 was deleted since it had missing values.Five individuals, cases 61–65, were reported to be about 0.75 inches tall withhead lengths well over five feet! See Problem 1.9 to reproduce the followingplots.

Fig. 7.1 Response plot for lasso and lasso applied to the covmb2 set B.

Fig. 7.2 DD plot.

Figure 7.?a) shows the response plot for lasso. The identity line passesright through the outliers which are obvious because of the large gap. Figure7.?b) shows the response plot from lasso for the cases in the covmb2 setB applied to the predictors, and the set B included all of the clean casesand omitted the 5 outliers. The response plot was made for all of the data,including the outliers. Prediction interval (PI) bands are also included forboth plots. Both plots are useful for outlier detection, but the method forplot 1.4b) is better for data analysis: impossible outliers should be deleted orgiven 0 weight, we do not want to predict that some people are about 0.75inches tall, and we do want to predict that the people were about 1.6 to 1.8


meters tall. Figure 7.? shows the DD plot made using ddplot5. The fiveoutliers are in the upper right corner.

Also see Problem 1.10 where the covmb2 set B deleted the 8 cases withthe largest Di, including 5 outliers and 3 clean cases.

When n > p, an important multivariate distribution is the multivariatenormal distribution. See Section 1.5.

Fig. 7.3 Elements of C for outlier data.

Fig. 7.4 Elements of the classical covariance matrix S for outlier data.

Example 7.5. This example helps illustrate the effect of outliers on clas-sical methods. The artificial data set had n = 50, p = 100, and the cleandata was iid Np(0, Ip). Hence the diagonal elements of the population co-variance matrix are 0 and the diagonal elements are 1. Plots of the elementsof the sample covariance matrix S and the covmb2 estimator C are notshown, but were similar to Figure 7.?. Then the first ten cases were contam-inated: xi ∼ Np(µ, 100Ip) where µ = (10, 0, ..., 0)T. Figure 7.? shows thatthe covmb2 dispersion matrix C was not much effected by the outliers. Thediagonal elements are near 1 and the off diagonal elements are near 0. Figure7.? shows that the sample covariance matrix S was greatly effected by theoutliers. Several sample covariances are less than −20 and several samplevariances are over 40.

R code to used to produce Figures 7.? and 7.? is shown below.

#n = 50, p = 100

x<-matrix(rnorm(5000),nrow=50,ncol=100)

out<-medout(x) #no outliers, try ddplot5(x)

out <- covmb2(x,msteps=0)

z<-out$cov

plot(diag(z)) #plot the diagonal elements of C

plot(out$center) #plot the elements of T

vecz <- vecw(z)$vecz

plot(vecz)

out<-covmb2(x,m=45)

plot(out$center)

plot(diag(out$cov))

#outliers

x[1:10,] <- 10*x[1:10,]


x[1:10,1] <- x[1:10]+10

medout(x) #The 10 outliers are easily detected in

#the plot of the distances from the MED(X).

ddplot5(x) #two widely separated clusters of data

tem <- getB(x,msteps=0)

tem$indx #all 40 clean cases were used

dim(tem$B) #40 by 100

out<-covmb2(x,msteps=0)

z<-out$cov

plot(diag(z))

plot(out$center)

vecz <- vecw(z)$vecz

plot(vecz) #plot the elements of C

#Figure 7.?

#examine the sample covariance matrix and mean

plot(diag(var(x)))

plot(apply(x,2,mean)) #plot elements of xbar

zc <- var(x)

vecz <- vecw(zc)$vecz

plot(vecz) #plot the elements of S

#Figure 7.?

out<-medout(x) #10 outliers

out<-covmb2(x,m=40)

plot(out$center)

plot(diag(out$cov))

The covmb2 estimator can also be used for n > p. The slpack functionmldsim6 suggests that for 40% outliers, the outliers need to be further awayfrom the bulk of the data (covmb2(k=5) needs a larger value of pm) than forthe other six estimators if n ≥ 20p. With some outlier types, covmb2(k=5)was often near best. Try the following commands. The other estimators needn > 2p, and as n gets close to 2p, covmb2 may outperform the other esti-mators. Also see Problem 1.11.

#near point mass on major axis

mldsim6(n=100,p=10,outliers=1,gam=0.25,pm=25)

mldsim6(n=100,p=10,outliers=1,gam=0.4,pm=25) #bad



#mean shift outliers




#concentration steps can help

7.2 Resistant Multiple Linear Regression 215

mldsim6(n=100,p=10,outliers=3,gam=0.4,pm=10,osteps=0)

mldsim6(n=100,p=10,outliers=3,gam=0.4,pm=10,osteps=9)

Elliptically contoured distributions, defined below, are an important classof distributions for multivariate data. The multivariate normal distributionis also an elliptically contoured distribution. These distributions are usefulfor multivariate analysis.

Definition 7.11: Johnson (1987, pp. 107-108). A p×1 random vectorX has an elliptically contoured distribution, also called an elliptically sym-metric distribution, if X has joint pdf

f(z) = kp|Σ|−1/2g[(z − µ)T Σ−1(z − µ)], (7.8)

and we say X has an elliptically contoured ECp(µ,Σ, g) distribution.

If X has an elliptically contoured (EC) distribution, then the characteristicfunction of X is

φX(t) = exp(itT µ)ψ(tT Σt) (7.9)

for some function ψ. If the second moments exist, then

E(X) = µ (7.10)


wherecX = −2ψ′(0).

7.2 Resistant Multiple Linear Regression

Consider the multiple linear regression model, written in matrix form asY = Xβ+e. This model is a special case of the multivariate linear regressionmodel with m = 1. (Ordinary) least squares (OLS) is the classical regressionmethod. The OLS response and residual plots are very useful for detectingoutliers and checking the model. See Table 1.1 for acronyms.

Resistant estimators are useful for detecting certain types of outliers. Somegood resistant regression estimators are rmreg2 from Section 8.6.2, thehbreg estimator from Section 7.?, and the Olive (2005a) MBA and trimmedviews estimators described below.

Resistant estimators are often created by computing several trial fits bi

that are estimators of β. Then a criterion is used to select the trial fit to beused in the resistant estimator. Suppose c ≈ n/2. The LMS(c) criterion isQLMS(b) = r2(c)(b) where r2(1) ≤ · · · ≤ r2(n) are the ordered squared residu-

als, and the LTS(c) criterion is QLTS(b) =∑c

i=1 r2(i)(b). The LTA(c) crite-


rion is QLTA(b) =∑c

i=1 |r(b)|(i) where |r(b)|(i) is the ith ordered absoluteresidual. Three impractical high breakdown robust estimators are the Ham-pel (1975) least median of squares (LMS) estimator, the Rousseeuw (1984)least trimmed sum of squares (LTS) estimator, and the Hossjer (1991) leasttrimmed sum of absolute deviations (LTA) estimator. Also see Hawkins and

Olive (1999ab). These estimators correspond to the βL ∈ Rp that minimizes

the corresponding criterion. LMS, LTA, and LTS have O(np) or O(np+1)complexity. See Bernholt (2005), Hawkins and Olive (1999b), Klouda (2015),and Mount et al. (2014). Estimators with O(n4) or higher complexity taketoo long to compute. LTS and LTA are

√n consistent while LMS has the

lower n1/3 rate. See Kim and Pollard (1990), Cızek (2006, 2008), and Masıcek(2004). If c = n, the LTS and LTA criteria are the OLS and L1 criteria.

A good resistant estimator is the Olive (2005a) median ball algorithm(MBA or mbareg). The Euclidean distance of the ith vector of predictors xi

from the jth vector of predictors xj is

Di(xj) = Di(xj , Ip) =√

(xi − xj)T (xi − xj).

For a fixed xj consider the ordered distances D(1)(xj), ..., D(n)(xj). Next,

let βj(α) denote the OLS fit to the min(p + 3 + bαn/100c, n) cases withthe smallest distances where the approximate percentage of cases used isα ∈ {1, 2.5, 5, 10, 20, 33, 50}. (Here bxc is the greatest integer function sob7.7c = 7. The extra p+3 cases are added so that OLS can be computed forsmall n and α.) This yields seven OLS fits corresponding to the cases withpredictors closest to xj. A fixed number of K cases are selected at randomwithout replacement to use as the xj . Hence 7K OLS fits are generated. Weuse K = 7 as the default. A robust criterion Q is used to evaluate the 7Kfits and the OLS fit to all of the data. Hence 7K + 1 OLS fits are generatedand the MBA estimator is the fit that minimizes the criterion. The mediansquared residual is a good choice for Q.

Three ideas motivate this estimator. First, x-outliers, which are outliers inthe predictor space, tend to be much more destructive than Y -outliers whichare outliers in the response variable. Suppose that the proportion of outliersis γ and that γ < 0.5. We would like the algorithm to have at least one“center” xj that is not an outlier. The probability of drawing a center that isnot an outlier is approximately 1−γK > 0.99 for K ≥ 7 and this result is freeof p. Secondly, by using the different percentages of coverages, for many datasets there will be a center and a coverage that contains no outliers. Third, theMBA estimator is a

√n consistent estimator of the same parameter vector β

estimated by OLS under mild conditions on the zero mean error distribution.

Ellipsoidal trimming can be used to create resistant multiple linear regres-sion (MLR) estimators. To perform ellipsoidal trimming, an estimator (T,C)is computed and used to create the squared Mahalanobis distances D2

i for


each vector of observed predictors xi. If the ordered distance D(j) is unique,then j of the xi’s are in the ellipsoid

{x : (x − T )T C−1(x − T ) ≤ D2(j)}. (7.12)

The ith case (Yi,xTi )T is trimmed if Di > D(j). Then an estimator of β is

computed from the remaining cases. For example, if j ≈ 0.9n, then about10% of the cases are trimmed, and OLS or L1 could be used on the casesthat remain. Ellipsoidal trimming differs from technique i) in Section 7.? thatuses the RMVN set on the xi, since the RMVN set uses a random amountof trimming. (The ellipsoidal trimming technique can also be used for otherregression models, and the theory of the regression method tends to applyto the method applied to the cleaned data that was not trimmed since theresponse variables were not used to select the cases.)

A response plot is a plot of the fitted values Yi versus the response Yi andis very useful for detecting outliers. If the MLR model holds and the MLRestimator is good, then the plotted points will scatter about the identity linethat has unit slope and zero intercept. The identity line is added to the plotas a visual aid, and the vertical deviations from the identity line are equal tothe residuals since Yi − Yi = ri. Note that the response and residual plots aremade using all of the data, not just the cleaned data that was not trimmed.

The resistant trimmed views estimator combines ellipsoidal trimming andthe response plot. First compute (T,C) on the xi, perhaps using the RMVNestimator. Trim the M% of the cases with the largest Mahalanobis distances,and then compute the MLR estimator βM from the remaining cases. UseM = 0, 10, 20, 30, 40, 50, 60, 70, 80, and 90 to generate ten response plots

of the fitted values βT

Mxi versus Yi using all n cases. (Fewer plots are used

for small data sets if βM can not be computed for large M .) These plots arecalled “trimmed views.”

Definition 7.12. The trimmed views (TV) estimator βT,n correspondsto the trimmed view where the bulk of the plotted points follow the identityline with smallest variance function, ignoring any outliers.

Example 7.6. For the Buxton (1920) data, height was the responsevariable while an intercept, head length, nasal height, bigonal breadth, andcephalic index were used as predictors in the multiple linear regression model.Observation 9 was deleted since it had missing values. Five individuals, cases61–65, were reported to be about 0.75 inches tall with head lengths well overfive feet! OLS was used on the cases remaining after trimming, and Figure7.? shows four trimmed views corresponding to 90%, 70%, 40%, and 0% trim-ming. The OLS TV estimator used 70% trimming since this trimmed viewwas best. Since the vertical distance from a plotted point to the identity lineis equal to the case’s residual, the outliers had massive residuals for 90%,


70%, and 40% trimming. Notice that the OLS trimmed view with 0% trim-ming “passed through the outliers” since the cluster of outliers is scatteredabout the identity line.

The TV estimator βT,n has good statistical properties if an estimator withgood statistical properties is applied to the cases (XM,n,Y M,n) that remainafter trimming. Candidates include OLS, L1, Huber’s M–estimator, Mallows’GM–estimator, or the Wilcoxon rank estimator. See Rousseeuw and Leroy(1987, pp. 12-13, 150). The basic idea is that if an estimator with OP (n−1/2)convergence rate is applied to a set of nM ∝ n cases, then the resultingestimator βM,n also has OP (n−1/2) rate provided that the response Y was

not used to select the nM cases in the set. If ‖βM,n − β‖ = OP (n−1/2) for

M = 0, ..., 90 then ‖βT,n − β‖ = OP (n−1/2) by Pratt (1959).

Fig. 7.5 4 Trimmed Views for the Buxton Data

Let Xn = X0,n denote the full design matrix. Often when proving asymp-

totic normality of an MLR estimator β0,n, it is assumed that

XTnXn

n→ W−1.

If β0,n has OP (n−1/2) rate and if for big enough n all of the diagonal elementsof (

XTM,nXM,n

n

)−1

are all contained in an interval [0, B) for some B > 0, then ‖βM,n − β‖ =

OP (n−1/2).

The distribution of the estimator βM,n is especially simple when OLS isused and the errors are iid N(0, σ2). Then

βM,n = (XTM,nXM,n)−1XT

M,nY M,n ∼ Np(β, σ2(XTM,nXM,n)−1)

and√n(βM,n − β) ∼ Np(0, σ2(XT

M,nXM,n/n)−1). Notice that this result

does not imply that the distribution of βT,n is normal.

Warning: When Yi = xTi β + e, MLR estimators tend to estimate the

same slopes β2, ..., βp, but the constant β1 tends to depend on the estimatorunless the errors are symmetric. The MBA and trimmed views estimators doestimate the same β as OLS asymptotically, but samples may need to be hugebefore the MBA and trimmed views estimates of the constant are close to theOLS estimate of the constant. If the trimmed views estimator is modified so


that the LTS, LTA, or LMS criterion is used to select the final estimator, thenfor the modified trimmed views estimator and the MBA estimator, it is likely

that√n(β − β)

D→∑ki=1 πiNp(0, σ2W i) where 0 ≤ πi ≤ 1 and

∑ki=1 πi = 1.

The index i corresponds to the fits considered by the MBA estimator withk = 7K + 1 where often K = 7 so k = 50, or the modified trimmed viewsestimator with k = 10. For the MBA estimator and the modified trimmedviews estimator, the prediction region method, described in Section 5.3, maybe useful for testing hypotheses. Large sample sizes may be needed if the errordistribution is not symmetric since the constant β1 needs large samples. SeeSection 7.? for an explanation fow why large sample sizes may be needed toestimate the constant.

The TV estimator can be modified to create a resistant weighted MLRestimator. To see this, recall that the weighted least squares (WLS) estima-tor using weights Wi can be found using the ordinary least squares (OLS)regression (without intercept) of

√WiYi on

√Wixi. This idea can be used

for categorical data analysis since the minimum chi-square estimator is oftencomputed using WLS. Let xi = (1, xi,2, ..., xi,p)

T , let Yi = xTi β + ei and let

β be an estimator of β.

Definition 7.13. For a multiple linear regression model with weightsWi, aweighted response plot is a plot of

√Wix

Ti β versus

√WiYi. The weighted

residual plot is a plot of√Wix

Ti β versus the WMLR residuals rWi =√

WiYi −√Wix

Ti β.

Application 7.2. For resistant weighted MLR, use the WTV estimatorwhich is selected from ten weighted response plots.

The conditions under which the rmreg2 estimator of Section 12.6.2 hasbeen shown to be

√n consistent are quite strong, but it seems likely that the

estimator is a√n consistent estimator of β under mild conditions where the

parameter vector β is not, in general, the parameter vector estimated by OLS.For MLR, the mpack function rmregboot bootstraps the rmreg2 estimator,and the function rmregbootsim can be used to simulate rmreg2. Bothfunctions use the residual bootstrap where the residuals come from OLS. Seethe R code below.

out<-rmregboot(belx,bely)

plot(out$betas)

ddplot4(out$betas) #right click Stop

out<-rmregboot(cbrainx,cbrainy)

ddplot4(out$betas) #right click Stop


7.3 MLR Outlier Detection

For multiple linear regression, the OLS response and residual plots are veryuseful for detecting outliers. The DD plot of the continuous predictors is alsouseful. Use the mpack functions MLRplot and ddplot4.

Huber and Ronchetti (2009, p. 154) noted that efficient methods for iden-tifying leverage groups are needed. Such groups are often difficult to detectwith regression diagnostics and residuals, but often have outlying fitted val-ues and responses that can be detected with response and residual plots. Thefollowing rules of thumb are useful for finding influential cases and outliers.Look for points with large absolute residuals and for points far away fromY . Also look for gaps separating the data into clusters. The OLS fit oftenpasses through a cluster of outliers, causing a large gap between a clustercorresponding to the bulk of the data and the cluster of outliers. When sucha gap appears, it is possible that the smaller cluster corresponds to goodleverage points: the cases follow the same model as the bulk of the data. Todetermine whether small clusters are outliers or good leverage points, givezero weight to the clusters, and fit an MLR estimator such as OLS to thebulk of the data. Denote the weighted estimator by βw. Then plot Yw versusY using the entire data set. If the identity line passes through the cluster,then the cases in the cluster may be good leverage points, otherwise theymay be outliers. The trimmed views estimator of the previous section is alsouseful. Dragging the plots, so that they are roughly square, can be useful.

Definition 7.14. Suppose that some analysis to detect outliers is per-formed. Masking occurs if the analysis suggests that one or more outliers arein fact good cases. Swamping occurs if the analysis suggests that one or moregood cases are outliers. Suppose that a subset of h cases is selected from then cases making up the data set. Then the subset is clean if none of the hcases are outliers.

Influence diagnostics such as Cook’s distances CDi from Cook (1977) andthe weighted Cook’s distances WCDi from Pena (2005) are sometimes useful.Although an index plot of Cook’s distance CDi may be useful for flagginginfluential cases, the index plot provides no direct way of judging the modelagainst the data. As a remedy, cases in the response and residual plots withCDi > min(0.5, 2p/n) are highlighted with open squares, and cases with|WCDi − median(WCDi)| > 4.5MAD(WCDi) are highlighted with crosses,where the median absolute deviation MAD(wi) = median(|wi−median(wi)|).

Example 7.7. Figure 7.? shows the response plot and residual plot forthe Buxton (1920) data. Notice that the OLS fit passes through the outliers,but the response plot is resistant to Y –outliers since Y is on the verticalaxis. Also notice that although the outlying cluster is far from Y , only two ofthe outliers had large Cook’s distance and only one case had a large WCDi.

7.3 MLR Outlier Detection 221

Hence masking occurred for the Cook’s distances, the WCDi, and for the OLSresiduals, but not for the OLS fitted values. Plots using lmsreg and ltsregwere similar, but trimmed views and MBA were effective. See Figures 7.? and7.?. Figure 7.? was made with the following R commands.

source("G:/mpack.txt"); source("G:/mrobdata.txt")

mlrplot4(buxx,buxy) #right click Stop twice

Fig. 7.6 Plots for Buxton Data

Fig. 7.7 Plots for HBK Data

High leverage outliers are a particular challenge to conventional numericalMLR diagnostics such as Cook’s distance, but can often be visualized usingthe response and residual plots. (Using the trimmed views of Section 7.?is also effective for detecting outliers and other departures from the MLRmodel.)

Example 7.8. Hawkins et al. (1984) gave a well known artificial dataset where the first 10 cases are outliers while cases 11-14 are good leveragepoints. Figure 7.? shows the residual and response plots based on the OLSestimator. The highlighted cases have Cook’s distance > min(0.5, 2p/n), andthe identity line is shown in the response plot. Since the good cases 11-14have the largest Cook’s distances and absolute OLS residuals, swamping hasoccurred. (Masking has also occurred since the outliers have small Cook’sdistances, and some of the outliers have smaller OLS residuals than cleancases.) To determine whether both clusters are outliers or if one cluster con-sists of good leverage points, cases in both clusters could be given weightzero and the resulting response plot created. (Alternatively, response plotsbased on the tvreg estimator of Section 7.? could be made where the caseswith weight one are highlighted. For high levels of trimming, the identity lineoften passes through the good leverage points.)

The above example is typical of many “benchmark” outlier data sets forMLR. In these data sets traditional OLS diagnostics such as Cook’s distanceand the residuals often fail to detect the outliers, but the combination of theresponse plot and residual plot is usually able to detect the outliers. The CDi

and WCDi are the most effective when there is a single cluster about theidentity line. If there is a second cluster of outliers or good leverage pointsor if there is nonconstant variance, then these numerical diagnostics tend tofail.


Fig. 7.8 The Highlighted Points are More Concentrated about the Attractor

Often practical “robust estimators” generate a sequence of K trial fitscalled attractors: b1, ..., bK . Then some criterion is evaluated and the attrac-tor bA that minimizes the criterion is used in the final estimator. One wayto obtain attractors is to generate trial fits called starts, and then use theconcentration technique. Let b0,j be the jth start and compute all n resid-uals ri(b0,j) = Yi − xT

i b0,j. At the next iteration, the OLS estimator b1,j

is computed from the cn ≈ n/2 cases corresponding to the smallest squaredresiduals r2i (b0,j). This iteration can be continued for k steps resulting inthe sequence of estimators b0,j, b1,j, ..., bk,j. Then bk,j is the jth attractor forj = 1, ..., K. Using k = 10 concentration steps often works well, and the basicresampling algorithm is a special case with k = 0, i.e., the attractors are thestarts. Elemental starts are the fits from randomly selected “elemental sets”of p cases. Such an algorithm is called a CLTS concentration algorithm orCLTS.

A CLTA concentration algorithm would replace the OLS estimator bythe L1 estimator, and the smallest cn squared residuals by the smallest cnabsolute residuals. Many other variants are possible, but obtaining theoreticalresults may be difficult.

Example 7.9. As an illustration of the CLTA concentration algorithm,consider the animal data from Rousseeuw and Leroy (1987, p. 57). The re-sponse Y is the log brain weight and the predictor x is the log body weightfor 25 mammals and 3 dinosaurs (outliers with the highest body weight).Suppose that the first elemental start uses cases 20 and 14, corresponding tomouse and man. Then the start bs,1 = b0,1 = (2.952, 1.025)T and the sum

of the c = 14 smallest absolute residuals

14∑

i=1

|r|(i)(b0,1) = 12.101. Figure 7.?a

shows the scatterplot of x and y. The start is also shown and the 14 casescorresponding to the smallest absolute residuals are highlighted. The L1 fit to

these c highlighted cases is b1,1 = (2.076, 0.979)T and

14∑

i=1

|r|(i)(b1,1) = 6.990.

The iteration consists of finding the cases corresponding to the c smallestabsolute residuals, obtaining the corresponding L1 fit and repeating. Theattractor ba,1 = b7,1 = (1.741, 0.821)T and the LTA(c) criterion evaluated

at the attractor is

14∑

i=1

|r|(i)(ba,1) = 2.172. Figure 7.?b shows the attrac-

tor and that the c highlighted cases corresponding to the smallest absoluteresiduals are much more concentrated than those in Figure 7.?a. Figure 7.?ashows 5 randomly selected starts while Figure 7.?b shows the corresponding


Fig. 7.9 Starts and Attractors for the Animal Data

attractors. Notice that the elemental starts have more variability than theattractors, but if the start passes through an outlier, so does the attractor.

Suppose the data set has n cases where d are outliers and n−d are “clean”(not outliers). The the outlier proportion γ = d/n. Suppose that K elementalsets are chosen with replacement and that it is desired to find K such thatthe probability P(that at least one of the elemental sets is clean) ≡ P1 ≈ 1−αwhere α = 0.05 is a common choice. Then P1 = 1− P(none of theK elementalsets is clean) ≈ 1− [1−(1−γ)p]K by independence. Hence α ≈ [1−(1−γ)p]K

or

K ≈ log(α)

log([1 − (1 − γ)p])≈ log(α)

−(1 − γ)p(7.13)

using the approximation log(1 − x) ≈ −x for small x. Since log(0.05) ≈ −3,

if α = 0.05, then K ≈ 3

(1 − γ)p. Frequently a clean subset is wanted even if

the contamination proportion γ ≈ 0.5. Then for a 95% chance of obtaining atleast one clean elemental set, K ≈ 3 (2p) elemental sets need to be drawn. Ifthe start passes through an outlier, so does the attractor. For concentrationalgorithms for multivariate location and dispersion, if the start passes througha cluster of outliers, sometimes the attractor would be clean.

Table 7.1 Largest p for a 95% Chance of a Clean Subsample.

Kγ 500 3000 10000 105 106 107 108 109

0.01 509 687 807 1036 1265 1494 1723 19520.05 99 134 158 203 247 292 337 3820.10 48 65 76 98 120 142 164 1860.15 31 42 49 64 78 92 106 1200.20 22 30 36 46 56 67 77 870.25 17 24 28 36 44 52 60 680.30 14 19 22 29 35 42 48 550.35 11 16 18 24 29 34 40 450.40 10 13 15 20 24 29 33 380.45 8 11 13 17 21 25 28 320.50 7 9 11 15 18 21 24 28

Notice that the number of subsets K needed to obtain a clean elemental setwith high probability is an exponential function of the number of predictorsp but is free of n. Hawkins and Olive (2002) showed that if K is fixed andfree of n, then the resulting elemental or concentration algorithm (that usesk concentration steps), is inconsistent and zero breakdown. See Theorem 7.?.


Nevertheless, many practical estimators tend to use a value of K that is freeof both n and p (e.g. K = 500 or K = 3000). Such algorithms include ALMS= FLMS = lmsreg and ALTS = FLTS = ltsreg. The “A” denotes thatan algorithm was used. The “F” means that a fixed number of trial fits (Kelemental fits) was used and the criterion (LMS or LTS) was used to selectthe trial fit used in the final estimator.

To examine the outlier resistance of such inconsistent zero breakdown es-timators, fix both K and the contamination proportion γ and then find thelargest number of predictors p that can be in the model such that the proba-bility of finding at least one clean elemental set is high. Given K and γ, P (atleast one of K subsamples is clean) = 0.95 ≈1− [1 − (1 − γ)p]K. Thus the largest value of p satisfies

3

(1 − γ)p≈ K, or

p ≈⌊

log(3/K)

log(1 − γ)

⌋(7.14)

if the sample size n is very large. Again bxc is the greatest integer function:b7.7c = 7.

Table 7.1 shows the largest value of p such that there is a 95% chancethat at least one of K subsamples is clean using the approximation givenby Equation (7.?). Hence if p = 28, even with one billion subsamples, thereis a 5% chance that none of the subsamples will be clean if the contami-nation proportion γ = 0.5. Since clean elemental fits have great variability,an algorithm needs to produce many clean fits in order for the best fit tobe good. When contamination is present, all K elemental sets could containoutliers. Hence basic resampling and concentration algorithms that only useK elemental starts are doomed to fail if γ and p are large.

The outlier resistance of elemental algorithms that use K elemental setsdecreases rapidly as p increases. However, for p < 10, such elemental algo-rithms are often useful for outlier detection. They can perform better thanMBA, trimmed views, and rmreg2 if p is small and the outliers are closeto the bulk of the data or if p is small and there is a mixture distribution:the bulk of the data follows one MLR model, but “outliers” and some of theclean data are fit well by another MLR model. For example, if there is onenontrivial predictor, suppose the plot of x versus Y looks like the letter X.Such a mixture distribution is not really an outlier configuration since out-liers lie far from the bulk of the data. All practical estimators have outlierconfigurations where they perform poorly. If p is small, elemental algorithmstend to have trouble when there is a weak regression relationship for the bulkof the data and a cluster of outliers that are not good leverage points (donot fall near the hyperplane followed by the bulk of the data). The Buxton(1920) data set is an example.

Proposition 7.1. Let h = p be the number of randomly selected casesin an elemental set, and let γo be the highest percentage of massive outliers


that a resampling algorithm can detect reliably. If n is large, then

γo ≈ min

(n − c

n, 1 − [1 − (0.2)1/K]1/h

)100%. (7.15)

Proof. As in Remark 4.1, if the contamination proportion γ is fixed, thenthe probability of obtaining at least one clean subset of size h with highprobability (say 1 − α = 0.8) is given by 0.8 = 1 − [1 − (1 − γ)h]K . Fix thenumber of starts K and solve this equation for γ. �

The value of γo depends on c ≥ n/2 and h. To maximize γo, take c ≈ n/2and h = p. For example, with K = 500 starts, n > 100, and h = p ≤ 20 theresampling algorithm should be able to detect up to 24% outliers providedevery clean start is able to at least partially separate inliers (clean cases)from outliers. However, if h = p = 50, this proportion drops to 11%.

Definition 7.15. Let b1, ..., bJ be J estimators of β. Assume that J ≥ 2and that OLS is included. A fit-fit (FF) plot is a scatterplot matrix of the

fitted values Y (b1), ..., Y (bJ). Often Y is also included in the top or bottomrow of the FF plot to see the response plots. A residual-residual (RR) plot isa scatterplot matrix of the residuals r(b1), ..., r(bJ). Often Y is also includedin the top or bottom row of the RR plot to see the residual plots.

If the multiple linear regression model holds, if the predictors are bounded,and if all J regression estimators are consistent estimators of β, then thesubplots in the FF and RR plots should be linear with a correlation tendingto one as the sample size n increases. To prove this claim, let the ith residualfrom the jth fit bj be ri(bj) = Yi−xT

i bj where (Yi,xTi ) is the ith observation.

Similarly, let the ith fitted value from the jth fit be Yi(bj) = xTi bj . Then

‖ri(b1) − ri(b2)‖ = ‖Yi(b1) − Yi(b2)‖ = ‖xTi (b1 − b2)‖

≤ ‖xi‖ (‖b1 − β‖ + ‖b2 − β‖). (7.16)

The FF plot is a powerful way for comparing fits. The commonly suggestedalternative is to look at a table of the estimated coefficients, but coefficientscan differ greatly while yielding similar fits if some of the predictors are highlycorrelated or if several of the predictors are independent of the response.

The mpack functions ffplot2 and rrplot2 make FF and RR plotsusing OLS, ALMS from lmsreg, ALTS from ltsreg, mbareg, an outlierdetector mbalata, BB, and rmreg2 described in Section 8.6.2. OLS, BB,and mbareg are the three trial fits used by the default version of the

√n

consistent high breakdown hbreg estimator. See Section 7.?. The top row offfplot2 shows the response plots. The R code below is useful and showshow to get some of the text’s data sets into R.

library(MASS)


rrplot2(buxx,buxy)

ffplot2(buxx,buxy)

#The following three data sets can be obtained with

#the source("G:/mrobdata.txt") command

#if the data file is on flash drive G.

rmreg2(buxx,buxy) #right click Stop twice

rmreg2(cbrainx,cbrainy)

rmreg2(gladox,gladoy)

hbk <- matrix(scan(),nrow=75,ncol=5,byrow=T)

hbk <- hbk[,-1]

rmreg2(hbk[,1:3],hbk[,4]) #Outliers are clear

#but fit avoids good leverage points.

nasty <- matrix(scan(),nrow=32,ncol=6,byrow=T)

nasty <- nasty[,-1]

rmreg2(nasty[,1:4],nasty[,5])

wood <- matrix(scan(),nrow=20,ncol=7,byrow=T)

wood <- wood[,-1]

rmreg2(wood[,1:5],wood[,6]) #failed to find

#the outliers

major <- matrix(scan(),nrow=112,ncol=7,byrow=T)

major <- major[,-1]

rmreg2(major[,1:5],major[,6])

Fig. 7.10 FF Plots for Buxton Data

Fig. 7.11 RR Plots for Buxton Data

Example 7.?, continued. The RR and FF plots for the Buxton (1920)data are shown in Figures 7.? and 7.?. Note that only the last four estimatorsgives large absolute residuals to the outliers. The top row of Figure 7.? givesthe response plots for the estimators. If there are two clusters, one in theupper right and one in the lower left of the response plot, then the identityline goes through both clusters. Hence the fit passes through the outliers. Onefeature of the MBA estimator is that it depends on the sample of 7 centersdrawn and changes each time the function is called. In ten runs, about sevenplots will look like Figures 7.? and 7.?, but in about three plots the MBAestimator will also pass through the outliers.

7.4 Robust Regression 227

Table 7.2 Summaries for Seven Data Sets, the Correlations of the Residuals fromTV(M) and the Alternative Method are Given in the 1st 5 Rows

Method Buxton Gladstone glado hbk major nasty woodMBA 0.997 1.0 0.455 0.960 1.0 -0.004 0.9997

LMSREG -0.114 0.671 0.938 0.977 0.981 0.9999 0.9995LTSREG -0.048 0.973 0.468 0.272 0.941 0.028 0.214

L1 -0.016 0.983 0.459 0.316 0.979 0.007 0.178OLS 0.011 1.0 0.459 0.780 1.0 0.009 0.227

outliers 61-65 none 115 1-10 3,44 2,6,...,30 4,6,8,19n 87 267 267 75 112 32 20p 5 7 7 4 6 5 6M 70 0 30 90 0 90 20

Table 7.2 compares the TV, MBA (for MLR), lmsreg, ltsreg, L1, andOLS estimators on 7 data sets available from the text’s website. The columnheaders give the file name while the remaining rows of the table give thesample size n, the number of predictors p, the amount of trimming M usedby the TV estimator, the correlation of the residuals from the TV estimatorwith the corresponding alternative estimator, and the cases that were out-liers. If the correlation was greater than 0.9, then the method was effectivein detecting the outliers, and the method failed, otherwise. Sometimes thetrimming percentage M for the TV estimator was picked after fitting thebulk of the data in order to find the good leverage points and outliers. Eachmodel included a constant.

Notice that the TV, MBA, and OLS estimators were the same for theGladstone (1905) data and for the Tremearne (1911) major data which hadtwo small Y –outliers. For the Gladstone data, there is a cluster of infantsthat are good leverage points, and we attempt to predict brain weight withthe head measurements height, length, breadth, size, and cephalic index. Orig-inally, the variable length was incorrectly entered as 109 instead of 199 forcase 115, and the glado data contains this outlier. In 1997, lmsreg was notable to detect the outlier while ltsreg did. Due to changes in the Splus 2000code, lmsreg detected the outlier but ltsreg did not. These two functionschange often, not always for the better.

7.4 Robust Regression

This section will consider the breakdown of a regression estimator and thendevelop the practical high breakdown hbreg estimator.


7.4.1 MLR Breakdown and Equivariance

Breakdown and equivariance properties have received considerable attentionin the literature. Several of these properties involve transformations of thedata, and are discussed below. If X and Y are the original data, then thevector of the coefficient estimates is

β = β(X,Y ) = T (X ,Y ), (7.17)

the vector of predicted values is

Y = Y (X,Y ) = Xβ(X ,Y ), (7.18)

and the vector of residuals is

r = r(X ,Y ) = Y − Y . (7.19)

If the design matrix X is transformed into W and the vector of dependentvariables Y is transformed into Z, then (W ,Z) is the new data set.

Definition 7.16. Regression Equivariance: Let u be any p×1 vector.Then β is regression equivariant if

β(X ,Y + Xu) = T (X ,Y + Xu) = T (X ,Y ) + u = β(X ,Y ) + u. (7.20)

Hence if W = X and Z = Y + Xu, then Z = Y + Xu and r(W ,Z) =

Z − Z = r(X ,Y ). Note that the residuals are invariant under this type of

transformation, and note that if u = −β, then regression equivariance impliesthat we should not find any linear structure if we regress the residuals on X .Also see Problem 7.3.

Definition 7.17. Scale Equivariance: Let c be any scalar. Then β isscale equivariant if

β(X , cY ) = T (X , cY ) = cT (X ,Y ) = cβ(X ,Y ). (7.21)

Hence if W = X and Z = cY , then Z = cY and r(X, cY ) = c r(X ,Y ).Scale equivariance implies that if the Yi’s are stretched, then the fits and theresiduals should be stretched by the same factor.

Definition 7.18. Affine Equivariance: Let A be any p× p nonsingularmatrix. Then β is affine equivariant if

β(XA,Y ) = T (XA,Y ) = A−1T (X ,Y ) = A−1β(X ,Y ). (7.22)

Hence if W = XA and Z = Y , then Z = Wβ(XA,Y ) =

XAA−1β(X,Y ) = Y , and r(XA,Y ) = Z − Z = Y − Y = r(X,Y ). Note


that both the predicted values and the residuals are invariant under an affinetransformation of the predictor variables.

Definition 7.19. Permutation Invariance: Let P be an n × n per-mutation matrix. Then P T P = P P T = In where In is an n × n identitymatrix and the superscript T denotes the transpose of a matrix. Then β ispermutation invariant if

β(PX ,PY ) = T (P X,P Y ) = T (X,Y ) = β(X,Y ). (7.23)

Hence if W = PX and Z = P Y , then Z = P Y and r(P X ,PY ) =P r(X ,Y ). If an estimator is not permutation invariant, then swappingrows of the n× (p+ 1) augmented matrix (X ,Y ) will change the estimator.Hence the case number is important. If the estimator is permutation invariant,then the position of the case in the data cloud is of primary importance.Resampling algorithms are not permutation invariant because permuting thedata causes different subsamples to be drawn.

The remainder of this subsection gives a standard definition of breakdownand then shows that if the median absolute residual is bounded in the presenceof high contamination, then the regression estimator has a high breakdownvalue. The following notation will be useful. Let W denote the data matrixwhere the ith row corresponds to the ith case. For regression, W is then × (p + 1) matrix with ith row (xT

i , Yi). Let W nd denote the data matrix

where any dn of the cases have been replaced by arbitrarily bad contaminatedcases. Then the contamination fraction is γ ≡ γn = dn/n, and the breakdown

value of β is the smallest value of γn needed to make ‖β‖ arbitrarily large.

Definition 7.20. Let 1 ≤ dn ≤ n. If T (W ) is a p× 1 vector of regressioncoefficients, then the breakdown value of T is

B(T,W ) = min

{dn

n: supW n

d

‖T (W nd)‖ = ∞

}

where the supremum is over all possible corrupted samples W nd .

Definition 7.21. High breakdown regression estimators have γn → 0.5as n → ∞ if the clean (uncontaminated) data are in general position: anyp clean cases give a unique estimate of β. Estimators are zero breakdown ifγn → 0 and positive breakdown if γn → γ > 0 as n → ∞.

The following result greatly simplifies some breakdown proofs and showsthat a regression estimator basically breaks down if the median absoluteresidual MED(|ri|) can be made arbitrarily large. The result implies that ifthe breakdown value ≤ 0.5, breakdown can be computed using the medianabsolute residual MED(|ri|(Wn

d )) instead of ‖T (W nd )‖. Similarly β is high


breakdown if the median squared residual or the cnth largest absolute resid-ual |ri|(cn) or squared residual r2(cn) stay bounded under high contamination

where cn ≈ n/2. Note that ‖β‖ ≡ ‖β(W nd)‖ ≤M for some constant M that

depends on T and W but not on the outliers if the number of outliers dn isless than the smallest number of outliers needed to cause breakdown.

Proposition 7.22. If the breakdown value ≤ 0.5, computing the break-down value using the median absolute residual MED(|ri|(W n

d)) instead of‖T (Wn

d )‖ is asymptotically equivalent to using Definition 7.20.

Proof. Consider any contaminated data set W nd with ith row (wT

i , Zi)T .

If the regression estimator T (W nd ) = β satisfies ‖β‖ ≤M for some constant

M if d < dn, then the median absolute residual MED(|Zi−βTwi|) is bounded

by maxi=1,...,n |Yi − βTxi| ≤ maxi=1,...,n[|Yi| +

∑pj=1M |xi,j|] if dn < n/2.

If the median absolute residual is bounded by M when d < dn, then ‖β‖is bounded provided fewer than half of the cases line on the hyperplane (and

so have absolute residual of 0), as shown next. Now suppose that ‖β‖ = ∞.Since the absolute residual is the vertical distance of the observation from thehyperplane, the absolute residual |ri| = 0 if the ith case lies on the regressionhyperplane, but |ri| = ∞ otherwise. Hence MED(|ri|) = ∞ if fewer thanhalf of the cases lie on the regression hyperplane. This will occur unless theproportion of outliers dn/n > (n/2 − q)/n → 0.5 as n → ∞ where q is thenumber of “good” cases that lie on a hyperplane of lower dimension than p.In the literature it is usually assumed that the original data are in generalposition: q = p− 1. �

Suppose that the clean data are in general position and that the number ofoutliers is less that the number needed to make the median absolute residualand ‖β‖ arbitrarily large. If the xi are fixed, and the outliers are moved upand down by adding a large positive or negative constant to the Y valuesof the outliers, then for high breakdown (HB) estimators, β and MED(|ri|)stay bounded where the bounds depend on the clean data W but not on theoutliers even if the number of outliers is nearly as large as n/2. Thus if the|Yi| values of the outliers are large enough, the |ri| values of the outliers willbe large.

If the Yi’s are fixed, arbitrarily large x-outliers tend to drive the slopeestimates to 0, not ∞. If both x and Y can be varied, then a cluster ofoutliers can be moved arbitrarily far from the bulk of the data but may stillhave small residuals. For example, move the outliers along the regressionhyperplane formed by the clean cases.

If the (xTi , Yi) are in general position, then the contamination could be

such that β passes exactly through p − 1 “clean” cases and dn “contam-inated” cases. Hence dn + p − 1 cases could have absolute residuals equalto zero with ‖β‖ arbitrarily large (but finite). Nevertheless, if T possesses


reasonable equivariant properties and ‖T (W nd )‖ is replaced by the median

absolute residual in the definition of breakdown, then the two breakdown val-ues are asymptotically equivalent. (If T (W ) ≡ 0, then T is neither regressionnor affine equivariant. The breakdown value of T is one, but the median ab-solute residual can be made arbitrarily large if the contamination proportionis greater than n/2.)

If the Yi’s are fixed, arbitrarily large x-outliers will rarely drive ‖β‖ to

∞. The x-outliers can drive ‖β‖ to ∞ if they can be constructed so thatthe estimator is no longer defined, e.g. so that XT X is nearly singular. Theexamples following some results on norms may help illustrate these points.

Definition 7.22. Let y be an n× 1 vector. Then ‖y‖ is a vector norm ifvn1) ‖y‖ ≥ 0 for every y ∈ R

n with equality iff y is the zero vector,vn2) ‖ay‖ = |a| ‖y‖ for all y ∈ R

n and for all scalars a, andvn3) ‖x + y‖ ≤ ‖x‖ + ‖y‖ for all x and y in R

n.

Definition 7.23. Let G be an n× p matrix. Then ‖G‖ is a matrix norm ifmn1) ‖G‖ ≥ 0 for every n×p matrix G with equality iff G is the zero matrix,mn2) ‖aG‖ = |a| ‖G‖ for all scalars a, andmn3) ‖G + H‖ ≤ ‖G‖ + ‖H‖ for all n× p matrices G and H .

Example 7.5. The q-norm of a vector y is ‖y‖q = (|y1|q + · · ·+ |yn|q)1/q.In particular, ‖y‖1 = |y1|+ · · ·+ |yn|, the Euclidean norm‖y‖2 =

√y21 + · · ·+ y2

n, and ‖y‖∞ = maxi |yi|. Given a matrix G anda vector norm ‖y‖q the q-norm or subordinate matrix norm of matrix G is

‖G‖q = maxy 6=0

‖Gy‖q

‖y‖q. It can be shown that the maximum column sum norm

‖G‖1 = max1≤j≤p

n∑

i=1

|gij|, the maximum row sum norm ‖G‖∞ = max1≤i≤n

p∑

j=1

|gij|,

and the spectral norm ‖G‖2 =

√maximum eigenvalue of GT G. The

Frobenius norm

‖G‖F =

√√√√p∑

j=1

n∑

i=1

|gij|2 =

√trace(GTG).

Several useful results involving matrix norms will be used. First, for anysubordinate matrix norm, ‖Gy‖q ≤ ‖G‖q ‖y‖q. Let J = Jm = {m1, ..., mp}denote the p cases in the mth elemental fit bJ = X−1

J Y J . Then for anyelemental fit bJ (suppressing q = 2),

‖bJ − β‖ = ‖X−1J (XJβ + eJ) − β‖ = ‖X−1

J eJ‖ ≤ ‖X−1J ‖ ‖eJ‖. (7.24)


The following results (Golub and Van Loan 1989, pp. 57, 80) on the Euclideannorm are useful. Let 0 ≤ σp ≤ σp−1 ≤ · · · ≤ σ1 denote the singular values ofXJ = (xmi,j). Then

‖X−1J ‖ =

σ1

σp‖XJ‖, (7.25)

maxi,j

|xmi,j| ≤ ‖XJ‖ ≤ p maxi,j

|xmi,j|, and (7.26)

1

p maxi,j |xmi,j|≤ 1

‖XJ‖≤ ‖X−1

J ‖. (7.27)

From now on, unless otherwise stated, we will use the spectral norm as thematrix norm and the Euclidean norm as the vector norm.

Example 7.6. Suppose the response values Y are near 0. Consider thefit from an elemental set: bJ = X−1

J Y J and examine Equations (7.?), (7.?),and (7.?). Now ‖bJ‖ ≤ ‖X−1

J ‖ ‖Y J‖, and since x-outliers make ‖XJ‖ large,x-outliers tend to drive ‖X−1

J ‖ and ‖bJ‖ towards zero not towards ∞. Thex-outliers may make ‖bJ‖ large if they can make the trial design ‖XJ‖ nearlysingular. Notice that Euclidean norm ‖bJ‖ can easily be made large if one ormore of the elemental response variables is driven far away from zero.

Example 7.7. Without loss of generality, assume that the clean Y ’s arecontained in an interval [a, f ] for some a and f . Assume that the regression

model contains an intercept β1. Then there exists an estimator βM of β such

that ‖βM‖ ≤ max(|a|, |f |) if dn < n/2.

Proof. Let MED(n) = MED(Y1, ..., Yn) and MAD(n) = MAD(Y1, ..., Yn).

Take βM = (MED(n), 0, ..., 0)T. Then ‖βM‖ = |MED(n)| ≤ max(|a|, |f |).Note that the median absolute residual for the fit βM is equal to the medianabsolute deviation MAD(n) = MED(|Yi − MED(n)|, i = 1, ..., n) ≤ f − a ifdn < b(n + 1)/2c. �

Note that βM is a poor high breakdown estimator of β and Yi(βM ) tracksthe Yi very poorly. If the data are in general position, a high breakdownregression estimator is an estimator which has a bounded median absoluteresidual even when close to half of the observations are arbitrary. Rousseeuwand Leroy (1987, pp. 29, 206) conjectured that high breakdown regressionestimators can not be computed cheaply, and that if the algorithm is alsoaffine equivariant, then the complexity of the algorithm must be at leastO(np). The following theorem shows that these two conjectures are false.

Theorem 7.1. If the clean data are in general position and the model hasan intercept, then a scale and affine equivariant high breakdown estimatorβw can be found by computing OLS on the set of cases that have Yi ∈[MED(Y1, ..., Yn) ± w MAD(Y1, ..., Yn)] where w ≥ 1 (so at least half of thecases are used).


Proof. Note that βw is obtained by computing OLS on the set J of thenj cases which have

Yi ∈ [MED(Y1, ..., Yn) ± wMAD(Y1, ..., Yn)] ≡ [MED(n) ± wMAD(n)]

where w ≥ 1 (to guarantee that nj ≥ n/2). Consider the estimator βM =

(MED(n), 0, ..., 0)T which yields the predicted values Yi ≡ MED(n). The

squared residual r2i (βM ) ≤ (w MAD(n))2 if the ith case is in J . Hence the

weighted LS fit βw is the OLS fit to the cases in J and has

∑

i∈J

r2i (βw) ≤ nj(w MAD(n))2.

Thus

MED(|r1(βw)|, ..., |rn(βw)|) ≤ √nj w MAD(n) <

√n w MAD(n) <∞.

Thus the estimator βw has a median absolute residual bounded by√n w MAD(Y1, ..., Yn). Hence βw is high breakdown, and it is affine equiv-

ariant since the design is not used to choose the observations. It is scaleequivariant since for constant c = 0, βw = 0, and for c 6= 0 the set ofcases used remains the same under scale transformations and OLS is scaleequivariant. �

Note that if w is huge and MAD(n) 6= 0, then the high breakdown estima-

tor βw and βOLS will be the same for most data sets. Thus high breakdown

estimators can be very nonrobust. Even if w = 1, the HB estimator βw onlyresists large Y outliers.

An ALTA concentration algorithm uses the L1 estimator instead of OLSin the concentration step and uses the LTA criterion. Similarly an ALMSconcentration algorithm uses the L∞ estimator and the LMS criterion.

Theorem 7.2. If the clean data are in general position and if a high break-down start is added to an ALTA, ALTS, or ALMS concentration algorithm,then the resulting estimator is HB.

Proof. Concentration reduces (or does not increase) the corresponding HBcriterion that is based on cn ≥ n/2 absolute residuals, so the median absoluteresidual of the resulting estimator is bounded as long as the criterion appliedto the HB estimator is bounded. �

For example, consider the LTS(cn) criterion. Suppose the ordered squaredresiduals from the high breakdown mth start b0m are obtained. If the dataare in general position, then QLTS(b0m) is bounded even if the number ofoutliers dn is nearly as large as n/2. Then b1m is simply the OLS fit tothe cases corresponding to the cn smallest squared residuals r2(i)(b0m) for


i = 1, ..., cn. Denote these cases by i1, ..., icn. Then QLTS(b1m) =

cn∑

i=1

r2(i)(b1m) ≤cn∑

j=1

r2ij(b1m) ≤

cn∑

j=1

r2ij(b0m) =

cn∑

j=1

r2(i)(b0m) = QLTS(b0m)

where the second inequality follows from the definition of the OLS estimator.Hence concentration steps reduce or at least do not increase the LTS criterion.If cn = (n+1)/2 for n odd and cn = 1+n/2 for n even, then the LTS criterionis bounded iff the median squared residual is bounded.

Theorem 7.? can be used to show that the following two estimators arehigh breakdown. The estimator βB is the high breakdown attractor used bythe

√n consistent high breakdown hbreg estimator of Definition 7.? while

the estimator bk,B is the high breakdown attractor used by the√n consistent

high breakdown CLTA estimator of Theorem 7.?.

Definition 7.24. Make an OLS fit to the cn ≈ n/2 cases whose Y valuesare closest to the MED(Y1, ..., Yn) ≡ MED(n) and use this fit as the start

for concentration. Define βB to be the attractor after k concentration steps.

Define bk,B = 0.9999βB .

Theorem 7.3. If the clean data are in general position, then βB and bk,B

are high breakdown regression estimators.

Proof. The start can be taken to be βw with w = 1 from Theorem 7.?.

Since the start is high breakdown, so is the attractor βB by Theorem 7.?.Multiplying a HB estimator by a positive constant does not change the break-down value, so bk,B is HB. �

The following result shows that it is easy to make a HB estimator that isasymptotically equivalent to a consistent estimator on a large class of iid zeromean symmetric error distributions, although the outlier resistance of the HBestimator is poor. The following result may not hold if βC estimates βC and

βLMS estimates βLMS where βC 6= βLMS . Then bk,B could have a smaller

median squared residual than βC even if there are no outliers. The two param-eter vectors could differ because the constant term is different if the error dis-tribution is not symmetric. For a large class of symmetric error distributions,βLMS = βOLS = βC ≡ β, then the ratio MED(r2i (β))/MED(r2i (β)) → 1 asn→ ∞ for any consistent estimator of β. The estimator below has two attrac-tors, βC and bk,B, and the probability that the final estimator βD is equal

to βC goes to one under the strong assumption that the error distribution is

such that both βC and βLMS are consistent estimators of β.

Theorem 7.4. Assume the clean data are in general position, and that theLMS estimator is a consistent estimator of β. Let βC be any practical con-

sistent estimator of β, and let βD = βC if MED(r2i (βC)) ≤ MED(r2i (bk,B)).


Let βD = bk,B, otherwise. Then βD is a HB estimator that is asymptotically

equivalent to βC .

Proof. The estimator is HB since the median squared residual of βD

is no larger than that of the HB estimator bk,B. Since βC is consistent,

MED(r2i (βC)) → MED(e2) in probability where MED(e2) is the populationmedian of the squared error e2. Since the LMS estimator is consistent, theprobability that βC has a smaller median squared residual than the biased

estimator βk,B goes to 1 as n → ∞. Hence βD is asymptotically equivalent

to βC . �

7.4.2 A Practical High Breakdown Consistent Estimator

Olive and Hawkins (2011) showed that the practical hbreg estimator is ahigh breakdown

√n consistent robust estimator that is asymptotically equiv-

alent to the least squares estimator for many error distributions. The hbregestimator was used in Section 12.6.1 to make the somewhat outlier resistantrmreg estimator of multivariate linear regression that was asymptoticallyequivalent to the classical estimator.

Suppose the algorithm estimator uses some criterion to choose an attractoras the final estimator where there are K attractors and K is fixed, e.g. K =500, so K does not depend on n. The following theorem is powerful because itdoes not depend on the criterion used to choose the attractor. If the algorithmneeds to use many attractors to achieve outlier resistance, then the individualattractors have little outlier resistance. Such estimators include elementalconcentration algorithms, heuristic and genetic algorithms, and projectionalgorithms. Algorithms such as elemental concentration algorithms where allK of the attractors are inconsistent are especially untrustworthy.

Suppose there are K consistent estimators βj of β, each with the same

rate nδ. If βA is an estimator obtained by choosing one of the K estimators,

then βA is a consistent estimator of β with rate nδ by Pratt (1959).

Theorem 7.5. Suppose the algorithm estimator chooses an attractor asthe final estimator where there are K attractors and K is fixed.

i) If all of the attractors are consistent, then the algorithm estimator isconsistent.

ii) If all of the attractors are consistent with the same rate, e.g., nδ where0 < δ ≤ 0.5, then the algorithm estimator is consistent with the same rate asthe attractors.

iii) If all of the attractors are high breakdown, then the algorithm estimatoris high breakdown.

Proof. i) Choosing from K consistent estimators results in a consistentestimator, and ii) follows from Pratt (1959). iii) Let γn,i be the breakdown


value of the ith attractor if the clean data are in general position. The break-down value γn of the algorithm estimator can be no lower than that of theworst attractor: γn ≥ min(γn,1, ..., γn,K) → 0.5 as n→ ∞. �

The consistency of the algorithm estimator changes dramatically if K isfixed but the start size h = hn = g(n) where g(n) → ∞. In particular, ifK starts with rate n1/2 are used, the final estimator also has rate n1/2. Thedrawback to these algorithms is that they may not have enough outlier resis-tance. Notice that the basic resampling result below is free of the criterion.

Proposition 7.8. Suppose Kn ≡ K starts are used and that all startshave subset size hn = g(n) ↑ ∞ as n → ∞. Assume that the estimatorapplied to the subset has rate nδ.i) For the hn-set basic resampling algorithm, the algorithm estimator hasrate [g(n)]δ.ii) Under regularity conditions (e.g. given by He and Portnoy 1992), the k–step CLTS estimator has rate [g(n)]δ.

Proof. i) The hn = g(n) cases are randomly sampled without replacement.Hence the classical estimator applied to these g(n) cases has rate [g(n)]δ. Thusall K starts have rate [g(n)]δ, and the result follows by Pratt (1959). ii) ByHe and Portnoy (1992), all K attractors have [g(n)]δ rate, and the resultfollows by Pratt (1959). �

Remark 7.1. Proposition 7.2 shows that β is HB if the median absolute orsquared residual (or |r(β)|(cn) or r2(cn) where cn ≈ n/2) stays bounded under

high contamination. Let QL(βH) denote the LMS, LTS, or LTA criterion for

an estimator βH ; therefore, the estimator βH is high breakdown if and only

if QL(βH) is bounded for dn near n/2 where dn < n/2 is the number of out-liers. The concentration operator refines an initial estimator by successivelyreducing the LTS criterion. If βF refers to the final estimator (attractor) ob-

tained by applying concentration to some starting estimator βH that is high

breakdown, then since QLTS(βF ) ≤ QLTS(βH), applying concentration toa high breakdown start results in a high breakdown attractor. See Theorem7.?.

High breakdown estimators are, however, not necessarily useful for detect-ing outliers. Suppose γn < 0.5. On the one hand, if the xi are fixed, and theoutliers are moved up and down parallel to the Y axis, then for high break-down estimators, β and MED(|ri|) will be bounded. Thus if the |Yi| valuesof the outliers are large enough, the |ri| values of the outliers will be large,suggesting that the high breakdown estimator is useful for outlier detection.On the other hand, if the Yi’s are fixed at any values and the x values per-turbed, sufficiently large x-outliers tend to drive the slope estimates to 0,not ∞. For many estimators, including LTS, LMS, and LTA, a cluster of Y


outliers can be moved arbitrarily far from the bulk of the data but still, byperturbing their x values, have arbitrarily small residuals.

Our practical high breakdown procedure is made up of three components.1) A practical estimator βC that is consistent for clean data. Suitable choiceswould include the full-sample OLS and L1 estimators.2) A practical estimator βA that is effective for outlier identification. Suitablechoices include the mbareg, rmreg2, lmsreg, or FLTS estimators.3) A practical high-breakdown estimator such as βB from Definition 7.13with k = 10.

By selecting one of these three estimators according to the features eachof them uncovers in the data, we may inherit some of the good properties ofeach of them.

Definition 7.25. The hbreg estimator βH is defined as follows. Pick a

constant a > 1 and set βH = βC . If aQL(βA) < QL(βC), set βH = βA. If

aQL(βB) < min[QL(βC), aQL(βA)], set βH = βB.

That is, find the smallest of the three scaled criterion values QL(βC),

aQL(βA), aQL(βB). According to which of the three estimators attains this

minimum, set βH to βC , βA, or βB respectively.Large sample theory for hbreg is simple and given in the following theo-

rem. Let βL be the LMS, LTS, or LTA estimator that minimizes the criterion

QL. Note that the impractical estimator βL is never computed. The following

theorem shows that βH is asymptotically equivalent to βC on a large class

of zero mean finite variance symmetric error distributions. Thus if βC is√n

consistent or asymptotically efficient, so is βH . Notice that βA does not needto be consistent. This point is crucial since lmsreg is not consistent and it isnot known whether FLTS is consistent. The clean data are in general positionif any p clean cases give a unique estimate of β.

Theorem 7.6. Assume the clean data are in general position, and supposethat both βL and βC are consistent estimators of β where the regression

model contains a constant. Then the hbreg estimator βH is high breakdown

and asymptotically equivalent to βC .

Proof. Since the clean data are in general position and QL(βH) ≤aQL(βB) is bounded for γn near 0.5, the hbreg estimator is high break-down. Let Q∗

L = QL for LMS and Q∗L = QL/n for LTS and LTA. As n→ ∞,

consistent estimators β satisfy Q∗L(β) − Q∗

L(β) → 0 in probability. Since

LMS, LTS, and LTA are consistent and the minimum value is Q∗L(βL), it

follows that Q∗L(βC) −Q∗

L(βL) → 0 in probability, while Q∗L(βL) < aQ∗

L(β)

for any estimator β. Thus with probability tending to one as n → ∞,


QL(βC) < amin(QL(βA), QL(βB)). Hence βH is asymptotically equivalent

to βC . �

Remark 7.2. i) Let βC = βOLS . Then hbreg is asymptotically equiva-lent to OLS when the errors ei are iid from a large class of zero mean finitevariance symmetric distributions, including the N(0, σ2) distribution, since

the probability that hbreg uses OLS instead of βA or βB goes to one asn→ ∞.

ii) The above theorem proves that practical high breakdown estimatorswith 100% asymptotic Gaussian efficiency exist; however, such estimatorsare not necessarily good.

iii) The theorem holds when both βL and βC are consistent estimators ofβ, for example, when the iid errors come from a large class or zero mean finitevariance symmetric distributions. For asymmetric distributions, βC estimates

βC and βL estimates βL where the constants usually differ. The theoremholds for some distributions that are not symmetric because of the penaltya. As a → ∞, the class of asymmetric distributions where the theorem holdsgreatly increases, but the outlier resistance decreases rapidly as a increasesfor a > 1.4.

iv) The default hbreg estimator used OLS, mbareg, and βB with a = 1.4and the LTA criterion. For the simulated data with symmetric error distri-butions, βB appeared to give biased estimates of the slopes. However, for the

simulated data with right skewed error distributions, βB appeared to givegood estimates of the slopes but not the constant estimated by OLS, and theprobability that the hbreg estimator selected βB appeared to go to one.

v) Both MBA and OLS are√n consistent estimators of β, even for a large

class of skewed distributions. Using βA = βMBA and removing βB from the

hbreg estimator results in a√n consistent estimator of β when βC = OLS is

a√n consistent estimator of β, but massive sample sizes were still needed to

get good estimates of the constant for skewed error distributions. For skeweddistributions, if OLS needed n = 1000 to estimate the constant well, mbaregmight need n > one million to estimate the constant well.

The situation is worse for multivariate linear regression when hbreg isused instead of OLS, since there are m constants to be estimated. If thedistribution of the iid error vectors ei is not elliptically contoured, gettingall m mbareg estimators to estimate all m constants well needs even largersample sizes. See the rmreg estimator of Section 12.6.1.

vi) The outlier resistance of hbreg is not especially good, and could likelybe improved by letting the constant ap depend on p. Let ap be very near 1for low p and increase ap to 1.4 as p increases.

vii) Note that for a large class of symmetric elliptically contoured distribu-tions for ε, the hbreg estimator could be used instead of OLS in multivariatelinear regression, but the classical tests could still be applied since the rmregand classical estimators were asymptotically equivalent. See Section 12.6.1.


viii) The estimator βA only needs to be practical to compute, it does notneed to be consistent. Hence hbreg can be used to fix the estimators thatare zero breakdown and inconsistent, but the maximal bias of hbreg willnot be as good as that of some impractical high breakdown estimators.

The family of hbreg estimators is enormous and depends on i) the prac-

tical high breakdown estimator βB, ii) βC , iii) βA, iv) a, and v) the criterionQL. Note that the theory needs the error distribution to be such that bothβC and βL are consistent. Sufficient conditions for LMS, LTS, and LTA to beconsistent are rather strong. To have reasonable sufficient conditions for thehbreg estimator to be consistent, βC should be consistent under weak condi-tions. Hence OLS is a good choice that results in 100% asymptotic Gaussianefficiency.

We suggest using the LTA criterion since in simulations, hbreg behavedlike βC for smaller sample sizes than those needed by the LTS and LMScriteria. We want a near 1 so that hbreg has outlier resistance similar toβA, but we want a large enough so that hbreg performs like βC for moderaten on clean data. Simulations suggest that a = 1.4 is a reasonable choice. Thedefault hbreg program from mpack uses the

√n consistent outlier resistant

estimator mbareg as βA.

There are at least three reasons for using βB as the high breakdown es-

timator. First, βB is high breakdown and simple to compute. Second, the

fitted values roughly track the bulk of the data. Lastly, although βB has

rather poor outlier resistance, βB does perform well on several outlier con-figurations where some common alternatives fail. See Figures 7.? and 7.?.

Next we will show that the hbreg estimator implemented with a = 1.4using QLTA, βC = OLS, and βB can greatly improve the estimator βA. We

will use βA = ltsreg in R and Splus 2000. Depending on the implemen-tation, the ltsreg estimators use the elemental resampling algorithm, theelemental concentration algorithm, or a genetic algorithm. Coverage is 50%,75%, or 90%. The Splus 2000 implementation is an unusually poor geneticalgorithm with 90% coverage. The R implementation appears to be the zerobreakdown inconsistent elemental basic resampling algorithm that uses 50%coverage. The ltsreg function changes often.

Simulations were run in R with the xij (for j > 1) and ei iid N(0, σ2)

and β = 1, the p× 1 vector of ones. Then β was recorded for 100 runs. Themean and standard deviation of the βj were recorded for j = 1, ..., p. Forn ≥ 10p and OLS, the vector of means should be close to 1 and the vectorof standard deviations should be close to 1/

√n. The

√n consistent high

breakdown hbreg estimator performed like OLS if n ≈ 35p and 2 ≤ p ≤ 6,if n ≈ 20p and 7 ≤ p ≤ 14, or if n ≈ 15p and 15 ≤ p ≤ 40. See Table 7.3for p = 5 and 100 runs. ALTS denotes ltsreg, HB denotes hbreg, andBB denotes βB. In the simulations, hbreg estimated the slopes well for thehighly skewed lognormal data, but not the OLS constant. Use the mpackfunction hbregsim.


Table 7.3 MEAN βi and SD(βi)

n method mn or sd β1 β2 β3 β4 β5

25 HB mn 0.9921 0.9825 0.9989 0.9680 1.0231sd 0.4821 0.5142 0.5590 0.4537 0.5461

OLS mn 1.0113 1.0116 0.9564 0.9867 1.0019sd 0.2308 0.2378 0.2126 0.2071 0.2441

ALTS mn 1.0028 1.0065 1.0198 1.0092 1.0374sd 0.5028 0.5319 0.5467 0.4828 0.5614

BB mn 1.0278 0.5314 0.5182 0.5134 0.5752sd 0.4960 0.3960 0.3612 0.4250 0.3940

400 HB mn 1.0023 0.9943 1.0028 1.0103 1.0076sd 0.0529 0.0496 0.0514 0.0459 0.0527

OLS mn 1.0023 0.9943 1.0028 1.0103 1.0076sd 0.0529 0.0496 0.0514 0.0459 0.0527

ALTS mn 1.0077 0.9823 1.0068 1.0069 1.0214sd 0.1655 0.1542 0.1609 0.1629 0.1679

BB mn 1.0184 0.8744 0.8764 0.8679 0.8794sd 0.1273 0.1084 0.1215 0.1206 0.1269

As implemented in mpack, the hbreg estimator is a practical√n consis-

tent high breakdown estimator that appears to perform like OLS for moderaten if the errors are unimodal and symmetric, and to have outlier resistancecomparable to competing practical “outlier resistant” estimators.

The hbreg, lmsreg, ltsreg, OLS, and βB estimators were comparedon the same 25 benchmark data sets. Also see Park et al. (2012). The HB

estimator βB was surprisingly good in that the response plots showed that itwas the best estimator for 2 data sets and that it usually tracked the data, butit performed poorly in 7 of the 25 data sets. The hbreg estimator performedwell, but for a few data sets hbreg did not pick the attractor with the bestresponse plot, as illustrated in the following example.

Fig. 7.12 Response Plots Comparing Robust Regression Estimators

Example 7.8. The LMS, LTA, and LTS estimators are determined by a“narrowest band” covering half of the cases. Hawkins and Olive (2002) sug-gested that the fit will pass through outliers if the band through the outliersis narrower than the band through the clean cases. This behavior tends tooccur if the regression relationship is weak, and if there is a tight clusterof outliers where |Y | is not too large. As an illustration, Buxton (1920, pp.232-5) gave 20 measurements of 88 men. Consider predicting stature using anintercept, head length, nasal height, bigonal breadth, and cephalic index. Onecase was deleted since it had missing values. Five individuals, numbers 61-65,were reported to be about 0.75 inches tall with head lengths well over five

7.5 Summary 241

feet! Figure 7.? shows the response plots for hbreg, OLS, ltsreg, and βB .

Notice that only the fit from βB (BBFIT) did not pass through the outliers,but hbreg selected the OLS attractor. There are always outlier configura-tions where an estimator will fail, and hbreg should fail on configurationswhere LTA, LTS, and LMS would fail.

7.5 Summary

1) For the location model, the sample mean Y =

∑ni=1 Yi

n, the sample vari-

ance S2n =

∑ni=1(Yi − Y )2

n− 1, and the sample standard deviation Sn =

√S2

n.

If the data Y1, ..., Yn is arranged in ascending order from smallest to largestand written as Y(1) ≤ · · · ≤ Y(n), then Y(i) is the ith order statistic and theY(i)’s are called the order statistics. The sample median

MED(n) = Y((n+1)/2) if n is odd,

MED(n) =Y(n/2) + Y((n/2)+1)

2if n is even.

The notation MED(n) = MED(Y1, ..., Yn) will also be used. The sample me-dian absolute deviation is MAD(n) = MED(|Yi − MED(n)|, i = 1, . . . , n).

2) Suppose the multivariate data has been collected into an n × p matrix

W = X =

xT1...

xTn

.

The coordinatewise median MED(W ) = (MED(X1), ...,MED(Xp))T whereMED(Xi) is the sample median of the data in column i corresponding to

variable Xi. The sample mean x =1

n

n∑

i=1

xi = (X1, ..., Xp)T where Xi is

the sample mean of the data in column i corresponding to variable Xi. Thesample covariance matrix

S =1

n − 1

n∑

i=1

(xi − x)(xi − x)T = (Sij).

That is, the ij entry of S is the sample covariance Sij . The classical estimatorof multivariate location and dispersion is (T,C) = (x,S).

3) Let (T,C) = (T (W ),C(W )) be an estimator of multivariate locationand dispersion. The ith Mahalanobis distance Di =

√D2

i where the ith


squared Mahalanobis distance is D2i = D2

i (T (W ),C(W )) =(xi − T (W ))T C−1(W )(xi − T (W )).

4) The squared Euclidean distances of the xi from the coordinatewisemedian is D2

i = D2i (MED(W ), Ip). Concentration type steps compute the

weighted median MEDj: the coordinatewise median computed from the casesxi withD2

i ≤ MED(D2i (MEDj−1, Ip)) where MED0 = MED(W ). Often used

j = 0 (no concentration type steps) or j = 9. Let Di = Di(MEDj , Ip). LetWi = 1 if Di ≤ MED(D1, ..., Dn)+kMAD(D1, ..., Dn) where k ≥ 0 and k = 5is the default choice. Let Wi = 0, otherwise.

5) Let the covmb2 set B of at least n/2 cases correspond to the cases withweight Wi = 1. Then the covmb2 estimator (T,C) is the sample mean andsample covariance matrix applied to the cases in set B. Hence

T =

∑ni=1Wixi∑n

i=1Wiand C =

∑ni=1Wi(xi − T )(xi − T )T

∑ni=1Wi − 1

.

The function ddplot5 plots the Euclidean distances from the coordinatewisemedian versus the Euclidean distances from the covmb2 location estimator.Typically the plotted points in this DD plot cluster about the identity line,and outliers appear in the upper right corner of the plot with a gap betweenthe bulk of the data and the outliers.

7.6 Complements

The fact that response plots are extremely useful for model assessment and fordetecting influential cases and outliers for an enormous variety of statisticalmodels does not seem to be well known. Certainly in any multiple linearregression analysis, the response plot and the residual plot of Y versus rshould always be made. Cook and Olive (2001) used response plots to selecta response transformation graphically. Olive (2005a) suggested using residual,response, RR, and FF plots to detect outliers while Hawkins and Olive (2002,pp. 141, 158) suggested using the RR and FF plots. The four plots are bestfor n ≥ 5p. Olive (2008:

∮6.4, 2017a: ch. 5-9) showed that the residual

and response plots are useful for experimental design models. Park et al.(2012) showed response plots are competitive with the best robust regressionmethods for outlier detection on some outlier data sets that have appearedin the literature.

Olive (2004b, 2013b) used response plots for 1D regression, including gen-eralized linear models and generalized additive models. Olive (2017a) covered1D regression, multiple linear regression, and prediction intervals using theshorth. Olive and Hawkins (2003) showed that regression residuals behavewell.

7.6 Complements 243

Rousseeuw and van Zomeren (1990) suggested that Mahalanobis distancesbased on “robust estimators” of location and dispersion can be more usefulthan the distances based on the sample mean and covariance matrix. Theyshow that a plot of robust Mahalanobis distances RDi versus residuals from“robust regression” can be useful.

Hampel et al. (1986, pp. 96-98) and Donoho and Huber (1983) providedsome history for breakdown. Maguluri and Singh (1997) gave interesting ex-amples on breakdown. Morgenthaler (1989) and Stefanski (1991) conjecturedthat high breakdown estimators with high efficiency are not possible. Theo-rems 7.?, 7.?, and 7.? show that these conjectures are false. The cross checkingestimator uses the classical estimator if it is “close” to an impractical highbreakdown consistent estimator, and uses the high breakdown estimator, oth-erwise. See He and Portnoy (1992). The estimator in Theorem 7.? is similar,and has problems since the practical robust estimator is bad, and it is hardto define “close” when the robust and classical estimators are not estimatingthe same β, e.g. if the error distribution is not symmetric.

This paragraph will explain why mbareg needs large samples to give agood estimate of the constant for highly skewed error distributions. Notethat the LMS, LTA, and LMS criteria use half sets. For simplicity, considerthe LMS criterion that minimizes the median squared residual. Heuristically,for highly right skewed data, let the “left tail half set” shift the constant ofthe OLS hyperplane down so that the half set of cases closest to the planeare the half set with the smallest OLS residual values. These cases will havenegative residuals and residuals close to zero, which are roughly the casescorresponding to the half set of errors in the left tail of the error distribution.Let the “OLS half set” correspond to the half set of cases with the smallestabsolute OLS residuals, so the cases closest to the OLS hyperplane. Sincethe distribution is highly right skewed, the “OLS half set” has much morevariability than the “left tail half set.” (For the location model, OLS is thesample mean which is greater than the sample median for right skewed data.The “left tail half set” shifts the mean down to the midpoint c of the minimumvalue and the median value, and often c ≈ median/2 if the support of thehighly right skewed distribution is (0,∞).) A trial fit that uses the same OLSslope estimates but which shifts the intercept down to use the “left tail halfset” will have a smaller median squared residual than the median squaredresidual using OLS. The trial fits for mbareg are

√n consistent, so estimate

the OLS intercept eventually. However, the trial fit that uses 1% of the datahas less than 1% efficiency since the x values are close in distance rather thanspread out. Unless the sample size is large, the mbareg estimator tends toproduce some trial fits that shift the intercept down, and one of these trialfits is selected to be the final mbareg estimator since it has a smaller mediansquared residual than the other trial fits.

The TV estimator was proposed by Olive (2002, 2005a) and is similar to anestimator proposed by Rousseeuw and van Zomeren (1992). Although both


the TV and MBA estimators have the goodOP (n−1/2) convergence rate, theirefficiency under normality may be very low. Chang and Olive (2007, 2010)suggested a method of adaptive trimming such that the resulting estimatoris asymptotically equivalent to the OLS estimator.

Introductions to 1D regression and regression graphics are Cook and Weis-berg (1999a, ch. 18, 19, and 20) and Cook and Weisberg (1999b), while Olive(2008: ch. 12, 2010: ch. 15, 2017a) considers 1D regression. Chang (2006) andChang and Olive (2007, 2010) extended least squares theory to 1D regressionmodels where h(x) = βT x.

If n is not much larger than p, then Hoffman et al. (2015) gave a ro-bust Partial Least Squares–Lasso type estimator that uses a clever weightingscheme.

7.6.1 More on Robust Regression

Theorems similar to Theorems 7.? and 7.? are easy to derive, but depend onthe strong assumption that the robust estimator that minimizes the robustcriterion and OLS estimate the same β. The basic resampling lmsreg es-timator is an inconsistent zero breakdown estimator by Hawkins and Olive(2002), but the modification in Theorem 7.? ii) is HB and asymptoticallyequivalent to OLS for a large class of zero mean finite variance symmetric er-ror distributions. Hence the modified estimator has a

√n rate which is higher

than the n1/3 rate of the LMS estimator. The maximum bias function of theresulting estimator is not the same as that of the LMS estimator.

Theorem 7.7. Suppose that the algorithm uses Kn ≡ K randomly se-lected elemental starts (e.g. K = 500) with k concentration steps and the two

additional attractors βOLS and bk,B. Assume that βLTS and βOLS are bothconsistent estimators of β.

i) Then the resulting CLTS estimator is a√n consistent HB estimator

if βOLS is√n consistent, and the estimator is asymptotically equivalent to

βOLS .ii) Suppose that a HB criterion is used on the K + 2 attractors such that

the resulting estimator is HB if a HB attractor is used. Also assume that theglobal minimizer of the HB criterion is a consistent estimator for β (e.g. LMSor LTA). The resulting HB estimator is asymptotically equivalent to the OLSestimator.

Proof. i) Theorems 7.? and 7.? show that a LTS concentration algorithmthat uses a HB start is HB, and that bk,B is a HB biased estimator. The LTSestimator is consistent by Masıcek (2004). As n → ∞, consistent estimators

β satisfy QLTS(β)/n−QLTS(β)/n → 0 in probability. Since bk,B is a biased

7.6 Complements 245

estimator of β, OLS will have a smaller criterion value with probability tend-ing to one. With probability tending to one, OLS will also have a smallercriterion value than the criterion value of the attractor from a randomlydrawn elemental set (by He and Portnoy 1992). Since K randomly chosenelemental sets are used, the CLTS estimator is asymptotically equivalent toOLS.

ii) As in the proof of i), the OLS estimator will minimize the criterionvalue with probability tending to one as n → ∞. �

Researchers are starting to use intelligently chosen trial fits. Maronna andYohai (2015) used OLS and 500 elemental sets as the 501 trial fits to producean FS estimator used as the initial estimator for an FMM estimator. Since the501 trial fits are zero breakdown, so is the FS estimator. Since the FMM esti-mator has the same breakdown as the initial estimator, the FMM estimatoris zero breakdown. For regression, they show that the FS estimator is con-sistent on a large class of zero mean finite variance symmetric distributions.The theory is very similar to that of Theorems 7.?, 7.?, and 7.?. Consistencyfollows since the elemental fits and OLS are unbiased estimators of βOLS

but an elemental fit is an OLS fit to p cases. Hence the elemental fits arevery variable, and the probability that the OLS fit has a smaller S-estimatorcriterion than a randomly chosen elemental fit (or K randomly chosen ele-mental fits) goes to one as n → ∞. (OLS and the S-estimator are both

√n

consistent estimators of β, so the ratio of their criterion values goes to one,and the S-estimator minimizes the criterion value.) Hence the FMM estima-tor is asymptotically equivalent to the MM estimator that has the smallestcriterion value for a large class of iid zero mean finite variance symmetricerror distributions. This FMM estimator is asymptotically equivalent to theFMM estimator that uses OLS as the initial estimator. When the error dis-tribution is skewed the S-estimator and OLS population constant are not thesame, and the probability that an elemental fit is selected is close to one fora skewed error distribution as n→ ∞. (The OLS estimator β gets very closeto βOLS while the elemental fits are highly variable unbiased estimators ofβOLS , so one of the elemental fits is likely to have a constant that is closerto the S-estimator constant while still having good slope estimators.) Hencethe FS estimator is inconsistent, and the FMM estimator is likely inconsis-tent for skewed distributions. No practical method is known for computinga√n consistent FS or FMM estimator that has the same breakdown and

maximum bias function as the S or MM estimator that has the smallest S orMM criterion value.

Olive (2008, ch. 7, 8, 9, 12, and 13) contains much more information about1D regression and resistant and robust regression. The least squares responseand residual plots are very useful for detecting outliers. For more on thebehavior of fits from randomly selected elemental sets, see Hawkins and Olive(2002) and Olive and Hawkins (2007a). For the hbreg estimator, see Olive


and Hawkins (2011). More on the MBA LATA estimator can be found inOlive (2008, ch. 8).

For the dominant robust statistics paradigm, most of the problems for highbreakdown multivariate location and dispersion, discussed in Section 4.9, arealso problems for high breakdown regression: the impractical “brand name”estimators have at least O(np) complexity, while the practical estimators usedin the software have not been shown to be both high breakdown and consis-tent. See Hawkins and Olive (2002), Hubert et al. (2002), and Maronna andYohai (2002). Huber and Ronchetti (2009, pp. xiii, 8-9, 152-154, 196-197) sug-gested that high breakdown regression estimators do not provide an adequateremedy for the ill effects of outliers, that their statistical and computationalproperties are not adequately understood, that high breakdown estimators“break down for all except the smallest regression problems by failing to pro-vide a timely answer!” and that “there are no known high breakdown pointestimators of regression that are demonstrably stable.”

A large number of impractical high breakdown regression estimators havebeen proposed, including LTS, LMS, LTA, S, LQD, τ , constrained M, re-peated median, cross checking, one step GM, one step GR, t-type, and re-gression depth estimators. See Rousseeuw and Leroy (1987) and Maronna etal. (2006). The practical algorithms used in the software use a brand namecriterion to evaluate a fixed number of trial fits and should be denoted asan F-brand name estimator such as FLTS. Two stage estimators, such asthe MM estimator, that need an initial consistent high breakdown estima-tor often have the same breakdown value and consistency rate as the initialestimator. These estimators are typically implemented with a zero break-down inconsistent initial estimator and hence are zero breakdown with zeroefficiency.

The practical estimators can be i) used in RR and FF plots for outlier de-

tection, or ii) used as βA to create an hbreg estimator that is asymptoticallyequivalent to least squares on a large class of symmetric error distributions.

Some of the theory for the impractical robust regression estimators is usefulfor determining when the robust estimator and OLS estimate the same β.Many regression estimators β satisfy

√n(β − β)

D→ Np(0, V (β, F ) W ) (7.28)

whenXT X

n→ W−1, and when the errors ei are iid with a cdf F and a

unimodal pdf f that is symmetric with a unique maximum at 0. When thevariance V (ei) exists,

V (OLS, F ) = V (ei) = σ2 while V(L1,F) =1

4[f(0)]2.

See Bassett and Koenker (1978).

7.6 Complements 247

Theorem 7.8. Under regularity conditions similar to those in Conjecture7.1 below, a) the LMS(τ ) converges at a cubed root rate to a non-Gaussian

limit. b) The estimator βLTS satisfies Equation (7.?) and

V (LTS(τ ), F ) =

∫ F−1(1/2+τ/2)

F−1(1/2−τ/2)w2dF (w)

[τ − 2F−1(1/2 + τ/2)f(F−1(1/2 + τ/2))]2. (7.29)

The proof of Theorem 7.8a is given in Kim and Pollard (1990). The proofof b) is given in Masıcek (2004), Cızek (2006), and Vısek (2006).

Conjecture 7.1. Let the iid errors ei have a cdf F that is continuousand strictly increasing on its interval support with a symmetric, unimodal,differentiable density f that strictly decreases as |x| increases on the support.

Then the estimator βLTA satisfies Equation (7.?) and

V (LTA(τ ), F ) =τ

4[f(0) − f(F−1(1/2 + τ/2))]2. (7.30)

See Tableman (1994b, p. 392) and Hossjer (1994).

Cızek (2008) showed that LTA is√n consistent, but did not prove that

LTA is asymptotically normal. Assume Conjecture 7.1 is true for the fol-lowing LTA remarks in this section. Then as τ → 1, the efficiency of LTSapproaches that of OLS and the efficiency of LTA approaches that of L1.Hence for τ close to 1, LTA will be more efficient than LTS when the er-rors come from a distribution for which the sample median is more efficientthan the sample mean (Koenker and Bassett, 1978). The results of Ooster-hoff (1994) suggest that when τ = 0.5, LTA will be more efficient than LTSonly for sharply peaked distributions such as the double exponential. To sim-plify computations for the asymptotic variance of LTS, we will use truncatedrandom variables.

Lemma 7.13. Under the symmetry conditions given in Conjecture 7.1,

V (LTS(τ ), F ) =τσ2

TF (−k, k)[τ − 2kf(k)]2

(7.31)

andV (LTA(τ ), F ) =

τ

4[f(0) − f(k)]2(7.32)

wherek = F−1(0.5 + τ/2). (7.33)

Proof. Let W have cdf F and pdf f . Suppose that W is symmetric aboutzero, and by symmetry, k = F−1(0.5 + τ/2) = −F−1(0.5 − τ/2). If W hasbeen truncated at a = −k and b = k, then the variance of the truncated


random variable WT is V (WT ) = σ2TF (−k, k) =

∫ k

−kw2dF (w)

F (k) − F (−k) by Definition

?. Hence ∫ F−1(1/2+τ/2)

F−1(1/2−τ/2)

w2dF (w) = τσ2TF (−k, k)

and the result follows from the definition of k.

This result is useful since formulas for the truncated variance have beengiven in Section ?. The following examples illustrate the result. See Hawkinsand Olive (1999b). The mpack functions cltv, deltv, and nltv are usefulfor computing the asymptotic variance of the LTS and LTA estimators forthe Cauchy, double exponential, and normal error distributions, as given inthe following three examples. See Problems 7.6, 7.7, and 7.8.

Example 7.13: N(0,1) Errors. If YT is a N(0, σ2) truncated at a = −kσand b = kσ, V (YT ) = σ2[1 − 2kφ(k)

2Φ(k) − 1]. At the standard normal

V (LTS(τ ), Φ) =1

τ − 2kφ(k)(7.34)

while V(LTA(τ ),Φ) =τ

4[φ(0)− φ(k)]2=

2πτ

4[1 − exp(−k2/2)]2(7.35)

where φ is the standard normal pdf and k = Φ−1(0.5+τ/2).Thus for τ ≥ 1/2,LTS(τ ) has breakdown value of 1 − τ and Gaussian efficiency

1

V (LTS(τ ), Φ)= τ − 2kφ(k). (7.36)

The 50% breakdown estimator LTS(0.5) has a Gaussian efficiency of 7.1%.If it is appropriate to reduce the amount of trimming, we can use the 25%breakdown estimator LTS(0.75) which has a much higher Gaussian efficiencyof 27.6% as reported in Ruppert (1992, p. 255). Also see the column labeled“Normal” in table 1 of Hossjer (1994).

Example 7.14: Double Exponential Errors. The double exponential(Laplace) distribution is interesting since the L1 estimator corresponds tomaximum likelihood and so L1 beats OLS, reversing the comparison of thenormal case. For a double exponential DE(0, 1) random variable,

V (LTS(τ ), DE(0, 1)) =2 − (2 + 2k + k2) exp(−k)

[τ − k exp(−k)]2

while V(LTA(τ ),DE(0, 1)) =τ

4[0.5− 0.5 exp(−k)]2=

1

τ

7.7 Problems 249

where k = − log(1− τ ). Note that LTA(0.5) and OLS have the same asymp-totic efficiency at the double exponential distribution. Also see Tableman(1994ab).

Example 7.15: Cauchy Errors. Although the L1 estimator and thetrimmed estimators have finite variance when the errors are Cauchy, theOLS estimator has infinite variance (because the Cauchy distribution hasinfinite variance). If XT is a Cauchy C(0, 1) random variable symmetrically

truncated at −k and k, then V (XT ) =k − tan−1(k)

tan−1(k). Hence

V (LTS(τ ), C(0, 1)) =2k − πτ

π[τ − 2kπ(1+k2)

]2

and V (LTA(τ ), C(0, 1)) =τ

4[ 1π − 1

π(1+k2) ]2

where k = tan(πτ/2). The LTA sampling variance converges to a finite valueas τ → 1 while that of LTS increases without bound. LTS(0.5) is slightlymore efficient than LTA(0.5), but LTA pulls ahead of LTS if the amount oftrimming is very small.

OutliersThe outlier detection methods of Section 1.3 are due to Olive (2017b, sec-

tion 4.7). For competing outlier detection methods, see Boudt et al. (2017).Also, google “novelty detection,” “anomaly detection,” and “artefact identi-fication.”

7.7 Problems

PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE-FUL.

7.1. (See Aldrin et al. 1993.) Suppose

Y = m(βT x) + e (7.37)

where m is a possibly unknown function and the zero mean errors e are inde-pendent of the predictors. Let z = βT x and let w = x −E(x). Let Σx,Y =

Cov(x, Y ), and let Σx = Cov(x) = Cov(w). Let r = w − (Σxβ)βT w.

a) Recall that Cov(x,Y ) = E[(x − E(x))(Y − E(Y ))T ] and show thatΣx,Y = E(wY ).

b) Show that E(wY ) = Σx,Y = E[(r + (Σxβ)βT w) m(z)] =


E[m(z)r] + E[βT w m(z)]Σxβ.

c) Using βOLS = Σ−1x Σx,Y , show that βOLS = c(x)β + u(x) where the

constantc(x) = E[βT (x −E(x))m(βT x)]

and the bias vector u(x) = Σ−1x E[m(βT x)r].

d) Show that E(wz) = Σxβ. (Hint: Use E(wz) = E[(x −E(x))xT β] =E[(x−E(x))(xT − E(xT ) +E(xT ))β].)

e) Assume m(z) = z. Using d), show that c(x) = 1 if βT Σxβ = 1.

f) Assume that βT Σxβ = 1. Show that E(zr) = E(rz) = 0. (Hint: FindE(rz) and use d).)

g) Suppose that βT Σxβ = 1 and that the distribution of x is multivariatenormal. Then the joint distribution of z and r is multivariate normal. Usingthe fact that E(zr) = 0, show Cov(r, z) = 0 so that z and r are independent.Then show that u(x) = 0.

(Note: the assumption βT Σxβ = 1 can be made without loss of generalitysince if βT Σxβ = d2 > 0 (assuming Σx is positive definite), then Y =m(d(β/d)Tx) + e ≡ md(ηT x) + e where md(u) = m(du), η = β/d andηT Σxη = 1.)

7.2. Referring to Definition 7.4, let Yi,j = xTi βj = Yi(βj) and let ri,j =

ri(βj). Show that ‖ri,1 − ri,2‖ = ‖Yi,1 − Yi,2‖.

7.3. Assume that the model has a constant β1 so that the first column ofX is 1. Show that if the regression estimator is regression equivariant, thenadding 1 to Y changes β1 but does not change the slopes β2, ..., βp.

R Problems

Use the command source(“G:/mpack.txt”) to download the func-tions and the command source(“G:/mrobdata.txt”) to download the data.See Preface or Section 11.1. Typing the name of the mpack function, e.g.trviews, will display the code for the function. Use the args command, e.g.args(trviews), to display the needed arguments for the function. For someof the following problems, the R commands can be copied and pasted from(http://lagrange.math.siu.edu/Olive/mrsashw.txt) into R.

7.4. Paste the command for this problem into R to produce the secondcolumn of Table 7.1. Include the output in Word.

7.5. a) To get an idea for the amount of contamination a basic resamplingor concentration algorithm can tolerate, enter or download the gamper func-

7.7 Problems 251

tion (with the source(“G:/mpack.txt”) command) that evaluates Equation(7.?) at different values of h = p.

b) Next enter the following commands and include the output in Word.

zh <- c(10,20,30,40,50,60,70,80,90,100)

for(i in 1:10) gamper(zh[i])

The “asymptotic variance” for LTA in Problems 7.6, 7.7, and 7.8 is actuallythe conjectured asymptotic variance for LTA if the multiple linear regressionmodel is used instead of the location model. See Section 7.?.

7.6. a) Download the R function nltv that computes the asymptoticvariance of the LTS and LTA estimators if the errors are N(0,1).b) Enter the commands nltv(0.5), nltv(0.75), nltv(0.9), and nltv(0.9999).Write a table to compare the asymptotic variance of LTS and LTA at thesecoverages. Does one estimator always have a smaller asymptotic variance?

7.7. a) Download the R function deltv that computes the asymptoticvariance of the LTS and LTA estimators if the errors are double exponentialDE(0,1).b) Enter the commands deltv(0.5), deltv(0.75), deltv(0.9), and deltv(0.9999).Write a table to compare the asymptotic variance of LTS and LTA at thesecoverages. Does one estimator always have a smaller asymptotic variance?

7.8. a) Download the R function cltv that computes the asymptoticvariance of the LTS and LTA estimators if the errors are Cauchy C(0,1).

b) Enter the commands cltv(0.5), cltv(0.75), cltv(0.9), and cltv(0.9999).Write a table to compare the asymptotic variance of LTS and LTA at thesecoverages. Does one estimator always have a smaller asymptotic variance?

7.9∗. a) If necessary, use the commands source(“G:/mpack.txt”) andsource(“G:/mrobdata.txt”).

b) Enter the command mbamv(belx,bely) in R. Click on the rightmostmouse button (and in R, click on Stop). You need to do this 7 times be-fore the program ends. There is one predictor x and one response Y . Thefunction makes a scatterplot of x and Y and cases that get weight one areshown as highlighted squares. Each MBA sphere covers half of the data.When you find a good fit to the bulk of the data, hold down the Ctrl and ckeys to make a copy of the plot. Then paste the plot in Word.

c) Enter the command mbamv2(buxx,buxy) in R. Click on the rightmostmouse button (and in R, click on Stop). You need to do this 14 times beforethe program ends. There are four predictors x1, ..., x4 and one response Y .The function makes the response and residual plots based on the OLS fit tothe highlighted cases. Each MBA sphere covers half of the data. When youfind a good fit to the bulk of the data, hold down the Ctrl and c keys to makea copy of the two plots. Then paste the plots in Word.

7.10. This problem compares the MBA estimator that uses the mediansquared residual MED(r2i ) criterion with the MBA estimator that uses the


LATA criterion. On clean data, both estimators are√n consistent since both

use 50√n consistent OLS estimators. The MED(r2i ) criterion has trouble

with data sets where the multiple linear regression relationship is weak andthere is a cluster of outliers. The LATA criterion tries to give all x–outliers,including good leverage points, zero weight.

a) If necessary, use the commands source(“G:/mpack.txt”) andsource(“G:/mrobdata.txt”). The mlrplot2 function is used to compute bothMBA estimators. Use the rightmost mouse button to advance the plot (andin R, highlight stop).

b) Use the command mlrplot2(belx,bely) and include the resulting plot inWord. Is one estimator better than the other, or are they about the same?

c) Use the command mlrplot2(cbrainx,cbrainy) and include the resultingplot in Word. Is one estimator better than the other, or are they about thesame? (The infants are likely good leverage cases instead of outliers.)

d) Use the command mlrplot2(museum[,3:11],museum[,2]) and include theresulting plot in Word. For this data set, most of the cases are based onhumans but a few are based on apes. The MBA LATA estimator will oftengive the cases corresponding to apes larger absolute residuals than the MBAestimator based on MED(r2i ), but the apes appear to be good leverage cases.

e) Use the command mlrplot2(buxx,buxy) until the outliers are clusteredabout the identity line in one of the two response plots. (This will usuallyhappen within 10 or fewer runs. Pressing the “up arrow” will bring the pre-vious command to the screen and save typing.) Then include the resultingplot in Word. Which estimator went through the outliers and which one gavezero weight to the outliers?

f) Use the command mlrplot2(hx,hy) several times. Usually both MBAestimators fail to find the outliers for this artificial Hawkins data set that isalso analyzed by Atkinson and Riani (2000, section 3.1). The lmsreg estimatorcan be used to find the outliers. In R use the commands library(MASS) andffplot2(hx,hy). Include the resulting plot in Word.

7.11. Use the following R commands to make 100 N3(0, I3) cases and 100trivariate non-EC cases.

n3x <- matrix(rnorm(300),nrow=100,ncol=3)

ln3x <- exp(n3x)

In R, type the command library(MASS).

a) Using the commands pairs(n3x) and pairs(ln3x) and include both scat-terplot matrices in Word. (Click on the plot and hit Ctrl and c at the sametime. Then go to file in the Word menu and select paste.) Are strong nonlin-earities present among the MVN predictors? How about the non-EC predic-tors? (Hint: a box or ball shaped plot is linear.)

b) Make a single index model and the sufficient summary plot with thefollowing commands

7.7 Problems 253

ncy <- (n3x%*%1:3)ˆ3 + 0.1*rnorm(100)

plot(n3x%*%(1:3),ncy)

and include the plot in Word.c) The command trviews(n3x, ncy) will produce ten plots. To advance the

plots, click on the rightmost mouse button and highlight Stop to advance tothe next plot. The last plot is the OLS view. Include this plot in Word.

d) After all 10 plots have been viewed, the output will show 10 estimatedpredictors. The last estimate is the OLS (least squares) view and might looklike the following output.

Intercept X1 X2 X3

4.417988 22.468779 61.242178 75.284664

If the OLS view is a good estimated sufficient summary plot, then the plotcreated from the command (leave out the intercept)

plot(n3x%*%c(22.469,61.242,75.285),n3x%*%1:3)

should cluster tightly about some line. Your linear combination will be dif-ferent than the one used above. Using your OLS view, include the plot usingthe command above (but with your linear combination) in Word. Was thisplot linear? Did some of the other trimmed views seem to be better than theOLS view, that is, did one of the trimmed views seem to have a smooth meanfunction with a smaller variance function than the OLS view?

e) Now type the R command

lncy <- (ln3x%*%1:3)ˆ3 + 0.1*rnorm(100).

Use the command trviews(ln3x,lncy) to find the best view with a smoothmean function and the smallest variance function. This view should not bethe OLS view. Include your best view in Word.

f) Get the linear combination from your view, say (94.848, 216.719, 328.444)T ,and obtain a plot with the command

plot(ln3x%*%c(94.848,216.719,328.444),ln3x%*%1:3).

Include the plot in Word. If the plot is linear with high correlation, then yourresponse plot in e) should be good.

7.12. At the beginning of your R session, use source(“G:/mpack.txt”) com-mand and library(MASS).

a) Perform the commands

nx <- matrix(rnorm(300),nrow=100,ncol=3)

lnx <- exp(nx)

SP <- lnx%*%1:3

lnsincy <- sin(SP)/SP + 0.01*rnorm(100)


For parts b), c), and d) below, to make the best trimmed view withtrviews, ctrviews, or lmsviews, you may need to use the function twice.The first view trims 90% of the data, the next view trims 80%, etc. The lastview trims 0% and is the OLS view (or lmsreg view). Remember to advancethe view with the rightmost mouse button and highlight “Stop.” Then clickon the plot and next simultaneously hit Ctrl and c. This makes a copy of theplot. Then in Word, use the menu command “Paste.”

b) Find the best trimmed view with OLS and covfch with the followingcommands and include the view in Word.

trviews(lnx,lnsincy)

(With trviews, suppose that 40% trimming gave the best view. Theninstead of using the procedure above b), you can use the command

essp(lnx,lnsincy,M=40)

to make the best trimmed view. Then click on the plot and next simulta-neously hit Ctrl and c. This makes a copy of the plot. Then in Word, usethe menu command “Paste.” Click the rightmost mouse button and highlight“Stop” to return the command prompt.)

c) Find the best trimmed view with OLS and (x,S) using the followingcommands and include the view in Word. See the paragraph above b).

ctrviews(lnx,lnsincy)

d) Find the best trimmed view with lmsreg and cov.mcd using thefollowing commands and include the view in Word. See the paragraph aboveb).

lmsviews(lnx,lnsincy)

e) Which method or methods gave the best response plot? Explain briefly.

7.13. Warning: this problem may take too much time, but makesa good project. This problem is like Problem 7.12 but uses many moresingle index models.a) Make some prototype functions with the following commands.

nx <- matrix(rnorm(300),nrow=100,ncol=3)

SP <- nx%*%1:3

ncuby <- SPˆ3 + rnorm(100)

nexpy <- exp(SP) + rnorm(100)

nlinsy <- SP + 4*sin(SP) + 0.1*rnorm(100)

nsincy <- sin(SP)/SP + 0.01*rnorm(100)

nsiny <- sin(SP) + 0.1*rnorm(100)

nsqrty <- sqrt(abs(SP)) + 0.1*rnorm(100)

nsqy <- SPˆ2 + rnorm(100)

7.7 Problems 255

b) Make sufficient summary plots similar to Figures 7.? and 7.? with thefollowing commands and include both plots in Word.

plot(SP,ncuby)

plot(-SP,ncuby)

c) Find the best trimmed view with the following commands (in R, firsttype library(MASS)). Include the view in Word.

trviews(nx,ncuby)

You may need to use the function twice. The first view trims 90% of thedata, the next view trims 80%, etc. The last view trims 0% and is the OLSview. Remember to advance the view with the rightmost mouse button andhighlight “Stop.” Suppose that 40% trimming gave the best view. Then usethe command

essp(nx,ncuby, M=40)

to make the best trimmed view. Then click on the plot and next simultane-ously hit Ctrl and c. This makes a copy of the plot. Then in Word, use themenu command “Paste.”

d) To make a plot like Figure 7.?, use the following commands. Let tem

= β obtained from the trviews output. In Example 7.? (continued), tem canbe obtained with the following command.

tem <- c(12.60514, 25.06613, 37.25504)

Include the plot in Word.

ESP <- nx%*%tem

plot(ESP,SP)

e) Repeat b), c), and d) with the following data sets.i) nx and nexpyii) nx and nlinsyiii) nx and nsincyiv) nx and nsinyv) nx and nsqrtyvi) nx and nsqyEnter the following commands to do parts vii) to x).

lnx <- exp(nx)

SP <- lnx%*%1:3

lncuby <- (SP/3)ˆ3 + rnorm(100)

lnlinsy <- SP + 10*sin(SP) + 0.1*rnorm(100)

lnsincy <- sin(SP)/SP + 0.01*rnorm(100)

lnsiny <- sin(SP/3) + 0.1*rnorm(100)

ESP <- lnx%*%tem


vii) lnx and lncubyviii) lnx and lnlinsyix) lnx and lnsincyx) lnx and lnsiny

7.14. Warning: this problem may take too much time, but makesa good project. Repeat Problem 7.13 but replace trviews with a)lmsviews, b) symviews (that creates views that sometimes work evenwhen symmetry is present), and c) ctrviews.

Except for part a), the essp command will not work. Instead, for the besttrimmed view, click on the plot and next simultaneously hit Ctrl and c. Thismakes a copy of the plot. Then in Word, use the menu command “Paste.”

7.15. a) In addition to the source(“G:/mpack.txt”) command, also usethe source(“G:/mrobdata.txt”) command, and type the library(MASS) com-mand).

b) Type the command tvreg(buxx,buxy,ii=1). Click the rightmost mousebutton and highlight Stop. The response plot should appear. Repeat 10 timesand remember which plot percentage M (say M = 0) had the best responseplot. Then type the command tvreg2(buxx,buxy, M = 0) (except use yourvalue of M, not 0). Again, click the rightmost mouse button (and in R, high-light Stop). The response plot should appear. Hold down the Ctrl and c keysto make a copy of the plot. Then paste the plot in Word.

c) The estimated coefficients βTV from the best plot should have appearedon the screen. Copy and paste these coefficients into Word.

7.16. a) After entering the two source commands above Problem 7.4, enterthe following command.

MLRplot(buxx,buxy)

Click the rightmost mouse button (and in R click on Stop). The responseplot should appear. Again, click the rightmost mouse button (and in R clickon Stop). The residual plot should appear. Hold down the Ctrl and c keys tomake a copy of the two plots. Then paste the plots in Word.

b) The response variable is height, but 5 cases were recorded with heightsabout 0.75 inches tall. The highlighted squares in the two plots correspondto cases with large Cook’s distances. With respect to the Cook’s distances,what is happening, swamping or masking?

c) RR plots: One feature of the MBA estimator is that it depends on thesample of 7 centers drawn and changes each time the function is called. Inten runs, about seven plots will look like Figure 7.?, but in about three plotsthe MBA estimator will also pass through the outliers. Make the RR plot bypasting the commands for this problem into R, and include the plot in Word.

7.7 Problems 257

d) FF plots: the plots in the top row will cluster about the identity line ifthe MLR model is good or if the fit passes through the outliers. Make the FFplot by pasting the commands for this problem into R, and include the plotin Word.

Chapter 8

Multivariate Linear Regression

This chapter will show that multivariate linear regression with m ≥ 2 re-sponse variables is nearly as easy to use, at least if m is small, as multiplelinear regression which has 1 response variable. For multivariate linear re-gression, at least one predictor variable is quantitative. Plots for checkingthe model, including outlier detection, are given. Prediction regions that arerobust to nonnormality are developed. For hypothesis testing, it is shownthat the Wilks’ lambda statistic, Hotelling Lawley trace statistic, and Pillai’strace statistic are robust to nonnormality.

8.1 Introduction

Definition 8.1. The response variables are the variables that you wantto predict. The predictor variables are the variables used to predict theresponse variables.

Definition 8.2. The multivariate linear regression model

yi = BT xi + εi

for i = 1, ..., n has m ≥ 2 response variables Y1, ..., Ym and p predictorvariables x1, x2, ..., xp where x1 ≡ 1 is the trivial predictor. The ith caseis (xT

i , yTi ) = (1, xi2, ..., xip, Yi1, ..., Yim) where the 1 could be omitted. The

model is written in matrix form as Z = XB + E where the matrices aredefined below. The model has E(εk) = 0 and Cov(εk) = Σε = (σij) fork = 1, ..., n. Then the p × m coefficient matrix B =

[β1 β2 . . . βm

]and

the m × m covariance matrix Σε are to be estimated, and E(Z) = XBwhile E(Yij) = xT

i βj . The εi are assumed to be iid. Multiple linear regres-sion corresponds to m = 1 response variable, and is written in matrix formas Y = Xβ + e. Subscripts are needed for the m multiple linear regression

259

260 8 Multivariate Linear Regression

models Y j = Xβj +ej for j = 1, ..., mwhere E(ej) = 0. For the multivariatelinear regression model, Cov(ei, ej) = σij In for i, j = 1, ..., m where In isthe n× n identity matrix.

Notation. The multiple linear regression model uses m = 1. Themultivariate linear model yi = BT xi + εi for i = 1, ..., n has m ≥ 2,and multivariate linear regression and MANOVA models are special cases.See Definition 10.2. This chapter will use x1 ≡ 1 for the multivariate linearregression model. The multivariate location and dispersion model isthe special case where X = 1 and p = 1.

The data matrix W = [X Z] except usually the first column 1 of X isomitted for software. The n×m matrix

Z =

Y1,1 Y1,2 . . . Y1,m

Y2,1 Y2,2 . . . Y2,m

......

. . ....

Yn,1 Yn,2 . . . Yn,m

=

[Y 1 Y 2 . . . Y m

]=

yT1...

yTn

.

The n× p design matrix of predictor variables is

X =

x1,1 x1,2 . . . x1,p

x2,1 x2,2 . . . x2,p

......

. . ....

xn,1 xn,2 . . . xn,p

=

[v1 v2 . . . vp

]=

xT1...

xTn

where v1 = 1.The p×m matrix

B =

β1,1 β1,2 . . . β1,m

β2,1 β2,2 . . . β2,m

......

. . ....

βp,1 βp,2 . . . βp,m

=

[β1 β2 . . . βm

].

The n×m matrix

E =

ε1,1 ε1,2 . . . ε1,m

ε2,1 ε2,2 . . . ε2,m

......

. . ....

εn,1 εn,2 . . . εn,m

=

[e1 e2 . . . em

]=

εT1...

εTn

.

Considering the ith row of Z,X, and E shows that yTi = xT

i B + εTi .

Warning: The ei are error vectors, not orthonormal eigenvectors.

Definition 8.3. In the multiple linear regression model, m = 1 and

8.1 Introduction 261

Yi = xi,1β1 + xi,2β2 + · · ·+ xi,pβp + ei = xTi β + ei (8.1)

for i = 1, . . . , n. In matrix notation, these n equations become

Y = Xβ + e, (8.2)

where Y is an n × 1 vector of response variables, X is an n × p matrix ofpredictors, β is a p×1 vector of unknown coefficients, and e is an n×1 vectorof unknown errors. Equivalently,

Y1

Y2

...Yn

=

x1,1 x1,2 . . . x1,p

x2,1 x2,2 . . . x2,p

......

. . ....

xn,1 xn,2 . . . xn,p

β1

β2

...βp

+

e1e2...en

. (8.3)

The ei are iid with zero mean and variance σ2, and multiple linear regressionis used to estimate the unknown parameters β and σ2.

Each response variable in a multivariate linear regression model follows amultiple linear regression model Y j = Xβj + ej for j = 1, ..., m where itis assumed that E(ej) = 0 and Cov(ej) = σjjIn. Hence the errors corre-sponding to the jth response are uncorrelated with variance σ2

j = σjj. Noticethat the same design matrix X of predictors is used for each of the mmodels, but the jth response variable vector Y j, coefficient vector βj , anderror vector ej change and thus depend on j.

Now consider the ith case (xTi , y

Ti ) which corresponds to the ith row of Z

and the ith row of X . Then

Yi1 = β11xi1 + · · ·+ βp1xip + εi1 = xTi β1 + εi1

Yi2 = β12xi1 + · · ·+ βp2xip + εi2 = xTi β2 + εi2

...Yim = β1mxi1 + · · ·+ βpmxip + εim = xT

i βm + εim

or yi = µxi+ εi = E(yi) + εi where

E(yi) = µxi= BT xi =

xTi β1

xTi β2...

xTi βm

.

The notation yi|xi and E(yi|xi) is more accurate, but usually the condi-tioning is suppressed. Taking µxi

to be a constant (or condition on xi if thepredictor variables are random variables), yi and εi have the same covariancematrix. In the multivariate regression model, this covariance matrix Σε doesnot depend on i. Observations from different cases are uncorrelated (often


independent), but the m errors for the m different response variables for thesame case are correlated. If X is a random matrix, then assume X and Eare independent and that expectations are conditional on X .

Example 8.1. Suppose it is desired to predict the response variables Y1 =height and Y2 = height at shoulder of a person from partial skeletal remains.A model for prediction can be built from nearly complete skeletons or fromliving humans, depending on the population of interest (e.g. ancient Egyp-tians or modern US citizens). The predictor variables might be x1 ≡ 1, x2 =femur length, and x3 = ulna length. The two heights of individuals withx2 = 200mm and x3 = 140mm should be shorter on average than the twoheights of individuals with x2 = 500mm and x3 = 350mm. In this exampleY1, Y2, x2, and x3 are quantitative variables. If x4 = gender is a predictorvariable, then gender (coded as male = 1 and female = 0) is qualitative.

Definition 8.4. Least squares is the classical method for fitting multivari-ate linear regression. The least squares estimators are

B = (XT X)−1XT Z =[β1 β2 . . . βm

].

The predicted values or fitted values

Z = XB =[Y 1 Y 2 . . . Y m

]=

Y1,1 Y1,2 . . . Y1,m

Y2,1 Y2,2 . . . Y2,m

......

. . ....

Yn,1 Yn,2 . . . Yn,m

.

The residuals E = Z − Z = Z − XB =

εT1

εT2...

εTn

=

[r1 r2 . . . rm

]=

ε1,1 ε1,2 . . . ε1,m

ε2,1 ε2,2 . . . ε2,m

......

. . ....

εn,1 εn,2 . . . εn,m

.

These quantities can be found from the m multiple linear regressions of Y j

on the predictors: βj = (XT X)−1XT Y j, Y j = Xβj, and rj = Y j − Y j

for j = 1, ..., m. Hence εi,j = Yi,j − Yi,j where Y j = (Y1,j, ..., Yn,j)T . Finally,

Σε,d =

(Z − Z)T (Z − Z)

n− d=

(Z − XB)T (Z − XB)

n− d=

ETE

n − d=

1

n− d

n∑

i=1

εiεTi .

The choices d = 0 and d = p are common. If d = 1, then Σε,d=1 = Sr, thesample covariance matrix of the residual vectors εi, since the sample mean

8.1 Introduction 263

of the εi is 0. Let Σε = Σε,p be the unbiased estimator of Σε. Also,

Σε,d = (n− d)−1ZT [I − X(XT X)−1X ]Z,

andE = [I − X(XT X)−1X ]Z.

The following two theorems show that the least squares estimators arefairly good. Also see Theorem 8.7 in Section 8.4. Theorem 8.2 can also be

used for Σε,d =n− 1

n − dSr.

Theorem 8.1, Johnson and Wichern (1988, p. 304): Suppose X hasfull rank p < n and the covariance structure of Definition 8.2 holds. ThenE(B) = B so E(βj) = βj , Cov(βj, βk) = σjk(X

T X)−1 for j, k = 1, ..., p.

Also E and B are uncorrelated, E(E) = 0, and

E(Σε) = E

(E

TE

n− p

)= Σε.

Theorem 8.2. Sr = Σε+OP (n−1/2) and 1n

∑ni=1 εiε

Ti = Σε+OP (n−1/2)

if the following three conditions hold: B − B = OP (n−1/2), 1n

∑ni=1 εix

Ti =

OP (1), and 1n

∑ni=1 xix

Ti = OP (n1/2).

Proof. Note that yi = BT xi+εi = BTxi+εi. Hence εi = (B−B)T xi+εi.

Thus

n∑

i=1

εiεTi =

n∑

i=1

(εi−εi+εi)(εi−εi+εi)T =

n∑

i=1

[εiεTi +εi(εi−εi)

T +(εi−εi)εTi ]

=

n∑

i=1

εiεTi + (

n∑

i=1

εixTi )(B − B) + (B − B)T (

n∑

i=1

xiεTi )+

(B − B)T (

n∑

i=1

xixTi )(B − B).

Thus 1n

∑ni=1 εiε

Ti = 1

n

∑ni=1 εiε

Ti +

OP (1)OP (n−1/2) + OP (n−1/2)OP (1) +OP (n−1/2)OP (n1/2)OP (n−1/2),

and the result follows since 1n

∑ni=1 εiε

Ti = Σε + OP (n−1/2) and

Sr =n

n − 1

1

n

n∑

i=1

εiεTi . �


Sr and Σε are also√n consistent estimators of Σε by Su and Cook (2012,

p. 692). See Theorem 8.7.

8.2 Plots for the Multivariate Linear Regression Model

As in Chapter 10, this section suggests using residual plots, response plots,and the DD plot to examine the multivariate linear model. The DD plot isused to examine the distribution of the iid error vectors. The residual plotsare often used to check for lack of fit of the multivariate linear model. Theresponse plots are used to check linearity and to detect influential cases forthe linearity assumption. The response and residual plots are used exactlyas in the m = 1 case corresponding to multiple linear regression and exper-imental design models. See Olive (2010, 2017ac), Olive et al. (2015), Oliveand Hawkins (2005), and Cook and Weisberg (1999a, p. 432; 1999b). ReviewRemark 10.2 which also applies to multivariate linear regression.

Notation. Plots will be used to simplify the regression analysis, and inthis text a plot of W versus Z uses W on the horizontal axis and Z on thevertical axis.

Definition 8.5. A response plot for the jth response variable is a plotof the fitted values Yij versus the response Yij. The identity line with slopeone and zero intercept is added to the plot as a visual aid. A residual plotcorresponding to the jth response variable is a plot of Yij versus rij.

Remark 8.1. Make the m response and residual plots for any multivariatelinear regression. In a response plot, the vertical deviations from the identityline are the residuals rij = Yij− Yij . Suppose the model is good, the jth errordistribution is unimodal and not highly skewed for j = 1, ..., m, and n ≥ 10p.Then the plotted points should cluster about the identity line in each of them response plots. If outliers are present or if the plot is not linear, then thecurrent model or data need to be transformed or corrected. If the model isgood, then each of the m residual plots should be ellipsoidal with no trendand should be centered about the r = 0 line. There should not be any patternin the residual plot: as a narrow vertical strip is moved from left to right, thebehavior of the residuals within the strip should show little change. Outliersand patterns such as curvature or a fan shaped plot are bad.

Rule of thumb 8.1. Use multivariate linear regression if

n ≥ max((m+ p)2, mp+ 30, 10p))

provided that the m response and residual plots all look good. Make the DDplot of the εi. If a residual plot would look good after several points havebeen deleted, and if these deleted points were not gross outliers (points far

8.2 Plots for the Multivariate Linear Regression Model 265

from the point cloud formed by the bulk of the data), then the residual plotis probably good. Beginners often find too many things wrong with a goodmodel. For practice, use the computer to generate several multivariate linearregression data sets, and make the m response and residual plots for thesedata sets. This exercise will help show that the plots can have considerablevariability even when the multivariate linear regression model is good. Thempack function MLRsim simulates response and residual plots for variousdistributions when m = 1.

Rule of thumb 8.2. If the plotted points in the residual plot look likea left or right opening megaphone, the first model violation to check is theassumption of nonconstant variance. (This is a rule of thumb because it ispossible that such a residual plot results from another model violation suchas nonlinearity, but nonconstant variance is much more common.)

Remark 8.2. Residual plots magnify departures from the model while theresponse plots emphasize how well the multivariate linear regression modelfits the data.

Definition 8.6. An RR plot is a scatterplot matrix of the m sets ofresiduals r1, ..., rm.

Definition 8.7. An FF plot is a scatterplot matrix of the m sets of fittedvalues of response variables Y 1, ..., Y m. The m response variables Y 1, ...,Y m

can be added to the plot.

Remark 8.3. Some applications for multivariate linear regression need them error vectors to be linearly related, and larger sample sizes may be neededif the error vectors are not linearly related. For example, the asymptoticoptimality of the prediction regions of Section 8.3 needs the error vectors tobe iid from an elliptically contoured distribution. Make the RR plot and aDD plot of the residual vectors εi to check that the error vectors are linearlyrelated. Make a DD plot of the continuous predictor variables to check forx-outliers. Make a DD plot of Y1, ...., Ym to check for outliers, especially ifit is assumed that the response variables come from an elliptically contoureddistribution.

The RMVN DD plot of the residual vectors εi is used to check the er-ror vector distribution, to detect outliers, and to display the nonparametricprediction region developed in Section 8.3. The DD plot suggests that theerror vector distribution is elliptically contoured if the plotted points clustertightly about a line through the origin as n → ∞. The plot suggests thatthe error vector distribution is multivariate normal if the line is the identityline. If n is large and the plotted points do not cluster tightly about a linethrough the origin, then the error vector distribution may not be ellipticallycontoured. These applications of the DD plot for iid multivariate data arediscussed in Olive (2002, 2008, 2013a) and Chapter 5. The RMVN estimatorhas not yet been proven to be a consistent estimator when computed from


residual vectors, but simulations suggest that the RMVN DD plot of theresidual vectors is a useful diagnostic plot. The mpack function mregddsim

can be used to simulate the DD plots for various distributions.Predictor transformations for the continuous predictors can be made ex-

actly as in Section 2.4.

Warning: The Rule of thumb 2.1 does not always work. For example, thelog rule may fail. If the relationships in the scatterplot matrix are alreadylinear or if taking the transformation does not increase the linearity, thenno transformation may be better than taking a transformation. For the Arcdata set evaporat.lsp with m = 1, the log rule suggests transforming theresponse variable Evap, but no transformation works better.

Response transformations can also be made as in Section 2.4, but alsomake the response plot of Y j versus Y j , and use the rules of Section 2.4on Yj to linearize the response plot for each of the m response variablesY1, ..., Ym.

8.3 Asymptotically Optimal Prediction Regions

In this section, we will consider a more general multivariate regression model,and then consider the multivariate linear model as a special case. Given ncases of training or past data (x1, y1), ..., (xn, yn) and a vector of predictorsxf , suppose it is desired to predict a future test vector yf .

Definition 8.8. A large sample 100(1− δ)% prediction region is a set An

such that P (yf ∈ An) → 1−δ as n→ ∞, and is asymptotically optimal if thevolume of the region converges in probability to the volume of the populationminimum volume covering region.

The classical large sample 100(1− δ)% prediction region for a future valuexf given iid data x1, ..., ,xn is {x : D2

x(x,S) ≤ χ2p,1−δ}, while for multi-

variate linear regression, the classical large sample 100(1 − δ)% predictionregion for a future value yf given xf and past data (x1, yi), ..., (xn, yn) is

{y : D2y(yf , Σε) ≤ χ2

m,1−δ}. See Johnson and Wichern (1988, pp. 134, 151,312). By Equation (3.10), these regions may work for multivariate normalxi or εi, but otherwise tend to have undercoverage. Section 5.2 and Olive(2013a) replaced χ2

p,1−δ by the order statistic D2(Un) where Un decreases to

dn(1 − δ)e. This section will use a similar technique from Olive (2017b) todevelop possibly the first practical large sample prediction region for themultivariate linear model with unknown error distribution. The followingtechnical theorem will be needed to prove Theorem 8.4.

8.3 Asymptotically Optimal Prediction Regions 267

Theorem 8.3. Let a > 0 and assume that (µn, Σn) is a consistent esti-mator of (µ, aΣ).

a) D2x(µn, Σn) − 1

aD2x(µ,Σ) = oP (1).

b) Let 0 < δ ≤ 0.5. If (µn, Σn)− (µ, aΣ) = Op(n−δ) and aΣ

−1

n −Σ−1 =OP (n−δ), then

D2x(µn, Σn) − 1

aD2

x(µ,Σ) = OP (n−δ).

Proof. Let Bn denote the subset of the sample space on which Σn has aninverse. Then P (Bn) → 1 as n→ ∞. Now

D2x(µn, Σn) = (x − µn)T Σ

−1

n (x − µn) =

(x− µn)T

(Σ−1

a− Σ−1

a+ Σ

−1

n

)(x − µn) =

(x− µn)T

(−Σ−1

a+ Σ

−1

n

)(x − µn) + (x − µn)T

(Σ−1

a

)(x − µn) =

1

a(x − µn)T (−Σ−1 + a Σ

−1

n )(x − µn) +

(x− µ + µ − µn)T

(Σ−1

a

)(x − µ + µ − µn)

=1

a(x − µ)T Σ−1(x − µ) +

2

a(x − µ)T Σ−1(µ− µn)+

1

a(µ − µn)T Σ−1(µ − µn) +

1

a(x − µn)T [aΣ

−1

n − Σ−1](x− µn)

on Bn, and the last three terms are oP (1) under a) and OP (n−δ) under b).�

Now suppose a prediction region for an m× 1 random vector yf given avector of predictors xf is desired for the multivariate linear model. If we had

many cases zi = BT xf + εi, then we could use the multivariate predictionregion for m variables from Section 4.2. Instead, Theorem 8.4 will use the

prediction region from Section 5.2 on the pseudodata zi = BTxf + εi =

yf + εi for i = 1, ..., n. This takes the data cloud of the n residual vectors εi

and centers the cloud at yf . Note that zi = (B−B +B)T xf +(εi−εi+εi) =

zi+(B−B)T xf +εi−εi = zi+(B−B)T xf −(B−B)T xi = zi+OP (n−1/2).Hence the distances based on the zi and the distances based on the zi havethe same quantiles, asymptotically (for quantiles that are continuity pointsof the distribution of zi).

If the εi are iid from an ECm(0,Σ, g) distribution with continuous de-creasing g and nonsingular covariance matrix Σε = cΣ for some con-


stant c > 0, then the population asymptotically optimal prediction regionis {y : Dy(BT xf ,Σε) ≤ D1−δ} where P (Dy(BT xf ,Σε) ≤ D1−δ) = 1 − δ.

For example, if the iid εi ∼ Nm(0,Σε), then D1−δ =√χ2

m,1−δ. If the er-

ror distribution is not elliptically contoured, then the above region still has100(1− δ)% coverage, but prediction regions with smaller volume may exist.

A natural way to make a large sample prediction region is to estimate thetarget population minimum volume covering region, but for moderate sam-ples and many error distributions, the natural estimator that covers dn(1−δ)eof the cases tends to have undercoverage as high as min(0.05, δ/2). This em-pirical result is not too surprising since it is well known that the performanceof a prediction region on the training data is superior to the performance onfuture test data, due in part to the unknown variability of the estimator. Tocompensate for the undercoverage, let qn be as in Theorem 8.4.

Theorem 8.4. Suppose yi = E(yi|xi) + εi = yi + εi where Cov(εi) =Σε > 0, and where the zero mean εf and the εi are iid for i = 1, ..., n.

Given xf , suppose the fitted model produces yf and nonsingular Σε. Letzi = yf + εi and

D2i ≡ D2

i (yf , Σε) = (zi − yf )T Σ−1

ε (zi − yf )

for i = 1, ..., n. Let qn = min(1 − δ + 0.05, 1− δ +m/n) for δ > 0.1 and

qn = min(1 − δ/2, 1− δ + 10δm/n), otherwise.

If qn < 1 − δ + 0.001, set qn = 1 − δ. Let 0 < δ < 1 and h = D(Un) whereD(Un) is the 100 qnth sample quantile of the Mahalanobis distances Di. Letthe nominal 100(1 − δ)% prediction region for yf be given by

{z : (z − yf )T Σ−1

ε (z − yf ) ≤ D2(Un)} =

{z : D2z(yf , Σε) ≤ D2

(Un)} = {z : Dz(yf , Σε) ≤ D(Un)}. (8.4)

a) Consider the n prediction regions for the data where (yf,i,xf,i) =(yi,xi) for i = 1, ..., n. If the order statistic D(Un) is unique, then Un of then prediction regions contain yi where Un/n→ 1 − δ as n → ∞.

b) If (yf , Σε) is a consistent estimator of (E(yf),Σε), then (8.4) is alarge sample 100(1− δ)% prediction region for yf .

c) If (yf , Σε) is a consistent estimator of (E(yf ),Σε), and the εi comefrom an elliptically contoured distribution such that the unique highest den-sity region is {z : Dz(0,Σε) ≤ D1−δ}, then the prediction region (8.4) isasymptotically optimal.

Proof. a) Suppose (xf , yf ) = (xi, yi). Then

8.3 Asymptotically Optimal Prediction Regions 269

D2yi

(yi, Σε) = (yi − yi)T Σ

−1

ε (yi − yi) = εTi Σ

−1

ε εi = D2εi

(0, Σε).

Hence yi is in the ith prediction region {z : Dz(yi, Σε) ≤ D(Un)(yi, Σε)}iff εi is in prediction region {z : Dz(0, Σε) ≤ D(Un)(0, Σε)}, but exactly Un

of the εi are in the latter region by construction, if D(Un) is unique. SinceD(Un) is the 100(1− δ)th percentile of the Di asymptotically, Un/n→ 1− δ.

b) Let P [Dz(E(yf ),Σε) ≤ D1−δ(E(yf),Σε)] = 1 − δ. Since Σε > 0,

Theorem 8.3 shows that if (yf , Σε)P→ (E(yf),Σε) then D(yf , Σε)

D→Dz(E(yf ),Σε). Hence the percentiles of the distances converge in distri-

bution, and the probability that yf is in {z : Dz(yf , Σε) ≤ D1−δ(yf , Σε)}converges to 1 − δ = the probability that yf is in {z : Dz(E(yf ),Σε) ≤D1−δ(E(yf),Σε)} at continuity points D1−δ of the distribution ofD(E(yf ),Σε).

c) The asymptotically optimal prediction region is the region with thesmallest volume (hence highest density) such that the coverage is 1 − δ, asn → ∞. This region is {z : Dz(E(yf),Σε) ≤ D1−δ(E(yf ),Σε)} if theasymptotically optimal region for the εi is {z : Dz(0,Σε) ≤ D1−δ(0,Σε)}.Hence the result follows by b). �

Notice that if Σ−1

ε exists, then 100qn% of the n training data yi are in theircorresponding prediction region with xf = xi, and qn → 1−δ even if (yi, Σε)is not a good estimator or if the regression model is misspecified. Hence thecoverage qn of the training data is robust to model assumptions. Of course thevolume of the prediction region could be large if a poor estimator (yi, Σε) isused or if the εi do not come from an elliptically contoured distribution. Theresponse, residual, and DD plots can be used to check model assumptions.If the plotted points in the RMVN DD plot cluster tightly about some linethrough the origin and if n ≥ max[3(m+p)2, mp+30], we expect the volumeof the prediction region may be fairly low for the least squares estimators.

If n is too small, then multivariate data is sparse and the covering ellipsoidfor the training data may be far too small for future data, resulting in severeundercoverage. Also notice that qn = 1−δ/2 or qn = 1−δ+0.05 for n ≤ 20p.At the training data, the coverage qn ≥ 1 − δ, and qn converges to thenominal coverage 1− δ as n → ∞. Suppose n ≤ 20p. Then the nominal 95%prediction region uses qn = 0.975 while the nominal 50% prediction regionuses qn = 0.55.Prediction distributions depend both on the error distributionand on the variability of the estimator (yf , Σε). This variability is typicallyunknown but converges to 0 as n→ ∞. Also, residuals tend to underestimateerrors for small n. For moderate n, ignoring estimator variability and usingqn = 1 − δ resulted in undercoverage as high as min(0.05, δ/2). Letting the“coverage” qn decrease to the nominal coverage 1 − δ inflates the volume ofthe prediction region for small n, compensating for the unknown variabilityof (yf , Σε).


Consider the multivariate linear regression model. The semiparametric andparametric regions are only conjectured to be large sample prediction re-gions, but are useful as diagnostics. Let Σε = Σε,d=p, zi = yf + εi, and

D2i (yf ,Sr) = (zi − yf )T S−1

r (zi − yf) for i = 1, ..., n. Then the large samplenonparametric 100(1 − δ)% prediction region is

{z : D2z(yf ,Sr) ≤ D2

(Un)} = {z : Dz(yf ,Sr) ≤ D(Un)}, (8.5)

while the (Johnson and Wichern 1988: p. 312) classical large sample 100(1−δ)% prediction region is

{z : D2z(yf , Σε) ≤ χ2

m,1−δ} = {z : Dz(yf , Σε) ≤√χ2

m,1−δ}. (8.6)

Theorem 8.5 will show that this prediction region (8.5) can also be foundby applying the nonparametric prediction region (4.17) on the zi. Recallthat Sr defined in Definition 8.4 is the sample covariance matrix of the resid-ual vectors εi. Section 4.2 describes the nonparametric, semiparametric, andparametric MVN prediction regions. Similar regions are used for multivariatelinear regression. For the multivariate linear regression model, if D1−δ is acontinuity point of the distribution of D, Assumption D1 above Theorem 8.7holds, and the εi have a nonsingular covariance matrix, then (8.5) is a largesample 100(1 − δ)% prediction region for yf .

Theorem 8.5. For multivariate linear regression, when least squares isused to compute yf , Sr , and the pseudodata zi, prediction region (8.5) isthe Section 4.2 nonparametric prediction region applied to the zi.

Proof. Multivariate linear regression with least squares satisfies Theorem8.4 by Su and Cook (2012). (See Theorem 8.7.) Let (T,C) be the samplemean and sample covariance matrix (see Definition 2.6) applied to the zi.The sample mean and sample covariance matrix of the residual vectors is(0,Sr) since least squares was used. Hence the zi = yf + εi have samplecovariance matrix Sr, and sample mean yf . Hence (T,C) = (yf ,Sr), andthe Di(yf ,Sr) are used to compute D(Un). �

The RMVN DD plot of the residual vectors will be used to display theprediction regions for multivariate linear regression. See Example 8.3. Thenonparametric prediction region for multivariate linear regression of Theorem8.5 uses (T,C) = (yf ,Sr) in (8.4), and has simple geometry. Let Rr be thenonparametric prediction region (8.5) applied to the residuals εi with yf = 0.Then Rr is a hyperellipsoid with center 0, and the nonparametric predictionregion is the hyperellipsoid Rr translated to have center yf . Hence in a DDplot, all points to the left of the line MD = D(Un) correspond to yi that arein their prediction region, while points to the right of the line are not in theirprediction region.

The nonparametric prediction region has some interesting properties. Thisprediction region is asymptotically optimal if the εi are iid for a large class

8.4 Testing Hypotheses 271

of elliptically contoured ECm(0,Σ, g) distributions. Also, if there are 100different values (xjf , yjf) to be predicted, we only need to update yjf forj = 1, ..., 100, we do not need to update the covariance matrix Sr .

It is common practice to examine how well the prediction regions workon the training data. That is, for i = 1, ..., n, set xf = xi and see if yi isin the region with probability near to 1 − δ with a simulation study. Notethat yf = yi if xf = xi. Simulation is not needed for the nonparametricprediction region (8.5) for the data since the prediction region (8.5) centeredat yi contains yi iff Rr, the prediction region centered at 0, contains εi sinceεi = yi − yi. Thus 100qn% of prediction regions corresponding to the data(yi,xi) contain yi, and 100qn% → 100(1−δ)%. Hence the prediction regionswork well on the training data and should work well on (xf , yf) similar tothe training data. Of course simulation should be done for (xf , yf ) that arenot equal to training data cases. See Section 8.5.

This training data result holds provided that the multivariate linear regres-sion using least squares is such that the sample covariance matrix Sr of theresidual vectors is nonsingular, the multivariate regression model neednot be correct. Hence the coverage at the n training data cases (xi, yi)is robust to model misspecification. Of course, the prediction regions maybe very large if the model is severely misspecified, but severity of misspec-ification can be checked with the response and residual plots. Coverage fora future value yf can also be arbitrarily bad if there is extrapolation or if(xf , yf ) comes from a different population than that of the data.

8.4 Testing Hypotheses

This section considers testing a linear hypothesis H0 : LB = 0 versusH1 : LB 6= 0 where L is a full rank r × p matrix.

Definition 8.9. Assume rank(X) = p. The total corrected (for the mean)sum of squares and cross products matrix is

T = R + W e = ZT

(In − 1

n11T

)Z.

Note that T /(n− 1) is the usual sample covariance matrix Σy if all n of theyi are iid, e.g. if B = 0. The regression sum of squares and cross productsmatrix is

R = ZT

[X(XT X)−1XT − 1

n11T

]Z = ZT XB − 1

nZT11T Z.

Let H = BTLT [L(XT X)−1LT ]−1LB. The error or residual sum of squares

and cross products matrix is


W e = (Z − Z)T (Z − Z) = ZT Z − ZT XB = ZT [In − X(XT X)−1XT ]Z.

Note that W e = ETE and W e/(n− p) = Σε.

Warning: SAS output uses E instead of W e.

The MANOVA table is shown below.

Summary MANOVA Table

Source matrix df

Regression or Treatment R p− 1Error or Residual W e n− p

Total (corrected) T n− 1

Definition 8.10. Let λ1 ≥ λ2 ≥ · · · ≥ λm be the ordered eigenvalues ofW−1

e H. Then there are four commonly used test statistics.The Roy’s maximum root statistic is λmax(L) = λ1.The Wilks’ Λ statistic is Λ(L) = |(H + W e)

−1W e| = |W−1e H + I|−1 =

m∏

i=1

(1 + λi)−1.

The Pillai’s trace statistic is V (L) = tr[(H + W e)−1H] =

m∑

i=1

λi

1 + λi.

The Hotelling-Lawley trace statistic is U(L) = tr[W−1e H ] =

m∑

i=1

λi.

Typically some function of one of the four above statistics is used to getpval, the estimated pvalue. Output often gives the pvals for all four teststatistics. Be cautious about inference if the last three test statistics do notlead to the same conclusions (Roy’s test may not be trustworthy for r > 1).Theory and simulations developed below for the four statistics will providemore information about the sample sizes needed to use the four test statistics.See the following page for notation used in the next theorem.

Theorem 8.6. The Hotelling-Lawley trace statistic

U(L) =1

n − p[vec(LB)]T [Σ

−1

ε ⊗ (L(XT X)−1LT )−1][vec(LB)]. (8.7)

Proof. Using the Searle (1982, p. 333) identitytr(AGT DGC) = [vec(G)]T [CA ⊗ DT ][vec(G)], it follows that

(n− p)U(L) = tr[Σ−1

ε BTLT [L(XT X)−1LT ]−1LB]

= [vec(LB)]T [Σ−1

ε ⊗ (L(XT X)−1LT )−1][vec(LB)] = T where A = Σ−1

ε ,

G = LB,D = [L(XT X)−1LT ]−1, and C = I. Hence (8.7) holds. �


Some notation is useful to show (8.7) and to show that (n−p)U(L)D→ χ2

rm

under mild conditions if H0 is true. Following Henderson and Searle (1979),let matrix A = [a1 a2 . . . ap]. Then the vec operator stacks the columnsof A on top of one another so

vec(A) =

a1

a2

...ap

.

Let A = (aij) be an m × n matrix and B a p × q matrix. Then theKronecker product of A and B is the mp× nq matrix

A ⊗ B =

a11B a12B · · · a1nBa21B a22B · · · a2nB

...... · · ·

...am1B am2B · · · amnB

.

An important fact is that if A and B are nonsingular square matrices, then[A⊗ B]−1 = A−1 ⊗ B−1. The following assumption is important.

Assumption D1: Let hi be the ith diagonal element of X(XT X)−1XT .

Assume max1≤i≤n hiP→ 0 as n → ∞, assume that the zero mean iid error

vectors have finite fourth moments, and assume that1

nXT X

P→ W−1.

Su and Cook (2012) proved a central limit type theorem for Σε and B forthe partial envelopes estimator, and the least squares estimator is a specialcase. These results prove the following theorem. Their theorem also showsthat for multiple linear regression (m = 1), σ2 = MSE is a

√n consistent

estimator of σ2.

Theorem 8.7: Multivariate Least Squares Central Limit Theorem(MLS CLT). For the least squares estimator, if assumption D1 holds, then

Σε is a√n consistent estimator of Σε and

√n vec(B − B)

D→ Npm(0,Σε ⊗ W ).

Theorem 8.8. If assumption D1 holds and if H0 is true, then

(n− p)U(L)D→ χ2

rm.

Proof. By Theorem 8.7,√n vec(B −B)

D→ Npm(0,Σε ⊗W ). Then un-

der H0,√n vec(LB)

D→ Nrm(0,Σε ⊗LWLT ), and n [vec(LB)]T [Σ−1ε ⊗

(LWLT )−1][vec(LB)]D→ χ2

rm. This result also holds if W and Σε are re-

placed by W = n(XT X)−1 and Σε. Hence under H0 and using the proof of


Theorem 8.6,

T = (n−p)U(L) = [vec(LB)]T [Σ−1

ε ⊗(L(XT X)−1LT )−1][vec(LB)]D→ χ2

rm.

�

Some more details on the above results may be useful. Consider testing alinear hypothesis H0 : LB = 0 versus H1 : LB 6= 0 where L is a full rankr × p matrix. For now assume the error distribution is multivariate normalNm(0,Σε). Then

vec(B − B) =

β1 − β1

β2 − β2...

βm − βm

∼ Npm(0,Σε ⊗ (XT X)−1)

where

C = Σε⊗(XT X)−1 =

σ11(XT X)−1 σ12(X

T X)−1 · · · σ1m(XT X)−1

σ21(XT X)−1 σ22(X

T X)−1 · · · σ2m(XT X)−1

...... · · ·

...

σm1(XT X)−1 σm2(X

T X)−1 · · · σmm(XT X)−1

.

Now let A be an rm×pm block diagonal matrix: A = diag(L, ...,L). Then

A vec(B − B) = vec(L(B − B)) =

L(β1 − β1)

L(β2 − β2)...

L(βm − βm)

∼ Nrm(0,Σε ⊗ L(XT X)−1LT )

where D = Σε ⊗ L(XT X)−1LT = ACAT =

σ11L(XT X)−1LT σ12L(XT X)−1LT · · · σ1mL(XT X)−1LT

σ21L(XT X)−1LT σ22L(XT X)−1LT · · · σ2mL(XT X)−1LT

...... · · ·

...

σm1L(XT X)−1LT σm2L(XT X)−1LT · · · σmmL(XT X)−1LT

.

Under H0, vec(LB) = A vec(B) = 0, and


vec(LB) =

Lβ1

Lβ2...

Lβm

∼ Nrm(0,Σε ⊗ L(XT X)−1LT ).

Hence under H0,

[vec(LB)]T [Σ−1ε ⊗ (L(XT X)−1LT )−1][vec(LB)] ∼ χ2

rm,

and

T = [vec(LB)]T [Σ−1

ε ⊗ (L(XT X)−1LT )−1][vec(LB)]D→ χ2

rm. (8.8)

A large sample level δ test will reject H0 if pval ≤ δ where

pval = P

(T

rm< Frm,n−mp

). (8.9)

Since least squares estimators are asymptotically normal, if the εi are iidfor a large class of distributions,

√n vec(B − B) =

√n

β1 − β1

β2 − β2...

βm − βm

D→ Npm(0,Σε ⊗ W )

whereXT X

n

P→ W−1.

Then under H0,

√n vec(LB) =

√n

Lβ1

Lβ2...

Lβm

D→ Nrm(0,Σε ⊗ LWLT ),

andn [vec(LB)]T [Σ−1

ε ⊗ (LWLT )−1][vec(LB)]D→ χ2

rm.

Hence (8.8) holds, and (8.9) gives a large sample level δ test if the leastsquares estimators are asymptotically normal.

Kakizawa (2009) showed, under stronger assumptions than Theorem 8.8,that for a large class of iid error distributions, the following test statisticshave the same χ2

rm limiting distribution when H0 is true, and the same non-central χ2

rm(ω2) limiting distribution with noncentrality parameter ω2 when


H0 is false under a local alternative. Hence the three tests are robust to theassumption of normality. The limiting null distribution is well known whenthe zero mean errors are iid from a multivariate normal distribution. SeeKhattree and Naik (1999, p. 68): (n− p)U(L)

D→ χ2rm, (n− p)V (L)

D→ χ2rm,

and −[n − p − 0.5(m − r + 3)] log(Λ(L))D→ χ2

rm. Results from Kshirsagar(1972, p. 301) suggest that the third chi-square approximation is very goodif n ≥ 3(m+ p)2 for multivariate normal error vectors.

Theorems 8.6 and 8.8 are useful for relating multivariate tests with thepartial F test for multiple linear regression that tests whether a reducedmodel that omits some of the predictors can be used instead of the full modelthat uses all p predictors. The partial F test statistic is

FR =

[SSE(R) − SSE(F )

dfR − dfF

]/MSE(F )

where the residual sums of squares SSE(F ) and SSE(R) and degrees offreedom dfF and dfr are for the full and reduced model while the meansquare error MSE(F ) is for the full model. Let the null hypothesis for thepartial F test be H0 : Lβ = 0 where L sets the coefficients of the predictorsin the full model but not in the reduced model to 0. Seber and Lee (2003, p.100) shows that

FR =[Lβ]T (L(XT X)−1LT )−1[Lβ]

rσ2

is distributed as Fr,n−p if H0 is true and the errors are iid N(0, σ2). Notethat for multiple linear regression with m = 1, FR = (n − p)U(L)/r since

Σ−1

ε = 1/σ2. Hence the scaled Hotelling Lawley test statistic is the partialF test statistic extended to m > 1 predictor variables by Theorem 8.6.

By Theorem 8.8, for example, rFRD→ χ2

r for a large class of nonnormal

error distributions. If Zn ∼ Fk,dn , then ZnD→ χ2

k/k as dn → ∞. Hence usingthe Fr,n−p approximation gives a large sample test with correct asymptoticlevel, and the partial F test is robust to nonnormality.

Similarly, using an Frm,n−pm approximation for the following test statisticsgives large sample tests with correct asymptotic level by Kakizawa (2009) andsimilar power for large n. The large sample test will have correct asymptoticlevel as long as the denominator degrees of freedom dn → ∞ as n→ ∞, anddn = n− pm reduces to the partial F test if m = 1 and U(L) is used. Thenthe three test statistics are

−[n− p− 0.5(m− r + 3)]

rmlog(Λ(L)),

n− p

rmV (L), and

n − p

rmU(L).

By Berndt and Savin (1977) and Anderson (1984, pp. 333, 371),

V (L) ≤ − log(Λ(L)) ≤ U(L).


Hence the Hotelling Lawley test will have the most power and Pillai’s testwill have the least power.

Following Khattree and Naik (1999, pp. 67-68), there are several ap-proximations used by the SAS software. For the Roy’s largest root test, ifh = max(r,m), use

n− p− h+ r

hλmax(L) ≈ F (h, n− p− h+ r).

The simulations in Section 8.5 suggest that this approximation is good forr = 1 but poor for r > 1. Anderson (1984, p. 333) stated that Roy’s largestroot test has the greatest power if r = 1 but is an inferior test for r > 1. Letg = n−p−(m−r+1)/2, u = (rm−2)/4 and t =

√r2m2 − 4/

√m2 + r2 − 5 for

m2+r2−5 > 0 and t = 1, otherwise. Assume H0 is true. Thus UP→ 0, V

P→ 0,

and ΛP→ 1 as n → ∞. Then

gt− 2u

rm

1 − Λ1/t

Λ1/t≈ F (rm, gt− 2u) or (n − p)t

1 − Λ1/t

Λ1/t≈ χ2

rm.

For large n and t > 0, − log(Λ) = −t log(Λ1/t) = −t log(1 + Λ1/t − 1) ≈t(1 − Λ1/t) ≈ t(1 − Λ1/t)/Λ1/t. If it can not be shown that

(n− p)[− log(Λ) − t(1 − Λ1/t)/Λ1/t]P→ 0 as n → ∞,

then it is possible that the approximate χ2rm distribution may be the limiting

distribution for only a small class of iid error distributions. When the εi areiid Nm(0,Σε), there are some exact results. For r = 1,

n− p−m+ 1

m

1 − Λ

Λ∼ F (m, n− p−m+ 1).

For r = 2,

2(n− p−m+ 1)

2m

1 − Λ1/2

Λ1/2∼ F (2m, 2(n− p−m+ 1)).

For m = 2,2(n− p)

2r

1 − Λ1/2

Λ1/2∼ F (2r, 2(n− p)).

Let s = min(r,m), m1 = (|r −m| − 1)/2 and m2 = (n− p−m− 1)/2. Notethat s(|r −m| + s) = min(r,m)max(r,m) = rm. Then

n − p

rm

V

1 − V/s=

n− p

s(|r −m| + s)

V

1 − V/s≈ 2m2 + s+ 1

2m1 + s+ 1

V

s− V≈

F (s(2m1+s+1), s(2m2+s+1)) ≈ F (s(|r−m|+s), s(n−p)) = F (rm, s(n−p)).


This approximation is asymptotically correct by Slutsky’s theorem since

1− V/sP→ 1. Finally,

n− p

rmU =

n− p

s(|r −m| + s)U ≈ 2(sm2 + 1)

s2(2m1 + s+ 1)U ≈ F (s(2m1 + s+ 1), 2(sm2 + 1))

≈ F (s(|r −m| + s), s(n − p)) = F (rm, s(n− p)).

This approximation is asymptotically correct for a wide range of iid errordistributions.

Multivariate analogs of tests for multiple linear regression can be derivedwith appropriate choice of L. Assume a constant x1 = 1 is in the model. Asa textbook convention, use δ = 0.05 if δ is not given.

The four step MANOVA test of linear hypotheses is useful.i) State the hypotheses H0 : LB = 0 and H1 : LB 6= 0.ii) Get test statistic from output.iii) Get pval from output.iv) State whether you reject H0 or fail to reject H0. If pval ≤ δ, reject H0

and conclude that LB 6= 0. If pval > δ, fail to reject H0 and conclude thatLB = 0 or that there is not enough evidence to conclude that LB 6= 0.

The MANOVA test of H0 : B = 0 versus H1 : B 6= 0 is the special case

corresponding to L = I and H = BTXT XB = Z

TZ, but is usually not a

test of interest.

The analog of the ANOVA F test for multiple linear regression is theMANOVA F test that uses L = [0 Ip−1] to test whether the nontrivialpredictors are needed in the model. This test should reject H0 if the responseand residual plots look good, n is large enough, and at least one responseplot does not look like the corresponding residual plot. A response plot forYj will look like a residual plot if the identity line appears almost horizontal,

hence the range of Yj is small. Response and residual plots are often usefulfor n ≥ 10p.

The 4 step MANOVA F test of hypotheses uses L = [0 Ip−1].i) State the hypotheses H0: the nontrivial predictors are not needed in themreg model H1: at least one of the nontrivial predictors is needed.ii) Find the test statistic F0 from output.iii) Find the pval from output.iv) If pval ≤ δ, reject H0. If pval > δ, fail to reject H0. If H0 is rejected,conclude that there is a mreg relationship between the response variablesY1, ..., Ym and the predictors x2, ..., xp. If you fail to reject H0, concludethat there is a not a mreg relationship between Y1, ..., Ym and the predictorsx2, ..., xp. (Or there is not enough evidence to conclude that there is amreg relationship between the response variables and the predictors. Get thevariable names from the story problem.)


The Fj test of hypotheses uses Lj = [0, ..., 0, 1, 0, ..., 0], where the 1 is inthe jth position, to test whether the jth predictor xj is needed in the modelgiven that the other p− 1 predictors are in the model. This test is an analogof the t tests for multiple linear regression. Note that xj is not needed in themodel corresponds to H0 : Bj = 0 while xj needed in the model corresponds

to H1 : Bj 6= 0 where BTj is the jth row of B.

The 4 step Fj test of hypotheses uses Lj = [0, ..., 0, 1, 0, ..., 0] where the 1is in the jth position.i) State the hypotheses H0 : xj is not needed in the modelH1 : xj is needed.ii) Find the test statistic Fj from output.iii) Find pval from output.iv) If pval ≤ δ, reject H0. If pval > δ, fail to reject H0. Give a nontechnicalsentence restating your conclusion in terms of the story problem. If H0 isrejected, then conclude that xj is needed in the mreg model for Y1, ..., Ym

given that the other predictors are in the model. If you fail to reject H0, thenconclude that xj is not needed in the mreg model for Y1, ..., Ym given thatthe other predictors are in the model. (Or there is not enough evidence toconclude that xj is needed in the model. Get the variable names from thestory problem.)

The Hotelling Lawley statistic

Fj =1

djB

T

j Σ−1

ε Bj =1

dj(βj1, βj2, ..., βjm)Σ

−1

ε

βj1

βj2

...

βjm

where BT

j is the jth row of B and dj = (XT X)−1jj , the jth diagonal entry of

(XT X)−1. The statistic Fj could be used for forward selection and backwardelimination in variable selection.

The 4 step MANOVA partial F test of hypotheses has a full modelusing all of the variables and a reduced model where r of the variables aredeleted. The ith row of L has a 1 in the position corresponding to the ithvariable to be deleted. Omitting the jth variable corresponds to the Fj testwhile omitting variables x2, ..., xp corresponds to the MANOVA F test. UsingL = [0 Ik] tests whether the last k predictors are needed in the multivariatelinear regression model given that the remaining predictors are in the model.i) State the hypotheses H0: the reduced model is good H1: use the fullmodel.ii) Find the test statistic FR from output.iii) Find the pval from output.


iv) If pval ≤ δ, reject H0 and conclude that the full model should be used.If pval > δ, fail to reject H0 and conclude that the reduced model is good.

The mpack function mltreg produces the m response and residual plots,gives B, Σε, the MANOVA partial F test statistic and pval correspondingto the reduced model that leaves out the variables given by indices (so x2

and x4 in the output below with F = 0.77 and pval = 0.614), Fj and the pvalfor the Fj test for variables 1, 2, ..., p (where p = 4 in the output below soF2 = 1.51 with pval = 0.284), and F0 and pval for the MANOVA F test (inthe output below F0 = 3.15 and pval= 0.06). Right click Stop on the plotsm times to advance the plots and to get the cursor back on the commandline in R.

The command out <- mltreg(x,y,indices=c(2)) would producea MANOVA partial F test corresponding to the F2 test while the commandout <- mltreg(x,y,indices=c(2,3,4)) would produce a MANOVApartial F test corresponding to the MANOVA F test for a data set withp = 4 predictor variables. The Hotelling Lawley trace statistic is used in thetests.

out <- mltreg(x,y,indices=c(2,4))

$Bhat

[,1] [,2] [,3]

[1,] 47.96841291 623.2817463 179.8867890

[2,] 0.07884384 0.7276600 -0.5378649

[3,] -1.45584256 -17.3872206 0.2337900

[4,] -0.01895002 0.1393189 -0.3885967

$Covhat

[,1] [,2] [,3]

[1,] 21.91591 123.2557 132.339

[2,] 123.25566 2619.4996 2145.780

[3,] 132.33902 2145.7797 2954.082

$partial

partialF Pval

[1,] 0.7703294 0.6141573

$Ftable

Fj pvals

[1,] 6.30355375 0.01677169

[2,] 1.51013090 0.28449166

[3,] 5.61329324 0.02279833

[4,] 0.06482555 0.97701447

$MANOVA

MANOVAF pval

[1,] 3.150118 0.06038742


#Output for Example 8.2

y<-marry[,c(2,3)]; x<-marry[,-c(2,3)];

mltreg(x,y,indices=c(3,4))

$partial

partialF Pval

[1,] 0.2001622 0.9349877

$Ftable

Fj pvals

[1,] 4.35326807 0.02870083

[2,] 600.57002201 0.00000000

[3,] 0.08819810 0.91597268

[4,] 0.06531531 0.93699302

$MANOVA

MANOVAF pval

[1,] 295.071 1.110223e-16

Example 8.2. The above output is for the Hebbler (1847) data fromthe 1843 Prussia census. Sometimes if the wife or husband was not at thehousehold, then s/he would not be counted. Y1 = number of married civilianmen in the district, Y2 = number of women married to civilians in the district,x2 = population of the district in 1843, x3 = number of married military menin the district, and x4 = number of women married to military men in thedistrict. The reduced model deletes x3 and x4. The constant uses x1 = 1.

a) Do the MANOVA F test.b) Do the F2 test.c) Do the F4 test.d) Do an appropriate 4 step test for the reduced model that deletes x3

and x4.e) The output for the reduced model that deletes x1 and x2 is shown below.

Do an appropriate 4 step test.

$partial

partialF Pval

[1,] 569.6429 0

Solution:a) i) H0: the nontrivial predictors are not needed in the mreg model

H1: at least one of the nontrivial predictors is neededii) F0 = 295.071iii) pval = 0iv) Reject H0, the nontrivial predictors are needed in the mreg model.

b) i) H0: x2 is not needed in the model H1: x2 is neededii) F2 = 600.57iii) pval = 0iv) Reject H0, population of the district is needed in the model.


c) i) H0: x4 is not needed in the model H1: x4 is neededii) F4 = 0.065iii) pval = 0.937iv) Fail to reject H0, number of women married to military men is not

needed in the model given that the other predictors are in the model.

d) i) H0: the reduced model is good H1: use the full model.ii) FR = 0.200iii) pval = 0.935iv) Fail to reject H0, so the reduced model is good.e) i) H0: the reduced model is good H1: use the full model.ii) FR = 569.6iii) pval = 0.00iv) Reject H0, so use the full model.

8.5 An Example and Simulations

The semiparametric prediction region and parametric MVN prediction regionfrom Section 5.2 applied to the zi are only conjectured to be large sampleprediction regions, but are added to the DD plot as visual aids. Cases belowthe horizontal line that crosses the identity line correspond to the semipara-metric region while cases below the horizontal line that ends at the identityline correspond to the parametric MVN region. A vertical line dropped downfrom this point of intersection does correspond to a large sample predictionregion for multivariate normal error vectors. Note that zi = yf + εi, andadding a constant yf to all of the residual vectors does not change the Ma-halanobis distances, so the DD plot of the residual vectors can be used todisplay the prediction regions.

Fig. 8.1 Scatterplot Matrix of the Mussels Data.

Fig. 8.2 DD Plot of the Mussels Data, MLD Model.

Fig. 8.3 Plots for Y1 = log(S).

8.5 An Example and Simulations 283

Fig. 8.4 Plots for Y2 = log(M).

Fig. 8.5 DD Plot of the Residual Vectors for the Mussels Data.

Example 8.3. Cook and Weisberg (1999a, pp. 351, 433, 447) gave a dataset on 82 mussels sampled off the coast of New Zealand. Let Y1 = log(S)and Y2 = log(M) where S is the shell mass and M is the muscle mass.The predictors are X2 = L, X3 = log(W ), and X4 = H : the shell length,log(width), and height.

a) First use the multivariate location and dispersion model for this data.Figure 8.1 shows a scatterplot matrix of the data and Figure 8.2 shows aDD plot of the data with multivariate prediction regions added. These plotssuggest that the data may come from an elliptically contoured distributionthat is not multivariate normal. The semiparametric and nonparametric 90%prediction regions of Section 5.2 consist of the cases below the RD = 5.86line and to the left of the MD = 4.41 line. These two lines intersect on aline through the origin that is followed by the plotted points. The parametricMVN prediction region is given by the points below the RD = 3.33 line anddoes not contain enough cases.

b) Now consider the multivariate linear regression model. To check linear-ity, Figures 8.3 and 8.4 give the response and residual plots for Y1 and Y2.The response plots show strong linear relationships. For Y1, case 79 sticksout while for Y2, cases 8, 25, and 48 are not fit well. Highlighted cases hadCook’s distance > min(0.5, 2p/n). See Cook (1977).

To check the error vector distribution, the DD plot should be used insteadof univariate residual plots, which do not take into account the correlationsof the random variables ε1, ..., εm in the error vector ε. A residual vectorε = (ε − ε) + ε is a combination of ε and a discrepancy ε − ε that tendsto have an approximate multivariate normal distribution. The ε − ε termcan dominate for small to moderate n when ε is not multivariate normal,incorrectly suggesting that the distribution of the error vector ε is closer to amultivariate normal distribution than is actually the case. Figure 8.5 showsthe DD plot of the residual vectors. The plotted points are highly correlatedbut do not cover the identity line, suggesting an elliptically contoured errordistribution that is not multivariate normal. The nonparametric 90% predic-tion region for the residuals consists of the points to the left of the verticalline MD = 2.60. Comparing Figures 8.2 and 8.5, the residual distribution iscloser to a multivariate normal distribution. Cases 8, 48, and 79 have espe-cially large distances.

The four Hotelling Lawley Fj statistics were greater than 5.77 with pvaluesless than 0.005, and the MANOVA F statistic was 337.8 with pvalue ≈ 0.


The response, residual, and DD plots are effective for finding influentialcases, for checking linearity, for checking whether the error distribution ismultivariate normal or some other elliptically contoured distribution, andfor displaying the nonparametric prediction region. Note that cases to theright of the vertical line correspond to cases with yi that are not in theirprediction region. These are the cases corresponding to residual vectors withlarge Mahalanobis distances. Adding a constant does not change the distance,so the DD plot for the residual vectors is the same as the DD plot for the zi.

Fig. 8.6 Plots for Y2 = M .

Fig. 8.7 DD Plot When Y2 = M .

c) Now suppose the same model is used except Y2 = M . Then the responseand residual plots for Y1 remain the same, but the plots shown in Figure 8.6show curvature about the identity and r = 0 lines. Hence the linearity condi-tion is violated. Figure 8.7 shows that the plotted points in the DD plot havecorrelation well less than one, suggesting that the error vector distributionis no longer elliptically contoured. The nonparametric 90% prediction regionfor the residual vectors consists of the points to the left of the vertical lineMD = 2.52, and contains 95% of the training data. Note that the plots canbe used to quickly assess whether power transformations have resulted in alinear model, and whether influential cases are present. R code for producingthe seven figures is shown below.

y <- log(mussels)[,4:5]

x <- mussels[,1:3]

x[,2] <- log(x[,2])

z<-cbind(x,y)

pairs(z, labels=c("L","log(W)","H","log(S)","log(M)"))

ddplot4(z) #right click Stop

out <- mltreg(x,y) #right click Stop 4 times

ddplot4(out$res) #right click Stop

y[,2] <- mussels[,5]

tem <- mltreg(x,y) #right click Stop 4 times

ddplot4(tem$res) #right click Stop


8.5.1 Simulations for Testing

A small simulation was used to study the Wilks’ Λ test, the Pillai’s tracetest, the Hotelling Lawley trace test, and the Roy’s largest root test for theFj tests and the MANOVA F test for multivariate linear regression. The firstrow of B was always 1T and the last row of B was always 0T . When the nullhypothesis for the MANOVA F test is true, all but the first row correspondingto the constant are equal to 0T . When p ≥ 3 and the null hypothesis for theMANOVA F test is false, then the second to last row of B is (1, 0, ..., 0),the third to last row is (1, 1, 0, ..., 0) et cetera as long as the first row isnot changed from 1T . First m× 1 error vectors wi were generated such thatthe m random variables in the vector wi are iid with variance σ2. Let them×m matrix A = (aij) with aii = 1 and aij = ψ where 0 ≤ ψ < 1 for i 6= j.

Then εi = Awi so that Σε = σ2AAT = (σij) where the diagonal entriesσii = σ2[1+(m−1)ψ2 ] and the off diagonal entries σij = σ2[2ψ+(m−2)ψ2 ]where ψ = 0.10. Hence the correlations are (2ψ+(m−2)ψ2)/(1+(m−1)ψ2 ).As ψ gets close to 1, the error vectors cluster about the line in the directionof (1, ..., 1)T. We used wi ∼ Nm(0, I),wi ∼ (1 − τ )Nm(0, I) + τNm(0, 25I)with 0 < τ < 1 and τ = 0.25 in the simulation, wi ∼ multivariate td withd = 7 degrees of freedom, or wi ∼ lognormal - E(lognormal): where the mcomponents of wi were iid with distribution ez − E(ez) where z ∼ N(0, 1).Only the lognormal distribution is not elliptically contoured.

Table 8.1 Test Coverages: MANOVA F H0 is True.

w dist n test F1 F2 Fp−1 Fp FM

MVN 300 W 1 0.043 0.042 0.041 0.018MVN 300 P 1 0.040 0.038 0.038 0.007MVN 300 HL 1 0.059 0.058 0.057 0.045MVN 300 R 1 0.051 0.049 0.048 0.993MVN 600 W 1 0.048 0.043 0.043 0.034MVN 600 P 1 0.046 0.042 0.041 0.026MVN 600 HL 1 0.055 0.052 0.050 0.052MVN 600 R 1 0.052 0.048 0.047 0.994MIX 300 W 1 0.042 0.043 0.044 0.017MIX 300 P 1 0.039 0.040 0.042 0.008MIX 300 HL 1 0.057 0.059 0.058 0.039MIX 300 R 1 0.050 0.050 0.051 0.993

MVT(7) 300 W 1 0.048 0.036 0.045 0.020MVT(7) 300 P 1 0.046 0.032 0.042 0.011MVT(7) 300 HL 1 0.064 0.049 0.058 0.045MVT(7) 300 R 1 0.055 0.043 0.051 0.993

LN 300 W 1 0.043 0.047 0.040 0.020LN 300 P 1 0.039 0.045 0.037 0.009LN 300 HL 1 0.057 0.061 0.058 0.041LN 300 R 1 0.049 0.055 0.050 0.994


Table 8.2 Test Coverages: MANOVA F H0 is False.

n m = p test F1 F2 Fp−1 Fp FM

30 5 W 0.012 0.222 0.058 0.000 0.00630 5 P 0.000 0.000 0.000 0.000 0.00030 5 HL 0.382 0.694 0.322 0.007 0.57930 5 R 0.799 0.871 0.549 0.047 0.99750 5 W 0.984 0.955 0.644 0.017 0.96350 5 P 0.971 0.940 0.598 0.012 0.87150 5 HL 0.997 0.979 0.756 0.053 0.99150 5 R 0.996 0.978 0.744 0.049 1

105 10 W 0.650 0.970 0.191 0.000 0.633105 10 P 0.109 0.812 0.050 0.000 0.000105 10 HL 0.964 0.997 0.428 0.000 1105 10 R 1 1 0.892 0.052 1150 10 W 1 1 0.948 0.032 1150 10 P 1 1 0.941 0.025 1150 10 HL 1 1 0.966 0.060 1150 10 R 1 1 0.965 0.057 1450 20 W 1 1 0.999 0.020 1450 20 P 1 1 0.999 0.016 1450 20 HL 1 1 0.999 0.035 1450 20 R 1 1 0.999 0.056 1

The simulation used 5000 runs, and H0 was rejected if the F statisticwas greater than Fd1,d2

(0.95) where P (Fd1,d2< Fd1,d2

(0.95)) = 0.95 withd1 = rm and d2 = n−mp for the test statistics

−[n− p− 0.5(m− r + 3)]

rmlog(Λ(L)),

n− p

rmV (L), and

n − p

rmU(L),

while d1 = h = max(r,m) and d2 = n− p− h+ r for the test statistic

n− p− h+ r

hλmax(L).

Denote these statistics by W , P , HL, and R. Let the coverage be the propor-tion of times that H0 is rejected. We want coverage near 0.05 when H0 is trueand coverage close to 1 for good power when H0 is false. With 5000 runs,coverage outside of (0.04,0.06) suggests that the true coverage is not 0.05.Coverages are tabled for the F1, F2, Fp−1, and Fp test and for the MANOVAF test denoted by FM . The null hypothesis H0 was always true for the Fp

test and always false for the F1 test. When the MANOVA F test was true,H0 was true for the Fj tests with j 6= 1. When the MANOVA F test wasfalse, H0 was false for the Fj tests with j 6= p, but the Fp−1 test should behardest to reject for j 6= p by construction of B and the error vectors.

When the null hypothesisH0 was true, simulated values started to get closeto nominal levels for n ≥ 0.8(m+p)2, and were fairly good for n ≥ 1.5(m+p)2.The exception was Roy’s test which rejects H0 far too often if r > 1. See Table


8.1 where we want values for the F1 test to be close to 1 since H0 is falsefor the F1 test, and we want values close to 0.05, otherwise. Roy’s test wasvery good for the Fj tests but very poor for the MANOVA F test. Resultsare shown for m = p = 10. As expected from Berndt and Savin (1977),Pillai’s test rejected H0 less often than Wilks’ test which rejected H0 lessoften than the Hotelling Lawley test. Based on a much larger simulationstudy Pelawa Watagoda (2013, pp. 111-112), using the four types of errorvector distributions and m = p, the tests had approximately correct level ifn ≥ 0.83(m+ p)2 for the Hotelling Lawley test, if n ≥ 2.80(m+ p)2 for theWilks’ test (agreeing with Kshirsagar (1972) n ≥ 3(m+ p)2 for multivariatenormal data), and if n ≥ 4.2(m+ p)2 for Pillai’s test.

In Table 8.2, H0 is only true for the Fp test where p = m, and we wantvalues in the Fp column near 0.05. We want values near 1 for high powerotherwise. If H0 is false, often H0 will be rejected for small n. For example,if n ≥ 10p, then the m residual plots should start to look good, and theMANOVA F test should be rejected. For the simulated data, the test hadfair power for n not much larger thanmp. Results are shown for the lognormaldistribution.

Some R output for reproducing the simulation is shown below. The mpackfunction is mregsim and etype = 1 uses data from a MVN distribution. Thefcov line computed the Hotelling Lawley statistic using Equation (8.8) whilethe hotlawcov line used Definition 8.10. The mnull=T part of the commandmeans we want the first value near 1 for high power and the next threenumbers near the nominal level 0.05 except for mancv where we want allof the MANOVA F test statistics to be near the nominal level of 0.05. Themnull=F part of the command means want all values near 1 for high powerexcept for the last column (for the terms other than mancv) corresponding tothe Fp test where H0 is true so we want values near the nominal level of 0.05.The “coverage” is the proportion of times that H0 is rejected, so “coverage”is short for “power” and “level”: we want the coverage near 1 for high powerwhen H0 is false and we want the coverage near the nominal level 0.05 whenH0 is true. Also see Problem 8.10.

mregsim(nruns=5000,etype=1,mnull=T)

$wilkcov

[1] 1.0000 0.0450 0.0462 0.0430

$pilcov

[1] 1.0000 0.0414 0.0432 0.0400

$hotlawcov

[1] 1.0000 0.0522 0.0516 0.0490

$roycov

[1] 1.0000 0.0512 0.0500 0.0480

$fcov

[1] 1.0000 0.0522 0.0516 0.0490

$mancv

wcv pcv hlcv rcv fcv


[1,] 0.0406 0.0332 0.049 0.1526 0.049

mregsim(nruns=5000,etype=2,mnull=F)

$wilkcov

[1] 0.9834 0.9814 0.9104 0.0408

$pilcov

[1] 0.9824 0.9804 0.9064 0.0372

$hotlawcov

[1] 0.9856 0.9838 0.9162 0.0480

$roycov

[1] 0.9848 0.9834 0.9156 0.0462

$fcov

[1] 0.9856 0.9838 0.9162 0.0480

$mancv

wcv pcv hlcv rcv fcv

[1,] 0.993 0.9918 0.9942 0.9978 0.9942

8.5.2 Simulations for Prediction Regions

The same type of data and 5000 runs were used to simulate the predictionregions for yf given xf for multivariate regression. With n=100, m=2, andp=4, the nominal coverage of the prediction region is 90%, and 92% of thetraining data is covered. Following Olive (2013a), consider the predictionregion {z : (z − T )T C−1(z − T ) ≤ h2} = {z : D2

z ≤ h2} = {z : Dz ≤ h}.Then the ratio of the prediction region volumes

hmi

√det(Ci)

hm2

√det(C2)

was recorded where i = 1 was the nonparametric region, i = 2 was thesemiparametric region, and i = 3 was the parametric MVN region. Here h1

and h2 were the cutoff D(Un)(Ti,Ci) for i = 1, 2, and h3 =√χ2

m,qn.

If, as conjectured, the RMVN estimator is a consistent estimator whenapplied to the residual vectors instead of iid data, then the volume ratiosconverge in probability to 1 if the iid zero mean errors ∼ Nm(0,Σε), andthe volume ratio converges to 1 for i = 1 for a large class of ellipticallycontoured distributions. These volume ratios were denoted by voln and volmfor the nonparametric and parametric MVN regions. The coverage was theproportion of times the prediction region contained yf where ncov, scov,and mcov are for the nonparametric, semiparametric, and parametric MVNregions.


Table 8.3 Coverages for 90% Prediction Regions.

w dist n m = p ncov scov mcov voln volmMVN 48 2 0.901 0.905 0.888 0.941 0.964MVN 300 5 0.889 0.887 0.890 1.006 1.015MVN 1200 10 0.899 0.896 0.896 1.004 1.001MIX 48 2 0.912 0.927 0.710 0.872 0.097MIX 300 5 0.906 0.911 0.680 0.882 0.001MIX 1200 10 0.904 0.911 0.673 0.889 0+

MVT(7) 48 2 0.903 0.910 0.825 0.914 0.646MVT(7) 300 5 0.899 0.909 0.778 0.916 0.295MVT(7) 1200 10 0.906 0.911 0.726 0.919 0.061

LN 48 2 0.912 0.926 0.651 0.729 0.090LN 300 5 0.915 0.917 0.593 0.696 0.009LN 1200 10 0.912 0.916 0.593 0.679 0+

In the simulations, we took n = 3(m + p)2 and m = p. Table 8.3 showsthat the coverage of the nonparametric region was close to 0.9 in all cases.The volume ratio voln was fairly close to 1 for the three elliptically contoureddistributions. Since the volume of the prediction region is proportional to hm,the volume can be very small if h is too small and m is large. Parametricprediction regions usually give poor estimates of h when the parametric dis-tribution is misspecified. Hence the parametric MVN region only performedwell for multivariate normal data.

Some R output for reproducing the simulation is shown below. The mpackfunction is mpredsim and etype = 1 uses data from a MVN distribution. Theterm “ncvr” is “ncov” in Table 8.3. Since up = 0.94, 94% of the training datais covered by the nominal 90% nominal prediction region. Also see Problem8.11.

mpredsim(nruns=5000,etype=1)

$ncvr

[1] 0.9162

$scvr

[1] 0.916

$mcvr

[1] 0.9138

$voln

[1] 0.9892485

$vols

[1] 1

$volm

[1] 1.004964

$up

[1] 0.94


8.6 Two Robust Estimators

8.6.1 The rmreg Estimator

The classical multivariate linear regression estimator is found from m leastsquares multiple linear regressions of Yj on the predictors. The first wayto make a robust multivariate linear regression estimator is to replace leastsquares by a robust estimator, such as the m hbreg multiple linear regres-sions of Yj on the predictors. Olive and Hawkins (2011) showed that theprobability the hbreg estimator is equal to the least squares estimator goesto one as n → ∞ for a large class of (univariate) error distributions. SeeSection 14.4. The class of (univariate) error distributions contains distribu-tions that are not symmetric; however, for a skewed distribution the slopeestimates are similar, but the intercept β1 will differ for least squares andhbreg. See the Warning in Section 14.2.

The mpack function rmreg replaces least squares with hbreg to make thefirst robust multivariate linear regression estimator. Then the probability thatthe rmreg estimator is equal to the classical multivariate linear regressionestimator also goes to 1 on a large class of distributions for the error vectorε, including many elliptically contoured distributions.

Hence the large sample nonparametric prediction region and the largesample Wilks’ test, Pillai’s test, and Hotelling Lawley test using the robustestimator rmreg are asymptotically equivalent to their analogs using theclassical estimator for a large class of error vector distributions.

The rmreg estimator has some useful theory for clean data, but replac-ing the least squares multiple linear regression estimator by a highly outlierresistant multiple linear regression estimator results in a multivariate linearregression estimator with outlier resistance that is still quite low. For rmreg,the tests are not valid when outliers are present since Σε uses the outliers.

Fig. 8.8 Plots for Y1 = head length.

Fig. 8.9 Plots for Y2 = height.

Fig. 8.10 DD Plot of the Residual Vectors for the Buxton Data.

8.6 Two Robust Estimators 291

Example 8.4. Buxton (1920) gave various measurements of 88 men. Headlength and person’s height were the response variables while an intercept,nasal height, bigonal breadth, and cephalic index were used as predictors inthe multivariate linear regression model. Observation 9 was deleted since ithad missing values. Five individuals, numbers 62–66, were reported to beabout 0.75 inches tall with head lengths well over five feet! Figure 8.8 showsthe response and residual plots corresponding to Y1 for the robust estimatorrmreg. The response plot for the classical estimator, not shown, has theidentity line tilted slightly above most of the plotted points in the lower partof the plot, while the plotted points in the lower part of the residual plotfollow a line with negative slope instead of the r = 0 line. Figure 8.9 showsthe response and residual plots corresponding to Y2 for the robust estimator.The response plot for the classical estimator, not shown, has the identity linetilted slightly below most of the plotted points in the upper part of the plot,while the plotted points in the upper part of the residual plot follow a linewith negative slope instead of the r = 0 line. Figure 8.10 shows the DD plot.The 90% semiparametric and nonparametric regions use the 95th percentilewhich is a linear combination of an outlying case with a nonoutlying case. Theparametric MVN region contains cases below the RD = 2.448 line, which isobscured by the identity line. The tests of hypotheses for the robust estimatorare not robust to outliers because all n = 87 residual vectors are used to makeΣε. As is often the case, outliers can be detected with the plots using theclassical or robust estimator.

These three figures can be made with the following R commands.

ht <- buxy

z <- cbind(buxx,ht)

y <- z[,c(1,5)]

x <- z[,2:4]

# compare mltreg(x,y)

out<-rmltreg(x,y) #right click Stop 4 times

out; ddplot4(out$res) #right click Stop

R functions for simulating testing and prediction regions using rmreg arermregsim and rmpredism. The similar functions for the classical estimatordelete the initial “r.” The function rmltreg produces output similar tothe R function mltreg for the classical estimator, including response andresidual plots and output for testing with the rmreg estimator. For predictionthe simulation results for the robust estimator were similar to those for theclassical estimator. For testing, the robust estimator needed larger values ofn and the tests did not work for the highly skewed lognormal distributionsince the robust and classical estimators estimate the m constants (β1j forj = 1, ..., m) differently when the data is skewed. Rupasinghe Arachchige Don(2013) did a large simulation study for testing using rmreg. Results suggestedthat for the three elliptically contoured distributions used in Section 8.5.1 andm = p, the tests had approximately correct level if n > 150 + (m + p)2 for


the Hotelling Lawley test, if n > 140 + 3(m+ p)2 for the Wilks’ test, and ifn > 90 + 3.6(m+ p)2 for Pillai’s test.

8.6.2 The rmreg2 Estimator

The robust multivariate linear regression estimator rmreg2 is the classi-cal multivariate linear regression estimator applied to the RMVN set whenRMVN is computed from the vectors ui = (xi2, ..., xip, Yi1, ..., Yim)T fori = 1, ..., n. Hence ui is the ith case with xi1 = 1 deleted. This regressionestimator has considerable outlier resistance, and is one of the most outlierresistant practical robust regression estimator for the m = 1 multiple linearregression case. See Chapter 14. The rmreg2 estimator has been shown tobe consistent if the ui are iid from a large class of elliptically contoured dis-tributions, which is a much stronger assumption than having iid error vectorsεi.

First we will review some results for multiple linear regression. Let x =(1,wT )T and let

Cov(w) = E[(w − E(w))(w − E(w))T] = Σw

and Cov(w, Y ) = E[(w − E(w))(Y − E(Y ))] = ΣwY . Let β = (α,ηT )T

be the population OLS coefficients from the regression of Y on x (w and aconstant), where α is the constant and η is the vector of slopes. Let the OLS

estimator be β = (α, ηT )T . Then the population coefficients from an OLSregression of Y on x are

α = E(Y ) − ηTE(w) and η = Σ−1w ΣwY. (8.10)

Then the OLS estimator β = (XT X)−1XT Y . The sample covariancematrix of w is

Σw =1

n− 1

n∑

i=1

(wi−w)(wi−w)T where the sample mean w =1

n

n∑

i=1

wi.

Similarly, define the sample covariance vector of w and Y to be

ΣwY =1

n− 1

n∑

i=1

(wi − w)(Yi − Y ).

Suppose that (Yi,wTi )T are iid random vectors such that Σ−1

w and ΣwY

exist. Thenα = Y − ηT w

P→ α

and

8.6 Two Robust Estimators 293

η = Σ−1

w ΣwYP→ η as n → ∞.

Now for multivariate linear regression, βj = (αj , ηTj )T where αj = Y j −

ηTj w and ηj = Σ

−1

w ΣwYj . Let Σwy = 1n−1

∑ni=1(wi − w)(yi − y)T which

has jth column ΣwYj for j = 1, ..., m. Let

u =

(wy

), E(u) = µu =

(E(w)E(y)

)=

(µwµy

), and Cov(u) = Σu =

(Σww ΣwyΣyw Σyy

).

Let the vector of constants be αT = (α1, ..., αm) and the matrix of slopevectors BS =

[η1 η2 . . . ηm

]. Then the population least squares coefficient

matrix is

B =

(αT

BS

)

where α = µy − BTSµw and BS = Σ−1

w Σwy where Σw = Σww .If the ui are iid with nonsingular covariance matrix Cov(u), the least

squares estimator

B =

(αT

BS

)

where α = y − BT

Sw and BS = Σ−1

w Σwy . The least squares multivariatelinear regression estimator can be calculated by computing the classical esti-mator (u,Su) = (u, Σu) of multivariate location and dispersion on the ui,

and then plug in the results into the formulas for α and BS .Let (T,C) = (µu, Σu) be a robust estimator of multivariate location and

dispersion. If µu is a consistent estimator of µu and Σu is a consistentestimator of c Σu for some constant c > 0, then a robust estimator of

multivariate linear regression is the plug in estimator α = µy − BT

S µw and

BS = Σ−1

w Σwy .For the rmreg2 estimator, (T,C) is the classical estimator applied to

the RMVN set when RMVN is applied to vectors ui for i = 1, ..., n (coulduse (T,C) = RMVN estimator since the scaling does not matter for thisapplication). Then (T,C) is a

√n consistent estimator of (µu, cΣu) if the ui

are iid from a large class of ECd(µu,Σu, g) distributions where d = m+p−1.Thus the classical and robust estimators of multivariate linear regression areboth

√n consistent estimators of B if the ui are iid from a large class of

elliptically contoured distributions. This assumption is quite strong, but therobust estimator is useful for detecting outliers. When there are categoricalpredictors or the joint distribution of u is not elliptically contoured, it ispossible that the robust estimator is bad and very different from the goodclassical least squares estimator.


Fig. 8.11 Plots for Y1 = nasal height using rmreg.

Fig. 8.12 Plots for Y2 = height using rmreg.

Fig. 8.13 Plots for Y1 = nasal height using rmreg2.

Fig. 8.14 Plots for Y2 = height using rmreg2.

Example 8.4, continued. The mpack function rmreg2 computes thermreg2 estimator and produces the response and residual plots. The plotsfor the rmreg2 estimator were very similar to Figures 8.8 and 8.9.

Now let Y1 = nasal height and Y2 = height with x2 = head length, x3 =bigonal breadth, and x4 = cephalic index. Then Y2 and x2 have massive out-liers. Then the response and residual plots for the classical estimator and therobust estimator rmreg using hbreg were nearly identical. Figures 8.11 and8.12 show that the fit using rmreg went right through the outliers. Figures8.13 and 8.14 show that the response and residual plots corresponding tormreg2 do not have fits that pass through the outliers.

These figures can be made with the following R commands.

ht <- buxy; z <- cbind(buxx,ht);

y <- z[,c(2,5)]; x <- z[,c(1,3,4)]

# compare mltreg(x,y) #right click Stop 4 times

rmltreg(x,y) #right click Stop 4 times

out <- rmreg2(x,y) #right click Stop 4 times

# try ddplot4(out$res) #right click Stop

The residual bootstrap for the test H0 : LB = 0 may be useful. Take asample of size n with replacement from the residual vectors to form Z∗

1 withith row y∗T

i where y∗i = yi + ε∗i . The function rmreg3 gets the rmreg2

estimator without the plots. Using rmreg3, regress Z on X to get vec(LB∗1).

Repeat B times to get a bootstrap sample w1, ...,wB where wi = vec(LB∗i ).

The nonparametric bootstrap uses n cases drawn with replacement, and mayalso be useful. Apply the nonparametric prediction region to the wi and seeif 0 is in the region. If L is r × p, then w is rp × 1, and we likely needn ≥ max[50rp, 3(m+ p)2].

8.9 Summary 295

8.7 Bootstrap

8.7.1 Parametric Bootstrap

8.7.2 Residual Bootstrap

8.7.3 Nonparametric Bootstrap


8.9 Summary

1) The multivariate linear regression model is a special case of the multi-variate linear model where at least one predictor variable xj is continuous.The MANOVA model in Chapter 10 is a multivariate linear model where allof the predictors are categorical variables so the xj are coded and are oftenindicator variables.

2) The multivariate linear regression model yi = BT xi + εi fori = 1, ..., n has m ≥ 2 response variables Y1, ..., Ym and p predictor variablesx1, x2, ..., xp. The ith case is (xT

i , yTi ) = (xi1, xi2, ..., xip, Yi1, ..., Yim). The

constant xi1 = 1 is in the model, and is often omitted from the case andthe data matrix. The model is written in matrix form as Z = XB + E.The model has E(εk) = 0 and Cov(εk) = Σε = (σij) for k = 1, ..., n. AlsoE(ei) = 0 while Cov(ei, ej) = σijIn for i, j = 1, ..., m. Then B and Σε areunknown matrices of parameters to be estimated, and E(Z) = XB whileE(Yij) = xT

i βj.3) Each response variable in a multivariate linear regression model follows

a multiple linear regression model Y j = Xβj + ej for j = 1, ..., m where itis assumed that E(ej) = 0 and Cov(ej) = σjjIn.

4) For each variable Yk make a response plot of Yik versus Yik and a residualplot of Yik versus rik = Yik − Yik. If the multivariate linear regression modelis appropriate, then the plotted points should cluster about the identity linein each of the m response plots. If outliers are present or if the plot is notlinear, then the current model or data need to be transformed or corrected.If the model is good, then each of the m residual plots should be ellipsoidalwith no trend and should be centered about the r = 0 line. There should notbe any pattern in the residual plot: as a narrow vertical strip is moved fromleft to right, the behavior of the residuals within the strip should show littlechange. Outliers and patterns such as curvature or a fan shaped plot are bad.

5) Make a scatterplot matrix of Y1, ..., Ym and of the continuous predictors.Use power transformations to remove strong nonlinearities.


6) Consider testing LB = 0 where L is an r × p full rank matrix. Let

W e = ETE and W e/(n−p) = Σε. Let H = B

TLT [L(XT X)−1LT ]−1LB.

Let λ1 ≥ λ2 ≥ · · · ≥ λm be the ordered eigenvalues of W−1e H. Then there

are four commonly used test statistics.The Wilks’ Λ statistic is Λ(L) = |(H + W e)

−1W e| = |W−1e H + I|−1 =

m∏

i=1

(1 + λi)−1.

The Pillai’s trace statistic is V (L) = tr[(H + W e)−1H] =

m∑

i=1

λi

1 + λi.

The Hotelling-Lawley trace statistic is U(L) = tr[W−1e H ] =

m∑

i=1

λi.

The Roy’s maximum root statistic is λmax(L) = λ1.7) Theorem: The Hotelling-Lawley trace statistic

U(L) =1

n− p[vec(LB)]T [Σ

−1

ε ⊗ (L(XT X)−1LT )−1][vec(LB)].

8) Assumption D1: Let hi be the ith diagonal element of X(XT X)−1XT .

Assume max(h1, ..., hn)P→ 0 as n→ ∞, assume that the zero mean iid error

vectors have finite fourth moments, and assume that1

nXT X

P→ W−1.

9) Multivariate Least Squares Central Limit Theorem (MLS

CLT): For the least squares estimator, if assumption D1 holds, then Σε is

a√n consistent estimator of Σε, and

√n vec(B −B)

D→ Npm(0,Σε ⊗W ).10) Theorem: If assumption D1 holds and if H0 is true, then

(n− p)U(L)D→ χ2

rm.

11) Under regularity conditions, −[n−p+1−0.5(m− r+3)] log(Λ(L))D→

χ2rm, (n − p)V (L)

D→ χ2rm, and (n − p)U(L)

D→ χ2rm.

These statistics are robust against nonnormality.12) For the Wilks’ Lambda test,

pval = P

(−[n− p+ 1 − 0.5(m− r + 3)]

rmlog(Λ(L)) < Frm,n−rm

).

For the Pillai’s trace test, pval = P

(n − p

rmV (L) < Frm,n−rm

).

For the Hotelling Lawley trace test, pval = P

(n− p

rmU(L) < Frm,n−rm

).

The above three tests are large sample tests, P(reject H0|H0 is true) → δas n → ∞, under regularity conditions.

13) The 4 step MANOVA F test of hypotheses uses L = [0 Ip−1].i) State the hypotheses H0: the nontrivial predictors are not needed in themreg model H1: at least one of the nontrivial predictors is needed.ii) Find the test statistic Fo from output.

8.9 Summary 297

iii) Find the pval from output.iv) If pval ≤ δ, reject H0. If pval > δ, fail to reject H0. If H0 is rejected,conclude that there is a mreg relationship between the response variablesY1, ..., Ym and the predictors x2, ..., xp. If you fail to reject H0, conclude thatthere is a not a mreg relationship between Y1, ..., Ym and the predictors x2,..., xp. (Get the variable names from the story problem.)

14) The 4 step Fj test of hypotheses uses Lj = [0, ..., 0, 1, 0, ..., 0] where

the 1 is in the jth position. Let BTj be the jth row of B. The hypotheses are

equivalent to H0 : BTj = 0 H1 : BT

j 6= 0. i) State the hypothesesH0: xj is not needed in the model H1: xj is needed in the model.ii) Find the test statistic Fj from output.iii) Find pval from output.iv) If pval ≤ δ, reject H0. If pval > δ, fail to reject H0. Give a nontechnicalsentence restating your conclusion in terms of the story problem. If H0 isrejected, then conclude that xj is needed in the mreg model for Y1, ..., Ym. Ifyou fail to reject H0, then conclude that xj is not needed in the mreg modelfor Y1, ..., Ym given that the other predictors are in the model.

15) The 4 step MANOVA partial F test of hypotheses has a full modelusing all of the variables and a reduced model where r of the variables aredeleted. The ith row of L has a 1 in the position corresponding to the ithvariable to be deleted. Omitting the jth variable corresponds to the Fj testwhile omitting variables x2, ..., xp corresponds to the MANOVA F test.i) State the hypotheses H0: the reduced model is goodH1: use the full model.ii) Find the test statistic FR from output.iii) Find the pval from output.iv) If pval ≤ δ, reject H0 and conclude that the full model should be used.If pval > δ, fail to reject H0 and conclude that the reduced model is good.

16) The 4 step MANOVA F test should reject H0 if the response andresidual plots look good, n is large enough, and at least one response plotdoes not look like the corresponding residual plot. A response plot for Yj willlook like a residual plot if the identity line appears almost horizontal, hencethe range of Yj is small.

17) The mpack function mltreg produces the m response and residual

plots, gives B, Σε, the MANOVA partial F test statistic and pval corre-sponding to the reduced model that leaves out the variables given by in-dices (so x2 and x4 in the output below with F = 0.77 and pval = 0.614),Fj and the pval for the Fj test for variables 1, 2, ..., p (where p = 4 inthe output below so F2 = 1.51 with pval = 0.284), and F0 and pval forthe MANOVA F test (in the output below F0 = 3.15 and pval= 0.06).The command out <- mltreg(x,y,indices=c(2)) would produce aMANOVA partial F test corresponding to the F2 test while the commandout <- mltreg(x,y,indices=c(2,3,4)) would produce a MANOVApartial F test corresponding to the MANOVA F test for a data set with


p = 4 predictor variables. The Hotelling Lawley trace statistic is used in thetests.

out <- mltreg(x,y,indices=c(2,4))

$Bhat [,1] [,2] [,3]

[1,] 47.96841291 623.2817463 179.8867890

[2,] 0.07884384 0.7276600 -0.5378649

[3,] -1.45584256 -17.3872206 0.2337900

[4,] -0.01895002 0.1393189 -0.3885967

$Covhat

[,1] [,2] [,3]

[1,] 21.91591 123.2557 132.339

[2,] 123.25566 2619.4996 2145.780

[3,] 132.33902 2145.7797 2954.082

$partial

partialF Pval

[1,] 0.7703294 0.6141573

$Ftable

Fj pvals

[1,] 6.30355375 0.01677169

[2,] 1.51013090 0.28449166

[3,] 5.61329324 0.02279833

[4,] 0.06482555 0.97701447

$MANOVA

MANOVAF pval

[1,] 3.150118 0.06038742

18) Given B = [β1 β2 · · · βm] and xf , find yf = (y1, ..., ym)T where

yi = βT

i xf .

19) Σε =E

TE

n− p=

1

n− p

n∑

i=1

εiεTi while the sample covariance matrix of

the residuals is Sr =n − p

n − 1Σε =

ETE

n− 1. Both Σε and Sr are

√n consistent

estimators of Σε for a large class of distributions for the error vectors εi.20) The 100(1 − δ)% nonparametric prediction region for yf given xf is

the nonparametric prediction region from∮

5.2 applied to zi = yf + εi =

BTxf + εi for i = 1, ..., n. This takes the data cloud of the n residual vectors

εi and centers the cloud at yf . Let

D2i (yf ,Sr) = (zi − yf)T S−1

r (zi − yf )

for i = 1, ..., n. Let qn = min(1 − δ + 0.05, 1− δ +m/n) for δ > 0.1 and

qn = min(1 − δ/2, 1− δ + 10δm/n), otherwise.

8.9 Summary 299

If qn < 1 − δ + 0.001, set qn = 1 − δ. Let 0 < δ < 1 and h = D(Un) whereD(Un) is the qnth sample quantile of the Di. The 100(1− δ)% nonparametricprediction region for yf is

{y : (y − yf)T S−1r (y − yf ) ≤ D2

(Un)} = {y : Dy(yf ,Sr) ≤ D(Un)}.

a) Consider the n prediction regions for the data where (yf,i,xf,i) =(yi,xi) for i = 1, ..., n. If the order statistic D(Un) is unique, then Un of then prediction regions contain yi where Un/n→ 1 − δ as n → ∞.

b) If (yf ,Sr) is a consistent estimator of (E(yf ),Σε) then the nonpara-metric prediction region is a large sample 100(1 − δ)% prediction region foryf .

c) If (yf ,Sr) is a consistent estimator of (E(yf ),Σε), and the εi comefrom an elliptically contoured distribution such that the unique highest den-sity region is {y : Dy(0,Σε) ≤ D1−δ}, then the nonparametric predictionregion is asymptotically optimal.

21) On the DD plot for the residual vectors, the cases to the left of thevertical line correspond to cases that would have yf = yi in the nonpara-metric prediction region if xf = xi, while the cases to the right of the linewould not have yf = yi in the nonparametric prediction region.

22) The DD plot for the residual vectors is interpreted almost exactly asa DD plot for iid multivariate data is interpreted. Plotted points clusteringabout the identity line suggests that the εi may be iid from a multivariatenormal distribution, while plotted points that cluster about a line throughthe origin with slope greater than 1 suggests that the εi may be iid from anelliptically contoured distribution that is not MVN. The semiparametric andparametric MVN prediction regions correspond to horizontal lines on the DDplot. Robust distances have not been shown to be consistent estimators ofthe population distances, but are useful for a graphical diagnostic.

23) The robust multivariate linear regression method rmreg replaces leastsquares with the hbreg estimator. The probability that the rmreg estima-tor equals the classical estimator goes to 1 as n → ∞ for a large class oferror distributions. Hence the hypothesis tests and nonparametric predictionregions for the classical method can be applied to the robust method. Theentries of B are hard to drive to ±∞ for the robust estimator, and the resid-uals corresponding to outliers are sometimes large. Since the residuals areused to compute Σε, the tests of hypothesis based on the robust estimatorare not robust to the presence of outliers.

24) The robust multivariate linear regression method rmreg2 computesthe classical estimator on the RMVN set where RMVN is computed fromthe n cases ui = (xi2, ..., xpi, Yi1, ..., Yim)T . This estimator has considerableoutlier resistance but theory currently needs very strong assumptions. Theresponse and residual plots and DD plot of the residuals from this estimator


are useful for outlier detection. The rmreg2 estimator is superior to thermreg estimator for outlier detection.

8.10 Complements

Multivariate linear regression is a semiparametric method that is nearly aseasy to use as multiple linear regression if m is small. The material on plotsand testing followed Olive et al. (2015) closely. The m response and residualplots should be made as well as the DD plot, and the response and residualplots are very useful for the m = 1 case of multiple linear regression andexperimental design. These plots speed up the model building process formultivariate linear models since the success of power transformations achiev-ing linearity can be quickly assessed, and influential cases can be quicklydetected. See Cook and Olive (2001). Work is needed on variable selectionand on determining the sample sizes for when the tests and prediction regionsstart to work well. Response and residual plots can look good for n ≥ 10p,but for testing and prediction regions, we may need n ≥ a(m + p)2 where0.8 ≤ a ≤ 5 even for well behaved elliptically contoured error distributions.Cook and Setodji (2003) used the FF plot.

Often observations (Y1, ..., Ym, x2, ..., xp) are collected on the same per-son or thing and hence are correlated. If transformations can be found suchthat the m response plots and residual plots look good, and n is large(n ≥ max[(m+ p)2, mp+ 30)] starts to give good results), then multivariatelinear regression can be used to efficiently analyze the data. Examining mmultiple linear regressions is an incorrect method for analyzing the data.

In addition to robust estimators and seemingly unrelated regressions, en-velope estimators and partial least squares (PLS) are competing methods formultivariate linear regression. See recent work by Cook such as Cook andSu (2013), Cook et al. (2013), and Su and Cook (2012). Methods like ridgeregression and lasso can also be extended to multivariate linear regression.See, for example, Obozinski et al. (2011). Prediction regions for alternativemethods with n >> p could be made following Section 8.3.

Section 8.3 follows Olive (2017b) closely. Consider the model yi = E(yi|xi)+ei = m(xi)+εi for i = 1, ..., n. A practical method for producing a predictionregion for yf is to create pseudodata yf + ε1, ..., yf + εn using the residualvectors εi and the predicted value yf . Then apply prediction region (4.12) tothe pseudodata but modify c = kn = dn(1−δ)e so that the coverage is betterfor moderate samples. Often the nonparametric prediction region (5.17) willwork.

There is little competition for the nonparametric prediction region formultivariate regression if m > 1. For m = 1 and multiple linear regression,the prediction intervals of Olive (2007) based on the shorth should be shorterwhen the error distribution is not symmetric. The parametric MVN region


works if the errors εi come from a MVN distribution, but this region tends tohave volume that is too small if the error distribution is not MVN. For m notmuch larger than two, the Lei et al. (2013) prediction region or the Hyndman(1996) prediction region (5.8) could be used on the pseudodata zi = yf + εi,possibly modifying (5.8) with a value of δn that increases to δ as n → ∞. Form = 1, a similar procedure is done by Lei and Wasserman (2014).

Some robust estimators are described by Wilcox (2009) and Rousseeuw etal. (2004), where the practical FLTS and FMCD methods are not yet backedby theory and should be replaced by the methods in Section 8.6. Pluggingin robust dispersion estimators in place of the covariance matrices, as donein Section 8.6.2, is not a new idea. Maronna and Morgenthaler (1986) usedM–estimators when m = 1. Problems can occur if the error distribution isnot elliptically contoured. See Nordhausen and Tyler (2015).

If m = 1 and n is not much larger than p, then Hoffman et al. (2015)gave a robust Partial Least Squares–Lasso type estimator that uses a cleverweighting scheme.

The R software was used to make plots and software. See R Core Team(2016). The function mpredsim was used to simulate the prediction regions,mregsim was used to simulate the tests of hypotheses, and mregddsim

simulated the DD plots for various distributions. The function mltregmakesthe response and residual plots and computes the Fj, MANOVA F , andMANOVA partial F test pvalues, while the function ddplot4 makes theDD plots.

Variable selection, also called subset or model selection, is the search fora subset of predictor variables that can be deleted without important loss ofinformation. There is the full model x = (xT

I ,xO)T where xI is a candidatesubmodel. It is crucial to verify that a multivariate regression model is appro-priate for the full model. For each of the m response variables, use theresponse plot and the residual plot for the full model to check thisassumption. Variable selection for multivariate linear regression is discussedin Fujikoshi et al. (2014). R programs are needed to make variable selectioneasy. Forward selection would be especially useful.

To do crude variable selection, fit the model, leave out the variable withthe largest Fj test pvalue > 0.1, and fit the model, and repeat. The statisticCp(I) = (p − k)(FI − 1) + k may also be useful. Here p is the number ofvariables in the full model, k is the number of variables in the candidatemodel I, and FI is the MANOVA partial F statistic for testing whether thep− k variables xO (in the full model but not in the candidate model I) canbe deleted. Models that have Cp(I) ≤ k are certainly interesting. Check thefinal submodel xI for multivariate linear regression with the FF, RR plots,and the response and residual plots for the full model and for the candidatemodel for each of the m response variables Y1, ..., Ym. The submodels use YIj

for j = 1, ..., m.


If n < 10p, do forward selection until there are J ≈ n/10 predictors. Checkthat the model with J predictors is reasonable. Then compute Cp(I) for eachmodel considered in the forward selection.

The theory for multivariate linear regression assumes that the model isknown before gathering data. If variable selection and response transforma-tions are performed to build a model, then the estimators are biased andresults for inference fail to hold in that pvalues and coverage of confidenceand prediction regions will be wrong. When m = 1, see, for example, Berk(1978), Copas (1983), Miller (1984), and Rencher and Pun (1980). Hence itis a good idea to do a pilot study to suggest which transformations and vari-ables to use. Then do a larger study (without using variable selection) usingvariables suggested by the pilot study.

Khattree and Naik (1999, pp. 91-98) discussed testing H0 : LBM = 0versus H1 : LBM 6= 0 where M = I gives a linear test of hypotheses.Johnstone and Nadler (2017) gave useful approximations for Roy’s largestroot test when the error vector distribution is multivariate normal.

8.11 Problems

PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE-FUL.

8.1∗. Consider the Hotelling Lawley test statistic. Let

T (W ) = n [vec(LB)]T [Σ−1

ε ⊗ (LWLT )−1][vec(LB)].

LetXT X

n= W

−1.

Show T (W ) = [vec(LB)]T [Σ−1

ε ⊗ (L(XT X)−1LT )−1][vec(LB)].

8.2. Consider the Hotelling Lawley test statistic. Let T =

[vec(LB)]T [Σ−1

ε ⊗ (L(XT X)−1LT )−1][vec(LB)].

Let L = Lj = [0, ..., 0, 1, 0, ..., 0] have a 1 in the jth position. Let bT

j = LB be

the jth row of B. Let dj = Lj(XT X)−1LT

j = (XT X)−1jj , the jth diagonal

entry of (XT X)−1. Then Tj = 1dj

bT

j Σ−1

ε bj. The Hotelling Lawley statistic

U = tr([(n− p)Σε]−1BTLT [L(XT X)−1LT ]−1LB]).

Hence if L = Lj , then Uj = 1dj(n−p) tr(Σ

−1

ε bj bT

j ).

8.11 Problems 303

Using tr(ABC) = tr(CAB) and tr(a) = a for scalar a, show that(n− p)Uj = Tj.

8.3. Consider the Hotelling Lawley test statistic. Using the Searle (1982,p. 333) identity

tr(AGT DGC) = [vec(G)]T [CA ⊗ DT ][vec(G)],

show (n − p)U(L) = tr[Σ−1

ε BTLT[L(XTX)−1LT]−1LB]

= [vec(LB)]T [Σ−1

ε ⊗ (L(XT X)−1LT )−1][vec(LB)] by identifying A,G,D,and C.

$Ftable Fj pvals #Output for problem 8.4.

[1,] 82.147221 0.000000e+00

[2,] 58.448961 0.000000e+00

[3,] 15.700326 4.258563e-09

[4,] 9.072358 1.281220e-05

[5,] 45.364862 0.000000e+00

$MANOVA

MANOVAF pval

[1,] 67.80145 0

8.4. The output above is for the R Seatbelts data set where Y1 = drivers =number of drivers killed or seriously injured, Y2 = front = number of frontseat passengers killed or seriously injured, and Y3 = back = number of backseat passengers killed or seriously injured. The predictors were x2 = kms =distance driven, x3 = price = petrol price, x4 = van = number of van driverskilled, and x5 = law = 0 if the law was in effect that month and 1 otherwise.The data consists of 192 monthly totals in Great Britain from January 1969 toDecember 1984, and the compulsory wearing of seat belts law was introducedin February 1983.

a) Do the MANOVA F test.

b) Do the F4 test.

8.5. a) Sketch a DD plot of the residual vectors εi for the multivariatelinear regression model if the error vectors εi are iid from a multivariatenormal distribution. b) Does the DD plot change if the one way MANOVAmodel is used instead of the multivariate linear regression model?

8.6. The output below (on the following page) is for the R judge ratingsdata set consisting of lawyer ratings for n = 43 judges. Y1 = oral = soundoral rulings, Y2 = writ = sound written rulings, and Y3 = rten = worthy ofretention. The predictors were x2 = cont = number of contacts of lawyer withjudge, x3 = intg = judicial integrity, x4 = dmnr = demeanor, x5 = dilg =diligence, x6 = cfmg = case flow managing, x7 = deci = prompt decisions,


x8 = prep = preparation for trial, x9 = fami = familiarity with law, andx10 = phys = physical ability.

a) Do the MANOVA F test.

b) Do the MANOVA partial F test for the reduced model that deletesx2, x5, x6, x7, and x8.

y<-USJudgeRatings[,c(9,10,12)] #See problem 8.6.

x<-USJudgeRatings[,-c(9,10,12)]

mltreg(x,y,indices=c(2,5,6,7,8))

$partial

partialF Pval

[1,] 1.649415 0.1855314

$MANOVA

MANOVAF pval

[1,] 340.1018 1.121325e-14

8.7. Let βi be p× 1 and suppose

(β1 − β1

β2 − β2

)∼ N2p

((00

),

[σ11(X

T X)−1 σ12(XT X)−1

σ21(XT X)−1 σ22(X

T X)−1

]).

Find the distribution of

[L 0]

(β1 − β1

β2 − β2

)= Lβ1

where Lβ1 = 0 and L is r × p with r ≤ p. Simplify.

R Problems

Warning: Use the command source(“G:/mpack.txt”) to downloadthe programs. See Preface or Section 11.1. Typing the name of thempack function, e.g. ddplot, will display the code for the function. Use theargs command, e.g. args(ddplot), to display the needed arguments for thefunction. For some of the following problems, the R commands can be copiedand pasted from (http://lagrange.math.siu.edu/Olive/mrsashw.txt) into R.

8.8. This problem examines multivariate linear regression on the Cook andWeisberg (1999a) mussels data with Y1 = log(S) and Y2 = log(M) where Sis the shell mass and M is the muscle mass. The predictors are X2 = L,X3 = log(W ), and X4 = H : the shell length, log(width), and height.

a) The R command for this part makes the response and residual plotsfor each of the two response variables. Click the rightmost mouse button andhighlight Stop to advance the plot. When you have the response and residualplots for one variable on the screen, copy and paste the two plots into Word.Do this two times, once for each response variable. The plotted points fall inroughly evenly populated bands about the identity or r = 0 line.

8.11 Problems 305

b) Copy and paste the output produced from the R command for this partfrom $partial on. This gives the output needed to do the MANOVA F test,MANOVA partial F test, and the Fj tests.

c) The R command for this part makes a DD plot of the residual vectorsand adds the lines corresponding to the three prediction regions of Section8.3. The robust cutoff is larger than the semiparametric cutoff. Place theplot in Word. Do the residual vectors appear to follow a multivariate normaldistribution? (Right click Stop once.)

d) Do the MANOVA partial F test where the reduced model deletes X3

and X4.e) Do the F2 test.f) Do the MANOVA F test.

8.9. This problem examines multivariate linear regression on the SASInstitute (1985, p. 146) Fitness Club Data with Y1 = chinups, Y2 = situps,and Y3 = jumps. The predictors are X2 = weight, X3 = waist, and X4 =pulse.

a) The R command for this part makes the response and residual plots foreach of the three variables. Click the rightmost mouse button and highlightStop to advance the plot. When you have the response and residual plots forone variable on the screen, copy and paste the three plots into Word. Do thisthree times, once for each response variable. Are there any outliers?

b) The R command for this part makes a DD plot of the residual vectorsand adds the lines corresponding to the three prediction regions of Section8.3. The robust cutoff is larger than the semiparametric cutoff. Place the plotin Word. Are there any outliers? (Right click Stop once.)

8.10. This problem uses the mpack function mregsim to simulate theWilks’Λ test, Pillai’s trace test, Hotelling Lawley trace test, and Roy’s largestroot test for the Fj tests and the MANOVA F test for multivariate linearregression. When mnull = T the first row of B is 1T while the remainingrows are equal to 0T . Hence the null hypothesis for the MANOVA F test istrue. When mnull = F the null hypothesis is true for p = 2, but false forp > 2. Now the first row of B is 1T and the last row of B is 0T . If p > 2,then the second to last row of B is (1, 0, ..., 0), the third to last row is (1,1, 0, ..., 0) et cetera as long as the first row is not changed from 1T . Firstm iid errors zi are generated such that the m errors are iid with varianceσ2. Then εi = Azi so that Σε = σ2AAT = (σij) where the diagonal entriesσii = σ2[1+(m−1)ψ2 ] and the off diagonal entries σij = σ2[2ψ+(m−2)ψ2 ]where ψ = 0.10. Terms like Wilkcov give the percentage of times the Wilks’test rejected the F1, F2, ..., Fp tests. The $mancv wcv pcv hlcv rcv fcv outputgives the percentage of times the 4 test statistics reject the MANOVA F test.Here hlcov and fcov both correspond to the Hotelling Lawley test using theformulas in Problem 8.3.

5000 runs will be used so the simulation may take several minutes. Samplesizes n = (m + p)2, n = 3(m + p)2, and n = 4(m+ p)2 were interesting. We


want coverage near 0.05 when H0 is true and coverage close to 1 for goodpower when H0 is false. Multivariate normal errors were used in a) and b)below.

a) Copy the coverage parts of the output produced by the R commandsfor this part where n = 20, m = 2, and p = 4. Here H0 is true except forthe F1 test. Wilks’ and Pillai’s tests had low coverage < 0.05 when H0 wasfalse. Roy’s test was good for the Fj tests, but why was Roy’s test bad forthe MANOVA F test?

b) Copy the coverage parts of the output produced by the R commandsfor this part where n = 20, m= 2, and p = 4. Here H0 is false except for theF4 test. Which two tests seem to be the best for this part?

8.11. This problem uses the mpack function mpredsim to simulate theprediction regions for yf given xf for multivariate regression. With 5000 runsthis simulation may take several minutes. The R command for this problemgenerates iid lognormal errors then subtracts the mean, producing zi. Thenthe εi = Azi are generated as in Problem 8.10 with n=100, m=2, and p=4.The nominal coverage of the prediction region is 90%, and 92% of the trainingdata is covered. The ncvr output gives the coverage of the nonparametricregion. What was ncvr?

Chapter 9

One Way MANOVA Type Models


9.2 Summary

9.3 Complements

9.4 Problems

307

Chapter 10

1D Regression

... estimates of the linear regression coefficients are relevant to the linearparameters of a broader class of models than might have been suspected.

Brillinger (1977, p. 509)

After computing β, one may go on to prepare a scatter plot of the points(βxj, yj), j = 1, ..., n and look for a functional form for g(·).

Brillinger (1983, p. 98)

Regression is the study of the conditional distribution Y |x of the responseY given the k×1 vector of nontrivial predictors x. The scalar Y is a randomvariable and x is a random vector. In Chapter 2, a special case of regressionwas the multiple linear regression model Yi = wi,1η1+wi,2η2+· · ·+wi,pηp+ei

= wTi η+ei for i = 1, . . . , n. In this chapter, the subscript i is often suppressed

and the multiple linear regression model is written as Y = α + x1β1 + · · ·+xkβk + e = α + βT x + e where k = p − 1. The primary difference is theseparation of the constant term α and the nontrivial predictors x. In Chapter2, wi,1 ≡ 1 for i = 1, ..., n.Taking Y = Yi, α = η1, βj = ηj+1, and xj = wi,j+1

and e = ei for j = 1, ..., k = p− 1 shows that the two models are equivalent.The change in notation was made because the distribution of the nontrivialpredictors is very important for the theory of the more general regressionmodels.

Definition 10.1: In a 1D regression model, Y is conditionally independentof x given the sufficient predictor SP = h(x), written

Y x|h(x), (10.1)

where the real valued function h : Rp → R.

This chapter will primarily consider 1D regression models where h(x) =α + βT x. An important 1D regression model, introduced by Li and Duan(1989), has the form

Y = g(α + βT x, e) (10.2)

309

310 10 1D Regression

where g is a bivariate (inverse link) function and e is a zero mean error thatis independent of x. The constant term α may be absorbed by g if desired.

Special cases of the 1D regression model (12.1) include many importantgeneralized linear models (GLMs) and the additive error single index model

Y = m(α+ βT x) + e. (10.3)

Typically m is the conditional mean or median function. For example if allof the expectations exist, then

E[Y |x] = E[m(α+ βT x)|x] + E[e|x] = m(α + βT x).

The multiple linear regression model is an important special case where m isthe identity function: m(α + βT x) = α + βT x. Another important specialcase of 1D regression is the response transformation model where

g(α + βT x, e) = t−1(α+ βT x + e) (10.4)

and t−1 is a one to one (typically monotone) function. Hence

t(Y ) = α+ βT x + e.

If Yi is an observed survival time, then many survival regression models, in-cluding the Cox (1972) proportional hazards model, are 1D regression models.

Definition 10.2. Regression is the study of the conditional distributionof Y |x. Focus is often on the mean function E(Y |x) and/or the variancefunction VAR(Y |x). There is a distribution for each value of x = xo suchthat Y |x = xo is defined. For a 1D regression with h(x) = βT x,

E(Y |x = xo) = E(Y |βT x = βT xo) ≡M(βT xo) and

VAR(Y |x = xo) = VAR(Y |βT x = βT xo) ≡ V (βT xo)

where M is the kernel mean function and V is the kernel variance function.

Notice that the mean and variance functions depend on the same linearcombination if the 1D regression model is valid. This dependence is typical ofGLMs where M and V are known kernel mean and variance functions thatdepend on the family of GLMs. See Cook and Weisberg (1999a, section 23.1).A heteroscedastic regression model

Y = M(βT1 x) +

√V (βT

2 x) e (10.5)

is a 1D regression model if β2 = cβ1 for some scalar c.

10 1D Regression 311

In multiple linear regression, the difference between the response Yi and

the estimated conditional mean function α+ βTxi is the residual. For more

general regression models this difference may not be the residual, and the

“discrepancy” Yi−M(βTxi) may not be estimating the error ei. To guarantee

that the residuals are estimating the errors, the following definition is usedwhen possible.

Definition 10.3: Cox and Snell (1968). Let the errors ei be iid withpdf f and assume that the regression model Yi = g(xi,η, ei) has a uniquesolution for ei :

ei = h(xi,η, Yi).

Then the ith residualei = h(xi, η, Yi)

where η is a consistent estimator of η.

Example 10.1. Let η = (α,βT )T . If Y = m(α + βT x) + e where m is

known, then e = Y −m(α + βT x). Hence ei = Yi −m(α + βTxi) which is

the usual definition of the ith residual for such models.

Dimension reduction can greatly simplify our understanding of the con-ditional distribution Y |x. If a 1D regression model is appropriate, then thek–dimensional vector x can be replaced by the 1–dimensional scalar βT xwith “no loss of information about the conditional distribution.” Cook andWeisberg (1999a, p. 411) define a sufficient summary plot (SSP) to be a plotthat contains all the sample regression information about the conditionaldistribution Y |x of the response given the predictors.

Definition 10.4: For a 1D regression model, a sufficient summary plotis a plot of h(x) versus Y . A response plot or estimated sufficient summaryplot (ESSP) is a plot of the estimated sufficient predictor (ESP) versus Y . Ifh(x) = βT x, then Y x|(a + cβT x) for any constants a and c 6= 0. Hence

a + cβT x is a SP with ESP = α + βTx where β is an estimator of cβ for

some nonzero constant c.

If there is only one predictor x, then the plot of x versus Y is both asufficient summary plot and a response plot, but generally only a responseplot can be made. Since a can be any constant, a = 0 is often used. Thefollowing section shows how to use the OLS regression of Y on x to obtainan ESP. If we plot the fitted values and the ESP versus Y , the plots are calledfit–response and ESP-response plots. For multiple linear regression, these twoplots are the same.


10.1 Estimating the Sufficient Predictor

Some notation is needed before giving theoretical results. Let x, a, t, and βbe k × 1 vectors where only x is random.

Definition 10.5: Cook and Weisberg (1999a, p. 431). The predictorsx satisfy the condition of linearly related predictors with 1D structure if

E[x|βT x] = a + tβT x. (10.6)

If the predictors x satisfy this condition, then for any given predictor xj,

E[xj|βT x] = aj + tjβT x.

Notice that β is a fixed k× 1 vector. If x is elliptically contoured (EC) with1st moments, then the assumption of linearly related predictors holds since

E[x|bT x] = ab + tbbT x

for any nonzero k × 1 vector b (see Lemma 10.4). The condition of linearlyrelated predictors is impossible to check since β is unknown, but the con-dition is far weaker than the assumption that x is EC. The stronger ECcondition is often used since there are checks for whether this condition isreasonable, eg use the DD plot. The following proposition gives an equivalentdefinition of linearly related predictors. Both definitions are frequently usedin the dimension reduction literature.

Proposition 10.1. The predictors x are linearly related iff

E[bT x|βT x] = ab + tbβT x (10.7)

for any k×1 constant vector b where ab and tb are constants that depend onb.

Proof. Suppose that the assumption of linearly related predictors holds.Then

E[bT x|βT x] = bTE[x|βT x] = bT a + bT tβT x.

Thus the result holds with ab = bT a and tb = bT t.Now assume that Equation (10.7) holds. Take bi = (0, ..., 0, 1, 0, ..., 0)T, the

vector of zeroes except for a one in the ith position. Then Equation (10.6)holds since E[x|βT x] = E[Ik x|βT x] =

E[

bT1 x...

bTk x

| βT x] =

a1 + t1β

T x...

ak + tkβT x

≡ a + tβT x. �

10.1 Estimating the Sufficient Predictor 313

Following Cook (1998a, p. 143-144), assume that there is an objectivefunction

Ln(a, b) =1

n

n∑

i=1

L(a+ bT xi, Yi) (10.8)

where L(u, v) is a bivariate function that is a convex function of the first

argument u. Assume that the estimate (a, b) of (a, b) satisfies

(a, b) = argmina,b

Ln(a, b). (10.9)

For example, the ordinary least squares (OLS) estimator uses

L(a+ bT x, Y ) = (Y − a− bT x)2.

Maximum likelihood type estimators such as those used to compute GLMsand Huber’s M–estimator also work, as does the Wilcoxon rank estima-tor. Assume that the population analog (α∗,β∗) is the unique minimizerof E[L(a + bT x, Y )] where the expectation exists and is with respect to thejoint distribution of (Y,xT )T . For example, (α∗,β∗) is unique if L(u, v) isstrictly convex in its first argument. The following result is a useful extensionof Brillinger (1977, 1983).

Theorem 10.2 (Li and Duan 1989, p. 1016): Assume that the x arelinearly related predictors, that (Yi,x

Ti )T are iid observations from some joint

distribution with Cov(xi) nonsingular. Assume L(u, v) is convex in its firstargument and that β∗ is unique. Assume that Y x|βT x. Then β∗ = cβ forsome scalar c.

Proof. See Li and Duan (1989) or Cook (1998a, p. 144).

Remark 10.1. This theorem basically means that if the 1D regressionmodel is appropriate and if the condition of linearly related predictors holds,

then the (eg OLS) estimator b ≡ β∗ ≈ cβ. Li and Duan (1989, p. 1031)

show that under additional conditions, (a, b) is asymptotically normal. Inparticular, the OLS estimator frequently has a

√n convergence rate. If the

OLS estimator (α, β) satisfies β ≈ cβ when model (10.1) holds, then theresponse plot of

α+ βTx versus Y

can be used to visualize the conditional distribution Y |(α + βT x) providedthat c 6= 0.

Remark 10.2. If b is a consistent estimator of β∗, then certainly

β∗ = cxβ + ug


where ug = β∗ − cxβ is the bias vector. Moreover, the bias vector ug = 0if x is elliptically contoured under the assumptions of Theorem 10.2. Thisresult suggests that the bias vector might be negligible if the distribution ofthe predictors is close to being EC. Often if no strong nonlinearities arepresent among the predictors, the bias vector is small enough so that

bTx is a useful ESP.

Remark 10.3. Suppose that the 1D regression model is appropriate andY x|βT x. Then Y x|cβT x for any nonzero scalar c. If Y = g(βT x, e)and both g and β are unknown, then g(βT x, e) = ha,c(a + cβT x, e) where

ha,c(w, e) = g(w − a

c, e)

for c 6= 0. In other words, if g is unknown, we can estimate cβ but we cannot determine c or β; i.e., we can only estimate β up to a constant.

A very useful result is that if Y = m(x) for some function m, then m canbe visualized with both a plot of x versus Y and a plot of cx versus Y if c 6= 0.In fact, there are only three possibilities, if c > 0 then the two plots are nearlyidentical: except the labels of the horizontal axis change. (The two plots areusually not exactly identical since plotting controls to “fill space” depend onseveral factors and will change slightly.) If c < 0, then the plot appears tobe flipped about the vertical axis. If c = 0, then m(0) is a constant, and theplot is basically a dot plot. Similar results hold if Yi = g(α+ βT xi, ei) if theerrors ei are small. OLS often provides a useful estimator of cβ where c 6= 0,but OLS can result in c = 0 if g is symmetric about the population medianof α+ βT x.

Definition 10.6. If the 1D regression model (12.1) holds with h(x) =α+ βT x, and OLS is used, then the ESP may be called the OLS ESP andthe response plot may be called the OLS response plot. Other estimators,such as SIR, may have similar labels.

Example 10.2. Suppose that xi ∼ N3(0, I3) and that

Y = m(βT x) + e = (x1 + 2x2 + 3x3)3 + e.

Then a 1D regression model Y x|βT x holds with β = (1, 2, 3)T . Figure 10.1shows the sufficient summary plot of βT x versus Y , and Figure 10.2 shows thesufficient summary plot of −βT x versus Y . Notice that the functional formm appears to be cubic in both plots and that both plots can be smoothedby eye or with a scatterplot smoother such as lowess. The two figures weregenerated with the following R/Splus commands.

X <- matrix(rnorm(300),nrow=100,ncol=3)

SP <- X%*%1:3

Y <- (SP)ˆ3 + rnorm(100)

10.1 Estimating the Sufficient Predictor 315

plot(SP,Y)

plot(-SP,Y)

Fig. 10.1 SSP for m(u) = u3

Fig. 10.2 Another SSP for m(u) = u3

We particularly want to use the OLS estimator (α, β) to produce an es-timated sufficient summary plot. This estimator is obtained from the usualmultiple linear regression of Yi on xi, but we are not assuming that themultiple linear regression model holds; however, we are hoping that the 1Dregression model Y x|βT x is a useful approximation to the data and that

β ≈ cβ for some nonzero constant c. In addition to Theorem 12.2, nice resultsexist if the single index model is appropriate. Recall that

Cov(x,Y ) = E[(x− E(x))((Y −E(Y ))T ].

Definition 10.7. Suppose that (Yi,xTi )T are iid observations and that the

positive definite k×k matrix Cov(x) = ΣX and the k×1 vector Cov(x, Y ) =

ΣX,Y . Let the OLS estimator (α, β) be computed from the multiple linear

regression of Y on x plus a constant. Then (α, β) estimates the populationquantity (αOLS,βOLS) where

βOLS = Σ−1X ΣX,Y . (10.10)

The following notation will be useful for studying the OLS estimator. Letthe sufficient predictor z = βT x and let w = x − E(x). Let r = w −(ΣXβ)βT w.

Theorem 10.3. In addition to the conditions of Definition 12.7, also as-sume that Yi = m(βT xi) + ei where the zero mean constant variance iiderrors ei are independent of the predictors xi. Then

βOLS = Σ−1X ΣX,Y = cm,Xβ + um,X (10.11)

where the scalarcm,X = E[βT (x− E(x)) m(βT x)] (10.12)

and the bias vectorum,X = Σ−1

X E[m(βT x)r]. (10.13)


Moreover, um,X = 0 if x is from an EC distribution with nonsingular ΣX ,and cm,X 6= 0 unless Cov(x, Y ) = 0. If the multiple linear regression modelholds, then cm,X = 1, and um,X = 0.

The proof of the above result is outlined in Problem 10.2 using an argumentdue to Aldrin, Bφlviken, and Schweder (1993). See related results in Stoker(1986) and Cook, Hawkins, and Weisberg (1992). If the 1D regression modelis appropriate, then typically Cov(x, Y ) 6= 0 unless βT x follows a symmetricdistribution and m is symmetric about the median of βT x.

Definition 10.8. Let (α, β) denote the OLS estimate obtained from theOLS multiple linear regression of Y on x. The OLS view is a response plot

of a+ βTx versus Y . Typically a = 0 or a = α.

Remark 10.4. All of this awkward notation and theory leads to a re-markable result, perhaps first noted by Brillinger (1977, 1983) and called the1D Estimation Result by Cook and Weisberg (1999a, p. 432). The result isthat if the 1D regression model Y x|βT x is appropriate, then the OLS viewwill frequently be a useful estimated sufficient summary plot (ESSP). Hence

the OLS predictor βTx is a useful estimated sufficient predictor (ESP).

Although the OLS view is frequently a good ESSP if no strong nonlinear-ities are present in the predictors and if cm,X 6= 0 (eg the sufficient summary

plot of βT x versus Y is not approximately symmetric), even better estimatedsufficient summary plots can be obtained by using ellipsoidal trimming. Thistopic is discussed in the following section and follows Olive (2002) closely.

10.2 Visualizing 1D Regression

Cook and Nachtsheim (1994) and Cook (1998a, p. 152) demonstrate thatthe bias um,X can often be made small by ellipsoidal trimming. To performellipsoidal trimming, an estimator (T,C) is computed where T is a k × 1multivariate location estimator and C is a k × k symmetric positive defi-nite dispersion estimator. Then the ith squared Mahalanobis distance is therandom variable

D2i = (xi − T )T C−1(xi − T ) (10.14)

for each vector of observed predictors xi. If the ordered distances D(j) areunique, then j of the xi are in the hyperellipsoid

{x : (x − T )TC−1(x − T ) ≤ D2(j)}. (10.15)

The ith case (Yi,xTi )T is trimmed if Di > D(j). Thus if j ≈ 0.9n, then about

10% of the cases are trimmed.

10.2 Visualizing 1D Regression 317

Fig. 10.3 Scatterplot for Mussel Data, o Corresponds to Trimmed Cases

We suggest that the estimator (T,C) should be the classical sample meanand covariance matrix (x,S) or a robust multivariate location and dispersionestimator such as RFCH. See Section 10.7. When j ≈ n/2, the RFCH esti-mator attempts to make the volume of the hyperellipsoid given by Equation(10.15) small.

Ellipsoidal trimming seems to work for at least three reasons. The trim-ming divides the data into two groups: the trimmed cases and the remainingcases (xM , YM ) where M% is the amount of trimming, eg M = 10 for 10%trimming. If the distribution of the predictors x is EC then the distributionof xM still retains enough symmetry so that the bias vector is approximatelyzero. If the distribution of x is not EC, then the distribution of xM willoften have enough symmetry so that the bias vector is small. In particular,trimming often removes strong nonlinearities from the predictors and theweighted predictor distribution is more nearly elliptically symmetric thanthe predictor distribution of the entire data set (recall Winsor’s principle:“all data are roughly Gaussian in the middle”). Secondly, under heavy trim-ming, the mean function of the remaining cases may be more linear than themean function of the entire data set. Thirdly, if |c| is very large, then the biasvector may be small relative to cβ. Trimming sometimes inflates |c|. FromTheorem 12.3, any of these three reasons should produce a better estimatedsufficient predictor.

Example 10.3. Cook and Weisberg (1999a, p. 351, 433, 447) gave a dataset on 82 mussels sampled off the coast of New Zealand. The variables are themuscle mass M in grams, the length L and height H of the shell in mm, theshell width W and the shell mass S. The robust and classical Mahalanobisdistances were calculated, and Figure 10.3 shows a scatterplot matrix of themussel data, the RDi’s, and the MDi’s. Notice that many of the subplots arenonlinear. The cases marked by open circles were given weight zero by theFMCD algorithm, and the linearity of the retained cases has increased. Notethat only one trimming proportion is shown and that a heavier trimmingproportion would increase the linearity of the cases that were not trimmed.

The two ideas of using ellipsoidal trimming to reduce the bias and choosinga view with a smooth mean function and smallest variance function canbe combined into a graphical method for finding the estimated sufficientsummary plot and the estimated sufficient predictor. Trim the M% of thecases with the largest Mahalanobis distances, and then compute the OLSestimator (αM , βM) from the cases that remain. Use M = 0, 10, 20, 30,

40, 50, 60, 70, 80, and 90 to generate ten plots of βT

Mx versus Y using alln cases. In analogy with the Cook and Weisberg procedure for visualizing


1D structure with two predictors, the plots will be called “trimmed views.”Notice that M = 0 corresponds to the OLS view.

Definition 10.9. The best trimmed view is the trimmed view with asmooth mean function and the smallest variance function and is the estimatedsufficient summary plot. If M∗ = E is the percentage of cases trimmed that

corresponds to the best trimmed view, then βT

Ex is the estimated sufficientpredictor.

The following examples illustrate the R/Splus function trviews that isused to produce the ESSP. If R is used instead of Splus, the command

library(MASS)

needs to be entered to access the function cov.mcd called by trviews.The function trviews is used in Problem 10.6. Also notice the trviews

estimator is basically the same as the tvreg estimator. The tvreg estimatorcan be used to simultaneously detect whether the data is following a multiple

linear regression model or some other single index model. Plot αE + βT

Exversus Y and add the identity line. If the plotted points follow the identity linethen the MLR model is reasonable, but if the plotted points follow a nonlinearmean function, then a nonlinear single index model may be reasonable.

Example 10.2 continued. The command

trviews(X, Y)

produced the following output.

Intercept X1 X2 X3

0.6701255 3.133926 4.031048 7.593501

Intercept X1 X2 X3

1.101398 8.873677 12.99655 18.29054

Intercept X1 X2 X3

0.9702788 10.71646 15.40126 23.35055

Intercept X1 X2 X3

0.5937255 13.44889 23.47785 32.74164

Intercept X1 X2 X3

1.086138 12.60514 25.06613 37.25504

Intercept X1 X2 X3

4.621724 19.54774 34.87627 48.79709

Intercept X1 X2 X3

3.165427 22.85721 36.09381 53.15153

Intercept X1 X2 X3

5.829141 31.63738 56.56191 82.94031

Intercept X1 X2 X3

4.241797 36.24316 70.94507 105.3816

Intercept X1 X2 X3


6.485165 41.67623 87.39663 120.8251

Fig. 10.4 Best View for Estimating m(u) = u3

Fig. 10.5 The angle between the SP and the ESP is nearly zero.

The function generates 10 trimmed views. The first plot trims 90% ofthe cases while the last plot does not trim any of the cases and is the OLSview. To advance a plot, press the right button on the mouse (in R, highlightstop rather than continue). After all of the trimmed views have beengenerated, the output is presented. For example, the 5th line of numbers in

the output corresponds to α50 = 1.086138 and βT

50 where 50% trimming wasused. The second line of numbers corresponds to 80% trimming while the

last line corresponds to 0% trimming and gives the OLS estimate (α0, βT

0 ) =

(a, b). The trimmed views with 50% and 90% trimming were very good.

We decided that the view with 50% trimming was the best. Hence βE =(12.60514, 25.06613, 37.25504)T ≈ 12.5β. The best view is shown in Figure12.4 and is nearly identical to the sufficient summary plot shown in Figure12.1. Notice that the OLS estimate = (41.68, 87.40, 120.83)T ≈ 42β. The OLSview is Figure 1.5 in Chapter 1, and is again very similar to the sufficientsummary plot, but it is not quite as smooth as the best trimmed view.

The plot of the estimated sufficient predictor versus the sufficient predictoris also informative. Of course this plot can usually only be generated forsimulated data since β is generally unknown. If the plotted points are highlycorrelated (with |corr(ESP,SP)| > 0.95) and follow a line through the origin,then the estimated sufficient summary plot is nearly as good as the sufficientsummary plot. The simulated data used β = (1, 2, 3)T , and the commands

SP <- X %*% 1:3

ESP <- X %*% c(12.60514, 25.06613, 37.25504)

plot(ESP,SP)

generated the plot shown in Figure 10.5.

Example 10.4. An artificial data set with 200 trivariate vectors xi wasgenerated. The marginal distributions of xi,j are iid lognormal for j = 1, 2,

and 3. Since the response Yi = sin(βT xi)/βT xi where β = (1, 2, 3)T , the

random vector xi is not elliptically contoured and the function m is stronglynonlinear. Figure 12.6d shows the OLS view and Figure 12.7d shows the besttrimmed view. Notice that it is difficult to visualize the mean function with


the OLS view, and notice that the correlation between Y and the ESP is verylow. By focusing on a part of the data where the correlation is high, it may bepossible to improve the estimated sufficient summary plot. For example, inFigure 10.7d, temporarily omit cases that have ESP less than 0.3 and greaterthan 0.75. From the untrimmed cases, obtained the ten trimmed estimatesβ90, ..., β0. Then using all of the data, obtain the ten views. The best viewcould be used as the ESSP.

Application 10.1. Suppose that a 1D regression analysis is desired on adata set, use the trimmed views as an exploratory data analysis techniqueto visualize the conditional distribution Y |βT x. The best trimmed view isan estimated sufficient summary plot. If the single index model (10.3) holds,the function m can be estimated from this plot using parametric modelsor scatterplot smoothers such as lowess. Notice that Y can be predictedvisually using up and over lines.

Fig. 10.6 Estimated Sufficient Summary Plots Without Trimming

Fig. 10.7 1D Regression with Trimmed Views

Table 10.1 Estimated Sufficient Predictors Coefficients Estimating c(1, 2,3)T

method b1 b2 b3OLS View 0.0032 0.0011 0.0047

90% Trimmed OLS View 0.086 0.182 0.338SIR View −0.394 −0.361 −0.845

10% Trimmed SIR VIEW −0.284 −0.473 −0.834SAVE View −1.09 0.870 -0.480

40% Trimmed SAVE VIEW 0.256 0.591 0.765PHD View −0.072 −0.029 −0.0097

90% Trimmed PHD VIEW −0.558 −0.499 −0.664LMSREG VIEW −0.003 −0.005 −0.059

70% Trimmed LMSREG VIEW 0.143 0.287 0.428

Application 10.2. The best trimmed view can also be used as a diagnosticfor linearity and monotonicity.

For example in Figure 10.4, if ESP = 0, then Y = 0 and if ESP = 100,then Y = 500. Figure 10.4 suggests that the mean function is monotone butnot linear, and Figure 10.7 suggests that the mean function is neither linearnor monotone.


Application 10.3. Assume that a known 1D regression model is assumedfor the data. Then the best trimmed view is a model checking plot and canbe used as a diagnostic for whether the assumed model is appropriate.

The trimmed views are sometimes useful even when the assumption of lin-early related predictors fails. Cook and Li (2002) summarize when competingmethods such as the OLS view, sliced inverse regression (SIR), principal Hes-sian directions (PHD), and sliced average variance estimation (SAVE) canfail. All four methods frequently perform well if there are no strong nonlin-earities present in the predictors.

Example 10.4 (continued). Figure 10.6 shows that the response plots forSIR, PHD, SAVE, and OLS are not very good while Figure 10.7 shows thattrimming improved the SIR, SAVE and OLS methods.

Fig. 10.8 1D Regression with lmsreg

One goal for future research is to develop better methods for visualizing1D regression. Trimmed views seem to become less effective as the num-ber of predictors k = p − 1 increases. Consider the sufficient predictor SP= x1 + · · · + xk. With the sin(SP)/SP data, several trimming proportionsgave good views with k = 3, but only one of the ten trimming proportionsgave a good view with k = 10. In addition to problems with dimension, it isnot clear which covariance estimator and which regression estimator shouldbe used. We suggest using the RFCH estimator with OLS, and preliminaryinvestigations suggest that the classical covariance estimator gives better es-timates than cov.mcd. But among the many Splus regression estimators,lmsreg often worked well. Theorem 10.2 suggests that strictly convex re-gression estimators such as OLS will often work well, but there is no theoryfor the robust regression estimators.

Fig. 10.9 The Weighted lmsreg Fitted Values Versus Y

Example 10.4 continued. Replacing the OLS trimmed views by alter-native MLR estimators often produced good response plots, and for singleindex models, the lmsreg estimator often worked the best. Figure 10.8 showsa scatterplot matrix of Y , ESP and SP where the sufficient predictor SP =βT x. The ESP used ellipsoidal trimming with cov.mcd and with lmsreg

instead of OLS. The top row of Figure 10.8 shows that the estimated suf-ficient summary plot and the sufficient summary plot are nearly identical.Also the correlation of the ESP and the SP is nearly one. Table 10.1 shows


the estimated sufficient predictor coefficients b when the sufficient predictorcoefficients are c(1, 2, 3)T . Only the SIR, SAVE, OLS and lmsreg trimmedviews produce estimated sufficient predictors that are highly correlated withthe sufficient predictor.

Figure 10.9 helps illustrate why ellipsoidal trimming works. This viewused 70% trimming and the open circles denote cases that were trimmed.The highlighted squares correspond to the cases (x70, Y70) that were nottrimmed. Note that the highlighted cases are far more linear than the dataset as a whole. Also lmsreg will give half of the highlighted cases zero weight,further linearizing the function. In Figure 10.9, the lmsreg constant α70 isincluded, and the plot is simply the response plot of the weighted lmsreg

fitted values versus Y . The vertical deviations from the line through the origin

are the “residuals” Yi − α70 − βT

70x and at least half of the highlighted caseshave small residuals.

10.3 Predictor Transformations

Even if the multiple linear regression model is valid, a model based on asubset of the predictor variables depends on the predictor distribution. If thepredictors are linearly related (eg EC), then the submodel mean functionis often well behaved, but otherwise the submodel mean function could benonlinear and the submodel variance function could be nonconstant.

For 1D regression models, the presence of strong nonlinearities amongthe predictors can invalidate inferences. A necessary condition for x to havean EC distribution (or for no strong nonlinearities to be present among thepredictors) is for each marginal plot of the scatterplot matrix of the predictorsto have a linear or ellipsoidal shape if n is large.

One of the most useful techniques in regression is to remove gross non-linearities in the predictors by using predictor transformations. Power trans-formations are particularly effective. A multivariate version of the Box–Coxtransformation due to Velilla (1993) can cause the distribution of the trans-formed predictors to be closer to multivariate normal, and the Cook andNachtsheim (1994) procedure can cause the distribution to be closer to ellip-tical symmetry. Marginal Box-Cox transformations also seem to be effective.Power transformations can also be selected with slider bars in Arc.

Section 1.1 gives several rules for predictor transformations, including theunit rule, log rule, and ladder rule. As an illustration of the ladder rule andlog rule, in Figure 10.14c, small values of Y and large values of FESP needspreading, and using log(Y ) would make the plot more linear.

10.4 Variable Selection 323

10.4 Variable Selection

A standard problem in 1D regression is variable selection, also called subsetor model selection. Assume that Y x|(α+ βT x), that a constant is alwaysincluded, and that x = (x1, ..., xp−1)

T are the p − 1 nontrivial predictors,which we assume to be of full rank. Then variable selection is a search for asubset of predictor variables that can be deleted without important loss ofinformation. This section follows Olive and Hawkins (2005) closely.

Variable selection for the 1D regression model is very similar to variableselection for the multiple linear regression model (see Section 3.4). To clarifyideas, assume that there exists a subset S of predictor variables such thatif xS is in the 1D model, then none of the other predictors are needed inthe model. Write E for these (‘extraneous’) variables not in S, partitioningx = (xT

S ,xTE)T . Then

SP = α+ βT x = α+ βTSxS + βT

ExE = α+ βTSxS . (10.16)

The extraneous terms that can be eliminated given that the subset S is inthe model have zero coefficients.

Now suppose that I is a candidate subset of predictors, that S ⊆ I andthat O is the set of predictors not in I. Then

SP = α+ βT x = α+ βTSxS = α+βT

SxS + βT(I/S)xI/S + 0T xO = α+ βT

I xI ,

(if I includes predictors from E, these will have zero coefficients). For anysubset I that contains the subset S of relevant predictors, the correlation

corr(α+ βTxi, α+ βTI xI,i) = 1. (10.17)

This observation, which is true regardless of the explanatory power of themodel, suggests that variable selection for 1D regression models is simplein principle. For each value of j = 1, 2, ..., p− 1 nontrivial predictors, keeptrack of subsets I that provide the largest values of corr(ESP,ESP(I)). Anysuch subset for which the correlation is high is worth closer investigation andconsideration. To make this advice more specific, use the rule of thumb that acandidate subset of predictors I is worth considering if the sample correlationof ESP and ESP(I) satisfies

corr(α+ βTxi, αI + β

T

I xI,i) = corr(βTxi, β

T

I xI,i) ≥ 0.95. (10.18)

The difficulty with this approach is that fitting all of the possible sub-models involves substantial computation. An exception to this difficulty ismultiple linear regression where there are efficient “leaps and bounds” algo-rithms for searching all subsets when OLS is used (see Furnival and Wilson1974). Since OLS often gives a useful ESP, the following all subsets procedure


can be used for 1D models when p ≤ 20. Forward selection and backwardelimination can be for much larger values of p.

• Fit a full model using the methods appropriate to that 1D problem to find

the ESP α+ βTx.

• Find the OLS ESP αOLS + βT

OLSx.• If the 1D ESP and the OLS ESP have “a strong linear relationship” (for

example |corr(ESP,OLS ESP)| ≥ 0.95), then infer that the 1D problem isone in which OLS may serve as an adequate surrogate for the correct 1Dmodel fitting procedure.

• Use computationally fast OLS variable selection procedures such as for-ward selection, backward elimination and the leaps and bounds all subsetsalgorithm along with the Mallows (1973) Cp criterion to identify predictorsubsets I containing k variables (including the constant) with Cp(I) ≤ 2k.

• Perform a final check on the subsets that satisfy the Cp screen by usingthem to fit the 1D model.

For a 1D model, the response, ESP and vertical discrepancies V = Y −ESP are important. When the multiple linear regression (MLR) model holds,the fitted values are the ESP: Y = ESP , and the vertical discrepancies arethe residuals.

Definition 10.10. a) The plot of αI + βT

I xI,i versus α + βTxi is called

an EE plot (also called an FF plot for MLR).

b) The plot of discrepancies Yi − αI − βT

I xI,i versus Yi − α− βTxi is called

a VV plot (also called an RR plot for MLR).

c) The plots of αI + βT

I xI,i versus Yi and of α + βTxi versus Yi are called

estimated sufficient summary plots or response plots.

Many numerical methods such as forward selection, backward elimination,stepwise and all subset methods using the Cp criterion (Jones 1946, Mallows1973), have been suggested for variable selection. The four plots in Definition10.10 contain valuable information to supplement the raw numerical resultsof these selection methods. Particular uses include:

• The key to understanding which plots are the most useful is the observa-tion that a plot of w versus z is used to visualize the conditional distribu-tion of z given w. Since the 1D regression is the study of the conditionaldistribution of Y given α + βT x, the response plot is used to visualizethis conditional distribution and should always be made. A major problemwith variable selection is that deleting important predictors can change thefunctional form of the model. In particular, if a multiple linear regressionmodel is appropriate for the full model, linearity may be destroyed if im-portant predictors are deleted. When the single index model (10.3) holds,m can be visualized with a response plot. Adding visual aids such as the


estimated parametric mean function m(α + βTx) can be useful. If an es-

timated nonparametric mean function m(α+ βTx) such as lowess follows

the parametric curve closely, then often numerical goodness of fit testswill suggest that the model is good. See Chambers, Cleveland, Kleiner,and Tukey (1983, p. 280) and Cook and Weisberg (1999a, p. 425, 432).For variable selection, the response plots from the full model and submodelshould be very similar if the submodel is good.

• Sometimes outliers will influence numerical methods for variable selection.Outliers tend to stand out in at least one of the plots. An EE plot isuseful for variable selection because the correlation of ESP(I) and ESP isimportant. The EE plot can be used to quickly check that the correlationis high, that the plotted points fall about some line, that the line is theidentity line, and that the correlation is high because the relationship islinear, rather than because of outliers.

• Numerical methods may include too many predictors. Investigators can ex-amine the p–values for individual predictors, but the assumptions neededto obtain valid p–values are often violated; however, the OLS t tests forindividual predictors are meaningful since deleting a predictor changes theCp value by t2−2 where t is the test statistic for the predictor. See Section10.5, Daniel and Wood (1980, p. 100-101) and the following two remarks.

Remark 10.5. Variable selection with the Cp criterion is closely relatedto the partial F test that uses test statistic FI . Suppose that the full modelcontains p predictors including a constant and the submodel I includes k pre-dictors including a constant. If n ≥ 10p, then the submodel I is “interesting”if Cp(I) ≤ min(2k, p).

To see this claim notice that the following results are properties of OLSand hold even if the data does not follow a 1D model. If the candidate modelof xI has k terms (including the constant), then

FI =SSE(I) − SSE

(n− k) − (n− p)/SSE

n− p=n− p

p− k

[SSE(I)

SSE− 1

]

where SSE is the “residual” sum of squares from the full model and SSE(I)is the “residual” sum of squares from the candidate submodel. Then

Cp(I) =SSE(I)

MSE+ 2k − n = (p− k)(FI − 1) + k (10.19)

where MSE is the “residual” mean square for the full model. Let ESP(I)

= αI + βT

I xI be the ESP for the submodel and let VI = Y − ESP (I) so

that VI,i = Yi − αI − βT

I xI,i. Let ESP and V denote the correspondingquantities for the full model. Using Proposition 3.1 and Remarks 3.1 and 3.2with corr(r, rI) replaced by corr(V, VI ), it can be shown that


corr(V, VI) =

√SSE

SSE(I)=

√n − p

Cp(I) + n− 2k=

√n − p

(p− k)FI + n− p.

It can also be shown that Cp(I) ≤ 2k corresponds to corr(V, VI ) ≥ dn where

dn =

√1 − p

n.

Notice that for a fixed value of k, the submodel Ik that minimizes Cp(I) alsomaximizes corr(V, VI). If Cp(I) ≤ 2k and n ≥ 10p, then 0.948 ≤ corr(V, VI ),and both corr(V, VI ) → 1.0 and corr(OLS ESP, OLS ESP(I)) → 1.0 as n →∞. Hence the plotted points in both the VV plot and the EE plot will clusterabout the identity line (see Proposition 3.1 vi).

Remark 10.6. Suppose that the OLS ESP and the standard ESP arehighly correlated: |corr(ESP,OLS ESP)| ≥ 0.95. Then often OLS variableselection can be used for the 1D data, and using the p–values from OLSoutput seems to be a useful benchmark. To see this, suppose that n > 5p andfirst consider the model Ii that deletes the predictor Xi. Then the model hask = p− 1 predictors including the constant, and the test statistic is ti where

t2i = FIi.

Using (12.19) and Cp(Ifull) = p, it can be shown that

Cp(Ii) = Cp(Ifull) + (t2i − 2).

Using the screen Cp(I) ≤ min(2k, p) suggests that the predictor Xi shouldnot be deleted if

|ti| >√

2 ≈ 1.414.

If |ti| <√

2 then the predictor can probably be deleted since Cp decreases.More generally, it can be shown that Cp(I) ≤ 2k iff

FI ≤ p

p− k.

Now k is the number of terms in the model including a constant while p− kis the number of terms set to 0. As k → 0, the partial F test will reject Ho:βO = 0 (i.e., say that the full model should be used instead of the submodelI) unless FI is not much larger than 1. If p is very large and p − k is verysmall, then the partial F test will tend to suggest that there is a model Ithat is about as good as the full model even though model I deletes p − kpredictors.

The Cp(I) ≤ k screen tends to overfit. We simulated multiple linear regres-sion and single index model data sets with p = 8 and n = 50, 100, 1000 and


10000. The true model S satisfied Cp(S) ≤ k for about 60% of the simulateddata sets, but S satisfied Cp(S) ≤ 2k for about 97% of the data sets.

In many settings, not all of which meet the Li–Duan sufficient conditions,the full model OLS ESP is a good estimator of the sufficient predictor. Ifthe fitted full 1D model Y x|(α + βT x) is a useful approximation to the

data and if βOLS is a good estimator of cβ where c 6= 0, then a subset Iwill produce a response plot similar to the response plot of the full modelif corr(OLS ESP, OLS ESP(I)) ≥ 0.95. Hence the response plots based onthe full and submodel ESP can both be used to visualize the conditionaldistribution of Y .

Assuming that a 1D model holds, a common assumption made for variableselection is that the fitted full model ESP is a good estimator of the sufficientpredictor, and the usual numerical and graphical checks on this assumptionshould be made. To see that this assumption is weaker than the assumptionthat the OLS ESP is good, notice that if a 1D model holds but βOLS estimatescβ where c = 0, then the Cp(I) criterion could wrongly suggest that allsubsets I have Cp(I) ≤ 2k. Hence we also need to check that c 6= 0.

There are several methods are for checking the OLS ESP, including: a)if an ESP from an alternative fitting method is believed to be useful, checkthat the ESP and the OLS ESP have a strong linear relationship: for exam-ple that |corr(ESP, OLS ESP)| ≥ 0.95. b) Often examining the OLS responseplot shows that a 1D model is reasonable. For example, if the data are tightlyclustered about a smooth curve, then a single index model may be appro-priate. c) Verify that a 1D model is appropriate using graphical techniquesgiven by Cook and Weisberg (1999a, p. 434-441). d) Verify that x has anEC distribution with nonsingular covariance matrix and that the mean func-tion m(α + βT x) is not symmetric about the median of the distribution ofα+ βT x. Then results from Li and Duan (1989) suggest that c 6= 0.

Condition a) is both the most useful (being a direct performance check)and the easiest to check. A standard fitting method should be used whenavailable (e.g., for parametric 1D models such as GLMs). Conditions c) andd) need x to have a continuous multivariate distribution while the predictorscan be factors for a) and b). Using trimmed views results in an ESP that cansometimes cause condition b) to hold when d) is violated.

To summarize, variable selection procedures, originally meant for MLR,can often be used for 1D data. If the fitted full 1D model Y x|(α+ βT x)

is a useful approximation to the data and if βOLS is a good estimator of cβwhere c 6= 0, then a subset I is good if corr(OLS ESP, OLS ESP(I)) ≥ 0.95.If n is large enough, Remark 12.5 implies that this condition will hold ifCp(I) ≤ 2k or if FI ≤ 1. This result suggests that within the (large) subclassof 1D models where the OLS ESP is useful, the OLS partial F test is robust(asymptotically) to model misspecifications in that FI ≤ 1 correctly suggeststhat submodel I is good. The OLS t tests for individual predictors are also


meaningful since if |t| <√

2 then the predictor can probably be deleted sinceCp decreases while if |t| ≥ 2 then the predictor is probably useful even whenthe other predictors are in the model. Section 12.5 provides related theory,and the following examples help illustrate the above discussion.

Example 10.5. This example illustrates that the plots are useful for gen-eral 1D regression models such as the response transformation model. Con-sider the data set on 82 mussels in Example 10.3. The response Y is themuscle mass in grams, and the four predictors are the logarithms of the shelllength, width, height and mass. The logarithm transformation was used toremove strong nonlinearities that were evident in a scatterplot matrix of theuntransformed predictors. The Cp criterion suggests using log(width) andlog(shell mass) as predictors. The EE and VV plots are shown in Figure10.10ab. The response plots based on the full and submodel are shown inFigure 10.10cd and are nearly identical, but not linear.

When log(muscle mass) is used as the response, the Cp criterion suggestsusing log(height) and log(shell mass) as predictors (the correlation betweenlog(height) and log(width) is very high). Figure 10.11a shows the RR plotand 2 outliers are evident. These outliers correspond to the two cases withlarge negative residuals in the response plot shown in Figure 10.11b. Afterdeleting the outliers, the Cp criterion still suggested using log(height) andlog(shell mass) as predictors. The p–value for including log(height) in themodel was 0.03, and making the FF and RR plots after deleting log(height)suggests that log(height) may not be needed in the model.

Example 10.6. According to Li (1997), the predictors in the Boston hous-ing data of Harrison and Rubinfeld (1978) have a nonlinear quasi–helix re-lationship which can cause regression graphics methods to fail. Nevertheless,the graphical diagnostics can be used to gain interesting information fromthe data. The response Y = log(CRIM) where CRIM is the per capita crimerate by town. The predictors used were x1 = proportion of residential landzoned for lots over 25,000 sq.ft., log(x2) where x2 is the proportion of non-retail business acres per town, x3 = Charles River dummy variable (= 1 iftract bounds river; 0 otherwise), x4 = NOX = nitric oxides concentration(parts per 10 million), x5 = average number of rooms per dwelling, x6 =proportion of owner-occupied units built prior to 1940, log(x7) where x7 =weighted distances to five Boston employment centers, x8 = RAD = index ofaccessibility to radial highways, log(x9) where x9 = full-value property-taxrate per $10,000, x10 = pupil-teacher ratio by town, x11 = 1000(Bk− 0.63)2

where Bk is the proportion of blacks by town, log(x12) where x12 = % lower

Fig. 10.10 Mussel Data with Muscle Mass as the Response

Fig. 10.11 Mussel Data with log(Muscle Mass) as the Response


Fig. 10.12 Response and Residual Plots for Boston Housing Data

Fig. 10.13 Relationships between NOX, RAD and Y = log(CRIM)

status of the population, and log(x13) where x13 = median value of owner-occupied homes in $1000’s. The full model has 506 cases and 13 nontrivialpredictor variables.

Figure 10.12ab shows the response plot and residual plot for the full model.The residual plot suggests that there may be three or four groups of data,but a linear model does seem reasonable. Backward elimination with Cp

suggested the “min Cp submodel” with the variables x1, log(x2), NOX, x6,log(x7), RAD, x10, x11 and log(x13). The full model had R2 = 0.878 andσ = 0.7642. The Cp submodel had Cp(I) = 6.576, R2

I = 0.878, andσI = 0.762. Deleting log(x7) resulted in a model with Cp = 8.483 andthe smallest coefficient p–value was 0.0095. The FF and RR plots for thismodel (not shown) looked like the identity line. Examining further submodelsshowed that NOX and RAD were the most important predictors. In particu-lar, the OLS coefficients of x1, x6 and x11 were orders of magnitude smallerthan those of NOX and RAD. The submodel including a constant, NOX,RAD and log(x2) had R2 = 0.860, σ = 0.811 and Cp = 67.368. Figure10.12cd shows the response plot and residual plot for this submodel.

Although this submodel has nearly the same R2 as the full model, theresiduals show more variability than those of the full model. Nevertheless,we can examine the effect of NOX and RAD on the response by deletinglog(x2). This submodel had R2 = 0.842, σ = 0.861 and Cp = 138.727. Figure10.13a shows that the response plot for this model is no longer linear. Theresidual plot (not shown) also displays curvature. Figure 10.13a shows thatthere are two groups, one with high Y and one with low Y . There are threeclusters of points in the plot of NOX versus RAD shown in Figure 10.13b (thesingle isolated point in the southeast corner of the plot actually corresponds toseveral cases). The two clusters of high NOX and high RAD points correspondto the cases with high per capita crime rate.

Fig. 10.14 Boston Housing Data: Nonlinear 1D Regression Model

The tiny filled in triangles if Figure 10.13a represent the fitted values for aquadratic. We added NOX2, RAD2 and NOX ∗RAD to the full model andagain tried variable selection. Although the full quadratic in NOX and RADhad a linear response plot, the submodel with NOX, RAD and log(x2) wasvery similar. For this data set, NOX and RAD seem to be the most importantpredictors, but other predictors are needed to make the model linear and toreduce residual variation.


Fig. 10.15 Response Plot for Insulation Data

In the Boston housing data, now let Y = CRIM. Since log(Y ) has a linearrelationship with the predictors, Y should follow a nonlinear 1D regressionmodel. Consider the full model with predictors log(x2), x3, x4, x5, log(x7),x8, log(x9) and log(x12). Regardless of whether Y or log(Y ) is used as theresponse, the minimumCp model from backward elimination used a constant,log(x2), x4, log(x7), x8 and log(x12) as predictors. If Y is the response, thenthe model is nonlinear and Cp = 5.699. Remark 10.5 suggests that if Cp ≤ 2k,then the points in the VV plot should tightly cluster about the identity lineeven if a multiple linear regression model fails to hold. Figure 10.14 showsthe VV and EE plots for the minimum Cp submodel. The response plots forthe full model and submodel are also shown. Note that the clustering in theVV plot is indeed higher than the clustering in the EE plot. Note that theresponse plots are highly nonlinear but are nearly identical.

Example 10.7. This insulation data was contributed by Ms. Spector. Abox with insulation was heated for 20 minutes then allowed to cool down.The response variable Y = temperature in middle of box was taken at time0, 5, ..., 40. The type of insulation was a factor with type 1 = no insulation,2 = corn pith, 3 = fiberglass, 4 = styrofoam and 5 = bubbles. There were 45temperature measurements, one for each time type combination. The mea-surements were averages of ten trials and starting temperatures were closebut not exactly equal.

The model using time, (time)2, type, and the interactions type:time andtype:(time)2 had E(Y |x) ≈ (xT β)2. A second model used time, (time)2

and type, and rather awkward R code for producing the response plot inFigure 12.15 is shown below. The solid curve corresponds to (xT β, (xT β)2) =

(FIT, (FIT )3) where β is the OLS estimator from regressing Y on xT = (1,time, (time)2, type). The thin curve corresponds to lowess. Since the twolines correspond, E(Y |x) ≈ (xT β)3 or Y = m(xT β) + e where m(w) = w3.See Problem 12.10 for producing the response plot in Arc.

#assume the insulation data is loaded

ftype <- as.factor(insulation[,2])

zi <- as.data.frame(insulation)

iout <- lm(ytemp˜time+I(timeˆ2)+ftype,data=zi)

FIT <- iout$fit

Y <- insulation[,1]

plot(FIT,Y)

lines(lowess(FIT,Y)) #get (FIT,(FIT)ˆ3) curve

zx <- FIT

z <- lsfit(cbind(zx,zxˆ2,zxˆ3),Y)

zfit <- Y-z$resid

lines(FIT,zfit)

10.5 Inference 331

10.5 Inference

This section follows Chang and Olive (2007, 2010) closely. Inference can beperformed for trimmed views if M is chosen without using the response, e.g.if the trimming is done with a DD plot, and the dimension reduction (DR)method such as OLS or sliced inverse regression (SIR) is performed on thedata (YMi,xMi) that remains after trimmingM% of the cases with ellipsoidaltrimming based on the FCH or RFCH estimator.

First we review some theoretical results for the DR methods OLS and SIRand give the main theoretical result for OLS. Let

Cov(x) = E[(x − E(x))(x − E(x))T] = Σx

and Cov(x, Y ) = E[(x− E(x))(Y − E(Y ))] = ΣxY . Let the OLS estimator

be (αOLS , βOLS). Then the population coefficients from an OLS regressionof Y on x are

αOLS = E(Y ) − βTOLSE(x) and βOLS = Σ−1

x ΣxY . (10.20)

Let the data be (Yi,xi) for i = 1, ..., n. Let the p× 1 vector η = (α,βT )T ,let X be the n × p OLS design matrix with ith row (1,xT

i ), and let Y =(Y1, ..., Yn)T . Then the OLS estimator η = (XT X)−1XT Y . The samplecovariance of x is

Σx =1

n− 1

n∑

i=1

(xi − x)(xi − x)T where the sample mean x =1

n

n∑

i=1

xi.

Similarly, define the sample covariance of x and Y to be

ΣxY =1

n

n∑

i=1

(xi − x)(Yi − Y ) =1

n

n∑

i=1

xiYi − x Y .

The first result shows that η is a consistent estimator of η.i) Suppose that (Yi,x

Ti )T are iid random vectors such that Σ−1

x and ΣxY

exist. ThenαOLS = Y − β

T

OLSxD→ αOLS

andβOLS =

n

n− 1Σ

−1

x ΣxYD→ βOLS as n → ∞.

Some notation is needed for the following results. Many 1D regressionmodels have an error e with

σ2 = Var(e) = E(e2). (10.21)

Let e be the error residual for e. Let the population OLS residual


v = Y − αOLS − βTOLSx (10.22)

withτ2 = E[(Y − αOLS − βT

OLSx)2] = E(v2), (10.23)

and let the OLS residual be

r = Y − αOLS − βT

OLSx. (10.24)

Typically the OLS residual r is not estimating the error e and τ2 6= σ2, butthe following results show that the OLS residual is of great interest for 1Dregression models.

Assume that a 1D model holds, Y x|(α+ βT x), which is equivalent toY x|βT x. Then under regularity conditions, results ii) – iv) below hold.

ii) Li and Duan (1989): βOLS = cβ for some constant c.iii) Li and Duan (1989) and Chen and Li (1998):

√n(βOLS − cβ)

D→ Np−1(0,COLS) (10.25)

where

COLS = Σ−1x E[(Y −αOLS −βT

OLSx)2(x−E(x))(x−E(x))T ]Σ−1x . (10.26)

iv) Chen and Li (1998): Let A be a known full rank constant k × (p − 1)matrix. If the null hypothesis Ho: Aβ = 0 is true, then

√n(AβOLS − cAβ) =

√nAβOLS

D→ Nk(0,ACOLSAT )

andACOLSAT = τ2AΣ−1

x AT . (10.27)

Notice that COLS = τ2Σ−1x if v = Y −αOLS −βT

OLSx x or if the MLRmodel holds. If the MLR model holds, τ2 = σ2.

To create test statistics, the estimator

τ2 = MSE =1

n − p

n∑

i=1

r2i =1

n − p

n∑

i=1

(Yi − αOLS − βT

OLSxi)2

will be useful. The estimator COLS =

Σ−1

x

[1

n

n∑

i=1

[(Yi − αOLS − βT

OLSxi)2(xi − x)(xi − x)T ]

]Σ

−1

x (10.28)

can also be useful. Notice that for general 1D regression models, the OLSMSE estimates τ2 rather than the error variance σ2.

v) Result iv) suggests that a test statistic for Ho : Aβ = 0 is

10.5 Inference 333

WOLS = nβT

OLSAT [AΣ−1

x AT ]−1AβOLS/τ2 D→ χ2

k, (10.29)

the chi–square distribution with k degrees of freedom.

Before presenting the main theoretical result, some results from OLS MLRtheory are needed. Let the p×1 vector η = (α,βT )T , the known k×p constantmatrix A = [a A] where a is a k×1 vector, and let c be a known k×1 constantvector. Following Seber and Lee (2003, p. 99–106), the usual F statistic fortesting Ho : Aη = c is

F0 =(SSE(Ho) − SSE)/k

SSE/(n − p)= (10.30)

(Aη − c)T [A(XT X)−1AT]−1(Aη − c)/(kτ2)

where MSE = τ2 = SSE/(n − p), SSE =∑n

i=1 r2i and

SSE(Ho) =

n∑

i=1

r2i (Ho)

is the minimum sum of squared residuals subject to the constraint Aη = c.Recall that if Ho is true, the MLR model holds and the errors ei are iidN(0, σ2), then Fo ∼ Fk,n−p, the F distribution with k and n − p degrees offreedom. Also recall that if Zn ∼ Fk,n−p, then

ZnD→ χ2

k/k (10.31)

as n → ∞.The main theoretical result of this section is Theorem 10.4 below. This

theorem and (10.31) suggest that OLS output, originally meant for testingwith the MLR model, can also be used for testing with many 1D regressiondata sets. Without loss of generality, let the 1D model Y x|(α+ βT x) bewritten as

Y x|(α+ βTRxR + βT

OxO)

where the reduced model is Y x|(α + βTRxR) and xO denotes the terms

outside of the reduced model. Notice that OLS ANOVA F test correspondsto Ho: β = 0 and uses A = Ip−1. The tests for Ho: βi = 0 use A =(0, ..., 0, 1, 0, ..., 0) where the 1 is in the ith position and are equivalent to theOLS t tests. The test Ho: βO = 0 uses A = [0 Ij] if βO is a j×1 vector, andthe test statistic (12.30) can be computed with the OLS partial F test: runOLS on the full model to obtain SSE and on the reduced model to obtainSSE(R) ≡ SSE(Ho).

In the theorem below, it is crucial that Ho: Aβ = 0. Tests for Ho: Aβ = 1,say, may not be valid even if the sample size n is large. Also, confidence


intervals corresponding to the t tests are for cβi, and are usually not veryuseful when c is unknown.

Theorem 10.4. Assume that a 1D regression model (10.1) holds and thatEquation (10.29) holds when Ho : Aβ = 0 is true. Then the test statistic(10.30) satisfies

F0 =n− 1

knWOLS

D→ χ2k/k

as n → ∞.Proof. Notice that by (10.29), the result follows if F0 = (n−1)WOLS/(kn).

Let A = [0 A] so that Ho:Aη = 0 is equivalent to Ho:Aβ = 0. FollowingSeber and Lee (2003, p. 106),

(XT X)−1 =

(1n

+ xT D−1x −xT D−1

−D−1x D−1

)(10.32)

where the (p− 1) × (p − 1) matrix

D−1 = [(n− 1)Σx]−1 = Σ−1

x /(n− 1). (10.33)

Using A and (10.32) in (10.30) shows that F0 =

(AβOLS)T

[[0 A]

(1n + xT D−1x −xT D−1

−D−1x D−1

)(0T

AT

)]−1

AβOLS/(kτ2),

and the result follows from (12.33) after algebra. QED

For SIR, the theory is more complicated. Following Chen and Li (1998),

SIR produces eigenvalues λi and associated SIR directions βi,SIR for i =

1, ..., p−1.The SIR directions βi,SIR for i = 1, ..., d are used for dD regression.vi) Chen and Li (1998): For a 1D regression and vector A, a test statistic

for Ho : Aβ1 = 0 is

WS = nβT

1,SIRAT [AΣ−1

x AT ]−1Aβ1,SIR/[(1− λ1)/λ1]D→ χ2

1. (10.34)

Ellipsoidal trimming can be used to create outlier resistant DR methodsthat can give useful results when the assumption of linearly related predictors(10.6) is violated. To perform ellipsoidal trimming, a robust estimator ofmultivariate location and dispersion (T,C) is computed and used to createthe Mahalanobis distances Di(T,C). The ith case (Yi,xi) is trimmed if Di >D(j). For example, if j ≈ 0.9n, then about M% = 10% of the cases aretrimmed, and a DR method can be computed from the cases that remain.

For theory and outlier resistance, the choice of (T,C) and M are impor-tant. Chang and Olive (2007) used the MBA estimator (TMBA,CMBA) for(T,C), but we would now use the RFCH estimator because of its combina-

10.5 Inference 335

tion of speed, robustness and theory. The classical Mahalanobis distance uses(T,C) = (x, Σx). Denote the robust distances by RDi and the classical dis-tances by MDi. Then the DD plot of the MDi versus the RDi can be usedto choose M . The plotted points in the DD plot will follow the identity linewith zero intercept and unit slope if the predictor distribution is multivariatenormal (MVN), and will follow a line with zero intercept but non–unit slopeif the distribution is elliptically contoured with nonsingular covariance ma-trix but not MVN. Delete M% of the cases with the largest MBA distancesso that the remaining cases follow the identity line (or some line through theorigin) closely. Let (YMi,xMi) denote the data that was not trimmed wherei = 1, ..., nM. Then apply the DR method on these nM cases.

As long as M is chosen only using the predictors, DR theory will applyif the data (YM ,xM ) satisfies the regularity conditions. For example, if theMLR model is valid and the errors are iid N(0, σ2), then the OLS estimator

ηM = (XTMXM)−1XT

MY M ∼ Np(η, σ2(XTMXM )−1).

More generally, let βDM denote a DR estimator applied to (YMi,xMi) andassume that √

nM (βDM − cMβ)D→ Np−1(0,CDM )

where CDM is nonsingular. Let φM = limn→∞ n/nM . Then

√n(βDM−cMβ) =

√n√nM

√nM(βDM−cMβ)

D→ Np−1(0, φMCDM ). (10.35)

If Ho : Aβ = 0 is true and CDM is a consistent estimator of CDM , then

WDM = nM βT

DMAT [ACDMAT ]−1AβDM/τ2M

D→ χ2k.

Notice that M = 0 corresponds to the full data set and n0 = n.

The tradeoff is that if low amounts of trimming do not work, then largeramounts of trimming sometimes greatly improve DR methods, but largeamounts of trimming will have large efficiency losses if low amounts of trim-ming work since n/nM ≥ 1 and the diagonal elements of CDM typicallybecome larger with M .

Trimmed views can also be used to select M ≡ MTV . If the MLR modelholds and OLS is used, then the resulting trimmed views estimator βM,TV

is√n consistent, but need not be asymptotically normal.

Adaptive trimming can be used to obtain an asymptotically normal esti-mator that may avoid large efficiency losses. First, choose an initial amountof trimming MI by using, e.g., the DD plot or trimmed views. Let β denote

the first direction of the DR method. Next compute |corr(βT

Mx, βT

MIx)| for

M = 0, 10, ..., 90 and find the smallest valueMA ≤MI such that the absolute


correlation is greater than 0.95. If no such value exists, then use MA = MI .The resulting adaptive trimming estimator is asymptotically equivalent tothe estimator that uses 0% trimming if β0 is a consistent estimator of c0β

and if βMIis a consistent estimator of cMIβ.

Detecting outlying x is useful for any regression method, and now thateffective methods such as RFCH are available, the DD plot should be usedroutinely. In a small simulation, the clean data Y = (α+βT x)3+e where α =1,β = (1, 0, 0, 0)T, e ∼ N(0, 1) and x ∼ N4(0, I4). The outlier percentage γwas either 0% or 49%. The 2 clusters of outliers were about the same sizeand had Y ∼ N(0, 1), x ∼ N4(±10(1, 1, 1, 1)T , I4). Table 10.2 records the

averages of βi over 100 runs where the DR method used M = 0 or M = 50%trimming. SIR, SAVE and PHD were very similar except when γ = 49 andM = 0. When outliers were present, the average of βF,50 ≈ cF (1, 0, 0, 0)T

where cF depended on the DR method and F was OLS, SIR, SAVE or PHD.The sample size n = 1000 was used although OLS gave reasonable estimatesfor much smaller sample sizes. The rpack function drsim7 can be used toduplicate the simulation in R.

The following simulations show that ellipsoidal trimming based on theMBA estimator is useful for DR even when no outliers are present.

In the simulations, we used eight types of predictor distributions: d1)x ∼ Np−1(0, Ip−1), d2) x ∼ 0.6Np−1(0, Ip−1) + 0.4Np−1(0, 25Ip−1), d3)x ∼ 0.4Np−1(0, Ip−1) + 0.6Np−1(0, 25Ip−1), d4) x ∼ 0.9Np−1(0, Ip−1) +0.1Np−1(0, 25Ip−1), d5) x ∼ LN(0, I) where the marginals are iid log-normal(0,1), d6) x ∼ MV Tp−1(3), d7) x ∼ MV Tp−1(5) and d8) x ∼MV Tp−1(19). Here x has a multivariate t distribution xi ∼ MV Tp−1(ν)

if xi = zi/√Wi/ν where zi ∼ Np−1(0, Ip−1) is independent of the chi–

square random variable Wi ∼ χ2ν. Of the eight distributions, only d5) is not

elliptically contoured. The MVT distribution gets closer to the multivariatenormal (MVN) distribution d1) as ν → ∞. The MVT distribution has firstmoments for ν ≥ 3 and second moments for ν ≥ 5. See Johnson and Kotz(1972, p. 134-135). All simulations used 1000 runs.

The simulations were for single index models with α = 1. Let the sufficientpredictor SP = α + βT x. Then the seven models considered were m1) Y =SP + e, m2) Y = (SP )2 + e, m3) Y = exp(SP ) + e, m4) Y = (SP )3 + e,m5) Y = sin(SP )/SP + 0.01e, m6) Y = SP + sin(SP ) + 0.1e and m7)Y =

√|SP |+ 0.1e where e ∼ N(0, 1).

First, coefficient estimation was examined with β = (1, 1, 1, 1)T , and for

OLS the sample standard deviation (SD) of each entry βMi,j of βM,j wascomputed for i = 1, 2, 3, 4 with j = 1, ..., 1000. For each of the 1000 runs, theChen and Li formula

SEcl(βMi) =

√n−1

M (CM )ii

10.5 Inference 337

Table 10.2 DR Coefficient Estimation with Trimming

type γ M β1 β2 β3 β4

SIR 0 0 .0400 .0021 −.0006 .0012SIR 0 50 −.0201 −.0015 .0014 .0027SIR 49 0 .0004 −.0029 −.0013 .0039SIR 49 50 −.0798 −.0014 .0004 −.0015

SAVE 0 0 .0400 .0012 .0010 .0018SAVE 0 50 −.0201 −.0018 .0024 .0030SAVE 49 0 −.4292 −.2861 −.3264 −.3442SAVE 49 50 −.0797 −.0016 −.0006 −.0024PHD 0 0 .0396 −.0009 −.0071 −.0063PHD 0 50 −.0200 −.0013 .0024 .0025PHD 49 0 −.1068 −.1733 −.1856 −.1403PHD 49 50 −.0795 .0023 .0000 −.0037OLS 0 0 5.974 .0083 −.0221 .0008OLS 0 50 4.098 .0166 .0017 −.0016OLS 49 0 2.269 −.7509 −.7390 −.7625OLS 49 50 5.647 .0305 .0011 .0053

Table 10.3 OLS Coefficient Estimation with Trimming

m x M βM

√nSEcl

√nSD

m2 d1 0 2.00,2.01,2.00,2.00 7.81,7.79,7.76,7.80 7.87,8.00,8.02,7.88m5 d4 0 −.03,−.03,−.03,−.03 .30,.30,.30,.30 .31,.32,.33,.31m6 d5 0 1.04,1.04,1.04,1.04 .36,.36,.37,.37 .41,.42,.42,.40m7 d6 10 .11,.11,.11,.11 .58,.57,.57,.57 .60,.58,.62,.61

was computed where

CM = Σ−1

xM

[1

nM

nM∑

i=1

[(YMi − αM − βT

MxMi)2(xMi − xM )(xMi − xM)T ]

]Σ

−1

xM

is the estimate (10.28) applied to (YM ,xM ). The average of βM and of√nSEcl were recorded as well as

√nSD of βMi,j under the labels βM ,√

n SEcl and√nSD. Under regularity,

√n SEcl ≈

√nSD ≈

√1

1− M100

diag(CM)

where CM is (10.26) applied to (YM ,xM ).

For MVN x, MLR and 0% trimming, all three recorded quantities werenear (1,1,1,1) for n = 60, 500, and 1000. For 90% trimming and n = 1000, theresults were β90 = (1.00, 1.00, 1.01, 0.99),

√n SEcl = (7.56, 7.61, 7.60, 7.54)


Table 10.4 Rejection Proportions for H0: β = 0

x n F SIR n F SIRd1 100 0.041 0.057 500 0.050 0.048d2 100 0.050 0.908 500 0.045 0.930d3 100 0.047 0.955 500 0.050 0.930d4 100 0.045 0.526 500 0.048 0.599d5 100 0.055 0.621 500 0.061 0.709d6 100 0.042 0.439 500 0.036 0.472d7 100 0.054 0.214 500 0.047 0.197d8 100 0.044 0.074 500 0.060 0.077

and√nSD = (7.81, 8.02, 7.76, 7.59), suggesting that β90 is asymptotically

normal but inefficient.For other distributions, results for 0 and 10% trimming were recorded as

well as a “good” trimming value MB . Results are “good” if all of the entriesof both βMB

and√n SEcl were approximately equal, and if the theoretical√

n SEcl was close to the simulated√nSD. The results were good for MVN x

and all seven models, and the results were similar for n = 500 and n = 1000.The results were good for models m1 and m5 for all eight distributions. Modelm6 was good for 0% trimming except for distribution d5 and model m7 wasgood for 0% trimming except for distributions d5, d6 and d7. Trimmingusually helped for models m2, m3 and m4 for distributions d5 – d8. Someresults are shown in Table 10.3 for n = 500.

For SIR with h = 4 slices βM was recorded. The SIR results were similar tothose for OLS, but often more trimming and larger sample sizes were neededthan those for OLS. The results depended on h in that the largest samplesizes were needed for 2 slices and then for 3 slices.

Next testing was considered. Let FM and WM denote the OLS and SIRstatistics (10.30) and (10.34) applied to the nM cases (YM ,xM ) that remainedafter trimming. Ho was rejected for OLS if FM > Fk,nM−p(0.95) and for SIRif WM > χ2

k(0.95). For SIR, 2 slices were used since using more than h = 2

slices rejected Ho too often. As h increased from 2 to 3 to 4, λ1 and the SIRchi–square test statistic W0 rapidly increased. For h > 4 the increase wasmuch slower.

For testing the nominal level was 0.05, and we recorded the proportion pof runs where Ho was rejected. Since 1000 runs were used, the count 1000p ∼binomial(1000, 1− δn) where 1 − δn converges to the true large sample level1− δ. The standard error for the proportion is

√p(1 − p)/1000 ≈ 0.0069 for

p = 0.05. An observed coverage p ∈ (0.03, 0.07) suggests that there is noreason to doubt that the true level is 0.05.

Suppose a 1D model holds but Y x. Then the Yi are iid and the modelreduces to Y = E(Y ) + e = cα + e where e = Y −E(Y ). As a special case, if

10.5 Inference 339

Table 10.5 Rejection Proportions for Ho: β2 = 0

m x Test 70 60 50 40 30 20 10 0 ADAP1 1 F .061 .056 .062 .051 .046 .050 .044 .043 .0431 1 W .007 .013 .015 .020 .027 .032 .045 .056 .0565 1 F .019 .023 .019 .019 .020 .022 .027 .037 .0295 1 W .002 .003 .006 .005 .010 .014 .025 .055 .0262 2 F .023 .024 .026 .070 .183 .182 .142 .166 .0402 2 W .007 .010 .021 .067 .177 .328 .452 .576 .0504 3 F .027 .058 .096 .081 .071 .057 .062 .123 .1204 3 W .028 .069 .152 .263 .337 .378 .465 .541 .5396 4 F .026 .024 .030 .032 .028 .044 .051 .088 .0886 4 W .012 .009 .013 .016 .030 .040 .076 .386 .3197 5 F .058 .058 .053 .054 .046 .044 .051 .037 .0377 5 W .001 .000 .005 .005 .034 .080 .118 .319 .2503 6 F .021 .024 .019 .025 .025 .034 .080 .374 .0363 6 W .003 .008 .007 .021 .019 .041 .084 .329 .2646 7 F .027 .032 .023 .041 .047 .053 .052 .055 .0556 7 W .007 .006 .013 .022 .019 .025 .054 .176 .169

Y = m(α+βT x)+e and if Y x, then Y = m(α)+e. For the correspondingtest H0 : β = 0 versus H1 : β 6= 0, and the OLS F statistic (10.30) andSIR W statistic (10.34) are invariant with respect to a constant. This testis interesting since if Ho holds, then the results do not depend on the 1Dmodel (10.1), but only on the distribution of x and the distribution of e.Since βOLS = cβ, power can be good if c 6= 0. The OLS test is equivalentto the ANOVA F test from MLR of Y on x. Under H0, the test shouldperform well provided that the design matrix is nonsingular and the errordistribution and sample size are such that the central limit theorem holds.Table 10.4 shows the results for OLS and SIR for n = 100, 500 and for theeight different distributions. Since the true model was linear and normal, theexact OLS level is 0.05 even for n = 10. Table 10.4 shows that OLS performedas expected while SIR only gave good results for MVN x.

Next the test Ho : β2 = 0 was considered. The OLS test is equivalentto the t test from MLR of Y on x. The true model used α = 1 and β =

(1, 0, 1, 1)T. To simulate adaptive trimming, |corr(βT

Mx,βT x)| was computedfor M = 0, 10, ..., 90 and the initial trimming proportion MI maximized thiscorrelation. This process should be similar to choosing the best trimmed viewby examining 10 plots. The rejection proportions were recorded for M =0, ..., 90 and for adaptive trimming. The seven models, eight distributionsand sample sizes n = 60, 150, and 500 were used. Table 10.5 shows someresults for n = 150.

For OLS, the test that used adaptive trimming had proportions ≤ 0.072except for model m4 with distributions d2, d3, d4, d6, d7 and d8; m2 with d4,d6 and d7 for n = 500 and d6 with n = 150; m6 with d4 and n = 60, 150; m5with d7 and n = 500 and m7 with d7 and n = 500. With the exception of m4,


when the adaptive p > 0.072, then 0% trimming had a rejection proportionnear 0.1. Occasionally adaptive trimming was conservative with p < 0.03.The 0% trimming worked well for m1 and m6 for all eight distributions andfor d1 and d5 for all seven models. Models m2 and m3 usually benefitedfrom adaptive trimming. For distribution d1, the adaptive and 0% trimmingmethods had identical p for n = 500 except for m3 where the values were0.038 and 0.042. Chang (2006) has much more extensive tables.

For SIR results were not as good. Adaptive trimming worked about asoften as it failed, and failed for model m1. 0% trimming performed well forall seven models for the MVN distribution d1, and there was always an Msuch the WM did not reject Ho too often.

10.6 Complements

Introductions to 1D regression and regression graphics are Cook and Weisberg(1999a, ch. 18, 19, and 20) and Cook and Weisberg (1999b), while Olive (2010)considers 1D regression. More advanced treatments are Cook (1998a) and Li(2000). Important papers include Brillinger (1977, 1983) and Li and Duan(1989). Formal testing procedures for the single index model are given bySimonoff and Tsai (2002) and Gao and Liang (1997). Li (1997) shows thatOLS F tests can be asymptotically valid for model (10.2) if x is multivariatenormal and Σ−1

x ΣxY 6= 0.

Let η = (α,βT )T . Then the ith Cook’s distance

CDi =(Y (i) − Y )T (Y (i) − Y )

pσ2=

‖ESP (i) − ESP ‖2

(p+ 1)MSE(10.36)

where ESP (i) = XT η(i) and the estimated sufficient predictor ESP = XT η

estimates dxTj η for some constant d and j = 1, ..., n. This fact suggests

that Cook’s distances and MD2i still give useful information on cases that

influence the estimated sufficient summary plot although MSE is estimatingE(r2) = E[(Y − αOLS − xT βOLS)2] = τ2.

There are many ways to estimate 1D models, including maximum likeli-hood for parametric models. The literature for estimating cβ when model(10.2) holds is growing, and Cook and Li (2002) summarize when competingmethods such as ordinary least squares (OLS), sliced inverse regression (SIR),principal Hessian directions (PHD), and sliced average variance estimation(SAVE) can fail. All four methods frequently perform well if there are nostrong nonlinearities present in the predictors. Cook and Ni (2005) providestheory for inverse regression methods such as SAVE. Further informationabout these and related methods can be found, for example, in Brillinger

10.7 Problems 341

(1977, 1983), Chen and Li (1998), Cook (1998ab, 2004), Cook and Critchley(2000), Cook and Weisberg (1991, 1999ab), Li (1991, 1992, 2000) and Li andZhu (2007).

Several papers have suggested that outliers and strong nonlinearities needto be removed from the predictors. See Brillinger (1991), Cook (1998a, p.152), Cook and Nachtsheim (1994), Heng-Hui (2001), Li and Duan (1989, p.1011, 1041, 1042) and Li (1991, p. 319). Outlier resistant methods for generalmethods such as SIR are less common, but see Gather, Hilker and Becker(2001, 2002) (where FMCD should be replaced by RFCH) and Cızek andHardle (2006). Trimmed views were introduced by Olive (2002, 2004b).

Section 10.4 follows Olive and Hawkins (2005) closely. The literature onnumerical methods for variable selection in the OLS multiple linear regressionmodel is enormous, and the literature for other given 1D regression modelsis also growing. See Naik and Tsai (2001) and Kong and Xia (2007).

Section 10.5 followed Chang and Olive (2007, 2010) closely. More examplesand much more simulations are in Chang (2006). Severini (1998) discusseswhen OLS output is relevant for the Gaussian additive error single indexmodel. Also see Yoo, Patterson and Datta (2009).

The mussel data set is included as the file mussel.lsp in the Arc softwareand can be obtained from the web site (http://www.stat.umn.edu/arc/). TheBoston housing data can be obtained from the STATLIB website(http://lib.stat.cmu.edu/datasets/boston). Both data sets can be obtainedfrom the text website.

10.7 Problems

10.1. Refer to Definition 10.3 for the Cox and Snell (1968) definition forresiduals, but replace η by β.

a) Find ei if Yi = µ+ ei and T (Y ) is used to estimate µ.b) Find ei if Yi = xT

i β + ei.c) Find ei if Yi = β1 exp[β2(xi − x)]ei where the ei are iid exponential(1)

random variables and x is the sample mean of the x′is.d) Find ei if Yi = xT

i β + ei/√wi.

10.2∗. (Aldrin, Bφlviken, and Schweder 1993). Suppose

Y = m(βT x) + e (10.37)

where m is a possibly unknown function and the zero mean errors e are inde-pendent of the predictors. Let z = βT x and let w = x −E(x). Let Σx,Y =

Cov(x, Y ), and let Σx =Cov(x) = Cov(w). Let r = w − (Σxβ)βT w.


a) Recall that Cov(x,Y ) = E[(x − E(x))(Y − E(Y ))T ] and show thatΣx,Y = E(wY ).

b) Show that E(wY ) = Σx,Y = E[(r + (Σxβ)βT w) m(z)] =

E[m(z)r] + E[βT w m(z)]Σxβ.

c) Using βOLS = Σ−1x Σx,Y , show that βOLS = c(x)β + u(x) where the

constantc(x) = E[βT (x −E(x))m(βT x)]

and the bias vector u(x) = Σ−1x E[m(βT x)r].

d) Show that E(wz) = Σxβ. (Hint: Use E(wz) = E[(x −E(x))xT β] =E[(x−E(x))(xT − E(xT ) +E(xT ))β].)

e) Assume m(z) = z. Using d), show that c(x) = 1 if βT Σxβ = 1.

f) Assume that βT Σxβ = 1. Show that E(zr) = E(rz) = 0. (Hint: FindE(rz) and use d).)

g) Suppose that βT Σxβ = 1 and that the distribution of x is multivariatenormal. Then the joint distribution of z and r is multivariate normal. Usingthe fact that E(zr) = 0, show Cov(r, z) = 0 so that z and r are independent.Then show that u(x) = 0.

(Note: the assumption βT Σxβ = 1 can be made without loss of generalitysince if βT Σxβ = d2 > 0 (assuming Σx is positive definite), then y =m(d(β/d)Tx) + e ≡ md(ηT x) + e where md(u) = m(du), η = β/d andηT Σxη = 1.)

10.3. Suppose that you have a statistical model where both fitted valuesand residuals can be obtained. For example this is true for time series andfor nonparametric regression models such as Y = f(x1, ..., xp) + e where

y = f(x1, ..., xp) and the residual e = Y − f(x1, ..., xp). Suggest graphs forvariable selection for such models.

Output for Problem 10.4.

BEST SUBSET REGRESSION MODELS FOR CRIM

(A)LogX2 (B)X3 (C)X4 (D)X5 (E)LogX7 (F)X8 (G)LogX9 (H)LogX12

3 "BEST" MODELS FROM EACH SUBSET SIZE LISTED.

ADJUSTED

k CP R SQUARE R SQUARE RESID SS MODEL VARIABLES

-- ----- -------- -------- --------- ---------------

1 379.8 0.0000 0.0000 37363.2 INTERCEPT ONLY

2 36.0 0.3900 0.3913 22744.6 F

2 113.2 0.3025 0.3039 26007.8 G

10.7 Problems 343

2 191.3 0.2140 0.2155 29310.8 E

3 21.3 0.4078 0.4101 22039.9 E F

3 25.0 0.4036 0.4059 22196.7 F H

3 30.8 0.3970 0.3994 22442.0 D F

4 17.5 0.4132 0.4167 21794.9 C E F

4 18.1 0.4125 0.4160 21821.0 E F H

4 18.8 0.4117 0.4152 21850.4 A E F

5 10.2 0.4226 0.4272 21402.3 A E F H

5 10.8 0.4219 0.4265 21427.7 C E F H

5 12.0 0.4206 0.4252 21476.6 A D E F

6 5.7 0.4289 0.4346 21125.8 A C E F H

6 9.3 0.4248 0.4305 21279.1 A C D E F

6 10.3 0.4237 0.4294 21319.9 A B E F H

7 6.3 0.4294 0.4362 21065.0 A B C E F H

7 6.3 0.4294 0.4362 21066.3 A C D E F H

7 7.7 0.4278 0.4346 21124.3 A C E F G H

8 7.0 0.4297 0.4376 21011.8 A B C D E F H

8 8.3 0.4283 0.4362 21064.9 A B C E F G H

8 8.3 0.4283 0.4362 21065.8 A C D E F G H

9 9.0 0.4286 0.4376 21011.8 A B C D E F G H

10.4. The output above is for the Boston housing data from software thatdoes all subsets variable selection. The full model is a 1D transformationmodel with response variable Y = CRIM and uses a constant and variablesA, B, C, D, E, F, G and H. (Using log(CRIM) as the response would give anMLR model.) From this output, what is the best submodel? Explain briefly.

10.5∗. a) Show that Cp(I) ≤ 2k if and only if FI ≤ p/(p− k).

b) Using (10.19), find E(Cp) and Var(Cp) assuming that an MLR modelis appropriate and that Ho (the reduced model I can be used) is true.

c) Using (10.19), Cp(Ifull) = p and the notation in Section 12.4, show that

Cp(Ii) = Cp(Ifull) + (t2i − 2).

R Problems Some R code for homework problems is at(www.math.siu.edu/olive/rsashw.txt).

Warning: Use a command like source(“G:/rpack.txt”) to downloadthe programs. See Preface or Section 16.2. Typing the name of therpack function, e.g. trviews, will display the code for the function. Use theargs command, e.g. args(trviews), to display the needed arguments for thefunction.


10.6. Use the following R commands to make 100 N3(0, I3) cases and 100trivariate non-EC cases.

n3x <- matrix(rnorm(300),nrow=100,ncol=3)

ln3x <- exp(n3x)

In R, type the command library(MASS).

a) Using the commands pairs(n3x) and pairs(ln3x) and include both scat-terplot matrices in Word. (Click on the plot and hit Ctrl and c at the sametime. Then go to file in the Word menu and select paste.) Are strong nonlin-earities present among the MVN predictors? How about the non-EC predic-tors? (Hint: a box or ball shaped plot is linear.)

b) Make a single index model and the sufficient summary plot with thefollowing commands

ncy <- (n3x%*%1:3)ˆ3 + 0.1*rnorm(100)

plot(n3x%*%(1:3),ncy)

and include the plot in Word.c) The command trviews(n3x, ncy) will produce ten plots. To advance the

plots, click on the rightmost mouse button (and in R select stop) to advanceto the next plot. The last plot is the OLS view. Include this plot in Word.

d) After all 10 plots have been looked at the output will show 10 estimatedpredictors. The last estimate is the OLS (least squares) view and might looklike

Intercept X1 X2 X3

4.417988 22.468779 61.242178 75.284664

If the OLS view is a good estimated sufficient summary plot, then the plotcreated from the command (leave out the intercept)

plot(n3x%*%c(22.469,61.242,75.285),n3x%*%1:3)

should cluster tightly about some line. Your linear combination will be dif-ferent than the one used above. Using your OLS view, include the plot usingthe command above (but with your linear combination) in Word. Was thisplot linear? Did some of the other trimmed views seem to be better than theOLS view, that is, did one of the trimmed views seem to have a smooth meanfunction with a smaller variance function than the OLS view?

e) Now type the R command

lncy <- (ln3x%*%1:3)ˆ3 + 0.1*rnorm(100).

Use the command trviews(ln3x,lncy) to find the best view with a smoothmean function and the smallest variance function. This view should not bethe OLS view. Include your best view in Word.

f) Get the linear combination from your view, say (94.848, 216.719, 328.444)T ,and obtain a plot with the command

10.7 Problems 345

plot(ln3x%*%c(94.848,216.719,328.444),ln3x%*%1:3).

Include the plot in Word. If the plot is linear with high correlation, then yourresponse plot in e) should be good.

10.7. (At the beginning of your R session, use source(“G:/rpack.txt”) com-mand (and library(MASS) in R.))

a) Perform the commands

> nx <- matrix(rnorm(300),nrow=100,ncol=3)

> lnx <- exp(nx)

> SP <- lnx%*%1:3

> lnsincy <- sin(SP)/SP + 0.01*rnorm(100)

For parts b), c) and d) below, to make the best trimmed view withtrviews, ctrviews or lmsviews, you may need to use the function twice.The first view trims 90% of the data, the next view trims 80%, etc. The lastview trims 0% and is the OLS view (or lmsreg view). Remember to advancethe view with the rightmost mouse button (and in R, highlight “stop”). Thenclick on the plot and next simultaneously hit Ctrl and c. This makes a copyof the plot. Then in Word, use the menu commands “Copy>paste.”

b) Find the best trimmed view with OLS and covfch with the followingcommands and include the view in Word.

> trviews(lnx,lnsincy)

(With trviews, suppose that 40% trimming gave the best view. Theninstead of using the procedure above b), you can use the command

> essp(lnx,lnsincy,M=40)

to make the best trimmed view. Then click on the plot and next simultane-ously hit Ctrl and c. This makes a copy of the plot. Then in Word, use themenu commands “Copy>paste”. Click the rightmost mouse button (and inR, highlight “stop”) to return the command prompt.)

c) Find the best trimmed view with OLS and (x,S) using the followingcommands and include the view in Word. See the paragraph above b).

> ctrviews(lnx,lnsincy)

d) Find the best trimmed view with lmsreg and cov.mcd using thefollowing commands and include the view in Word. See the paragraph aboveb).

> lmsviews(lnx,lnsincy)

e) Which method or methods gave the best response plot? Explain briefly.

10.8. Warning: this problem may take too much time. This problemis like Problem 12.7 but uses many more single index models.a) Make some prototype functions with the following commands.


> nx <- matrix(rnorm(300),nrow=100,ncol=3)

> SP <- nx%*%1:3

> ncuby <- SPˆ3 + rnorm(100)

> nexpy <- exp(SP) + rnorm(100)

> nlinsy <- SP + 4*sin(SP) + 0.1*rnorm(100)

> nsincy <- sin(SP)/SP + 0.01*rnorm(100)

> nsiny <- sin(SP) + 0.1*rnorm(100)

> nsqrty <- sqrt(abs(SP)) + 0.1*rnorm(100)

> nsqy <- SPˆ2 + rnorm(100)

b) Make sufficient summary plots similar to Figures 12.1 and 12.2 withthe following commands and include both plots in Word.

> plot(SP,ncuby)

> plot(-SP,ncuby)

c) Find the best trimmed view with the following commands (first typelibrary(MASS) if you are using R). Include the view in Word.

> trviews(nx,ncuby)

You may need to use the function twice. The first view trims 90% of the data,the next view trims 80%, etc. The last view trims 0% and is the OLS view.Remember to advance the view with the rightmost mouse button (and in R,highlight “stop”). Suppose that 40% trimming gave the best view. Then usethe command

> essp(nx,ncuby, M=40)

to make the best trimmed view. Then click on the plot and next simultane-ously hit Ctrl and c. This makes a copy of the plot. Then in Word, use themenu commands “Copy>paste”.

d) To make a plot like Figure 10.5, use the following commands. Let tem

= β obtained from the trviews output. In Example 10.2 (continued), tem canbe obtained with the following command.

> tem <- c(12.60514, 25.06613, 37.25504)

Include the plot in Word.

> ESP <- nx%*%tem

> plot(ESP,SP)

e) Repeat b), c) and d) with the following data sets.i) nx and nexpyii) nx and nlinsyiii) nx and nsincyiv) nx and nsinyv) nx and nsqrtyvi) nx and nsqyEnter the following commands to do parts vii) to x).

10.7 Problems 347

> lnx <- exp(nx)

> SP <- lnx%*%1:3

> lncuby <- (SP/3)ˆ3 + rnorm(100)

> lnlinsy <- SP + 10*sin(SP) + 0.1*rnorm(100)

> lnsincy <- sin(SP)/SP + 0.01*rnorm(100)

> lnsiny <- sin(SP/3) + 0.1*rnorm(100)

> ESP <- lnx%*%tem

vii) lnx and lncubyviii) lnx and lnlinsyix) lnx and lnsincyx) lnx and lnsiny

10.9. Warning: this problem may take too much time. RepeatProblem 10.8 but replace trviews with a) lmsviews, b) symviews (thatcreates views that sometimes work even when symmetry is present), c)ctrviews and d) sirviews.

Except for part a), the essp command will not work. Instead, for thebest trimmed view, click on the plot and next simultaneously hit Ctrl andc. This makes a copy of the plot. Then in Word, use the menu commands“Copy>paste”.

Chapter 11

Stuff for Students

11.1 R

R is the free version of Splus available from the CRAN website (https://cran.r-project.org/). As of January 2019, the author’s personal computer has Ver-sion 3.3.1 (June 21, 2016) of R.

Downloading the book’s files into RMany of the homework problems use R functions contained in the book’s

website (http://lagrange.math.siu.edu/Olive/linmodbk.htm) under the filename linmodpack.txt. The following two R commands can be copied andpasted into R from near the top of the file (http://lagrange.math.siu.edu/Olive/linmodrhw.txt).

Downloading the book’s R functions linmodpack.txt and data fileslinmoddata.txt into R: the commands

source("http://lagrange.math.siu.edu/Olive/linmodpack.txt")

source("http://lagrange.math.siu.edu/Olive/linmoddata.txt")

can be used to download the R functions and data sets into R. Type ls().Nearly 10 R functions from linmodpack.txt should appear. In R, enter thecommand q(). A window asking “Save workspace image?” will appear. Clickon No to remove the functions from the computer (clicking on Yes saves thefunctions in R, but the functions and data are easily obtained with the sourcecommands).

Citing packagesWe will use R packages often in this book. The following R command is

useful for citing the Mevik et al. (2015) pls package.

citation("pls")

To cite package pls in publications use:

349

350 11 Stuff for Students

Bjrn-Helge Mevik, Ron Wehrens and Kristian Hovde Liland (2015).

pls:Partial Least Squares and Principal Component Regression.

R package version 2.5-0. https://CRAN.R-project.org/package=pls

Other packages cited in this book include MASS and class: both from Ven-ables and Ripley (2010), glmnet: Friedman et al. (2015), and leaps: Lumley(2009).

This section gives tips on using R, but is no replacement for books suchas Becker et al. (1988), Crawley (2005, 2013), Fox and Weisberg (2010), orVenables and Ripley (2010). Also see Mathsoft (1999ab) and use the website(www.google.com) to search for useful websites. For example enter the searchwords R documentation.

The command q() gets you out of R.Least squares regression can be done with the function lsfit or lm.The commands help(fn) and args(fn) give information about the function

fn, e.g. if fn = lsfit.Type the following commands.

x <- matrix(rnorm(300),nrow=100,ncol=3)

y <- x%*%1:3 + rnorm(100)

out<- lsfit(x,y)

out$coef

ls.print(out)

The first line makes a 100 by 3 matrix x with N(0,1) entries. The secondline makes y[i] = 0+1∗x[i, 1]+2∗x[i, 2]+3∗x[i, 2]+ewhere e is N(0,1). Theterm 1:3 creates the vector (1, 2, 3)T and the matrix multiplication operator is%∗%. The function lsfit will automatically add the constant to the model.Typing “out” will give you a lot of irrelevant information, but out$coef andout$resid give the OLS coefficients and residuals respectively.

To make a residual plot, type the following commands.

fit <- y - out$resid

plot(fit,out$resid)

title("residual plot")

The first term in the plot command is always the horizontal axis while thesecond is on the vertical axis.

To put a graph in Word, hold down the Ctrl and c buttons simulta-neously. Then select “Paste” from the Word menu, or hit Ctrl and v at thesame time.

To enter data, open a data set in Notepad or Word. You need to knowthe number of rows and the number of columns. Assume that each case isentered in a row. For example, assuming that the file cyp.lsp has been savedon your flash drive from the webpage for this book, open cyp.lsp in Word. Ithas 76 rows and 8 columns. In R , write the following command.

11.1 R 351

cyp <- matrix(scan(),nrow=76,ncol=8,byrow=T)

Then copy the data lines from Word and paste them in R. If a cursor doesnot appear, hit enter. The command dim(cyp) will show if you have enteredthe data correctly.

Enter the following commands

cypy <- cyp[,2]

cypx<- cyp[,-c(1,2)]

lsfit(cypx,cypy)$coef

to produce the output below.

Intercept X1 X2 X3

205.40825985 0.94653718 0.17514405 0.23415181

X4 X5 X6

0.75927197 -0.05318671 -0.30944144

Making functions in R is easy.

For example, type the following commands.

mysquare <- function(x){

# this function squares x

r <- xˆ2

r }

The second line in the function shows how to put comments into functions.

Modifying your function is easy.

Use the fix command.fix(mysquare)

This will open an editor such as Notepad and allow you to make changes. (InSplus, the command Edit(mysquare) may also be used to modify the functionmysquare.)

To save data or a function in R, when you exit, click on Yes when the“Save worksheet image?” window appears. When you reenter R, type ls().This will show you what is saved. You should rarely need to save anythingfor this book. To remove unwanted items from the worksheet, e.g. x, typerm(x),pairs(x) makes a scatterplot matrix of the columns of x,hist(y) makes a histogram of y,boxplot(y) makes a boxplot of y,stem(y) makes a stem and leaf plot of y,scan(), source(), and sink() are useful on a Unix workstation.To type a simple list, use y <− c(1,2,3.5).The commands mean(y), median(y), var(y) are self explanatory.


The following commands are useful for a scatterplot created by the com-mand plot(x,y).lines(x,y), lines(lowess(x,y,f=.2))identify(x,y)abline(out$coef), abline(0,1)

The usual arithmetic operators are 2 + 4, 3 − 7, 8 ∗ 4, 8/4, and

2ˆ{10}.

The ith element of vector y is y[i] while the ij element of matrix x is x[i, j].The second row of x is x[2, ] while the 4th column of x is x[, 4]. The transposeof x is t(x).

The command apply(x,1,fn) will compute the row means if fn = mean.The command apply(x,2,fn) will compute the column variances if fn = var.The commands cbind and rbind combine column vectors or row vectors withan existing matrix or vector of the appropriate dimension.

Getting information about a library in RIn R, a library is an add–on package of R code. The command library()

lists all available libraries, and information about a specific library, such asleaps for variable selection, can be found, e.g., with the commandlibrary(help=leaps).

Downloading a library into RMany researchers have contributed a library or package of R code that can

be downloaded for use. To see what is available, go to the website(http://cran.us.r-project.org/) and click on the Packages icon.

Following Crawley (2013, p. 8), you may need to “Run as administrator”before you can install packages (right click on the R icon to find this). Thenuse the following command to install the glmnet package.

install.packages("glmnet")

Open R and type the following command.library(glmnet)

Next type help(glmnet) to make sure that the library is available for use.

Warning: R is free but not fool proof. If you have an old version of Rand want to download a library, you may need to update your version ofR. The libraries for robust statistics may be useful for outlier detection, butthe methods have not been shown to be consistent or high breakdown. Allsoftware has some bugs. For example, Version 1.1.1 (August 15, 2000) of Rhad a random generator for the Poisson distribution that produced variateswith too small of a mean θ for θ ≥ 10. Hence simulated 95% confidenceintervals might contain θ 0% of the time. This bug seems to have been fixedin Versions 2.4.1 and later. Also, some functions in lregpack may no longerwork in new versions of R.

11.2 Hints for Selected Problems 353

11.2 Hints for Selected Problems

Chapter 1

1.1 a) Sort each column, then find the median of each column. ThenMED(W ) = (1430, 180, 120)T.

b) The sample mean of (X1, X2, X3)T is found by finding the sample mean

of each column. Hence x = (1232.8571, 168.00, 112.00)T .

1.9. See Example 1.7.1.10 a) X2 ∼ N(100, 6).

b) (X1

X3

)∼ N2

((4917

),

(3 −1−1 4

)).

c) X1 X4 and X3 X4.

d)

ρ(X1 , X2) =Cov(X1 , X3)√

VAR(X1)VAR(X3)=

−1√3√

4= −0.2887.

1.11 a) Y |X ∼ N(49, 16) since Y X. (Or use E(Y |X) = µY +Σ12Σ

−122 (X − µx) = 49 + 0(1/25)(X − 100) = 49 and VAR(Y |X) =

Σ11 −Σ12Σ−122 Σ21 = 16− 0(1/25)0 = 16.)

b) E(Y |X) = µY +Σ12Σ−122 (X−µx) = 49+10(1/25)(X−100) = 9+0.4X.

c) VAR(Y |X) = Σ11 −Σ12Σ−122 Σ21 = 16 − 10(1/25)10 = 16 − 4 = 12.

1.13 The proof is identical to that given in Example 3.2. (In addition, itis fairly simple to show that M1 = M2 ≡M . That is, M depends on Σ butnot on c or g.)

1.19 ΣB = E[E(X|BT X)XT B)] = E(MBBT XXT B) = MBBT ΣB.Hence MB = ΣB(BT ΣB)−1.

1.26 a)

N2

((32

),

(3 11 2

)).

b) X2 X4 and X3 X4.

c)σ12√σ11σ33

=1√2√

3= 1/

√6 = 0.4082.

Chapter 3

3.7 Note that ZTAZA = ZT Z,

GA ηA =

(Gη√λ∗2 η

),


and ZTAGAηA = ZT Gη. Then

RSS(ηA) = ‖ZA − GAηA‖22 = (ZA − GAηA)T (ZA − GAηA) =

ZTAZA − ZT

AGAηA − ηTAGT

AZA + ηTAGT

AGAηA =

ZT Z − ZT Gη − ηT GT Z +(ηT GT

√λ2 ηT

)( Gη√λ∗2 η

).

Thus

QN (ηA) = ZT Z − ZT Gη − ηT GT Z + ηT GT Gη + λ∗2ηT η + γ‖ηA‖1 =

‖Z − Gη‖22 + λ∗2‖η‖2

2 +λ∗1√

1 + λ∗2‖ηA‖1 =

RSS(η) + λ∗2‖η‖22 + λ∗1‖η‖1 = Q(η). �

11.3 Projects

Straightforward Projects1) Bootstrap OLS and forward selection with Cp as in Table 2.2, but use

more values of n, p, k, ψ, and error distributions. See some R code for Problem3.12.

2) Bootstrap OLS and forward selection with BIC in a manner similarto bootstrapping OLS and forward selection with Cp as in Table 2.2, butuse more values of n, p, k, ψ, and error distributions. The slpack functionsbicboot and bicbootsim are useful.

Harder Projects1) Compare the Bickel and Ren (2001) bootstrap confidence region (2.21)

with the prediction region method bootstrap confidence region (2.22) on aproblem. For example for OLS or forward selection testing H0 : β0 = 0.

2) For multiple linear regression, shrinkage estimators often shrink β andthe ESP too much. See Figure 1.9b for ridge regression. Suppose Y = β1 +β2x2 + · · ·+ β101x101 + e = x2 + e with n = 100 and p = 101. This model issparse and lasso performs well, similar to Figure 1.9a. Ridge regression shrinkstoo much, but Y is highly correlated with Y . This suggests regressing Y onY to get Y = a + bY + ε. Then Y = Xβ2 where βi2 = bβiM for i = 2, ..., p

and βi1 = a+ bβiM and M is the shrinkage method such as ridge regression.If b ≈ 1 or if the response plot using shrinkage method M looks good (theplotted points are linear and cover the identity line), then the improvementis not needed.

This technique greatly improves the appearance of the response plot andthe prediction intervals on the training data. Investigate whether the tech-nique improves the prediction intervals on test data. Consider automating

11.3 Projects 355

the procedure by using the improvement if H0 : b = 1 versus H1 : b 6= 1 isrejected, e.g. if 1 is not in the CI b± 2SE(b). Some R code is shown below.

(It may be possible to improve shrinkage estimators for regression modelssuch as Poisson regression. For Poisson regression, we would want

exp(a+ bβT

Mx) to track Y well.)

#Possible way to correct shrinkage estimator

#underfitting.

#The response plot looks much better, but is the idea

#useful for prediction? Usually x1 was x2 in

#the formula Y = 0 + x1 + e.

#The corrected version has ‘‘x1" coef approx 0.48.

library(glmnet)

set.seed(13)

par(mfrow=c(2,1))

x <- matrix(rnorm(10000),nrow=100,ncol=100)

Y <- x[,1] + rnorm(100,sd=0.1)

#sparse model, iid predictors

out <- cv.glmnet(x,Y,alpha=1) #lasso


fit <- predict(out,s=lam,newx=x)

res<- Y-fit

#PI bands used d = 1

AERplot2(yhat=fit,y=Y,res=res)

title("lasso")

cor(fit,Y) #about 0.997

tem <- lsfit(fit,Y)

tem$coef #changes even if set.seed is used

# Intercept 1

#0.0009741988 1.0132965955

out <- cv.glmnet(x,Y,alpha=0) #ridge regression


fit <- predict(out,s=lam,newx=x)

res<- Y-fit

#PI bands used d = 1

AERplot2(yhat=fit,y=Y,res=res)

#$respi

#[1] -1.276461 1.693856 #PI length about 2.97

title("ridge regression")

par(mfrow=c(1,1))

#ridge regression shrank betahat and ESP too much

cor(fit,Y) #about 0.91

tem <- lsfit(fit,Y)

tem$coef


# Intercept 1

#0.3523725 5.8094443 #Fig. 1.9 has -0.7008187 5.7954084

fit2 <- Y-tem$resid #Y = yhat + r, fit2 = yhat for scaled RR estimator

plot(fit2,Y) #response plot is much better

abline(0,1)

rrcoef <- predict(out,type="coefficients",s=lam)

plot(rrcoef)

bhat <- tem$coef[2]*rrcoef

bhat[1] <- bhat[1] + tem$coef[1]

#bhat is the betahat for the new ESP fit2

fit3 <- x%*%bhat[-1] + bhat[1]

plot(fit2,fit3)

max(abs(fit2-fit3))

#[1] 1.110223e-15

plot(rrcoef)

plot(bhat)

res2 <- Y - fit2

AERplot2(yhat=fit2,y=Y,res=res2)

$respi

[1] -0.7857706 0.6794579 #PI length about 1.47

title("Response Plot for Scaled Ridge Regression Estimator")

Research Ideas That Have Confounded the Author1) We want clearer and weaker sufficient conditions for when the bootstrap

methods work. In particular, we want to weaken sufficient conditions forwhen the shorth CI and prediction region method confidence region work. SeeRemark 2.9, Section 2.3.4, Equation (2.2), and the Warning before Example2.8. Some heuristics for why these bootstrap methods may work for MLRforward selection are given in Sections 2.3.5 and 3.11.

11.4 Tables 357

11.4 Tables

Tabled values are F(k,d, 0.95) where P (F < F (k, d, 0.95)) = 0.95.00 stands for ∞. Entries were produced with the qf(.95,k,d) commandin R. The numerator degrees of freedom are k while the denominator degreesof freedom are d.

k 1 2 3 4 5 6 7 8 9 00

d

1 161 200 216 225 230 234 237 239 241 254

2 18.5 19.0 19.2 19.3 19.3 19.3 19.4 19.4 19.4 19.5

3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.53

4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.63

5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.37

6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 3.67

7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.23

8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 2.93

9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 2.71

10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.54

11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.41

12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.30

13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.21

14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.13

15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.07

16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.01

17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 1.96

18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 1.92

19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 1.88

20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 1.84

25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 1.71

30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 1.62

00 3.84 3.00 2.61 2.37 2.21 2.10 2.01 1.94 1.88 1.00


Tabled values are tα,d where P (t < tα,d) = α where t has a t distributionwith d degrees of freedom. If d > 29 use the N(0, 1) cutoffs d = Z = ∞.

alpha pvalue

d 0.005 0.01 0.025 0.05 0.5 0.95 0.975 0.99 0.995 left tail

1 -63.66 -31.82 -12.71 -6.314 0 6.314 12.71 31.82 63.66

2 -9.925 -6.965 -4.303 -2.920 0 2.920 4.303 6.965 9.925

3 -5.841 -4.541 -3.182 -2.353 0 2.353 3.182 4.541 5.841

4 -4.604 -3.747 -2.776 -2.132 0 2.132 2.776 3.747 4.604

5 -4.032 -3.365 -2.571 -2.015 0 2.015 2.571 3.365 4.032

6 -3.707 -3.143 -2.447 -1.943 0 1.943 2.447 3.143 3.707

7 -3.499 -2.998 -2.365 -1.895 0 1.895 2.365 2.998 3.499

8 -3.355 -2.896 -2.306 -1.860 0 1.860 2.306 2.896 3.355

9 -3.250 -2.821 -2.262 -1.833 0 1.833 2.262 2.821 3.250

10 -3.169 -2.764 -2.228 -1.812 0 1.812 2.228 2.764 3.169

11 -3.106 -2.718 -2.201 -1.796 0 1.796 2.201 2.718 3.106

12 -3.055 -2.681 -2.179 -1.782 0 1.782 2.179 2.681 3.055

13 -3.012 -2.650 -2.160 -1.771 0 1.771 2.160 2.650 3.012

14 -2.977 -2.624 -2.145 -1.761 0 1.761 2.145 2.624 2.977

15 -2.947 -2.602 -2.131 -1.753 0 1.753 2.131 2.602 2.947

16 -2.921 -2.583 -2.120 -1.746 0 1.746 2.120 2.583 2.921

17 -2.898 -2.567 -2.110 -1.740 0 1.740 2.110 2.567 2.898

18 -2.878 -2.552 -2.101 -1.734 0 1.734 2.101 2.552 2.878

19 -2.861 -2.539 -2.093 -1.729 0 1.729 2.093 2.539 2.861

20 -2.845 -2.528 -2.086 -1.725 0 1.725 2.086 2.528 2.845

21 -2.831 -2.518 -2.080 -1.721 0 1.721 2.080 2.518 2.831

22 -2.819 -2.508 -2.074 -1.717 0 1.717 2.074 2.508 2.819

23 -2.807 -2.500 -2.069 -1.714 0 1.714 2.069 2.500 2.807

24 -2.797 -2.492 -2.064 -1.711 0 1.711 2.064 2.492 2.797

25 -2.787 -2.485 -2.060 -1.708 0 1.708 2.060 2.485 2.787

26 -2.779 -2.479 -2.056 -1.706 0 1.706 2.056 2.479 2.779

27 -2.771 -2.473 -2.052 -1.703 0 1.703 2.052 2.473 2.771

28 -2.763 -2.467 -2.048 -1.701 0 1.701 2.048 2.467 2.763

29 -2.756 -2.462 -2.045 -1.699 0 1.699 2.045 2.462 2.756

Z -2.576 -2.326 -1.960 -1.645 0 1.645 1.960 2.326 2.576

CI 90% 95% 99%

0.995 0.99 0.975 0.95 0.5 0.05 0.025 0.01 0.005 right tail

0.01 0.02 0.05 0.10 1 0.10 0.05 0.02 0.01 two tail

REFERENCES 359

Abraham, B., and Ledolter, J. (2006), Introduction to Regression Modeling,Thomson Brooks/Cole, Belmont, CA.

Agresti, A. (2015), Foundations of Linear and Generalized Linear Models,Wiley, Hoboken, NJ.

Akaike, H. (1973), “Information Theory as an Extension of the Maxi-mum Likelihood Principle,” in Proceedings, 2nd International Symposium onInformation Theory, eds. Petrov, B.N., and Csakim, F., Akademiai Kiado,Budapest, 267-281.

Akaike, H. (1977), “On Entropy Maximization Principle,” in Applicationsof Statistics, ed. Krishnaiah, P.R, North Holland, Amsterdam, 27-41.

Akaike, H. (1978), “A new Look at the Bayes Procedure,” Biometrics, 65,53-59.

Andersen, P.K., and Skovgaard, L.T. (2010), Regression with Linear Pre-dictors, Springer, New York, NY.

Anderson, T.W. (1971), The Statistical Analysis of Time Series, Wiley,New York, NY.

Anton, H., and Rorres, C. (2014), Elementary Linear Algebra, ApplicationsVersion, 11th ed., Wiley, New York, NY.

Becker, R.A., Chambers, J.M., and Wilks, A.R. (1988), The New SLanguage: a Programming Environment for Data Analysis and Graphics,Wadsworth and Brooks/Cole, Pacific Grove, CA.

Belsley, D.A. (1984), “Demeaning Conditioning Diagnostics Through Cen-tering,” The American Statistician, 38, 73-77.

Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013), “ValidPost-Selection Inference,” The Annals of Statistics, 41, 802-837.

Bertsimas, D., King, A., and Mazmunder, R. (2016), “Best Subset Selec-tion Via a Modern Optimization Lens,” The Annals of Statistics, 44, 813-852.

Bhatia, R., Elsner, L., and Krause, G. (1990), “Bounds for the Variation ofthe Roots of a Polynomial and the Eigenvalues of a Matrix,” Linear Algebraand Its Applications, 142, 195-209.

Bickel, P.J., and Ren, J.–J. (2001), “The Bootstrap in Hypothesis Testing,”in State of the Art in Probability and Statistics: Festschrift for William R. vanZwet, eds. de Gunst, M., Klaassen, C., and van der Vaart, A., The Instituteof Mathematical Statistics, Hayward, CA, 91-112.

Bogdan, M., Ghosh, J., and Doerge, R. (2004), “Modifying the SchwarzBayesian Information Criterions to Locate Multiple Interacting QuantitativeTrait Loci,” Genetics, 167, 989-999.

Buchlmann, P., and Yu, B. (2002), “Analyzing Bagging,” The Annals ofStatistics, 30, 927-961.

Budny, K. (2014), “A Generalization of Chebyshev’s Inequality for Hilbert-Space-Valued Random Variables,” Statistics & Probability Letters, 88, 62-65.

Burnham, K.P., and Anderson, D.R. (2002), Model Selection and Mul-timodel Inference: a Practical Information-Theoretic Approach, 2nd ed.,Springer, New York, NY.

360 REFERENCES

Burnham, K.P., and Anderson, D.R. (2004), “Multimodel Inference Under-standing AIC and BIC in Model Selection,” Sociological Methods & Research,33, 261-304.

Butler, R., and Rothman, E. (1980), “Predictive Intervals Based on Reuseof the Sample,” Journal of the American Statistical Association, 75, 881-889.

Buxton, L.H.D. (1920), “The Anthropology of Cyprus,” The Journal ofthe Royal Anthropological Institute of Great Britain and Ireland, 50, 183-235.

Camponovo, L. (2015), “On the Validity of the Pairs Bootstrap for LassoEstimators,” Biometrika, 102, 981-987.

Chang, J., and Olive, D.J. (2010), “OLS for 1D Regression Models,” Com-munications in Statistics: Theory and Methods, 39, 1869-1882.

Chatterjee, A., and Lahiri, S.N. (2011), “Bootstrapping Lasso Estimators,”Journal of the American Statistical Association, 106, 608-625.

Chen, J., and Chen, Z. (2008), “Extended Bayesian Information Criterionfor Model Selection With Large Model Spaces,” Biometrika, 95, 759-771.

Chen, S.X. (2016), “Peter Hall’s Contributions to the Bootstrap,” TheAnnals of Statistics, 44, 1821-1836.

Chen, C.H., and Li, K.C. (1998), “Can SIR be as Popular as MultipleLinear Regression?,” Statistica Sinica, 8, 289-316.

Chen, X. (2011), “A New Generalization of Chebyshev Inequality for Ran-dom Vectors,” see arXiv:0707.0805v2.

Chew, V. (1966), “Confidence, Prediction and Tolerance Regions for theMultivariate Normal Distribution,” Journal of the American Statistical As-sociation, 61, 605-617.

Chihara, L., and Hesterberg, T. (2011), Mathematical Statistics with Re-sampling and R, Wiley, Hoboken, NJ.

Cho, H., and Fryzlewicz, P. (2012), “High Dimensional Variable SelectionVia Tilting,” Journal of the Royal Statistical Society, B, 74, 593-622.

Christensen, R. (2011), Plane Answers to Complex Questions: the Theoryof Linear Models, 4th ed., Springer, New York, NY.

Chun, H., and Keles, S. (2010), “Sparse Partial Least Squares Regressionfor Simultaneous Dimension Reduction and Predictor Selection,” Journal ofthe Royal Statistical Society, B, 72, 3-25.

Claeskens, G., and Hjort, N.L. (2008), Model Selection and Model Averag-ing, Cambridge University Press, New York, NY.

Clarke, B.R. (1986), “Nonsmooth Analysis and Frechet Differentiability ofM Functionals,” Probability Theory and Related Fields, 73, 137-209.

Clarke, B.R. (2000), “A Review of Differentiability in Relation to Ro-bustness With an Application to Seismic Data Analysis,” Proceedings of theIndian National Science Academy, A, 66, 467-482.

Cook, R.D. (1998), Regression Graphics: Ideas for Studying RegressionThrough Graphics, Wiley, New York, NY.

Cook, R.D. (2018), An Introduction to Envelopes: Dimension Reductionfor Efficient Estimation in Multivariate Statistics, Wiley, Hoboken, NJ.

REFERENCES 361

Cook, R.D., and Forzani, L. (2008), “Principal Fitted Components forDimension Reduction in Regression,” Statistical Science, 23, 485-501.

Cook, R.D., and Forzani, L. (2018a), “Big Data and Partial Least SquaresPrediction,” The Canadian Journal of Statistics, 46, 62-78.

Cook, R.D., and Forzani, L. (2018b), “Partial Least Squares Prediction inHigh-Dimensional Regression,” The Annals of Statistics, to appear.

Cook, R.D., Forzani, L., and Rothman, A. (2013), “Prediction in Abun-dant High-Dimensional Linear Regression,” Electronic Journal of Statistics,7, 30593088.

Cook, R.D., and Olive, D.J. (2001), “A Note on Visualizing ResponseTransformations in Regression,” Technometrics, 43, 443-449.

Cook, R.D., and Weisberg, S. (1999), Applied Regression Including Com-puting and Graphics, Wiley, New York, NY.

Cox, D.R. (1972), “Regression Models and Life-Tables,” Journal of theRoyal Statistical Society, B, 34, 187-220. Draper, N.R., and Smith, H. (1966,1981, 1998), Applied Regression Analysis, 1st, 2nd, and 3rd ed., Wiley, NewYork, NY.

Cornish, E.A. (1954), “The Multivariate t-Distribution Associated with aSet of Normal Sample Deviates,” Australian Journal of Physics, 7, 531-542.

Croux, C., Dehon, C., Rousseeuw, P.J., and Van Aelst, S. (2001), “Ro-bust Estimation of the Conditional Median Function at Elliptical Models,”Statistics & Probability Letters, 51, 361-368.

Datta, B.N. (1995), Numerical Linear Algebra and Applications,Brooks/Cole Publishing Company, Pacific Grove, CA.

Denham, M.C. (1997), “Prediction Intervals in Partial Least Squares,”Journal of Chemometrics, 11, 39-52.

Dezeure, R., Buhlmann, P., Meier, L., and Meinshausen, N. (2015), “High-Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi,”Statistical Science, 30, 533-558.

Driscoll, M.F., and Krasnicka, B. (1995), “An Accessible Proof of Craig’sTheorem in the General Case,” The American Statistician, 49, 59-62.

Eaton, M.L. (1986), “A Characterization of Spherical Distributions,” Jour-nal of Multivariate Analysis, 20, 272-276.

Efron, B. (1982), The Jackknife, the Bootstrap and Other ResamplingPlans, SIAM, Philadelphia, PA.

Efron, B. (2014), “Estimation and Accuracy After Model Selection,” (withdiscussion), Journal of the American Statistical Association, 109, 991-1007.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “LeastAngle Regression,” (with discussion), The Annals of Statistics, 32, 407-451.

Efron, B., and Hastie, T. (2016), Computer Age Statistical Inference, Cam-bridge University Press, New York, NY.

Efron, B., and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,Chapman & Hall/CRC, New York, NY.

Fan, J., and Li, R. (2002), “Variable Selection for Cox’s Proportional Haz-ard Model and Frailty Model,” The Annals of Statistics, 30, 74-99.

362 REFERENCES

Fan, J., and Lv, J. (2010), “A Selective Overview of Variable Selection inHigh Dimensional Feature Space,” Statistica Sinica, 20, 101-148.

Fan, R. (2016), “A Squared Correlation Coefficient of the Correlation Ma-trix,” unpublished manuscript at (http://lagrange.math.siu.edu/Olive/sfan.pdf).

Ferguson, T.S. (1996), A Course in Large Sample Theory, Chapman &Hall, New York, NY.

Fernholtz, L.T. (1983), von Mises Calculus for Statistical Functionals,Springer, New York, NY.

Ferrari, D., and Yang, Y. (2015), “Confidence Sets for Model Selection byF –Testing,” Statistica Sinica, 25, 1637-1658.

Fithian, W., Sun, D., and Taylor, J. (2014), “Optimal Inference AfterModel Selection,” ArXiv e-prints.

Fogel, P., Hawkins, D.M., Beecher, C., Luta, G., and Young, S. (2013), “ATale of Two Matrix Factorizations,” The American Statistician, 67, 207-218.

Frank, I.E., and Friedman, J.H. (1993), “A Statistical View of SomeChemometrics Regression Tools,” (with discussion), Technometrics, 35, 109-148.

Freedman, D.A. (2005), Statistical Models Theory and Practice, Cam-bridge University Press, New York, NY.

Frey, J. (2013), “Data-Driven Nonparametric Prediction Intervals,” Jour-nal of Statistical Planning and Inference, 143, 1039-1048.

Friedman, J.H., and Hall, P. (2007), “On Bagging and Nonlinear Estima-tion,” Journal of Statistical Planning and Inference, 137, 669-683.

Fujikoshi, Y., Sakurai, T., and Yanagihara, H. (2014), “Consistency ofHigh-Dimensional AIC–Type and Cp–Type Criteria in Multivariate LinearRegression,” Journal of Multivariate Analysis, 123, 184-200.

Furnival, G., and Wilson, R. (1974), “Regression by Leaps and Bounds,”Technometrics, 16, 499-511.

Gao, X., and Huang, J. (2010), “Asymptotic Analysis of High-DimensionalLAD Regression with Lasso,” Statistica Sinica, 20, 1485-1506.

Gill, R.D. (1989), “Non- and Semi-Parametric Maximum Likelihood Esti-mators and the von Mises Method, Part 1,” Scandinavian Journal of Statis-tics, 16, 97-128.

Gilmour, S.G. (1996), “The Interpretation of Mallows’s Cp-Statistic,” TheStatistician, 45, 49-56.

Gladstone, R.J. (1905), “A Study of the Relations of the Brain to the Sizeof the Head,” Biometrika, 4, 105-123.

Graybill, F.A. (1976), Theory and Application of the Linear Model, Dux–bury Press, North Scituate, MA.

Graybill, F.A. (1983), Matrices With Applications to Statistics, 2nd ed.,Wadsworth, Belmont, CA.

Graybill, F.A. (2000), Theory and Application of the Linear Model, Brooks/Cole,Pacific Grove, CA.

REFERENCES 363

Gruber, M.H.J. (1998), Improving Efficiency by Shrinkage: the James-Stein and Ridge Regression Estimators, Marcel Dekker, New York, NY.

Gunst, R.F., and Mason, R.L. (1980), Regression Analysis and Its Appli-cation: a Data Oriented Approach, Marcel Dekker, New York, NY.

Guttman, I. (1982), Linear Models: an Introduction, Wiley, New York,NY.

Haitovsky, Y. (1987), “On Multivariate Ridge Regression,” Biometrika,74, 563-570.

Hall, P. (1988), “Theoretical Comparisons of Bootstrap Confidence Inter-vals,” (with discussion), The Annals of Statistics, 16, 927-985.

Hall, P., Lee, E.R., and Park, B.U. (2009), “Bootstrap-Based PenaltyChoice for the Lasso Achieving Oracle Performance,” Statistica Sinica, 19,449-471.

Hall, P., Martin, M.A., and Schucany, W.R. (1989), “Better NonparametricBootstrap Confidence Intervals for the Correlation Coefficient,” Journal ofStatistical Computation and Simulation, 33, 161172.

Hastie, T., Tibshirani, R., and Friedman, J. (2001, 2009), The Elementsof Statistical Learning: Data Mining, Inference, and Prediction, 1st and 2nded. Springer, New York, NY.

Hastie, T., Tibshirani, R., and Wainwright, M. (2015), Statistical Learningwith Sparsity: the Lasso and Generalizations, CRC Press Taylor & Francis,Boca Raton, FL.

Hebbler, B. (1847), “Statistics of Prussia,” Journal of the Royal StatisticalSociety, A, 10, 154-186.

Henderson, H.V., and Searle, S.R. (1979), “Vec and Vech Operators forMatrices, with Some Uses in Jacobians and Multivariate Statistics,” TheCanadian Journal of Statistics, 7, 65-81.

Hesterberg, T., (2014), “What Teachers Should Know About the Boot-strap: Resampling in the Undergraduate Statistics Curriculum,” availablefrom (http://arxiv.org/pdf/1411.5279v1.pdf). (An abbreviated version waspublished (2015), The American Statistician, 69, 371-386.)

Hocking, R.R. (2013), Methods and Applications of Linear Models: Regres-sion and the Analysis of Variance, 3rd ed., Wiley, New York, NY.

Hoerl, A.E., and Kennard, R. (1970), “Ridge Regression: Biased Estima-tion for Nonorthogonal Problems,” Technometrics, 12, 55-67.

Hoffman, I., Serneels, S., Filzmoser, P., and Croux, C. (2015), “SparsePartial Robust M Regression,” Chemometrics and Intelligent Laboratory Sys-tems, 149, Part A, 50-59.

Hurvich, C.M., and Tsai, C.-L. (1991),“Bias of the Corrected AIC Cri-terion for Underfitted Regression and Time Series Models,” Biometrika, 78,499-509.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Intro-duction to Statistical Learning With Applications in R, Springer, New York,NY.

364 REFERENCES

Jansen, L., Fithian, W., and Hastie, T. (2015), “Effective Degrees of Free-dom: a Flawed Metaphor,” Biometrika, 102, 479-485.

Javanmard, A., and Montanari, A. (2014), “Confidence Intervals andHypothesis Testing for High-Dimensional Regression,” Journal of MachineLearning Research, 15, 2869-2909.

Jia, J., and Yu, B. (2010), “On Model Selection Consistency of the ElasticNet When p >> n,” Statistica Sinica, 20, 595-611.

Johnson, M.E. (1987), Multivariate Statistical Simulation, Wiley, NewYork, NY.

Johnson, N.L., and Kotz, S. (1970), Distributions in Statistics: ContinuousUnivariate Distributions–2, Wiley, New York, NY.

Johnson, N.L., and Kotz, S. (1972), Distributions in Statistics: ContinuousMultivariate Distributions, Wiley, New York, NY.

Johnson, R.A., and Wichern, D.W. (1988, 2007), Applied Multivariate Sta-tistical Analysis, 2nd and 6th ed., Prentice Hall, Englewood Cliffs, NJ.

Jones, H.L. (1946), “Linear Regression Functions with Neglected Vari-ables,” Journal of the American Statistical Association, 41, 356-369.

Kelker, D. (1970), “Distribution Theory of Spherical Distributions and aLocation Scale Parameter Generalization,” Sankhya, A, 32, 419-430.

Kenard, R.W. (1971), “A Note on the Cp Statistics,” Technometrics, 13,899-900.

Khattree, R., and Naik, D.N. (1999), Applied Multivariate Statistics withSAS Software, 2nd ed., SAS Institute, Cary, NC.

Knight, K., and Fu, W.J. (2000), “Asymptotics for Lasso-Type Estima-tors,” Annals of Statistics, 28, 1356–1378.

Kowalski, C.J. (1973), “Non-Normal Bivariate Distributions with NormalMarginals,” The American Statistician, 27, 103-106.

Kuhn, M., and Johnson, K. (2013), Applied Predictive Modeling, Springer,New York, NY.

Lee, J., Sun, D., Sun, Y., and Taylor, J. (2016), “Exact Post-SelectionInference With Application to the Lasso,” The Annals of Statistics, 44, 907-927.

Lee, J.D., and Taylor, J.E. (2014), “Exact Post Model Selection Infer-ence for Marginal Screening,” in Advances in Neural Information ProcessingSystems, 136-144.

Leeb, H., and Potscher, B.M. (2005), “Model Selection and Inference: Factsand Fiction,” Econometric Theory, 21, 21-59.

Leeb, H., and Potscher, B.M. (2006), “Can One Estimate the ConditionalDistribution of Post–Model-Selection Estimators?” The Annals of Statistics,34, 2554-2591.

Leeb, H., Potscher, B.M., and Ewald, K. (2015), “On Various ConfidenceIntervals Post-Model-Selection,” Statistical Science, 30, 216-227.

Lehmann, E.L. (1999), Elements of Large–Sample Theory, Springer, NewYork, NY.

REFERENCES 365

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R.J., and Wasserman, L.(2018), “Distribution-Free Predictive Inference for Regression,” Journal ofthe American Statistical Association, 113, 1094-1111.

Leon, S.J. (1986), Linear Algebra with Applications, 2nd ed., MacmillanPublishing Company, New York, NY.

Leon, S.J. (2015), Linear Algebra with Applications, 9th ed., Pearson,Boston, MA.

Li, K.C., and Duan, N. (1989), “Regression Analysis Under Link Viola-tion,” The Annals of Statistics, 17, 1009-1052.

Lin, D., Foster, D.P., and Ungar, L.H. (2012), “VIF Regression, a FastRegression Algorithm for Large Data,” Journal of the American StatisticalAssociation, 106, 232-247.

Linhart, H., and Zucchini, W. (1986), Model Selection, Wiley, New York,NY.

Lockhart, R., Taylor, J., Tibshirani, R.J., and Tibshirani, R. (2014), “ASignificance Test for the Lasso,” (with discussion), The Annals of Statistics,42, 413-468.

Lu, S., Liu, Y., Yin, L., and Zhang, K. (2017), “Confidence Intervals andRegions for the Lasso by Using Stochastic Variational Inequality Techniquesin Optimization,” Journal of the Royal Statistical Society, B, 79 589-611.

Luo, S., and Chen, Z. (2013), “Extended BIC for Linear Regression ModelsWith Diverging Number of Relevant Features and High or Ultra-High FeatureSpaces,” Journal of Statistical Planning and Inference, 143, 494-504.

Machado, J.A.F., and Parente, P. (2005), “Bootstrap Estimation of Covari-ance Matrices Via the Percentile Method,” Econometrics Journal, 8, 70–78.

Mallows, C. (1973), “Some Comments on Cp,” Technometrics, 15, 661-676.Marden, J.I. (2012), Mathematical Statistics: Old School, unpublished text

online at (www.stat.istics.net).Mardia, K.V., Kent, J.T., and Bibby, J.M. (1979), Multivariate Analysis,

Academic Press, London, UK.Marquardt, D.W., and Snee, R.D. (1975), “Ridge Regression in Practice,”

The American Statistician, 29, 3-20.MathSoft (1999a), S-Plus 2000 User’s Guide, Data Analysis Products Di-

vision, MathSoft, Seattle, WA.MathSoft (1999b), S-Plus 2000 Guide to Statistics, Volume 2, Data Anal-

ysis Products Division, MathSoft, Seattle, WA.Meinshausen, N. (2007), “Relaxed Lasso,” Computational Statistics &

Data Analysis, 52, 374-393.Mevik, B.–H., Wehrens, R., and Liland, K.H. (2015), pls: Partial Least

Squares and Principal Component Regression, R package version 2.5-0, (https://CRAN.R-project.org/package=pls).

Monahan, J.F. (2008), A Primer on Linear Models, Chapman & Hall/CRC,Boca Rotan, FL.

Muller, K.E., and Stewart, P.W. (2006), Linear Model Theory: Univariate,Multivariate, and Mixed Models, Wiley, Hoboken, NJ.

366 REFERENCES

Myers, R.H., and Milton, J.S. (1991), A First Course in the Theory ofLinear Statistical Models, Duxbury, Belmont, CA.

Navarro, J. (2014), “Can the Bounds in the Multivariate Chebyshev In-equality be Attained?” Statistics & Probability Letters, 91, 1-5.

Navarro, J. (2016), “A Very Simple Proof of the Multivariate Chebyshev’sInequality,” Communications in Statistics: Theory and Methods, 45, 3458-3463.

Ning, Y., and Liu, H. (2017), “A General Theory of Hypothesis Tests andConfidence Regions for Sparse High Dimensional Models,” The Annals ofStatistics, 45, 158-195.

Nishi, R. (1984), “Asymptotic Properties of Criteria for Selection of Vari-ables in Multiple Regression,” The Annals of Statistics, 12, 758-765.

Norman, G.R., and Streiner, D.L. (1986), PDQ Statistics, B.C. Decker,Philadelphia, PA.

Obozinski, G., Wainwright, M.J., and Jordan, M.I. (2011), “SupportUnion Recovery in High-Dimensional Multivariate Regression,” The Annalsof Statistics, 39, 1-47.

Olive, D.J. (2004a), “A Resistant Estimator of Multivariate Location andDispersion,” Computational Statistics & Data Analysis, 46, 99-102.

Olive, D.J. (2004b), “Visualizing 1D Regression,” in Theory and Applica-tions of Recent Robust Methods, eds. Hubert, M., Pison, G., Struyf, A., andVan Aelst, S., Birkhauser, Basel, Switzerland, 221-233.

Olive, D.J. (2005), “Two Simple Resistant Regression Estimators,” Com-putational Statistics & Data Analysis, 49, 809-819.

Olive, D.J. (2007), “Prediction Intervals for Regression Models,” Compu-tational Statistics & Data Analysis, 51, 3115-3122.

Olive, D.J. (2008b), Applied Robust Statistics, Unpublished Online Text,see (http://lagrange.math.siu.edu/Olive/ol-bookp.htm).

Olive, D.J. (2010), Multiple Linear and 1D Regression Models, onlinecourse notes, see (http://lagrange.math.siu.edu/Olive/regbk.htm).

Olive, D.J. (2013a), “Asymptotically Optimal Regression Prediction In-tervals and Prediction Regions for Multivariate Data,” International Journalof Statistics and Probability, 2, 90-100.

Olive, D.J. (2013b), “Plots for Generalized Additive Models,” Communi-cations in Statistics: Theory and Methods, 42, 2610-2628.

Olive, D.J. (2014), Statistical Theory and Inference, Springer, New York,NY.

Olive, D.J. (2017a), Linear Regression, Springer, New York, NY.Olive, D.J. (2017b), Robust Multivariate Analysis, Springer, New York,

NY.Olive, D.J. (2018), “Applications of Hyperellipsoidal Prediction Regions,”

Statistical Papers, 59, 913-931.Olive, D.J. (2019), Prediction and Statistical Learning, online course notes,

see (http://lagrange.math.siu.edu/Olive/slearnbk.htm).

REFERENCES 367

Olive, D.J., and Hawkins, D.M. (2003), “Robust Regression with HighCoverage,” Statistics & Probability Letters, 63, 259-266.

Olive, D.J., and Hawkins, D.M. (2005), “Variable Selection for 1D Regres-sion Models,” Technometrics, 47, 43-50.

Olive, D.J., and Hawkins, D.M. (2011), “Practical High Breakdown Re-gression,” preprint at (http://lagrange.math.siu.edu/Olive/pphbreg.pdf).

Olive, D.J., Pelawa Watagoda, L.C.R., and Rupasinghe Arachchige Don,H.S. (2015), “Visualizing and Testing the Multivariate Linear RegressionModel,” International Journal of Statistics and Probability, 4, 126-137.

Pelawa Watagoda, L.C.R. (2017a), “Inference After Variable Selection,”Ph.D. Thesis, Southern Illinois University. See (http://lagrange.math.siu.edu/Olive/slasanthiphd.pdf).

Pelawa Watagoda, L.C.R. (2017b), “Simulation for Inference After Vari-able Selection,” unpublished manuscript online at (http://lagrange.math.siu.edu/Olive/slasanthisim.pdf).

Pelawa Watagoda, L.C.R., and Olive, D.J. (2019a), “Bootstrapping Mul-tiple Linear Regression After Variable Selection.” Unpublished manuscriptat (http://lagrange.math.siu.edu/Olive/ppboottest.pdf).

Pelawa Watagoda, L.C.R., and Olive, D. J. (2019b), “Comparing Shrink-age Estimators With Asymptotically Optimal Prediction Intervals.” Unpub-lished manuscript at (http://lagrange.math.siu.edu/Olive/pppicomp.pdf).

Press, S.J. (2005), Applied Multivariate Analysis: Using Bayesian and Fre-quentist Methods of Inference, 2nd ed., Dover, Mineola, NY.

Qi, X., Luo, R., Carroll, R.J., and Zhao, H. (2015), “Sparse Regressionby Projection and Sparse Discriminant Analysis,” Journal of Computationaland Graphical Statistics, 24, 416-438.

R Core Team (2016), “R: a Language and Environment for Statisti-cal Computing,” R Foundation for Statistical Computing, Vienna, Austria,(www.R-project.org).

Rao, C.R. (1965, 1973) Linear Statistical Inference and Its Applications,1st and 2nd ed., Wiley, New York, NY.

Rao, C.R., Toutenberg, H., Shalabh, and Heunmann, C. (2008), Lin-ear Models and Generalizations: Least Squares and Alternatives, 3rd ed.,Springer, New York, NY.

Ravishanker, N., and Dey, D.K. (2002), A First Course in Linear ModelTheory, Chapman & Hall/CRC, Boca Raton, FL.

Reid, J.G., and Driscoll, M.F. (1988), “An Accessible Proof of Craig’sTheorem in the Noncentral Case,” The American Statistician, 42, 139-142.

Rejchel, W. (2016), “Lasso With Convex Loss: Model Selection Consis-tency and Estimation,” Communications in Statistics: Theory and Methods,45, 1989-2004.

Ren, J.-J. (1991), “On Hadamard Differentiability of Extended StatisticalFunctional,” Journal of Multivariate Analysis, 39, 30-43.

368 REFERENCES

Ren, J.-J., and Sen, P.K. (1995), “Hadamard Differentiability on D[0,1]p,”Journal of Multivariate Analysis, 55, 14-28.

Rencher, A., and Pun, F. (1980), “Inflation of R2 in Best Subset Regres-sion,” Technometrics, 22, 49-53.

Rencher, A.C., and Schaalje, G.B. (2008), Linear Models in Statistics, 2nded., Wiley, Hoboken, NJ.

Rohatgi, V.K. (1976), An Introduction to Probability Theory and Mathe-matical Statistics, Wiley, New York, NY.

Rousseeuw, P.J., and Leroy, A.M. (1987), Robust Regression and OutlierDetection, Wiley, New York, NY.

Scheffe, H. (1959), The Analysis of Variance, Wiley, New York, NY.Schwarz, G. (1978), “Estimating the Dimension of a Model,” The Annals

of Statistics, 6, 461-464.Searle, S.R. (1971), Linear Models, Wiley, New York, NY.Searle, S.R., and Gruber, M.H.J. (2017), Linear Models, 2nd ed., Wiley,

Hoboken, NJ.Searle, S.R. (1982), Matrix Algebra Useful for Statistics, Wiley, New York,

NY.Seber, G.A.F., and Lee, A.J. (2003), Linear Regression Analysis, 2nd ed.,

Wiley, New York, NY.Sen, P.K., and Singer, J.M. (1993), Large Sample Methods in Statistics:

an Introduction with Applications, Chapman & Hall, New York, NY.Sengupta, D., and Jammalamadaka, S.R. (2003), Linear Models: an Inte-

grated Approach, World Scientific, Singapore.Serfling, R.J. (1980), Approximation Theorems of Mathematical Statistics,

Wiley, New York, NY.Severini, T.A. (1998), “Some Properties of Inferences in Misspecified Lin-

ear Models,” Statistics & Probability Letters, 40, 149-153.Severini, T.A. (2005), Elements of Distribution Theory, Cambridge Uni-

versity Press, New York, NY.Shao, J. (1993), “Linear Model Selection by Cross-Validation,” Journal of

the American Statistical Association, 88, 486-494.Srivastava, M.S., and Khatri, C.G. (1979), An Introduction to Multivariate

Statistics, North Holland, New York, NY.Stapleton, J.H. (2009), Linear Statistical Models, 2nd ed., Wiley, Hoboken,

NJ.Staudte, R.G., and Sheather, S.J. (1990), Robust Estimation and Testing,

Wiley, New York, NY.Steinberger, L., and Leeb, H. (2016), “Leave-One-Out Prediction Intervals

in Linear Regression Models With Many Variables,” unpublished manuscriptat (https://arxiv.org/pdf/1602.05801.pdf).

Stewart, G.M. (1969), “On the Continuity of the Generalized Inverse,”SIAM Journal on Applied Mathematics, 17, 33-45.

Su, W., Bogdan, M. , and Candes, E. (2017), “False Discoveries OccurEarly on the Lasso Path,” The Annals of Statistics, 45, 2133-2150.

REFERENCES 369

Su, Z., and Cook, R.D. (2012), “Inner Envelopes: Efficient Estimation inMultivariate Linear Regression,” Biometrika, 99, 687-702.

Su, Z., Zhu, G., and Yang, Y. (2016), “Sparse Envelope Model: EfficientEstimation and Response Variable Selection in Multivariate Linear Regres-sion,” Biometrika, 103, 579-593.

Tibshirani, R. (1996), “Regression Shrinkage and Selection Via the Lasso,”Journal of the Royal Statistical Society, B, 58, 267-288.

Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J.,and Tibshirani, R.J. (2012), “Strong Rules for Discarding Predictors in Lasso-Type Problems,” Journal of the Royal Statistical Society, B, 74, 245–266.

Tibshirani, R.J. (2015), “Degrees of Freedom and Model Search,” StatisticaSinica, 25, 1265-1296.

Tibshirani, R.J., Rinaldo, A., Tibshirani, R., and Wasserman, L. (2018),“Uniform Asymptotic Inference and the Bootstrap After Model Selection,”The Annals of Statistics, 46, 1255-1287.

Tibshirani, R.J., and Taylor, J. (2012), “Degrees of Freedom in LassoProblems,” The Annals of Statistics, 40, 1198-1232.

Tibshirani, R.J., Taylor, J., Lockhart, R., and Tibshirani, R. (2016), “Ex-act Post-Selection Inference for Sequential Regression Procedures,” Journalof the American Statistical Association, 111, 600-620.

Uraibi, H.S., Midi, H., and Rana, S. (2017), “Robust Multivariate LeastAngle Regression,” Science Asia, 43, 56-60.

Uraibi, H.S., Midi, H., and Rana, S. (2018), “Selective Overview of ForwardSelection in Terms of Robust Correlations,” Communications in Statistics:Simulations and Computation, to appear.

van de Geer, S., Bulhmann, P., Ritov, Y., and Dezeure, R. (2014), “OnAsymptotically Optimal Confidence Regions and Tests for High-DimensionalModels,” The Annals of Statistics, 42, 1166-1202.

Wang, H. (2009), “Forward Regression for Ultra-High Dimensional Vari-able Screening,” Journal of the American Statistical Association, 104, 1512-1524.

Wang, S.-G., and Chow, S.-C. (1994), Advanced Linear Models: Theoryand Applications, Marcel Dekker, New York, NY.

Wasserman, L. (2014), “Discussion: A Significance Test for the Lasso,”The Annals of Statistics, 42, 501-508.

White, H. (1984), Asymptotic Theory for Econometricians, AcademicPress, San Diego, CA.

Wold, H. (1975), “Soft Modelling by Latent Variables: the Nonlinear Par-tial Least Squares (NIPALS) Approach,” in Perspectives in Probability andStatistics, Papers in Honor of M.S. Bartlett, ed. Gani, J., Academic Press,San Diego, CA, 117-144.

Wold, H. (1985), “Partial Least Squares,” International Journal of Cardi-ology, 147, 581-591.

Wold, H. (2006), “Partial Least Squares,” Encyclopedia of Statistical Sci-ences, Wiley, New York, NY.

370 REFERENCES

Xu, H., Caramanis, C., and Mannor, S. (2011), “Sparse Algorithms are NotStable: a No-Free-Lunch Theorem,” IEEE Transactions on Pattern Analysisand Machine Intelligence, PP(99), 1-9.

Yang, Y. (2003), “Regression with Multiple Candidate Models: Selectingor Mixing?” Statistica Sinica, 13, 783-809.

Ye, J. (1998), “On Measuring and Correcting the Effects of Data Miningand Model Selection,” Journal of the American Statistical Association, 93,120-131.

Zhang, P. (1992), “On the Distributional Properties of Model SelectionCriterion,” Journal of the American Statistical Association, 87, 732-737.

Zhang, T., and Yang, B. (2017), “Box-Cox Transformation in Big Data,”Technometrics, 59, 189-201.

Zhang, X., and Cheng, G. (2017), “Simultaneous Inference for High-Dimensional Linear Models,” Journal of the American Statistical Association,112, 757-768.

Zheng, Z., and Loh, W.-Y. (1995), “Consistent Variable Selection in LinearModels,” Journal of the American Statistical Association, 90, 151-156.

Zou, H., and Hastie, T. (2005), “Regularization and Variable SelectionVia the Elastic Net,” Journal of the Royal Statistical Society Series, B, 67,301-320.

Zou, H., Hastie, T., and Tibshirani, R. (2007), “On the “Degrees of Free-dom” of the Lasso,” The Annals of Statistics, 35, 2173-2192.

Index

Cızek, 216, 2471D regression, 4, 77, 309, 314

Abraham, vactive set, 162additive error regression, 2additive predictor, 4AER, 3affine equivariant, 228affine transformation, 229Akaike, 143, 200Aldrin, 249, 316, 341Anderson, 78, 143, 187, 276, 277Anton, vAP, 3asymptotic distribution, 11, 14asymptotic theory, 11Atkinson, 252

Bφlviken, 316, 341Buchlmann, 110Bassett, 246, 247Becker, 341, 350Belsley, 185Berk, 185, 302Berndt, 276, 287Bernholt, 216Bertsimas, 185, 187best linear unbiased estimator, 68Bhatia, 41, 157Bickel, 106, 108, 109, 127bivariate normal, 32BLUE, 68Bogdan, 188bootstrap, 11, 127Boudt, 249Box, 10

Box–Cox transformation, 10, 322breakdown, 229, 243Brillinger, 309, 313, 316, 340Budny, 93Burnham, 143, 187Butler, 188Buxton, 43, 49, 212, 217, 220, 224, 226,

240, 291

Cızek, 341Camponovo, 186case, 1Cauchy Schwartz inequality, 127cdf, 3, 205centering matrix, 208cf, 3, 23Chambers, 325Chang, 76, 244, 331, 334, 340, 341Chatterjee, 186Chebyshev’s Inequality, 17Chen, 77, 93, 97, 127, 143, 187, 200, 332,

334, 341Cheng, 185Chew, 91Cho, 188Chow, vChristensen, vChun, 151, 186CI, 3Claeskins, 185Clarke, 127classical prediction region, 91Cleveland, 325CLT, 3CLTS, 222column space, 54, 75conditional distribution, 32

371

372 Index

confidence region, 94consistent, 16consistent estimator, 16Continuity Theorem, 23Continuous Mapping Theorem:, 23converges almost everywhere, 18converges in distribution, 14converges in law, 14converges in probability, 16converges in quadratic mean, 16Cook, v, 8, 34, 35, 41, 44, 121, 143, 151,

185, 220, 242, 244, 264, 270, 273,283, 300, 304, 310–312, 316, 317,321, 322, 325, 327, 340

Cook’s distance, 340Copas, 302Cornish, 37covariance matrix, 30coverage, 92covmb2, 211, 242Cox, 10, 199, 310, 311, 341Craig’s Theorem, 59, 75Crawley, 350, 352Critchley, 341cross checking estimator, 243Croux, 36CV, 3

Daniel, 325Datta, 185, 341DD plot, 312degrees of freedom, 190Delta Method, 13Denham, 186Dey, vDezeure, 185diagnostic for linearity, 320dimension reduction, 312dominant robust statistics paradigm,

246Donoho, 243Driscoll, 59Duan, 77, 309, 313, 327, 332, 340

EAP, 3Eaton, 34EC, 3, 312EE plot, 324efficiency, 247Efron, 103, 110, 119, 127, 143, 155, 156,

160, 163, 169, 185, 186, 200eigenvalue, 147eigenvector, 147elastic net, 165

elemental set, 223ellipsoidal trimming, 216, 316elliptically contoured, 33, 37, 215, 312,

314elliptically contoured distribution, 90elliptically symmetric, 33, 215empirical cdf, 101empirical distribution, 101envelope estimators, 300ESP, 3, 316ESSP, 3, 316estimable, 84estimated additive predictor, 4estimated sufficient predictor, 4, 316estimated sufficient summary plot, 5,

311, 316Euclidean norm, 25, 231extrapolation, 170

Fan, 97, 185–187, 201Ferguson, 23, 41Fernholtz, 127Ferrari, 185FF plot, 225, 265Fithian, 185fitted values, 136, 180Fogel, 185Forzani, 151, 185Fox, 350Frank, 187Freedman, v, 103Frey, 86, 96, 105, 112, 129Friedman, vi, 110, 187, 350Fryzlewicz, 188Fu, 118, 157, 161, 185–187Fujikoshi, 187, 301full model, 112, 133, 180full rank, 55Furnival, 323

GAM, 3Gao, 187, 340Gather, 341Gauss Markov Theorem, 68general position, 230, 237generalized additive model, 4Generalized Cochran’s Theorem, 63generalized inverse, 55, 75generalized linear model, 4, 310Ghosh, 127Gill, 127Gladstone, 49, 122, 227GLM, 3Golub, 232

Index 373

Gram matrix, 154Graybill, v, 58, 140Gruber, v, 186Gunst, 156, 157, 185Guttman, v, 68

Hardle, 341Hossjer, 216, 247, 248Hadamard derivative, 127Haitovsky, 187Hall, 97, 127, 186Hall (2007), 110Hampel, 216, 243Harrison, 328Hastie, 113, 127, 143, 148, 151, 153–156,

160, 161, 165, 166, 169, 185, 187,188, 191, 200

hat matrix, 136Hawkins, 112, 172, 185, 187, 199, 216,

221, 223, 242, 245, 248, 252, 264,290, 316, 323, 341

hbreg, 237He, 236, 243, 245Hebbler, 141, 281Henderson, 273Heng-Hui, 341Hesterberg, 11, 127heteroscedastic, 310Hilker, 341Hjort, 185Hocking, vHoerl, 186Hoffman, 188, 244, 301Huang, 187Huber, 218, 220, 243, 246Hubert, 246Hurvich, 143Hyndman, 301

i, 101identity line, 5, 264iid, 3, 5, 205, 206

Jacobian matrix, 26James, 2, 139, 172, 173, 185Jammalamadaka, vJansen, 188Javanmard, 185Jia, 167Johnson, 31, 33, 37, 76, 91, 147, 215,

263, 266, 270, 336Johnstone, 302joint distribution, 31Jolliffe, 150

Jones, 187, 200, 324

Kakizawa, 275, 276Karhunen Loeve direction, 148Karhunen Loeve directions, 185Karush-Kuhn-Tucker conditions, 184Keles, 151, 186Kelker, 35Kennard, 186Khatri, 46Khattree, 276, 277, 302Kim, 188, 216, 247Kleiner, 325Klouda, 216Knight, 118, 157, 161, 185–187Koenker, 246, 247Kong, 341Kotz, 37, 76, 336Kowalski, 33Krasnicka, 59Kshirsagar, 276, 287Kutner, v

ladder of powers, 7ladder rule, 8, 38Lahiri, 186Lai, 185Larsen, vlasso, 3, 8, 138, 187, 300least squares, 136least squares estimators, 262Ledolter, vLee, v, 33, 64, 67, 74, 107, 113, 143, 185,

187, 276, 333, 334Leeb, 185, 188Lehmann, 19, 20, 41Lei, 89, 188, 301Leon, vLeroy, 172, 218, 222, 232, 246Li, 77, 201, 309, 313, 321, 327, 328, 332,

334, 340Liang, 340limiting distribution, 12, 14Lin, 187linearly dependent, 54linearly independent, 54linearly related predictors, 312Liu, 186location model, 205Lockhart, 185, 186log rule, 8, 38Loh, 187lowess, 314LR, 3

374 Index

LS CLT, 72, 139, 182LTA, 247LTS, 247Lu, 185Lumley, vi, 350Luo, 143, 187, 200Lv, 185–187

Masıcek, 216, 244, 247Machado, 108, 127MAD, 3, 206Maguluri, 243Mahalanobis distance, 34, 207, 210, 241,

316Mallows, 187, 200, 218, 324Marden, 2, 60Mardia, 37Markov’s Inequality, 17Maronna, 245, 246, 301Marquardt, 155Marx, vMasking, 220Mason, 156, 157, 185Mathsoft, 350matrix norm, 231MCLT, 3mean, 206MED, 3median, 206, 241median absolute deviation, 206, 241Meinshausen, 163, 165, 186, 200Mendenhall, vMevik, vi, 151, 349mgf, 3, 23Miller, 302Milton, vmixture distribution, 29, 40MLD, 3MLR, 2, 3MLS CLT, 273modified power transformation, 9moment generating function, 60Monahan, vmonotonicity, 320Montanari, 185Morgenthaler, 243, 301Mosteller, 9Mount, 216Muller, vmultiple linear regression, 2, 4, 112, 260,

310multiple linear regression model, 260Multivariate Central Limit Theorem, 25multivariate Chebyshev’s inequality, 93

Multivariate Delta Method, 26multivariate linear model, 260multivariate linear regression model, 259multivariate location and dispersion

model, 260multivariate normal, 30, 34multivariate t distribution, 336multivariate t-distribution, 37MVN, 3, 31Myers, v

Nachtsheim, 316, 322, 341Nadler, 302Naik, 276, 277, 302, 341Navarro, 93Ni, 340Ning, 186Nishi, 143Nishii, 113noncentral χ2 distribution, 60nonparametric bootstrap, 102, 112nonparametric prediction region, 91Nordhausen, 301norm, 231Norman, 210null space, 55

Obozinski, 187, 300observation, 1Olive, v, 2, 41, 76, 87, 90, 91, 93, 103,

105, 112, 115, 118, 125–127, 170,172, 180, 185, 187, 199, 215, 216,223, 242–245, 248, 249, 264–266,290, 300, 316, 323, 331, 334, 340,341

OLS, 3, 8, 136, 331, 340OLS view, 316Oosterhoff, 247order statistics, 206, 241outlier resistant regression, 211outliers, 6, 205, 220

Potscher, 185Parente, 108, 127Park, 240, 242Partial F Test Theorem, 73, 76partial least squares, 138, 300Patterson, 341pdf, 3Pena, 220Pelawa Watagoda, 2, 87, 125–127, 170,

171, 179, 185, 188, 287percentile method, 105permutation invariant, 229

Index 375

PHD, 340PI, 3pmf, 3Polansky, 127Pollard, 216, 247population correlation, 32population mean, 30Portnoy, 236, 243, 245positive definite, 56, 147positive semidefinite, 56, 147power transformation, 8Pratt, 118, 218, 236predicted values, 136, 180prediction region, 90prediction region method, 105, 126predictor variables, 259Press, 42principal component direction, 148principal component regression, 147principal components regression, 138,

147projection matrix, 55Projection Matrix Theorem, 55proportional hazards model, 310Pun, 302pvalue, 74

Qi, 185, 187quadratic form, 56

R, 349R Core Team, vi, 179, 301rank, 54Rank Nullity Theorem, 55Rao, v, 30Ravishanker, vregression equivariance, 228regression equivariant, 228Reid, 59Rejchel, 186relaxed lasso, 138, 163, 186Ren, 106, 108, 109, 127Rencher, v, 302residual plot, 5, 264residuals, 136, 180, 311response plot, 5, 264, 311, 313, 324response transformation, 10response transformation model, 310response variable, 1, 4response variables, 259Riani, 252ridge regression, 138, 187, 300Ripley, vi, 350Ro, 212

Rohatgi, 23, 33Ronchetti, 220, 246Rorres, vRothman, 186, 188Rousseeuw, 172, 216, 218, 222, 232, 243,

246row space, 54RR plot, 265Rubinfeld, 328rule of thumb, 209Ruppert, 248

S, 19sample correlation matrix, 208sample covariance matrix, 208, 241sample mean, 11, 208, 241SAS Institute, 305SAVE, 340Savin, 276, 287scale equivariant, 228Schaalje, vScheaffer, vScheffe, vSchwarz, 143, 200Schweder, 316, 341score equations, 156SE, 3, 11Searle, v, 57, 63, 272, 273, 303Seber, v, 33, 64, 67, 74, 107, 113, 143,

276, 333, 334Sen, 41, 127, 139Sengupta, vSerfling, 41, 101Setodji, 300Severini, 27, 31, 41, 157, 341Shao, 113, 143Sheather, 130Simonoff, 340Singer, 41, 139Singh, 243single index model, 310, 315singular value decomposition, 154SIR, 331, 340sliced inverse regression, 331Slutsky’s Theorem, 22, 27Snee, 155Snell, 311, 341SP, 3span, 53, 75sparse model, 3spectral decomposition, 147Spectral Decomposition Theorem, 57spectral norm, 231spherical, 34

376 Index

square root matrix, 57, 75, 148Srivastava, 46SSP, 3, 311standard deviation, 206standard error, 11Stapleton, vStaudte, 130Stefanski, 243Steinberger, 188Stewart, v, 41, 157Stoker, 316Streiner, 210Su, 143, 185–187, 264, 270, 273, 300submodel, 112subspace, 53sufficient predictor, 4, 309sufficient summary plot, 311supervised learning, 2survival regression models, 310SVD, 154Swamping, 220

Tableman, 247, 249Tanis, vTarr, 212Taylor, 167, 187test data, 2Tibshirani, 127, 162, 167, 169, 185, 186,

200Tikhonov regularization, 186trace, 154training data, 1transformation plot, 9, 10Tremearne, 5, 227trimmed view, 318trimmed views estimator, 217Tsai, 143, 340, 341Tukey, 9, 325TV estimator, 218, 243Tyler, 301

Uraibi, 188

Vısek, 247

van de Geer, 185Van Loan, 232van Zomeren, 243variable selection, 112, 323, 341variance, 205, 206vector norm, 231vector space, 53Velilla, 322Venables, vi, 350von Mises differentiable statistical

functions, 101VV plot, 324

W, 19Wackerly, vWalpole, vWang, v, 50, 187Wasserman, 188, 301weighted least squares, 219Weisberg, v, 8, 121, 244, 264, 283, 304,

310–312, 316, 317, 325, 327, 340,350

White, 27, 41Wichern, 31, 91, 147, 263, 266, 270Wilcox, 301Wilcoxon rank estimator, 218, 313Wilson, 323Winsor’s principle, 317Wold, 185Wood, 325

Xia, 341Xu, 185

Yang, 41, 185, 188Ye, v, 185Yohai, 245, 246Yoo, 341Yu, 110, 167

Zhang, 41, 117, 177, 185, 187Zheng, 187Zhu, 341Zou, 166, 188, 191, 200

theory for linear models - southern illinois university...

Documents