rate optimal estimation and confidence intervals for high

39
Rate Optimal Estimation and Confidence Intervals for High-dimensional Regression with Missing Covariates Yining Wang 1 , Jialei Wang 3 , Sivaraman Balakrishnan 1,2 , and Aarti Singh 1 1 Machine Learning Department, Carnegie Mellon University 2 Department of Statistics, Carnegie Mellon University 3 Department of Computer Science, University of Chicago February 8, 2017 Abstract We consider the problem of estimating and constructing component-wise confidence inter- vals of a sparse high-dimensional linear regression model when some covariates of the design matrix are missing completely at random. A variant of the Dantzig selector (Candes & Tao, 2007) is analyzed for estimating the regression model and a de-biasing argument is employed to construct component-wise confidence intervals under additional assumptions on the covari- ance of the design matrix. We also derive rates of convergence of the mean-square estimation error and the average confidence interval length, and show that the dependency over several model parameters (e.g., sparsity s, portion of observed covariates ρ * , signal level kβ 0 k 2 ) are optimal in a minimax sense. 1 Introduction High-dimensional regression has been an active topic of research in statistics and machine learning over the past 20 years (Tibshirani, 1996; Efron et al., 2004; Donoho, 2006; Cand` es et al., 2006; Fan & Li, 2001). Generally speaking, the high-dimensional estimation problem concerns the setting where the number of variables (or features) is on par with, or even far exceeds the number of observations (data points) available. To make the estimation problem well-defined, it is usually assumed that only a small portion of the variables are related to the response variable. A typical high-dimensional regression model is defined as y = 0 + ε; X R n×p , ε|X ∼N n (02 ε I ) (1) where β 0 is a p-dimensional sparse linear model to be estimated and ε is i.i.d. Gaussian noise with zero mean and variance σ 2 ε . The number of observations n is assumed to be smaller than the dimension p of each data point, while it is assumed that s<n components of β 0 is non- zero. Popular estimators for Eq. (1) include the Lasso (Tibshirani, 1996), the SCAD (Fan & Li, 2001) and the Dantzig selector (Candes & Tao, 2007), whose asymptotic rates of convergence and 1

Upload: others

Post on 16-Oct-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rate Optimal Estimation and Confidence Intervals for High

Rate Optimal Estimation and Confidence Intervals forHigh-dimensional Regression with Missing Covariates

Yining Wang1, Jialei Wang3, Sivaraman Balakrishnan1,2, and Aarti Singh1

1Machine Learning Department, Carnegie Mellon University2Department of Statistics, Carnegie Mellon University

3Department of Computer Science, University of Chicago

February 8, 2017

Abstract

We consider the problem of estimating and constructing component-wise confidence inter-vals of a sparse high-dimensional linear regression model when some covariates of the designmatrix are missing completely at random. A variant of the Dantzig selector (Candes & Tao,2007) is analyzed for estimating the regression model and a de-biasing argument is employedto construct component-wise confidence intervals under additional assumptions on the covari-ance of the design matrix. We also derive rates of convergence of the mean-square estimationerror and the average confidence interval length, and show that the dependency over severalmodel parameters (e.g., sparsity s, portion of observed covariates ρ∗, signal level ‖β0‖2) areoptimal in a minimax sense.

1 Introduction

High-dimensional regression has been an active topic of research in statistics and machine learningover the past 20 years (Tibshirani, 1996; Efron et al., 2004; Donoho, 2006; Candes et al., 2006; Fan& Li, 2001). Generally speaking, the high-dimensional estimation problem concerns the settingwhere the number of variables (or features) is on par with, or even far exceeds the number ofobservations (data points) available. To make the estimation problem well-defined, it is usuallyassumed that only a small portion of the variables are related to the response variable. A typicalhigh-dimensional regression model is defined as

y = Xβ0 + ε; X ∈ Rn×p, ε|X ∼ Nn(0, σ2εI) (1)

where β0 is a p-dimensional sparse linear model to be estimated and ε is i.i.d. Gaussian noisewith zero mean and variance σ2

ε . The number of observations n is assumed to be smaller thanthe dimension p of each data point, while it is assumed that s < n components of β0 is non-zero. Popular estimators for Eq. (1) include the Lasso (Tibshirani, 1996), the SCAD (Fan & Li,2001) and the Dantzig selector (Candes & Tao, 2007), whose asymptotic rates of convergence and

1

Page 2: Rate Optimal Estimation and Confidence Intervals for High

model selection properties are well understood (Zhao & Yu, 2006; Bach, 2008; Bickel et al., 2009;Wainwright, 2009)

In many statistical applications, however, the full design (data) matrix X is not fully observedand missing/corrupted entries are common. For example, in a data set that records characterizationof p = 5520 genes for n = 46 patients with soft tissue tumors (Nielsen et al., 2002), a total of6.7% entries are missing; in addition, 78.6% of the 5520 genes and all of the 46 patients have atleast one missing covariate. Under such scenario, classical methods like list-deletion is no longerapplicable; imputation based methods require additional assumptions on the data generative modeland might lead to invalid confidence intervals because the noise of the imputed values is not takeninto consideration.

In this paper, we consider the problem of estimating and building component-wise confidenceintervals of β0 without imputing an incomplete design X . Let Rij ∈ 0, 1 be indicator variablesof whether Xij is missing and define the observable zero-filled n× p matrix X as

Xij = RijXij , 1 ≤ i ≤ n, 1 ≤ j ≤ p.

Following Loh & Wainwright (2012a), we assume that each element Xij is missing independentlyand completely at random, and impose a random design model over X where each row of X issampled i.i.d. from a sub-Gaussian distribution with zero mean and covariance Σ0. The main con-tributions of this paper are as follows:

Estimation with unknown Σ0 We analyze the noisy Dantzig selector estimator and show that itsaverage squared error of estimating β0 (i.e., E‖β−β0‖22) depends quadratically on ρ, the probabilityof observing each Xij , and linearly on ‖β0‖22. The dependency over ρ is better than existing estima-tors (Loh & Wainwright, 2012a; Chen & Caramanis, 2013) under similar settings, which depend onthe 4th power of ρ. It is also proved that the dependency over ρ2 cannot be removed in the minimaxsense. Our bounds are not directly comparable to (Rosenbaum & Tsybakov, 2010, 2013) that donot make random design assumptions, whose statistical rates depend on ‖β0‖1 instead of ‖β0‖2.

Estimation with known Σ0 We analyze a variant of the noisy Dantzig selector and show thatits averaged square error dependes linearly on ρ and ‖β0‖22. This improves over existing estimators(Loh & Wainwright, 2012b) for the known Σ0 setting, which depend quadratically on ρ. In addition,it is shown that under the identity covariance case Σ0 = I the dependency over both ρ and ‖β0‖2 isminimax optimal, as well as the dependency over conventional quantities s, p and n.

Component-wise Confidence Intervals with unknown Σ0 Under the additional assumption thatΣ−1

0 is sparse, coordinate-wise confidence intervals of β0 are constructed by de-biasing the noisyDantzig selector. The constructed confidence intervals are conditioned on X , with randomness overboth the missing pattern Rij and noise ε. Furthermore, under the identity covariance case Σ0 = Iit is shown that the length of the confidence interval matches the minimax rate up to universalconstants.

One important difference from existing de-biased sparse estimators (Javanmard & Montanari,2014; Cai et al., 2014; Zhang & Zhang, 2014; van de Geer et al., 2014) is that when X containsmissing values, the de-biasing matrix Θ (defined in Eq. (7)) correlates with the sparse estimator β(cf. Eqs. (2,3)), and the limiting distribution of the de-biased estimator depends on unseen covariates

2

Page 3: Rate Optimal Estimation and Confidence Intervals for High

in X . We use a variant of the CLIME estimator (Cai et al., 2011) to resolve the correlation issueand propose a data-driven estimator for the limiting variance of the de-biased estimator.

Notations For a vector x, we use ‖x‖p =(∑

j |xj |p)1/p

to denote the p-norm of x. For a matrixA, we use ‖A‖Lp to denote the operator p-norm of A; that is, ‖A‖Lp = supx 6=0 ‖Ax‖p/‖x‖p.We also write ‖A‖∞ for the max norm of a matrix: ‖A‖∞ = maxj,k |Ajk|. For a positive semi-definite matrix A, let λmax(A) and λmin(A) be the largest and smallest eigenvalues of A. We useBp(M) = x : ‖x‖p ≤M to denote the centered `p ball of radius M .

1.1 Related work

Rosenbaum & Tsybakov (2010) proposed the MU-selector for high-dimensional regression underan error-in-variable model, where the design matrix X is observed with deterministic (adversarial)measurement error W that is bounded in matrix max norm. The estimator was further generalizedto handle missing data in (Rosenbaum & Tsybakov, 2013) for which de-biasing the covarianceestimator leads to improved error bounds. Optimization algorithms and minimax rates when W isGaussian white noise are derived in (Belloni et al., 2016).

Loh & Wainwright (2012a) analyzed a gradient descent algorithm for optimizing a non-convexLasso-type loss function and derived rates of convergence from both statistical and optimizationperspectives. Their analysis shows a quadruple dependency over the observation/missing rate forthe mean-square estimation error and requires an upper RE condition. Loh & Wainwright (2012b)derived lower bounds on the minimax rate, assuming identity covariance for the design points andbounded signal level ‖β0‖2. The error lower bound depends linear in observation/missing rate(Loh & Wainwright, 2012b). A general analytical framework for non-convex optimization in high-dimensional regression problems is presented in (Loh & Wainwright, 2015). A similar rate ofconvergence was established in (Chen & Caramanis, 2013) for orthogonal matching pursuit (OMP)type estimators.

Datta & Zou (2015) proposed COCOLASSO, a variant of Lasso for error-in-variable modelswhere a covariance estimate Σ is projected onto a positive semi-definite cone so that the resultingoptimization problem is convex. Both additive and multiplicative measurement error models wereconsidered in (Datta & Zou, 2015) and corresponding rates of convergence were derived.

2 Rate-optimal estimation of β0

2.1 Problem setup and assumptions

Throughout this paper we make the following assumptions:

(A1) Homogenous Gaussian noise: ε ∼ Nn(0, σ2εI) for some σε <∞.

(A2) Sub-Gaussian random design: each row of X is sampled i.i.d. from some underlying sub-Gaussian distribution with covariance Σ0 and sub-Gaussian parameter σx < ∞. Assume0 < λmin(Σ0) ≤ λmax(Σ0) < ∞. For notational simplicity we drop Σ0 and use λmin, λmax

instead in the rest of this paper.

3

Page 4: Rate Optimal Estimation and Confidence Intervals for High

(A3) Missing completely at random: Rij are independent Rademacher variables with Pr[Rij =1] = ρj for some ρ1, · · · , ρp ∈ (0, 1). Also assume Riji,j ⊥⊥ X, ε and ρ∗ = min1≤j≤p ρj >0.

(A4) Sparsity: The support set J0 = supp(β0) = j : |β0j | 6= 0 satisfies |J0| ≤ s for somes n.

(A1), (A3) and (A4) are standard assumptions for high-dimensional regression with missingdata, and (A2) implies (with high probability) a deterministic Restricted Eigenvalue (RE) condi-tion (Bickel et al., 2009) on the sample covariance of X . which leads to a s log p/n fast rate forestimating β0.

2.2 The noisy Dantzig selector

Define X ∈ Rn×p and Σ ∈ Rp×p as

Xij =RijXij

ρj, Σ =

1

nX>X −Ddiag

(1

nX>X

),

where D = diag(1− ρ1, · · · , 1− ρp) is a known p× p diagonal matrix. It is a simple observationthat, conditioned on X , EX = X and EΣ = Σ = 1

nX>X . Define the noisy Dantzig selector as

βn ∈ argminβ∈Rp

‖β‖1 :

∥∥∥∥ 1

nX>y − Σβ

∥∥∥∥∞≤ λn

, (2)

where λn > 0 are tuning parameters. Eq. (2) is a variant of the Dantzig selector Candes & Tao(2007) and is in principle similar to the MU-selector in Rosenbaum & Tsybakov (2010). Note thatEq. (2) is always a convex optimization problem (regardless of whether Σ is positive semi-definite)and hence can be efficiently solved.

We also consider a variant of the noisy Dantzig selector under the idealized scenario where thepopulation covariance Σ0 for the design matrix is known. In particular, define βn as the solution of

βn ∈ argminβ∈Rp

‖β‖1 :

∥∥∥∥ 1

nX>y − Σ0β

∥∥∥∥∞≤ λn

. (3)

Note that the covariance estimate Σ is replaced with the known population covariance Σ0 in Eq. (3).The estimator βn is primarily for theoretical considerations, as Σ0 is in general unknown in mostapplications. In following sections we show that βn achieves exact minimax rates for recovering β0

with missing covariates.

2.3 Rates of convergence and minimax lower bounds

Theorem 2.1 establishes upper bounds on the mean square estimation error of β0. Eq. (4) corre-sponds to the setting where the population covariance Σ0 is known and Eq. (5) holds when Σ0 isunknown.

4

Page 5: Rate Optimal Estimation and Confidence Intervals for High

Theorem 2.1. Assume (A1) to (A4). If log pρ2∗n→ 0 and λn (σ2

x‖β0‖2 + σxσε)√

log pρ∗n

, then

‖βn − β0‖2 ≤ OP

σ2x

√s

λmin

(‖β0‖2

√log p

ρ∗n+σεσx

√log p

ρ∗n

). (4)

In addition, if maxσ4xs log(σxp/ρ∗)ρ3∗λ

2minn

, log pρ4∗n

→ 0 and λn σ2

x‖β0‖2√

log pρ2∗n

+ σxσε

√log pρ∗n

, then

‖βn − β0‖2 ≤ OP

σ2x

√s

λmin

(‖β0‖2

√log p

ρ2∗n

+σεσx

√log p

ρ∗n

). (5)

Finally, ‖βn−β0‖1 ≤ 2√s‖βn−β0‖2 and ‖βn−β0‖1 ≤ 2

√s‖βn−β0‖2 with probability 1−o(1).

Compared to Loh & Wainwright (2012a) our bounds are better by an O(1/ρ∗) factor for βnwhen Σ0 is unknown and an O(1/ρ

3/2∗ ) factor better when Σ0 is known. Our bounds are not

directly comparable to Rosenbaum & Tsybakov (2010) that considers a fixed-design setting withno stochastic model assumed over X . We however remark that error bounds in Rosenbaum &Tsybakov (2010) depend on ‖β0‖1, which could be a factor of

√s worse than ‖β0‖2.

We next present minimax lower bounds for theL2 estimation error ‖βn−β0‖22. We first considerthe simple case where the population covariance Σ0 is the identity.

Theorem 2.2. Suppose 4 ≤ s < 4p/5, s log(p/s)ρ∗n

→ 0 and Σ0 = I . Then

infβn

supβ0∈B2(M)∩B0(s)

E‖βn − β0‖22

≥ C0 ·min

σ2ε +

1− ρ∗1 + 2c

M2, e0.5c2(1−ρ∗)sσ2ε

·min

√s log(p/s)

(1− ρ∗)2n,s log(p/s)

ρ∗n

. (6)

Here C0 > 0 is a universal constant and c > 0 is an arbitrary constant.

Remark 2.1. Under the additional assumption that (1−ρ∗)2s log(p/s)ρ2∗n

→ 0, the right-hand side ofEq. (6) can be simplified to

C0 ·min

σ2ε +

1− ρ∗1 + 2c

M2, e0.5c2(1−ρ∗)sσ2ε

s log(p/s)

ρ∗n.

Suppose that s/p → 0 and hence log p and log(p/s) are of the same order. If the missing rate(1− ρ∗) is at least a constant and the sparsity level s or the noise level σε is not too small, the termec

2(1−ρ∗)sσ2ε is negligible because it increases exponentially with s. This term arises in the lower

bound because on a subset of of the design points whose size decreases exponentially fast with s,the covariates corresponding to the support of β0 are fully observed. Apart from this term, the lowerbound matches the estimation error rate of βn when Σ0 = I , corresponding to λmin = σx = 1.

The setting where Σ0 is unknown is more complicated because of the 1/ρ2∗ dependency in

the upper bound of βn (of squared `2 estimation error). The following theorem shows that suchdependency is unavoidable if the population covariance Σ0 is unknown.

5

Page 6: Rate Optimal Estimation and Confidence Intervals for High

Theorem 2.3. Let γ0 ∈ (0, 1/2) be an arbitrary small positive constant and suppose s ≥ 4,max σ2

εM2ρ∗n

, 1γ0ρ2∗n

→ 0. Define Λ(γ0) = Σ0 ∈ Sp+ : 1−γ0 ≤ λmin(Σ0) ≤ λmax(Σ0) ≤ 1+γ0,where Sp+ is the class of all positive definite p × p matrices. Then for any fixed j ∈ 1, · · · , p itholds that

infβn

supβ0∈B2(M)∩B0(s)

Σ0∈Λ(γ0)

E|βnj − β0j |2 ≥ C1 ·max

σ2ε

ρ∗n,min

(1− ρ∗1 + 2c

M2, e0.5c2(1−ρ∗)sσ2ε

)1

ρ2∗n

,

where C1 > 0 is a universal constant that only depends on γ0 and c > 0 is an arbitrary constant.

3 Confidence intervals of regression coefficients

We describe a method that builds confidence intervals over the estimated βn by de-biasing the noisyDantzig selector. We need the following additional assumption to justify the proposed approach:

(A5) There exist b0, b1 <∞ such that each row (and column) of Σ−10 belongs to B0(b0) ∩ B1(b1).

That is, each row of Σ−10 is b0-sparse and ‖Σ−1

0 ‖L1 ≤ b1.

Condition (A5) allows the usage of CLIME or node-wise Lasso to estimate an approximate “in-verse” of Σ0 that asymptotically de-biases an estimate βn. Similar conditions for high-dimensionalinference were studied in (van de Geer et al., 2014). We discuss potential settings where (A5) couldbe relaxed in Sec. 5.

3.1 The de-biased noisy Dantzig selector

Let Θ be a p× p matrix obtained by solving the following optimization problem:

Θ ∈ argminΘ∈Rp×p‖Θ‖1 : ‖ΣΘ− Ip×p‖∞ ≤ νn and ‖ΘΣ− Ip×p‖∞ ≤ νn

, (7)

where νn > 0 is some tuning parameter to be specified later. Eq. (7) is a missing data variant of theCLIME estimator proposed in Cai et al. (2011) for estimating precision matrices in high dimension.The following lemma formally establishes the performance of Θ for estimating Σ−1

0 . Its proof isstandard and for completeness we include it in the supplementary materials.

Lemma 3.1. Under (A1), (A3) and (A5), suppose log pρ2∗n→ 0 and νn σ2

xb1

√log pρ2∗n

. Then with

probability 1 − o(1) it holds that max‖Θ‖L1 , ‖Θ‖L∞ ≤ b1 and max‖Θ − Σ−10 ‖L1 , ‖Θ −

Σ−10 ‖L∞ ≤ 2νnb0b1.

With an estimator βn of β0 and Θ obtained from solving Eq. (7), the de-biased estimator βun isdefined as

βun = βn + Θ

(1

nX>y − Σβn

). (8)

We then have the following theorem that derives the limiting variance of βun .

6

Page 7: Rate Optimal Estimation and Confidence Intervals for High

Theorem 3.1. Define Γ ∈ Rp×p as

Γ =σ2ε

nX>X +

σ2ε

nDdiag(X>X) + Υ,

where D = diag( 1ρ1− 1, · · · , 1

ρp− 1) and

Υjk =

1n

∑ni=1

∑t 6=j

1−ρtρjρt

X2ijX

2itβ

20t, j = k;

1n

∑ni=1

∑t 6=j,k

1−ρtρt

XijXikX2itβ

20t, j 6= k.

If the conclusion in Lemma 3.1 holds and

σ2xb0b1νn

(σεσx

√log p

ρ∗+ ‖β0‖2

√log p

ρ2∗

+

√n‖βn − β0‖1σ2xb0b1

)p→ 0, (9)

then for any variable subset S ⊆ [p] with constant size it holds that with probability 1 − o(1) overthe random design X ,

√n(βun − β0

)SS

d→ N|S|(

0,[Σ−1

0 ΓΣ−10

]SS

)conditioned on X .

Remark 3.1. When the noisy Dantzig selector Eq. (2) is used for the initial estimation βn and both νnand λn are chosen at the rates specified in Theorem 2.1 and Lemma 3.1, then a sufficient conditionfor Eq. (9) to hold is that

σ4xb0b

21

√log2 p

ρ4∗n

(σε√ρ∗

σx+ ‖β0‖2

)(1 +

s

λminb0b1

)→ 0. (10)

Remark 3.2. We derive the asymptotic variance of√n(βunj − β0j) for a specific coordinate j under

the identity covariance setting Σ0 = I and demonstrate its rate optimality. For simplicity we alsoassume uniform observation rates ρ1 = ρ2 = · · · = ρp = ρ∗. Fix j as a single coordinate and letVj = Vjj

√n(βun − β0). By Theorem 3.1, when n is sufficiently large

Vjp→ Γjj

p→ σ2ε

ρ∗+

1− ρ∗ρ2∗

∑t6=j

β20t ≤

σ2ε

ρ∗+

1− ρ∗ρ2∗‖β0‖22. (11)

Comparing Eq. (11) with Theorem 2.3, the variance Vj achieves the minimax rates of coordinate-wise estimation up to problem independent constants. Formally, under the additional assumptionσ2ε e−0.5c2(1−ρ∗)s‖β0‖22 that σε is not exponentially small, we have that

lim supp,n→∞

V 2j

inf β′nsupβ′0∈B2(‖β0‖2)∩B0(s),Σ′0∈Λ(γ0) nE0′ |β′nj − β′0j |2

≤ 2C−11 (1 + 2c),

where C1 > 0 is a universal constant in Theorem 2.3.

7

Page 8: Rate Optimal Estimation and Confidence Intervals for High

3.2 Data-driven approximation of the limiting covariance

The limiting variance Σ−10 ΓΣ−1

0 derived in Theorem 3.1 depends on the full design matrixX and thepopulation precision matrix Σ−1

0 , both of which are unavailable in practice when some covariates ofX are missing. To overcome this difficulty, in this section a data-driven approximation of Σ−1

0 ΓΣ−10

is proposed, which only depends on the observed data X and y.Define Γ = σ2

εn X

>X + Υ, where

Υjk =1

n

n∑i=1

∑t6=j,k

(1− ρt)XijXikX2itβ

2nt,

for j, k ∈ 1, · · · , p. The following theorem shows that ΘΓΘ> is a good approximation ofΣ−1

0 ΓΣ−10 when n is sufficiently large:

Theorem 3.2. Suppose the conclusion in Lemma 3.1 holds, log pρ4∗n→ 0 and ‖βn − β0‖2

p→ 0. Then

∥∥∥ΘΓΘ> − Σ−10 ΓΣ−1

0

∥∥∥∞≤ OP

(σ4xb

21 log2 p

ρ2∗

(‖β0‖22 +

ρ∗σ2ε

σ2x

)(b0νn +

√log p

ρ∗n

)+ ‖β0‖2‖βn − β0‖1

).

Based on Theorems 3.1 and 3.2, an asymptotic (1− α) confidence interval of β0j can be com-puted as

CIj(α) =

βunj − Φ−1(1− α/2)√

(ΘΓΘ>)jj√n

, βunj +Φ−1(1− α/2)

√(ΘΓΘ>)jj

√n

, (12)

where Φ−1(·) is the inverse function of the CDF of the standard Gaussian distribution.

4 Simulation results

4.1 Synthetic data

We fix σε = 0.1 and set Σ0 = Ω−1 where Ω is chosen to be the following banded matrix:

Ωij =

0.5|i−j| if |i− j| ≤ 5

0 otherwise.

Assume uniform observation rate ρ1 = · · · = ρp = ρ, which ranges from 0.5 to 0.9. The supportset J0 ⊂ [p] of β0 is selected uniformly at random, with |J0| = 10. β0 is then generated as

β0ji.i.d.∼ Bernoulli(+1,−1) for j ∈ J0 and β0j = 0 for j /∈ J0. Both the noisy Dantzig selector

Eq. (2) and the noisy CLIME estimator Eq. (7) are solved using alternating direction methods ofmultipliers (ADMM).

8

Page 9: Rate Optimal Estimation and Confidence Intervals for High

4.1.1 Verification of asymptotic normality

We run 1000 independent realizations of R, X, y and study the distributions of√n(βun−β0). We

plot the empirical distribution of

δj =

√n(βunj − β0j)√

(ΘΓΘ>)jj

against the standard Normal distribution. Figure 1 shows that the empirical distribution of δj agreesquite well with N (0, 1). In addition, more observations (n) are required to deliver asymptoticnormality when observation rates are low (e.g., ρ = 0.5).

4.1.2 Average CI coverage and length

We calculate the average coverage and length of the constructed confidence intervals from T inde-pendent realizations, defined as

Avgcov(j) =1

T

T∑i=1

I(β0j ∈ CI(i)j (α)), and Avglen(j) =

1

T

T∑i=1

length(CI(i)j (α)),

where CIj(α) is defined in Eq. (12). We also report the average coverage and length of coordinate-wise confidence intervals across a coordinate subset J ⊆ [p], defined as

Avgcov(J) =1

|J |∑j∈J

Avgcov(j) and Avglen(J) =1

|J |∑j∈J

Avglen(j).

Tables 1 summarize the results for various (n, p, ρ) settings.

4.2 Real data

In this section we conduct experiments on two datasets: DNA and Madelon1, where the distri-bution of the design matrices are not necessarily sampled sub-Gaussian distributions. The DNAdata contains 2000 instances and 180 covariates, while Madelon contains 2000 data points and 500covariates. For these two datasets, we only uses their data matrix X and construct the responsey according to the sparse linear regression model as specified in previous section. Following thesimulation study, we randomly remove the observed covariates with probability 1 − ρ, and thenperform statistical inference based on the datasets with missing covariates. The performance ofthe constructed confidence intervals are reported in Table 3. We see that the proposed procedurecan produced roughly normal estimates for the parameters of interest when ρ is not too small, thisdemonstrate that the method can be robust to violations of the statistical assumptions on the designmatrix.

5 Discussion

Quadratic dependency over ρ∗ Theorem 2.3 shows that if the population covariance Σ0 of thedesign matrix X is unknown, then mean square estimation error of a particular component in β0

1Available from https://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/

9

Page 10: Rate Optimal Estimation and Confidence Intervals for High

n = 1500, p = 500, ρ = 0.9 n = 5000, p = 500, ρ = 0.7

n = 8000, p = 500, ρ = 0.5 n = 12000, p = 500, ρ = 0.5

Figure 1: Empirical distribution and density of δj =√n(βunj−β0j)√(ΘΓΘ>)jj

of 1000 independent realizations.

Top row in each subfigure: two coordinates randomly chosen from J0; bottom row in each subfigure:two coordinates randomly chosen from Jc0 . Red curve: density of the standard Normal distribution.

10

Page 11: Rate Optimal Estimation and Confidence Intervals for High

Table 1: 95% confidence intervals for high-dimensional regression with missing data when ρ ∈[0.7, 0.9].

(n, p, ρ)Random j ∈ J0 Random j 6∈ J0 J0 Jc0

Avgcov Avglen Avgcov Avglen Avgcov Avglen Avgcov Avglen(1000,200,0.9) 0.941 0.182 0.951 0.192 0.938 0.208 0.966 0.187(1000,200,0.8) 0.945 0.318 0.948 0.329 0.944 0.334 0.979 0.331(1000,200,0.7) 0.952 0.494 0.983 0.540 0.949 0.547 0.989 0.529(1500,500,0.9) 0.931 0.155 0.966 0.170 0.945 0.183 0.971 0.158(1500,500,0.8) 0.927 0.278 0.982 0.294 0.937 0.308 0.985 0.284(1500,500,0.7) 0.963 0.415 0.994 0.469 0.971 0.497 0.995 0.450(2000,1000,0.9) 0.947 0.144 0.974 0.144 0.949 0.160 0.975 0.139(2000,1000,0.8) 0.967 0.249 0.987 0.264 0.939 0.281 0.990 0.254(2000,1000,0.7) 0.952 0.378 0.995 0.422 0.930 0.451 0.997 0.409(3000,2000,0.9) 0.958 0.116 0.954 0.118 0.951 0.133 0.981 0.115(3000,2000,0.8) 0.919 0.202 0.979 0.220 0.948 0.236 0.993 0.212(3000,2000,0.7) 0.891 0.315 0.998 0.349 0.950 0.372 0.998 0.348

Table 2: 95% confidence intervals for regression with missing data when ρ = 0.5.

(n, p, ρ)Random j ∈ J0 Random j 6∈ J0 J0 Jc0

Avgcov Avglen Avgcov Avglen Avgcov Avglen Avgcov Avglen(1000,200,0.5) 0.928 1.051 0.998 1.223 0.942 1.384 0.999 1.194(2000,200,0.5) 0.971 0.715 0.997 0.849 0.971 0.799 0.995 0.813(3000,200,0.5) 0.956 0.574 0.976 0.644 0.961 0.668 0.989 0.640(4000,200,0.5) 0.936 0.468 0.984 0.541 0.943 0.527 0.986 0.534(1500,500,0.5) 0.986 0.795 0.978 0.911 0.756 0.954 1.000 0.896(3000,500,0.5) 0.849 0.510 0.899 0.575 0.479 0.634 0.998 0.572(8000,500,0.5) 0.972 0.352 0.978 0.408 0.908 0.417 0.988 0.403(12000,500,0.5) 0.941 0.272 0.965 0.315 0.936 0.328 0.976 0.309

11

Page 12: Rate Optimal Estimation and Confidence Intervals for High

Table 3: 95% confidence intervals for regression with missing data on real world datasets.

(dataset, ρ)Random j ∈ J0 Random j 6∈ J0 J0 Jc0

Avgcov Avglen Avgcov Avglen Avgcov Avglen Avgcov Avglen(DNA,0.9) 0.924 0.120 0.956 0.128 0.937 0.128 0.957 0.129(DNA,0.8) 0.908 0.195 0.959 0.216 0.926 0.212 0.965 0.218(DNA,0.7) 0.888 0.286 0.967 0.318 0.925 0.314 0.973 0.317(DNA,0.5) 0.713 0.464 0.964 0.516 0.745 0.512 0.976 0.519

(Madelon,0.9) 0.943 0.095 0.963 0.101 0.949 0.098 0.945 0.105(Madelon,0.8) 0.966 0.167 0.976 0.174 0.961 0.181 0.971 0.223(Madelon,0.7) 0.962 0.229 0.977 0.236 0.956 0.253 0.977 0.261(Madelon,0.5) 0.663 0.334 0.977 0.357 0.682 0.377 0.965 0.356

must depend quadratically on the observation ratio ρ∗. We conjecture that such results also hold forthe estimation error of the entire regression model β0 as well. More specifically, we conjecture thatunder suitable finite-sample conditions,

infβn

supβ0∈B2(M)∩B0(s)

Σ0∈Λ(γ0)

E‖βn−β0‖22 ≥ C ′1·max

σ2εs log p

ρ∗n,min

(1− ρ∗1 + 2c

M2, ec2(1−ρ∗)sσ2

ε

)s log p

ρ2∗n

.

We are, however, unable to generalize our construction of diffcult cases in the proof of Theorem2.3 (cf. Sec. 6.4) to handle E‖β − β0‖22. The current construction relies on a carefully designedset of covariance matrices that leak no information unless both X1 and Xj are observed. Extendingsuch construction to multiple covariates requires new ideas.

Finite-sample conditions The finite-sample condition (i.e., relationship between n and the othermodel parameters for the asymptotic error bound to hold) required for Eq. (5) in Theorem 2.1 isslightly more restrictive than the actual error bound suggests. The condition arises from the useof Bernstein-type concentration inequalities, where the variance of a concentrated empirical sum ismuch smaller than its high-order moments. We are not sure whether such finite-sample conditionsare results of a fundamental information-theoretical limitation, or can be avoided by a more refinedanalysis.

Confidence intervals constructed in Sec 3 requires more stringent conditions to be asymptoti-cally level α. Eq. (10) suggests that at least n s2ρ−4

∗ log p needs to be satisfied. On the otherhand, Cai & Guo (2015) shows that no adaptive confidence intervals exist under the regime ofn < s2.

On Condition (A5) (A5) requires the population precision matrix Σ−10 to be sparse, which could

be restrictive as the preicision matrix is only a nuisance parameter and confidence intervals of β0 donot necessarily require good estimation of Σ−1

0 . Javanmard & Montanari (2014) drops such sparseprecision conditions at the cost of asymptotic efficiency of the average length of the resulting CI.

12

Page 13: Rate Optimal Estimation and Confidence Intervals for High

However, the techniques in (Javanmard & Montanari, 2014) cannot be easily adapted to the missingdata case because both estimates Θ and βn depend on the randomness of the missing patterns R.In our proof we circumvent this issue by connecting to the deterministic population precision Σ−1

0 ,which we do not know how to generalize to the case when Σ−1

0 is not sparse.

6 Proofs

6.1 Additional notations on concentration bounds

Definition 6.1. Let A,B be random or deterministic square matrices of the same size and ε be arandom vector of i.i.d.N (0, σ2

ε) components. Let ϕu,v(A,B; logN), ϕu,∞(A,B; logN), ϕε,∞(A)be terms such that, with probability 1−o(1), for all subset S of vectors with |S| ≤ N , the followinghold for all u, v ∈ S: ∣∣u>(A−B)v

∣∣ ≤ ϕu,v(A,B; logN) · ‖u‖2‖v‖2;∥∥∥A>ε∥∥∥∞≤ ϕε,∞(A) · σε.

Note that ϕu,v(·, ·) is symmetric and satisfies the triangle inequality. Also, infinity norms like‖A − B‖∞ or ‖(A − B)u‖∞ for a fixed u can be upper bounded by ϕu,v(A,B;O(log dim(A))),by considering the set of unit vectors e1, · · · , edim(A).

6.2 Proof of Theorem 2.1

We need the following two concentration lemmas, which are proved in the supplementary material.

Lemma 6.1. Denote random matrices A(`), ` ∈ 0, 1, 2 as A(0) = Σ, A(1) = 1nX>X and

A(2) = Σ, respectively. Then for ` ∈ 0, 1, 2:

ϕu,v

(A(`),Σ0; logN

)≤ O

(σ2x max

logN

ρ1.5`∗ n

,

√logN

ρ`∗n

).

Lemma 6.2. If log pρ∗n→ 0 then ϕε,∞( 1

nX) ≤ O(σεσx

√log pρ∗n

).

We present the following lemma commonly known as the basic inequality in high-dimensionalinference literature. Its proof is given in the supplementary material.

Lemma 6.3 (Basic inequality). Suppose log pρ4∗n

→ 0 for βn or log pρ2∗n

→ 0 for βn, and let J0 =

supp(β0) be the support of β0. If λn ≥ Ωσx√

log pn (σx‖β0‖2ρ∗

+ σε√ρ∗

) and λn ≥ Ωσx√

log pρ∗n

(σx‖β0‖2+

σε), then with probability 1 − o(1) we have that ‖(βn − β0)Jc0‖1 ≤ ‖(βn − β0)J0‖1 and ‖(βn −β0)Jc0‖1 ≤ ‖(βn − β0)J0‖1.

Definition 6.2 (Restricted eigenvalue condition). A p× p matrix A is said to satisfy RE(s, φmin) iffor all J ⊆ [p], |J | ≤ s the following holds:

infh6=0,‖hJc‖1≤‖hJ‖1

h>Ah

h>h≥ φmin.

13

Page 14: Rate Optimal Estimation and Confidence Intervals for High

The following lemma is proved in the supplementary material.

Lemma 6.4. Suppose σ4xs log(σx log p/ρ∗)

ρ3∗λ2minn

→ 0. Then with probability 1−o(1), Σ satisfies RE(s, (1−o(1))λmin(Σ0)).

We are now ready to prove Theorem 2.1 that establishes the rate of convergence of the noisyDantzig selector estimators. We consider βn first. Define λnµ = 1

nX>y − Σβn. By y = Xβ0 + ε,

we have that

Σ(βn − β0) =

(1

nX>X − Σ0

)β0 +

(Σ0 − Σ

)β0 − λnµ+

1

nX>ε.

Multiply both sides by (βn − β0) and apply Holder’s inequality:

(βn − β0)>Σ(βn − β0)

≤ ‖βn − β0‖1∥∥∥∥( 1

nX>X − Σ0

)β0

∥∥∥∥∞

+∥∥∥(Σ0 − Σ

)β0

∥∥∥∞

+ λn‖µ‖∞ +

∥∥∥∥ 1

nX>ε

∥∥∥∥∞

≤ ‖βn − β0‖1 ·OP

ϕu,v

(1

nX>X,Σ0; log p

)‖β0‖2 + ϕu,v

(Σ,Σ0; log p

)‖β0‖2 + λn + ϕε,∞

(1

nX

)σε

≤ ‖βn − β0‖1 ·OP

σ2x‖β0‖2

√log p

ρ2∗n

+ σxσε

√log p

ρ∗n+ λn

.

Here the last inequality is due to Lemmas 6.1 and 6.2. Suppose σ4xs log(σxp/ρ∗)ρ4∗λ

2minn

→ 0 and λn isappropriately set as in Lemma 6.3. We then have

‖βn − β0‖1 ≤ 2‖(βn − β0)J0‖1 ≤ 2√s‖βn − β0‖2

by Lemma 6.3 and

(βn − β0)>Σ(βn − β0) ≥ (1− o(1))λmin‖βn − β0‖22

by Lemma 6.4. Chaining all inequalities we get

‖βn − β0‖2 ≤ OP

( √s

λmin

σ2x‖β0‖2

√log p

ρ2∗n

+ σxσε

√log p

ρ∗n+ λn

)

≤ OP

( √s

λmin

σ2x‖β0‖2

√log p

ρ2∗n

+ σxσε

√log p

ρ∗n

).

The `1 norm error bound ‖βn−β0‖1 can be easily obtained by the fact that ‖βn−β0‖1 ≤ 2√s‖βn−

β0‖2.Finally, consider µn and define λnµ = 1

nX>y − Σ0βn. Note that ‖δ‖∞ ≤ 1 and

Σ0(βn − β0) =

(1

nX>X − Σ0

)β0 − λnµ+

1

nX>ε.

14

Page 15: Rate Optimal Estimation and Confidence Intervals for High

Note in addition that (βn−β0)>Σ0(βn−β0) ≥ λmin‖βn−β0‖22 by Assumption (A3). Subsequently,the same line of argument as βn yields

‖βn − β0‖2 ≤ 2√s

λmin·OP

ϕu,v

(1

nX>X,Σ0; log p

)‖β0‖2 + λn + ϕε,∞

(1

nX

)σε

≤ OP

(σ2x‖β0‖2 + σxσε

)√ s log p

λ2minρ∗n

.

6.3 Proof of Theorem 2.2

We consider the worst case with equal observation rates across covariates: ρ1 = · · · = ρp = ρ∗and use Fano’s inequality (Lemma A.1) to establish the minimax lower bound in Theorem 2.2.Construct hypothesis β as

β = ( a, · · · , a︸ ︷︷ ︸repeat s/2 times

, 0,±δ, 0, · · · ,±δ, 0︸ ︷︷ ︸exactly s/2 copies of δ

), (13)

where δ → 0 is some parameter to be chosen later and a =√

2M2

s − δ2 is carefully chosen so that‖β‖2 = M . Clearly β ∈ B2(M) ∩ B0(s). Let dH(β, β′) =

∑pj=1 I[βj 6= β′j ] be the Hamming

distance between β and β′. The following lemma shows that it is possible to construct a largehypothesis classes where any two models in the hypothesis class are far away under the Hammingdistance:

Lemma 6.5 (Raskutti et al. (2011), Lemma 4). Define H = z ∈ −1, 0,+1p : ‖z‖0 = s. Forp, s even and s < 2p/3, there exists a subset H ⊆ H with cardinality |H| ≥ exp s2 log p−s

s/2 such

that ρH(z, z′) ≥ s/2 for all dinstinct s, s′ ∈ H.

Without loss of generality we shall restrain ourselves to even p and s/2 scenarios. This doesnot affect the minimax lower bound to be proved. Using the above lemma and under the conditionthat s ≤ 4p/5, one can construct Θ consisting of hypothesis of the form in Eq. (13) such thatlog |Θ| s log(p/s) and ‖β − β′‖2 ≥

√s/4δ for all distinct β, β′ ∈ Θ. It remains to evaluate the

KL divergence between Pβ and Pβ′ .Let xobs and xmis denote the observed and missing covariates of a particular data point and let

βobs, βmis be the corresponding partition of coordinates of β. The likelihood of xobs and y can beobtained by integrating out xmis (assuming there are q coordinates that are observed):

p(y, xobs;β) = ρq∗(1− ρ∗)p−q∫Np(xobs, xmis; 0, I)N (y − (x>obsβobs − xmis)

>βmis; 0, σ2ε)dxmis

= p(xobs) ·1√

2π(σ2ε + ‖βmis‖22)

exp

(y − x>obsβobs)2

2(σ2ε + ‖βmis‖22)

.

Note that p(xobs) does not depend on β. Subsequently,

KL(Pβ‖Pβ′) = Eβ,ρ∗ logp(y, xobs;β

′)

p(y, xobs;β)

15

Page 16: Rate Optimal Estimation and Confidence Intervals for High

= Eβ,ρ∗

1

2log

σ2ε + ‖β′mis‖22σ2ε + ‖βmis‖22

+1

2

[(y − x>obsβ

′obs)

2

σ2ε + ‖β′mis‖22

−(y − x>obsβobs)

2

σ2ε + ‖βmis‖22

]= Eρ∗

1

2log

σ2ε + ‖β′mis‖22σ2ε + ‖βmis‖22

+1

2

[σ2ε + ‖βmis‖22 + ‖βobs − β′obs‖22

σ2ε + ‖β′mis‖22

− 1

](a)

≤ Eρ∗

1

2

[σ2ε + ‖β′mis‖22σ2ε + ‖βmis‖22

+σ2ε + ‖βmis‖22σ2ε + ‖β′mis‖22

]− 1 +

1

2

‖βobs − β′obs‖22σ2ε + ‖β′mis‖22

= Eρ∗

1

2

(‖β′mis‖22 − ‖βmis‖22

)2(σ2ε + ‖βmis‖22)(σ2

ε + ‖β′mis‖22)+

1

2

‖βobs − β′obs‖22σ2ε + ‖β′mis‖22

. (14)

Here for (a) we apply the inequality that log(1 + x) ≤ x for all x > 0. For some constant c ∈(0, 1/2), define E(c) as the event that at least 1−ρ∗

1+2c portion of the first s/2 coordinates in x are miss-ing. By Chernoff bound, 2 Pr[E(c)] ≥ 1 − e−c2(1−ρ∗)s. Note that under E(c), ‖βmis‖22, ‖β′mis‖22 ≥(1−ρ∗)s2(1+2c)a

2 almost surely. Subsequently,

KL(Pβ‖Pβ′) ≤1

2

Eρ∗|E(c)

(‖β′mis‖22 − ‖βmis‖22

)2(σ2ε + (1−ρ∗)s

2(1+2c)a2)2 +

1

2

Eρ∗|E(c)‖βobs − β′obs‖22σ2ε + (1−ρ∗)s

2(1+2c)a2

+ e−c2(1−ρ∗)s

[1

2

Eρ∗|E(c)

(‖β′mis‖22 − ‖βmis‖22

)2σ4ε

+1

2

Eρ∗|E(c)‖βobs − β′obs‖22σ2ε

].

Because β and β′ are identical in the first s/2 coordinates, both ‖β′mis‖22 − ‖βmis‖22 and ‖βobs −β′obs‖22 are independent of E(c). Therefore,

Eρ∗(‖β′mis‖22 − ‖βmis‖22

)2= Eρ∗

(‖β′mis,>s/2‖

22 − ‖βmis,>s/2‖22

)2≤ 4(1− ρ∗)2s2δ4;

Eρ∗‖β′obs − βobs‖22 = Eρ∗‖β′obs,>s/2 − βobs,>s/2‖22 ≤ 2ρ∗sδ2.

Here β·,>s/2 denote the β· vector without its first s/2 coordinates, and in both inequalities wenote by construction that ‖β>s/2‖0, ‖β′>s/2‖0 ≤ s/2. Because a2 = 2M2

s − δ2, we have that(1−ρ∗)s2(1+2c)a

2 = 1−ρ∗1+2cM

2 − (1−ρ∗)2(1+2c)sδ

2. For now assume that 1−ρ∗1+2c sδ

2 σ2ε + 1−ρ∗

1+2cM2, which then

implies σ2ε + (1−ρ∗)s

2(1+2c)a2 ≥ 1

2

(σ2ε + 1−ρ∗

1+2cM2)

. We will justify this assumption at the end of thisproof. Combining all inequalities we have

KL(Pβ‖Pβ′) ≤8(1− ρ∗)2s2δ4(σ2ε + 1−ρ∗

1+2cM2)2 +

2ρ∗sδ2

σ2ε + 1−ρ∗

1+2cM2

+ e−c2(1−ρ∗)s

[2(1− ρ∗)2s2δ4

σ4ε

+ρ∗sδ

2

σ2ε

].

Let Pnβ and Pnβ′ be the distribution of n i.i.d. samples parameterized by β and β′, respectively.Because the samples are i.i.d., we have that KL(Pnβ ‖Pnβ′) = nKL(Pβ‖Pβ′). On the other hand,

because log |Θ| s log(p/s), to ensure 1 −KL(Pnβ ‖P

nβ′ )+log 1/2

log |Θ| ≥ Ω(1) we only need to show

2If X1, · · · , Xn are i.i.d. random variables taking values in 0, 1 then Pr[ 1n

∑ni=1Xi < (1− δ)µ] ≤ exp− δ

2µ2

for 0 < δ < 1, where µ = EX .

16

Page 17: Rate Optimal Estimation and Confidence Intervals for High

KL(Pnβ ‖Pnβ′) s log(p/s), which is implied by

(1− ρ∗)2s2δ4(σ2ε + 1−ρ∗

1+2cM2)2

s log(p/s)

n⇐= δ2

(σ2ε +

1− ρ∗1 + 2c

M2

)√log(p/s)

(1− ρ∗)2sn;

ρ∗sδ2

σ2ε + 1−ρ∗

1+2cM2 s log(p/s)

n⇐= δ2

(σ2ε +

1− ρ∗1 + 2c

M2

)log(p/s)

ρ∗n;

e−c2(1−ρ∗)s (1− ρ∗)2s2δ4

σ4ε

s log(p/s)

n⇐= δ2 e0.5c2(1−ρ∗)sσ2

ε

√log(p/s)

(1− ρ∗)2sn;

e−c2(1−ρ∗)s ρ∗sδ

2

σ2ε

⇐= δ2 ec2(1−ρ∗)sσ2ε

log(p/s)

ρ∗n.

Combining all terms we have that

δ2 min

σ2ε +

1− ρ∗1 + 2c

M2, e0.5c2(1−ρ∗)sσ2ε

·min

√log(p/s)

(1− ρ∗)2sn,log(p/s)

ρ∗n

. (15)

The bound for ‖β − β′‖22 can then be obtained by ‖β − β′‖22 ≥ s4δ

2.The final part of the proof is to justify the assumption that 1−ρ∗

1+2c sδ2 σ2

ε + 1−ρ∗1+2cM

2. In-

voking Eq. (15), the assumption is valid if 1−ρ∗1+2c max

√s log(p/s)(1−ρ∗)2n ,

s log(p/s)ρ∗n

→ 0, which holds if

s log(p/s)ρ∗n

→ 0.

6.4 Proof of Theorem 2.3

We again take ρ1 = · · · = ρp = ρ∗. The first term σ2ε

ρ∗nin the minimax lower bound is trivial to

establish: consider β0 = δej and β1 = −δej with Σ0 = Σ1 = I . By Eq. (14), we have that

KL(Pnβ0‖Pnβ1) = n ·KL(Pβ0‖Pβ1) ≤ 2ρ∗nδ

2

σ2ε

.

Equating KL(Pnβ0‖Pnβ1

) O(1) we have that δ2 σ2ε

ρ∗n. Because σ2

εM2ρ∗n

→ 0, we know thatβ0, β1 ∈ B2(M) ∩ B0(1) when n is sufficiently large. Invoking Le Cam’s method (Lemma A.2)with |β0j − β1j |2 = 4δ2 σ2

ερ∗n

we prove the desired minimax lower bound of σ2ε

ρ∗n.

We next focus on the second term in the minimax lower bound that involves 1/ρ2∗n. Without

loss of generality assume j > s− 1. Construct two hypothesis (β0,Σ0) and (β1,Σ1) as follows:

β0 = (a√s− 2

, · · · , a√s− 2︸ ︷︷ ︸

repeat s− 2 times

, a, 0, · · · , 0, aγ, 0, · · · , 0︸ ︷︷ ︸β0j=aγ

), Σ0 = Ip×p − γ(es−1e>j + eje

>s−1);

β1 = (a√s− 2

, · · · , a√s− 2︸ ︷︷ ︸

repeat s− 2 times

, a, 0, · · · , 0,−aγ, 0, · · · , 0︸ ︷︷ ︸β0j=−aγ

), Σ1 = Ip×p + γ(es−1e>j + eje

>s−1).

17

Page 18: Rate Optimal Estimation and Confidence Intervals for High

Here γ → 0 is some parameter to be determined later and a is set to a =√

M2

2+γ2to ensure that

‖β0‖2 = ‖β1‖2 = M . It is immediate by definition that β0, β1 ∈ B2(M) ∩ B0(s). In addition, byGershgorin circle theorem all eigenvalues of Σ0 and Σ1 lie in [1 − γ, 1 + γ]. As γ → 0, it holdsthat Σ0,Σ1 ∈ Λ(γ0) for any constant γ0 ∈ (0, 1/2) when n is sufficiently large. A finite-samplestatement of this fact is given at the end of the proof.

Unlike the identity covariance case, the likelihood p(y, xobs;β,Σ) for incomplete observationsare complicated when Σ has non-zero off-diagonal elements. The following lemma gives a generalcharacterization of the likelihood when β 6= 0. Its proof is given in the supplementary material.

Lemma 6.6. Partition the covariance Σ as Σ =

[Σ11 Σ12

Σ21 Σ22

], where Σ11 corresponds to xobs

and Σ22 corresponds to xmis. Define Σ22:1 = Σ22−Σ21Σ−111 Σ12. Let q = dim(Σ11) be the number

of observed covariates. Then

p(y, xobs;β,Σ) = ρq∗(1− ρ∗)p−q ·1√

(2π)q|Σ11|exp

−1

2x>obsΣ

−111 xobs

· 1√

2π(σ2ε + β>misΣ22:1βmis)

exp

(y − x>obsβobs − β>misΣ21Σ−111 xobs)

2

2(σ2ε + β>misΣ22:1βmis)

.

We now present the following lemma, which is key to establish the 1/ρ2∗ rate in the minimax

lower bound. Its proof is given in the supplementary material.

Lemma 6.7. p(y, xobs;β0,Σ0) = p(y, xobs;β1,Σ1) unless both xs−1 and xj are observed.

Let P0 and P1 denote the distributions parameterized by (β0,Σ0) and (β1,Σ1), respectively.Let A denote the event that both xs−1 and xj are observed. By Lemma 6.7, we have that

KL(P0‖P1) = Pr[A]E0

[log

p(y, xobs;β0,Σ0)

p(y, xobs;β1,Σ1)

∣∣∣∣A] = ρ2∗E0

[log

p(y, xobs;β0,Σ0)

p(y, xobs;β1,Σ1)

∣∣∣∣A] .Suppose Σ0 = [Σ011 Σ012; Σ021 Σ022] and Σ1 = [Σ111 Σ112; Σ121 Σ122] are partitioned in thesame way as in Lemma 6.6. Conditioned on the eventA, we have that Σ022 = Σ122 = I(p−q)×(p−q),Σ012 = Σ>021 = Σ112 = Σ>121 = 0q×(p−q), Σ011 = Iq×q − γ(es−1e

>j + eje

>s−1), Σ111 = Iq×q +

γ(es−1e>j + eje

>s−1), and by Lemma A.3, we have that Σ−1

011 = I + γ2

1−γ2 (es−1e>s−1 + eje

>j ) +

γ1−γ2 (es−1e

>j + eje

>s−1) and Σ−1

111 = I + γ2

1−γ2 (es−1e>s−1 + eje

>j ) − γ

1−γ2 (es−1e>j + eje

>s−1). In

addition, det(Σ011) = det(Σ111) = 1 − γ2. Note also that Σ022:1 = Σ122:1 = I(p−q)×(p−q) andhence β>0misΣ022:1β0mis = β>1misΣ122:1β1mis because ‖β0mis‖22 = ‖β1mis‖22 regardless of whichcovariates are missing. Define xobs,<s = xj : xj is observed, j < s and βobs,<s = βj :xj is observed, j < s. Subsequently, invoking Lemma 6.6 we get

E0|A

[log

P0

P1

]= − 2γ

1− γ2E0[xs−1xj ]− E0|A

1

2

(y − x>obsβ0obs)2 − (y − x>obsβ1obs)

2

σ2ε + ‖β0mis‖22

(a)=

2γ2

1− γ2+ E0|A

xj(β0j − β1j)(y − x>obs,<sβ0obs,<s)

σ2ε + ‖β0mis‖22

18

Page 19: Rate Optimal Estimation and Confidence Intervals for High

(b)=

2γ2

1− γ2+ E0|A

xj(β0j − β1j)(x

>mis,<sβ0mis,<s + xjβ0j + ε)

σ2ε + ‖β0mis‖22

(c)=

2γ2

1− γ2+ ER|A

β0j(β0j − β1j)E0[x2

j ] + (β0j − β1j)E0|R[xj(x>mis,<s−1β0mis,<s−1 + ε)]

σ2ε + ‖β0mis‖22

=2γ2

1− γ2+ ER|A

β0j(β0j − β1j)E0[x2

j ]

σ2ε + ‖β0mis‖22

=2γ2

1− γ2+ ER|A

2a2γ2

σ2ε + ‖β0mis‖22

.

Here (a) is due to β0obs,<s = β1obs,<s and β20j = β2

1j , and (b) is because β0k = 0 for all k ≥ sexcept for k = j. Note also that under A, xj is observed and hence β0j always belongs to β0obs.For (c), note that xs−1 is observed under A and xj is independent of x<s−1 and ε conditioned onR, thanks to the missing completely at random assumption (A3). For any constant c ∈ (0, 1/2)define E ′(c) as the event that at least 1−ρ∗

1+2c portion of the first (s − 2) coordinates in x are missing.Note that ‖β0mis‖22 ≥

1−ρ∗1+2c a

2 almost surely under A ∩ E ′(C) and by Chernoff bound Pr[A] ≥1− e−c2(1−ρ∗)(s−2) ≥ 1− e−0.5c2(1−ρ∗)s for s ≥ 4. Subsequently, by law of total expectation

ER|A

2a2γ2

σ2ε + ‖β0mis‖22

≤ 2a2γ2

σ2ε + 1−ρ∗

1+2c a2

+ e−0.5c2(1−ρ∗)s 2a2γ2

σ2ε

.

Replace a2 = M2

2+γ2. We then have that

KL(Pn0 ‖Pn1 ) ≤ nρ2∗

[2γ2

1− γ2+

2M2γ2

(2 + γ2)σ2ε + 1−ρ∗

1+2cM2

+ e−0.5c2(1−ρ∗)s 2M2γ2

(2 + γ2)σ2ε

]

≤ nρ2∗

[2γ2

1− γ2+

2(1 + 2c)γ2

1− ρ∗+ e−0.5c2(1−ρ∗)sM

2γ2

σ2ε

].

Equating KL(Pn0 ‖Pn1 ) O(1) and applying the condition that γ2 → 0, we have that

γ2 min

1− ρ∗

2(1 + 2c), e0.5c2(1−ρ∗)s σ

M2

1

ρ2∗n. (16)

Subsequently, ∣∣β0j − β1j

∣∣2 = 4a2γ2 min

1− ρ∗

2(1 + 2c)M2, e0.5c2(1−ρ∗)sσ2

ε

1

ρ2∗n.

Invoking Lemma A.2 we finish the proof of the minimax lower bound.Finally, we justify the conditions γ2 → 0 and γ < γ0 that are used in the proof. Eq. (16) yields

γ2 ≤ O( 1ρ2∗n

). So γ2 → 0 and γ < γ0 is implied by 1γ20ρ

2∗n→ 0.

6.5 Proof of Theorem 3.1

Using y = Xβ0 + ε we have that

Σ(βn − β0) +

(1

nX>y − Σβn

)=

(1

nX>X − Σ︸ ︷︷ ︸

∆n

)β0 +

1

nX>ε. (17)

19

Page 20: Rate Optimal Estimation and Confidence Intervals for High

Define ∆n = 1nX>X − Σ. Recall that βun = βn + Θ

(1nX>y − Σβn

). Subsequently, multiplying

both sides of Eq. (17) with√nΘ and re-organizing terms we have

√n(βun − β0) =

√nΘ

(∆nβ0 +

1

nX>ε

)−√n(ΘΣ− I)(βn − β0)

=√nΣ−1

0

(∆nβ0 +

1

nX>ε

)−√n(ΘΣ− I)(βn − β0)︸ ︷︷ ︸

rn

+√n(Θ− Σ−1

0 )

(∆nβ0 +

1

nX>ε

)︸ ︷︷ ︸

rn

.

Define rn =√n(ΘΣ− I)(βn − β0) and rn =

√n(Θ− Σ−1

0 )(

∆nβ0 + 1nX>ε)

Lemma 6.8. Suppose log pρ4∗n→ 0 and the conclusion in Lemma 3.1 holds. Then ‖rn‖∞ ≤ OP(

√nνn‖βn−

β0‖1) and ‖rn‖∞ ≤ OP(σxb0b1νn(σε

√log pρ∗

+ σx‖β0‖2√

log pρ2∗

)).

Lemma 6.8 based on Holder’s inequality and is proved in the supplementary materials. If thecondition in Eq. (9) holds, Lemma 6.8 implies that max‖rn‖∞, ‖rn‖∞

p→ 0, which means bothterms rn and rn are asymptotically negligible in the infinity norm sense. It then suffices to analyzethe limiting distribution (conditioned on X) of an =

√nΣ−1

0

(∆nβ0 + 1

nX>ε)

. By Assumptions

(A1) and (A3), E∆n|X = 0, Eε|X = 0 and hence Ean|X = 0. We next analyze the conditionalcovariance Van|X . Recall that ∆n = 1

nX>X − Σ. By definition, for any j, k ∈ 1, · · · , p

[∆n]jk =

1n

∑ni=1

Rijρj

(1− Rik

ρk

)XijXik, j 6= k;

0, j = k.

Here Rij = 1 if Xij is observed and Rij = 0 otherwise. Subsequently, an = Σ−10 an where

[an]j =1√n

n∑i=1

(RijXij

ρjεi +

∑k 6=j

Rijρj

(1− Rik

ρk

)XijXikβ0k︸ ︷︷ ︸

Tij

).

Because R ⊥⊥ X, ε and ε ⊥⊥ X , we have that ETij |X = 0. Therefore, for any j ∈ 1, · · · , p

VTij |X = E[|Tij |2|X

]=σ2εX

2ij

ρj+∑t6=j

1− ρtρjρt

X2ijX

2itβ

20t

and for j 6= k,

cov(Tij , Tik|X) = E [TijTik|X] = σ2εXijXik +

∑t6=j,k

1− ρtρt

XijXikX2itβ

20t.

Because Tijni=1 are i.i.d. random variables, by central limiting theorem, for any subset S ⊆ [p]with constant size

[an]SSd→ N|S| (0, covSS(an|X))

d→ N|S|(

0,[Σ−1

0 ΓΣ−10

]SS

),

where all randomness is conditioned on X .

20

Page 21: Rate Optimal Estimation and Confidence Intervals for High

6.6 Proof of Theorem 3.2

By triangle inequality and Holder’s inequality,

‖Σ−10 ΓΣ−1

0 − ΘΓΘ>‖∞≤ ‖(Σ−1

0 − Θ)ΓΣ−10 ‖∞ + ‖ΘΓ(Σ−1

0 − Θ>)‖∞ + ‖Θ(Γ− Γ)Θ>‖∞

≤ 2 max‖Σ−1

0 ‖L1 , ‖Θ‖L1 , ‖Θ‖L∞

max‖Σ−1

0 − Θ‖L1 , ‖Σ−10 − Θ‖L∞

‖Γ‖∞ + ‖Θ‖2L1

‖Γ− Γ‖∞.

With Lemma 3.1, the bound can be simplified to (with probability 1− o(1))

‖Σ−10 ΓΣ−1

0 − ΘΓΘ>‖∞ ≤ 4b0b21νn‖Γ‖∞ + b21‖Γ− Γ‖∞. (18)

Note that by standard concentration inequalities of supreme of sub-Gaussian random variables,‖X‖∞ ≤ OP(σx

√log p). Also, by Holder’s inequality ‖Υ‖∞ ≤ ρ−2

∗ ‖X‖4∞‖β20‖1. Subsequently,

‖Γ‖∞ ≤σ2ε

ρ∗‖X‖2∞ +

‖X‖4∞‖β0‖22ρ2∗

≤ OP

σ4x log2 p

(σ2ε

σ2xρ∗

+‖β0‖22ρ2∗

). (19)

It remains to upper bound ‖Γ− Γ‖∞. Decompose the difference as

‖Γ− Γ‖∞ ≤ σ2ε

∥∥∥∥ 1

nX>X − 1

nX>X − Ddiag

(1

nX>X

)∥∥∥∥∞

+ ‖Υ− Υ‖∞.

We first focus on the first term. Recall that ‖D‖∞ ≤ 1−1/ρ∗ and 1nX>X = Σ+Ddiag( 1

nX>X).

Subsequently, the first infinity norm term is upper bounded by

‖Σ− Σ0‖∞ +

∥∥∥∥Ddiag

(1

nX>X

)− Ddiag(Σ0)

∥∥∥∥∞

+1

ρ∗‖Σ− Σ0‖∞.

By Lemma 6.1, if log pρ4∗n→ 0 then ‖Σ−Σ0‖∞ ≤ OP(σ2

x

√log pρ2∗n

) and ‖Σ−Σ0‖∞ ≤ OP(σ2x

√log pn ).

For the remaining term, we invoke the following lemma that is proved in the supplementary materi-als:

Lemma 6.9. If log pρ∗n→ 0 then ‖Ddiag( 1

nX>X)− Ddiag(Σ0)‖∞ ≤ OP(σ2

x

√log pρ3∗n

).

Consequently,

σ2ε

∥∥∥∥ 1

nX>X − 1

nX>X − Ddiag

(1

nX>X

)∥∥∥∥∞≤ OP

σ2εσ

2x

√log p

ρ3∗n

. (20)

Finally, we derive the upper bound for ‖Υ − Υ‖∞. We first construct a p × p matrix Υ as an“intermediate” quantity defined as

Υjk =1

n

n∑i=1

∑t6=j,k

(1− ρt)XijXikX2itβ

20t for j, k ∈ 1, · · · , p.

21

Page 22: Rate Optimal Estimation and Confidence Intervals for High

Note that Υ involves the missing design X and the true model β0. Further define Υjkt and Υjkt forj, k, t ∈ 1, · · · , p as

Υjkt =1

n

n∑i=1

(1− ρt)XijXikX2it, Υjkt = EΥjkt|X.

We next state the following concentration results on Υjkt and Υjkt, which will be proved in thesupplementary material.

Lemma 6.10. Fix j, k ∈ [p] and suppose log pρ3∗n→ 0. We then have that

maxj,k∈[p]

maxt6=j,k

∣∣Υjkt

∣∣ ≤ OP

(σ4x log2 p

ρ2∗

)and

maxj,k∈[p]

maxt6=j,k

∣∣Υjkt −Υjkt

∣∣ ≤ OP

(σ4x log2 p

√log p

ρ5∗n

).

We then upper bound ‖Υ− Υ‖∞ by bounding ‖Υ− Υ‖∞ and ‖Υ− Υ‖∞ separately.

Upper bound for ‖Υ− Υ‖∞ By definition, Υjk =∑

t6=j,k Υjktβ2nt and Υjkt =

∑t6=j,k Υjktβ

20t.

Holder’s inequality then yields

‖Υ− Υ‖∞ ≤ maxj,k∈[p]

maxt6=j,k

∣∣Υjkt

∣∣ · ‖β2n − β2

0‖1.

Under the condition that log pρ3∗n→ 0, it holds that maxj,k maxt6=j,k |Υjkt| ≤ OP(1)·maxj,k maxt6=j,k |Υjkt|.

Furthermore, ‖β2n−β2

0‖1 ≤ ‖βn+β0‖∞‖βn−β0‖1 ≤ (‖β0‖2 +‖βn−β0‖2)‖βn−β0‖1. InvokingLemma 6.10 and the condition that ‖βn − β0‖2

p→ 0 we get

‖Υ− Υ‖∞ ≤ OP

σ4x log2 p

ρ2∗‖β0‖2‖βn − β0‖1

. (21)

Upper bound for ‖Υ − Υ‖∞ Note that Υjk =∑

t6=j,k Υjktβ20t and Υjk =

∑t6=j,k Υjktβ

20t. By

Holder’s inequality,‖Υ− Υ‖∞ ≤ max

j,k∈[p]maxt6=j,k

∣∣Υjkt −Υjkt

∣∣ · ‖β20‖1.

Invoking Lemma 6.10 we then have

‖Υ− Υ‖∞ ≤ OP

σ4x log2 p‖β0‖22

√log p

ρ5∗n

. (22)

Finally, combining Eqs. (18,19,20,21,22) we complete the proof of Theorem 3.2.

22

Page 23: Rate Optimal Estimation and Confidence Intervals for High

Appendix A Technical Lemmas

Lemma A.1 (generalized Fano’s inequality, (Ibragimov & Has’ minskii, 2013)). Let Θ be a param-eter set and d : Θ × Θ → R≥0 be a semimetric. Let Pθ be the distribution induced by θ and Pnθbe the distribution of n i.i.d. observations from Pθ. If d(θ, θ′) ≥ α and KL(Pθ‖Pθ′) ≤ β for alldistinct θ, θ′ ∈ Θ, then

infθ

supθ∈Θ

EPθn[d(θ, θ)

]≥ α

2

(1− nβ + log 2

log |Θ|

).

Lemma A.2 (Le Cam’s method, (Le Cam, 2012)). Suppose Pθ0 and Pθ1 are distributions inducedby θ0 and θ1. Let Pnθ0 and Pnθ1 be distributions of n i.i.d. observations from Pθ0 and Pθ1 , respectively.Then for any estimator θ it holds that

1

2

[PrPnθ0

(θ 6= θ0) + PrPnθ1

(θ 6= θ1)

]≥ 1

2− 1

2‖Pnθ0 − P

nθ1‖TV ≥

1

2− 1

2√

2

√nKL(Pθ0‖Pθ1).

Lemma A.3 (Miller (1981), Eq. (13)). Suppose H is a matrix of rank at most 2 and (I + H) isinvertible. Then

(I +H)−1 = I − aH −H2

a+ b,

where a = 1 + tr(H) and 2b = [tr(H)]2 + tr(H2).

References

Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. Journal of MachineLearning Research, 9(Jun), 1179–1225.

Belloni, A., Rosenbaum, M., & Tsybakov, A. B. (2016). Linear and conic programming estimatorsin high dimensional errors-in-variables models. Journal of the Royal Statistical Society: Series B(Statistical Methodology).

Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzigselector. The Annals of Statistics, (pp. 1705–1732).

Cai, T., Liu, W., & Luo, X. (2011). A constrained L1 minimization approach to sparse precisionmatrix estimation. Journal of the American Statistical Association, 106(494), 594–607.

Cai, T. T., & Guo, Z. (2015). Confidence intervals for high-dimensional linear regression: Minimaxrates and adaptivity. arXiv preprint arXiv:1506.05539.

Cai, T. T., Liang, T., & Rakhlin, A. (2014). Geometric inference for general high-dimensional linearinverse problems. arXiv preprint arXiv:1404.4408.

Candes, E., & Tao, T. (2007). The dantzig selector: Statistical estimation when p is much largerthan n. The Annals of Statistics, (pp. 2313–2351).

23

Page 24: Rate Optimal Estimation and Confidence Intervals for High

Candes, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruc-tion from highly incomplete frequency information. IEEE Transactions on information theory,52(2), 489–509.

Chen, Y., & Caramanis, C. (2013). Noisy and missing data regression: Distribution-oblivious sup-port recovery. In International Conference on Machine Learning.

Datta, A., & Zou, H. (2015). Cocolasso for high-dimensional error-in-variables regression. arXivpreprint arXiv:1510.07123.

Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on information theory, 52(4),1289–1306.

Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annalsof statistics, 32(2), 407–499.

Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties. Journal of the American statistical Association, 96(456), 1348–1360.

Ibragimov, I. A., & Has’ minskii, R. Z. (2013). Statistical estimation: asymptotic theory, vol. 16.Springer Science & Business Media.

Javanmard, A., & Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(1), 2869–2909.

Le Cam, L. (2012). Asymptotic methods in statistical decision theory. Springer Science & BusinessMedia.

Loh, P.-L., & Wainwright, M. (2012a). High-dimensional regression with noisy and missing data:provable guarantees with nonconvexity. The Annals of Statistics, 40(3), 1637–1664.

Loh, P.-L., & Wainwright, M. J. (2012b). Corrupted and missing predictors: Minimax boundsfor high-dimensional linear regression. In 2012 IEEE International Symposium on InformationTheory Proceedings (ISIT).

Loh, P.-L., & Wainwright, M. J. (2015). Regularized m-estimators with nonconvexity: Statisticaland algorithmic theory for local optima. Journal of Machine Learning Research, 16, 559–616.

Miller, K. S. (1981). On the inverse of the sum of matrices. Mathematics Magazine, 54(2), 67–72.

Nielsen, T. O., West, R. B., Linn, S. C., Alter, O., Knowling, M. A., O’Connell, J. X., Zhu, S., Fero,M., Sherlock, G., Pollack, J. R., et al. (2002). Molecular characterisation of soft tissue tumours:a gene expression study. The Lancet, 359(9314), 1301–1307.

Raskutti, G., Wainwright, M. J., & Yu, B. (2011). Minimax rates of estimation for high-dimensionallinear regression over-balls. IEEE Transactions on Information Theory, 57(10), 6976–6994.

Rosenbaum, M., & Tsybakov, A. (2010). Sparse recovery under matrix uncertainty. The Annals ofStatistics, 38(5), 2620–2651.

Rosenbaum, M., & Tsybakov, A. (2013). Improved matrix uncertainty selector.

24

Page 25: Rate Optimal Estimation and Confidence Intervals for High

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological), (pp. 267–288).

van de Geer, S., Buhlmann, P., Ritov, Y., & Dezeure, R. (2014). On asymptotically optimal confi-dence regions and tests for high-dimensional models. The Annals of Statistics, 32(3), 1166–1202.

Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using`1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5),2183–2202.

Zhang, C.-H., & Zhang, S. (2014). Confidence intervals for low dimensional parameters in highdimensional linear models. Journal of the Royal Statistical Society, Series B (Statistical Method-ology), 76, 217–242.

Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. Journal of Machine learningresearch, 7(Nov), 2541–2563.

25

Page 26: Rate Optimal Estimation and Confidence Intervals for High

Supplementary Material for: Rate Optimal Estimation andConfidence Intervals for High-dimensional Regression with

Missing Covariates

Yining Wang1, Jialei Wang3, Sivaraman Balakrishnan1,2, and Aarti Singh1

1Machine Learning Department, Carnegie Mellon University2Department of Statistics, Carnegie Mellon University

3Department of Computer Science, University of Chicago

February 8, 2017

This supplementary material provides detailed proofs for technical lemmas whose proofs areomitted in the main text.

A Proofs of concentration bounds

A.1 Proof of Lemma 6.1

Fix arbitrary u, v ∈ S. For j, k ∈ [p] and ` ∈ 0, 1, 2, define

ξ(0)jk (Ri, ρ) = 1, ξ

(1)jk (Ri, ρ) =

Rijρj, ξ

(2)jk (Ri, ρ) =

Rijρj, j = k;

RijRikρjρk

, j 6= k.

Also let T (`)i =

∑pj,k=1 ξjk(Ri, ρ)XijXikujvk. We then have that

∣∣u>(Σ− Σ0)v∣∣ =

∣∣∣∣ 1nn∑i=1

T(0)i − ET (0)

i

∣∣∣∣, (S1)

∣∣u>(1

nX>X − Σ0)v

∣∣ =

∣∣∣∣ 1nn∑i=1

T(1)i − ET (2)

i

∣∣∣∣, (S2)

∣∣u>(Σ− Σ0)v∣∣ =

∣∣∣∣ 1nn∑i=1

T(2)i − ET (2)

i

∣∣∣∣. (S3)

The main idea is to use Berstein inequality with moment conditions (Lemma E.5) to establishconcentration bounds and achieve optimal dependency over ρ. Define V (`) = E

[|T (`)i − ET (`)

i |2].

We then have that

V (`) ≤ E|T (`)i |

2 =

p∑j,k,j′,k′=1

(`)jk ξ

(`)j′k′

EXijXikXij′Xik′ujvkuj′vk′

.

1

Page 27: Rate Optimal Estimation and Confidence Intervals for High

It is then of essential importance to evaluate Eξ

(`)jk ξ

(`)j′k′

. For ` = 0 the expectation trivially equals

1. For ` = 1 and ` = 2, we apply the following proposition, which is easily proved by definition.

Proposition A.1. Eξ

(1)jk ξ

(1)j′k′

= 1 + I[j = j′]( 1

ρj− 1) and E

ξ

(2)jk ξ

(2)j′k′

= 1 + I[j = j′]( 1

ρj−

1) + I[k = k′]( 1ρk− 1) + I[j = j′ ∧ k = k′]( 1

ρj− 1)( 1

ρk− 1). Here I[·] is the indicator function.

We are now ready to derive E|T (`)i |2.

E|T (0)i |

2 = E|X>i u|2|X>i v|2

;

E|T (1)i |

2 = E|X>i u|2|X>i v|2

+

p∑j=1

(1

ρj− 1

)u2jEX2ij |X>i v|2

≤ E

|X>i u|2|X>i v|2

+

1

ρ∗

p∑j=1

u2jE|X>i ej |2|X>i v|2

;

E|T (2)i |

2 = E|X>i u|2|X>i v|2

+

p∑j=1

(1

ρj− 1

)(u2j + v2

j )EX2ij |X>i v|2

+

p∑j,k=1

(1

ρj− 1

)(1

ρk− 1

)u2jv

2kEX2ijX

2ik

≤ E

|X>i u|2|X>i v|2

+

1

ρ∗

p∑j=1

(u2j + v2

j )E|X>i ej |2|X>i v|2

+

1

ρ2∗

p∑j,k=1

u2jv

2kE|X>i ej |2|X>i ek|2

.

By Cauchy-Schwarts inequality and moment upper bounds of sub-Gaussian random variables (LemmaE.1), we have that

E|X>i a|2|X>i b|2

≤√

E|X>i a|4√E|X>i b|4 ≤ 16σ4

x‖a‖22‖b‖22.

Consequently, there exists universal constant c2 > 0 such that

E|T (0)i |

2 ≤ c2σ4x‖u‖22‖v‖22, E|T (1)

i |2 ≤ c2

ρ∗σ4x‖u‖22‖v‖22, E|T (2)

i |2 ≤ c2

ρ2∗σ4x‖u‖22‖v‖22.

We next establish L > 0 so that the moment condition in Lemma E.5 is satisfied, namelyE|T (`)

i −ET (`)i |k ≤

12V

(`)Lk−2k! for all k > 1. Note that for all ` ∈ 0, 1, 2, there exist functions

ξ(`)j and ξ

(`)j only depending on j such that ξ(`)

jk = ξ(`)j ξ

(`)k + I[j = k] · ξ(`)

j and furthermore

maxj |ξ(`)j | ≤ 1/ρ∗, ξ

(0)j = ξ

(1)j = 0 and maxj |ξ(2)

j | ≤ 1/ρ2∗. Subsequently,

E|T (`)i − ET (`)

i |k = E

∣∣∣∣∣∣p∑

j,k=1

(`)j ξ

(`)k + I[j = k] · ξ(`)

j − 1)XijXikujvk

∣∣∣∣∣∣k

≤ 3k

E∣∣∣∣ p∑j,k=1

ξ(`)j ξ

(`)k XijXikujvk

∣∣∣∣k + E∣∣∣∣ p∑j=1

ξ(`)j X2

ijujvj

∣∣∣∣k + E∣∣∣∣ p∑j,k=1

XijXikujvk

∣∣∣∣k .

2

Page 28: Rate Optimal Estimation and Confidence Intervals for High

Here the second line is a consequence of the following inequality: for all a, b, c ≥ 0 we have that(a+ b+ c)k ≤ (3 maxa, b, c)k ≤ 3k maxak, bk, vk ≤ 3k(ak + bk + ck). Define uj = ujξ

(`)j ,

vk = vkξ(`)k , uj = uj

√|ξ(`)j | and vj = vj

√|ξ(`)j |. Apply Lemma E.6 with |

∑pj=1 ξ

(`)j X2

ijujvj | ≤X>i AXi, A = diag(|u1v1|, · · · , |upvp|) and note that tr(A) =

√tr(A2) = |u|>|v| ≤ ‖u‖2‖v‖2

and ‖A‖op = max1≤j≤p |uj vj | ≤ ‖u‖2‖v‖2. Subsequently, for all t > 0

Pr[X>i AXi > 3σ2

x‖u‖2‖v‖2(1 + t)]≤ e−t. (S4)

Let F (x) = Pr[X>i AXi ≤ x], x ≥ 0 be the CDF of X>i AXi and G(x) = 1 − F (x). Usingintegration by parts, we have that

E|X>i AXi|k =

∫ ∞0

xkdF (x) = −∫ ∞

0xkdG(x) =

∫ ∞0

kxk−1G(x)dx.

Here in the last equality we use the fact that limx→∞ xkG(x) = 0 for any fixed k ∈ N, because

G(x) ≤ exp1− xM by Eq. (S4), where M = 3σ2

x‖u‖2‖v‖2. Consequently,

E|X>i AXi|k =

∫ M

0kxk−1G(x)dx+ k

∫ ∞M

xk−1G(x)dx

≤Mk + k

∫ ∞0

Mk−1(1 + z)k−1e−z ·Mdz

= Mk + kMk

∫ ∞0

(1 + z)k−1e−zdz

≤Mk + kMk · k! ≤ (k + 1)!Mk.

Here in the second line we apply change-of-variable x = M(1+z) and the fact thatG(M(1+z)) ≤e−z in the integration term. Because 2k ≥ k + 1 for all k ≥ 1, we conclude that

E∣∣∣∣ p∑j=1

ξ(`)j X2

ijujvj

∣∣∣∣k ≤ 6kσ2kx k!E‖u‖k2‖v‖k2, ∀k ≥ 1.

Subsequently, applying Cauchy-Schwarts inequality together with moment bounds for sub-Gaussian random variables (Lemma E.1) we obtain

E|T (`)i − ET (`)

i |k

≤ 3k(√

E|X>u|2k√

E|X>v|2k + 6kσ2kx k!E‖u‖k2‖v‖k2 +

√E|X>u|2k

√E|X>v|2k

)≤ 3k · 2k · 6kΓ(k)σ2k

x ·(√

E‖u‖2k2√E‖v‖2k2 +

√E‖u‖2k2

√E‖v‖2k2 + ‖u‖k2‖v‖k2

)≤ ρ`/2∗

(C ′‖u‖2‖v‖2σ2

x

ρ`∗

)kk!,

whereC ′ <∞ is some absolute constant. Compare the bound of E|T (`)i −ET

(`)i |k with the variance

E|T (`)i |2 we obtained earlier, we have that L = σ2

x‖u‖2‖v‖2 · C ′3/ρ1.5`∗ is sufficient to guarantee

3

Page 29: Rate Optimal Estimation and Confidence Intervals for High

E|T (`)i − ET (`)

i |k ≤12V

(`)Lk−2k! for all 1 k > 2. Applying Bernstein inequality with momentconditions (Lemma E.5) and union bound over all u, v ∈ S, we have that

Pr

[∀u, v ∈ S,

∣∣∣∣ 1nn∑i=1

T(`)i − ET (`)

i

∣∣∣∣ > ‖u‖2‖v‖2ε]≤ 2N2 exp

− nε2

2(V (`) + Lε)

for all ε > 0, where V (`) = V (`)

‖u‖22‖v‖22and L = L

‖u‖2‖v‖2 . Subsequently,

supu,v∈S

∣∣∣∣ 1nn∑i=1

T(`)i − ET (`)

i

∣∣∣∣ ≤ OP

‖u‖2‖v‖2 max

L logN

n,

√V (`) logN

n

≤ OP

(σ2x‖u‖2‖v‖2 max

logN

ρ1.5`∗ n

,

√logN

ρ`∗n

).

A.2 Proof of Lemma 6.2

Define δj = 1n

∑ni=1 Zij where Zij = Xijεi. Because Eεi|X = 0, we have that EZij = 0. In

addition,

E|Zij |2 =σ2εσ

2x

ρj≤ σ2

εσ2x

ρ∗=: V

and

E|Zij |k = ρj ·1

ρkj· Eεki · E|Xij |k

≤ 1

ρk−1∗· k22kσkxσ

kεΓ

(k

2

)2

≤ k2(2σxσε)k

ρk−1∗

k!

≤ ρ∗

(4σxσερ∗

)kk!.

By setting L = 64σxσε/ρ∗ we have that E|Zij |k ≤ 12V L

k−2k! for all k > 1. Subsequently,applying Bernstein inequality with moment conditions (Lemma E.5) and union bound over j =1, · · · , p we have that

Pr [‖δ‖∞ > ε] ≤ 2p exp

− nε2

2(V + Lε)

for any ε > 0. Suppose εL

V → 0. We then have that

‖δ‖∞ ≤ OP

(σεσx

√log p

ρ∗n

).

The condition εLV → 0 is satisfied with log p

ρ∗n→ 0.

1The case of k = 2 is trivially true.

4

Page 30: Rate Optimal Estimation and Confidence Intervals for High

A.3 Proof of Lemma 6.9

Fix arbitrary j ∈ 1, · · · , p and consider

Tij = (1− ρj)X2ij =

(1− ρj)RijX2ij

ρ2j

.

It is easy to verify that [Ddiag( 1nX>X)]jj = 1

n

∑ni=1 Tij and [Ddiag(Σ0)]jj = 1

n

∑ni=1 ETij =

(1−ρj)Σ0jj

ρj. We use moment based Bernstein’s inequality (Lemma E.5) to bound the perturbation

| 1n∑n

i=1 Tij − ETij |. Define Vj = E|Tij − ETij |2. We then have

Vj ≤ E|Tij |2 =(1− ρj)2EX4

ij

ρ3j

≤ 3σ4x

ρ3∗

and for all k ≥ 3,

E|Tij − ETij |k ≤ 2k(E|Tij |k + |ETij |k

)≤ 4k+1

ρ2k−1∗

σ2kx k!.

It can then be verified that E|Tij − ETij |k ≤ 12VjL

k−2k! for all k ≥ 2 if L = 512σ2x

ρ2∗. By Lemma

E.5 and a union bound over all j ∈ 1, · · · , p, we have that

Pr

[∀j,∣∣∣∣ 1n

n∑i=1

Tij − ETij∣∣∣∣ > ε

]≤ 2p exp

− nε2

2(V + Lε)

for all ε > 0, where V = 3σ4x

ρ3∗and L = 512σ2

xρ2∗

. Under the assumption that εLV → 0, we have that

∥∥∥∥Ddiag

(1

nX>X

)−Ddiag(Σ0)

∥∥∥∥∞

= sup1≤j≤p

∣∣∣∣ 1nn∑i=1

Tij − ETij∣∣∣∣ ≤ OP

(σ2x

√log p

ρ3∗n

).

The condition εLV → 0 is then satisfied with log p

ρ∗n→ 0.

A.4 Proof of Lemma 6.10

By definition and the missing data model,

Υjkt = EΥjkt|X =

1n

∑ni=1

1−ρtρjρk

X2ijX

2it, j = k;

1n

∑ni=1

1−ρtρk

XijXikX2it, j 6= k.

Subsequently,

maxj,k∈[p]

maxt6=j,k

∣∣Υjkt

∣∣ ≤ ‖X‖4∞ρ2∗≤ OP

(σ4x log2 p

ρ2∗

).

To prove the second part of this lemma, we first fix arbitrary j, k ∈ [p] and t 6= j, k. Define

Tijkt = (ξjkt(Ri, ρ)− Eξjkt(Ri, ρ))XijXikX2it,

5

Page 31: Rate Optimal Estimation and Confidence Intervals for High

where ξjkt(Ri, ρ) =(1−ρt)RijRikRit

ρjρkρ2t

. It is easy to verify that Υjkt − Υjkt = 1n

∑ni=1 Tijkt and

ETijkt|X = 0. We then use Bernstein inequality with support conditions (Lemma E.4) to bound theconcentration of 1

n

∑ni=1 Tijkt towards zero. DefineA = maxi,j,k,t |Tijkt| and V = maxi,j,k,t E|Tijkt|2.

By Holder’s inequality we have that

A ≤ ‖X‖4∞

ρ4∗≤ OP

(σ4x log2 p

ρ4∗

).

Here in the OP(·) notation the randomness is on the generating process of X and is independent ofthe randomness of missing patterns R. In addition, note that

E∣∣∣∣ (ξjkt − Eξjkt)

(ξjkt′ − Eξjkt′

) ∣∣∣∣ ≤ 1

ρ5∗

for all j, k, t, t′ ∈ 1, · · · , p and t, t′ 6= j, k. Subsequently,

V = maxi,j,k,t

E|Tijkt|2 ≤1

ρ5∗X2ijX

2ikX

4it ≤

‖X‖8∞ρ5∗≤ OP

(σ8x log4 p

ρ5∗

).

Applying Lemma E.4 conditioned on ‖X‖∞ ≤ O(σ4x log2 pρ2∗

), we have that with probability 1−O(δ)

for some δ = o(1) the following holds:∣∣∣∣ 1nn∑i=1

Tijkt

∣∣∣∣ ≤ O(σ4x log2 p

√log(1/δ)

ρ5∗n

)=: ε,

provided that εAV → 0. Applying union bound over all j, k ∈ [p] and t ∈ [p]\j, k we get

maxj,k∈[p]

maxt6=j,k

∣∣∣∣ 1nn∑i=1

Tijkt

∣∣∣∣ ≤ OP

(σ4x log2 p

√log p

ρ5∗n

),

The condition εAV → 0 is satisfied with log p

ρ3∗n→ 0.

B Proof of restricted eigenvalue conditions

Lemma B.1. Suppose A,B are p × p random matrices with Pr[‖A − B‖∞ ≤ M ] ≥ 1 − o(1)for some M < ∞. If A satisfies RE(s, φmin) and B satisfies RE(s, φ′min), then with probability1− o(1) we have that

φ′min ≥ φmin − O(1) · ϕu,v(A,B;O(s log(Mp))) +O(1/n) .

Proof. For any h ∈ Rp it holds that

h>Bh

h>h≥ h>Ah

h>h− h>(B −A)h

h>h.

6

Page 32: Rate Optimal Estimation and Confidence Intervals for High

With appropriate scalings, it suffices to bound

suph:‖hJc‖1≤‖hJ‖1,‖h‖2≤1

∣∣h>(B −A)h∣∣

for all J ⊆ [p], |J | ≤ s as the largest possible gap between φmin and φ′min.Define Bp(r) = x ∈ Rp : ‖x‖p ≤ r as the p-norm ball of radius r. Because ‖hJc‖1 ≤ ‖hJ‖1

implies ‖h‖1 ≤ 2‖hJ‖1 ≤ 2√s‖h‖2, we have that

suph:‖hJc‖1≤‖hJ‖1,‖h‖2≤1

∣∣h>(B −A)h∣∣ ≤ sup

h∈B2(1)∩B1(2√s)

∣∣h>(B −A)h∣∣.

By Lemma 11 in the supplementary material of Loh & Wainwright (2012a), we have that

B2(1) ∩ B1(2√s) ⊆ 3conv B0(4s) ∩ B2(1)⊆ convB0(4s) ∩ B2(3)︸ ︷︷ ︸

K(4s)

.

Here conv(A) denotes the convex hull of setA. LetK(4s) = B0(4s)∩B2(3) and denoteNε,‖·‖2(K(4s))as the covering number of K(4s) with respect to the Euclidean norm ‖ · ‖2. That is, Nε,‖·‖2(K(4s))is the size of the smallest covering set H ⊆ K(4s) such that suph∈K(4s) infh′∈H ‖h−h′‖2 ≤ ε. Bydefinition of the concentration bounds, we have that with probability 1− o(1)

suph∈H

∣∣h>(A−B)h∣∣ ≤ ϕu,u(A,B; log |H|) sup

h∈H‖h‖22 ≤ 9ϕu,u(A,B; logNε,‖·‖2(K(4s))).

Subsequently, for any ε ∈ (0, 1) with probability 1− o(1)

suph∈B2(1)∩B1(2

√s)

∣∣h>(B −A)h∣∣ ≤ sup

h∈convK(4s)

∣∣h>(A−B)h∣∣

≤ supξ1,··· ,ξT≥0,ξ1+···+ξT=1,h1,··· ,hT∈K(4s)

T∑i,j=1

ξiξj∣∣h>i (A−B)hj

∣∣≤ sup

h,h′∈K(4s)

∣∣h>(A−B)h′∣∣

≤ suph,h′∈Hε,‖·‖2 [K(4s)]

∣∣h>(A−B)h′∣∣+ (6ε+ 3ε2)‖A−B‖L2

≤ 36ϕu,u(A,B; logNε,‖·‖2(K(4s))) + εpM

.

Here the last inequality is implied by the condition that ‖A − B‖∞ ≤ M with probability 1 −O(n−α). Taking ε = O(1/(p2M)) we have that εpM = O(1/p) = O(1/n).

The final part of the proof is to establish upper bounds for the covering number Nε,‖·‖2(K(4s)).First note that by definition

K(4s) =⋃

J⊆[p]:|J |≤4s

h : supp(h) = J ∧ ‖h‖2 ≤ 3 .

The covering number of a union of subsets can be upper bounded by the following proposition:

7

Page 33: Rate Optimal Estimation and Confidence Intervals for High

Proposition B.1. Let K = K1 ∪ · · · ∪Km. Then Nε,‖·‖2(K) ≤∑m

i=1Nε,‖·‖2(Ki).

Proof. Let Hi ⊆ Ki be covering sets of subset Ki. Define H = H1 ∪ · · · ∪ Hm. Clearly |H| ≤∑mi=1 |Hi| ≤

∑mi=1Nε,‖·‖2(Ki). It remains to prove that H is a valid ε-covering set of K. Take

arbitrary h ∈ K. By definition, there exists i ∈ [m] such that h ∈ Ki. Subsequently, there existsh∗ ∈ Hi ⊆ H such that ‖h− h∗‖2 ≤ ε. Therefore, H is a valid ε-covering set of K.

Define KJ(r) = h : supp(h) = J ∧ ‖h‖2 ≤ r. The covering number of KJ is established inthe following proposition:

Proposition B.2. Nε,‖·‖2(KJ(r)) ≤(

4r+εε

)|J |.Proof. KJ(r) is nothing but a centered |J |-dimensional ball of radius r, locating at the coordinatesindexed by J . The covering number result of high-dimensional ball is due to Lemma 2.5 of van deGeer (2010).

Combining the three propositions, we obtain

logNε,‖·‖2(K(4s)) ≤ log

4s∑j=0

(p

j

)+ log

(12 + ε/2

ε/2

)4s≤ O (s log(p/ε)) .

With the configuration of ε = O(1/(p2M)), we have that

logNε,‖·‖2(K(4s)) ≤ O(s log(pM)).

We are now ready to prove Lemma 6.4.

Proof of Lemma 6.4. Consider A = Σ and B = Σ0 in Lemma B.1. Lemma 6.1 yields

ϕu,v(Σ,Σ0;O(s log(Mp))) ≤ O

(σ2x max

s log(Mp)

ρ3∗n

,

√s log(Mp)

ρ2∗n

)=: ε.

By Lemma B.1, to prove this corollary it is sufficient to show that ελmin(Σ0) → 0. Note also that

M = ‖Σ−Σ0‖∞ ≤ ‖X‖∞ρ2∗ ≤ O(σx√

log pρ2∗

)with probability 1−o(1). The condition ε

λmin(Σ0) → 0

can then be satisfied with σ4xs log(σx log p/ρ∗)

ρ3∗λ2minn

→ 0.

C Proof of Lemma 3.1

Lemma C.1. Suppose log pρ4∗n→ 0 and νn σ2

xb1

√log pρ2∗n

. Then with probability 1− o(1) the popula-

tion precision matrix Σ−10 is a feasible solution to Eq. (7); that is, max‖ΣΣ−1

0 −Ip×p‖∞, ‖Σ−10 Σ−

Ip×p‖∞ ≤ νn.

8

Page 34: Rate Optimal Estimation and Confidence Intervals for High

Proof. First by Holder’s inequality we have that

‖ΣΣ−10 − I‖∞ = ‖(Σ− Σ0)Σ−1

0 ‖∞ ≤ ‖Σ−10 ‖L1‖Σ− Σ0‖∞ ≤ b1‖Σ− Σ0‖∞.

By Lemma 6.1, with probability 1− o(1)

‖Σ− Σ0‖∞ ≤ ϕu,v(

Σ,Σ0; 2 log p)≤ O

(σ2x

√log p

ρ2∗n

),

provided that log pρ4∗n→ 0. Subsequently, we have that

‖Σ−10 ‖L1‖Σ− Σ0‖∞ ≤ O

(σ2xb1

√log p

ρ2∗n

)≤ νn (S5)

with probability 1− o(1). The ‖Σ−10 Σ− I‖∞ term can be bounded in the same way by noting that

‖Σ−10 Σ− I‖∞ ≤ ‖Σ−1

0 ‖L∞‖Σ− Σ0‖∞ ≤ b1‖Σ− Σ0‖∞.

Lemma C.2. Suppose Σ−10 is a feasible solution to the CLIME optimization problem in Eq. (7).

Then max‖Θ‖L1 , ‖Θ‖L∞ ≤ ‖Σ−10 ‖L1 and ‖Θ− Σ−1

0 ‖∞ ≤ 2νn‖Σ−10 ‖L1 .

Proof. We first establish that ‖Θ‖L1 ≤ ‖Σ−10 ‖L1 . In Cai et al. (2011) it is proved that the solution

set of Eq. (7) is identical to the solution set of

Θ = ωipi=1 , ωi ∈ argminωi∈Rp‖ωi‖1 : ‖Σωi − ei‖∞ ≤ νn

.

Because Σ−10 belongs to the feasible set of the above constrained optimization problem, we have that

‖ωi‖1 ≤ ‖Σ−10 ‖L1 for all i = 1, · · · , p and hence ‖Θ‖L1 ≤ ‖Σ−1

0 ‖L1 . The inequality ‖Θ‖L∞ ≤‖Σ−1

0 ‖L1 can be proved by applying the same argument to Θ>.We next prove the infinity norm bound fot the estimation error Θ−Σ−1

0 . By triangle inequality,

‖Σ0(Θ− Σ−10 )‖∞ ≤ ‖ΣΘ− I‖∞ + ‖(Σ− Σ0)Θ‖∞ ≤ νn + ‖(Σ− Σ0)Θ‖∞.

Using Holder’s inequality, we have that

‖(Σ− Σ0)Θ‖∞ ≤ ‖Θ‖L1‖Σ− Σ0‖∞ ≤ ‖Σ−10 ‖L1‖Σ− Σ0‖∞ ≤ νn.

Here the last inequality is due to Eq. (S5). Subsequently, ‖Σ0(Θ − Σ−10 )‖∞ ≤ 2νn. Applying

Holder’s inequality again we obtain

‖Θ− Σ−10 ‖∞ ≤ ‖Σ

−10 ‖L1‖(Σ− Σ0)Θ‖∞ ≤ 2νn‖Σ−1

0 ‖L1 .

To translate the infinity-norm estimation error Σ−10 into an L1-norm bound that we desire, we

need the following lemma that establishes basic inequality of the estimation error:

Lemma C.3. Suppose Σ−10 is a feasible solution to Eq. (7). Then under Assumption (A5) we have

that max‖Θ− Σ−10 ‖L1 , ‖Θ− Σ−1

0 ‖L∞ ≤ 2b0‖Θ− Σ−10 ‖∞.

9

Page 35: Rate Optimal Estimation and Confidence Intervals for High

Proof. Let ωi and ω0i be the ith columns of Θ and Σ−10 , respectively. Let Ji denote the support size

of ω0i. Definte h = ωi − ω0i. We then have that

‖ωi‖1 = ‖ω0i + hJci ‖1 + ‖hJi‖1 ≥ ‖ω0i‖1 − ‖hJci ‖1 + ‖hJi‖1.

On the other hand, ‖ωi‖1 ≤ ‖ω0i‖1 as shown in the proof of Lemma C.2. Subsequently, ‖hJci ‖1 ≤‖hJi‖1 and hence

‖ωi − ω0i‖1 = 2‖hJi‖1 ≤ 2|Ji|‖h‖∞ ≤ 2b0‖ωi − ω0i‖∞.

Because the above inequality holds for all i = 1, · · · , p, we conclude that ‖Θ−Σ−10 ‖L1 ≤ 2b0‖Θ−

Σ−10 ‖∞. The bound for ‖Θ− Σ−1

0 ‖L∞ can be proved by applying the same argument to Θ>.

Combining all the above lemmas, we have that with probability 1− o(1)

max‖Θ− Σ−10 ‖L1 , ‖Θ− Σ−1

0 ‖L∞ ≤ 2νnb0‖Σ−10 ‖L1 ≤ O

σ2xb0b

21

√log p

ρ2∗n

.

D Proofs of the other technical lemmas

D.1 Proof of Lemma 6.3

We first show that under the conditions on n, λn and λn specified in the lemma, the true regres-sion vector β0 is feasible to both optimization problems with high probability; that is, ‖ 1

nX>y −

Σβ0‖∞ ≤ λn and ‖ 1nX>y − Σ0β0‖∞ ≤ λn with probability 1− o(1).

Consider βn first. Apply y = Xβ0 +ε and Definition 6.1, we have that with probability 1−o(1)∥∥∥∥ 1

nX>y − Σβ0

∥∥∥∥∞≤∥∥∥∥( 1

nX>X − Σ0

)β0

∥∥∥∥∞

+∥∥∥(Σ− Σ0

)β0

∥∥∥∞

+

∥∥∥∥ 1

nX>ε

∥∥∥∥∞

≤ϕu,v

(1

nX>X,Σ0; log p

)+ ϕu,v

(Σ,Σ0; log p

)‖β0‖2 + σ2

εϕε,∞

(1

nX

).

Now apply Lemmas 6.1 and 6.2: with probability 1− o(1)∥∥∥∥ 1

nX>y − Σβ0

∥∥∥∥∞≤ O

σx

√log p

n

(σx‖β0‖2ρ∗

+σε√ρ∗

)≤ λn,

provided that log pρ4∗n→ 0. The same line of argument applies to the second inequality by the following

decomposition: under the condition that log pρ2∗n→ 0, with probability 1− o(1)∥∥∥∥ 1

nX>y − Σ0β0

∥∥∥∥∞≤∥∥∥∥( 1

nX>X − Σ0

)β0

∥∥∥∥∞

+

∥∥∥∥ 1

nX>ε

∥∥∥∥∞

≤ ϕu,v(

1

nX>X,Σ0, log p

)‖β0‖2 + σ2

εϕε,∞

(1

nX

)≤ O

σx

√log p

ρ∗n(σx‖β0‖2 + σε)

≤ λn.

10

Page 36: Rate Optimal Estimation and Confidence Intervals for High

We are now ready to prove Lemma 6.3. We only prove the assertion involving βn, because thesame argument applies for βn as well. Let h = βn − β0. Because J0 = supp(β0), we have that

‖βn‖1 = ‖β0 + hJ0‖1 + ‖hJc0‖1 ≥ ‖β0‖1 − ‖hJ0‖1 + ‖hJc0‖1.

On the other hand, because both βn and β0 are feasible, by definition of the optimization problemwe have that ‖βn‖1 ≤ ‖β0‖1. Combining both chains of inequalities we arrive at ‖hJc0‖1 ≤ ‖hJ0‖1,which is to be demonstrated.

D.2 Proof of Lemma 6.6

Proposition D.1. Suppose X ∼ N (µ, ν2) for µ ∈ R and ν > 0. Then for any b ∈ R and a > 0, itholds that

E1√

2πa2exp

−(X − b)2

2a2

=

√ν2

a2 + ν2exp

− (µ− b)2

2(a2 + ν2)

.

Proof. Because X ∼ N (µ, ν2),

√2πν2E exp

−(X − b)2

2a2

=

∫exp

−(x− µ)2

2ν2− (x− b)2

2a2

dx

=

∫exp

−(a2 + ν2)x2 − 2(a2µ+ ν2b)x+ a2µ2 + ν2b2

2a2ν2

dx

=

∫exp

− 1

2a2ν2

[(a2 + ν2)

(x− a2µ+ ν2b

a2 + ν2

)2

− (a2µ+ ν2b)2

a2 + ν2+ ν2b2 + a2µ2

]dx

= exp

− (µ− b)2

2(a2 + ν2)

∫exp

−a

2 + ν2

2a2ν2

(x− a2µ+ ν2b

a2 + ν2

)2

dx

= exp

− (µ− b)2

2(a2 + ν2)

√2πa2ν2

a2 + ν2.

The proposition is then proved by multiplying both sides by√

2πa2/ν2.

We now consider the likelihood p(y, xobs;β,Σ). Integrating out the missing parts xmis we have

p(y, xobs;β,Σ) = p(xobs)

∫1√

2πσ2ε

exp

(y − x>obsβobs − x>misβmis)2

2σε2

dP (xmis|xobs)

= p(xobs)Eu[exp

(y − x>obsβobs − u)2

2σ2ε

∣∣∣∣xobs

],

where u = x>misβmis follows conditional distribution u|xobs ∼ N (µ, ν2) with µ = x>obsΣ12Σ−122 βmis

and ν2 = β>misΣ22:1βmis. Applying Proposition D.1 with a = σ and b = y − x>obsβobs, we have

11

Page 37: Rate Optimal Estimation and Confidence Intervals for High

Eu[exp

(y − x>obsβobs − u)2

2σ2ε

∣∣∣∣xobs

]=

1√2π(σ2

ε + β>misΣ22:1βmis)exp

(y − x>obsβobs − β>misΣ21Σ−111 xobs)

2

2(σ2ε + β>misΣ22:1βmis)

.

Finally, R ⊥⊥ x, xobs ∼ Nq(0,Σ11) and hence

p(xobs) = ρq(1− ρ)p−q · 1√(2π)q|Σ11|

exp

−1

2x>obsΣ

−111 xobs

.

D.3 Proof of Lemma 6.7

We prove this lemma by discussing three cases separately when at least one covariate of xs−1

and xj are missing. Assume in each case Σ0 and Σ1 are partitioned as in Lemma 6.6; that is,Σ0 = [Σ011 Σ012; Σ021 Σ022] and Σ1 = [Σ111 Σ112; Σ121 Σ122].

1. Both xs−1 and xj are missing. In this case Σ011 = Σ111 = Iq×q and Σ012 = Σ112 = Σ>021 =Σ>121 = 0q×(p−q). Therefore, Σ011 = Σ111 and the first two terms in p(y, xobs;β0,Σ0) andp(y, xobs;β1,Σ1) are identical. In addition, Σ022:1 = Σ022 = I − γ(es−1e

>j + eje

>s−1) and

Σ122:1 = Σ122 = I + γ(es−1e>j + eje

>s−1). Subsequently, β>0misΣ022:1β0mis = ‖β0mis‖22 −

2γβ0,s−1β0j = ‖β0mis‖22−2a2γ2, β>1misΣ122:1β1mis = ‖β1mis‖22+2γβ1,s−1β1j = ‖β1mis‖22−2a2γ2. Because ‖β0mis‖22 = ‖β1mis‖22 regardless of which covariates are missing, we havethat β>0misΣ022:1β0mis = β>1misΣ122:1β1mis and hence the last term in p(y, xobs;β0,Σ0) andp(y, xobs;β1,Σ1) are identical, because β>0misΣ021Σ−1

011 = β>1misΣ121Σ−1111 = 0 and β0obs =

β1obs when xj is missing.

2. xs−1 is observed but xj is missing. In this case, Σ011 = Σ111 = Iq×q, Σ022 = Σ122 =I(p−q)×(p−q), Σ012 = Σ>021 = −γes−1e

>j and Σ112 = Σ>121 = γes−1e

>j . Therefore, Σ011 =

Σ111 = I and hence the first two terms in the likelihood are identical. In addition, Σ022:1 =I − γ2eje

>j = Σ122:1 and hence β>0misΣ022:1β0mis = β>1misΣ122:1β1mis = ‖βmis‖22 − a2γ4.

Finally, β0obs = β1obs when xj is missing and β>0misΣ021Σ−1011 = β>1misΣ121Σ−1

111 = −a2γ2.Therefore the last term in both likelihoods are the same as well.

3. xj is observed but xs−1 is missing. In this case, Σ011 = Σ111 = Iq×q, Σ022 = Σ122 =I(p−q)×(p−q), Σ012 = Σ>021 = −γeje>s−1 and Σ112 = Σ>121 = γeje

>s−1. Therefore, Σ011 =

Σ111 = I and hence the first two terms in the likelihood are identical. In addition, Σ022:1 =I − γ2es−1e

>s−1 = Σ122:1 and hence β>0misΣ022:1β0mis = β>1misΣ122:1β1mis = ‖βmis‖22 −

a2γ2. Finally, β>0obsxobs + β>0misΣ021Σ−1011xobs = β>0obs,<sxobs,<s + β0jxj − γβ0,s−1xj =

β>0obs,<sxobs,<s because β0j = aγ and β0,s−1 = a. Similarly, β>1obsxobs+β>1misΣ121Σ−1

111xobs =

β>1obs,<sxobs,<s + β1jxj + γβ1,s−1xj = β>1obs,<sxobs,<s. Because β0obs,<s = β1obs,<s, weconclude that the last term of both likelihoods are the same.

D.4 Proof of Lemma 6.8

We first prove the upper bound for ‖rn‖∞. By Holder’s inequality,

‖rn‖∞ ≤√n‖ΘΣ− I‖∞‖βn − β0‖1 ≤

√nνn‖βn − β0‖1,

12

Page 38: Rate Optimal Estimation and Confidence Intervals for High

where the last inequality is due to Eq. (7).We next focus on ‖rn‖∞. Apply Holder’s inequality and triangle inequality:

‖rn‖∞ ≤√n‖Θ− Σ−1

0 ‖L∞(‖∆nβ0‖∞ +

∥∥∥∥ 1

nX>ε

∥∥∥∥∞

)≤ 2

√nb0b1νn

(‖∆nβ0‖∞ +

∥∥∥∥ 1

nX>ε

∥∥∥∥∞

).

Here in the second line we invoke the conclusion in Lemma 3.1. It then suffices to upper bound‖∆nβ0‖∞ and ‖ 1

nX>ε‖∞. With Definition 6.1, it holds with probability 1− o(1) that

‖∆nβ0‖∞ ≤∥∥∥∥( 1

nX>X − Σ0

)β0

∥∥∥∥∞

+∥∥∥(Σ− Σ0

)β0

∥∥∥∞

≤[ϕu,v

(1

nX>X,Σ0; log p

)+ ϕu,v

(Σ,Σ0; log p

)]‖β0‖2

and ∥∥∥∥ 1

nX>ε

∥∥∥∥∞≤ σεϕε,∞

(1

nX

).

By Lemmas 6.1 and 6.2, if log pρ4∗n→ 0 then

‖∆nβ0‖∞ ≤ OP

σ2x‖β0‖2

√log p

ρ2∗n

and

∥∥∥∥ 1

nX>ε

∥∥∥∥∞≤ OP

σxσε

√log p

ρ∗n

.

E Tail inequalities

Lemma E.1 (Sub-Gaussian concentration inequality). Suppose X is a univariate sub-Gaussianrandom variable with parameter σ > 0; that is, EX = 0 and EetX ≤ eσ2t2/2 for all t ∈ R. Then

Pr [|X| ≥ ε] ≤ 2e−ε2

2σ2 , ∀t > 0;

E|X|r ≤ r · 2r/2 · σr · Γ(r

2

), ∀r = 1, 2, · · ·

Lemma E.2 (Sub-exponential concentration inequality). Suppose X1, · · · , Xn are i.i.d. univariatesub-exponential random variables with parameter λ > 0; that is, EXi = 0 and EetXi ≤ et

2λ2/2

for all |t| ≤ 1/λ. Then

Pr

[∣∣∣∣ 1nn∑i=1

Xi

∣∣∣∣ > ε

]≤ 2 exp

−n

2min

(ε2

λ2,ε

λ

).

Lemma E.3 (Hoeffding inequality). SupposeX1, · · · , Xn are independent univariate random vari-ables with Xi ∈ [ai, bi] almost surely. Then for all t > 0, we have that

Pr

[∣∣∣∣ 1nn∑i=1

Xi − EXi

∣∣∣∣ > t

]≤ 2 exp

− 2n2t2∑n

i=1 (bi − ai)2

.

13

Page 39: Rate Optimal Estimation and Confidence Intervals for High

Lemma E.4 (Bernstein inequality, support condition). Suppose X1, · · · , Xn are independent ran-dom variables with zero mean and finite variance. If |Xi| ≤ M < ∞ almost surely for alli = 1, · · · , n, then

Pr

[∣∣∣∣ 1nn∑i=1

Xi

∣∣∣∣ > t

]≤ 2 exp

12n

2t2∑ni=1 EX2

i + 13Mnt

, ∀t > 0.

Lemma E.5 (Bernstein inequality, moment condition). Suppose X1, · · · , Xn are independent ran-dom variables with zero mean and E|Xi|2 ≤ σ2 < ∞. Assume in addition that there exists somepositive number L > 0 such that

E|Xi|k ≤1

2σ2Lk−2k!, ∀k > 1.

Then we have that

Pr

[∣∣∣∣ 1nn∑i=1

Xi

∣∣∣∣ > t

]≤ 2 exp

− nt2

2(σ2 + Lt)

, ∀t > 0.

Lemma E.6 (Hsu et al. (2012)). Suppose X = (X1, · · · , Xp) is a p-dimensional zero-mean sub-Gaussian random vector; that is, there exists σ > 0 such that

E expα>X

≤ exp

‖α‖22σ2/2

, ∀α ∈ Rp.

Let A be a p× p positive semi-definite matrix. Then for all t > 0,

Pr[X>AX > σ2

(tr(A) + 2

√tr(A2)t+ 2‖A‖opt

)]≤ e−t.

References

Cai, T., Liu, W., & Luo, X. (2011). A constrained L1 minimization approach to sparse precisionmatrix estimation. Journal of the American Statistical Association, 106(494), 594–607.

Hsu, D., Kakade, S. M., & Zhang, T. (2012). A tail inequality for quadratic forms of subgaussianrandom vectors. Electronic Communications in Probability, 17(52), 1–6.

Loh, P.-L., & Wainwright, M. (2012). High-dimensional regression with noisy and missing data:provable guarantees with nonconvexity. The Annals of Statistics, 40(3), 1637–1664.

van de Geer, S. (2010). Empirical Processes in M-Estimation. Cambridge University Press.

14