submitted to the annals of statistics · 1. introduction. one of the current challenges in...

Submitted to the Annals of Statistics

BOOTSTRAPPING MAX STATISTICS IN HIGHDIMENSIONS: NEAR-PARAMETRIC RATES UNDERWEAK VARIANCE DECAY AND APPLICATION TO

FUNCTIONAL AND MULTINOMIAL DATA

By Miles E. Lopes∗, Zhenhua Lin† and Hans-Georg Muller‡

University of California, Davis

In recent years, bootstrap methods have drawn attention for theirability to approximate the laws of “max statistics” in high-dimensionalproblems. A leading example of such a statistic is the coordinate-wisemaximum of a sample average of n random vectors in Rp. Existing re-sults for this statistic show that the bootstrap can work when n� p,and rates of approximation (in Kolmogorov distance) have been ob-tained with only logarithmic dependence in p. Nevertheless, one ofthe challenging aspects of this setting is that established rates tendto scale like n−1/6 as a function of n.

The main purpose of this paper is to demonstrate that improve-ment in rate is possible when extra model structure is available.Specifically, we show that if the coordinate-wise variances of the ob-servations exhibit decay, then a nearly n−1/2 rate can be achieved,independent of p. Furthermore, a surprising aspect of this dimension-free rate is that it holds even when the decay is very weak. Lastly, weprovide examples showing how these ideas can be applied to inferenceproblems dealing with functional and multinomial data.

1. Introduction. One of the current challenges in theoretical statisticsis to understand when bootstrap methods work in high-dimensional prob-lems. In this direction, there has been a surge of recent interest in connectionwith “max statistics” such as

T = max1≤j≤p

Sn,j ,

where Sn,j is the jth coordinate of the sum Sn = 1√n

∑ni=1(Xi − E[Xi]),

involving i.i.d. vectors X1, . . . , Xn in Rp.∗Supported in part by NSF grant DMS 1613218†Supported in part by NIH grant 5UG3OD023313-03‡Supported in part by NSF grant DMS 1712864 and NIH grant 5UG3OD023313-03MSC 2010 subject classifications: Primary 62G09, 62G15; secondary 62G05, 62G20.Keywords and phrases: bootstrap, high-dimensional statistics, rate of convergence,

functional data analysis, multinomial data, confidence region, hypothesis test

1

arX

iv:1

807.

0442

9v2

[m

ath.

ST]

20

Jul 2

019

http://www.imstat.org/aos/

2 M. E. LOPES ET AL.

This type of statistic has been a focal point in the literature for at leasttwo reasons. First, it is an example of a statistic for which bootstrap meth-ods can succeed in high dimensions under mild assumptions, which wasestablished in several pathbreaking works (Arlot, Blanchard and Roquain,2010a,b; Chernozhukov, Chetverikov and Kato, 2013, 2017). Second, thestatistic T is closely linked to several fundamental topics, such as suprema ofempirical processes, nonparametric confidence regions, and multiple testingproblems. Likewise, many applications of bootstrap methods for max statis-tics have ensued at a brisk pace in recent years (see, e.g., Chernozhukov,Chetverikov and Kato, 2014a; Wasserman, Kolar and Rinaldo, 2014; Chen,Genovese and Wasserman, 2015; Chang, Yao and Zhou, 2017; Zhang andCheng, 2017; Dezeure, Buhlmann and Zhang, 2017; Chen, 2018; Fan, Shaoand Zhou, 2018; Belloni et al., 2018).

One of the favorable aspects of bootstrap approximation results for thedistribution L(T ) is that rates have been established with only logarithmicdependence in p. For instance, the results in Chernozhukov, Chetverikov andKato (2017) imply that under certain conditions, the Kolmogorov distancedK between L(T ) and its bootstrap counterpart L(T ∗|X) satisfies the bound

(1.1) dK

(L(T ) ,L(T ∗|X)

)≤ c log(p)b

n1/6

with high probability, where c, b > 0 are constants not depending on n orp, and X denotes the matrix whose rows are X1, . . . , Xn. (In the following,symbols such as c will be often re-used to designate a positive constant notdepending on n or p, possibly with a different value at each occurrence.)Additional refinements of this result can be found in the same work, withregard to the choice of metric, or choice of bootstrap method. Also, recentprogress in sharpening the exponent b has been made by Deng and Zhang(2017). However, this mild dependence on p is offset by the n−1/6 dependenceon n, which differs from the n−1/2 rate in the multivariate Berry-Esseentheorem when p� n.

Currently, the general problem of determining the best possible ratesfor Gaussian and bootstrap approximations is largely open in the high-dimensional setting. In particular, if we let T denote the counterpart ofT that arises from replacing X1, . . . , Xn with independent Gaussian vec-tors X1, . . . , Xn satisfying cov(Xi) = cov(Xi), then a conjecture of Cher-nozhukov, Chetverikov and Kato (2017) indicates that a bound of the formdK(L(T ),L(T )) ≤ cn−1/6 log(p)b is optimal under certain conditions. A re-lated conjecture in the setting of high-dimensional U-statistics may alsobe found in Chen (2018). (Further discussion of related work on Gaussian

BOOTSTRAPPING MAX STATISTICS 3

approximation is given in Appendix H.) Nevertheless, the finite-sample per-formance of bootstrap methods for max statistics is often more encouragingthan what might be expected from the n−1/6 dependence on n (see, e.g.Zhang and Cheng, 2017; Fan, Shao and Zhou, 2018; Belloni et al., 2018).This suggests that improved rates are possible in at least some situations.

The purpose of this paper is to quantify an instance of such improvementwhen additional model structure is available. Specifically, we consider thecase when the coordinates of X1, . . . , Xn have decaying variances. If we letσ2j = var(X1,j) for each 1 ≤ j ≤ p, and write σ(1) ≥ · · · ≥ σ(p), then this

condition may be formalized as

(1.2) σ(j) ≤ c j−α for all j ∈ {1, . . . , p},

where α > 0 is a parameter not depending on n or p. (A complete setof assumptions, including a weaker version of (1.2), is given in Section 2.)This type of condition arises in many contexts, and in Section 2 we discussexamples related to principal component analysis, count data, and Fouriercoefficients of functional data. Furthermore, this condition can be assessedin practice, due to the fact that the parameters σ1, . . . , σp can be accuratelyestimated, even in high dimensions (cf. Lemma D.7).

Within the setting of decaying variances, our main results show thata nearly parametric rate can be achieved for both Gaussian and boot-strap approximation of L(T ). More precisely, this means that for any fixedδ ∈ (0, 1/2), the bound dK(L(T ),L(T )) ≤ cn−1/2+δ holds, and similarly, theevent

(1.3) dK

(L(T ) ,L(T ∗|X)

)≤ c n−1/2+δ

holds with high probability. Here, it is worth emphasizing a few basic as-pects of these bounds. First, they are non-asymptotic and do not dependon p. Second, the parameter α is allowed to be arbitrarily small, and inthis sense, the decay condition (1.2) is very weak. Third, the result for T ∗

holds when it is constructed using the standard multiplier bootstrap proce-dure (Chernozhukov, Chetverikov and Kato, 2013).

With regard to the existing literature, it is important to clarify that ournear-parametric rate does not conflict with the conjectured optimality of therate n−1/6 for Gaussian approximation. The reason is that the n−1/6 ratehas been established in settings where the values σ1, . . . , σp are restrictedfrom becoming too small. A basic version of such a requirement is that

(1.4) min1≤j≤p

σj ≥ c.


Hence, the conditions (1.2) and (1.4) are complementary. Also, it is inter-esting to observe that the two conditions “intersect” in the limit α → 0+,suggesting there is a phase transition in rates at the “boundary” correspond-ing to α = 0.

Another important consideration that is related to the conditions (1.2)and (1.4) is the use of standardized variables. Namely, it is of special interestto approximate the distribution of the statistic

T ′ = max1≤j≤p

Sn,j/σj ,

which is equivalent to approximating L(T ) when each Xi,j is standardizedto have variance 1. Given that standardization eliminates variance decay, itmight seem that the rate n−1/2+δ has no bearing on approximating L(T ′).However, it is still possible to take advantage of variance decay, by using abasic notion that we refer to as “partial standardization”.

The idea of partial standardization is to slightly modify T ′ by using a frac-tional power of each σj . Specifically, if we let τn ∈ [0, 1] be a free parameter,then we can consider the partially standardized statistic

(1.5) M = max1≤j≤p

Sn,j/στnj ,

which interpolates between T and T ′ as τn ranges over [0, 1]. This statistichas the following significant property: If X1, . . . , Xn satisfy the variance de-cay condition (1.2), and if τn is chosen to be slightly less than 1, then ourmain results show that the rate n−1/2+δ holds for bootstrap approximationsof L(M). In fact, this effect occurs even when τn → 1 as n → ∞. Furtherdetails can be found in Section 3. Also note that our main results are for-mulated entirely in terms of M , which covers the statistic T as a specialcase.

In practice, simultaneous confidence intervals derived from approxima-tions to L(M) are just as easy to use as those based on L(T ′). Althoughthere is a slight difference between the quantiles of M and T ′ when τn < 1,the important point is that the quantiles of L(M) may be preferred, sincefaster rates of bootstrap approximation are available. (See also Figure 1 inSection 4.) In this way, the statistic M offers a simple way to blend theutility of standardized variables with the beneficial effects of variance decay.

Outline. The remainder of the paper is organized as follows. In Section 2,we outline the problem setting, with a complete statement of the theoreti-cal assumptions, as well as some motivating facts and examples. Our mainresults are given in Section 3, which consist of a Gaussian approximation re-sult for L(M) (Theorem 3.1), and a corresponding bootstrap approximation


result (Theorem 3.2). To provide a numerical illustration of our results, wediscuss a problem in functional data analysis in Section 4, where the vari-ance decay condition naturally arises. Specifically, we show how bootstrapapproximations to L(M) can be used to derive simultaneous confidence in-tervals for the Fourier coefficients of a mean function. A second applicationto high-dimensional multinomial models is described in Section 5, whichoffers both a theoretical bootstrap approximation result, as well as somenumerical results. Lastly, our conclusions are summarized in Section 6. Allproofs are given in the appendices, found in the supplementary material.

Notation. The standard basis vectors in Rp are denoted e1, . . . , ep, andthe identity matrix of size p × p is denoted Ip. For any symmetric matrixA ∈ Rp×p, the ordered eigenvalues are denoted λ(A) = (λ1(A), . . . , λp(A)),where λmax(A) = λ1(A) ≥ · · · ≥ λp(A) = λmin(A). The operator norm of amatrix A, denoted ‖A‖op, is the same as its largest singular value. If v ∈ Rpis a fixed vector, and r > 0, we write ‖v‖r = (

∑pj=1 |vj |r)1/r. In addition,

the weak-`r (quasi) norm is given by ‖v‖w`r = max1≤j≤p j1/r|v|(j), where

|v|(1) ≥ · · · ≥ |v|(p) are the sorted absolute entries of v. Likewise, the notationv(1) ≥ · · · ≥ v(p) refers to the sorted entries. In a slight abuse of notation, we

write ‖ξ‖r = E[|ξ|r]1/r to refer to the Lr norm of a scalar random variable ξ,with r ≥ 1. The ψ1-Orlicz norm is ‖ξ‖ψ1 = inf{t > 0 |E[exp(|ξ|/t)] ≤ 2}. If{an} and {bn} are sequences of non-negative real numbers, then the relationan . bn means that there is a constant c > 0 not depending on n, and aninteger n0 ≥ 1, such that an ≤ cbn for all n ≥ n0. Also, we write an � bn ifan . bn and bn . an. Lastly, define the abbreviations an∨ bn = max{an, bn}and an ∧ bn = min{an, bn}.

2. Setting and preliminaries. We consider a sequence of models in-dexed by n, with all parameters depending on n, except for those that arestated to be fixed. In particular, the dimension p = p(n) is regarded as afunction of n, and hence, if a constant does not depend on n, then it doesnot depend on p either.

Assumption 2.1 (Data-generating model).

(i). There is a vector µ = µ(n) ∈ Rp and positive semi-definite matrixΣ = Σ(n) ∈ Rp×p, such that the observations X1, . . . , Xn ∈ Rp aregenerated as Xi = µ + Σ1/2Zi for each 1 ≤ i ≤ n, where the randomvectors Z1, . . . , Zn ∈ Rp are i.i.d.

(ii). The random vector Z1 satisfies E[Z1] = 0 and E[Z1Z>1 ] = Ip, as well

as sup‖u‖2=1 ‖Z>1 u‖ψ1 ≤ c0, for some constant c0 > 0 that does not


depend on n.

Remarks. Note that no constraints are placed on the ratio p/n. Also, thesub-exponential tail condition in part (ii) is similar to other tail conditionsthat have been used in previous works on bootstrap methods for max statis-tics (Chernozhukov, Chetverikov and Kato, 2013; Deng and Zhang, 2017).

To state our next assumption, it is necessary to develop some notation.For any d ∈ {1, . . . , p}, let J (d) denote a set of indices corresponding tothe d largest values among σ1, . . . , σp, i.e., {σ(1), . . . , σ(d)} = {σj | j ∈ J (d)}.In addition, let R(d) ∈ Rd×d denote the correlation matrix of the randomvariables {X1,j | j ∈ J (d)}. Lastly, let a ∈ (0, 1

2) be a constant fixed withrespect to n, and define the integers `n and kn according to

`n =⌈(1 ∨ log(n)3) ∧ p

⌉(2.1)

kn =⌈(`n ∨ n

1log(n)a

)∧ p⌉.(2.2)

Note that both `n and kn grow slower than any fractional power of n, andalways satisfy 1 ≤ `n ≤ kn ≤ p.

Assumption 2.2 (Structural assumptions).

(i). The parameters σ1, . . . , σp are positive, and there are positive con-stants α, c, and c◦ ∈ (0, 1), not depending on n, such that

(2.3) σ(j) ≤ c j−α for all j ∈ {kn, . . . , p},

(2.4) σ(j) ≥ c◦ j−α for all j ∈ {1, . . . , kn}.

(ii). There is a constant ε0 ∈ (0, 1), not depending on n, such that

(2.5) maxi 6=j

Ri,j(`n) ≤ 1− ε0.

Also, the matrix R+(`n) with (i, j) entry given by max{Ri,j(`n), 0} ispositive semi-definite, and there is a constant C > 0 not depending onn such that

(2.6)∑

1≤i<j≤`n

R+i,j(`n) ≤ C`n.


Remarks. Since `n, kn � n, it is possible to accurately estimate the pa-rameters σ(1), . . . , σ(kn), as well as the matrix R(`n), even when p is large(cf. Lemmas D.6 and D.7). In this sense, it is possible to empirically assessthe conditions above. When considering the size of the decay parameter α,note that if Σ is viewed as a covariance operator acting on a Hilbert space,then the condition α > 1/2 essentially corresponds to the case of a trace-class operator — a property that is typically assumed in functional dataanalysis (Hsing and Eubank, 2015). From this perspective, the conditionα > 0 is very weak, and allows the trace of Σ to diverge as p→∞.

With regard to the conditions on the correlation matrix R(`n), it is im-portant to keep in mind that they only apply to a small set of variables ofsize O(log(n)3) — and the dependence among the variables outside of J (`n)is completely unrestricted. The interpretation of (2.6) is that it prevents ex-cessive dependence among the coordinates with the largest variances. Mean-while, the condition that R+(`n) is positive semi-definite is more technical innature, and is only used in order to apply a specialized version of Slepian’slemma (Lemma G.3). Nevertheless, this condition always holds in the im-portant case where R(`n) is non-negative. Perturbation arguments may alsobe used to obtain other examples where some entries of R(`n) are negative.

2.1. Examples of correlation matrices. Some correlation matrices satis-fying Assumption (2.2)(ii) are given below.

• Autoregressive: Ri,j = ρ|i−j|0 , for any ρ0 ∈ (0, 1).

• Algebraic decay: Ri,j = 1{i = j}+ 1{i 6=j}4|i−j|γ , for any γ ≥ 2.

• Banded: Ri,j =(

1− |i−j|c0

)+, for any c0 > 0.

• Multinomial: Ri,j = 1{i = j} −√

πiπj(1−πi)(1−πj)1{i 6= j}

where (π1, . . . , πp) is a probability vector.

By combining these types of correlation matrices with choices of (σ1, . . . , σp)that satisfy (2.3) and (2.4), it is straightforward to construct examples of Σthat satisfy all aspects of Assumption 2.2.

2.2. Examples of variance decay. To provide additional context for thedecay condition (2.3), we describe some general situations where it occurs.


• Principal component analysis (PCA). The broad applicability of PCArests on the fact that many types of data have an underlying covariancematrix with weakly sparse eigenvalues. Roughly speaking, this meansthat most of the eigenvalues of Σ are small in comparison to the topfew. Similar to the condition (2.3), this situation can be modeled withthe decay condition

(2.7) λj(Σ) ≤ cj−γ ,

for some parameter γ > 0 (e.g. Bunea and Xiao, 2015). Whenever thisholds, it can be shown that the variance decay condition must hold forsome associated parameter α > 0, and this is done in Proposition 2.1below. So, in a qualitative sense, this indicates that if a dataset isamenable to PCA, then it is also likely to fall within the scope of oursetting.

Another way to see the relationship between PCA and variance decayis through the measure of “effective rank”, defined as

(2.8) r(Σ) = tr(Σ)‖Σ‖op .

This quantity has played a key role in a substantial amount of recentwork on PCA, because it offers a useful way to describe covariance ma-trices with an “intermediate” degree of complexity, which may be nei-ther very low-dimensional, nor very high-dimensional. We refer to Ver-shynin (2012), Lounici (2014), Bunea and Xiao (2015), Reiß and Wahl(2019+), Koltchinskii and Lounici (2017a,b), Koltchinskii, Loffler andNickl (2019+), Naumov, Spokoiny and Ulyanov (2018), and Jung, Leeand Ahn (2018), among others. Many of these works have focused onregimes where

(2.9) r(Σ) = o(n),

which conforms naturally with variance decay. Indeed, within a basicsetup where n � p and ‖Σ‖op � 1, the condition (2.9) holds underσ(j) ≤ cj−α for any α > 0.

• Count data. Consider a multinomial model based on p cells and n trials,parameterized by a vector of cell proportions π = (π1, . . . , πp). If theith trial is represented as a vector Xi ∈ Rp in the set of standard basisvectors {e1, . . . , ep}, then the marginal distributions of Xi are binomial


with σ2j = πj(1 − πj). In particular, it follows that all multinomial

models satisfy the variance decay condition (2.3), because if we letσ = (σ1, . . . , σp), then the weak-`2 norm of σ must satisfy ‖σ‖w`2 ≤‖σ‖2 ≤ 1, which implies

(2.10) σ(j) ≤ j−1/2

for all j ∈ {1, . . . , p}. In order to study the consequences of this fur-ther, we offer some detailed examples in Section 5. More generally, thevariance decay condition also arises for other forms of count data. Forinstance, in the case of a high-dimensional distribution with sparsePoisson marginals, the relation var(Xi,j) = E[Xi,j ] shows that sparsityin the mean vector can lead to variance decay.

• Fourier coefficients of functional data. Let Y1, . . . , Yn be an i.i.d. sam-ple of functional data, taking values in a separable Hilbert space H. Inaddition, suppose that the covariance operator C = cov(Y1) is trace-class, which implies an eigenvalue decay condition of the form (2.7).Lastly, for each i ∈ {1, . . . , n}, let Xi ∈ Rp denote the first p general-ized Fourier coefficients of Yi with respect to some fixed orthonormalbasis {ψj} for H. That is, Xi = (〈Yi, ψ1〉, . . . , 〈Yi, ψp〉).

Under the above conditions, it can be shown that no matter whichbasis {ψj} is chosen, the vectors X1, . . . , Xn always satisfy the vari-ance decay condition. (This follows from Proposition 2.1 below.) InSection 4, we explore some consequences of this condition as it relatesto simultaneous confidence intervals for the Fourier coefficients of themean function E[Y1].

To conclude this section, we state a proposition that was used in theexamples above. This basic result shows that decay among the eigenvaluesλ1(Σ), . . . , λp(Σ) requires at least some decay among σ1, . . . , σp.

Proposition 2.1. Fix two numbers s ≥ 1, and r ∈ (0, s). Then, there isa constant cr,s > 0 depending only on r and s, such that for any symmetricmatrix A ∈ Rp×p, we have

‖diag(A)‖w`s ≤ cr,s‖λ(A)‖w`r .

In particular, if A = Σ, and if there is a constant c0 > 0 such that theinequality

λj(Σ) ≤ c0 j−1/r

holds for all 1 ≤ j ≤ p, then the inequality

σ2(j) ≤ c0cr,s j

−1/s


holds for all 1 ≤ j ≤ p.

The proof is given in Appendix A, and follows essentially from the Schur-Horn majorization theorem, as well as inequalities relating ‖ ·‖r and ‖ ·‖w`r .

3. Main results. In this section, we present our main results on Gaus-sian approximation and bootstrap approximation.

3.1. Gaussian approximation. Let Sn ∼ N(0,Σ) and define the Gaussiancounterpart of the partially standardized statistic M (1.5) according to

(3.1) M = max1≤j≤p

Sn,j/στnj .

Our first theorem shows that in the presence of variance decay, the distribu-tion L(M) can approximate L(M) at a nearly parametric rate in Kolmogorovdistance. Recall that for any random variables U and V , this distance is givenby dK(L(U),L(V )) = supt∈R |P(U ≤ t)− P(V ≤ t)|.

Theorem 3.1 (Gaussian approximation). Fix any number δ ∈ (0, 1/2),and suppose that Assumptions 2.1 and 2.2 hold. In addition, suppose thatτn ∈ [0, 1) with (1− τn)

√log(n) & 1. Then,

(3.2) dK

(L(M) , L(M)

). n−

12

+δ.

Remarks. As a basic observation, note that the result handles the ordinarymax statistic T as a special case with τn = 0. In addition, it is worth em-phasizing that the rate does not depend on the dimension p, or the variancedecay parameter α, provided that it is positive. In this sense, the resultshows that even a small amount of structure can have a substantial impacton Gaussian approximation (in relation to existing n−1/6 rates that holdwhen α = 0). Lastly, the reason for imposing the lower bound on 1 − τn isthat if τn quickly approaches 1 as n→∞, then the variances var(Sn,j/σ

τnj )

will also quickly approach 1, thus eliminating the beneficial effect of variancedecay.

3.2. Multiplier bootstrap approximation. In order to define the multiplierbootstrap counterpart of M , first define the sample covariance matrix

(3.3) Σn =1

n

n∑i=1

(Xi − X)(Xi − X)>,


where X = 1n

∑ni=1Xi. Next, let S?n ∼ N(0, Σn), and define the associated

max statistic as

(3.4) M? = max1≤j≤p

S?n,j/στnj ,

where (σ21, . . . , σ

2p) = diag(Σn). In the exceptional case when σj = 0 for some

j, the expression S?n,j/σj is understood to be 0. This convention is natural,because the event S?n,j = 0 holds with probability 1, conditionally on σj = 0.

Remarks. The above description of M? differs from some previous worksinsofar as we have suppressed the role of “multiplier variables”, and havedefined S?n as a sample from N(0, Σn). From a mathematical standpoint, thisis equivalent to the multiplier formulation (Chernozhukov, Chetverikov andKato, 2013), where S?n = 1√

n

∑ni=1 ξ

?i (Xi−X) and ξ?1 , . . . , ξ

?n are i.i.d. N(0, 1)

random variables, generated independently of X.

Theorem 3.2 (Bootstrap approximation). Fix any number δ ∈ (0, 1/2),and suppose the conditions of Theorem 3.1 hold. Then, there is a constantc > 0 not depending on n, such that the event

(3.5) dK

(L(M) , L(M?|X)

)≤ c n−

12

+δ

occurs with probability at least 1− cn .

Remarks. At a high level, the proofs of Theorems 3.1 and 3.2 are based onthe following observation: When the variance decay condition holds, thereis a relatively small subset of {1, . . . , p} that is likely to contain the max-imizing index for M . In other words, if ∈ {1, . . . , p} denotes a randomindex satisfying M = Sn,/σ

τn , then the “effective range” of is fairly

small. Although this situation is quite intuitive when the decay parameterα is large, what is more surprising is that the effect persists even for smallvalues of α.

Once the maximizing index has been localized to a small set, it be-comes possible to use tools that are specialized to the regime where p� n.For example, Bentkus’ multivariate Berry-Esseen theorem (Bentkus, 2003)(cf. Lemma G.1) is helpful in this regard. Another technical aspect of theproofs worth mentioning is that they make essential use of the sharp con-stants in Rosenthal’s inequality, as established in (Johnson, Schechtman andZinn, 1985) (Lemma G.4).


4. Numerical illustration with functional data. Due to advancesin technology and data collection, functional data have become ubiquitous inthe past two decades, and statistical methods for their analysis have receivedgrowing interest. General references and surveys may be found in Ram-say and Silverman (2005); Ferraty and Vieu (2006); Horvath and Kokoszka(2012); Hsing and Eubank (2015); Wang, Chiou and Muller (2016).

The purpose of this section is to present an illustration of how the partiallystandardized statistic M and the bootstrap can be employed to do infer-ence on functional data. More specifically, we consider a one-sample test fora mean function, which proceeds by constructing simultaneous confidenceintervals (SCI) for its Fourier coefficients. With regard to our theoreticalresults, this is a natural problem for illustration, because the Fourier coeffi-cients of functional data typically satisfy the variance decay condition (1.2),as explained in the third example of Section 2.2. Additional background andrecent results on mean testing for functional data may be found in Benko,Hardle and Kneip (2009); Degras (2011); Cao, Yang and Todem (2012);Horvath, Kokoszka and Reeder (2013); Zheng, Yang and Hardle (2014);Choi and Reimherr (2018); Zhang et al. (2018), as well as the referencestherein.

4.1. Tests for the mean function. To set the stage, let H be a separableHilbert space of functions, and let Y ∈ H be a random function with meanE[Y ] = µ. Given a sample Y1, . . . , Yn of i.i.d. realizations of Y , a basic goalis to test

(4.1) H0 : µ = µ◦ versus H1 : µ 6= µ◦,

where µ◦ is a fixed function in H.This testing problem can be naturally formulated in terms of SCI, as

follows. Let {ψj} denote any fixed orthonormal basis for H. Also, let {uj}and {u◦j} respectively denote the generalized Fourier coefficients of µ and µ◦

with respect to {ψj}, so that

µ =∑∞

j=1 ujψj and µ◦ =∑∞

j=1 u◦j ψj .

Then, the null hypothesis is equivalent to uj = u◦j for all j ≥ 1. To test this

condition, one can construct a confidence interval Ij for each uj , and reject

the null if u◦j 6∈ Ij for at least one j ≥ 1. In practice, due to the infinitedimensionality of H, one will choose a sufficiently large integer p, and rejectthe null if u◦j 6∈ Ij for at least one j ∈ {1, . . . , p}.

Recently, a similar general strategy was pursued by Choi and Reimherr(2018), hereafter CR, who developed a test for the problem (4.1) based on


a hyper-rectangular confidence region for (u1, . . . , up) — which is equiva-lent to constructing SCI. In the CR approach, the basis is taken to be theeigenfunctions {ψC,j} of the covariance operator C = cov(Y ), and p is chosenas the number of eigenfunctions ψC,1, . . . , ψC,p required to explain a certainfraction (say 99%) of variance in the data. However, since C is unknown, theeigenfunctions must be estimated from the available data.

When p is large, estimating the eigenfunctions ψC,1, . . . , ψC,p is a well-known challenge in functional data analysis. For instance, a large choiceof p may be needed to explain 99% of the variance if the sample paths ofY1, . . . , Yn are not sufficiently smooth. Another example occurs when H1

holds but µ and µ◦ are not well separated, which may require a large choiceof p in order to distinguish (u1, . . . , up) and (u◦1, . . . , u

◦p). In light of these

considerations, we will pursue an alternative approach to constructing SCIthat does not require estimation of eigenfunctions.

4.2. Applying the bootstrap. Let {ψj} be any pre-specified orthonormalbasis forH. For instance, whenH = L2[0, 1], a natural option is the standardFourier basis. For a sample Y1, . . . , Yn ∈ H as considered before, definerandom vectors X1, . . . , Xn in Rp according to

Xi = (〈Yi, ψ1〉, . . . , 〈Yi, ψp〉),

and note that E[X1] = (u1, . . . , up). For simplicity, we retain the previousnotations associated with X1, . . . , Xn, so that Sn,j = n−1/2

∑ni=1(Xi,j −uj),

and likewise for other quantities. In addition, for any τn ∈ [0, 1], let

L = min1≤j≤p

Sn,j/στnj and M = max

1≤j≤pSn,j/σ

τnj .

For a given significance level % ∈ (0, 1), the %-quantiles of L and M aredenoted qL(%) and qM (%). Thus, the following event occurs with probabilityat least 1− %,

(4.2)

p⋂j=1

{qL(%/2)στnj√

n≤ Xj − uj ≤

qM (1−%/2)στnj√n

},

which leads to theoretical SCI for (u1, . . . , up).We now apply the bootstrap from Section 3.2 to estimate qL(%/2) and

qM (1− %/2). Specifically, we generate B ≥ 1 independent samples of M? asin (3.4), and then define qM (1− %/2) to be the empirical (1− %/2)-quantileof the B samples (and similarly for qL(%/2)), leading to the bootstrap SCI

(4.3) Ij =

[Xj −

qM (1−%/2)στnj√n

, Xj −qL(%/2)στnj√

n

]


for each j ∈ {1, . . . , p}.It remains to select the value of τn, for which we adopt the follow-

ing simple rule. For each choice of τn in a set of possible candidates, sayT = {0, 0.1, . . . , 0.9, 1}, we construct the associated intervals I1, . . . , Ip asin (4.3), and then select the value τn ∈ T for which the average width1p

∑pj=1 | Ij | is the smallest, where |[a, b]| = b− a.

In Figure 1, we illustrate the influence of τn on the shape of the SCI. Thereare two main points to notice: (1) The intervals change very gradually as afunction of τn, which shows that partial standardization is at most a mildadjustment of ordinary standardization. (2) The choice of τn involves a trade-off, which controls the “allocation of power” among the p intervals. When τnis close to 1, the intervals are wider for the leading coefficients (small j), andnarrower for the subsequent coefficients (large j). However, as τn decreasesfrom 1, the widths of the intervals gradually become more uniform, and theintervals for the leading coefficients become narrower. Hence, if the vectors(u1, . . . , up) and (u◦1, . . . , u

◦p) differ in the leading coefficients, then choosing

a smaller value of τn may lead to a gain in power. One last interesting pointto mention is that in the simulations reported below, the selection rule of“minimizing the average width” typically selected values of τn around 0.8,and hence strictly less than 1.

5 10 15 20

−0.

8−

0.6

−0.

4−

0.2

0.0

0.2

0.4

0.6

j

SC

I

τ=0.6 τ=0.8 τ=1.0

Fig 1. Illustration of the impact of τn on the shape of simultaneous confidence intervals(SCI). The curves represent upper and lower endpoints of the respective SCI, where theFourier coefficients are indexed by j. Overall, the plot shows that the SCI change verygradually as a function of τn, and that there is a trade-off in the widths of the intervals.Namely, as τn decreases, the intervals for the leading coefficients (small j) become tighter,while the intervals for the subsequent coefficients (large j) become wider.


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

1.2

t

µ(t)

ω=0ω=0.3ω=0.8

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

t

µ(t)

ρ=0ρ=0.3ρ=0.8

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

t

µ(t)

θ=0θ=0.3θ=0.5

Fig 2. Left: Mean functions for varying shape parameters ω with ρ = θ = 0. Middle:Mean functions for varying scale parameters ρ with ω = θ = 0. Right: Mean functionswith different shift parameters θ with ω = ρ = 0.

4.3. Simulation settings. To study the numerical performance of the SCIdescribed above, we generated i.i.d. samples from a Gaussian process on[0, 1], with population mean function

µω,ρ,θ(t) = (1 + ρ) ·(exp[−{gω(t) + 2}2] + exp[−{gω(t)− 2}2]

)+ θ

indexed by parameters (ω, ρ, θ), where gω(t) := 8hω(t)−4, and hω(t) denotesthe Beta distribution function with shape parameters (2+ω, 2). This familyof functions was considered in Chen and Muller (2012). To interpret theparameters, note that ω determines the shape of the mean function (seeFigure 2), whereas ρ and θ are scale and shift parameters. In terms of theseparameters, the null hypothesis corresponds to µ = µ◦ := µ0,0,0.

The population covariance function was taken to be the Matern function

C(s, t) = (√

2ν|t−s|)ν16Γ(ν)2ν−1 Kν(

√2ν|t− s|),

which was previously considered in CR, with Kν being a modified Besselfunction of the second kind. We set ν = 0.1, which results in relativelyrough sample paths, as illustrated in the left panel of Figure 3. Also, thesignificant presence of variance decay is shown in the right panel.

When implementing the bootstrap in Section 4.2, we used the first p =100 functions from the standard Fourier basis on [0,1]. (In principle, aneven larger value p could have been selected, but we chose p = 100 tolimit computation time.) For comparison purposes, we also implementedthe ‘Rzs’ version of the method proposed in CR, using the accompanying Rpackage fregion (Choi and Reimherr, 2016) under default settings, whichtypically utilized estimates of the first p ≈ 50 eigenfunctions of C.


0.0 0.2 0.4 0.6 0.8 1.0

−0.5

0.0

0.5

1.0

1.5

t

Y(t)

●

●

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 100

0.05

0.10

0.15

j

σ(j)

Fig 3. Left: A sample of the functional data Y1, . . . , Yn in the simulation study. Right:The ordered values σ(j) =

√var(X1,j) are represented by dots, which are approximated by

the decay profile 0.15j−0.69 (solid line).

Results on type I error. The nominal significance level was set to 5% inall simulations. To assess the actual type I error, we carried out 5,000 sim-ulations under the null hypothesis, for both n = 50 and n = 200. Whenn = 50, the type I error was 6.7% for the bootstrap method, and 1.6%for CR. When n = 200, the results were 5.7% for the bootstrap method,and 2.6% for CR. So, in these cases, the bootstrap respects the nominalsignificance level relatively well. In addition, our numerical results supportthe idea that partial standardization can be beneficial, because in the fullystandardized case where τn = 1, we observed less accurate type I error ratesof 7.0% for n = 50, and 6.4% for n = 200.

Results on power. To consider power, we varied each of the parameters ω,ρ and θ, one at a time, while keeping the other two at their baseline valueof zero. In each parameter setting, we carried out 1,000 simulations withsample size n = 50. The results are summarized in Figure 4, showing thatthe bootstrap achieves relative gains in power — especially with respect tothe shape (ω) and scale (ρ) parameters. In particular, it seems that using alarge number of basis functions can help to catch small differences in theseparameters (see also Figure 2).


0.00 0.02 0.04 0.06 0.08 0.10

0.0

0.2

0.4

0.6

0.8

1.0

ω

Pow

er

● ●●●●●

●

●

●

●

●

●

●● ● ●

● ●●●● ● ● ● ● ●●

●

●

●

●

●

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

ρ

Pow

er

●●●●●

●●

●

●

●

●

●

●

● ● ● ●

●●●●●● ● ● ● ●●

●

●

●

●

●●

0.02 0.06 0.10

0.0

0.2

0.4

0.6

0.8

1.0

θ

Pow

er

●●● ●

●●

●

●

●

●

●

●

● ●

●●● ● ●●

●

●

●

●

●

●

●●

Fig 4. Empirical power for the partially standardized bootstrap method (solid) and the CRmethod (dotted) Left: Empirical power for varying shape parameters ω while ρ = θ = 0.Middle: Empirical power for varying scale parameters ρ while ω = θ = 0. Right: Empiricalpower for varying shift parameters θ while ω = ρ = 0.

5. Examples with multinomial data. When multinomial models areused in practice, it is not uncommon for the number of cells p to be quitelarge. Indeed, the challenges of this situation have been a topic of sustainedinterest, and many inferential questions remain unresolved (e.g. Hoeffding,1965; Holst, 1972; Fienberg and Holland, 1973; Cressie and Read, 1984; Zel-terman, 1987; Paninski, 2008; Chafaı and Concordet, 2009; Balakrishnanand Wasserman, 2019). A recent survey is (Balakrishnan and Wasserman,2018). As one illustration of how our approach can be applied to such models,this section will look at the task of constructing SCI for the cell proportions.Although this type of problem has been studied from a variety of perspec-tives over the years (e.g. Quesenberry and Hurst, 1964; Goodman, 1965;Fitzpatrick and Scott, 1987; Sison and Glaz, 1995; Wang, 2008; Chafaı andConcordet, 2009), relatively few theoretical results directly address the high-dimensional setting — and in this respect, our example offers some progress.Lastly, it is notable that multinomial data are of a markedly different char-acter than the functional data considered in Section 4, which demonstrateshow our approach has a broad scope of potential applications.

5.1. Theoretical example. Recall from Section 2.2 that we regard the ob-servations in the multinomial model as lying in the set of standard basisvectors {e1, . . . , ep} ⊂ Rp. In this context, we also write πj = Xj to indicatethat the jth coordinate of the sample mean is an estimate of the jth cellproportion πj . In addition, it is important to clarify that a variance decaycondition of the form (1.2) is automatically satisfied in this model (as ex-plained in Section 2.2), and so it is not necessary to include this as a separateassumption. Below, we retain the definition of kn in (2.2).


Assumption 5.1 (Multinomial model).

(i) The observations X1, . . . , Xn ∈ Rp are i.i.d., with P(X1 = ej) = πj foreach j ∈ {1, . . . , p}, where π = (π1, . . . , πp) is a probability vector thatmay vary with n.

(ii) There are constants α > 0 and ε0 ∈ (0, 1), with neither depending onn, such that

(5.1) σ(j) ≥ ε0 j−α for all j ∈ {1, . . . , (kn + 1) ∧ p}.

Remarks. A concrete set of examples satisfying the conditions of Assump-tion 5.1 is given by probability vectors of the form π(j) ∝ j−η, with η > 1.Furthermore, the condition η > 1 is mild, since the inequality π(j) ≤ j−1 issatisfied by every probability vector.

Applying the bootstrap. In the high-dimensional setting, the multinomialmodel differs in an essential way from the model in Section 2, because therewill often be many empty cells (indices) j ∈ {1, . . . , p} for which σj = 0. Forthe indices where this occurs, the usual confidence intervals of the form (4.3)have zero width, and thus cannot be used. More generally, if the number ofobservations in cell j is small, then it is inherently difficult to constructa good confidence interval around πj . Consequently, we will restrict ourprevious SCI (4.3) by focusing on a set of cells that contain a sufficientnumber of observations. For theoretical purposes, such a set may be definedas

(5.2) Jn ={j ∈ {1, . . . , p}

∣∣∣ πj ≥√ log(n)n

}.

Accordingly, the max statistic and its bootstrapped version are defined bytaking maxima over the indices in Jn, and we denote them as

M = maxj∈Jn

Sn,j/στnj

andM? = max

j∈JnS?n,j/σ

τnj ,

where we arbitrarily take M and M? to be zero in the exceptional casewhen J is empty.

Although the presence of the random index set Jn complicates the dis-tributions of M and M?, it is a virtue of the bootstrap that this sourceof randomness is automatically accounted for in the resulting inference. Inaddition, the following result shows that the bootstrap continues to achievea near-parametric rate of approximation.


Theorem 5.1. Fix any δ ∈ (0, 1/2), and suppose that Assumption 5.1holds. In addition, suppose that τn ∈ [0, 1) with (1− τn)

√log(n) & 1. Then,

there is a constant c > 0 not depending on n such that the event

(5.3) dK

(L(M),L(M?|X)

)≤ c n−1/2+δ,


Remarks. The proof of this result shares much of the same structure asthe proofs of Theorems 3.1 and 3.2, but there are a few differences. First,the use of the random index set Jn in the definition of M and M? entailssome extra technical considerations, which are handled with the help ofKiefer’s inequality (Lemma G.5). Second, we develop a lower bound forλmin(Σ(kn)), where Σ(kn) is the covariance matrix of the variables indexedby J (kn) (see Lemma F.3). This bound may be of independent interest forproblems involving multinomial distributions, and does not seem to be wellknown; see also (Benasseni, 2012) for other related eigenvalue bounds.

5.2. Numerical example. We illustrate the bootstrap procedure in thecase of the model πj ∝ j−1, which was considered in a recent numeri-cal study of Balakrishnan and Wasserman (2018). Taking p = 1000 andn ∈ {500, 1000}, we applied the bootstrap method to construct 95% SCI forthe proportions πj corresponding to the cells with at least 5 observations.The cutoff value of 5 is based on a guideline that is commonly recommendedin textbooks, e.g., (Agresti, 2003, p.19), (Rice, 2007, p.519). Lastly, the pa-rameter τn was chosen in the same way as described in Section 4.2.

Based on 5000 Monte Carlo runs, the average coverage probability wasfound to be 93.7% for n = 500, and 94.4% for n = 1000, demonstratingsatisfactory performance. Regarding the parameter τn, the selection ruletypically produced values close to 0.8, for both n = 500 and n = 1000. Asa point of comparison, it is also interesting to mention the coverage prob-abilities that occurred when τn was set to 1 (which eliminates all variancedecay). In this case, the coverage probabilities became less accurate, withvalues of 92.7% for n = 500, and 93.1% for n = 1000. Hence, this shows thattaking advantage of variance decay can enhance coverage probability.

6. Conclusions. The main conclusion to draw from our work is that amodest amount of variance decay in a high-dimensional model can substan-tially improve rates of bootstrap approximation for max statistics — whichhelps to reconcile some of the empirical and theoretical results in the liter-ature. In particular, there are three aspects of this type of model structure


that are worth emphasizing. First, the variance decay condition (1.2) is veryweak, in the sense that the parameter α > 0 is allowed to be arbitrarilysmall. Second, the condition is approximately checkable in practice, sincethe parameters σ1, . . . , σp can be accurately estimated when n � p. Third,this type of structure arises naturally in a variety of contexts.

Beyond our main theoretical focus on rates of bootstrap approximation,we have also shown that the technique of partial standardization leads tofavorable numerical results. Specifically, this was illustrated with examplesinvolving both functional and multinomial data, where variance decay isan inherent property that can be leveraged. Finally, we note that theseapplications are by no means exhaustive, and the adaptation of the proposedapproach to other types of data may provide further opportunities for futurework.

References.

Agresti, A. (2003). Categorical data analysis. John Wiley & Sons.Arlot, S., Blanchard, G. and Roquain, E. (2010a). Some nonasymptotic results on

resampling in high dimension, I: Confidence regions. The Annals of Statistics 38 51–82.Arlot, S., Blanchard, G. and Roquain, E. (2010b). Some nonasymptotic results on

resampling in high dimension, II: Multiple tests. The Annals of Statistics 38 83–99.Asriev, A. V. and Rotar, V. I. (1986). On the convergence rate in the infinite-

dimensional central limit theorem for probabilities of hitting parallelepipeds. Theoryof Probability & Its Applications 30 691–701.

Balakrishnan, S. and Wasserman, L. (2018). Hypothesis testing for high-dimensionalmultinomials: A selective review. The Annals of Applied Statistics 12 727–749.

Balakrishnan, S. and Wasserman, L. (2019). Hypothesis testing for densities and high-dimensional multinomials: Sharp local minimax rates. The Annals of Statistics 47 1893–1927.

Belloni, A., Chernozhukov, V., Chetverikov, D., Hansen, C. and Kato, K. (2018).High-dimensional econometrics and regularized GMM. arXiv:1806.01888.

Benasseni, J. (2012). A new derivation of eigenvalue inequalities for the multinomialdistribution. Journal of Mathematical Analysis and Applications 393 697–698.

Benko, M., Hardle, W. and Kneip, A. (2009). Common functional principal compo-nents. The Annals of Statistics 37 1–34.

Bentkus, V. (1986). Lower bounds for the rate of convergence in the central limit theoremin Banach spaces. Lithuanian Mathematical Journal 25 312–320.

Bentkus, V. (2003). On the dependence of the Berry–Esseen bound on dimension. Journalof Statistical Planning and Inference 113 385–402.

Bentkus, V. (2005). A Lyapunov-type bound in Rd. Theory of Probability & Its Appli-cations 49 311–323.

Bentkus, V. and Gotze, F. (1993). On smoothness conditions and convergence rates inthe CLT in Banach spaces. Probability Theory and Related Fields 96 137–151.

Bentkus, V., Gotze, F., Paulauskas, V. and Rackauskas, A. (2000). The accuracyof Gaussian approximation in Banach spaces. In Limit Theorems of Probability Theory25–111. Springer.

Bloznelis, M. (1997). On the rate of normal approximation in D[0, 1]. Lithuanian Math-ematical Journal 37 207–218.


Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities: ANonasymptotic Theory of Tndependence. Oxford.

Bunea, F. and Xiao, L. (2015). On the sample covariance matrix estimator of reducedeffective rank population matrices, with applications to fPCA. Bernoulli 21 1200–1230.

Cao, G., Yang, L. and Todem, D. (2012). Simultaneous inference for the mean functionbased on dense functional data. Journal of Nonparametric Statistics 24 359–377.

Chafaı, D. and Concordet, D. (2009). Confidence regions for the multinomial pa-rameter with small sample size. Journal of the American Statistical Association 1041071–1079.

Chang, J., Yao, Q. and Zhou, W. (2017). Testing for high-dimensional white noise usingmaximum cross-correlations. Biometrika 104 111–127.

Chen, X. (2018). Gaussian and bootstrap approximations for high-dimensional U-statistics and their applications. The Annals of Statistics 46 642–678.

Chen, L. H. and Fang, X. (2011). Multivariate normal approximation by Stein’s method:The concentration inequality approach. arXiv:1111.4073.

Chen, Y. C., Genovese, C. R. and Wasserman, L. (2015). Asymptotic theory fordensity ridges. The Annals of Statistics 43 1896–1928.

Chen, D. and Muller, H. G. (2012). Nonlinear manifold representations for functionaldata. The Annals of Statistics 40 1-29.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximationsand multiplier bootstrap for maxima of sums of high-dimensional random vectors. TheAnnals of Statistics 41 2786–2819.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2014a). Anti-concentration andhonest, adaptive confidence bands. The Annals of Statistics 42 1787–1818.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2014b). Gaussian approximationof suprema of empirical processes. The Annals of Statistics 42 1564–1597.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2016). Empirical and multiplierbootstraps for suprema of empirical processes of increasing complexity, and relatedGaussian couplings. Stochastic Processes and their Applications 126 3632–3651.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2017). Central limit theoremsand bootstrap in high dimensions. The Annals of Probability 45 2309–2352.

Choi, H. and Reimherr, M. (2016). R Package ‘fregion’.https://github.com/hpchoi/fregion.

Choi, H. and Reimherr, M. (2018). A geometric approach to confidence regions andbands for functional parameters. Journal of Royal Statistical Society: Series B (Statis-tical Methodology) 80 239–260.

Cressie, N. A. C. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. Journalof the Royal Statistical Society: Series B 46 440–464.

Csorgo, M., Csorgo, S., Horvath, L. and Mason, D. M. (1986). Weighted empiricaland quantile processes. The Annals of Probability 31–85.

Degras, D. A. (2011). Simultaneous confidence bands for nonparametric regression withfunctional data. Statistica Sinica 1735–1765.

Deng, H. and Zhang, C. H. (2017). Beyond Gaussian approximation: Bootstrap formaxima of sums of independent random vectors. arXiv:1705.09528.

Dezeure, R., Buhlmann, P. and Zhang, C.-H. (2017). High-dimensional simultaneousinference with the bootstrap (with discussion). Test 26 685–719.

Fan, J., Shao, Q.-M. and Zhou, W.-X. (2018). Are discoveries spurious? Distributionsof maximum spurious correlations and their applications. The Annals of Statistics 46989–1017.

Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and


Practice. Springer.Fienberg, S. E. and Holland, P. W. (1973). Simultaneous estimation of multinomial

cell probabilities. Journal of the American Statistical Association 68 683–691.Fitzpatrick, S. and Scott, A. (1987). Quick simultaneous confidence intervals for multi-

nomial proportions. Journal of the American Statistical Association 82 875–878.Goodman, L. A. (1965). On simultaneous confidence intervals for multinomial propor-

tions. Technometrics 7 247–254.Gotze, F. (1991). On the rate of convergence in the multivariate CLT. The Annals of

Probability 724–739.Hoeffding, W. (1965). Asymptotically optimal tests for multinomial distributions. The

Annals of Mathematical Statistics 369–401.Holst, L. (1972). Asymptotic normality and efficiency for certain goodness-of-fit tests.

Biometrika 59 137–145.Horn, R. A. and Johnson, C. R. (1990). Matrix Analysis. Cambridge.Horvath, L. and Kokoszka, P. (2012). Inference for Functional Data with Applications.

Springer.Horvath, L., Kokoszka, P. and Reeder, R. (2013). Estimation of the mean of func-

tional time series and a two-sample problem. Journal of the Royal Statistical Society:Series B (Statistical Methodology) 75 103–122.

Hsing, T. and Eubank, R. (2015). Theoretical Foundations of Functional Data Analysis,with an Introduction to Linear Operators. Wiley.

Johnson, W. B., Schechtman, G. and Zinn, J. (1985). Best constants in momentinequalities for linear combinations of independent and exchangeable random variables.The Annals of Probability 234–253.

Johnstone, I. M. (2017). Gaussian estimation: Sequence and wavelet models.http://statweb.stanford.edu/ imj/GE 08 09 17.pdf.

Jung, S., Lee, M. H. and Ahn, J. (2018). On the number of principal components inhigh dimensions. Biometrika 105 389–402.

Klivans, A. R., O’Donnell, R. and Servedio, R. A. (2008). Learning geometric con-cepts via Gaussian surface area. In Foundations of Computer Science, 2008. FOCS’08.541–550.

Koltchinskii, V., Loffler, M. and Nickl, R. (2019+). Efficient estimation of linearfunctionals of principal components. The Annals of Statistics (to appear).

Koltchinskii, V. and Lounici, K. (2017a). Concentration inequalities and momentbounds for sample covariance operators. Bernoulli 23 110–133.

Koltchinskii, V. and Lounici, K. (2017b). Normal approximation and concentration ofspectral projectors of sample covariance. The Annals of Statistics 45 121–157.

Li, W. V. and Shao, Q. M. (2002). A normal comparison inequality and its applications.Probability Theory and Related Fields 122 494–508.

Lounici, K. (2014). High-dimensional covariance matrix estimation with missing obser-vations. Bernoulli 20 1029–1058.

Marshall, A. W., Olkin, I. and Arnold, B. C. (2011). Inequalities: Theory of Ma-jorization and Its Applications. Springer.

Massart, P. (1986). Rates of convergence in the central limit theorem for empiricalprocesses. In Annales de l’IHP Probabilites et statistiques 22 381–423.

Massart, P. (1989). Strong approximation for multivariate empirical and related pro-cesses, via KMT constructions. The Annals of Probability 266–291.

Nagaev, S. V. (1976). An estimate of the remainder term in the multidimensional cen-tral limit theorem. In Proceedings of the Third Japan-USSR Symposium on ProbabilityTheory 419–438.


Naumov, A., Spokoiny, V. and Ulyanov, V. (2018). Bootstrap confidence sets forspectral projectors of sample covariance. Probability Theory and Related Fields 1–42.

Nazarov, F. (2003). On the maximal perimeter of a convex set in Rn with respect to aGaussian measure. In Geometric Aspects of Functional Analysis Springer.

Norvaisa, R. and Paulauskas, V. (1991). Rate of convergence in the central limit the-orem for empirical processes. Journal of Theoretical Probability 4 511–534.

Paninski, L. (2008). A coincidence-based test for uniformity given very sparsely sampleddiscrete data. IEEE Transactions on Information Theory 54 4750–4755.

Paulauskas, V. and Stieve, C. (1990). On the central limit theorem in D[0, 1] andD([0, 1], H). Lithuanian Mathematical Journal 30 267–276.

Portnoy, S. (1986). On the central limit theorem in Rp when p→∞. Probability Theoryand Related Fields 73 571–583.

Quesenberry, C. P. and Hurst, D. C. (1964). Large sample simultaneous confidenceintervals for multinomial proportions. Technometrics 6 191–195.

Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis, 2nd ed.Springer.

Reiß, M. and Wahl, M. (2019+). Non-asymptotic upper bounds for the reconstructionerror of PCA. The Annals of Statistics (to appear).

Rhee, W. and Talagrand, M. (1984). Bad rates of convergence for the central limittheorem in Hilbert space. The Annals of Probability 843–850.

Rice, J. A. (2007). Mathematical Statistics and Data Analysis. Duxbury.Sazonov, V. V. (1968). On the multi-dimensional central limit theorem. Sankhya: The

Indian Journal of Statistics, Series A 181–204.Senatov, V. V. (1981). Uniform estimates of the rate of convergence in the multi-

dimensional central limit theorem. Theory of Probability & Its Applications 25 745–759.

Sison, C. P. and Glaz, J. (1995). Simultaneous confidence intervals and sample sizedetermination for multinomial proportions. Journal of the American Statistical Asso-ciation 90 366–369.

Spokoiny, V. and Zhilova, M. (2015). Bootstrap confidence sets under model misspec-ification. The Annals of Statistics 43 2653–2675.

van der Vaart, A. W. and Wellner, J. A. (2000). Weak Convergence and EmpiricalProcesses. Springer.

Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices.In Compressed Sensing: theory and applications (Y. C. Eldar and G. Kutyniok, eds.)Cambridge.

Vershynin, R. (2018). High Dimensional Probability. Cambridge.Wang, H. (2008). Exact confidence coefficients of simultaneous confidence intervals for

multinomial proportions. Journal of Multivariate Analysis 99 896–911.Wang, J.-L., Chiou, J.-M. and Muller, H.-G. (2016). Functional data analysis. Annual

Review of Statistics and Its Application 3 257–295.Wasserman, L., Kolar, M. and Rinaldo, A. (2014). Berry-Esseen bounds for estimat-

ing undirected graphs. Electronic Journal of Statistics 8 1188–1224.Zelterman, D. (1987). Goodness-of-fit tests for large sparse multinomial distributions.

Journal of the American Statistical Association 82 624–629.Zhai, A. (2018). A high-dimensional CLT in W2 distance with near optimal convergence

rate. Probability Theory and Related Fields 170 821–845.Zhang, X. and Cheng, G. (2017). Simultaneous inference for high-dimensional linear

models. Journal of the American Statistical Association 112 757–768.Zhang, J.-T., Cheng, M.-Y., Wu, H.-T. and Zhou, B. (2018). A new test for func-


tional one-way ANOVA with applications to ischemic heart screening. ComputationalStatistics & Data Analysis.

Zheng, S., Yang, L. and Hardle, W. K. (2014). A smooth simultaneous confidencecorridor for the mean of sparse functional data. Journal of the American StatisticalAssociation 109 661–673.


Supplementary Material

Organization of appendices. In Appendix A we prove Proposition 2.1, andin Appendices B and C we prove Theorems 3.1 and 3.2 respectively. Theseproofs rely on numerous technical lemmas, which are stated and proved inAppendix D. Next, the proof of Theorem 5.1 for the multinomial modelis given in Appendix E, and the associated technical lemmas are given inAppendix F. Lastly, in Appendix G we provide statements of backgroundresults, and in Appendix H we provide a discussion of related work in theGaussian approximation literature.

General remarks and notation. Based on the formulation of Theorems 3.1, 3.2,and 5.1, it is sufficient to show that these results hold for all large valuesof n, and it will simplify some of the proofs to make use of this reduction.For this reason, it is understood going forward that n is sufficiently large forany given expression to make sense. Another convention is that all proofsin Appendices B, C, D, E, and F will implicitly assume that p > kn (unlessotherwise stated), because once the proofs are given for this case, it will fol-low that the low-dimensional case where p ≤ kn can be handled as a directconsequence (which is explained on page 45).

To fix some notation that will be used throughout the appendices, letd ∈ {1, . . . , p}, and define a generalized version of M as

Md = maxj∈J (d)

Sn,j/στnj .

In particular, the statistic M defined in equation (1.5) is the same as Mp.Similarly, the Gaussian and bootstrap versions of Md are defined as

Md = maxj∈J (d)

Sn,j/στnj ,

andM?d = max

j∈J (d)S?n,j/σ

τnj .

In addition, define the parameter

(6.1) βn = α(1− τn).

Lastly, we will often use the fact that if a random variable ξ satisfies thebound ‖ξ‖ψ1 ≤ c for some constant c not depending on n, then there isanother constant C > 0 not depending on n, such that ‖ξ‖r ≤ C r for allr ≥ 1 (Vershynin, 2018, Proposition 2.7.1).


APPENDIX A: PROOF OF PROPOSITION 2.1

Proof. It is a standard fact that for any s ≥ 1, the `s norm dominatesits w`s counterpart, and so ‖diag(A)‖w`s ≤ ‖diag(A)‖s. Next, since A issymmetric, the Schur-Horn Theorem implies that the vector diag(A) is ma-jorized by λ(A) (Marshall, Olkin and Arnold, 2011, p.300). Furthermore,when s ≥ 1, the function ‖ · ‖s is Schur-convex on Rp, which means thatif u ∈ Rp is majorized by v ∈ Rp, then ‖u‖s ≤ ‖v‖s (Marshall, Olkin andArnold, 2011, p.138). Hence,

‖diag(A)‖w`s ≤ ‖λ(A)‖s.

Finally, if r ∈ (0, s), then for any v ∈ Rp, the inequality

‖v‖s ≤(ζ(s/r)

)1/s ‖v‖w`rholds, where ζ(x) :=

∑∞j=1 j

−x for x > 1. This bound may be derived asin (Johnstone, 2017, p.257),

‖v‖ss =

p∑j=1

|v|s(j) ≤p∑j=1

(‖v‖w`rj−1/r

)s ≤ ζ(s/r) · ‖v‖sw`r ,

which completes the proof.

APPENDIX B: PROOF OF THEOREM 3.1

Proof. Consider the inequality

(B.1) dK(L(Mp),L(Mp)) ≤ In + IIn + IIIn,

where we define

In = dK

(L(Mp) , L(Mkn)

)(B.2)

IIn = dK

(L(Mkn) , L(Mkn)

)(B.3)

IIIn = dK

(L(Mkn) , L(Mp)

).(B.4)

Below, we show that the term IIn is at most of order n−12

+δ in Proposi-tion B.1. Later on, we establish a corresponding result for In and IIIn inProposition B.2. Taken together, these results complete the proof of Theo-rem 3.1.


Proposition B.1. Fix any number δ ∈ (0, 1/2), and suppose the condi-tions of Theorem 3.1 hold. Then,

(B.5) IIn . n−12

+δ.

Proof. Let Πkn ∈ Rkn×p denote the projection onto the coordinates in-dexed by J (kn). This means that if we write J (kn) = {j1, . . . , jkn} so that(σj1 , . . . , σjkn ) = (σ(1), . . . , σ(kn)), then the lth row of Πkn is the standard ba-sis vector ejl ∈ Rp. Next, define the diagonal matrixDkn = diag(σ(1), . . . , σ(kn)).It follows that

Mkn = max1≤j≤kn

e>j D−τnkn

ΠknSn.

Define the matrix C> = D−τnknΠknΣ1/2, which is of size kn × p. Also, let r

denote the rank of C, and note that r ≤ kn, since the matrix Σ need not beinvertible. Next, consider a decomposition

C = QR,

where the columns of Q ∈ Rp×r are an orthonormal basis for the image ofC, and R ∈ Rr×kn . Hence, if we define the random vector

(B.6) Z = 1√n

∑ni=1Q

>Zi,

then we haveD−τnkn

ΠknSn = R>Z.

It is simple to check that for any fixed t ∈ R, there exists a Borel convex setAt ⊂ Rr such that P(Mkn ≤ t) = P(Z ∈ At). By the same reasoning, we alsohave P(Mkn ≤ t) = γr(At), where γr is the standard Gaussian distributionon Rr. Therefore, the quantity IIn satisfies the bound

IIn ≤ supA∈A

∣∣∣P(Z ∈ A)− γr(A)∣∣∣,(B.7)

where A denotes the collection of all Borel convex subsets of Rr.We now apply Theorem 1.1 of Bentkus (2003) (Lemma G.1), to handle

the supremum above. First observe that the definition of Z in (B.6) satisfiesthe conditions of that result, since the terms Q>Z1, . . . , Q

>Zn are i.i.d. withzero mean and identity covariance matrix. Therefore,

IIn . r1/4 · E[‖Q>Z1‖32] · n−1/2.

It remains to bound the middle factor on the right side. By Lyapunov’sinequality,

E[‖Q>Z1‖32

]≤(E[(Z>1 QQ

>Z1

)2])3/4.(B.8)


Next, if v1, . . . , vr denote the orthonormal columns of Q, then we have

QQ> =∑r

j=1 vjv>j .

Hence, if we put ζj = Z>1 vj , then

E[(Z>1 QQ

>Z1)2]

=∥∥∥∑r

j=1 ζ2j

∥∥∥2

2

≤(∑r

j=1 ‖ζ2j ‖2)2

. k2n,

(B.9)

where we have used the fact that r ≤ kn and ‖ζ2j ‖2 = ‖Z>1 vj‖24 . 1, based

on Assumption 2.1. Combining the last few steps gives E[‖Q>Z1‖32] . k6/4n ,

and henceIIn . k7/4

n n−1/2 . n−12

+δ,

as needed.

Proposition B.2. Fix any number δ ∈ (0, 1/2), and suppose the condi-tions of Theorem 3.1 hold. Then,

(B.10) In . n−12

+δ and IIIn . n−12

+δ.

Proof. We only prove the bound for In, since the same argument appliesto IIIn. It is simple to check that for any fixed real number t,∣∣∣P( max

1≤j≤pSn,j/σ

τnj ≤ t

)− P

(max

j∈J (kn)Sn,j/σ

τnj ≤ t

)∣∣∣ = P(A(t) ∩B(t)

),

where we define the events

(B.11) A(t) ={

maxj∈J (kn)

Sn,j/στnj ≤ t

}and B(t) =

{max

j∈J (kn)cSn,j/σ

τnj > t

},

and J (kn)c denotes the complement of J (kn) in {1, . . . , p}. Also, for anypair of real numbers t1,n and t2,n satisfying t1,n ≤ t2,n, it is straightforwardto check that the following inclusion holds for all t ∈ R,

(B.12) A(t) ∩B(t) ⊂ A(t2,n) ∪B(t1,n).


Applying a union bound, and then taking the supremum over t ∈ R, weobtain

In ≤ P(A(t2,n)) + P(B(t1,n)).

The remainder of the proof consists in selecting t1,n and t2,n so thatt1,n ≤ t2,n and that the probabilities P(A(t2,n)) and P(B(t1,n)) are suffi-ciently small. Below, Lemma B.1 shows that if t1,n and t2,n are chosen as

t1,n = c · k−βnn · log(n)(B.13)

t2,n = c◦ · `−βnn ·√

log(`n),(B.14)

for a certain constant c > 0, and c◦ as in (2.4), then P(A(t2,n)) and P(B(t1,n))

are at most of order n−12

+δ. Furthermore, the inequality t1,n ≤ t2,n holds forall large n, due to the definitions of `n, kn, and βn, as well as the condition(1− τn)

√log(n) & 1.

Lemma B.1. Fix any number δ ∈ (0, 1/2), and suppose the conditions ofTheorem 3.1 hold. Then, there are positive constants c and c◦, not dependingon n, that can be selected in the definitions of t1,n (B.13) and t2,n (B.14), sothat

(a) P(A(t2,n)) . n−12

+δ,

and

(b) P(B(t1,n)) . n−1.

Proof of Lemma B.1 part (a). Due to Proposition B.1 and the factthat J (`n) ⊂ J (kn), we have

P(A(t2,n)) ≤ P(

maxj∈J (kn)

Sn,j/στnj ≤ t2,n

)+ IIn

≤ P(

maxj∈J (`n)


)+ c n−

12

+δ.

(B.15)

To bound the probability in the last line, we will make use of the Gaussianityof Sn to apply certain results based on Slepian’s lemma, as contained inLemmas B.2 and G.3 below. As a preparatory step, consider some genericrandom variables {Yj} and positive scalars {aj} indexed by J (`n), as wellas a constant b such that maxj∈J (`n) aj ≤ b. Then,

P(

maxj∈J (`n)

Yj ≤ t2,n)≤ P

(max

j∈J (`n)ajYj ≤ b t2,n

),(B.16)


which can be seen by expressing the left side in terms of ∩j{ajYj ≤ ajt2,n},and noting that this set is contained in ∩j{ajYj ≤ bt2,n}. Due to Assump-

tion 2.2 with c◦ ∈ (0, 1), we have the inequality στn−1j ≤ `βnn /c◦ for all

j ∈ J (`n), and so we may apply the previous observation with aj = στn−1j ,

and b = `βnn /c◦. Furthermore, the definition of t2,n gives b t2,n =√

log(`n),and so we if we let Yj = Sn,j/σ

τnj , it follows that

P(

maxj∈J (`n)


)≤ P

(max

j∈J (`n)Sn,j/σj ≤

√log(`n)

).

The proof is completed by applying the next result (Lemma B.2) in conjunc-tion with the conditions of Assumption 2.2. (Take m = `n in the statementof Lemma B.2.)

Remark. The lemma below may be of independent interest, and so we havestated it in a way that can be understood independently of the context ofour main assumptions. Also, the constants 1/2 and 1/3 in the exponent ofthe bound (B.19) can be improved slightly, but we have left the result inthis form for simplicity.

Lemma B.2. For each integer m ≥ 1, let R = R(m) be a correlationmatrix in Rm×m, and let R+ = R+(m) denote the matrix with (i, j) entrygiven by max{Ri,j , 0}. Suppose the matrix R+ is positive semi-definite forall m, and that there are constants ε0 ∈ (0, 1) and c > 0, not depending onm, such that the inequalities

maxi 6=j

Ri,j ≤ 1− ε0(B.17)

∑i 6=j R

+i,j ≤ cm(B.18)

hold for all m. Lastly, let (ζ1, . . . , ζm) be a Gaussian vector drawn fromN(0,R). Then, there is a constant C > 0, not depending on m, such thatthe inequality

(B.19) P(

max1≤j≤m

ζj ≤√

log(m))≤ C exp

(− 1

2m1/3)

holds for all m.

Proof. It is enough to show that the result holds for all large m, because

if m ≤ m0, then the result is clearly true when C = exp(12m

1/30 ). To begin

the argument, we may introduce a Gaussian vector (ξ1, . . . , ξm) ∼ N(0,R+),


since the matrix R+ is positive semi-definite. In turn, the version of Slepian’sLemma given in Lemma G.3 leads to

P(

max1≤j≤m

ζj ≤√

log(m))≤ P

(max

1≤j≤`nξj ≤

√log(m)

)≤ Km · Φm

(√log(m)

),

(B.20)

where we put

Km = exp

{ ∑1≤i<j≤m

log(

11− 2

πarcsin(R+

i,j)

)exp

(− log(m)

1+R+i,j

)}.

Next, we apply the assumption maxi 6=j R+i,j ≤ 1 − ε0. Since the functions

x 7→ log(1/(1 − x)) and x 7→ 2π arcsin(x) have bounded derivatives on any

closed subinterval of [0, 1), it follows that

log(

11− 2

πarcsin(R+

i,j)

)≤ cR+

i,j ,

for some constant c > 0 not depending on m. Therefore, by possibly increas-ing c, the condition (B.18) gives

Km ≤ exp{cm · exp

(− log(m)

1+(1−ε0)

)}= exp

{cm

1− 12−ε0

}.

(B.21)

To bound the earlier factor involving Φm(√

log(m)), let η0 ∈ (0, 1) be a smallconstant to be optimized below, and note that the following inequality holdsfor all sufficiently large s > 0,

Φ(√

(2− η0ε0) log(s))≤ 1− 1

s ,(B.22)

which may be found in (Boucheron, Lugosi and Massart, 2013, p.337). Tak-ing s = mκ0 with κ0 := 1

2−η0ε0 shows that for all large m,

Φm(√

log(m))≤(

1− 1mκ0

)m≤ exp

(−m1−κ0).(B.23)

We now collect the last several steps. If we observe that κ0 <1

2−ε0 , then thefollowing inequalities hold for all large m,

Km · Φm(√

log(m))≤ exp

{cm

1− 12−ε0 −m1−κ0

}≤ exp

{− (1− η0)m1−κ0

}.

(B.24)


So, by possibly further decreasing η0, we have (1 − κ0) > 1/3, as well as(1− η0) > 1/2. This leads to the stated result.

Proof of Lemma B.1 part (b). Define the random variable

V = maxj∈J (kn)c

Sn,j/στnj ,

and letq = max

{2βn, log(n), 3

}.

Clearly, for any t > 0, we have the tail bound

(B.25) P(V ≥ t) ≤ ‖V ‖

qq

tq,

and furthermore

‖V ‖qq = E[∣∣∣ max

j∈J (kn)cSn,j/σ

τnj

∣∣∣q]≤

∑j∈J (kn)c

σq(1−τn)j E

[| 1σjSn,j |q

].

(B.26)

By Lemma D.4, we have ‖ 1σjSn,j‖q ≤ cq, and so

‖V ‖qq ≤ (cq)q∑

j∈J (kn)c

σq(1−τn)j

. (cq)qp∑

j=kn+1

j−qβn

≤ (cq)q∫ p

kn

x−qβndx

≤ (cq)q

qβn−1 k−qβn+1n ,

(B.27)

where we recall βn = α(1− τn), and note that qβn ≥ 2, which holds by the

definition of q. Hence, if we put Cn := c(qβn−1)1/q

· k1/qn , then

‖V ‖q ≤ Cn · q · k−βnn .


Furthermore, it is simple to check that Cn . 1, and that the assumption(1−τn)

√log(n) & 1 implies q . log(n). Therefore, from the inequality (B.25)

with t = e‖V ‖q, as well as the definition of q, we obtain

P

(V ≥ c · log(n) · k−βnn

)≤ e−q ≤ 1

n ,

for some constant c > 0 not depending on n, as needed.

APPENDIX C: PROOF OF THEOREM 3.2

Proof. Consider the inequality

(C.1) dK(L(Mp),L(M?p |X)) ≤ I′n + II′n(X) + III′n(X),

where we define

I′n = dK

(L(Mp) , L(Mkn)

)(C.2)

II′n(X) = dK

(L(Mkn) , L(M?

kn |X))

(C.3)

III′n(X) = dK

(L(M?

kn |X), L(M?

p |X)).(C.4)

Note that I′n is deterministic, whereas II′n(X) and III′n(X) are random vari-ables depending on X. The remainder of the proof consists in showing thateach of these terms are at most of order n−

12

+δ, with probability at least1− c

n . The terms II′n(X) and III′n(X) are handled in Sections C.2 and C.1 re-spectively. The first term I′n requires no further work, due to Proposition B.2(since I′n is equal to IIIn, defined in equation (B.4)).

C.1. Handling the term III′n(X). The proof of Proposition B.2 canbe partially re-used to show that for any fixed realization of X, and any realnumbers t′1,n ≤ t′2,n, the following bound holds

(C.5) III′n(X) ≤ P(A′(t′2,n)

∣∣X) + P(B′(t′1,n)

∣∣X),where we define the following events for any t ∈ R,

(C.6) A′(t) ={

maxj∈J (kn)

S?n,j/στnj ≤ t

}and B′(t) =

{max

j∈J (kn)cS?n,j/σ

τnj > t

}.


Below, Lemma C.1 ensures that t′1,n and t′2,n can be chosen so that the

random variables P(B′(t′1,n)∣∣X) and P(A′(t′2,n)

∣∣X) are at most cn−12

+δ, withprobability at least 1 − c

n . Also, it is straightforward to check that underAssumption 2.2, the choices of t′1,n and t′2,n given in Lemma C.1 satisfyt′1,n ≤ t′2,n when n is sufficiently large.

Lemma C.1. Fix any number δ ∈ (0, 1/2), and suppose the conditionsof Theorem 3.1 hold. Then, there are positive constants c1, c2, and c, notdepending on n, for which the following statement is true:

If t′1,n and t′2,n are chosen as

t′1,n = c1 · k−βnn · log(n)3/2 and(C.7)

t′2,n = c2 · `−βnn ·√

log(`n),(C.8)

then the events

(a) P(A′(t′2,n)∣∣X) ≤ c n−

12

+δ

and

(b) P(B′(t′1,n)∣∣X) ≤ n−1

each hold with probability at least 1− cn .

Proof of Lemma C.1 part (a). Using the definition of II′n(X), followedby J (`n) ⊂ J (kn), we have

P(A′(t′2,n)|X) ≤ P(

maxj∈J (`n)

Sn,j/στnj ≤ t

′2,n

)+ II′n(X).(C.9)

Taking t′2,n = t2,n as in (B.14), the proof of Lemma B.1 part (a) shows that

the first term is O(n−1/2). With regard to the second term, Proposition C.1in the next subsection shows that there is a constant c > 0 not dependingon n such that the event

(C.10) II′n(X) ≤ c n−12

+δ

holds with probability at least 1− cn . This completes the proof.


Proof of Lemma C.1 part (b). Define the random variable

(C.11) V ? := maxj∈J (kn)c

S?n,j/στnj ,

and as in the proof of Lemma B.1(b), let q = max{

2βn, log(n), 3

}. The idea

of the proof is to construct a function b(·) such that the following boundholds for every realization of X,(

E[|V ?|q

∣∣X])1/q≤ b(X),

and then Chebyshev’s inequality gives the following inequality for any num-ber bn satisfying b(X) ≤ bn,

P(V ? ≥ ebn

∣∣∣X) ≤ e−q ≤ 1n .

In turn, we will derive an expression for bn such that the event {b(X) ≤ bn}holds with high probability. This will lead to the statement of the lemma,because it will turn out that t′1,n � bn.

To construct the function b(·), observe that the initial portion of the proofof Lemma B.1(b) shows that for any realization of X,

(C.12) E[|V ?|q

∣∣X] ≤ ∑j∈J (kn)c

σq(1−τn)j E

[| 1σjS?n,j |q|X

].

Next, Lemma D.4 ensures that for every j ∈ {1, . . . , p}, the event

E[| 1σjS?n,j |q|X

]≤ (c q)q,

holds with probability 1. Consequently, if we let s = q(1− τn) and considerthe random variable

(C.13) s :=

( ∑j∈J (kn)c

σsj

) 1s

,

as well asb(X) := c · q · s(1−τn),

then we obtain the bound

(C.14)(E[|V ?|q

∣∣X])1/q≤ b(X),


with probability 1. To proceed, Lemma D.2 implies

(C.15) P(b(X) ≥ q · (c

√q)1−τn

(qβn−1)1/q· k−βn+1/q

n

)≤ e−q ≤ 1

n ,

for some constant c > 0 not depending on n. By weakening this tail boundslightly, it can be simplified to

(C.16) P(b(X) ≥ C ′n · q3/2 · k−βnn

)≤ 1

n ,

where C ′n := c k1/qn

(qβn−1)1/q, and we recall βn = α(1− τn). To simplify further, it

can be checked that C ′n . 1, and that the assumption (1− τn)√

log(n) & 1gives q . log(n). It follows that there is a constant c not depending on nsuch that if

bn := c · log(n)3/2 · k−βnn ,

then

(C.17) P(b(X) ≥ bn) ≤ 1n ,

which completes the proof.

C.2. Handling the term II′n(X).

Proposition C.1. Fix any number δ ∈ (0, 1/2), and suppose the con-ditions of Theorem 3.1 hold. Then, there is a constant c > 0 not dependingon n such that the event

(C.18) II′n(X) ≤ c n−12

+δ

holds with probability at least 1− cn .

Proof. Define the random variable

(C.19) M?kn := max

j∈J (kn)S?n,j/σ

τnj ,

which differs from M?kn

, since στnj is used in place of στnj . Consider thetriangle inequality

(C.20) II′n(X) ≤ dK

(L(Mkn) ,L(M?

kn |X))

+ dK

(L(M?

kn |X) , L(M?kn |X)

).

The two terms on the right will bounded separately.


To address the first term on the right side of (C.20), we will applyLemma D.3, for which a substantial amount of notation is needed. Recallthe matrix C = Σ1/2Π>knD

−τnkn

of size p × kn, where the projection matrix

Πkn ∈ Rkn×p is defined in the proof of Proposition B.1. Note that Mkn is thecoordinate-wise maximum of a Gaussian vector drawn from N(0,S), withS = C>C. To address M?

kn, let

Wn = 1n

∑ni=1(Zi − Z)(Zi − Z)>

where Z = 1n

∑ni=1 Zi, and observe that M?

knis the coordinate-wise max-

imum of Gaussian vector drawn from N(0, S), with S = C>WnC. Next,consider the s.v.d.,

C = UΛV >,

where if r denotes the rank of C, then we may take U ∈ Rp×r to haveorthonormal columns, Λ ∈ Rr×r to be invertible, and V > ∈ Rr×kn to haveorthonormal rows. In order to apply Lemma D.3 for a given realization ofS, it is necessary that the columns of S and S span the same subspaceof Rkn . This occurs with probability at least 1 − c

n , because the matrix

S is equal to V Λ(U>WnU)ΛV >, and the matrix (U>WnU) is invertiblewith probability at least 1− c

n (due to Lemma D.5). Another ingredient forapplying Lemma D.3 is the following algebraic relation, which is a directconsequence of the definitions just introduced,

(C.21)(V >SV

)−1/2(V >SV

)(V >SV

)−1/2= U>WnU.

Building on this relation, Lemma D.3 shows that if the event

(C.22) ‖U>WnU − Ir‖op ≤ ε,

holds for some number ε > 0, then the event

(C.23) dK

(L(Mkn) ,L(M?

kn |X))≤ c · k1/2

n · ε

also holds, where c > 0 is a constant not depending on n or ε. Thus, itremains to specify ε in the event (C.22). For this purpose, Lemma D.5 showsthat if ε = c ·n−1/2 ·kn · log(n), then the event (C.22) holds with probabilityat least 1− c

n . So, given that

n−1/2 · k3/2n · log(n) . n−

12

+δ,

the first term in the bound (C.20) requires no further consideration.


To deal with the second term in (C.20), we proceed by considering thegeneral inequality

(C.24) dK(L(ξ),L(ζ)) ≤ supt∈R

P(|ζ − t| ≤ r

)+ P(|ξ − ζ| > r),

which holds for any random variables ξ and ζ, and any real number r > 0(cf. Chernozhukov, Chetverikov and Kato (2016, Lemma 2.1)). Specifically,we will let L(M?

kn|X) play the role of L(ξ), and let L(M?

kn|X) play the

role of L(ζ). In other words, we need to establish an anti-concentrationinequality for L(M?

kn|X), as well as a coupling inequality for M?

knand M?

kn,

conditionally on X.To establish the coupling inequality, if we put

(C.25) rn = c · n−1/2 · log(n)5/2,

for a suitable constant c not depending on n, then Lemma D.8 shows thatthe event

(C.26) P(∣∣M?

kn −M?kn

∣∣ > rn

∣∣∣X) ≤ cn


Lastly, the anti-concentration inequality can be derived from Nazarov’sinequality (Lemma G.2), since M?

knis obtained from a Gaussian vector,

conditionally on X. For this purpose, let

(C.27) σkn = minj∈J (kn)

σj .

In turn, Nazarov’s inequality implies that the event

(C.28) supt∈R

P(|M?

kn − t| ≤ rn∣∣∣X) ≤ c · rn

σ1−τnkn

·√

log(kn),

holds with probability 1. Meanwhile, Lemma D.6 and Assumption 2.2 implythat the event

(C.29)1

σ1−τnkn

≤ c kβnn

holds with probability at least 1 − cn . Combining the last few steps, we

conclude that the following bound holds with probability at least 1− cn ,

supt∈R

P(|M?

kn − t| ≤ rn∣∣∣X) ≤ c · n−1/2 · kβnn · log(n)5/2 ·

√log(kn)

≤ c n− 12+δ,

(C.30)

as needed.


APPENDIX D: TECHNICAL LEMMAS FOR THEOREMS 3.1 AND 3.2

Lemma D.1. Fix any number δ ∈ (0, 1/2), and suppose the conditions ofTheorem 3.1 hold. Also, let q = max{ 2

βn, log(n), 3}. Then, there is a constant

c > 0 not depending on n, such that for any j ∈ {1, . . . , p}, we have

(D.1) ‖σj‖q ≤ c · σj ·√q.

Proof. Define the vector u := 1σj

Σ1/2ej ∈ Rp, which satisfies ‖u‖2 = 1.

Observe that

1σj‖σj‖q =

∥∥∥∥∥( 1n

∑ni=1(Z>i u)2 − (Z>u)2

)1/2∥∥∥∥∥q

≤∥∥∥∥( 1

n

∑ni=1(Z>i u)2

)1/2∥∥∥∥q

=∥∥∥ 1n

∑ni=1(Z>i u)2

∥∥∥1/2

q/2.

(D.2)

Since the random variables (Z>1 u)2, . . . , (Z>n u)2 are independent and non-negative, part (i) of Rosenthal’s inequality in Lemma G.4 implies the Lq/2

norm in the last line satisfies

(D.3)

∥∥∥ 1n

∑ni=1(Z>i u)2

∥∥∥q/2≤ c · q ·max

{∥∥(Z>1 u)2∥∥1, n−1+2/q

∥∥(Z>1 u)2∥∥q/2

},

for an absolute constant c > 0. For the first term inside the maximum,observe that since ‖u‖2 = 1 and Z1 is isotropic, we have ‖(Z>1 u)2‖1 = 1.To handle the second term inside the maximum, Assumption 2.1 implies‖(Z>1 u)2‖q/2 . q2. Combining the last few steps, and noticing the square

root on the Lq/2 norm in the last line of (D.2), we obtain

1σj‖σj‖q .

√q ·max

{1 , n−1/2+1/qq

},(D.4)

and this implies the statement of the lemma.

Lemma D.2. Fix any number δ ∈ (0, 1/2), and suppose the conditionsof Theorem 3.1 hold. Also, let q = max{ 2

βn, log(n), 3}, and s = q(1 − τn),

and consider the random variables s and t defined by

s =

( ∑j∈J (kn)c

σsj

)1/s

and t =

( ∑j∈J (kn)

σsj

)1/s

.


Then, there is a constant c > 0 not depending on n such that

(D.5) P

(s ≥ c

√q

(qβn−1)1/s· k−α+1/s

n

)≤ e−q,

and

(D.6) P(t ≥ c

√q

(qβn−1)1/s

)≤ e−q.

Proof. In light of the Chebyshev inequality P(s ≥ e‖s‖q

)≤ e−q, it

suffices to bound ‖s‖q (and similarly for t). We proceed by direct calculation,

‖s‖q =

∥∥∥∥∥ ∑j∈J (kn)c

σsj

∥∥∥∥∥1/s

q/s

≤

( ∑j∈J (kn)c

∥∥σsj∥∥q/s)1/s

(triangle inequality for ‖ · ‖q/s, with q/s ≥ 1)

=

( ∑j∈J (kn)c

∥∥σj∥∥sq)1/s

.√q ·

( ∑j∈J (kn)c

σsj

)1/s

(Lemma D.1)

.√q ·

(∫ p

kn

x−sαdx

)1/s

.√q · k

−α+1/sn

(sα− 1)1/s,

and in the last step we have used the fact that sα = qβn > 1, which holdssince q is defined to satisfy qβn > 1. The calculation for t is essentially thesame, except that we use

∑j∈J (kn) σ

sj . 1.

Remark. The following result is a variant of Lemma A.7 in the paper (Spokoinyand Zhilova, 2015).

Lemma D.3. Let A and B be positive semi-definite matrices in Rd×d \ {0}whose columns span the same subspace of Rd. Define two multivariate nor-mal random vectors ξ ∼ N(0, A) and ζ ∼ N(0, B). Let r ≤ d be the dimen-sion of the subspace spanned by the columns of A and B, and let Q ∈ Rd×rhave columns that are an orthonormal basis for this subspace. Define the


r × r positive definite matrices A = Q>AQ and B = Q>BQ, and let H beany square matrix satisfying H>H = A. Finally, let ε > 0 be a number suchthat ‖(H−1)>B(H−1)− Ir‖op ≤ ε. Then, there is an absolute constant c > 0such that

(D.7) supt∈R

∣∣∣∣P( max1≤j≤d

ξj ≤ t)− P

(max

1≤j≤dζj ≤ t

)∣∣∣∣ ≤ c√r ε.Proof. We may assume that

√r ε ≤ 1/2, for otherwise the claim trivially

holds with c = 2. Define the r-dimensional random vectors ξ = Q>ξ and ζ =Q>ζ. As a consequence of the assumptions, the random vector ξ lies in thecolumn-span of Q almost surely, which gives Qξ = ξ almost surely. It followsthat for any t ∈ R, the event {max1≤j≤d ξj ≤ t} can be expressed as {ξ ∈ At}for some Borel set At ⊂ Rr. Likewise, we also have {max1≤j≤d ζj ≤ t} ={ζ ∈ At}. Hence, the left hand side of (D.7) is upper-bounded by the totalvariation distance between L(ξ) and L(ζ), and in turn, Pinsker’s inequality

implies this is upper-bounded by c√dKL(L(ζ),L(ξ)), where c > 0 is an

absolute constant, and dKL denotes the KL divergence. Since the randomvectors ξ and ζ are Gaussian, the following exact formula is available if welet C = (H>)−1B(H−1)− Ir,

dKL(L(ζ),L(ξ)) = 12

(tr(C)− log det(C + Ir)

)= 1

2

∑rj=1 λj(C)− log(λj(C) + 1).

(D.8)

Using the basic inequality |x − log(x + 1)| ≤ x2/(1 + x) that holds for anyx ∈ (−1,∞), as well as the condition |λj(C)| ≤ ε ≤ 1/2, we have

dKL(L(ζ),L(ξ)) ≤ c r ‖C‖2op

≤ c r ε2,(D.9)

for some absolute constant c > 0.

Lemma D.4. Suppose the conditions of Theorem 3.1 hold, and let q =max{ 2

βn, log(n), 3}. Then, there is a constant c > 0 not depending on n such

that for any j ∈ {1, . . . , p}, we have

(D.10) ‖ 1σjSn,j‖q ≤ c q,

and the following event holds with probability 1,

(D.11)(E[| 1σjS?n,j |q|X

])1/q≤ c q.


Proof. We only prove the first bound, since the second one can be ob-tained by repeating the same argument, conditionally on X. Since q > 2,Lemma G.4 gives

(D.12) ‖ 1σjSn,j‖q . q ·max

{‖ 1σjSn,j‖2 , n−1/2+1/q‖ 1

σj(X1,j − µj)‖q

}.

Clearly,

(D.13) ‖ 1σjSn,j‖22 = var( 1

σjSn,j) = 1.

Furthermore, if we define the vector u := 1σj

Σ1/2ej in Rp, which satisfies

‖u‖2 = 1, then ∥∥ 1σj

(X1,j − µj)∥∥q

=∥∥Z>1 u∥∥q . q(D.14)

where the last step follows from Assumption 2.1. Applying the work aboveto the bound (D.12) gives

(D.15) ‖ 1σjSn,j‖q . q ·max

{1, n−1/2+1/q · q

}.

Finally, the stated choice of q implies that the right side in the last displayis of order q.

Lemma D.5. Let the random vectors Z1, . . . , Zn ∈ Rp be as in Assump-tion 2.1, and let Q ∈ Rp×r be a fixed matrix having orthonormal columnswith r ≤ kn. Lastly, let

(D.16) Wn =1

n

n∑i=1

(Zi − Z)(Zi − Z)>,

where Z = 1n

∑ni=1 Zi. Then, there is a constant c > 0 not depending on n,

such that the event

(D.17)∥∥Q>WnQ− Ir

∥∥op≤ c log(n)kn√

n,


Proof. Let ε ∈ (0, 1/2), and let N be an ε-net (with respect to the `2-norm) for the unit `2-sphere in Rr. It is well known that N can be chosenso that card(N ) ≤ (3/ε)r, and the inequality∥∥Q>WnQ− Ir

∥∥op≤ 1

1−2ε ·maxu∈N

∣∣∣u>(Q>WnQ− Ir

)u∣∣∣,


holds with probability 1 (Vershynin, 2012, Lemmas 5.2 and 5.4). For a fixedu ∈ N , put ξu,i := Z>i Qu, and consider the simple algebraic relation

(D.18) u>(Q>WnQ− Ir

)u =

(1n

∑ni=1 ξ

2i,u − 1

)︸︷︷︸

=:∆(u)

−(

1n

∑ni=1 ξi,u

)2

︸︷︷︸=:∆′(u)

.

We will show that both terms on the right side are small with high probabil-ity, and then take a union bound over u ∈ N . The high-probability boundswill be obtained by using Lemma G.4 to control ‖∆(u)‖q and ‖∆′(u)‖q whenq is sufficiently large.

To apply Lemma G.4, first observe the following bounds, which are con-sequences of Assumption 2.1,

(D.19) ‖ξi,u‖q . q,

and

‖ξ2i,u − 1‖q . q2.(D.20)

Therefore, when q > 2, Lemma G.4 gives

‖∆(u)‖q . qmax{‖∆(u)‖2 , 1

n

(∑ni=1 ‖ξ2

i,u − 1‖qq)1/q}

. qmax{

1√n, n−1+1/q · q2

}.

(D.21)

Due to Chebyshev’s inequality,

P(|∆(u)| ≥ e‖∆(u)‖q

)≤ e−q,

and so if we take q = max{C log(n)kn, 3} for some constant C > 0 to betuned below, then (D.21) gives ‖∆(u)‖q . q/

√n, and the following inequal-

ity holds for all u ∈ N ,

P(|∆(u)| ≥ c log(n)kn√

n

)≤ exp

{− C log(n)kn

}.

The random variable ∆′(u) can be analyzed with a similar set of steps, whichleads to the following inequality for all u ∈ N ,

(D.22) P(|∆′(u)| ≥ c

(log(n)kn√

n

)2)≤ exp

{− C log(n)kn

}.

Combining the previous work with a union bound, if we consider the choiceε = min{c log(n)kn/

√n, 1

4}, then

P(∥∥Q>WnQ− Ir

∥∥op≥ ε)≤ 2 exp

{− C · kn · log(n) + r · log(3/ε)

}.

Finally, choosing C sufficiently large implies the stated result.


Remark. For the next results, define the correlation

ρj,j′ =Σj,j′σjσj′

,

and its sample version

ρj,j′ =Σj,j′σj σj′

,

for any j, j′ ∈ {1, . . . , p}.

Lemma D.6. Suppose the conditions of Theorem 3.1 hold. Then, thereis a constant c > 0 not depending on n such that the three events

(D.23) maxj∈J (kn)

∣∣∣ σjσj − 1∣∣∣ ≤ c log(n)√

n,

(D.24) minj∈J (kn)

σ1−τnj ≥

(min

j∈J (kn)σ1−τnj

)·(

1− c log(n)√n

),

and

(D.25) maxj,j′∈J (kn)

∣∣ρjj′ − ρjj′∣∣ ≤ c log(n)√n

each hold with probability at least 1− cn .

Proof. The result is a direct consequence of Lemma D.7 below. Thedetails are essentially algebraic manipulations, and so are omitted.

Lemma D.7. Suppose the conditions of Theorem 3.1 hold, and fix anytwo (possibly equal) indices j, j′ ∈ {1, . . . , p}. Then, for any number κ ≥ 1,there are positive constants c and c1(κ) not depending on n such that theevent

(D.26)∣∣∣ Σj,j′σjσj′

− ρj,j′∣∣∣ ≤ c1(κ) log(n)√

n

holds with probability at least 1− cn−κ.

Remark. The event in the lemma has been formulated to hold with proba-bility at least 1− cn−κ, rather than 1− c

n , in order to accommodate a unionbound for proving Lemma D.6.


Proof. Consider the `2-unit vectors u = Σ1/2ej/σj and v = Σ1/2ej′/σj′

in Rp. Letting Wn be as defined in (D.16), observe that

Σj,j′

σjσj′− ρj,j′ = u>(Wn − Ip)v.(D.27)

For each 1 ≤ i ≤ n, define the random variables ζi,u = Z>i u and ζi,v = Z>i v.In this notation, the relation (D.27) becomes

Σj,j′

σjσj′− ρj,j′ =

(1n

∑ni=1 ζi,uζi,v − u>v

)︸︷︷︸

=:∆(u,v)

−(

1n

∑ni=1 ζi,u

)(1n

∑ni=1 ζi,v

)︸︷︷︸

=:∆′(u,v)

.

Note that E[ζi,uζi,v] = u>v. Also, if we let q = max{κ log(n), 3}, then

‖ζi,uζi,v − u>v‖q . q2,

which follows from Assumption 2.1. Therefore, Lemma (G.4) gives the fol-lowing bound for q > 2,

‖∆(u, v)‖q . qmax{‖∆(u, v)‖2 , 1

n

(∑ni=1 ‖ζi,uζi,v − u>v‖

qq

)1/q}. qmax

{1√n, n−1+1/q · q2

}. log(n)√

n.

(D.28)

Using the Chebyshev inequality

P(|∆(u, v)| ≥ e‖∆(u, v)‖q) ≤ e−q,

we haveP(|∆(u, v)| ≥ cκ log(n)√

n

)≤ 1

nκ .

Similar reasoning leads to the following tail bound for ∆′(u, v),

P(|∆′(u, v)| ≥

( cκ log(n)√n

)2) ≤ 1nκ ,

and combining with the previous tail bound gives the stated result.

Lemma D.8. Fix any number δ ∈ (0, 1/2), and suppose the conditionsof Theorem 3.1 hold. Then, there is a constant c > 0 not depending on nsuch that the event (C.26) holds with probability at least 1− c

n .


Proof. Let (a1, . . . , akn) and (b1, . . . , bkn) be real vectors, and note thebasic fact ∣∣∣∣ max

1≤j≤knaj − max

1≤j≤knbj

∣∣∣∣ ≤ max1≤j≤kn

|aj − bj |.

From this, it is simple to derive the inequality

(D.29)∣∣M?

kn −M?kn

∣∣ ≤ maxj∈J (kn)

∣∣∣( σjσj )τn − 1∣∣∣ · max

j∈J (kn)

∣∣∣S?j /στnj ∣∣∣.To handle the first factor on the right side, it follows from Lemma D.7 thatthe event

(D.30) maxj∈J (kn)

∣∣∣( σjσj )τn − 1∣∣∣ ≤ c · n−1/2 · log(n)

holds with probability at least 1− cn . Next, consider the random variable

(D.31) U? := maxj∈J (kn)

|S?n,j/στnj |.

It suffices to show there is possibly larger constant c > 0, such that the event

(D.32) P(U? ≥ c log(n)3/2

∣∣∣X) ≤ 1n

holds with probability at least 1 − cn . Using Chebyshev’s inequality with

q ≥ log(n) gives

P(U? ≥ e

(E[|U?|q|X]

)1/q∣∣∣X) ≤ e−q.Likewise, if the event

(D.33) (E[|U?|q|X])1/q ≤ c log(n)3/2

holds for some constant c > 0, then the event (D.32) also holds. For thispurpose, the argument in the proof of Lemma C.1(b) can be essentiallyrepeated with q = max{ 2

βn, log(n), 3

}to show that the event (D.33) holds

with probability at least 1− cn . The main detail to notice when repeating the

argument is that U? involves a maximum over J (kn), whereas the argumentfor Lemma C.1(b) involves a maximum over J (kn)c. This distinction can behandled by using the bound (D.6) in Lemma D.2.

The case when p ≤ kn. The previous proofs relied on the condition p > kn

only insofar as this implies kn ≥ n1

log(n)a and `n ≥ log(n)3. (These conditionsare used in the analyses of In and IIIn, as well as I′n and III′n(X).) However,if p ≤ kn, then the definition of kn implies that p = kn, which causes thequantities In, IIIn, I′n(X) and III′n(X) to become exactly 0. In this case,the proofs of Theorems 3.1 and 3.2 reduce to bounding IIn and II′n(X), andthese arguments can be repeated as before.


APPENDIX E: PROOF OF THEOREM 5.1

Notation and remarks. An important piece of notation for this appendix(and the next one) is the integer dn, which we define to be the largest indexdn ∈ {1, . . . , p} such that σ2

(dn) ≥ ε20/√n, with ε0 as in Assumption 5.1. (Such

an index must exist under Assumption 5.1.) Also, it is simple to check thatkn ≤ dn holds for all large n under Assumption 5.1. Furthermore, as in theprevious appendices, we will assume p > kn, since the case p ≤ kn can behandled using similar reasoning to that explained in the previous paragraph.Lastly, it will be helpful to note that σ2

(j) = π(j)(1 − π(j)), since it can bechecked that πi ≤ πj implies σi ≤ σj .

Outline of proof. Since the number dn will often play the role that p didin previous proofs, we will use a slightly different notation for the analoguesof the earlier quantities In, IIn, II′n(X), and III′n(X). This will also serve asa reminder that new details are involved in the context of the multinomialmodel. The new quantities are:

In = dK

(L(Mdn) , L(Mkn)

)IIn = dK

(L(Mkn) , L(Mkn)

)II′n(X) = dK

(L(Mkn) , L(M?

kn |X))

III′n(X) = dK

(L(M?

kn |X), L(M?

dn |X)).

The overall structure of the proof is based on the simple bound

dK

(L(M) , L(M?|X)

)≤ dK

(L(M) , L(Mdn)

)+ In + IIn + II′n(X) + III′n(X)

+ dK

(L(M?

dn |X) , L(M?|X)).

Most of the proof will be completed through four separate lemmas, showingthat each of the quantities in Roman numerals is at most of order n−1/2+δ.These lemmas are labeled as E.1 (for IIn), E.2 (for In), E.3 (for II′n(X))and E.4 (for III′n(X)). Finally, the other two quantities in the first andthird lines will be shown to be at most of order n−1/2+δ in Lemma E.5.

Lemma E.1. Fix any number δ ∈ (0, 1/2), and suppose the conditionsof Theorem 5.1 hold. Then,

(E.1) IIn . n−1/2+δ.


Proof. The argument is rougly similar to the proof of Proposition B.1,and we retain some of the notation used there. For the new proof, define thekn × kn matrix

(E.2) Σ(kn) := ΠknΣΠ>kn ,

where Πkn ∈ Rkn×p is the projection onto the coordinates indexed by J (kn),as explained on page 26. Let rn denote the rank of Σ(kn), and let

Σ(kn) = QΛrnQ>

be a spectral decomposition of Σ(kn), where Q ∈ Rkn×rn has orthonormalcolumns, and Λ ∈ Rrn×rn is diagonal and invertible. In addition, define

C = Λ1/2rn Q

>D−τnkn,

as well as the rn-dimensional random vector

Z ′i = Λ−1/2rn Q>Πkn(Xi − π),

and the sample averageZ ′n = 1√

n

∑ni=1 Z

′i.

Since the random vector Πkn(Xi−π) lies in the column span of Σ(kn) almostsurely, it follows that the relation QQ>Πkn(Xi − π) = Πkn(Xi − π) holdsalmost surely. In turn, this gives

QΛ1/2rn Z

′i = Πkn(Xi − π),

and henceD−τnkn

ΠknSn = C>Z ′n.

Consequently, for any t ∈ R, there is a Borel convex set At ⊂ Rrn such that

P(Mkn ≤ t) = P

(max

j∈J (kn)D−τnkn

ΠknSn,j ≤ t)

= P(Z ′n ∈ At

).

Similar reasoning also can be applied to Mkn in order to obtain the expres-sion P(Mkn ≤ t) = γrn(At), where γrn is the standard Gaussian distributionon Rrn . Therefore, we have the bound

(E.3) IIn ≤ supA∈A

∣∣∣P(Z ′ ∈ A)− γrn(A)∣∣∣,

with A being the collection of Borel convex subsets of Rrn .


Since the vectors Z ′1, . . . , Z′n are i.i.d., with mean 0 and identity covariance

matrix, we may apply the Berry-Esseen bound of Bentkus (Lemma G.1). Theonly remaining detail is to bound E[‖Z ′1‖32], and show that it is at most afixed power of kn. To do this, Lemma F.3 implies there is a constant c > 0not depending on n such that the following inequalities hold with probability1,

‖Z ′1‖22 ≤ 1λrn (Σ(kn))

∥∥Πkn(Xi − π)∥∥2

2

≤ ckcn · kn,

where we have used the facts that ‖Q‖op ≤ 1, and the entries of Πkn(Xi−π)

are bounded in magnitude by 1. Thus, we have E[‖Z ′1‖32] . k(3/2)(c+1)n , which

completes the proof.

Lemma E.2. Fix any number δ ∈ (0, 1/2), and suppose the conditionsof Theorem 5.1 hold. Then,

(E.4) In . n−1/2+δ.

Proof. By repeating the proof of Proposition B.2, it follows that if t1,n

and t2,n are any two real numbers with t1,n ≤ t2,n, then

(E.5) In ≤ P(A(t2,n)) + P(B(t1,n)),

where we write

A(t) ={

maxj∈J (kn)

Sn,j/στnj ≤ t

}and B(t) =

{max

j∈J (dn)\J (kn)Sn,j/σ

τnj > t

},

for any t ∈ R. (Note that B(t) differs from B(t) only insofar as J (dn)\J (kn)replaces J (kn)c.)

To handle the probability P(A(t2,n)), we mimic the definition (B.14) andlet

(E.6) t2,n = ε0 · `−βnn ·√

log(`n),

with ε0 ∈ (0, 1) as in Assumption 5.1. Having made this choice, the proof ofLemma B.1(a) may be repeated essentially verbatim to show that P(A(t2,n)) .n−1/2+δ. In particular, it is important to note that the correlation ma-trix R(`n) in the multinomial case satisfies the conditions needed for thatargument to work, because R+(`n) = I`n . Also, this argument relies onLemma E.1 in the same way that the proof of Lemma B.1(a) relies onProposition B.1.


To handle P(B(t1,n)), the proof of Lemma B.1(b) can be mostly repeatedto show that this probability is of order 1/n. However, there are a fewdifferences. First, we may regard the α from the context of Lemma B.1(b)as being equal to 1/2, due to the basic fact that σ(j) ≤ j−1/2 always holdsin the multinomial model. Likewise, we define

(E.7) t1,n = c · k−(1−τn)/2n · log(n)

as the analogue of t1,n in (B.13). The only other issues to notice are thatthe set J (kn)c is replaced by J (dn) \ J (kn), and that we must verify thefollowing condition. Namely, there is a constant c > 0 not depending on nsuch that the inequality

maxj∈J (dn)

∥∥ 1σjSn,j

∥∥q≤ cq

holds when q = max{ 2(1/2)(1−τn) , log(n), 3}. This will be verified later in

Lemma F.5. Finally, it is simple to check that t1,n ≤ t2,n holds for all largen.

Lemma E.3. Fix any number δ ∈ (0, 1/2), and suppose the conditionsof Theorem 5.1 hold. Then, there is a constant c > 0 not depending on nsuch that the event

(E.8) II′n(X) ≤ c n−1/2+δ


Proof. Let the random variable M?kn

be as defined in the proof of Propo-sition C.1, and consider the triangle inequality

(E.9) II′n(X) ≤ dK

(L(Mkn) ,L(M?

kn |X))

+ dK

(L(M?

kn |X) , L(M?kn |X)

).

Regarding the second term on the right, the proof of Proposition C.1 showsthat this term can be controlled with a coupling inequality for M?

knand

M?kn

, as well as an anti-concentration inequality for M?kn

. In the multino-mial context, both of these inequalities can be established using the sameoverall approach as before. The main items that need to be updated areto replace J (kn)c with J (dn) \ J (kn), and to control quantities involving{σj |j ∈ J (dn)} by using Lemmas (F.6, F.2) instead of Lemmas (D.6, D.1).

Next, to control the first term in the bound (E.9), we will use an argumentbased on Lemma D.3, which requires quite a few pieces of notation. First,


let Σ(kn) = ΠknΣΠ>kn ∈ Rkn×kn , and let the rank of this matrix be denotedby rn. Next, write the spectral decomposition of Σ(kn) as

Σ(kn) = QΛrnQ>,

where Q ∈ Rkn×rn has orthonormal columns and Λrn ∈ Rrn×rn is diagonal.In addition define

Σ(kn) = ΠknΣΠ>kn

Wn = Λ−1/2rn Q>Σ(kn)QΛ−1/2

rn .(E.10)

Since each vector Πkn(Xi − X) lies in the column span of Σ(kn) almostsurely, it follows that the relation QQ>Σ(kn)QQ> = Σ(kn) holds almostsurely, which is equivalent to

Σ(kn) = QΛ1/2rn WnΛ1/2

rn Q>.

With this in mind, let C = Λ1/2rn Q

>D−τnkn, and also define

S = C>C

S = C>WnC.(E.11)

It is straightforward to check that Mkn is the coordinate-wise maximum of aGaussian vector drawn from N(0,S), and similarly, M?

knis the coordinate-

wise maximum of a Gaussian vector drawn from N(0, S).We will now compare Mkn and M?

knby applying Lemma D.3. For this

purpose, let

(E.12) C = ULV >

be an s.v.d. for C, where the matrix L ∈ Rrn×rn is diagonal and invert-ible, and the matrices U ∈ Rrn×rn and V ∈ Rkn×rn each have orthonormalcolumns. From these definitions, it follows that

(E.13)(V >SV

)−1/2(V >SV

)(V >SV

)−1/2= U>WnU.

Thus, the matrix (V >SV )1/2 will play the role of H in the statement ofLemma D.3. Also, in order to apply that lemma, we need that the columnsof S and S span the same subspace of Rkn (with high probability). Notingthat S = V L2V > and S = V L(U>WnU)LV >, it follows that S and S havethe same column span whenever U>WnU is invertible, and by Lemma F.4,


this holds with probability at least 1− c/n. Therefore, Lemma D.3 ensuresthat if the event

(E.14) ‖U>WnU − Irn‖op ≤ ε,

holds for some number ε > 0, then the event

(E.15) dK

(L(Mkn) ,L(M?

kn |X))≤ c · k1/2

n · ε

also holds. Finally, Lemma F.4 shows that if we take ε = c log(n)kcn/√n, then

the event (E.14) holds with probability at least 1 − c/n, which completesthe proof.

Lemma E.4. Fix any number δ ∈ (0, 1/2), and suppose the conditionsof Theorem 5.1 hold. Then, there is a constant c > 0 not depending on nsuch that the event

(E.16) III′n(X) ≤ c n−1/2+δ


Proof. The proof follows the argument outlined in Section C.1. Themain details to be updated for the multinomial context arise in controllingthe quantities {σj |j ∈ J (dn)}. Specifically, the index set J (kn)c must bereplaced with J (dn) \ J (kn), and Lemmas (F.6, F.2) must be used in placeof Lemmas (D.6, D.1).

Lemma E.5. Suppose the conditions of Theorem 5.1 hold. Then, thereis a constant c > 0 not depending on n, such that

(E.17) dK

(L(M),L(Mdn)

)≤ In + c

n ,

and the event

(E.18) dK

(L(M?|X),L(M?

dn |X))≤ III′n(X)


Proof. Fix any t ∈ R. By intersecting the event {maxj∈Jn Sn,j/σ

τnj ≤ t}

with the events {J (kn) ⊂ Jn} and {J (kn) 6⊂ Jn}, and noting that themaximum can only become smaller on a subset, we have

P(

maxj∈Jn

Sn,j/στnj ≤ t

)≤ P

(max

j∈J (kn)Sn,j/σ

τnj ≤ t

)+ P

(Jkn 6⊂ Jn).


Therefore, by subtracting from both sides the probability involving the max-imum over J (dn), we have

(E.19) P(

maxj∈Jn

Sn,j/στnj ≤ t

)− P

(max

j∈J (dn)Sn,j/σ

τnj ≤ t

)≤ In + P

(Jkn 6⊂ Jn).

Similarly, by intersecting with the events {J ⊂ J (dn)} and {Jn 6⊂ J (dn)},we have

P(

maxj∈J (dn)

Sn,j/στnj ≤ t

)≤ P

(maxj∈Jn

Sn,j/στnj ≤ t

)+ P(Jn 6⊂ J (dn)).

Next, subtracting the probability involving the maximum over Jn gives

(E.20) P(

maxj∈J (dn)

Sn,j/στnj ≤ t

)− P

(maxj∈Jn

Sn,j/στnj ≤ t

)≤ 0 + P(Jn 6⊂ J (dn)).

Combining (E.19) and (E.20) implies

dK

(L(M) , L(Mdn)

)≤ In + P

(Jkn 6⊂ Jn) + P(Jn 6⊂ J (dn)),

and furthermore, Lemma F.1 shows that the last two probabilities on theright are at most c/n. This proves (E.17). The inequality (E.18) follows bysimilar reasoning, and is actually easier, because conditioning on X allowsus to work under the assumption that the events {J (kn) ⊂ Jn} and {Jn ⊂J (dn)} hold, since they occur with probability at least 1− c

n .

APPENDIX F: TECHNICAL LEMMAS FOR THEOREM 5.1

Lemma F.1. Suppose the conditions of Theorem 5.1 hold. Then, withprobability at least 1− c

n , the following two events hold simultaneously,

(F.1) J (kn) ⊂ Jn,

and

(F.2) Jn ⊂ J (dn).

Proof. We first address the event (F.1). By a union bound, the followinginequalities hold for all large n,

P(J (kn) 6⊂ Jn

)≤

∑j∈J (kn)

P(πj <

√log(n)/n

)

≤ kn · maxj∈J (kn)

P(√

n|πj − πj | > n1/4)(F.3)


where the last step follows from the fact that if j ∈ J (kn), then the crude(but adequate) inequality

√nπj −

√log(n) ≥ n1/4 holds for all large n. In

turn, Hoeffding’s inequality (van der Vaart and Wellner, 2000, p.460) implies

kn · maxj∈J (kn)

P(√

n|πj − πj | > n1/4)≤ 2kne

−2√n . 1

n ,

which establishes (F.1).We now turn to (F.2). Under Assumption 5.1, it is simple to check that

J (kn) ⊂ J (dn) holds for all large n. Consequently, in the low-dimensionalsituation where kn = p, we must have J (dn) = J (p) for all large n, and then(F.2) is clearly true. Hence, we may work in the situation where kn < p. Tobegin, observe that Assumption 5.1 gives π(1)(1 − π(1)) = σ2

(1) ≥ ε20, which

implies 1− π(1) ≥ ε20, and hence 1− πj ≥ ε20 for all j ∈ {1, . . . , p}. Based onthis observation, if we consider the set

J ′n :={j ∈ {1, . . . , p}

∣∣∣πj ≥ 1/√n},

then j ∈ J ′n implies σ2j = πj(1 − πj) ≥ ε20/

√n. Now, recall that J (dn) is

defined so that j ∈ J (dn)⇐⇒ σ2j ≥ ε20/

√n. As a result of this definition, we

have J ′n ⊂ J (dn). Therefore, in order to show that the event {Jn ⊂ J (dn)}holds with probability at least 1 − c/n, it suffices to show that the event{Jn ⊂ J ′n} holds with at least the same probability.

To proceed, observe that the following inclusion always holds,

Jn ⊂{j ∈ {1, . . . , p}

∣∣∣ √nπj ≥√log(n)− max1≤j≤p

√n(πj − πj)

}.(F.4)

Therefore, if we can show that the event

E :={√

log(n)− max1≤j≤p

√n(πj − πj) ≥ 1

}occurs with probability at least 1 − c

n , then the event {Jn ⊂ J ′n} will alsooccur with probability at least 1− c

n . Now, consider the union bound

(F.5) P(Ec) ≤p∑j=1

P(√

n(πj − πj) >√

log(n)− 1).

We will bound this sum by considering two different sets of indices. For theindices j ∈ J (kn)c, the values πj are mostly small. This motivates the useof Kiefer’s inequality (Lemma G.5), which implies there is some c > 0 not


depending on n, such that the following bound holds for all large n, andj ∈ J (kn)c,

P(√

n(πj − πj) >√

log(n)− 1)≤ 2 exp

{− c log(n) log( 1

πj)}

= 2πc log(n)j .

(F.6)

On the other hand, if j ∈ J (kn), then πj is of moderate size, and Hoeffding’sinequality (van der Vaart and Wellner, 2000, p.460) implies the followinginequality for all large n,

P(√

n(πj − πj) >√

log(n)− 1)≤ 2 exp

{− (3/2) log(n)

}= 2n−3/2.

(F.7)

(The constant 3/2 in the exponent has been chosen for simplicity, and isnot of special importance.) Combining the two different types of bounds,and using the fact that any probability vector satisfies π(j) ≤ j−1 for allj ∈ {1, . . . , p}, we obtain

P(Ec) . knn−3/2 +

p∑j=kn+1

πc log(n)(j)

. 1n +

∫ p

kn

x−c log(n)dx

. 1n + k

−c log(n)+1n

. 1n ,

(F.8)

as needed.

Lemma F.2. Suppose the conditions of Theorem 5.1 hold, and let q =max{ 2

(1/2)(1−τn) , log(n), 3}. Then, there is a constant c > 0 not depending

on n, such that for any j ∈ J (dn), we have

(F.9) ‖σj‖q ≤ c · σj ·√q.


Proof. By direct calculation

1σj‖σj‖q = 1

σj

∥∥∥∥∥ 1n

∑ni=1(X2

i,j − πj) +(πj − π2

j

)+(π2j − X2

j

)∥∥∥∥∥1/2

q/2

≤ 1σj

(∥∥ 1n

∑ni=1(X2

i,j − πj)∥∥1/2

q/2+ (πj − π2

j )1/2 +

∥∥π2j − X2

j

∥∥1/2

q/2

)

≤ 1σjn1/4

∥∥Sn,j∥∥1/2

q/2+ 1 +

√2

σjn1/4

∥∥Sn,j∥∥1/2

q/2.

Since 1/(σjn1/4) ≤ 1/ε0 when j ∈ J (dn), it follows from Lemma G.4 that

the first and third terms are at most of order√q.

Remark. For the next lemma, put πkn := (π(1), . . . , π(kn)), and recall the

definition Σ(kn) = ΠknΣΠ>kn from line (E.2).

Lemma F.3. If rn denotes the rank of Σ(kn), and the conditions of The-orem 5.1 hold, then there is a constant c > 0 not depending on n such that

(F.10) λrn(Σ(kn)) & k−cn .

Proof. We first consider the case kn < p, and then handle the casekn = p separately at the end of the proof. Under the multinomial model, wehave for all i, j ∈ {1, . . . , kn},

Σi,j(kn) =

{π(i)(1− π(i)) if i = j,

−π(i)π(j) if i 6= j.

For each i ∈ {1, . . . , kn}, define the “deleted row sum”,

%i :=∑j 6=i|Σi,j(kn)|.

By the Gersgorin disc theorem (Horn and Johnson, 1990, Sec. 6.1),

λrn(Σ(kn)) ≥ λmin(Σ(kn))

≥ min1≤i≤kn

{Σi,i(kn)− %i

}= min

1≤i≤kn

{(π(i) − π(i)

∑knj=1 π(j)

},

= π(kn)

(1−

∑knj=1 π(j)

).

(F.11)


When kn < p, it follows from Assumption 5.1 that

1−∑kn

j=1 π(j) ≥ π(kn+1)

≥ σ2(k+1)

≥ ε20(kn + 1)−2α

& k−2αn .

(F.12)

Hence, the previous steps show that λrn(Σ(kn)) is at least of order k−4αn .

Finally, consider the case when kn = p. In this case, it is a basic fact thatthe rank of Σ = Σ(kn) satisfies rn = p − 1 (where we note that Assump-tion 5.1 ensures π(p) > 0). Also, it is known from matrix analysis (Benasseni,2012, Theorem 1) that

λp−1(Σ) ≥ π(p),

and therefore Assumption 5.1 leads to

λrn(Σ(kn)) ≥ π(kn)

≥ σ2(kn)

≥ ε20k−2αn ,

(F.13)

as needed.

Lemma F.4. Let the deterministic matrix U ∈ Rrn×rn and the randommatrix Wn ∈ Rrn×rn be as defined in (E.12) and (E.10) respectively. Also,suppose that the conditions of Theorem 5.1 hold. Then, there is a constantc > 0 not depending on n such that the event

(F.14)∥∥U>WnU − Irn

∥∥op≤ c kcn log(n)√

n


Proof. Let the notation from the proof of Lemma E.3 be in force, andobserve that∥∥U>WnU − Irn

∥∥op

=∥∥∥U>(Λ−1/2

rn Q>Σ(kn)QΛ−1/2rn − Irn

)U∥∥∥op

≤∥∥∥Λ−1/2

rn Q>(

Σ(kn)− Σ(kn))QΛ−1/2

rn

∥∥∥op

≤ 1

λrn(Σ(kn))

∥∥Σ(kn)− Σ(kn)∥∥op.


With regard to the first factor in the previous line, Lemma F.3 implies

1λrn (Σ(kn)) . kcn.

To complete the proof, let πkn = Πknπ, and let u ∈ Rkn be a generic unitvector. Consider the decomposition

|u>(Σ(kn)− Σ(kn)

)u| ≤

∣∣∣∣ 1nn∑i=1

((X>i Π>knu)2 − u>diag(πkn)u

)∣∣∣∣+∣∣∣(X>Π>knu)2 − (π>knu)2

∣∣∣=: ∆n(u) + ∆′n(u).

In order to control these terms, note that each random variable (X>i Π>knu)2

is bounded in magnitude by 1, and has expectation equal to u>diag(πkn)u.In addition, we have

|∆′n(u)| ≤ 2|X>Π>knu− π>knu|.

Based on these observations, the proof of Lemma D.5 can be essentiallyrepeated to show that there is a constant c > 0 not depending on n suchthat the event ∥∥Σ(kn)− Σ(kn)

∥∥op≤ c log(n)kn√

n

holds with probability at least 1− cn . This completes the proof.

Lemma F.5. Let q = max{ 2(1/2)(1−τn) , log(n), 3}, and suppose that the

conditions of Theorem 5.1 hold. Then, there is a constant c > 0 not depend-ing on n, such that

(F.15) maxj∈J (dn)

‖ 1σjSn,j‖q ≤ cq.

In addition, the following event holds with probability 1,

(F.16) maxj∈J (dn)

(E[| 1σjS?n,j |q|X

])1/q≤ c q.

Proof. The second inequality can be obtained by repeating the proof ofLemma D.4, since S?n is still Gaussian under the setup of Assumption 5.1.To prove the first inequality, note that since q > 2, Lemma G.4 gives

(F.17) ‖ 1σjSn,j‖q . q ·max

{‖ 1σjSn,j‖2 , n−1/2+1/q‖ 1

σj(X1,j − πj)‖q

}.

Clearly,‖ 1σjSn,j‖22 = var( 1

σjSn,j) = 1.


For the stated choice of q, the second term inside the maximum satisfies

n−1/2+1/q‖ 1σj

(X1,j − πj)‖q . 1√nσj

,

and also, since j ∈ J (dn), we have 1√nσj

. 1n1/4 , which leads to the stated

claim.

Lemma F.6. Suppose the conditions of Theorem 5.1 hold. Then, thereis a constant c > 0 not depending on n such that the events

(F.18) maxj∈J (kn)

∣∣∣ σjσj − 1∣∣∣ ≤ c·kcn·

√log(n)

n1/2 ,

and

(F.19) minj∈J (kn)

σ1−τnj ≥

(min

j∈J (kn)σ1−τnj

)·(

1− c·kcn·√

log(n)

n1/2

)each hold with probability at least 1− c

n .

Proof. Note that if (F.18) occurs, then (F.19) also occurs, and so weonly deal with the former event. Fix any number κ ≥ 2. By a union bound,it suffices to show there are positive constants c and c1(κ) not depending onn, such that for any j ∈ J (kn), the event

(F.20)∣∣∣ σ2j

σ2j− 1∣∣∣ ≤ c1(κ)·kcn·

√log(n)

n1/2

holds with probability at least 1− cnκ . To this end, observe that∣∣∣ σ2

j

σ2j− 1∣∣∣ ≤ ∣∣∣ 1

σ2jn

∑ni=1(X2

i,j − πj)∣∣∣ + 1

σ2j

∣∣π2j − X2

j

∣∣≤ 3

σ2j

√n|Sn,j |.

Due to Assumption 5.1, and the fact that j ∈ J (kn), we have

1σ2j

√n

. k2αn√n.

In addition, Hoeffding’s inequality ensures there is a constant c1(κ) suchthat the event

|Sn,j | ≤ c1(κ)√

log(n)

holds with probability at least 1− cnκ .


APPENDIX G: BACKGROUND RESULTS

The following result is a multivariate version of the Berry-Esseen theoremdue to Bentkus (2003).

Lemma G.1 (Bentkus’ multivariate Berry-Esseen theorem). Let V1, . . . , Vnbe i.i.d. random vectors Rd, with zero mean, and identity covariance matrix.Furthermore, let γd denote the standard Gaussian distribution on Rd, andlet A denote the collection of all Borel convex subsets of Rd. Then, there isan absolute constant c > 0 such that

(G.1) supA∈A

∣∣∣P( 1√n

(V1 + · · ·+ Vn) ∈ A)− γd(A)

∣∣∣ ≤ c · d1/4 · E[‖V1‖32

]n1/2

.

The following is a version of Nazarov’s inequality (Nazarov, 2003; Klivans,O’Donnell and Servedio, 2008), as formulated in (Chernozhukov, Chetverikovand Kato, 2016, Lemma 4.3).

Lemma G.2 (Nazarov’s inequality). Let (ξ1, . . . , ξm) be a multivariatenormal random vector, and suppose the parameter σ2 := min1≤j≤m var(ξj)is positive. Then, for any r > 0,

(G.2) supt∈R

P(∣∣∣ max

1≤j≤mξj − t

∣∣∣ ≤ r) ≤ 2r

σ· (√

2 log(m) + 2).

The result below is a version of Slepian’s lemma, which is adapted from (Liand Shao, 2002, Theorem 2.2). (See references therein for earlier versions ofthis result.)

Lemma G.3 (Slepian’s lemma). Let m ≥ 3, and let R ∈ Rm×m be a cor-relation matrix with maxi 6=j Ri,j < 1. Also, let R+ be the matrix with (i, j)entry given by max{Ri,j , 0}, and suppose R+ is positive semi-definite. Fur-thermore, let ζ ∼ N(0,R) and ξ ∼ N(0,R+). Then, the following inequalitieshold for any t ≥ 0,

(G.3) P(

max1≤j≤m

ζj ≤ t)≤ P

(max

1≤j≤mξj ≤ t

)≤ Km(t) · Φm(t),

where

(G.4) Km(t) = exp

{ ∑1≤i<j≤m

log(

1

1− 2π arcsin(R+

i,j)

)exp

(− t2

1+R+i,j

)}.

The following inequalities are due to Johnson, Schechtman and Zinn (1985).


Lemma G.4 (Rosenthal’s inequality with best constants). Fix r ≥ 1and put Log(r) := max{log(r), 1}. Let ξ1, . . . , ξm be independent randomvariables satisfying E[|ξj |r] < ∞ for all 1 ≤ j ≤ m. Then, there is anabsolute constant c > 0 such that the following two statements are true.

(i). If ξ1, . . . , ξm are non-negative random variables, then

(G.5)∥∥∑m

j=1 ξi∥∥r≤ c · r

Log(r) ·max

{∥∥∑mj=1 ξj

∥∥1,(∑m

j=1

∥∥ξi‖rr)1/r}.(ii). If r > 2, and the random variables ξ1, . . . , ξm all have mean 0, then

(G.6)∥∥∑m

j=1 ξi∥∥r≤ c · r

Log(r) ·max

{∥∥∑mj=1 ξj

∥∥2,(∑m

j=1

∥∥ξi‖rr)1/r}.Remark. The non-negative case is handled in (Johnson, Schechtman andZinn, 1985, Theorem 2.5). With regard to the mean 0 case, the statementabove differs slightly from (Johnson, Schechtman and Zinn, 1985, Theorem4.1), which requires symmetric random variables, but the remark on page247 of that paper explains why the variables ξ1, . . . , ξm need not be sym-metric as long as they have mean 0.

The result below is a sharpened version of Hoeffding’s inequality for han-dling the binomial distribution when the success probability is small (van derVaart and Wellner, 2000, Corollary A.6.3).

Lemma G.5 (Kiefer’s inequality). Let ξ1, . . . , ξm be independent Bernoullirandom variables with success probability π0 ∈ (0, 1/e), and let ξ = 1

m

∑mi=1 ξi.

Then, the following inequality holds for any m ≥ 1 and t > 0,

(G.7) P(√

m|ξ − π0| ≥ t)≤ 2 exp

{− t2

[log(

1π0

)− 1]}.

APPENDIX H: RELATED WORK ON GAUSSIAN APPROXIMATION

Although our focus is primarily on rates of bootstrap approximation, thistopic is closely related to rates of Gaussian approximation in the central limittheorem — for which there is a long line of work in finite-dimensional andinfinite-dimensional settings. We refer to the chapter (Bentkus et al., 2000)for a general survey, as well as Appendix L of the paper (Chernozhukov,Chetverikov and Kato, 2013) for a discussion that is oriented more towardshigh-dimensional statistics.

To describe how our work fits into this literature, we fix some notation.Let B denote a Banach space, with A being a collection of its subsets, and


ϕ : B→ R being a function. For the moment, we will regard the summandsof Sn as centered i.i.d. random elements of B, and let Sn denote a centeredGaussian element of B with the same covariance as Sn. Broadly speaking, theliterature on rates of Gaussian approximation is concerned with boundingthe quantities ρn(A ) = supA∈A ρn(A), or ‖∆n‖∞ = supt∈R |∆n(t)|, where

ρn(A) =∣∣∣P(Sn ∈ A)− P(Sn ∈ A)

∣∣∣,(H.1)

∆n(t) =∣∣∣P(ϕ(Sn) ≤ t

)− P

(ϕ(Sn) ≤ t

)∣∣∣.(H.2)

In particular, note that if B = Rp and T = max1≤j≤p Sn,j , then the distancedK(L(T ),L(T )) can be represented as either ρn(A ) or ‖∆n‖∞, by takingA to be a class of rectangles, or by taking ϕ to be the coordinate-wisemaximum function.

Typically, the rates at which ρn(A ) and ‖∆n‖∞ decrease with n is depen-dent on the distribution of Sn, the dimension of B, as well as the smoothnessof A and ϕ, among other factors. Although the study of ρn(A ) and ‖∆n‖∞is highly multifaceted, it is a general principle that the smoothness of A andϕ tends to be much more influential when B is infinite-dimensional, as com-pared to the finite-dimensional case. (A discussion may be found in (Bentkusand Gotze, 1993).) Indeed, this is worth emphasizing in relation to our work,since the choices of A and ϕ corresponding to dK(L(T ),L(T )) are notablefor their lack of smoothness.

To illustrate this point, consider the following two facts that are knownto hold when B is separable and infinite-dimensional (Bentkus and Gotze,1993): (1) If E[‖X1‖3B] <∞, and ϕ satisfies smoothness conditions strongerthan having three Frechet derivatives, then ‖∆n‖∞ is of order O(n−1/2). (2)If the previous conditions hold, except that ϕ is only assumed to have oneFrechet derivative, then an example can be constructed so that ‖∆n‖∞ isbounded below by a sequence that converges to 0 arbitrarily slowly. (Otherexamples of lower bounds may be found in the papers (Bentkus, 1986; Rheeand Talagrand, 1984), among others.) By contrast, in the finite-dimensionalcase where B = Rp with p held fixed as n → ∞ and E[‖X1‖32] < ∞, itis known that non-smooth choices of ϕ and A can lead to a n−1/2 rate.Namely, it is known that ‖∆n‖∞ and ρn(A ) are of order O(n−1/2) whenϕ is convex, or when A is the class of Borel convex sets. Beyond the casewhere p is held fixed, many other works allow n and p to diverge together.When p grows relatively slowly compared to n, the leading rate for thechoices of ϕ and A just mentioned is essentially O(p7/4/n1/2) (Bentkus,2003, 2005). See also (Sazonov, 1968; Nagaev, 1976; Senatov, 1981; Port-noy, 1986; Gotze, 1991; Chen and Fang, 2011; Zhai, 2018) for additional


background. Meanwhile, when p � n, it has recently been established thatrates of the form O(log(p)b/n1/6) can be achieved if ϕ is the coordinate-wisemaximum function, or if A is a class of “sparsely convex sets”, such asrectangles (Chernozhukov, Chetverikov and Kato, 2017).

In light of the previous paragraph, it is natural that results in the infinite-dimensional setting have focused predominantly on smooth choices of Aand ϕ. Nevertheless, there are some special cases where notable results havebeen obtained for non-smooth choices of ϕ and A that correspond to maxstatistics. One such result is established in the paper (Asriev and Rotar,1986), which deals with bounds on ρn(A(r)) for rectangular sets such asA(r) :=

∏∞j=1(−∞, rj ], with r = (r1, r2, . . . ), in the infinite-dimensional

Euclidean space R∞. More specifically, if we continue to let σ2j = var(X1,j)

for j = 1, 2, . . . , and define the “effective dimension” parameter d(r) =∑∞j=1

σjσj+rj

, then the following bound holds under certain conditions on the

distribution of Sn,

(H.3) ρn(A(r)) . log(n)3/2

n1/2 · d(r) · (1 + d(r)2).

To comment on how the bound (H.3) relates with our Gaussian approxi-mation result in Theorem 3.1, note that they both involve near-parametricrates, and are governed by the parameters (σ1, σ2, . . . ). However, there arealso some crucial differences. First, the bound (H.3) is non-uniform withrespect to the set A(r), whereas Theorem 3.1 is a uniform result. In partic-ular, this difference becomes apparent by setting all rj equal to σ(n), whichimplies d(r) ≥ n/2, and causes the bound (H.3) diverge as n→∞. Second,the bound (H.3) relies on the assumption that Sn has a diagonal covari-ance matrix, whereas Theorem 3.1 allows for much more general covariancestructures.

Some further examples related to max statistics arise in the context ofempirical process theory. Let Gn(f) =

√n∑n

i=1(f(Xi) − E[f(Xi)]) denotean empirical process that is generated from i.i.d. observations X1, . . . , Xn,and indexed by a class of functions f ∈ F . Also let the Gaussian counterpartof Gn be denoted by G (i.e. a Brownian bridge) (van der Vaart and Wellner,2000). In this setting, the paper (Norvaisa and Paulauskas, 1991) studiesthe quantity

(H.4) ∆n(t) =∣∣∣P(‖Gn‖F ≤ t

)− P

(‖G‖F ≤ t

)∣∣∣,which can be understood in terms of the earlier definition (H.2) by let-ting ϕ(Gn) = ‖Gn‖F = supf∈F |Gn(f)|. Under the assumption that F is


a VC subgraph class of uniformly bounded functions, it is shown in thepaper (Norvaisa and Paulauskas, 1991) that the bound

(H.5) ∆n(t) . 1(1+t)3

log(n)2

n1/6

holds for all t ≥ 0. In relation to the current work, the condition that thefunctions in F are uniformly bounded is an important distinction, becausethe max statistic T = max1≤j≤pGn(fj) arises from the functions fj(x) = xj ,which are unbounded.

More recently, the papers (Chernozhukov, Chetverikov and Kato, 2014b,2016) have derived bounds on the quantity ‖∆n‖∞ (as well as couplingprobabilities) under weaker assumptions on the class F . For instance, theseworks allow for classes of functions that are non-Donsker or unbounded,provided that a suitable envelope function is available. Nevertheless, the re-sulting bounds on ‖∆n‖∞ involve restrictions on how quickly the parameterinff∈F var(Gn(f)) can decrease with n — whereas our results do not in-volve such restrictions. Also, the rates developed in these works are broadlysimilar to (H.5), and it is unclear if a modification of the techniques wouldlead to near-parametric rates in our setting. Lastly, a number of earlier re-sults that are related to bounding ‖∆n‖∞, such as invariance principles andcouplings, can be found in the papers (Csorgo et al., 1986; Massart, 1986,1989; Paulauskas and Stieve, 1990; Bloznelis, 1997, and references therein).However, these works are tailored to fairly specialized processes (e.g., deal-ing with cadlag functions on the unit interval, rectangles in Rp when p isfixed, or processes generated by Uniform[0,1] random variables).

submitted to the annals of statistics · 1. introduction. one of the current challenges in...

Documents