innovated higher criticism for detecting sparse signals in ... · innovated higher criticism for...

Innovated Higher Criticism for Detecting Sparse Signals in

Correlated Noise

Peter Hall1 and Jiashun Jin2

Abstract

Higher Criticism is a method for detecting signals that are both sparse and weak. Al-though first proposed in cases where the noise variables are independent, Higher Criti-cism also has reasonable performance in settings where those variables are correlated.In this paper we show that, by exploiting the nature of the correlation, performancecan be improved by using a modified approach which exploits the potential advantagesthat correlation has to offer. Indeed, it turns out that the case of independent noiseis the most difficult of all, from a statistical viewpoint, and that more accurate signaldetection (for a given level of signal sparsity and strength) can be obtained when cor-relation is present. We characterize the advantages of correlation by showing how toincorporate them into the definition of an optimal detection boundary. The boundaryhas particularly attractive properties when correlation decays at a polynomial rate orthe correlation matrix is Toeplitz.

Keywords: Adding noise, Cholesky factorization, empirical process, innovation, mul-tiple hypothesis testing, sparse normal means, spectral density, Toeplitz matrix.

AMS 2000 subject classifications: Primary 62G10, 62M10; secondary 62G32,62H15.

1 Introduction

Donoho and Jin [14] developed Tukey’s [36] proposal for “Higher Criticism” (HC), showingthat a method based on the statistical significance of a large number of statistically sig-nificant test results could be used very effectively to detect the presence of very sparselydistributed signals. They demonstrated that HC is capable of optimally detecting the pres-ence of signals that are so weak, and so sparse, that the the signal cannot be consistentlyestimated. Applications include the problem of signal detection against cosmic microwavebackground radiation (Cayon et al. [8], Cruz et al. [12], Jin [27, 28, 29], Jin et al. [31]).Related work includes that of Cai et al. [7], Hall et al. [20] and Meinshausen and Rice [32].

The context of Donoho and Jin’s [14] work was that where the noise is white, althougha small number of investigations have been made of the case of correlated noise (Hall etal. [20], Hall and Jin [21], Delaigle and Hall [13]). However, that research has focused onthe ability of standard HC, applied in the form that is appropriate for independent data, toaccommodate the non-independent case. In this paper we address the problem of how tomodify HC by developing innovated Higher Criticism (iHC) and showing how to optimizeperformance for correlated noise.

1 Department of Mathematics and Statistics, University of Melbourne, Parkville, VIC, 3010, Australia;Department of Statistics, University of California, Davis, CA 95616.

2 Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213. The research of JiashunJin was supported in part by NSF CAREER award DMS-0908613.

1

arX

iv:0

902.

3837

v1 [

mat

h.ST

] 2

3 Fe

b 20

09

Curiously, it turns out that when using the iHC method tuned to give optimal per-formance, the case of independence is the most difficult of all, statistically speaking. Toappreciate why this result is reasonable, note that if the noise is correlated then it does notvary so much from one location to a nearby location, and so is a little easier to identify. Inan extreme case, if the noise is perfectly correlated at different locations then it is constant,and in this instance it can be easily removed.

On the other hand, standard HC does not perform well in the case of correlated noise,because it utilizes only the marginal information in the data, without much attention tothe correlation structure. Innovated HC is designed to exploit the advantages offered bycorrelation, and gives good performance across a wide range of settings.

The concept of the “detection boundary” was introduced by Donoho and Jin [14] inthe context of white noise. In this paper, we extend it to the correlated case. In brief, thedetection boundary describes the relationship between signal sparsity and signal strengththat characterizes the boundary between cases where the signal can be detected, and caseswhere it cannot. In the setting of dependent data, this watershed depends on the corre-lation structure of the noise as well as on the sparsity and strength of the signal. Whencorrelation decays at a polynomial rate we are able to characterize the detection boundaryquite precisely. In particular, we show how to construct concise lower/upper bounds tothe detection boundary, based on the diagonal components of the inverse of the correlationmatrix, Σn. A special case is where Σn is Toeplitz; there the upper and the lower boundsto the detection boundary are asymptotically the same. In the Toeplitz case, the iHC isoptimal for signal detection but standard HC is not.

The paper is organized as follows. Section 2 introduces the sparse signal model followedby a brief review of the uncorrelated case. Section 3 establishes lower bounds to the de-tection boundary in correlated settings. Section 4 introduces innovated HC and establishesan upper bound to the detection boundary. Section 5 applies the main results in Sections3 and 4 to the case where the Σ′ns are Toeplitz. In this case, the lower bound coincideswith the upper bound and innovated HC is optimal for detection. Section 6 discusses acase where the signals have a more complicated structure. Section 7 investigates a caseof strong dependence. Simulations are given in Section 8, and discussion is given in Sec-tion 9. Sections 10, 11, and 12 give proofs of theorems, lemmas, and secondary lemmas,correspondingly.

2 Sparse signal model, review of HC in uncorrelated case

Consider an n-dimensional Gaussian vector

X = µ+ Z where Z ∼ N(0,Σ), (2.1)

with the mean vector µ unknown and the dimension n large. In most parts of the paper,we assume that Σ = Σn is known and has unit diagonal elements (the case where Σn isunknown is discussed in Section 9). We are interested in testing whether no signal exists(i.e. µ = 0) or there is a sparse and faint signal.

Such a situation may arise in many situations. One example is global testing in linearmodels. Consider a linear model Y ∼ N(Mµ, In), where the matrix M has many rows andcolumns, and we are interested in testing whether µ = 0. The setting is closely related toModel (2.1), since the least square estimator of µ is distributed as N(µ, (M ′M)−1). Theglobal testing problem is important in many applications. One is that of testing whether a

2

clinical outcome is associated with the expression pattern of a pre-specified group of genes(Goeman et al. [17, 18]), where M is the expression profile of the specified group of genes.Another is expression quantitative Trait Loci (eQTL) analysis, where M is related to thenumbers of common alleles for different genetic markers and individuals (Chen et al. [9]).In both examples, M is either observable or can be estimated. Also, it is frequently seenthat only a small proportion of genes is associated with the clinical outcome, and each genecontributes weakly to the clinical outcome. In such a situation, the signals are both sparseand faint.

Back to Model (2.1). We model the number of nonzero entries of µ as

m = n1−β, where β ∈ (1/2, 1). (2.2)

This is a very sparse case for the proportion of signals is much smaller than 1/√n. We

suppose that the signals appear at m different locations—`1 < `2 < . . . < `m—that arerandomly drawn from 1, 2, . . . , n without replacement,

P`1 = n1, `2 = n2, . . . , `m = nm =(n

m

)−1

, for all 1 ≤ n1 < n2 < . . . nm ≤ n , (2.3)

and that they have a common magnitude of

An =√

2r log n where r ∈ (0, 1).

We are interested in testing which of the following two hypotheses is true:

H0 : µ = 0 vs. H(n)1 : µ is a sparse vector as above. (2.4)

This testing problem was found to be delicate even in the uncorrelated case whereΣn = In. See [14] (also [7, 23, 24, 27, 32]) for details. Below, we briefly review the resultsin the uncorrelated case.

2.1 Detection boundary (Σn = In)

The testing problem is characterized by the curve r = ρ∗(β) in the β-r plane, where

ρ∗(β) =β − 1/2, 1/2 < β ≤ 3/4,(1−

√1− β)2, 3/4 < β < 1,

(2.5)

and we call r = ρ∗(β) the detection boundary. The detection boundary partitions the β-rplane into two sub-regions: the undetectable region below the boundary and the detectableregion above the boundary; see Figure 1. In the interior of the undetectable region, thesignals are so sparse and so faint that no test is able to successfully separate the alternativehypothesis from the null hypothesis in (2.4): the sum of Type I and Type II errors of anytest tends to 1 as n diverges to ∞. In the interior of the detectable region, it is possibleto have a test such that as n diverges to ∞, the Type I error tends to 0 and the powertends to 1. (In fact, Neyman-Pearson’s Likelihood Ratio Test (LRT) is such a test.) See[14, 23, 27] for example.

The drawback of LRT is that it needs detailed information of the unknown parameters(β, r). In practice, we need a test that does not need such information; this is where HCcomes in.

3

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

!

r

Detectable

Undetectable

Estimable

Figure 1: Phase diagram for the detection problem in the uncorrelated case. The detectionboundary separates the β-r plane into the detectable region and the undetectable region.In the estimable region, it is not only possible to reliably tell the existence of nonzerocoordinates, but is also possible to identify them individually.

2.2 Higher Criticism and its optimal adaptivity (Σn = In)

A notion that goes back to Tukey [36], Higher Criticism was first proposed in [14] to tacklethe aforementioned testing problem in the uncorrelated case. To apply Higher Criticism,let pj = P|N(0, 1)| ≥ |Xj | be the p-value associated with the j-th observation unit, andlet p(j) be the j-th p-value after sorting in ascending order. The Higher Criticism statisticis defined as

HC∗n = maxj: 1/n≤p(j)≤1/2

√n(j/n− p(j))√p(j)(1− p(j))

. (2.6)

There are also other versions of HC; see [14, 15, 16] for example. When H0 is true, HC∗nequals in distribution to the maximum of the standardized uniform stochastic process [14].Therefore, by a well-known result for empirical processes [33],

HC∗n√2 log log n

→ 1 in probability. (2.7)

Consider the HC test which rejects the null hypothesis when

HC∗n ≥ (1 + a)√

2 log log n where a > 0 is a constant. (2.8)

It follows from (2.7) that the Type I error tends to 0 as n diverges to∞. For any parameters(β, r) that fall in the interior of the detectable region, the Type II error also tends to 0.This is the following theorem, where we set a = 0.01 for simplicity of presentation.

Theorem 2.1 Consider the HC test that rejects H0 when HC∗n ≥ 1.01√

2 log log n. Forevery alternative H(n)

1 where the the associated parameters (r, β) satisfy r > ρ∗(β), the HCtest has asymptotically full power for detection:

PH

(n)1

Reject H0 → 1 as n→∞.

4

That is, the HC test adapts to unknown parameters (β, r), and yields asymptotically fullpower for detection throughout the entire detectable region. We call this the optimaladaptivity of HC [14].

Theorem 2.1 is closely related to [14, Theorem 1.2], where a mixture model is used.The mixture model reduces approximately to the current model if we randomly shuffle thecoordinates of X. However, despite its appealing technical convenience, it is not clear howto generalize the mixture model from the uncorrelated case to general correlated settings.Theorem 2.1 is a special case of Theorem 4.2.

We now turn to the correlated case. In this case, the exact “detection boundary” maydepend on Σn in a complicated manner, but it is possible to establish both a tight lowerbound and a tight upper bound. We discuss the lower bound first.

3 Lower bound to the detectability

To establish the lower bound, a key element is the theory in comparison of experiments(e.g. [34]), where a useful guideline is that adding noise always makes the inference moredifficult. Thus, we can alter the model by either adding or subtracting a certain amount ofnoise, so that the difficulty level (measured by the Hellinger distance, or the χ2-distance,etc., between the null density and the alternative density) of the original problem is sand-wiched by those of the two adjusted models. The correlation matrices in the latter havea simpler form and hence are much easier to analyze. Another key element is the recentdevelopment of matrix characterizations based on polynomial off-diagonal decay, where itshows that the inverse of a matrix with this property shares the same rate of decay as theoriginal matrix.

3.1 Comparison of experiments: adding noise makes inference harder

We begin by comparing two experiments that have the same mean, but where the data fromone experiment are more noisy than those from the other. Intuitively, it is more difficultto make inference in the first experiment than in the other. Specifically, consider the twoGaussian models

X = µ+ Z, Z ∼ N(0,Σ) and X∗ = µ+ Z∗, Z∗ ∼ N(0,Σ∗), (3.1)

where µ is an n-vector that is generated according to some distribution G = Gn. Thesecond model is more noisy than the first, in the sense that Σ∗ ≥ Σ; see the definitionbelow.

Definition 3.1 Consider two matrices A and B. We write A ≥ B if A − B is postivesemi-definite.

The second model in (3.1) can be viewed as the result of adding noise to the first. Indeed,defining ∆ = Σ∗−Σ, taking ξ to be N(0,∆) (independently of Z), and noting that Z+ξ ∼N(0,Σ + ∆), the second model is seen to be equivalent to X + ξ = µ+ (Z + ξ). Intuitively,adding noise generally makes inference more difficult. This can be stated precisely bycomparing distances between distributions, for example Hellinger distances. In detail, ifwe denote the Hellinger distance between X and Z by Hn(X,Z;µ,Σn), and that betweenX∗ and Z∗ by Hn(X∗, Z∗;µ,Σ∗n), then we have the following theorem, which is proved inSection 10.

Theorem 3.1 Suppose Σn ≤ Σ∗n in (3.1). Then Hn(X,Z;µ,Σn) ≥ Hn(X∗, Z∗n;µ,Σ∗n).

5

3.2 Inverses of matrices having polynomial off-diagonal decay

Next, we review results concerning matrices with polynomial off-diagonal decay. The mainmessage is that, under mild conditions, if a matrix has polynomial off-diagonal decay, thenits inverse as well as its Cholesky factorization (which is unique if we require the diagonalentries to be positive) also have polynomial off-diagonal decay, and with the same rate.This beautiful result was recently obtained by Jaffard [25]; also see [19, 35]. In detail, letZ be the set of all integers. Write `2 for the set of summable sequences x = xkk∈Z, andlet A = (A(j, k))j,k∈Z be an infinite matrix. Also, let |x|2 be the `2-vector norm of x, and‖A‖ be the operation norm of A: ‖A‖ = supx : |x|2=1 |Ax|2. Fixing positive constants λ,M , and c0, we define the class of matrices

Θ∞(λ, c0,M) =A = (A(j, k))j,k∈Z : |A(j, k)| ≤ M

(1 + |j − k|)λ, ‖A‖ ≥ c0

. (3.2)

The following lemma follows directly from [35].

Lemma 3.1 Fix λ > 1, c0 > 0, and M > 0. For any matrices A ∈ Θ∞(λ,M), there is aconstant C > 0, depending only on λ, M and c0, such that |A−1(j, k)| ≤ C · (1 + |j− k|)−λ.

Now, consider a sequence of matrices of finite but increasingly larger sizes, where theentries have a given rate of polynomial off-diagonal decay and where the operator normis uniformly bounded from below. Then the same rate of polynomial off-diagonal decayholds for their inverses, as well as for the inverse of their Cholesky factorizations. In detail,writing Θn for the set of n× n correlation matrices, we introduce the set of matrices

Θ∗n(λ, c0,M) =

Σn ∈ Θn : |Σn(j, k)| ≤M(1 + |j − k|)−λ, ‖Σn‖ ≥ c0

. (3.3)

The following corollary follows from Lemma 3.1 and is proved in Section 11.

Lemma 3.2 Fix λ > 1, c0 > 0, and M > 0. For any sequence of matrices Σn, n ≥ 1, suchthat Σn ∈ Θ∗n(λ, c0,M), let Un be the inverse of the Cholesky factorization of Σn. There isa constant C = C(λ, c0,M) > 0 such that for any n and any 1 ≤ j, k ≤ n,

|Σ−1n (j, k)| ≤ C · (1 + |j − k|)−λ, |Un(j, k)| ≤ C · (1 + |j − k|)−λ.

When λ = 1, the first inequality continues to hold, and the second holds if we adjoin a log nfactor to the right hand side.

3.3 Lower bound to the detectability

We are now ready for the lower bound. Consider a sequence of matrices Σn ∈ Θ∗n(λ, c0,M).Suppose the extreme diagonal entries of Σ−1

n have an upper limit 0 < γ0 <∞, i.e.

limn→∞

(max

√n≤k≤n−

√n

Σ−1n (k, k)

)= γ0. (3.4)

Recall that the detection boundary in the uncorrelated case is r = ρ∗(β). The followingtheorem says that if we re-scale and write r = γ−1

0 · ρ∗(β), then we obtain a lower bound.

Theorem 3.2 Fix β ∈ (1/2, 1), r ∈ (0, 1), λ > 1, c0 > 0, and M > 0. Consider a sequenceof correlation matrices Σn ∈ Θ∗n(λ, c0,M) that satisfy (3.4). If r < γ−1

0 ρ∗(β), then the nullhypothesis and alternative hypothesis in (2.4) merge asymptotically, and the sum of Type Iand Type II errors of any test converges to 1 as n diverges to ∞.

We now turn to the upper bound. The key is to adapt the Higher Criticism to correlatednoise and form a new statistic—innovated Higher Criticism.

6

4 Innovated Higher Criticism, upper bound to detectability

Originally designed for the independent case, standard HC is not really appropriate fordependent data, for the following reasons. First, HC only summarizes the information thatresides in the marginal effects of each coordinate, and neglects the correlation structure ofthe data. Second, HC remains the same if we randomly shuffle different coordinates of X.Such shuffling does not have an effect if Σn = In, but does otherwise. In this section webuild the correlation into the standard Higher Criticism and form a new statistic—innovatedHigher Criticism (iHC). We then use iHC to establish an upper bound to detectability. TheiHC is intimately connected to the well-known notion of innovation in time series [6] (see(4.1) below), hence the name innovated Higher Criticism.

Below, we begin by discussing the role of correlation in the detection problem.

4.1 Correlation among different coordinates: curse or blessing?

Consider Model (2.1) in the two cases Σn = In and Σn 6= In. Which is the more difficultdetection problem?

Here is one way to look at it. Since the mean vectors are the same in the two cases,the problem where the noise vector contains more “uncertainty” is more difficult than theother. In information theory, the total amount of uncertainty is measured by the differentialentropy, which in the Gaussian case is proportional to the determinant of the correlationmatrix [11]. As the determinant of a correlation matrix is largest when and only when itis the identity matrix, the uncorrelated case contains the largest amount of “uncertainty”and therefore gives the most difficult detection problem. In a sense, the correlation is a“blessing” rather than a “curse”, as one might have expected.

Here is another way to look at it. For any positive definite matrix Σn, denote theinverse of its Cholesky factorization by Un = Un(Σn) (so that UnΣnU

′n = In). Model (2.1)

is equivalent toUnX = Unµ+ UnZ where UnZ ∼ N(0, In). (4.1)

(In the literature of time series [6], UnX is intimately connected to the notion of innovation).Compared to the uncorrelated case, i.e.

X = µ+ Z where Z ∼ N(0, In).

The noise vectors have the same distribution, but the signals in the former are stronger. Infact, let `1 < `2 < . . . < `m be the m locations where µ is nonzero. Recalling that µj = Anif j ∈ `1, `2, . . . , `m, µj = 0 otherwise, and that Un is a lower triangular matrix,

(Unµ)`k = An

k∑j=1

Un(`k, `i) = AnUn(`k, `k) +An

k−1∑j=1

Un(`j , `k). (4.2)

Two key observations are as follows. First, since Σn has unit diagonal entries, every diagonalentry of Un is greater than or equal to 1, and especially,

Un(`k, `k) ≥ 1. (4.3)

Second, recall that m n, and `1, `2, . . . , `m are randomly generated from 1, 2, . . . , n,so different `j are far apart from each other. Therefore, under mild decay conditions on Un,

Un(`j , `k) ≈ 0, j = 1, 2, . . . , k − 1. (4.4)

7

Inserting (4.3) and (4.4) into (4.2), we expect that

(Unµ)`k & An, k = 1, 2, . . . ,m.

Therefore, “on average”, Unµ has at least m entries each of which is at least as large asAn. This says that, first, the correlated case is easier for detection than the uncorrelatedcase. Second, applying standard HC to UnX yields a larger power than applying it to Xdirectly.

Next we make the argument more precise. Fix a positive sequence δn : n ≥ 1 thattends to 0 as n diverges to∞, and a sequence of integers bn : n ≥ 1 that satisfy 1 ≤ bn ≤ n.Let

Θ∗n(δn, bn) = Σn ∈ Θn,

k−bn∑j=1

|Un(Σn)(k, j)| ≤ δn, for all k satisfying bn + 1 ≤ k ≤ n.

Introducing Θ∗n seems a digression from our original plan of focusing on Θ∗n (the set ofmatrices with polynomial off-diagonal decay), but it is interesting in its own right. In fact,compared to Θ∗n, Θ∗n is much broader as it does not impose much of a condition on Σn(j, k)for |j−k| ≤ bn. This helps to illustrate how broadly the aforementioned phenomenon holds.The following theorem is proved in Section 10.

Theorem 4.1 Fix β ∈ (1/2, 1) and r ∈ (ρ∗(β), 1). Let bn = nβ/3, and let δn be a positivesequence that tends to 0 as n diverges to ∞. Suppose we apply standard Higher Criticismto Un(Σn)X and we reject H0 if and only if the resulting score exceeds 1.01

√2 log log n.

Then, uniformly in all sequences of Σn satisfying Σn ∈ Θ∗n(δn, bn),

PH0Reject H0+ PH

(n)1

Accept H0 → 0, n→∞.

Generally, directly applying standard HC to X does not yield the same result (e.g. [21]).

4.2 Innovated Higher Criticism: Higher Criticism based on innovations

We have learned that applying standard HC to UnX yields better results than applying itto X directly. Is this the best we can do? No, there is still space for improvement. In fact,HC applied to UnX is a special case of innovated Higher Criticism to be elaborated in thissection. Innovated Higher Criticism is even more powerful in detection.

To begin with, we revisit the vector Unµ via an example. Fix n = 100; let Σn be asymmetric tri-diagonal matrix with 1 on the main diagonal, 0.4 on two sub-diagonals, and 0elsewhere; and let µ be the vector with 1 at coordinates 27, 50, 71, and 0 elsewhere. Figure 2compares µ and Unµ. Especially, the nonzero coordinates of Unµ appear in three visibleclusters, each of which corresponds to a different nonzero entry of µ. Also, at coordinates27, 50, 71, Unµ approximately equals to 1.2, but µ equals 1.

Now we can either simply apply standard HC to UnX as before, or we can first linearlytransform each cluster of signals to a singleton, and then apply the standard HC. Note thatin the second approach, we may have fewer signals, but each of them is much stronger thanthose in UnX. Since the HC test is more sensitive to signal strength than to the numberof signals, we expect that the second approach yields greater power for detection than thefirst.

8

10 20 30 40 50 60 70 80 90 100

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

10 20 30 40 50 60 70 80 90 100

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 2: Comparison of µ (left) and Un(Σn)µ (right). Here n = 100 and Σn is a symmetrictri-diagonal matrix with 1 on the main diagonal, 0.4 on two sub-diagonals, and 0 elsewhere.Also, µ is 1 at coordinates 27, 50, and 71 and 0 elsewhere. In comparison, the nonzeroentries of Un(Σn)µ appear in three visible clusters, each of which corresponds to a nonzerocoordinate of µ.

In light of this we propose the following approach. Write Un = (ukj)1≤k,j≤n. We picka bandwidth 1 ≤ bn ≤ n, and construct a matrix Un(bn) = Un(Σn, bn) by banding Un [4]:

U(bn) ≡(ukj)

1≤j,k≤n, ukj =ukj , k − bn + 1 ≤ j ≤ k,0, otherwise.

(4.5)

We then normalize each column of Un(bn) by its own `2-norm, and call the resulting matrixUn(bn). Next, defining

Vn(bn) = Vn(bn; Σn) = U ′n(bn; Σn) · Un(Σn), (4.6)

we transform Model (2.1) into

X 7−→ Vn(bn)X = Vn(bn)µ+ Vn(bn)Z. (4.7)

Finally, we apply standard Higher Criticism to Vn(bn)X, and call the resulting statisticinnovated Higher Criticism,

iHC∗n(bn) = iHC∗n(bn; Σn) =1√

(2bn − 1)sup

j: 1/n≤p(j)≤1/2

√n ·

j/n− p(j)√p(j)(1− p(j))

. (4.8)

Note that standard HC applied to UnX is a special case of iHC∗n with bn = 1.We briefly comment on the selection of the bandwidth parameter bn. First, for each k ∈

`1, `2, . . . , `m, direct calculations show that (Vn(bn)µ)k ≈ An ·√∑bn

j=1 u2k,k−j+1 ≥ An.

Second, Vn(bn)Z ∼ N(0, U ′n(bn)Un(bn)), where U ′n(bn)Un(bn) is a banded correlation matrixwith bandwidth 2bn − 1. Therefore, choosing bn involves a trade-off: a larger bn usuallymeans stronger signals, but also means stronger correlation among the noise. While it ishard to give a general rule for selecting the best bn, we must mention that in many cases,the choice of bn is not very critical. For example, when Σn has polynomial off-diagonaldecay, a logarithmically large bn is usually appropriate.

9

4.3 Upper bound to detectability

We now establish an upper bound to detectability. The following lemma describes thesignal strength in Vn(bn) ·X and is proved in Section 11.

Lemma 4.1 Fix c0 > 0, λ ≥ 1, and M > 0. Consider a sequence of bandwidths bn thattends to ∞. Let `1, `2, . . . , `m be the m random locations of signals in µ, arranged in theascending order. For sufficiently large n, there is a constant C = C(c0, λ,M) such that,except for an event with asymptotically vanishing probability,

(Vn(bn)µ)k ≥ (1− Cb1/2−λn + o(1)) ·√

Σ−1n (k, k) ·An, ∀ k ∈ `1, `2, . . . , `m,

for all Σn ∈ Θ∗n(λ, c0,M), where o(1) tends to 0 algebraically fast.

Now, suppose the diagonal entries of Σ−1n has a lower limit as follows,

limn→∞

(min

√n≤k≤n−

√n

Σ−1n (k, k)

)= γ0. (4.9)

Recall that the nonzero coordinates of µ is modeled as An =√

2r log n. So if we letbn = log n, then a direct result of Lemma 4.1 is that the vector Vn(bn) · X has at leastm nonzero coordinates, each of which is as large as √γ0An =

√2γ0 · r · log n. For the

bandwidth, note that a larger bn cannot improve the signal strength significantly, but mayyield a much stronger correlation in Vn(bn)Z. Therefore, a smaller bandwidth is preferred.The choice bn = log n is mainly for convenience, and can be modified.

We now turn to the behavior of iHC∗n(bn) under the null hypothesis. In the independentcase, iHC∗n reduces to HC∗n and is approximately equal to

√2 log log n. In the current

situation, iHC∗n is comparably larger due to the correlation. However, since the selectedbandwidth is relatively small, iHC∗n remains logarithmically large. This is formally capturedby the following lemma.

Lemma 4.2 Take the bandwidth to be bn = log n and suppose H0 is true. Then, exceptfor an algebraically small probability, iHC∗n(bn) ≤ C(log n)3/2 for some constant C > 0,uniformly for all correlation matrices.

Lemma 4.2 is proved in Section 11. The key is to express iHC∗n as the maximum of(2bn−1) standard HC, and apply the well-known Hungarian construction [10]. The followingtheorem elaborates on the upper bound, and is proved in Section 10.

Theorem 4.2 Fix c0 > 0, λ > 1, and M > 0, and set bn = log n. Suppose γ0 · r > ρ∗(β).If we reject H0 when iHC∗n(bn; Σn) ≥ (log n)2, then, uniformly in all Σn ∈ Θ∗n(λ, c0,M),

PH0Reject H0+ PH

(n)1

Accept H0 → 0, as n→∞.

The cut-off value (log n)2 can be replaced by other logarithmically large terms that tendto ∞ faster than (log n)3/2. For finite n, this cut-off value may be conservative. In Section8 (i.e. experiment (a)), we suggest an alternative where we select the cut-off value bysimulation.

In summary, a lower bound and an upper bound are established as r = γ−10 ρ∗(β) and r =

γ0−1ρ∗(β), respectively, under reasonably weak off-diagonal decay conditions. When γ0 =

γ0, the gap between the two bounds disappears, and iHC is optimal for detection. Below,we investigate several Toeplitz cases, ranging from weak dependence to strong dependence;for these cases, iHC is optimal in detection.

10

5 Application in the Toeplitz case

In this section, we discuss the case where Σn is a (truncated) Toeplitz matrix that is gener-ated by a spectral density f defined over (−π, π). In detail, let ak = (2π)−1

∫|θ|<π f(θ) e−ikθ dθ

be the k-th Fourier coefficient of f . The n-th truncated Toeplitz matrix generated by f isthe matrix Σn(f) of which the (j, k)-th element is aj−k, for 1 ≤ j, k ≤ n.

We assume that f is symmetric and positive, i.e.

c0(f) ≡ essinf−π≤θ≤πf(θ) > 0. (5.1)

First, note that f is a density, so a0 = 1 and Σn(f) has unit diagonal entries. Second, fromthe symmetry of f , it can be seen that Σn(f) is a real-valued symmetric matrix. Last, it iswell-known [5] that the smallest eigenvalue of Σn(f) is no smaller than c0(f), so Σn(f) ispositive definite. Putting all these together, Σn(f) is seen to be a correlation matrix.

Toeplitz matrices enjoy convenient asymptotic properties. In detail, suppose that ad-ditionally f has at least λ bounded derivatives (interpreted in the sense of conventionalderivatives together with a Holder condition), where λ > 1. Then by elementary Fourieranalysis, there is a constant M0 = M0(f) > 0 such that

|ak| ≤M0(f)(1 + k)−λ for k = 0, 1, 2, . . . . (5.2)

Comparing (5.1) and (5.2) with the definition of Θ∗n, we conclude that

Σn ∈ Θ∗n(λ, c0(f),M0(f)). (5.3)

In addition, it is known that the inverse of Σn(f) is typically asymptotically equivalentto the Toeplitz matrix generated by 1/f . In particular we have the following lemma, whichis a direct result of [5, Theorem 2.15].

Lemma 5.1 Suppose (5.1) and (5.2) hold. For all√n ≤ k ≤ n−

√n and each 1 < λ′ < λ,

|Σ−1n (f)(k, k)− Σn(1/f)(k, k)| ≤ Cn−(λ′−1)/2.

The diagonal entries of Σn(1/f) are the well-known Wiener interpolation rates [37]:

C(f) =1

2π

∫ π

−π

1f(θ)

dθ . (5.4)

Therefore, as a direct result of Lemma 5.1, max√n≤k≤n−√n|Σ−1n (f)(k, k)−C(f)|

= o(1).

Comparing this with (3.4) and (4.9) we deuce that

γ0 = γ0 = C(f). (5.5)

Combining (5.3) and (5.5), the following theorem is a direct result of Theorems 3.2 and 4.1(the proof is omitted).

Theorem 5.1 Fix λ > 1, and let Σn(f) be the Toeplitz matrix generated by a symmetricspectral density f that satisfies (5.1) and (5.2). When C(f) · r < ρ∗(β), the null andalternative hypotheses merge asymptotically, and the sum of Type I and Type II errors ofany test converges to 1 as n diverges to ∞. When C(f) · r > ρ∗(β), suppose we apply iHCwith bandwidth bn = log n and reject the null hypothesis when iHC∗n(bn,Σn(f)) ≥ (log n)2.Then the Type I error of iHC converges to zero, and its power converges to 1.

The curve r = C(f)−1ρ∗(β) partitions the β-r plane into the undetectable region andthe detectable region, similarly to the uncorrelated case. The regions of the current casecan be viewed as the corresponding regions in the uncorrelated squeezed vertically by afactor of 1/C(f). See Figure 3. (Note that C(f) ≥ 1, with equality if and only if f ≡ 1,which corresponds to the uncorrelated case.)

11

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Detectable

Undetectable

Estimable

!

r

Figure 3: Phase diagram in the case where Σn is a Toeplitz matrix generated by a spectraldensity f . Similarly to that in Figure 1, the β-r plane is partitioned into three regions—-undetectable, detectable, estimable—each of which can be viewed as the correspondingregion in Figure 1 squeezed vertically by a factor of 1/C(f). In the rectangular regionon the top, the largest signals in Vn(bn) · X (see (4.6)) are large enough to stand out bythemselves.

6 Extension: when signals appear in clusters

In the preceding sections (see e.g. (2.3) in Section 2), the m locations of signals weregenerated randomly from 1, 2, . . . , n. Since m

√n, the signals appear as singletons

with overwhelming probabilities. In this section we investigate an extension where thesignals may appear in clusters.

We consider a setting where the signals appear in a total of m clusters, whose locationsare randomly generated from 1, 2, . . . , n. Each cluster contains a total of K consecutivesignals, whose strengths are g0An, g1An, . . ., gK−1An, from right to left. Here, An =√

2r log n as before, K ≥ 1 is a fixed integer, and gi are constants. Approximately, thesignal vector can be modeled as follows.

As before, let `1, `2, . . . , `m be indices that are randomly sampled from 1, 2, . . . , n.Let µ = (µ1, . . . , µn)T, where µj = An if j ∈ `1, `2, . . . , `m, and µj = 0 otherwise. LetB = Bn denote the “backward shift” matrix, with 0 in every position except that it has 1in position (j+ 1, j) for 1 ≤ j ≤ n− 1. Thus, Bµ differs from µ in that the components areshifted one position backward, with 0 added at the bottom. We model the signal vector as

ν = g0 µ+ g2B µ+ . . . gk BK−1 µ =

(K−1∑k=0

gkBk

)µ .

Thus, ν is comprised of m clusters, each of which contains K consecutive signals. Let gbe the function g(θ) =

∑0≤k≤K−1 gk e

−ikθ. We note that∑

0≤k≤K−1 gk Bk is the lower

triangular Toeplitz matrix generated by g. With the same spectral density f , we consideran extension of that in Section 5 by considering the following model:

X = Σn(g)µ+ Z where Z ∼ N(0,Σn(f)) , (6.1)

with f denoting the spectral density in Section 5. The model is closely related to that byArias-Castro et al. [2], with gi = 1, m = 1, and f ≡ 1. See details therein.

12

We note that the model can be equivalently viewed as

X = µ+ Z where Z ∼ N(0, Σn) and Σn = Σ−1n (g) · Σn(f) · Σ−1

n (g),

with g denoting the complex conjugate of g. Asymptotically,

Σ−1n ∼ Σn(g) · Σ−1

n (f) · Σn(g) ∼ Σn(|g|2/f),

where the diagonal entries of Σn(|g|2/f) are

C(f, g) =1

2π

∫ π

−π

|g(θ)|2

f(θ)dθ .

If γ0 and γ0 are as defined in (3.4) and (4.9), then γ0 = γ0 = C(f, g), and we expect thedetection boundary to be r = C(f, g)−1 · ρ∗(β). The is affirmed by the following theorem,which is proved in Section 10.

Theorem 6.1 Fix λ > 1. Suppose g0 6= 0 and let f be a symmetric spectral density thatsatisfies (5.1) and (5.2). When C(f, g) · r < ρ∗(β), the null and alternative hypothesesmerge asymptotically, and the sum of Type I and Type II errors of any test converges to 1as n diverges to ∞. When C(f, g) · r > ρ∗(β), if we apply iHC to Σ−1

n (g)X with bandwidthbn = log n and reject the null hypothesis when iHC∗n(bn,Σ−1

n (g)Σn(f)Σ−1n (g)) ≥ (log n)2,

then the Type I error converges to zero, and the power converges to 1.

7 The case of strong dependence

So far, we have only discussed weakly dependent cases. In this section, we investigate thecase of strong dependence.

Suppose that we observe an n-variate Gaussian vector X = µ+ Z, where µ contains atotal of m signals, of equal strength to be specified, whose locations are randomly drawnfrom 1, 2, . . . , n without replacement, and Z ∼ N(0,Σn) where we assume that Σn displaysslowly decaying correlation:

Σn(j, k) = max0, 1− |j − k|α n−α0 , 1 ≤ j, k ≤ n , (7.1)

with α > 0 and 0 < α0 ≤ α. The range of dependence can be calibrated in terms of k0 =k0(n;α, α0), denoting the largest integer by k < nα0/α. Clearly, k0 ≈ nα0/α. Seemingly, themost interesting range is 0 < α0/α ≤ 1. The following lemma establishes cases for whichΣn is a correlation matrix.

Lemma 7.1 Let Σn be as in (7.1). For sufficient large n, a necessary and a sufficientcondition for Σn to be positive definite are, respectively, 0 ≤ α ≤ 2 and 0 < α0 ≤ α ≤ 1.

Lemma 7.1 is proved in Section 12. Model (7.1) has been studied in detail by Hall andJin [21], who showed that the detectability of standard HC is seriously damaged by strongdependence. However, it remains open as to what is the detection boundary, and how toadapt HC to overcome the strong dependence and obtain optimal detection. This is whatwe address in the current section.

The key idea is to decompose the correlation matrix as the product of three matriceseach of which is relatively easy to handle. To begin with we introduce a spectral density,

fα(θ) = 1−∞∑k=1

[(k + 1)α + (k − 1)α − 2kα

]cos(kθ) . (7.2)

13

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

Figure 4: Display of C(fα, g0). x-axis: α. y-axis: C(fα, g0).

Note that the Fourier coefficients of fα(θ) satisfy the decay condition in (5.2) with λ = 2−α.Also, we have the following lemma, which is derived in Section 12.

Lemma 7.2 For 0 < α < 1, we have essinf−π≤θ≤πfα(θ) > 0.

Next, letg0(θ) = 1− e−iθ, an = an(α0) = nα0/2.

The Toeplitz matrix Σn(g0) is a lower triangular matrix with 1’s on the main diagonal, −1’son the sub-diagonal, and 0’s elsewhere. Additionally, letDn be the diagonal matrix where onthe diagonal the first entry is 1 and the remaining entries are

√an. Let X = Dn ·Σn(g0) ·X.

Then Model (7.1) can be rewritten equivalently as

X = µ+ Z where µ = Dn · Σn(g0) · µ and Z ∼ N(0, Σn), (7.3)

with Σn = Dn · Σn(g0) · Σn · Σn(g0) ·Dn. The key is that Σn is asymptotically equivalentto the Toeplitz matrix generated by fα. In detail, introduce

Σ =(

1, 00, Σn−1(fα)

).

The following lemma is proved in Section 11.

Lemma 7.3 The spectral norm of Σn − Σn tends to 0 as n tends to ∞.

Additionally, note that µ =√an ·Σn−1(g) · µ except for the first coordinate. Therefore, we

expect Model (7.3) to be approximately equivalent to

X =√an · Σn(g0) · µ+ Z where Z ∼ N(0,Σn(fα)).

This is a special case of the cluster model we considered in Section 6 with f = fα and g = g0,except that the signal strength has been re-scaled by

√an. Therefore, if we calibrate the

nonzero entries in µ asa−1/2n ·An = a−1/2

n ·√

2r log n, (7.4)

14

then the detection boundary for the model is succinctly characterized by

r =1

C(fα, g0)· ρ∗(β), C(fα, g0) =

12π

∫ π

−π

|g0(θ)|2

fα(θ)dθ =

1π

∫ π

−π

1− cos(θ)fα(θ)

dθ .

See Figure 4 for the display of C(fα, g0). The following theorem is proved in Section 10.

Theorem 7.1 Let 0 < α0 ≤ α < 12 , β ∈ (1

2 , 1), and r ∈ (0, 1). Assume X is generatedaccording to Model (7.1), with signal strength re-scaled as in (7.4). When C(fα, g0) · r <ρ∗(β), the null and alternative hypotheses merge asymptotically, and the sum of Type I andType II errors of any test converges to 1 as n diverges to ∞. When C(fα, g0) · r > ρ∗(β), ifwe apply the iHC to X with bandwidth bn = log n and reject the null when iHC∗n(bn,Σn) ≥(log n)2, then the Type I error converges to zero, and its power converges to 1.

−0.4 −0.2 0 0.2 0.40

0.05

0.1

0.15

0.2

0.25

0.3

−0.4 −0.2 0 0.2 0.40

0.05

0.1

0.15

0.2

0.25

0.3

−0.4 −0.2 0 0.2 0.40

0.05

0.1

0.15

0.2

0.25

0.3

−0.4 −0.2 0 0.2 0.40

0.05

0.1

0.15

0.2

0.25

0.3

Figure 5: Sum of Type I and Type II errors as described in experiment (a). From top tobottom then from left to right, (β, r) = (0.5, 0.2), (0.5, 0.25), (0.55, 0.2), (0.55, 0.25). Ineach panel, the x-axis displays ρ, and three curves (blue, dashed-green, and red) displaythe sum of errors corresponding to HC, HC-a, and HC-b.

8 Simulation study

We conducted a small-scale empirical study to compare the performance of iHC and stan-dard HC. For iHC, we investigate two choices of bandwidth: bn = 1 and bn = log n. In thissection, we denote standard HC, iHC with bn = 1, and iHC with bn = log n by HC, HC-a,and HC-b correspondingly.

The algorithm for generating data included the following four steps: (1). Fix n, β,and r, let m = n1−β and An =

√2r log n. (2). Given a correlation matrix Σn, generate

a Gaussian vector Z ∼ N(0,Σn). (3). Randomly draw m integers `1 < `2 < . . . < `mfrom 1, 2, . . . , n without replacement, and let µ be the n-vector such that µj = An ifj ∈ `1, `2, . . . , `m and 0 otherwise. (4). Let X = µ + Z. Using data generated in thismanner we explored three parameter settings, (a)–(c), which we now describe.

15

In experiment (a), we took n = 1000 and Σn(ρ) as the tri-diagonal Toeplitz matrixgenerated by f(θ) = 1 + 2ρ cos(θ), |ρ| < 1/2. The corresponding detection boundary wasr = ρ∗(β)/C(f) with C(f) = (2π)−1

∫ π−π

11−2ρ cos(θ)dθ. Consider all ρ that range from

−0.45 to 0.45 with an increment of 0.05, and four pairs of parameters (β, r) = (0.5, 0.2),(0.5, 0.25), (0.55, 0.2), and (0.55, 0.25). (Note that the corresponding parameters (m,An)are (32, 1.66), (32, 2.63), (22, 1.66), and (22, 2.63)). For each triple (β, r, ρ), we generateddata according to (1)–(4), applied HC, HC-a, and HC-b to both Z and X, and repeatedthe whole process independently 100 times. As a result, for each triple (β, r, ρ) and eachprocedure, we got 100 HC scores that corresponded to the null hypothesis, and 100 HCscores that corresponded to the alternative hypothesis.

We report the results in two different ways. First, we report the minimum sum of TypeI and Type II errors (i.e. the minimum of the sum across all possible cut-off values); seeFigure 5. Second, we pick the upper 10% percentile of the 100 HC scores corresponding tothe null hypothesis as a threshold (for later references, we call this threshold the empiricalthreshold), and calculate the empirical power of the test (i.e. the fraction of HC scorescorresponding to the alternative hypothesis that exceeds the threshold). The empiricalthresholds are displayed in Table 1 (to save space, only part of the thresholds are reported),and the power is displayed in Figure 6. Recall that in Theorem 4.2 we recommend (log n)2

as a cut-off point in the asymptotic setting. For moderately large n, this cut-off point isconservative, and we recommend the empirical threshold instead.

The results suggest that (1). iHC-b outperforms iHC-a, and iHC-a outperforms HC.(2). As |ρ| increases (note that a larger |ρ| means a stronger correlation), the detectionproblem is increasingly easier, and the advantage of iHC is increasingly prominent. (3).Under the null hypothesis, the HC-b scores are usually smaller than those of HC and HC-a.This is mainly due to the normalization term

√2bn − 1 in the definition of iHC (see (4.8)).

−0.4 −0.2 0 0.2 0.40

0.2

0.4

0.6

0.8

1

−0.4 −0.2 0 0.2 0.40

0.2

0.4

0.6

0.8

1

−0.4 −0.2 0 0.2 0.40

0.2

0.4

0.6

0.8

1

−0.4 −0.2 0 0.2 0.40

0.2

0.4

0.6

0.8

1

Figure 6: Power as described in experiment (a). From top to bottom then from left to right,(β, r) = (0.5, 0.2), (0.5, 0.25), (0.55, 0.2), (0.55, 0.25). In each panel, the x-axis displays ρ,and three curves (blue, dashed-green, and red) display the power of HC, HC-a, and HC-b.

In experiment (b), we took Σn to be the Toeplitz matrix generated by f(θ) = 1 +12 cos(θ) + 2ρ cos(2θ), where ρ ranged from −0.2 to 0.45 with an increment of 0.05. (the

16

ρ -0.45 -0.35 -0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45HC 3.078 2.681 2.849 2.843 2.577 2.613 2.968 2.659 3.078 3.072HC-a 2.637 2.889 2.759 2.806 2.689 2.657 3.083 2.788 2.679 2.670HC-b 0.973 0.810 0.771 0.805 0.716 0.752 0.817 0.764 0.819 0.938

Table 1: Display of empirical thresholds in experiment (a) for different ρ.

−0.2 0 0.2 0.40

0.05

0.1

0.15

0.2

0.25

0.3

−0.2 0 0.2 0.40

0.05

0.1

0.15

0.2

0.25

0.3

−0.2 0 0.2 0.40

0.05

0.1

0.15

0.2

0.25

0.3

−0.2 0 0.2 0.40

0.05

0.1

0.15

0.2

0.25

0.3

Figure 7: Sum of Type I and Type II errors as described in experiment (b). From top tobottom then from left to right, (β, r) = (0.5, 0.2), (0.5, 0.25), (0.55, 0.2), (0.55, 0.25). Ineach panel, x-axis displays ρ, and three curves (blue, dashed-green, and red) display thesum of errors corresponding to HC, HC-a, and HC-b.

matrix Σn is positive definite when ρ is in this range). Other parameters are the same asin experiment (a). The minimum sums of Type I and Type II errors are reported in Figure7. The results suggest similarly that HC-b outperforms HC-a, and HC-a outperforms HC.

In experiment (c), we investigated the behavior of HC-a/HC-b/HC for larger n. Wetook (β, r) = (0.5, 0.25), n = 500 × (1, 2, 3, 4, 5), and Σn as the tri-diagonal matrix inexperiment (a) with ρ = 0.4. The sum of Type I and Type II errors is reported in Table 2.The results suggest that the performance of HC-a/HC-b/HC improve when n gets larger.(Investigation of the case where n was much larger than 2500 needed much greater computermemory, and so we omitted it.)

n 500 1000 1500 2000 2500HC .130 .150 .090 .115 .085HC-a .040 .030 .015 .025 .015HC-b .025 .010 .005 .005 0

Table 2: Display of the sum of Type I and Type II errors in experiment (c) for different n.

17

9 Discussion

We have extended standard HC to innovated HC by building in the correlation structure.The extreme diagonal entries of Σ−1

n play a key role in the testing problem. If the extremevalue has finite upper and lower limits, γ0 and γ0, then in the β-r plane, the detectionboundary is bounded by the curves r = γ0

−1 · ρ∗(β) from above and r = γ−10 · ρ∗(β)

from below. When the correlation matrix is Toeplitz, the upper and lower limits mergeand equal the Wiener interpolation rate C(f). The detection boundary is therefore r =(C(f))−1 · ρ∗(β). The detection boundary partitions the β-r plane into a detectable regionand an undetectable region. Innovated HC has asymptotically full power for detectionwhenever (β, r) falls into the interior of the detectable region, and is therefore optimallyadaptive.

9.1 Connection to recent literature

The work complements that of Donoho and Jin [14] and Hall and Jin [21]. The focus of[14] is standard HC and its performance in the uncorrelated case. The focus of [21] is howstrong dependence may harm the effectiveness of standard HC; what could be a remedywas however not explored. The innovated HC proposed in the current paper is optimal forboth the model in [14] and that in [21].

The work is related to that of Jager and Wellner [26], where the authors proposed afamily of goodness-of-fit statistics for detecting sparse normal mixtures. The work is alsorelated to that of Meinshausen and Rice [32] and of Cai, Jin and Low [7], where the authorsfocused on how to estimate εn—the proportion of non-null effects.

Recently, HC was also found to be useful for feature selection in high dimensionalclassification. See Donoho and Jin [15, 16] and Hall et al. [20]. The work concerned thesituation where there are relatively few samples containing a very large number of features,out of which only a small fraction is useful, and each useful feature contributes weakly tothe classification problem. In a related setting, Delaigle and Hall [13] investigated HC forclassification when the data is nonGaussian or dependent

9.2 Future work

The work is also intimately connected to recent literature on estimating covariance matri-ces. While the study is focused more on situations where the correlation matrices can beestimated using other approaches (e.g. [9, 17, 18]), it can be generalized to cases wherethe correlation matrix is unknown but can be estimated from data. In particular, it isnoteworthy that it was shown in Bickel and Levina [4] that when the correlation matrix haspolynomial off-diagonal decay, the matrix and its inverse can be estimated accurately interms of the spectral norm. In such situations we expect the proposed approach to performwell once we combine it with that in [4].

Another interesting direction is to explore cases where the correlation matrix does nothave polynomial off-diagonal decay, but is sparse in an unspecified pattern. This is a morechallenging situation as relatively little is known about the inverse of the correlation matrix.

Our study also opens opportunities for improving other recent procedures. Take theaforementioned work on classification [15, 16, 20] for example. The approach derived inthis paper suggests ways of incorporating correlation structure into feature selection, andtherefore raises hopes for better classifiers. For reasons of space, we leave explorations alongthese directions to future study.

18

10 Proofs of main results

In this section we prove all theorems in preceding sections, except Theorems 2.1 and 5.1.These two theorems are the direct result of Theorems 3.2 and 4.2, so we omit the proofs.For simplicity, we drop the subscript n whenever there is no confusion.

10.1 Proof of Theorem 3.1

Rewrite the second model in (3.1) as X+ ξ = µ+ ξ+Z, where independently, Z ∼ N(0,Σ),ξ ∼ N(0,∆), µ ∼ G for ∆ = Σ∗n − Σ and some distribution G. It suffices to show themonotonicity in the Hellinger affinity. Denote the density function of N(0,Σ) by f(x) =f(x1, x2, . . . , xn), and write dx1dx2 . . . dxn as dx for short. Then the Hellinger affinitycorresponding to the second model in (3.1) is

h(Σ,∆, G) ≡∫ √(

E∆EGf(x− µ− ξ))(E∆f(x− ξ)

)dx.

By Holder’s inequality and Fubini’s theorem, h(Σ,∆, G) is not less than∫[E∆

√EGf(x− µ− ξ)f(x− ξ)]dx = E∆

[∫ √EGf(x− µ− ξ)f(x− ξ)dx

].

Note that∫ √

EGf(x− µ− ξ)f(x− ξ)dx ≡∫ √

EGf(x− µ)f(x)dx for any fixed ξ. Itfollows that

h(Σ,∆, G) ≥ E∆

[∫ √EGf(x− µ− ξ)f(x− ξ)dx

]=∫ √

EGf(x− µ)f(x)dx,

where the last term is the Hellinger affinity corresponds to the first model of (3.1). Com-bining these results gives the claim.


It is sufficient to show that the Hellinger distance between the joint density of X andZ converges to 0 as n diverges to ∞. By the assumption γ0 r < ρ∗(β), we can choose asufficiently small constant δ = δ(r, β, γ0) such that γ0 (1−δ)−2 r < ρ∗(β). Let µ = µ/

√1− δ,

let U be the inverse of the Cholesky factorization of Σ, and let U be the banded versionof U :

U(i, j) =U(i, j), |i− j| ≤ log2(n),0, otherwise.

Model (2.1) can be equivalently written as

X = µ+ Z where Z ∼ N(0, (1− δ)−1 · Σ). (10.1)

The key to the proof is to compare Model (10.1) with the following model:

X = µ+ Z where Z ∼ N(0, (U ′U)−1). (10.2)

In fact, by Theorem 3.1, to establish the claim it suffices to prove that (i) U ′U ≤ (1−δ)−1Σfor sufficiently large n, and (ii) the Hellinger distance between the joint density of X andthat of Z associated with Model (10.2) tends to 0 as n diverges to ∞.

19

To prove the first claim, noting that Σ = (U ′U)−1, it suffices to show (1−δ)U ′U ≤ U ′U .Define W = U − U and observe that there is a generic constant C > 0 such that ‖U‖ ≤ Cand ‖W‖ ≤ C, whence ‖U ′U − U ′U‖ = ‖W ′W + U ′W + W ′U‖ ≤ C‖W‖. Moreover, by[22, Theorem 5.6.9], for any symmetric matrix, the spectral norm is no greater than the`1-norm. In view of the definitions of W and Θ∗n(λ, c0,M), the `1-norm of W is no greaterthan (log n)−2(λ−1). Therefore, ‖U ′U − U ′U‖ ≤ C‖W‖ ≤ C(log n)−2(λ−1). This, and thefact that all eigenvalues of U ′U are bounded from below by a positive constant, imply theclaim.

We now consider the second claim. Model (10.2) can be equivalently written as X =U µ + Z where Z ∼ N(0, In). The key to the proof is that U is a banded matrix and µis a sparse vector where with probability converging to 1, the inter-distances of nonzerocoordinates are no less than 3(log n)2 (see Lemma 11.2 for the proof). As a result thenonzero coordinates of U µ are disjoint clusters of sizes O(log2 n), which simplifies thecalculation of the Hellinger distance. The derivation of the claim is lengthy, so we summarizeit in the following lemma, which is proved in Section 11.

Lemma 10.1 Fix β ∈ (12 , 1), r ∈ (0, 1), and δ ∈ (0, 1) such that γ0(1− δ)−2r < ρ∗(β). As

n tends to ∞ the Hellinger distance associated with Model (10.2) tends to 0.


Put Y = Un(Σn)X, ν = Un(Σn)µ, and Z = Un(Σn)z. Model (4.1) reduces to

Y = ν + Z, Z ∼ N(0, In). (10.3)

Recalling thatHC∗n/√

2 log log n→ 1 in probability underH0, it follows that PH0Reject H0tends to 0 as n diverges to ∞, and it suffices to show P

H(n)1

Accept H0 → 0.

The key to the proof is to compare Model (10.3) with

Y ∗ = ν∗ + Z where Z ∼ N(0, In), (10.4)

with ν∗ having m nonzero entries of equal strength (1−δn)An whose locations are randomlydrawn from 1, 2, . . . , n without replacement. By (4.2)–(4.3) and the way Θ∗n(δn, bn) isdefined, we note that νj ≥ (1− δn)An for all j ∈ `1, `2, . . . , `m. Therefore,

signals in ν is both denser and stronger than those in ν∗. (10.5)

Intuitively, standard HC applied to Model (10.3) is no “less” than that applied to Model(10.4).

We now establish this point. Let F0(t) be the survival function of the central χ2-distribution χ2

1(0), and let Fn(t) and F ∗n be the empirical survival function of Y 2k nk=1 and

(Y ∗k )2nk=1, respectively. Using arguments similar to those of Donoho and Jin [14] it canbe shown that standard HC applied to Models (10.3) and (10.4), denoted by HC

(1)n and

HC(2)n for short, can be rewritten as

HC(1)n = sup

t: 1/n≤F0(t)≤1/2

√n(Fn(t)− F0(t))√

F0(t)F0(t)

,

HC(2)n = sup

t: 1/n≤F0(t)≤1/2

√n(F ∗n(t)− F0(t))√

F0(t)F0(t)

,

20

respectively. The key fact is now that the family of non-central χ2-distribution χ21(δ), δ ≥

0 is a monotone likelihood ratio family (MLR), i.e. for any fixed x and δ2 ≥ δ1 ≥ 0,Pχ2

1(δ2) ≥ x ≥ Pχ21(δ1) ≥ x. Consequently, it follows from (10.5) and mathematical

induction that for any x and t, PF ∗n(t) ≥ x ≥ PFn(t) ≥ x. Therefore, for any fixedx > 0,

PHC(1)n < x ≤ PHC(2)

n < x. (10.6)

Finally, by an argument similar to that of Donoho and Jin [14, Section 5.1], the secondterm in (10.6) with x = 1.01

√2 log log n tends to 0 as n diverges to ∞. This implies the

claim.


In view of Lemma 4.2, it suffices to show that PH

(n)1

Accept H0 → 0. Put U = U(bn),

V = Vn(bn), Y = V X, ν = V µ, Z = V Z. Model (4.7) reduces to

Y = ν + Z where Z ∼ N(0, U ′U). (10.7)

Let Fn(t) and F0(t) be the empirical survival function of Y 2k nk=1 and the survival func-

tion of χ21(0), respectively. Let q = q(β, r) = min(β + γ0r)2/(4γ0r), 4γ0r and set t∗n =√

2q log n. Since γ0r < ρ∗(β), then it can be shown that 0 < q < 1 and n−1 ≤ F0(t∗n) ≤ 1/2for sufficiently large n. Using an argument similar to that in the proof of Theorem 4.1,

iHC∗n = sups: 1/n≤F0(s)≤1/2

√n(Fn(s)− F0(s))√

(2bn − 1)F0(s)(1− F0(s))≥

√n(Fn(t∗n)− F0(t∗n))√

(2bn − 1)F0(t∗n)(1− F0(t∗n)),

and it follows that

PiHC∗n ≤ log3/2(n) ≤ P√n(Fn(t∗n)− F0(t∗n))√

(2bn − 1)F0(t∗n)(1− F0(t∗n))≤ log3/2(n). (10.8)

It remains to show that the right hand side of (10.8) is algebraically small. The proof needsdetailed calculation which we summarize in the lemma below, the proof of which is givenin Section 11.

Lemma 10.2 Under the condition of Theorem 4.2, the right hand side of (10.8) tends to0 algebraically fast as n diverges to ∞.


Inspection of the proof of Theorems 3.2 and 4.2 reveals that the condition that Σn isa correlation matrix and that Σn ∈ Θ∗n(λ, c0,M) in those theorems can be relaxed. Inparticular, Σn need not have equal diagonal entries and the decay condition on Σn can bereplaced by a weaker condition that concerns the decay of Un (the inverse of the Choleskyfactorization of Σn), specifically

|Un(i, j)| ≤M(1 + |i− j|λ)−1.

Let Un(f) be the inverse of the Cholesky factorization of Σn(f), and define Un =Un(f)Σn(g). Since Σn(g) is a lower triangular matrix with positive diagonal entries, thenit is seen that Un is the inverse of the Cholesky factorization of Σn. By Lemma 3.2, Un(f)

21

has polynomial off-diagonal decay with the parameter λ. It follows that Un decays at thesame rate. Applying Theorems 3.2 and 4.2, we see that all that remains to prove is that

max√n≤k≤n−

√n|Σ−1

n (k, k)− C(f, g)| → 0. (10.9)

By [5, Theorem 2.15], for any√n ≤ k ≤ n−

√n, k −K ≤ j ≤ k +K, and 1 ≤ λ′ < λ,

|Σ−1n (f)(k, j)− (Σn(1/f))(k, j)| = o(n−(1−λ′)/2).

Since Σ−1n = Σn(g) · Σ−1

n (f) · Σn(g), it follows that sup√n≤k≤n−√n |Σ−1n (k, k) − (Σn(g) ·

Σn(1/f) · Σn(g))(k, k)| → 0. Moreover, direct calculations show that (Σn(g) · Σn(1/f) ·Σn(g))(k, k) = C(f, g),

√n ≤ k ≤ n −

√n. Combining these results gives (10.9) and

concludes the proofs.


Consider the first claim. It suffices to show that the Hellinger distance between X and Zin Model (7.3) tends to 0 as n diverges to ∞. Since C(fα, g0) · r < ρ∗(β), there is a smallconstant δ > 0 such that (1 − δ)−1 · C(fα, g0) · r < ρ∗(β). Using Lemma 7.2, we see thatΣn−1(fα) is a positive matrix the smallest eigenvalue of which is bounded away from 0.It follows from Lemma 7.3 and basic algebra that Σ ≥ (1 − δ)Σn for sufficiently large n.Compare Model (7.3) with

X∗ = µ+ Z∗ where Z∗ ∼ N(0, (1− δ)Σ). (10.10)

By the monotonicity of Hellinger distance (Theorem 3.1), it suffices to show that theHellinger distance between X∗ and Z∗ tends to 0 as n diverges to ∞.

Now, by the definition of µ, µ − √an · Σn(g0) · µ = (µn,√an · µn, 0, . . . , 0)′. Since

Pµn 6= 0 = o(1) then, except for an event with negligible probability, µ = µ. Therefore,replacing µ by

√an ·Σn(g0) ·µ in Model (10.10) alters the Hellinger distance only negligibly.

Note that the first coordinate of X∗ is uncorrelated with all other coordinates, and itsmean equals 0 with probability converging to 1, so removing it from the model only hasa negligible effect on the Hellinger distance. Combining these properties, Model (10.10)reduces to the following with only a negligible difference in the Hellinger distance:

X∗(2 : n) = Σn−1(g0)(√an · µ(2 : n)) + Z∗(2 : n), Z∗(2 : n) ∼ N(0, (1− δ)Σn−1(fα)),

where X(2 : n) denotes the vector X with the first entry removed. Dividing both sides by√1− δ, this reduces to the following model:

X(2 : n) = Σn−1(g0)(√an · µ(2 : n)√

1− δ)

+ Z(2 : n), Z(2 : n) ∼ N(0, Σn−1(fα)), (10.11)

which is in fact Model (6.1) considered in Section 6. It follows from (7.4) that√an · µ(2 :

n)/√

1− δ has m nonzero coordinates each of which equals√

2(1− δ)−1r log n. ComparingModel (10.11) with Model (6.1) and recalling that (1−δ)−1 ·r ·C(fα, g0) < ρ∗(β), the claimfollows from Theorem 6.1.

Consider the second claim. Since C(fα, g0) · r > ρ∗(β), then there is a small constantδ > 0 such that (1 − δ) · r · C(fα, g0) > ρ∗(β). Let Un be the inverse of the Cholesky

22

factorization of Σn, and let Un(bn) and Vn(bn) be as defined right below (4.5). Write Model(7.1) equivalently as

V X = V µ+ V Z where V Z ∼ N(0, U ′(bn)U(bn)).

Recall that U ′(bn)U(bn) is a banded correlation matrix with bandwidth 2bn − 1. Let`1, `2, . . . , `m be the m locations of nonzero means of µ. By an argument similar to that inthe proof of Theorem 4.2, all remains to show is that, except for an event with negligibleprobability,

(V µ)k ≥√

2r′ log n for some constant r′ > ρ∗(β) and all k ∈ `1, `2, . . . , `k. (10.12)

We now show (10.12). First, by Lemma 4.1 and (7.4), except for an event with negligibleprobability,

(V µ)k ≥ (1− δ)1/4 · (an · Σn(k, k))−1/2 ·An, k ∈ `1, `2, . . . , `m.

Second, by the way Σn is defined,

(anΣ−1n )(k, k) = (Σn(g0) · Σ−1

n · Σn(g0))(k, k), for all k ≥ 2,

and by the way Σn is defined and Lemma 7.3, for sufficiently large n,

Σ−1n ≥ (1− δ)−1/2Σ−1

n , and so Σn(g0)Σ−1n Σn(g0) ≥ (1− δ)1/2Σn(g)Σ−1

n Σn(g).

Last, by [5, Theorem 2.15], |(Σn(g0) ·Σ−1n ·Σn(g0))(k, k)−C(fα, g0)| = o(1) when mink, n−

k is sufficiently large. Combining these results gives (10.12) with r′ = (1− δ) · r ·C(fα, g0),and the claim follows directly.

11 Appendix

This section contains proofs for all lemmas in preceding sections, except Lemmas 3.1, 5.1,7.1, and 7.2. Lemma 3.1 is the direct result of [35] and Lemma 5.1 is the direct result of[5, Theorem 2.15], so we omit the proofs. Lemmas 7.1 and 7.2 are proved in Section 12.

11.1 Proof of Lemma 3.2

Consider the first claim. Construct an infinite matrix Σ∞ by arranging the finite matricesalong the diagonal, and note that the inverses of Σ∞ is the matrix formed by arranging theinverse of the finite matrices along the diagonal. Since Σ∞(i, j) ≤ M(1 + |i− j|λ)−1, thenapplying Lemma 3.1 gives the claim.

Consider the second claim. It suffices to show that |Un(k, j)| ≤ C/(1 + |k − j|λ) for all1 ≤ j < k ≤ n. Denote the first k× k main diagonal sub-matrix of Σn by Σk, the k-th rowof Σk by (ξ′k−1, 1), and the k-th row of Un by u′k. It follows from direct calculations that

u′k =(1− ξ′k−1Σ−1

k−1ξk−1

)−1/2 · (ξ′k−1Σ−1k−1, 1). (11.1)

At the same time, by (11.1) and basic algebra,(1− ξ′k−1Σ−1

k ξk−1

)−1 ≤ u′kuk = Σ−1k (k, k). (11.2)

23

Combining (11.1) and (11.2) gives

|Un(k, j)| = |uk(j)| ≤ C|(Σ−1k−1ξk−1)j |, 1 ≤ j ≤ k − 1. (11.3)

Now, by Lemma 3.1, |Σ−1k−1(j, s)| ≤ C(1 + |j − s|λ)−1 for all 1 ≤ i, j ≤ k − 1. Note that

|ξk−1(s)| ≤ C(1 + |s− k|λ)−1, 1 ≤ s ≤ n, and λ > 1. It follows from basic algebra that

|(Σ−1k−1ξk−1)j | ≤

n∑s=1

C

(1 + |j − s|λ)(1 + |s− k|λ)≤ C

1 + |k − j|λ. (11.4)

Inserting (11.4) into (11.3) gives the claim.


Without loss of generality, assume `1 < `2 < . . . < `m. By Lemma 11.2, except for anevent with negligible probability, `1 ≥ bn, `m ≤ n − bn, and the inter-distances of the `j ’s≥ C log n · n2β−1. For any k ∈ `1, `2, . . . , `m, let dk =

(∑k+bn−1j=k u2

jk

)−1/2. By the wayU(bn) is defined,

(U ′(bn)Uµ)k = dk

n∑s,j=1

uksusjµj = dk

[ n∑s,j=1

uksusjµj −n∑

s,j=1

(uks − uks)usjµj]. (11.5)

Consider dk first. Write

1/d2k =

k+bn−1∑j=k

u2jk =

n∑j=k

u2jk −

n∑j=k−bn

u2jk.

First, U ′U = Σ−1,∑n

j=k u2jk = (U ′U)(k, k) = (Σ−1)(k, k). Second, by the polynomial

off-diagonal decay of U and basic calculus,

n∑j=k+bn

u2jk ≤ C

n∑j=k+bn

11 + |j − k|λ

≤ Cb1−2λn .

Last, note that the quantities Σ−1(k, k) are uniformly bounded away from 0 and ∞. Com-bining these results gives

|dk −√

Σ−1(k, k)| ≤ Cb1−2λn . (11.6)

Consider∑n

s,j=1 uksusjµj next. Recall that µj = An when j ∈ `1, `2, . . . , `m andµj = 0 otherwise. Since U ′U = Σ−1,

n∑s,j=1

uksusjµj =n∑j=1

(Σ−1)(k, j)µj = AnΣ−1(k, k) +An∑`s 6=k

Σ−1(k, `s).

Define Ln = nβ−1/2. By Lemma 11.2, except for an event with negligible probability, theinter-distance of `j is no less than Ln. So by the polynomial off-diagonal decay of Σ−1, thesecond term is algebraically small. Therefore,

n∑s,j=1

uksusjµj = An[(Σ−1)(k, k) + o(b1−λn )]. (11.7)

24

Last, we consider∑n

s,j=1(uks − uks)usjµj . Direct calculations show that

|((U − U)′U)(k, j)| ≤

C

1+|k−j|λ , λ > 1,C logn

1+|k−j|λ , λ = 1,

so by a similar argument,∣∣∣∣ n∑s,j=1

(uks − uks)usjµj∣∣∣∣ =

∣∣∣∣ n∑j=1

((U − U)′U)(k, j)µj

∣∣∣∣ ≤ An · ((U − U)′U)(k, k) + o(1),

where o(1) is algebraically small. Moreover, by the Cauchy-Schwartz inequality,

((U − U)′U)(k, k) ≤n∑s=1

|(uks − uks)usk| ≤ b1/2−λn ,

and the claim follows.


Without loss of generality, suppose n is divisible by 2bn − 1, and let N = N(n, bn) =n/(2bn − 1). Let pi be N iid samples from U(0, 1), and FN (t) be the empirical cdf. Thenormalized uniform stochastic process is defined as

WN (t) =√N [FN (t)− t]/

√t(1− t).

The following lemma is proved in Section 11.3.1.

Lemma 11.1 There is a generic constant C > 0 such that for sufficiently large n,

P

sup1/n≤t≤1/2

|WN (t)| ≥ C(log n)3/2≤ Cn−C .

We now prove Lemma 4.2. Define Y = U ′UX. Under the null hypothesis, Y ∼N(0, U ′U) and the coordinates Yk are block-wise dependent with a bandwidth ≤ 2bn − 1.Split the Yk’s into 2bn−1 different subsets Ωj = Yk : k ≡ j mod (2bn−1), 1 ≤ j ≤ 2bn−1.Note that the Yk’s in each subset are independent, and that |Ωj | = N , 1 ≤ j ≤ 2bn − 1.

Let Fn(t) and F0(t) be as in the proof of Theorem 4.1, and let

Fn,j =2bn − 1n

n∑k=1

1Y 2k ≥t, Yk∈Ωj, 1 ≤ j ≤ 2bn − 1.

Note that Fn(t) = 12bn−1

∑2bn−1j=1 Fn,j(t). By arguments similar to that of Donoho and Jin

[14], and basic algebra, it follows that

iHC∗n = supt

√n(Fn(t)− F0(t))√

(2bn − 1)F0(t)F0(t)

≤

2bn−1∑j=1

supt

√N(Fn,j(t)− F0(t))√

F0(t)F0(t)

,

and so for any x > 0,

PiHC∗n ≥ x ≤2bn−1∑j=1

P

supt

√N(Fn,j(t)− F0(t))√

F0(t)F0(t)

≥ x

.

25

Finally, since that Fn,j ’s are the empirical survival functions of N independent samplesfrom χ2

1(0), then

sup1/n≤F0(t)≤1/2

√N(Fn,j(t)− F0(t))√

F0(t)F0(t)

= sup

1/n≤t≤1/2WN (t) in distribution.

Therefore,PiHC∗n ≥ x ≤ (2bn − 1)P sup

1/n≤t≤1/2WN (t) ≥ x.

Taking x = C(log n)3/2, the claim follows from Lemma 11.1.

11.3.1 Proof of Lemma 11.1

By the Hungarian construction [10], there is a Brownian bridge B(t) such that

P sup1/n≤t≤1/2

|√N(FN (t)− t)− B(t)| ≥ C(logN + x)√

N ≤ Ce−Cx,

where C > 0 are generic constants. Noting that 1/√t(1− t) ≤

√n ≤ C

√N logN when

1/n ≤ t ≤ 1/2, it follows that

P

sup

1/n≤t≤1/2

∣∣∣∣√N(FN (t)− t)− B(t)√

t(1− t)

∣∣∣∣ ≥ C(logN)1/2(logN + x)≤ Ce−Cx. (11.8)

At the same time, by [33, Page 446],

P

sup

1/n≤t≤1/2

∣∣∣∣ B(t)√t(1− t)

∣∣∣∣ ≥ C(logN)1/2x

≤ C logN · e−Cx. (11.9)

Combining (11.8)–(11.9), taking x = C logN and using triangle inequality, gives the claim.


By direct calculations and the way Σ is defined, we have

Σ =(

Σ∗ ξn−1

ξ′n−1 1

), (11.10)

where

ξ′n−1 =√

2n−α × (0, . . . , nα0 − k0(n)α, k0(n)α − (k0(n)− 1)α, . . . , 2α − 1, 1) (11.11)

and Σ∗ is a symmetric matrix with unit diagonal entries, and with the following on thek-th sub-diagonal:

12·

2kα − (k + 1)α − (k − 1)α, k ≤ k0(n)− 1,1 + ((k − 1)α − 2kα)/nα0 = O(n−α0/α), k = k0(n),−(1− (k − 1)α/nα0) = O(n−α0/α), k = k0(n) + 1,0, k ≥ k0(n) + 2.

26

Note that Σn−1(g0) and Σ∗ share the same 2k0(n)− 1 sub-diagonals that are closest to themain diagonal (including the main diagonal). Let H1 be the matrix containing all othersub-diagonals of Σn−1(g0), and let H2 be the the matrix which contains the k0(n)-th andthe (k0(n) + 1)-th diagonals (upper and lower) of Σ∗. It is seen that

Σ− Σ =(H1 00 0

)+(H2 00 0

)+(

0 ξ′n−1

ξ′n−1 0

)≡ B1 +B2 +B3.

Let ‖ · ‖1 and ‖ · ‖2 denote the `1 matrix norm and the `2 matrix norm, respectively. First,by direct calculations, since α < 1/2, ‖B1 + B2‖1 ≤ Cnα0(α−1)/α ≤ Cn−α0 . At the sametime, by (11.11) and since α < 1/2,

‖B3‖2 ≤C

nα0

n∑k=1

(kα − (k + 1)α)2 ≤ C

nα0

n∑k=1

k2α−2 ≤ C/nα0 .

Since the spectral norm is no greater than the `1-matrix norm and the `2-matrix norm, thespectral norm of B1 +B2 +B3 is no greater than Cn−α0/2, and the claim follows.


Let a =√

(1− δ)/γ0, r′ = γ0(1 − δ)−2r, U1 = aU , and ˜µ = 1a µ. Model (10.2) can be

equivalently written as

X = U µ+ Z = U1˜µ+ Z where Z ∼ N(0, In). (11.12)

Using the argument in the first paragraph of the proof of Theorem 3.2 it is not hard to verifythat (I) ˜µ has m = n1−β nonzero entries each of which is equal to

√2r′ log n with r′ < ρ∗(β),

and whose locations are randomly sampled from (1, 2, . . . , n); (II) U1, where U1(k, j) = 0 if|k− j| > (log n)2, is a banded lower triangular matrix; and (III) limn→∞max√n≤k≤n−√n(U ′1U1)(k, k) = (1− δ) < 1.

From now on, write µ = ˜µ and r = r′ for short. Note that the Hellinger affinityassociated with Model (10.2) is E0(

√W ∗n), where E0 denotes the law Z ∼ N(0, In), and

W ∗n = W ∗n(r, β;Z1, Z2, . . . , Zn) =(n

m

)−1 ∑`=(`1,`2,...,`m)

eµ′Ù

′1Z−‖U1µ`‖2/2.

Introduce the set of indices

Sn =` = (`1, `2, . . . , `m), min

1≤j≤m−1|`j+1 − `j | ≥ 3(log n)2, `1 ≥

√n, n− `m ≥

√n.

(11.13)The following lemma is proved in Section 11.5.1.

Lemma 11.2 Let `1 < `2 < . . . < `m be m distinct indices randomly sampled from(1, 2, . . . , n) without replacement. Then for any 1 ≤ K ≤ n, (a) P`1 ≤ K ≤ Km/n,(b) P`m ≥ n−K ≤ Km/n, and (c) Pmin1≤i≤m−1|ì+1− ì| ≤ K ≤ Km(m+ 1)/n.As a result, P` = (`1, `2, . . . , `m) /∈ Sn = O(log n)2 n1−2β = o(1).

Applying Lemma 11.2, we make only a negligible difference by restricting ` to Sn anddefining

Wn =1(nm

) ∑`=(`1,`2,...,`m)∈Sn

eµ′Ù

′1Z−‖U1µ`‖2/2, (11.14)

27

in which caseE(W 1/2

n ) = E(W ∗n1/2) + o(1). (11.15)

Define Y = U ′1Z,σ2j = var(Yj) ≡ (U ′1U1)(j, j), 1 ≤ j ≤ n, (11.16)

and the eventDn = Yj/σj ≤

√2 log n, 1 ≤ j ≤ n.

By direct calculation, PDcn = o(1), and so by Holder’s inequality, E(W 1/2

n 1Dcn) =

E(W 1/2n )+o(1). Combining this result and (11.15) we deduce that E(W ∗n

1/2) = E(W 1/2n 1Dn)+

o(1), and comparing this property with the desired result we see that it is is sufficient toshow that

E(W 1/2n 1Dn) = 1 + o(1). (11.17)

The key to (11.17) is the following lemma, which is proved in Section 11.5.2.

Lemma 11.3 Consider Model (11.12), where U1 and µ satisfy (I)–(III). As n → ∞,E(Wn 1Dn) = 1 + o(1), and E(W 2

n 1Dn) = 1 + o(1).

Since

|W 1/2n 1Dn − 1| ≤

|Wn 1Dn − 1|

1 +W1/2n 1Dn

≤ |Wn 1Dn − 1|,

then by Holder’s inequality,(E|W 1/2

n 1Dn − 1|)2 ≤ ∣∣Wn 1Dn − 1

∣∣2 = E(W 2n 1Dn

)− 2E

(Wn 1Dn

)+ 1. (11.18)

Combining (11.18) with Lemma 11.3 gives (11.17).


The last claim follows once (a)–(c) are proved. Consider (a)–(b) first. Fixing K ≥ 1,

P`1 = K =

(n−Km−1

)(nm

) = m(n−m)(n−m− 1) . . . (n−m−K + 2)

n(n− 1) . . . (n−K + 1)≤ m/n,

so P`1 ≤ K ≤ Km/n. Similarly, Pn− `m ≤ K ≤ Km/n. This gives (a) and (b).Next we prove (c). Denote the minimum inter-distance of `1, `2, . . . , `m by

L(`) = L(`;m,n) = min1≤i≤m−1|`i+1 − `i|.

Note that

PL(`) = K ≤m−1∑j=1

`j+1 − `j = K ≤m−1∑j=1

n∑k=1

P`j = k, `j+1 = k +K.

Writing P`j = k, `j+1 = k +K =(nm

)−1(k−1j−1

)(n−k−Km−j−1

), we have:

PL(`) = K ≤ 1(nm

) m−1∑j=1

n∑k=j

(k − 1j − 1

)(n− k −Km− j − 1

)=

1(nm

) n∑k=1

k∑j=1

(k − 1j − 1

)(n− k −Km− j − 1

),

where the last term is no greater than

1(nm

) n∑k=1

(n−K − 1m− 2

)≤ n(

n−2m

)( n

m− 2

)≤ m2/n,

and the claim follows.

28


Define Tn =√

2 log n. We need the following lemma, which is proved in Section 12.

Lemma 11.4 Consider a bivariate zero mean normal variable (X,Y )′ that satisfies Var(X) =σ2

1, Var(Y ) = σ22, and corr(X,Y ) = %, where c0 ≤ σ1, σ2 ≤ 1 for some constant c0 ∈ (0, 1).

Then there is a constant C > 0 such that for sufficiently large n,

E[exp(AnX − σ2

1A2n/2)· 1Y/σ2>Tn

]≤ C · n−(1−%

√r)2 ≤ Cn−(1−

√r)2 ,

E[exp(An(X + Y )− σ2

1 + σ22

2A2n

)· 1X/σ1≤Tn,Y/σ2≤Tn

]≤ Cn−d(r),

where d(r) = min2r, 1− 2(1−√r)2.

We also need the following definition.

Definition 11.1 We say that two indices j and k are near to each other if |j−k| ≤ (log n)2.

We now proceed to show Lemma 11.3. Consider the first claim. Note that for any` = (`1, `2, . . . , `m) ∈ Sn, the minimum inter-distance of ì is no less than 3(log n)2. In viewof the definition of Yj and σj (see (11.16)), we have

‖U1µ`‖2 = A2n

m∑i=1

(U ′1U1)(ì, ì) = A2n

m∑i=1

σ2ì.

Consequently, we can rewrite Wn as

Wn =1(nm

) ∑`=(`1,`2,...,`m)∈Sn

exp(An

m∑i=1

Yì −A2n

2

m∑i=1

σ2ì

). (11.19)

Note that

1Dcn ≤n∑j=1

1Yj/σj>Tn. (11.20)

Combining (11.19) and (11.20) gives

E[Wn · 1Dcn] ≤1(nm

) ∑`=(`1,...,`m)∈Sn

n∑k=1

E

[exp

(An

m∑j=1

Y`j −A2n

2

m∑j=1

σ2`j

)· 1Yk/σk>Tn

].

(11.21)Now, for each 1 ≤ k ≤ n, when k is near one `j , say `j0 , Yk must be independent of allother Y`j with j 6= j0. It follows that

E

[exp

(An

m∑j=1

Y`j −A2n

2

m∑j=1

σ2`j

)·1Yk/σk>Tn

]= E

[exp(AnY`j0 −σ

2j0A

2n/2)·1Yk/σk>Tn

].

By Lemma 11.4, the right hand side is no greater than Cn−(1−√r)2 . Therefore,

E

[exp

(An

m∑j=1

Y`j −A2n

2

m∑j=1

σ2`j

)· 1Yk/σk>Tn

]≤ Cn−(1−

√r)2 . (11.22)

29

Moreover, for each fixed ` = (`1, . . . , `m) ∈ Sn, there are at most 2m(log n)2 different indicesk that can be near some of the `j ’s; and when they are, they can be near only one such `j .Combining these results gives

E[Wn · 1Dcn] ≤1(nm

) ∑`=(`1,...,`m)∈Sn

C(log n)2mn−(1−√r)2 ≤ C(log n)2n(1−β)−(1−

√r)2 .

(11.23)By the definition of ρ∗(β) and the assumption of the lemma, r < ρ∗(β) ≤ (1 −

√1− β)2,

and so the first claim follows directly from (11.23).We now consider the second claim. Fix 0 ≤ N ≤ m, and let SN (`) denote the set of

all k = (k1, k2, . . . , km) ∈ Sn such that there are exactly N kj ’s that are near to one ì.(Clearly, any kj can be near to at most one ì). The two sets of indices (`1, `2, . . . , `m) and(k1, k2, . . . , km) form exactly N pairs, each contains one candidate from the first set andone candidate from the second. These pairs are not near to each other, and not near to anyremaining indices outside the pairs. Using (11.19), we write

E[W 2n · 1Dn] =

(n

m

)−2 ∑`=(`1,`2,...,`m)∈Sn

m∑N=0

∑k=(k1,k2,...,km)∈SN (`)

× E[exp

(An

m∑i=1

(Yì + Yki)−A2n

2

m∑i=1

(σ2ì

+ σ2ki

))· 1Dn

]. (11.24)

For any fixed ` and k ∈ SN (`), by symmetry, and without loss of generality, we supposethe N pairs are (`1, k1), (`2, k2), . . . , (`N , kN ). By independence of the pairs with otherindices, and also by independence among the pairs,

E

[exp

(An

m∑j=1

(Y`j + Ykj )−A2n

2

m∑j=1

(σ2`j

+ σ2kj

))· 1Dn

]

≤ E[

exp(An

m∑j=1

(Y`j + Ykj )−A2n

2

m∑j=1

(σ2`j

+ σ2kj

))· 1Y`j /σ`j≤Tn,Ykj /σkj≤Tn, for all 1 ≤ j ≤ N

]

≤ E[

exp(An

N∑j=1

(Y`j + Ykj )−A2n

2

N∑j=1

(σ2`j

+ σ2kj

))· 1Y`j /σ`j≤Tn,Ykj /σkj≤Tn, for all 1 ≤ j ≤ N

]

= ΠNj=1

(E

[exp

An(Y`j + Ykj )−

A2n

2(σ2`j

+ σ2kj

)· 1Y`j /σ`j≤Tn,Ykj /σkj≤Tn

]). (11.25)

Here, in the first inequality, we have used the fact that

1Dn ≤ 1Y`j /σ`j≤Tn,Ykj /σkj≤Tn, for all 1 ≤ j ≤ N;

in the second inequality, we have utilized the independence and the fact that

E[exp(AnYj − σ2

jA2n/2)] = 1, for all j = 1, . . . , n;

and in the third equality, we have used again the independence. Moreover, in view of theway U1 is defined and Lemma 3.2, there is a constant c0 ∈ (0, 1) such that σj ∈ [c0, 1].Using Lemma 11.4, for sufficiently large n and each 1 ≤ j ≤ N ,

E

[exp

(An(Y`j + Ykj )−

A2n

2(σ2`j

+ σ2kj

))· 1Y`j /σ`j≤Tn,Ykj /σkj≤Tn

]≤ Cnd(r), (11.26)

30

with d(r) being as in Lemma 11.4. Combining (11.25) and (11.26) gives

E[W 2n · 1Dn] ≤

(n

m

)−2 ∑`=(`1,...,`m)

m∑N=0

(Cnd(r))N∣∣SN (`)

∣∣. (11.27)

where |SN (`)| denote the cardinality of SN (`). By elementary combinatorics,

|SN (`)| ≤(m

N

)(2 log2 n)N

(n−Nm−N

)≤ (2 log2 n)N

(m

N

)(n

m−N

). (11.28)

Direct calculations show that(mN

)(n

m−N)(

nm

) =1N !

(m!

(m−N)!

)2 (n−m)!(n−m+N)!

.1N !(m2

n

)N. (11.29)

Substituting (11.28)-(11.29) into (11.27) and recalling that m = n1−β, we deduce that

E[W 2n · 1Dn] ≤

(n

m

)−1 ∑`=(`1,`2,...,`m)∈Sn

m∑N=0

1N !(m2

n

)N((C log2 n)nd(r)

)N, (11.30)

where the last term ≤∑∞

N=01N !

(C log2 n)n1+d(r)−2β

)N . By the assumption of the lemma,

r < ρ∗(β) =β − 1/2, 1/2 < β ≤ 3/4,(1−

√1− β)2, 3/4 ≤ β < 1,

thus it can be seen that 1 + d(r)− 2β < 0 for all fixed β and r ∈ (0, ρ∗(β)). Combining thiswith (11.30) gives the second claim.


The key observation is that, there is a sequence of positive numbers δn that tends to 0 asn diverges to ∞ such that νk ≥ (1 − δn)An for all k ∈ `1, `2, . . . , `m, so it is natural tocompare Model (10.7) with the following model:

Y ∗ = ν∗ + Z, Z ∼ N(0, In), (11.31)

where ν∗ has m nonzero entries of equal strength (1− δn)An whose locations are randomlydrawn from 1, 2, . . . , n without replacement.

For short, write t = t∗n and

Hn(t) =√n(Fn(t)− F0(t))√

(2bn − 1)F0(t)(1− F0(t)).

Let F ∗n(t) be the empirical survival function of (Y ∗k )2nk=1, and let F (t) = E[Fn(t)] andF ∗(t) = E[F ∗n(t)]. Recall that the family of non-central χ2-distributions has monotonelikelihood ratio. Then F (t) ≥ F ∗(t) ≥ F0(t). Now, first, since the Yk’s are block-wisedependent with a block size ≤ 2bn − 1, it follows by direct calculations that

Var(Hn(t)) ≤ CF (t)/F0(t).

Second, by F (t) ≥ F ∗n(t),

31

E[Hn(t)] =√n(F (t)− F0(t))√

(2bn − 1)F0(t)(1− F0(t))≥

√n(F ∗(t)− F0(t))√

(2bn − 1)F0(t)(1− F0(t)), (11.32)

where the right hand side diverges to ∞ algebraically fast, by an argument similar to thatin [14]. Combine Chebyshev’s inequality, the identity bn = log n, and the calculations ofthe mean and variance of Hn(t),

PHn(t) ≤ (log n)2 ≤ C(log n)F (t)

n(F (t)− F0(t)

)2 . (11.33)

It remains to show that the last term in (11.33) is algebraically small. We discussseparately the cases F (t)/F0(t) ≥ 2 and F (t)/F0(t) < 2. For the first case,

F (t)n(F (t)− F0(t))2

≤ C

nF (t)≤ C

nF0(t),

which is algebraically small since t =√

2q log n and 0 < q < 1. For the second case,

F (t)n(F (t)− F0(t))2

≤ CF0(t)n(F (t)− F0(t))2

≤ CF0(t)n(F ∗(t)− F0(t))2

, (11.34)

which is seen to be algebraically small by comparing it to the right hand side of (11.32).This concludes the claim.

12 Complementary technical details


Consider the first claim. Suppose such an autoregression structure exists for α ≥ α0 > 0.Let

Yk =√an · (Xk+1 −Xk)/d, an = nα0/2, k = 1, 2, . . . , n− 1.

Clearly, var(Yk) = 1. At the same time, direct calculations show that the correlationbetween Y1 and Yj+1 equals to ((j + 1)α + (j − 1)α − 2jα)/2 for all 1 ≤ j ≤ n− 2, which isno larger than 1. Taking j = 2 yields (3α + 1− 2 · 2α)/2 ≤ 1, and hence α ≤ 2.

Consider the second claim. For any k ≥ 1, define the partial sum Sk(t) = 1+2∑k

j=1(1−jα

nα0 )+ cos(kt). By a well-known result in trigometrics [38], to show the positive-definitenessof Σn, it suffices to show that

Sk0+1(t) ≥ 0 for all t ∈ [−π, π] and Sk0+1(t) > 0 except for a measure 0 set. (12.1)

Here, k0 = k0(n;α, α0) is the largest integer k such that kα ≤ nα0 .We now show (12.1). By that in [38, Page 183], if we let a0 = 2, and aj = 2(1− jα

nα0 )+,1 ≤ j ≤ n− 1, then Sk0+1(t) =

∑k0−1j=0 (j+ 1)∆2ajKj(t) + (k0 + 1)Kk0(t)∆ak0 +Dn(t)ak0+1.

Here, ∆aj = aj − aj+1, ∆2aj = aj + aj+2 − 2aj+1, and Dj(t) and Kj(t) are the Dirichlet’skernel and the Fejer’s kernel, respectively,

Dj(t) =sin((j + 1

2)t)2 sin( t2)

, Kj(t) =2

j + 1

(sin( j+1

2 t)2 sin( t2)

)2

, j = 0, 1, . . . . (12.2)

32

In view of the definition of k0, ak0+1 = (1 − (k0+1)α

nα0 )+ = 0. Also, by the monotonicity ofaj, ∆ak0 = ak0 − ak0+1 ≥ 0. Therefore, Sk0+1(t) ≥

∑k0−1j=0 (j + 1)∆2ajKj(t).

We claim that the sequence a0, a1, . . . , an−1 is convex. In detail, since α ≤ 1, thesequence jα is concave. As a result, the sequence (1 − jα

nα0 ) is convex, and so is thesequenced (1 − jα

nα0 )+. In view of the definition of aj , the claim follows directly. Theconvexity of aj ’s implies that ∆2aj ≥ 0, 0 ≤ j ≤ n − 2. Therefore, Sk0+1(t) ≥ 0. Thisproves the first part of (12.1).

We now prove the second part of (12.1). We discuss separately for two cases α < 1and α = 1. In the first case, ∆a0 = 1

nα0 (2 − 2α) > 0 and K0(t) = 12 . As a result,

Sk0+1(t) ≥ 2−2α

2nα0 > 0, and the claim follows. In the second case, ∆aj = 1nα0 (2j − j −

(j + 2)) = 0, and ∆ak0−1 = (1 − k0−1nα0 ) − 2(1 − k0

nα0 ) = 1nα0 (k0 + 1) − 1 > 0. Therefore,

Sk0+1(t) ≥ (k0 + 1)(k0+1nα0 − 1)Kk0(t). Clearly, Sk0+1(t) can only assume 0 when (k0+1

2 )t isa multiple of π. Since the set of such t has measure 0, the claim follows directly.


Let a0 = 2, and ak = 2kα − (k+ 1)α − (k− 1)α, 1 ≤ k ≤ n− 1. Clearly, ak > 0 for all k, sofα(0;α) > 0. Furthermore, when θ 6= 0, by [38, Equation 1.7, Page 183],

fα(θ) =∞∑ν=0

(ν + 1)[aν+2 + aν − 2aν+1]aνKν(θ), (12.3)

where Kν(θ) is the Fejer’s kernel as in (12.2). By the positiveness of the Fejer’s kernel, allremains to show is that ak+1 + ak−1 − 2ak > 0, for all k ≥ 2.

Define h(x) = (1 + 2x)α + (1− 2x)α− 4(1 +x)α− 4(1−x)α + 6, 0 ≤ x ≤ 1/2. By directcalculations, for all k ≥ 2,

ak+1+ak−1−2ak = −kα[(1+2k

)α+(1− 2k

)α−4(1+1k

)α−4(1− 1k

)α+6] = −kαh(1k

). (12.4)

Also, by basic calculus, h′′(x) = 4α(α−1)[(1+2x)α−2+(1−2x)α−2−(1+x)α−2−(1−x)α−2].Since 0 < α < 1, xα−2 is a convex function. It follows that h′′(x) < 0 for all x ∈ (0, 1/2),and h(x) is a strictly concave function. At the same time, note that h(0) = h′(0) = 0, soh(x) < 0 for x ∈ (0, 1/2]. Combining this with (12.4) gives the claim.


Denote the density, cdf and survival function of N(0, 1) by φ, Φ, and Φ. For the first claim,define W = X/σ1 and V = Y/σ2 if ρ ≥ 0 and V = −Y/σ2 otherwise. The proofs for twocases ρ ≥ 0 and ρ < 0 are similar, so we only show the first one. In this case, it suffices toshow

E[exp(σ1AnW − σ2

1A2n/2)· 1V >Tn

]≤ C · n−(1−%

√r)2 .

Write W = (W − ρV ) + ρV , and note that (1− ρ)2 + ρ2 ≤ 1. It is seen that

σ1AnW − σ21A

2n/2 ≤ [σ1An(W − ρV )− σ2

1(1− ρ)2A2n/2] + [σ1AnρV − σ2

1ρ2A2

n/2]. (12.5)

Since W and V have unit variance and a correlation ρ, then (W − ρV ) is independent of Vand is distributed as N(0, (1−ρ)2). Therefore, E[exp(σ1An(W−ρV )−σ2

1(1−ρ)2A2n/2)] = 1.

Combining this with (12.5) gives

E[exp(σ1AnW − σ2

1A2n/2)· 1V >Tn

]= E

[exp(σ1ρAnV − σ2

1ρ2A2

n/2)· 1V >Tn

].

33

Now, by direct calculations,

E[exp(AnV −A2

n/2)· 1V >Tn

]=∫ ∞Tn

φ(x− σ1ρAn)dx = Φ(Tn − σ1ρAn).

Since Φ(x) ≤ Cφ(x) if x > 0, Φ(Tn−σ1ρAn) ≤ Cφ(Tn−σ1ρAn) = Cn−(1−ρ√r)2 . Combining

these results gives the claim.We now show the second claim. By Holder’s inequality, it suffices to show that

E[exp(2AnX − σ21A

2n) · 1X≤σ1Tn] ≤ Cn

−d(r).

Recalling that W = X/σ1, we have

E[exp(2AnX − σ1A2n) · 1X≤σ1Tn] = E[exp(2σ1AnW − σ2

1A2n) · 1W≤Tn].

By direct calculations,

E[exp(2σ1AnW − σ21A

2n) · 1W≤Tn] = eσ

21A

2n

∫ Tn

−∞φ(x− 2σ1An)dx = eσ

21A

2nΦ(Tn − 2σ1An).

Since Φ(x) ≤ Cφ(x) for all x < 0 and Φ(x) ≤ 1 for all x ≥ 0,

eσ21A

2nΦ(Tn − 2σ1An) ≤

Ceσ

21A

2n = Cn2σ2

1r, σ21r ≤ 1/4.

eσ21A

2nφ(Tn − 2σ1An) = Cn1−2(1−σ1

√r)2 , σ2

1r > 1/4;

in view of the definition of d(r), eσ21A

2nΦ(Tn − 2σ1An) ≤ Cnd(σ2

1r). Since that σ1 ≤ 1 andthat d(r) is a monotonely increasing function, we have d(σ2

1r) ≤ d(r). Combining theseresults gives the claim.

Acknowledgement: JJ would like to thank Christopher Genovese and Larry Wassermanfor extensive discussion. He would also like to thank Peter Bickel, Emmanuel Candes,David Donoho, Karlheinz Grochenig, Michael Leinert, Joel Tropp, Aad van der Vaart, andZepu Zhang for encouragement and pointers.

References

[1] Abramovich, F., Benjamini, Y., Donoho, D. and Johnstone, I. (2000). Adaptingto unknown sparsity by controlling the false discovery rate. Ann. Statist. 34 584–653.

[2] Arias-Castro, E., Donoho, D. and Huo, X. (2005). Near-optimal detection of geo-metric objects by fast Multiscale methods. IEEE Trans. on Info. Theory 51(7) 2402–2425.

[3] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: apractical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B. 57289–300.

[4] Bickel, P. and Levina, E. (2007). Regularized estimation of large covariance matrices.Ann. Statist. 36(1) 199–227.

[5] Bottcher, A. and Silbarmann, B. (1998). Introduction to Large Truncated ToeplitzMatrices. Springer.

34

[6] Brockwell, P. and Davis, R. (1991). Time Series and Methods, 2nd edition.Springer.

[7] Cai, T., Jin, J. and Low, M. (2007). Estimation and confidence sets for sparse normalmixtures. Ann. Statist. 35 2421-2449.

[8] Cayon, L., Jin, J. and Treaster, A. (2005). Higher Criticism statistic: detectingand identifying non-Gaussianity in the WMAP first year data. Mon. Not. Roy. Astron.Soc. 362 826–832.

[9] Chen, L., Tong, T. and Zhao, H. (2005). Considering dependence among genes andmarkers for false discovery control in eQTL mapping. Bioinformatics 24(18) 2015–2022.

[10] Csorgo, M., Csorgo, S., Horvath, L. and Mason, D. (1986). Weighted empiricaland quantile processes. Ann. Prob. 14(1) 31–85.

[11] Cover, T. M. and Thomas. J. A. (2006). Elementary Information Theory. Wiely &Sons.

[12] Cruz, M., Cayon, L., Martınez-Gonzalez, E., Vieva, P. and Jin, J. (2007). Thenon-Gaussian cold spot in the 3-year WMAP data. Astrophys. J. 655 11–20.

[13] Delaigle. A. and Hall, J. (2008). Higher Criticism in the context of unknown distribu-tion, non-independence and classification. Platinum Jubilee Proceedings of the IndianStatistical Institute. To appear.

[14] Donoho, D. and Jin, J. (2004). Higher Criticism for detecting sparse heterogeneousmixtures. Ann. Statist. 32 962–994.

[15] Donoho, D. and Jin, J. (2008a). Higher Criticism thresholding: optimal featureselection when useful features are rare and weak. Proc. Nat. Acad. Sci. 105(39) 14790–14795.

[16] Donoho, D. and Jin, J. (2008b). Higher Criticism thresholding: optimal phase dia-gram. arxiv.org/abs/0812.2263.

[17] Goeman, J., van de Geer, S., de Kort, F. and van Houwelingen, H. (2004). Aglobal test for groups of genes: testing association with a clinical outcome. Bioinfor-matics 20(1) 93–99.

[18] Goeman, J., van de Geer, S. and van Houwelingen, H. (2006). Testing against ahigh dimensional alternative. J. R. Statist. Soc. Ser. B. 68(3) 477–493.

[19] Grochenig, K. and Leinert, M. (2006). Symmetry and inverse-closedness of matrixalgebra and functional calculus for infinite matrices. Trans. Amer. Math. Soc. 3582695-2711.

[20] Hall, P., Pittelkow, Y. and Ghosh, M. (2008). Theoretical measures of relativeperformance of classifiers for high dimensional data with small sample sizes. J. Roy.Statist. Soc. Ser. B. 70 159–173.

[21] Hall, P. and Jin, J. (2008). Properties of Higher Criticism under strong dependence.Ann. Statist. 36 381–402.

35

[22] Horn, R.A. and Johnson, C.R. (2006). Matrix Analysis. Cambridge University Press.

[23] Ingster, Y.I. (1997). Some problems of hypothesis testing leading to infinitely divis-ible distribution. Math. Methods Statist. 6 47–69.

[24] Ingster, Y.I. (1999). Minimax detection of a signal for lpn-balls. Math. Methods Statist.7 401–428.

[25] Jaffard, S. (1990). Proprietes des matrices “bien localisees” pres de leur diagonaleet quelques applications. Ann. Inst. H. Poincare Anal. Non Lineaire 7 461–476.

[26] Jager, L. and Wellner, J. (2007). Goodness-of-fit tests via phi-divergences. Ann. Statist.35 2018–2053.

[27] Jin, J. (2004). Detecting a target in very noisy data from multiple looks. In AFestschrift to Honor Herman Rubin (Dasgupta eds.), IMS Lecture Notes Monogr. Ser.45 255–286. Inst. Math. Statist., Beachwood, OH.

[28] Jin, J. (2006). Higher Criticism statistic: theory and applications in non-Gaussiandetection. In Statistical Problems in Particle Physics, Astrophysics And Cosmology(L. Lyons and M.K. Unel. eds.). Imperial College Press, London.

[29] Jin, J. (2007). Proportion of nonzero normal means: universal oracle equivalences anduniformly consistent estimators. J. Roy. Statist. Soc. Ser. B. 70(3) 461–493.

[30] Jin, J. and Cai, T. (2007). Estimating the null and the proportion of non-null effectsin large-scale multiple comparisons. J. Amer. Statist. Assoc. 102 496–506.

[31] Jin, J., Starck, J.-L., Donoho, D., Aghanim, N. and Forni, O. (2005). Cosmolog-ical non-Gaussian signature detection: comparing performance of different statisticaltests. EURASIP J. Appl. Sig. Proc. 15 2470–2485.

[32] Meinshausen, M. and Rice, J. (2006). Estimating the proportion of false null hy-potheses among a large number of independently tested hypotheses. Ann. Statist. 34373–393.

[33] Shroack, G. and Wellner, J. (1986). Empirical Processes with Applications toStatistics. John Wiley & Sons.

[34] Strasser, H. (1998). Differentiability of statistical experiments. Statist. Decisions 16113–130.

[35] Sun, Q. (2005). Wiener’s lemma for infinite matrices with polynomial off-diagonaldecay. Comptes Rendus Mathematique 340 567–570.

[36] Tukey, J.W. (1989). Higher Criticism for individual significances in several tables orparts of tables. Internal working paper, Princeton University.

[37] Wiener, N. (1949). Extrapolation, Interpolation, and Smoothing of Stationary TimeSeries. Wiley, New York.

[38] Zygmund, A. (1959). Trigonometric Series, 2nd edition. Cambridge University Press.

36

innovated higher criticism for detecting sparse signals in ... · innovated higher criticism for...

Documents