statistics for high-dimensional data: p-values and
TRANSCRIPT
![Page 1: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/1.jpg)
Statistics for high-dimensional data:p-values and confidence intervals
Peter Buhlmann
Seminar fur Statistik, ETH Zurich
June 2014
![Page 2: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/2.jpg)
High-dimensional dataBehavioral economics and genetics (with Ernst Fehr, U. Zurich)
I n = 1′525 personsI genetic information (SNPs): p ≈ 106
I 79 response variables, measuring “behavior”
p n
goal: find significant associationsbetween behavioral responsesand genetic markers
0 20 40 60 80
020
040
060
0
Number of significant target SNPs per phenotype
Phenotype index
Num
ber
of s
igni
fican
t tar
get S
NP
s
5 10 15 25 30 35 45 50 55 65 70 75
100
300
500
700
![Page 3: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/3.jpg)
in high-dimensional statistics:a lot of progress has been achieved over the last 8-10 years for
I point estimationI rates of convergence
but very little work on assigning measures of uncertainty,p-values, confidence intervals
![Page 4: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/4.jpg)
we need uncertainty quantification!(the core of statistics)
![Page 5: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/5.jpg)
goal (regarding the title of the talk):
p-values/confidence intervalfor a high-dimensional linear model
(and we can then generalize to other models)
![Page 6: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/6.jpg)
Motif regression and variable selection
for finding HIF1α transcription factor binding sites in DNA seq.Muller, Meier, PB & Ricci
for coarse DNA segments i = 1, ...,n :
I predictor Xi = (X (1)i , . . . ,X (p)
i ) ∈ Rp:abundance score of candidate motifs j = 1, ...,p in DNAsegment i (using sequence data and computationalbiology algorithms, e.g. MDSCAN)
I univariate response Yi ∈ R: binding intensity of HIF1α tocoarse DNA segment (from CHIP-chip experiments)
![Page 7: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/7.jpg)
question: relation between the binding intensity Y and theabundance of short candidate motifs?
; linear model is often reasonable“motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003)
Yi =
p∑j=1
β0j X (j)
i + εi
i = 1, . . . ,n = 143, p = 195
goal: variable selection and significance of variables; find the relevant motifs among the p = 195 candidates
![Page 8: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/8.jpg)
Lasso for variable selection:
S(λ) = j ; βj(λ) 6= 0estimate for S0 = j ; β0
j 6= 0
no significance testing involvedit’s convex optimization only!and it’s very popular (Meinshausen & PB, 2006;Zhao & Yu, 2006;Wainwright, 2009;...)
![Page 9: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/9.jpg)
for motif regression(finding HIF1α transcription factor binding sites)n=143, p=195
; Lasso selects 26 covariateswhen choosing λ = λCV via cross-validationand resulting R2 ≈ 50%
i.e. 26 interesting candidate motifs
how significant are the findings?
![Page 10: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/10.jpg)
for motif regression(finding HIF1α transcription factor binding sites)n=143, p=195
; Lasso selects 26 covariateswhen choosing λ = λCV via cross-validationand resulting R2 ≈ 50%
i.e. 26 interesting candidate motifs
how significant are the findings?
![Page 11: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/11.jpg)
estimated coefficients β(λCV)
0 50 100 150 200
0.00
0.05
0.10
0.15
0.20
original data
variables
coef
ficie
nts
p-values for H0,j : β0j = 0 ?
![Page 12: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/12.jpg)
P-values for high-dimensional linear models
Y = Xβ0 + ε
goal: statistical hypothesis testing
H0,j : β0j = 0 or H0,G : β0
j = 0 for all j ∈ G ⊆ 1, . . . ,p
background: if we could handle the asymptotic distribution ofthe Lasso β(λ) under the null-hypothesis; could construct p-values
this is very difficult!asymptotic distribution of β has some point mass at zero,...Knight and Fu (2000) for p <∞ and n→∞
![Page 13: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/13.jpg)
; standard bootstrapping and subsampling cannot be usedeither
but there are recent proposals when using adaptations ofstandard resampling methods(Chatterjee & Lahiri, 2013; Liu & Yu, 2013); non-uniformity/super-efficiency issues remain...
![Page 14: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/14.jpg)
Low-dimensional projections and bias correctionOr de-sparsifying the Lasso estimatorrelated work by Zhang and Zhang (2011)
motivation:
βOLS,j = projection of Y onto residuals (Xj − X−j γ(j)OLS)
projection not well defined if p > n; use “regularized” residuals from Lasso on X -variables
Zj = Xj − X−j γ(j)Lasso
γ(j) = argminγ‖Xj − X−jγ‖+ λj‖γ‖1
![Page 15: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/15.jpg)
using Y = Xβ0 + ε ;
Z Tj Y = Z T
j Xjβ0j +
∑k 6=j
Z Tj Xk + Z T
j ε
and hence
Z Tj Y
Z Tj Xj
= β0j +
∑k 6=j
Z Tj Xk
Z Tj Xj
β0k︸ ︷︷ ︸
bias
+Z T
j ε
Z Tj Xj︸ ︷︷ ︸
noise component
;
bj =Z T
j Y
Z Tj Xj−
∑k 6=j
Z Tj Xk
Z Tj Xj
βLasso;k︸ ︷︷ ︸Lasso-estim. bias corr.
![Page 16: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/16.jpg)
bj is not sparse!... and this is crucial to obtain Gaussian limitnevertheless: it is “optimal” (see later)
I target: low-dimensional component β0j
I η := β0k ; k 6= j is a high-dimensional nuisance parameter
; exactly as in semiparametric modeling!and sparsely estimated (e.g. with Lasso)
![Page 17: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/17.jpg)
bj is not sparse!... and this is crucial to obtain Gaussian limitnevertheless: it is “optimal” (see later)
I target: low-dimensional component β0j
I η := β0k ; k 6= j is a high-dimensional nuisance parameter
; exactly as in semiparametric modeling!and sparsely estimated (e.g. with Lasso)
![Page 18: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/18.jpg)
=⇒ let’s turn to the blackboard!
![Page 19: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/19.jpg)
A general principle: de-sparsifying via “inversion” of KKT
KKT conditions:sub-differential of objective function ‖Y − Xβ‖22/n + λ‖β‖1
−X T ( Y︸︷︷︸Xβ0+ε
−X β)/n + λτ = 0
‖τ‖∞ ≤ 1, τj = sign(βj) if βj 6= 0.
with Σ = X T X/n ; Σ(β − β0) + λτ = X T ε/n
“regularized inverse” of Σ, denoted by Θ (not e.g. GLasso)
β − β0 + Θλτ = ΘX T ε/n −∆,
∆ = (ΘΣ− I)(β − β0)
new estimator: b = β + Θλτ = β + ΘX T (Y − X β)/n
![Page 20: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/20.jpg)
; b is exactly the same estimator as before (based onlow-dimensional projection using residual vectors Zj )
... when taking Θ (“regularized inverse of Σ”) havingrows using the (“nodewise”) Lasso-estimated coefficients fromXj versus X−j :
γj = argminγ∈Rp−1‖Xj − X−jγ‖22/n + 2λj‖γ‖1
Denote by
C =
1 −γ1,2 · · · −γ1,p−γ2,1 1 · · · −γ2,p
......
. . ....
−γp,1 −γp,2 · · · 1
Let
T 2 = diag(τ21 , . . . , τ
2p ), τ2
j = ‖Xj − X−j γj‖22/n + λj‖γj‖1,
ΘLasso = T−2C not symmetric... !
![Page 21: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/21.jpg)
“inverting” the KKT conditions is a very general principle; the principle can be used for GLMs and many other models
![Page 22: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/22.jpg)
Asymptotic pivot and optimality
Theorem (van de Geer, PB & Ritov, 2013)
√n(bj − β0
j )⇒ N (0, σ2εΩjj) (j = 1, . . . ,p)
Ωjj explicit expression ∼ (Σ−1)jj optimal!reaching semiparametric information bound
; asympt. optimal p-values and confidence intervalsif we assume:
I sub-Gaussian design (i.i.d. rows of X sub-Gaussian) withpopulation Cov(X ) = Σ has minimal eigenvalue ≥ M > 0
√
I sparsity for regr. Y vs. X : s0 = o(√
n/ log(p))“quite sparse”I sparsity of design: Σ−1 sparse
i.e. sparse regressions Xj vs. X−j : sj = o(n/ log(p))“maybe restrictive”
![Page 23: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/23.jpg)
It is optimal!Cramer-Rao
![Page 24: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/24.jpg)
for design with Σ−1 non-sparse:I Ridge projection (PB, 2013): good type I error control but
not optimal in terms of powerI convex program instead of Lasso for Zj (Javanmard &
Montanari, 2013; MSc. thesis Dezeure, 2013)Javanmard & Montanari prove optimality
so far: no convincing empirical evidence that we can deal wellwith such scenarios (Σ−1 non-sparse)
![Page 25: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/25.jpg)
Uniform convergence:
√n(bj − β0
j )⇒ N (0, σ2εΩjj) (j = 1, . . . ,p)
convergence is uniform over B(s0) = β; ‖β‖0 ≤ s0
; honest tests and confidence regions!
and we can avoid post model selection inference(cf. Potscher and Leeb)
![Page 26: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/26.jpg)
Simultaneous inference over all components:
√n(b − β0) ≈ (W1, . . . ,Wp) ∼ Np(0, σ2
εΩ)
; can construct P-values for:
H0,G with any G: test-statistics maxj∈G |bj |since covariance structure Ω is known
andcan easily do efficient multiple testing adjustment sincecovariance structure Ω is known!
![Page 27: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/27.jpg)
Alternatives?I versions of bootstrapping (Chatterjee & Lahiri, 2013)
; super-efficiencyphenomenon!
i.e. non-uniform convergenceJoe Hodges
• good for estimating the zeroes (i.e., j ∈ Sc0 with β0
j = 0)• bad for estim. the non-zeroes (i.e., j ∈ S0 with β0
j 6= 0)
I multiple sample splitting (Meinshausen, Meier & PB, 2009)split the sample repeatedly in two halfs:• select variables on first half• p-values using second half, based on selected variables
; avoids (because of sample splitting) over-optmisticp-values, but potentially suffers in terms of power
I covariance test (Lockhart, Taylor, Tibshirani, Tibshirani, 2014)I no sparsity ass. on Σ−1 (Javanmard and Montanari, 2014)
![Page 28: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/28.jpg)
Some empirical results (Dezeure, PB, Meier & Meinshausen, in progress)
compare power and control of familywise error rate (FWER)always p = 500, n = 100 and s0 = 15
0.0 0.2 0.4 0.6 0.8 1.0
Covtest
JM
MS−Split
Ridge
Despars−Lasso
FWER
0.0 0.2 0.4 0.6 0.8 1.0
Power
![Page 29: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/29.jpg)
confidence intervalsI for β0
j (j ∈ S0)
I for β0j = 0 (j ∈ Sc
0) where the intervals exhibit the worstcoverage (for each method)
Jm201345
65 69 78 80 82 83 84 84 85 8594
86 86 86 86 87 87 88
liuyu91 92
4398 99 99 99 99 99 99 99
9999 99 99 99 99 99 99
Res−Boot92
80 5396 97 98 98 98 99 99 99
9999 99 99 99 99 99 99
MS−Split
69100 99 100 100 100 100
100100 100 100
94100
100100 100 100 100 100
Ridge86
97 95 98 98 98 98 98 98 98 9999
99 99 99 99 99 99 99
Lasso−Pro Z&Z82
86 91 86 87 88 88 88 88 89 8995
89 89 89 90 90 90 90
Lasso−Pro78
82 80 83 84 86 87 87 87 87 8795
88 88 88 88 89 89 89
Toeplitz s0=3 U[0,2]
![Page 30: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/30.jpg)
Motif regression example
one significant variable withboth “de-sparsified Lasso” and multi sample splitting
0 50 100 150 200
0.00
0.05
0.10
0.15
0.20
motif regression
variables
coef
ficie
nts
: variable/motif with FWER-adjusted p-value 0.006: p-value clearly larger than 0.05
(this variable corresponds to known true motif)
![Page 31: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/31.jpg)
for data-sets with p ≈ 4′000− 10′000 and n ≈ 100; often no significant variable
because it is a too extreme ratio log(p)/n
![Page 32: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/32.jpg)
Behavioral economics and genomewide associationwith Ernst Fehr, University of Zurich
I n = 1525 probands (all students!)I m = 79 response variables measuring various behavioral
characteristics (e.g. risk aversion) from well-designedexperiments
I 460 Target SNPs (as a proxy for ≈ 106 SNPs):1380 parameters per response(but only 1341 meaningful parameters)
model: multivariate linear model
Yn×m︸ ︷︷ ︸responses
= Xn×p︸ ︷︷ ︸SNP data
βp×m + εn×m︸ ︷︷ ︸error
although p < n, the design matrix X (with categorical values∈ 1,2,3) does not have full rank
![Page 33: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/33.jpg)
Yn×m = Xn×pβp×m + εn×m
interested in p-values for
H0,jk : βjk = 0 versus HA,jk : βjk 6= 0,H0,G : βjk = 0 for all j , k ∈ G versus HA,G = Hc
0,G
adjusted to control the familywise error rate (i.e. conservativecriterion)in total: we consider 110’857 hypotheses
we test for non-marginal regression coefficients; “predictive” GWAS
![Page 34: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/34.jpg)
there is structure!
I 79 response experimentsI 23 chromosomes per response experimentI 20 Target SNPs per chromosome = 460 Target SNPs
. . .
.. . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
1 2
1
23 1 23
1 2 20
global
79
![Page 35: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/35.jpg)
do hierarchical FWER adjustment (Meinshausen, 2008)
. . .
.. . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
1 2
1
23 1 23
1 2 20
global
79
significant not significant
1. test global hypothesis2. if significant: test all single response hypotheses3. for the significant responses: test all single chromosome hyp.4. for the significant chromosomes: test all TargetSNPs
; powerful multiple testing withdata dependent adaptation of the resolution level(our analysis with 20 TagetSNPs per chromosome is ad-hoc)
cf. general sequential testing principle (Goeman & Solari, 2010)
![Page 36: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/36.jpg)
testing a group-hypothesis:
H0,G : β0j ≡ 0 for all j ∈ G
test-statistics:
maxj∈G|bj |
since und H0,G:
√nbG = NG(0, σ2
εΩG) + ∆G,
∆G = (∆j ; j ∈ G),√
n‖∆‖∞ = oP(1)
thus:
maxj∈G
√n|bj | ⇒ σε max
j∈G|Wj |,
(W1, . . . ,Wp) ∼ Np(0,Ω)
and can easily simulate maxj∈G |Wj |
![Page 37: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/37.jpg)
number of significant SNP parameters per response
0 20 40 60 80
020
040
060
0
Number of significant target SNPs per phenotype
Phenotype index
Num
ber
of s
igni
fican
t tar
get S
NP
s
5 10 15 25 30 35 45 50 55 65 70 75
100
300
500
700
response 40 has most significant (levels of) Target SNPs
![Page 38: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/38.jpg)
Conclusionscan construct asymptotically optimal
p-values and confidence intervals for low-dimensional targets inhigh-dimensional modelsR-package hdi︸︷︷︸
high-dimensional inference
(Meier, 2013)
assuming/based on suitable conditions:
• sparsity of Y vs X : s0 = o(√
n/ log(p))• sparsity of Xj vs X−j (j = 1, . . . ,p): maxj sj ≤ o(n/ log(p))• design matrix X is not too ill-posed
(e.g. restricted eigenval. ass.; or nice population covariance)
these conditions are typically uncheckable... ;confirmatory high-dimensional inference remains challenging
Thank you!
![Page 39: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/39.jpg)
Conclusionscan construct asymptotically optimal
p-values and confidence intervals for low-dimensional targets inhigh-dimensional modelsR-package hdi︸︷︷︸
high-dimensional inference
(Meier, 2013)
assuming/based on suitable conditions:
• sparsity of Y vs X : s0 = o(√
n/ log(p))• sparsity of Xj vs X−j (j = 1, . . . ,p): maxj sj ≤ o(n/ log(p))• design matrix X is not too ill-posed
(e.g. restricted eigenval. ass.; or nice population covariance)
these conditions are typically uncheckable... ;confirmatory high-dimensional inference remains challenging
Thank you!
![Page 40: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/40.jpg)
Conclusionscan construct asymptotically optimal
p-values and confidence intervals for low-dimensional targets inhigh-dimensional modelsR-package hdi︸︷︷︸
high-dimensional inference
(Meier, 2013)
assuming/based on suitable conditions:
• sparsity of Y vs X : s0 = o(√
n/ log(p))• sparsity of Xj vs X−j (j = 1, . . . ,p): maxj sj ≤ o(n/ log(p))• design matrix X is not too ill-posed
(e.g. restricted eigenval. ass.; or nice population covariance)
these conditions are typically uncheckable... ;confirmatory high-dimensional inference remains challenging
Thank you!
![Page 41: Statistics for high-dimensional data: p-values and](https://reader031.vdocuments.us/reader031/viewer/2022012102/6169f9a411a7b741a34d7199/html5/thumbnails/41.jpg)
R-package: hdi (Meier, 2013)
References:
I Buhlmann, P. and van de Geer, S. (2011). Statistics forHigh-Dimensional Data: Methodology, Theory and Applications.Springer.
I Meinshausen, N., Meier, L. and Buhlmann, P. (2009). P-valuesfor high-dimensional regression. Journal of the AmericanStatistical Association 104, 1671-1681.
I Buhlmann, P. (2013). Statistical significance in high-dimensionallinear models. Bernoulli 19, 1212-1242.
I van de Geer, S., Buhlmann, P. and Ritov, Y. (2013). Onasymptotically optimal confidence regions and tests forhigh-dimensional models. Preprint arXiv:1303.0518v1
I Meier, L. (2013). hdi: High-dimensional inference. R-packageavailable from R-Forge.