computational and statistical aspects of statistical ... · computational and statistical aspects...

Computational and Statistical Aspectsof Statistical Machine Learning

John Lafferty

Department of Statistics RetreatGleacher Center

Outline

• “Modern” nonparametric inference for high dimensional dataI Nonparametric reduced rank regression

• Risk-computation tradeoffsI Covariance-constrained linear regression

• Other research and teaching activities

2

Context for High Dimensional Nonparametrics

Great progress in recent years on high dimensional linear models

Many problems have important nonlinear structure.

We’ve been studying “purely functional ” methods for highdimensional, nonparametric inference

• no basis expansions

• no Mercer kernels

3

Additive Models

Fully nonparametric models appear hopeless

• Logarithmic scaling, p = log n (e.g., “Rodeo” Lafferty andWasserman (2008))

Additive models are useful compromise

• Exponential scaling, p = exp(nc) (e.g., “SpAM” Ravikumar,Lafferty, Liu and Wasserman (2009))

4

Additive Models

420 Chapter 23. Nonparametric Regression

10 15 20 25

−0.0

50.

000.

050.

100.

150.

20

Age

Chan

ge in

BM

D Females

10 15 20 25

−0.0

50.

000.

050.

100.

150.

20

Age

Chan

ge in

BM

D Males

Figure 23.1. Bone Mineral Density Data

−0.10 −0.05 0.00 0.05 0.10

150

160

170

180

190

Age−0.10 −0.05 0.00 0.05 0.10 0.15

100

150

200

250

300

Bmi

−0.10 −0.05 0.00 0.05 0.10

120

160

200

240

Map−0.10 −0.05 0.00 0.05 0.10 0.15

110

120

130

140

150

160

Tc

Figure 23.2. Diabetes Data5

Multivariate Regression

Y ∈ Rq and X ∈ Rp. Regression function m(X ) = E(Y |X ).

Linear model Y = BX + ε where B ∈ Rq×p.

Reduced rank regression: r = rank(B) ≤ C.

Recent work has studied properties and high dimensional scaling ofreduced rank regression where nuclear norm ‖B‖∗ is used as convexsurrogate for rank constraint (Yuan et al., 2007; Negahban andWainwright, 2011). E.g.,

‖Bn − B∗‖F = OP

(√Var(ε)r(p + q)

n

)

6

Low-Rank Matrices and Convex Relaxation

low rank matrices convex hullrank(X ) ≤ t ‖X‖∗ ≤ t

7

Nuclear Norm Regularization

Algorithms for nuclear norm minimization are a lot like iterative softthresholding for lasso problems.

To project a matrix B onto the nuclear norm ball ‖X‖∗ ≤ t :

• Compute the SVD:B = U diag(σ) V T

• Soft threshold the singular values:

B ← U diag(Softλ(σ)) V T

8

Nonparametric Reduced Rank RegressionFoygel, Horrell, Drton and Lafferty (NIPS 2012)

Nonparametric multivariate regression m(X ) = (m1(X ), . . . ,mq(X ))T

Each component an additive model

mk (X ) =

p∑j=1

mkj (Xj)

What is the nonparametric analogue of ‖B‖∗ penalty?

9

Low Rank Functions

What does it mean for a set of functions m1(x), . . . ,mq(x) to be lowrank?

Let x1, . . . , xn be a collection of points.

We require the n × q matrix M(x1:n) = [mk (xi)] is low rank.

Stochastic setting: M = [mk (Xi)]. Natural penalty is

1√n‖M‖∗ = 1√

n

q∑s=1

σs(M) =

q∑s=1

√λs( 1

nMTM)

Population version:

|||M|||∗ :=∥∥∥√Cov(M(X ))

∥∥∥∗

=∥∥∥Σ(M)1/2

∥∥∥∗

10

Constrained Rank Additive Models (CRAM)

Let Σj = Cov(Mj). Two natural penalties:∥∥∥Σ1/21

∥∥∥∗

+∥∥∥Σ

1/22

∥∥∥∗

+ · · ·+∥∥∥Σ

1/2p

∥∥∥∗∥∥∥(Σ

1/21 Σ

1/22 · · ·Σ1/2

p )∥∥∥∗

Population risk (first penalty) 12E∥∥∥Y −

∑j Mj(Xj)

∥∥∥2

2+ λ

∑j

∣∣∣∣∣∣Mj∣∣∣∣∣∣∗

Linear case:

p∑j=1

∥∥∥Σ1/2p

∥∥∥∗

=

p∑j=1

‖Bj‖2

∥∥∥(Σ1/21 Σ

1/22 · · ·Σ1/2

p )∥∥∥∗

= ‖B‖∗

11

CRAM Backfitting Algorithm (Penalty 1)

Input: Data (Xi ,Yi), regularization parameter λ.Iterate until convergence:

For each j = 1, . . . ,p:

Compute residual: Rj = Y −∑

k 6=j Mk (Xk )

Estimate projection Pj = E(Rj |Xj), smooth: Pj = SjRj

Compute SVD: 1n Pj PT

j = U diag(τ) UT

Soft-threshold: Mj = U diag([1− λ/√τ ]+)UT Pj

Output: Estimator M(Xi) =∑

j Mj(Xij).

12

Scaling of Estimation Error

Using a “double covering” technique, (12 -parametric,

12 -nonparametric), we bound the deviation between empirical andpopulation functional covariance matrices in spectral norm:

supV

∥∥∥Σ(V )− Σn(V )∥∥∥

sp= OP

(√q + log(pq)

n

).

This allows us to bound the excess risk of the empirical estimatorrelative to an oracle.

13

Summary

• Variations on additive models enjoy most of the good statisticaland computational properties of sparse or low-rank linearmodels.

• We’re building a toolbox for large scale, high dimensionalnonparametric inference.

14

Computation-Risk Tradeoffs

• In “traditional” computational learning theory, dividing linebetween learnable and non-learnable is polynomialvs. exponential time

• Valiant’s PAC model

• Mostly negative results: It is not possible to efficiently learn innatural settings

• Claim: Distinctions in polynomial time matter most

15

Analogy: Numerical Optimization

In numerical optimization, it is understood how to tradeoffcomputation for speed of convergence

• First order methods: linear cost, linear convergence

• Quasi-Newton methods: quadratic cost, superlinear convergence

• Newton’s method: cubic cost, quadratic convergence

Are similar tradeoffs possible in statistical learning?

16

Hints of a Computation-Risk Tradeoff

Graph estimation:

• Our method for estimating graph for Ising models:n = Ω(d3 log p), T = O(p4) for graphs with p nodes andmaximum degree d

• Information-theoretic lower bound: n = Ω(d log p)

17

Statistical vs. Computational Efficiency

Challenge: Understand how families of estimators with differentcomputational efficiencies can yield different statistical efficiencies

RateH,F (n) = infmn∈H

supm∈F

Risk(mn,m)

• H: computationally constrained hypothesis class

• F : smoothness constraints on “true” model

18

Computation-Risk Tradeoffs for Linear Regression

Dinah Shender has been studying such a tradeoff in the setting ofhigh dimensional linear regression

19


Standard ridge estimator solves(1n

X T X + λnI)βλ =

1n

X T Y

Sparsify sample covariance to get estimator(Tt [Σ] + λnI

)βt ,λ =

1n

X T Y

where Tt [Σ] is hard-thresholded sample covariance:

Tt ([mij ]) =[mij 1(|mij | > t)

]Recent advance in theoretical CS (Spielman et al.): Solving asymmetric diagonally-dominant linear system with m nonzero matrixentries can be done in time

O(m log2 p)

20


Dinah has recently proved the statistical error scales as

‖βt ,λ − β∗‖‖β∗‖

= OP (‖Tt (Σ)− Σ‖2) = O(t1−q)

for class of covariance matrices with rows in sparse `q balls (asstudied by Bickel and Levina).

• Combined with the computational advance, this gives us anexplicit, fine-grained risk/computation tradeoff

21

Simulation

0.0 0.5 1.0 1.5 2.0

0.8

0.9

1.0

1.1

1.2

1.3

1.4

lambda

risk

22

Some Other Projects

Minhua Chen: Convex optimization for dictionarylearning

Eric Janofsky: Nonparanormal componentanalysis

Min Xu: High dimensional conditional densityand graph estimation

23

Courses in the Works

• Winter 2013: Nonparametric Inference (Undergraduate andMasters)

• Spring 2013: Machine Learning for Big Data (UndergraduateStatistics and Computer Science)

Charles Cary: Developing Cloud-based infras-tructure for the course. Candidate data: 80 mil-lion images, Yahoo! clickthrough data, Sciencejournal articles, City of Chicago datasets.

24

computational and statistical aspects of statistical ... · computational and statistical aspects...

Documents