computational and statistical aspects of statistical ... · computational and statistical aspects...

24
Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center

Upload: others

Post on 19-Jan-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Computational and Statistical Aspectsof Statistical Machine Learning

John Lafferty

Department of Statistics RetreatGleacher Center

Page 2: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Outline

• “Modern” nonparametric inference for high dimensional dataI Nonparametric reduced rank regression

• Risk-computation tradeoffsI Covariance-constrained linear regression

• Other research and teaching activities

2

Page 3: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Context for High Dimensional Nonparametrics

Great progress in recent years on high dimensional linear models

Many problems have important nonlinear structure.

We’ve been studying “purely functional ” methods for highdimensional, nonparametric inference

• no basis expansions

• no Mercer kernels

3

Page 4: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Additive Models

Fully nonparametric models appear hopeless

• Logarithmic scaling, p = log n (e.g., “Rodeo” Lafferty andWasserman (2008))

Additive models are useful compromise

• Exponential scaling, p = exp(nc) (e.g., “SpAM” Ravikumar,Lafferty, Liu and Wasserman (2009))

4

Page 5: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Additive Models

420 Chapter 23. Nonparametric Regression

10 15 20 25

−0.0

50.

000.

050.

100.

150.

20

Age

Chan

ge in

BM

D Females

10 15 20 25

−0.0

50.

000.

050.

100.

150.

20

Age

Chan

ge in

BM

D Males

Figure 23.1. Bone Mineral Density Data

−0.10 −0.05 0.00 0.05 0.10

150

160

170

180

190

Age−0.10 −0.05 0.00 0.05 0.10 0.15

100

150

200

250

300

Bmi

−0.10 −0.05 0.00 0.05 0.10

120

160

200

240

Map−0.10 −0.05 0.00 0.05 0.10 0.15

110

120

130

140

150

160

Tc

Figure 23.2. Diabetes Data5

Page 6: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Multivariate Regression

Y ∈ Rq and X ∈ Rp. Regression function m(X ) = E(Y |X ).

Linear model Y = BX + ε where B ∈ Rq×p.

Reduced rank regression: r = rank(B) ≤ C.

Recent work has studied properties and high dimensional scaling ofreduced rank regression where nuclear norm ‖B‖∗ is used as convexsurrogate for rank constraint (Yuan et al., 2007; Negahban andWainwright, 2011). E.g.,

‖Bn − B∗‖F = OP

(√Var(ε)r(p + q)

n

)

6

Page 7: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Low-Rank Matrices and Convex Relaxation

low rank matrices convex hullrank(X ) ≤ t ‖X‖∗ ≤ t

7

Page 8: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Nuclear Norm Regularization

Algorithms for nuclear norm minimization are a lot like iterative softthresholding for lasso problems.

To project a matrix B onto the nuclear norm ball ‖X‖∗ ≤ t :

• Compute the SVD:B = U diag(σ) V T

• Soft threshold the singular values:

B ← U diag(Softλ(σ)) V T

8

Page 9: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Nonparametric Reduced Rank RegressionFoygel, Horrell, Drton and Lafferty (NIPS 2012)

Nonparametric multivariate regression m(X ) = (m1(X ), . . . ,mq(X ))T

Each component an additive model

mk (X ) =

p∑j=1

mkj (Xj)

What is the nonparametric analogue of ‖B‖∗ penalty?

9

Page 10: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Low Rank Functions

What does it mean for a set of functions m1(x), . . . ,mq(x) to be lowrank?

Let x1, . . . , xn be a collection of points.

We require the n × q matrix M(x1:n) = [mk (xi)] is low rank.

Stochastic setting: M = [mk (Xi)]. Natural penalty is

1√n‖M‖∗ = 1√

n

q∑s=1

σs(M) =

q∑s=1

√λs( 1

nMTM)

Population version:

|||M|||∗ :=∥∥∥√Cov(M(X ))

∥∥∥∗

=∥∥∥Σ(M)1/2

∥∥∥∗

10

Page 11: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Constrained Rank Additive Models (CRAM)

Let Σj = Cov(Mj). Two natural penalties:∥∥∥Σ1/21

∥∥∥∗

+∥∥∥Σ

1/22

∥∥∥∗

+ · · ·+∥∥∥Σ

1/2p

∥∥∥∗∥∥∥(Σ

1/21 Σ

1/22 · · ·Σ1/2

p )∥∥∥∗

Population risk (first penalty) 12E∥∥∥Y −

∑j Mj(Xj)

∥∥∥2

2+ λ

∑j

∣∣∣∣∣∣Mj∣∣∣∣∣∣∗

Linear case:

p∑j=1

∥∥∥Σ1/2p

∥∥∥∗

=

p∑j=1

‖Bj‖2

∥∥∥(Σ1/21 Σ

1/22 · · ·Σ1/2

p )∥∥∥∗

= ‖B‖∗

11

Page 12: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

CRAM Backfitting Algorithm (Penalty 1)

Input: Data (Xi ,Yi), regularization parameter λ.Iterate until convergence:

For each j = 1, . . . ,p:

Compute residual: Rj = Y −∑

k 6=j Mk (Xk )

Estimate projection Pj = E(Rj |Xj), smooth: Pj = SjRj

Compute SVD: 1n Pj PT

j = U diag(τ) UT

Soft-threshold: Mj = U diag([1− λ/√τ ]+)UT Pj

Output: Estimator M(Xi) =∑

j Mj(Xij).

12

Page 13: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Scaling of Estimation Error

Using a “double covering” technique, (12 -parametric,

12 -nonparametric), we bound the deviation between empirical andpopulation functional covariance matrices in spectral norm:

supV

∥∥∥Σ(V )− Σn(V )∥∥∥

sp= OP

(√q + log(pq)

n

).

This allows us to bound the excess risk of the empirical estimatorrelative to an oracle.

13

Page 14: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Summary

• Variations on additive models enjoy most of the good statisticaland computational properties of sparse or low-rank linearmodels.

• We’re building a toolbox for large scale, high dimensionalnonparametric inference.

14

Page 15: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Computation-Risk Tradeoffs

• In “traditional” computational learning theory, dividing linebetween learnable and non-learnable is polynomialvs. exponential time

• Valiant’s PAC model

• Mostly negative results: It is not possible to efficiently learn innatural settings

• Claim: Distinctions in polynomial time matter most

15

Page 16: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Analogy: Numerical Optimization

In numerical optimization, it is understood how to tradeoffcomputation for speed of convergence

• First order methods: linear cost, linear convergence

• Quasi-Newton methods: quadratic cost, superlinear convergence

• Newton’s method: cubic cost, quadratic convergence

Are similar tradeoffs possible in statistical learning?

16

Page 17: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Hints of a Computation-Risk Tradeoff

Graph estimation:

• Our method for estimating graph for Ising models:n = Ω(d3 log p), T = O(p4) for graphs with p nodes andmaximum degree d

• Information-theoretic lower bound: n = Ω(d log p)

17

Page 18: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Statistical vs. Computational Efficiency

Challenge: Understand how families of estimators with differentcomputational efficiencies can yield different statistical efficiencies

RateH,F (n) = infmn∈H

supm∈F

Risk(mn,m)

• H: computationally constrained hypothesis class

• F : smoothness constraints on “true” model

18

Page 19: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Computation-Risk Tradeoffs for Linear Regression

Dinah Shender has been studying such a tradeoff in the setting ofhigh dimensional linear regression

19

Page 20: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Computation-Risk Tradeoffs for Linear Regression

Standard ridge estimator solves(1n

X T X + λnI)βλ =

1n

X T Y

Sparsify sample covariance to get estimator(Tt [Σ] + λnI

)βt ,λ =

1n

X T Y

where Tt [Σ] is hard-thresholded sample covariance:

Tt ([mij ]) =[mij 1(|mij | > t)

]Recent advance in theoretical CS (Spielman et al.): Solving asymmetric diagonally-dominant linear system with m nonzero matrixentries can be done in time

O(m log2 p)

20

Page 21: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Computation-Risk Tradeoffs for Linear Regression

Dinah has recently proved the statistical error scales as

‖βt ,λ − β∗‖‖β∗‖

= OP (‖Tt (Σ)− Σ‖2) = O(t1−q)

for class of covariance matrices with rows in sparse `q balls (asstudied by Bickel and Levina).

• Combined with the computational advance, this gives us anexplicit, fine-grained risk/computation tradeoff

21

Page 22: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Simulation

0.0 0.5 1.0 1.5 2.0

0.8

0.9

1.0

1.1

1.2

1.3

1.4

lambda

risk

22

Page 23: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Some Other Projects

Minhua Chen: Convex optimization for dictionarylearning

Eric Janofsky: Nonparanormal componentanalysis

Min Xu: High dimensional conditional densityand graph estimation

23

Page 24: Computational and Statistical Aspects of Statistical ... · Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher

Courses in the Works

• Winter 2013: Nonparametric Inference (Undergraduate andMasters)

• Spring 2013: Machine Learning for Big Data (UndergraduateStatistics and Computer Science)

Charles Cary: Developing Cloud-based infras-tructure for the course. Candidate data: 80 mil-lion images, Yahoo! clickthrough data, Sciencejournal articles, City of Chicago datasets.

24