1 ﬂoat in numpy = 8 bytes 8ldn bytes = 1 mb ii

©Kevin Jamieson 2018

Warm up

2

For each block compute the memory required in terms of n, p, d. If d << p << n, what is the most memory efficient program (blue, green, red)? If you have unlimited memory, what do you think is the fastest program?

1 float in NumPy = 8 bytes106 ⇡ 220 bytes = 1 MB109 ⇡ 230 bytes = 1 GB

arymwi.nl Hw y z4HwlE lHtHtiI5HTyHE1RnxPyeypnxd

8ldn n

TH IIhim g dp p DKP 17y 04 8pntp4P

P Epp

blocky

8k8 p 14

CHTHHITHT

©2017 Kevin Jamieson 3

Regularization

Machine Learning – CSE546 Kevin Jamieson University of Washington

April 15, 2019

Ridge Regression

4

■ Old Least squares objective:

■ Ridge Regression objective:

©2017 Kevin Jamieson

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

bwLS = argminw

nX

i=1

�yi � xT

i w�2

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

�

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

Minimizing the Ridge Regression Objective

5©2017 Kevin Jamieson

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

0 = rw

nX

i=1

�yi � xT

i w�2

+ �||w||22

!

= �nX

i=1

2xi

�yi � xT

i w�+ 2�w

= 2

nX

i=1

xiyi

!+ 2

nX

i=1

xixTi + �I

!w

bwridge =

nX

i=1

xixTi + �I

!�1 nX

i=1

xiyi

!

= (XTX+ �I)�1XTy

Ridge Coefficient Path

■ Typical approach: select λ using cross validation, up next

6

From Kevin Murphy textbook


1/�

XTX in general

Ridge Regression: Effect of Regularization


TRAIN error:

D i.i.d.⇠ PXY

1

|D|X

(xi,yi)2D

(yi � xTi bw(�)

D,ridge)2

bw(�)D,ridge = argmin

w

1

|D|X

(xi,yi)2D

(yi � xTi w)

2 + �||w||22



TRAIN error:

TEST error:

D i.i.d.⇠ PXY

T i.i.d.⇠ PXY

Important: D \ T = ;

1

|D|X

(xi,yi)2D

(yi � xTi bw(�)

D,ridge)2


w

1

|D|X

(xi,yi)2D

(yi � xTi w)

2 + �||w||22

1

|T |X

(xi,yi)2D

(yi � xTi bw(�)

D,ridge)2



TRAIN error:

TEST error:

D i.i.d.⇠ PXY

T i.i.d.⇠ PXY

Important: D \ T = ;

1

|D|X

(xi,yi)2D

(yi � xTi bw(�)

D,ridge)2


w

1

|D|X

(xi,yi)2D

(yi � xTi w)

2 + �||w||22

1

|T |X

(xi,yi)2D

(yi � xTi bw(�)

D,ridge)2

Each line is i.i.d. draw of D or T

1/� small λlarge λ


Cross-Validation


October 9, 2016A

How… How… How???????


■ How do we pick the regularization constant λ… ■ How do we pick the number of basis functions…

■ We could use the test data, but…

How… How… How???????

■ How do we pick the regularization constant λ… ■ How do we pick the number of basis functions…

■ We could use the test data, but…


■ Never ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever train on the test data

(LOO) Leave-one-out cross validation

■ Consider a validation set with 1 example: D – training data D\j – training data with j th data point (xj ,yj) moved to validation set

■ Learn classifier fD\j with D\j dataset ■ Estimate true error as squared error on predicting yj:

Unbiased estimate of errortrue(fD\j)!


(LOO) Leave-one-out cross validation

■ Consider a validation set with 1 example: D – training data D\j – training data with j th data point (xj ,yj) moved to validation set

■ Learn classifier fD\j with D\j dataset ■ Estimate true error as squared error on predicting yj:

Unbiased estimate of errortrue(fD\j)!

■ LOO cross validation: Average over all data points j: For each data point you leave out, learn a new classifier fD\j Estimate error as:


errorLOO =1

n

nX

j=1

(yj � fD\j(xj))2

errortrue Eam Ut IfxD

EpLerroreadInIt

LOO cross validation is (almost) unbiased estimate of true error of hD!

■ When computing LOOCV error, we only use N-1 data points So it’s not estimate of true error of learning with N data points Usually pessimistic, though – learning with less data typically gives worse answer

■ LOO is almost unbiased! Use LOO error for model selection!!! E.g., picking λ


Computational cost of LOO

■ Suppose you have 100,000 data points ■ You implemented a great version of your learning

algorithm Learns in only 1 second

■ Computing LOO will take about 1 day!!!


Use k-fold cross validation

■ Randomly divide training data into k equal parts D1,…,Dk

■ For each i Learn classifier fD\Di using data point not in Di Estimate error of fD\Di on validation set Di:

■


errorDi =1

|Di|X

(xj ,yj)2Di

(yj � fD\Di(xj))

2

m

Use k-fold cross validation

■ Randomly divide training data into k equal parts D1,…,Dk

■ For each i Learn classifier fD\Di using data point not in Di Estimate error of fD\Di on validation set Di:

■ k-fold cross validation error is average over data splits:

■ k-fold cross validation properties: Much faster to compute than LOO More (pessimistically) biased – using much less data, only n(k-1)/k Usually, k = 10


errorDi =1

|Di|X

(xj ,yj)2Di

(yj � fD\Di(xj))

2

Recap

■ Given a dataset, begin by splitting into

■ Model selection: Use k-fold cross-validation on TRAIN to train predictor and choose magic parameters such as λ

■ Model assessment: Use TEST to assess the accuracy of the model you output ■ Never ever ever ever ever train or choose

parameters based on the test data 20©2018 Kevin Jamieson

TESTTRAIN

TRAIN

TRAIN-1 VAL-1

TRAIN-3VAL-3

TRAIN-2VAL-2TRAIN-2

iii

if

Example

■ Given 10,000-dimensional data and n examples, we pick a subset of 50 dimensions that have the highest correlation with labels in the training set:

■ After picking our 50 features, we then use CV to train ridge regression with regularization λ

■ What’s wrong with this procedure?


50 indices j that have largest |Pn

i=1 xi,jyi|qPni=1 x

2i,j

1 1 TOPso

1 1 13

Recap

■ Learning is… Collect some data ■ E.g., housing info and sale price

Randomly split dataset into TRAIN, VAL, and TEST ■ E.g., 80%, 10%, and 10%, respectively

Choose a hypothesis class or model ■ E.g., linear with non-linear transformations

Choose a loss function ■ E.g., least squares with ridge regression penalty on TRAIN

Choose an optimization procedure ■ E.g., set derivative to zero to obtain estimator, cross-validation on

VAL to pick num. features and amount of regularization Justifying the accuracy of the estimate ■ E.g., report TEST error


Warm up


Fix any a, b, c > 0.

2. What is the x 2 R that minimizes max{�ax+ b, cx}

1. What is the x 2 R that minimizes ax2 + bx+ c

o I

ng


Simple Variable SelectionLASSO: Sparse Regression


October 9, 2016

Sparsity■ Vector w is sparse, if many entries are zero


bwLS = argminw

nX

i=1

�yi � xT

i w�2


Efficiency: If size(w) = 100 Billion, each prediction is expensive: ■ If w is sparse, prediction computation only depends on number of non-zeros


bwLS = argminw

nX

i=1

�yi � xT

i w�2

=X

wj 6=0

ŷi = ŵj hj(xi)byi = bw>LSxi =

dX

j=1

xi[j] bwLS [j]

Be

iwis j 10


Interpretability: What are the relevant dimension to make a prediction?


bwLS = argminw

nX

i=1

�yi � xT

i w�2

■ How do we find “best” subset among all possible?

$ ?

LotsizeSingleFamilyYear builtLastsoldpriceLastsaleprice/sqftFinishedsqftUnfinishedsqftFinishedbasementsqft#floorsFlooring typesParkingtypeParkingamountCoolingHeatingExteriormaterialsRooftypeStructurestyle

DishwasherGarbagedisposalMicrowaveRange/OvenRefrigeratorWasherDryerLaundry locationHeatingtypeJettedTubDeckFencedYardLawnGardenSprinklerSystem

Finding best subset: Exhaustive

■ Try all subsets of size 1, 2, 3, … and one that minimizes validation error

■ Problem?


1,4 ed

Finding best subset: Greedy

Forward stepwise:Starting from simple model and iteratively add features most useful to fit

Backward stepwise:Start with full model and iteratively remove features least useful to fit

Combining forward and backward steps:In forward algorithm, insert steps to remove features no longer as important

Lots of other variants, too.


Finding best subset: Regularize

Ridge regression makes coefficients small


bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

�

Finding best subset: Regularize

Ridge regression makes coefficients small


bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

From Kevin Murphy textbook

1/�

Thresholded Ridge Regression


bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

0

Why don’t we just set small ridge coefficients to 0?


bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

0

Consider two related features (bathrooms, showers)



bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

0

What if we didn’t include showers? Weight on bathrooms increases!


Can another regularizer perform selection automatically?

Recall Ridge Regression

34



+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

�

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

kwkp =

dX

i=1

|w|p!1/p

Ridge vs. Lasso Regression

35


■ Lasso objective:


+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

�

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

�

bwlasso = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||1

Penalized Least Squares


bwr = argminw

nX

i=1

�yi � xT

i w�2

+ �r(w)

Ridge : r(w) = ||w||22 Lasso : r(w) = ||w||1



bwr = argminw

nX

i=1

�yi � xT

i w�2

+ �r(w)


For any � � 0 for which bwr achieves the minimum, there exists a ⌫ � 0 such that

bwr = argminw

nX

i=1

�yi � xT

i w�2

subject to r(w) ⌫

1 ﬂoat in numpy = 8 bytes 8ldn bytes = 1 mb ii

Documents