1 float in numpy = 8 bytes 8ldn bytes = 1 mb ii

38
©Kevin Jamieson 2018 Warm up 2 For each block compute the memory required in terms of n, p, d. If d << p << n, what is the most memory efficient program (blue, green, red)? If you have unlimited memory, what do you think is the fastest program? 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB arymwi.nl Hw y z4HwlE lHtHtiI5 HTy HE1RnxPyeypnxd 8ldn n TH II him g dp p DKP 17 y 04 8 pntp4P P E pp blocky 8k 8 p 14 CHTHHITHT

Upload: others

Post on 24-May-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

©Kevin Jamieson 2018

Warm up

2

For each block compute the memory required in terms of n, p, d. If d << p << n, what is the most memory efficient program (blue, green, red)? If you have unlimited memory, what do you think is the fastest program?

1 float in NumPy = 8 bytes106 ⇡ 220 bytes = 1 MB109 ⇡ 230 bytes = 1 GB

arymwi.nl Hw y z4HwlE lHtHtiI5HTyHE1RnxPyeypnxd

8ldn n

TH IIhim g dp p DKP 17y 04 8pntp4P

P Epp

blocky

8k8 p 14

CHTHHITHT

Page 2: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

©2017 Kevin Jamieson 3

Regularization

Machine Learning – CSE546 Kevin Jamieson University of Washington

April 15, 2019

Page 3: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Ridge Regression

4

■ Old Least squares objective:

■ Ridge Regression objective:

©2017 Kevin Jamieson

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

bwLS = argminw

nX

i=1

�yi � xT

i w�2

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

Page 4: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Minimizing the Ridge Regression Objective

5©2017 Kevin Jamieson

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

0 = rw

nX

i=1

�yi � xT

i w�2

+ �||w||22

!

= �nX

i=1

2xi

�yi � xT

i w�+ 2�w

= 2

nX

i=1

xiyi

!+ 2

nX

i=1

xixTi + �I

!w

bwridge =

nX

i=1

xixTi + �I

!�1 nX

i=1

xiyi

!

= (XTX+ �I)�1XTy

Page 5: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Ridge Coefficient Path

■ Typical approach: select λ using cross validation, up next

6

From Kevin Murphy textbook

©2017 Kevin Jamieson

1/�

XTX in general

Page 6: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Bias-Variance Properties

7

■ Assume: and

©2017 Kevin Jamieson

bwridge = (XTX+ �I)�1XTy

XTX = nI y = Xw + ✏ ✏ ⇠ N (0,�2I)

If x 2 Rd and Y ⇠ N (xTw,�2), what is EY |x,train[(Y � xT bwridge)2|X = x]?

= �2 +�2

(n+ �)2(wTx)2 +

d�2n

(n+ �)2kxk22 (verify at home)

Bias-squared VarianceIrreduc. Error

EY |X,D[(Y � xT bwridge)2|X = x]

= EY |X [(Y � EY |X [Y |X = x])2|X = x] + ED[(EY |X [Y |X = x]� xT bwridge)2]

= EY |X [(Y � xTw)2|X = x] + ED[(xTw � xT bwridge)

2]

= �2 + (xTw � ED[xT bwridge])

2 + ED[(ED[xT bwridge]� xT bwridge)

2]

Page 7: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Ridge Regression: Effect of Regularization

8©2017 Kevin Jamieson

TRAIN error:

D i.i.d.⇠ PXY

1

|D|X

(xi,yi)2D

(yi � xTi bw(�)

D,ridge)2

bw(�)D,ridge = argmin

w

1

|D|X

(xi,yi)2D

(yi � xTi w)

2 + �||w||22

Page 8: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Ridge Regression: Effect of Regularization

9©2017 Kevin Jamieson

TRAIN error:

TEST error:

D i.i.d.⇠ PXY

T i.i.d.⇠ PXY

Important: D \ T = ;

1

|D|X

(xi,yi)2D

(yi � xTi bw(�)

D,ridge)2

bw(�)D,ridge = argmin

w

1

|D|X

(xi,yi)2D

(yi � xTi w)

2 + �||w||22

1

|T |X

(xi,yi)2D

(yi � xTi bw(�)

D,ridge)2

Page 9: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Ridge Regression: Effect of Regularization

10©2017 Kevin Jamieson

TRAIN error:

TEST error:

D i.i.d.⇠ PXY

T i.i.d.⇠ PXY

Important: D \ T = ;

1

|D|X

(xi,yi)2D

(yi � xTi bw(�)

D,ridge)2

bw(�)D,ridge = argmin

w

1

|D|X

(xi,yi)2D

(yi � xTi w)

2 + �||w||22

1

|T |X

(xi,yi)2D

(yi � xTi bw(�)

D,ridge)2

Each line is i.i.d. draw of D or T

1/� small λlarge λ

Page 10: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

©2018 Kevin Jamieson 11

Cross-Validation

Machine Learning – CSE546 Kevin Jamieson University of Washington

October 9, 2016A

Page 11: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

How… How… How???????

12©2018 Kevin Jamieson

■ How do we pick the regularization constant λ… ■ How do we pick the number of basis functions…

■ We could use the test data, but…

Page 12: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

How… How… How???????

■ How do we pick the regularization constant λ… ■ How do we pick the number of basis functions…

■ We could use the test data, but…

13©2018 Kevin Jamieson

■ Never ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever train on the test data

Page 13: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

(LOO) Leave-one-out cross validation

■ Consider a validation set with 1 example: D – training data D\j – training data with j th data point (xj ,yj) moved to validation set

■ Learn classifier fD\j with D\j dataset ■ Estimate true error as squared error on predicting yj:

Unbiased estimate of errortrue(fD\j)!

14©2018 Kevin Jamieson

Page 14: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

(LOO) Leave-one-out cross validation

■ Consider a validation set with 1 example: D – training data D\j – training data with j th data point (xj ,yj) moved to validation set

■ Learn classifier fD\j with D\j dataset ■ Estimate true error as squared error on predicting yj:

Unbiased estimate of errortrue(fD\j)!

■ LOO cross validation: Average over all data points j: For each data point you leave out, learn a new classifier fD\j Estimate error as:

15©2018 Kevin Jamieson

errorLOO =1

n

nX

j=1

(yj � fD\j(xj))2

errortrue Eam Ut IfxD

EpLerroreadInIt

Page 15: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

LOO cross validation is (almost) unbiased estimate of true error of hD!

■ When computing LOOCV error, we only use N-1 data points So it’s not estimate of true error of learning with N data points Usually pessimistic, though – learning with less data typically gives worse answer

■ LOO is almost unbiased! Use LOO error for model selection!!! E.g., picking λ

16©2018 Kevin Jamieson

Page 16: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Computational cost of LOO

■ Suppose you have 100,000 data points ■ You implemented a great version of your learning

algorithm Learns in only 1 second

■ Computing LOO will take about 1 day!!!

17©2018 Kevin Jamieson

Page 17: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Use k-fold cross validation

■ Randomly divide training data into k equal parts D1,…,Dk

■ For each i Learn classifier fD\Di using data point not in Di Estimate error of fD\Di on validation set Di:

18©2018 Kevin Jamieson

errorDi =1

|Di|X

(xj ,yj)2Di

(yj � fD\Di(xj))

2

m

Page 18: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Use k-fold cross validation

■ Randomly divide training data into k equal parts D1,…,Dk

■ For each i Learn classifier fD\Di using data point not in Di Estimate error of fD\Di on validation set Di:

■ k-fold cross validation error is average over data splits:

■ k-fold cross validation properties: Much faster to compute than LOO More (pessimistically) biased – using much less data, only n(k-1)/k Usually, k = 10

19©2018 Kevin Jamieson

errorDi =1

|Di|X

(xj ,yj)2Di

(yj � fD\Di(xj))

2

Page 19: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Recap

■ Given a dataset, begin by splitting into

■ Model selection: Use k-fold cross-validation on TRAIN to train predictor and choose magic parameters such as λ

■ Model assessment: Use TEST to assess the accuracy of the model you output ■ Never ever ever ever ever train or choose

parameters based on the test data 20©2018 Kevin Jamieson

TESTTRAIN

TRAIN

TRAIN-1 VAL-1

TRAIN-3VAL-3

TRAIN-2VAL-2TRAIN-2

iii

if

Page 20: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Example

■ Given 10,000-dimensional data and n examples, we pick a subset of 50 dimensions that have the highest correlation with labels in the training set:

■ After picking our 50 features, we then use CV to train ridge regression with regularization λ

■ What’s wrong with this procedure?

21©2018 Kevin Jamieson

50 indices j that have largest |Pn

i=1 xi,jyi|qPni=1 x

2i,j

1 1 TOPso

1 1 13

Page 21: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Recap

■ Learning is… Collect some data ■ E.g., housing info and sale price

Randomly split dataset into TRAIN, VAL, and TEST ■ E.g., 80%, 10%, and 10%, respectively

Choose a hypothesis class or model ■ E.g., linear with non-linear transformations

Choose a loss function ■ E.g., least squares with ridge regression penalty on TRAIN

Choose an optimization procedure ■ E.g., set derivative to zero to obtain estimator, cross-validation on

VAL to pick num. features and amount of regularization Justifying the accuracy of the estimate ■ E.g., report TEST error

22©2018 Kevin Jamieson

Page 22: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Warm up

1©2018 Kevin Jamieson

Fix any a, b, c > 0.

2. What is the x 2 R that minimizes max{�ax+ b, cx}

1. What is the x 2 R that minimizes ax2 + bx+ c

o I

ng

Page 23: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

©2018 Kevin Jamieson 23

Simple Variable SelectionLASSO: Sparse Regression

Machine Learning – CSE546 Kevin Jamieson University of Washington

October 9, 2016

Page 24: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Sparsity■ Vector w is sparse, if many entries are zero

24©2018 Kevin Jamieson

bwLS = argminw

nX

i=1

�yi � xT

i w�2

Page 25: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Sparsity■ Vector w is sparse, if many entries are zero

Efficiency: If size(w) = 100 Billion, each prediction is expensive: ■ If w is sparse, prediction computation only depends on number of non-zeros

25©2018 Kevin Jamieson

bwLS = argminw

nX

i=1

�yi � xT

i w�2

=X

wj 6=0

ŷi = ŵj hj(xi)byi = bw>LSxi =

dX

j=1

xi[j] bwLS [j]

Be

iwis j 10

Page 26: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Sparsity■ Vector w is sparse, if many entries are zero

Interpretability: What are the relevant dimension to make a prediction?

26©2018 Kevin Jamieson

bwLS = argminw

nX

i=1

�yi � xT

i w�2

■ How do we find “best” subset among all possible?

$ ?

LotsizeSingleFamilyYear builtLastsoldpriceLastsaleprice/sqftFinishedsqftUnfinishedsqftFinishedbasementsqft#floorsFlooring typesParkingtypeParkingamountCoolingHeatingExteriormaterialsRooftypeStructurestyle

DishwasherGarbagedisposalMicrowaveRange/OvenRefrigeratorWasherDryerLaundry locationHeatingtypeJettedTubDeckFencedYardLawnGardenSprinklerSystem

Page 27: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Finding best subset: Exhaustive

■ Try all subsets of size 1, 2, 3, … and one that minimizes validation error

■ Problem?

27©2018 Kevin Jamieson

1,4 ed

Page 28: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Finding best subset: Greedy

Forward stepwise:Starting from simple model and iteratively add features most useful to fit

Backward stepwise:Start with full model and iteratively remove features least useful to fit

Combining forward and backward steps:In forward algorithm, insert steps to remove features no longer as important

Lots of other variants, too.

28©2018 Kevin Jamieson

Page 29: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Finding best subset: Regularize

Ridge regression makes coefficients small

29©2018 Kevin Jamieson

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

Page 30: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Finding best subset: Regularize

Ridge regression makes coefficients small

30©2018 Kevin Jamieson

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

From Kevin Murphy textbook

1/�

Page 31: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Thresholded Ridge Regression

31©2018 Kevin Jamieson

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

0

Why don’t we just set small ridge coefficients to 0?

Page 32: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

32©2018 Kevin Jamieson

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

0

Consider two related features (bathrooms, showers)

Thresholded Ridge Regression

Page 33: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

33©2018 Kevin Jamieson

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

0

What if we didn’t include showers? Weight on bathrooms increases!

Thresholded Ridge Regression

Can another regularizer perform selection automatically?

Page 34: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Recall Ridge Regression

34

■ Ridge Regression objective:

©2018 Kevin Jamieson

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

kwkp =

dX

i=1

|w|p!1/p

Page 35: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Ridge vs. Lasso Regression

35

■ Ridge Regression objective:

■ Lasso objective:

©2018 Kevin Jamieson

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

bwridge = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||22

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

+ + + =...

+f1(w) f2(w) + . . . + =TX

t=1

ft(w)fT (w)

bwlasso = argminw

nX

i=1

�yi � xT

i w�2

+ �||w||1

Page 36: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Penalized Least Squares

36©2018 Kevin Jamieson

bwr = argminw

nX

i=1

�yi � xT

i w�2

+ �r(w)

Ridge : r(w) = ||w||22 Lasso : r(w) = ||w||1

Page 37: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Penalized Least Squares

37©2018 Kevin Jamieson

bwr = argminw

nX

i=1

�yi � xT

i w�2

+ �r(w)

Ridge : r(w) = ||w||22 Lasso : r(w) = ||w||1

For any � � 0 for which bwr achieves the minimum, there exists a ⌫ � 0 such that

bwr = argminw

nX

i=1

�yi � xT

i w�2

subject to r(w) ⌫

Page 38: 1 float in NumPy = 8 bytes 8ldn bytes = 1 MB II

Penalized Least Squares

38©2018 Kevin Jamieson

bwr = argminw

nX

i=1

�yi � xT

i w�2

+ �r(w)

Ridge : r(w) = ||w||22 Lasso : r(w) = ||w||1

For any � � 0 for which bwr achieves the minimum, there exists a ⌫ � 0 such that

bwr = argminw

nX

i=1

�yi � xT

i w�2

subject to r(w) ⌫