1 float in numpy = 8 bytes 8ldn bytes = 1 mb ii
TRANSCRIPT
©Kevin Jamieson 2018
Warm up
2
For each block compute the memory required in terms of n, p, d. If d << p << n, what is the most memory efficient program (blue, green, red)? If you have unlimited memory, what do you think is the fastest program?
1 float in NumPy = 8 bytes106 ⇡ 220 bytes = 1 MB109 ⇡ 230 bytes = 1 GB
arymwi.nl Hw y z4HwlE lHtHtiI5HTyHE1RnxPyeypnxd
8ldn n
TH IIhim g dp p DKP 17y 04 8pntp4P
P Epp
blocky
8k8 p 14
CHTHHITHT
©2017 Kevin Jamieson 3
Regularization
Machine Learning – CSE546 Kevin Jamieson University of Washington
April 15, 2019
Ridge Regression
4
■ Old Least squares objective:
■ Ridge Regression objective:
©2017 Kevin Jamieson
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
bwLS = argminw
nX
i=1
�yi � xT
i w�2
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
�
bwridge = argminw
nX
i=1
�yi � xT
i w�2
+ �||w||22
Minimizing the Ridge Regression Objective
5©2017 Kevin Jamieson
bwridge = argminw
nX
i=1
�yi � xT
i w�2
+ �||w||22
0 = rw
nX
i=1
�yi � xT
i w�2
+ �||w||22
!
= �nX
i=1
2xi
�yi � xT
i w�+ 2�w
= 2
nX
i=1
xiyi
!+ 2
nX
i=1
xixTi + �I
!w
bwridge =
nX
i=1
xixTi + �I
!�1 nX
i=1
xiyi
!
= (XTX+ �I)�1XTy
Ridge Coefficient Path
■ Typical approach: select λ using cross validation, up next
6
From Kevin Murphy textbook
©2017 Kevin Jamieson
1/�
XTX in general
Bias-Variance Properties
7
■ Assume: and
©2017 Kevin Jamieson
bwridge = (XTX+ �I)�1XTy
XTX = nI y = Xw + ✏ ✏ ⇠ N (0,�2I)
If x 2 Rd and Y ⇠ N (xTw,�2), what is EY |x,train[(Y � xT bwridge)2|X = x]?
= �2 +�2
(n+ �)2(wTx)2 +
d�2n
(n+ �)2kxk22 (verify at home)
Bias-squared VarianceIrreduc. Error
EY |X,D[(Y � xT bwridge)2|X = x]
= EY |X [(Y � EY |X [Y |X = x])2|X = x] + ED[(EY |X [Y |X = x]� xT bwridge)2]
= EY |X [(Y � xTw)2|X = x] + ED[(xTw � xT bwridge)
2]
= �2 + (xTw � ED[xT bwridge])
2 + ED[(ED[xT bwridge]� xT bwridge)
2]
Ridge Regression: Effect of Regularization
8©2017 Kevin Jamieson
TRAIN error:
D i.i.d.⇠ PXY
1
|D|X
(xi,yi)2D
(yi � xTi bw(�)
D,ridge)2
bw(�)D,ridge = argmin
w
1
|D|X
(xi,yi)2D
(yi � xTi w)
2 + �||w||22
Ridge Regression: Effect of Regularization
9©2017 Kevin Jamieson
TRAIN error:
TEST error:
D i.i.d.⇠ PXY
T i.i.d.⇠ PXY
Important: D \ T = ;
1
|D|X
(xi,yi)2D
(yi � xTi bw(�)
D,ridge)2
bw(�)D,ridge = argmin
w
1
|D|X
(xi,yi)2D
(yi � xTi w)
2 + �||w||22
1
|T |X
(xi,yi)2D
(yi � xTi bw(�)
D,ridge)2
Ridge Regression: Effect of Regularization
10©2017 Kevin Jamieson
TRAIN error:
TEST error:
D i.i.d.⇠ PXY
T i.i.d.⇠ PXY
Important: D \ T = ;
1
|D|X
(xi,yi)2D
(yi � xTi bw(�)
D,ridge)2
bw(�)D,ridge = argmin
w
1
|D|X
(xi,yi)2D
(yi � xTi w)
2 + �||w||22
1
|T |X
(xi,yi)2D
(yi � xTi bw(�)
D,ridge)2
Each line is i.i.d. draw of D or T
1/� small λlarge λ
©2018 Kevin Jamieson 11
Cross-Validation
Machine Learning – CSE546 Kevin Jamieson University of Washington
October 9, 2016A
How… How… How???????
12©2018 Kevin Jamieson
■ How do we pick the regularization constant λ… ■ How do we pick the number of basis functions…
■ We could use the test data, but…
How… How… How???????
■ How do we pick the regularization constant λ… ■ How do we pick the number of basis functions…
■ We could use the test data, but…
13©2018 Kevin Jamieson
■ Never ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever train on the test data
(LOO) Leave-one-out cross validation
■ Consider a validation set with 1 example: D – training data D\j – training data with j th data point (xj ,yj) moved to validation set
■ Learn classifier fD\j with D\j dataset ■ Estimate true error as squared error on predicting yj:
Unbiased estimate of errortrue(fD\j)!
14©2018 Kevin Jamieson
(LOO) Leave-one-out cross validation
■ Consider a validation set with 1 example: D – training data D\j – training data with j th data point (xj ,yj) moved to validation set
■ Learn classifier fD\j with D\j dataset ■ Estimate true error as squared error on predicting yj:
Unbiased estimate of errortrue(fD\j)!
■ LOO cross validation: Average over all data points j: For each data point you leave out, learn a new classifier fD\j Estimate error as:
15©2018 Kevin Jamieson
errorLOO =1
n
nX
j=1
(yj � fD\j(xj))2
errortrue Eam Ut IfxD
EpLerroreadInIt
LOO cross validation is (almost) unbiased estimate of true error of hD!
■ When computing LOOCV error, we only use N-1 data points So it’s not estimate of true error of learning with N data points Usually pessimistic, though – learning with less data typically gives worse answer
■ LOO is almost unbiased! Use LOO error for model selection!!! E.g., picking λ
16©2018 Kevin Jamieson
Computational cost of LOO
■ Suppose you have 100,000 data points ■ You implemented a great version of your learning
algorithm Learns in only 1 second
■ Computing LOO will take about 1 day!!!
17©2018 Kevin Jamieson
Use k-fold cross validation
■ Randomly divide training data into k equal parts D1,…,Dk
■ For each i Learn classifier fD\Di using data point not in Di Estimate error of fD\Di on validation set Di:
■
18©2018 Kevin Jamieson
errorDi =1
|Di|X
(xj ,yj)2Di
(yj � fD\Di(xj))
2
m
Use k-fold cross validation
■ Randomly divide training data into k equal parts D1,…,Dk
■ For each i Learn classifier fD\Di using data point not in Di Estimate error of fD\Di on validation set Di:
■ k-fold cross validation error is average over data splits:
■ k-fold cross validation properties: Much faster to compute than LOO More (pessimistically) biased – using much less data, only n(k-1)/k Usually, k = 10
19©2018 Kevin Jamieson
errorDi =1
|Di|X
(xj ,yj)2Di
(yj � fD\Di(xj))
2
Recap
■ Given a dataset, begin by splitting into
■ Model selection: Use k-fold cross-validation on TRAIN to train predictor and choose magic parameters such as λ
■ Model assessment: Use TEST to assess the accuracy of the model you output ■ Never ever ever ever ever train or choose
parameters based on the test data 20©2018 Kevin Jamieson
TESTTRAIN
TRAIN
TRAIN-1 VAL-1
TRAIN-3VAL-3
TRAIN-2VAL-2TRAIN-2
iii
if
Example
■ Given 10,000-dimensional data and n examples, we pick a subset of 50 dimensions that have the highest correlation with labels in the training set:
■ After picking our 50 features, we then use CV to train ridge regression with regularization λ
■ What’s wrong with this procedure?
21©2018 Kevin Jamieson
50 indices j that have largest |Pn
i=1 xi,jyi|qPni=1 x
2i,j
1 1 TOPso
1 1 13
Recap
■ Learning is… Collect some data ■ E.g., housing info and sale price
Randomly split dataset into TRAIN, VAL, and TEST ■ E.g., 80%, 10%, and 10%, respectively
Choose a hypothesis class or model ■ E.g., linear with non-linear transformations
Choose a loss function ■ E.g., least squares with ridge regression penalty on TRAIN
Choose an optimization procedure ■ E.g., set derivative to zero to obtain estimator, cross-validation on
VAL to pick num. features and amount of regularization Justifying the accuracy of the estimate ■ E.g., report TEST error
22©2018 Kevin Jamieson
Warm up
1©2018 Kevin Jamieson
Fix any a, b, c > 0.
2. What is the x 2 R that minimizes max{�ax+ b, cx}
1. What is the x 2 R that minimizes ax2 + bx+ c
o I
ng
©2018 Kevin Jamieson 23
Simple Variable SelectionLASSO: Sparse Regression
Machine Learning – CSE546 Kevin Jamieson University of Washington
October 9, 2016
Sparsity■ Vector w is sparse, if many entries are zero
24©2018 Kevin Jamieson
bwLS = argminw
nX
i=1
�yi � xT
i w�2
Sparsity■ Vector w is sparse, if many entries are zero
Efficiency: If size(w) = 100 Billion, each prediction is expensive: ■ If w is sparse, prediction computation only depends on number of non-zeros
25©2018 Kevin Jamieson
bwLS = argminw
nX
i=1
�yi � xT
i w�2
=X
wj 6=0
ŷi = ŵj hj(xi)byi = bw>LSxi =
dX
j=1
xi[j] bwLS [j]
Be
iwis j 10
Sparsity■ Vector w is sparse, if many entries are zero
Interpretability: What are the relevant dimension to make a prediction?
26©2018 Kevin Jamieson
bwLS = argminw
nX
i=1
�yi � xT
i w�2
■ How do we find “best” subset among all possible?
$ ?
LotsizeSingleFamilyYear builtLastsoldpriceLastsaleprice/sqftFinishedsqftUnfinishedsqftFinishedbasementsqft#floorsFlooring typesParkingtypeParkingamountCoolingHeatingExteriormaterialsRooftypeStructurestyle
DishwasherGarbagedisposalMicrowaveRange/OvenRefrigeratorWasherDryerLaundry locationHeatingtypeJettedTubDeckFencedYardLawnGardenSprinklerSystem
Finding best subset: Exhaustive
■ Try all subsets of size 1, 2, 3, … and one that minimizes validation error
■ Problem?
27©2018 Kevin Jamieson
1,4 ed
Finding best subset: Greedy
Forward stepwise:Starting from simple model and iteratively add features most useful to fit
Backward stepwise:Start with full model and iteratively remove features least useful to fit
Combining forward and backward steps:In forward algorithm, insert steps to remove features no longer as important
Lots of other variants, too.
28©2018 Kevin Jamieson
Finding best subset: Regularize
Ridge regression makes coefficients small
29©2018 Kevin Jamieson
bwridge = argminw
nX
i=1
�yi � xT
i w�2
+ �||w||22
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
�
Finding best subset: Regularize
Ridge regression makes coefficients small
30©2018 Kevin Jamieson
bwridge = argminw
nX
i=1
�yi � xT
i w�2
+ �||w||22
From Kevin Murphy textbook
1/�
Thresholded Ridge Regression
31©2018 Kevin Jamieson
bwridge = argminw
nX
i=1
�yi � xT
i w�2
+ �||w||22
0
Why don’t we just set small ridge coefficients to 0?
32©2018 Kevin Jamieson
bwridge = argminw
nX
i=1
�yi � xT
i w�2
+ �||w||22
0
Consider two related features (bathrooms, showers)
Thresholded Ridge Regression
33©2018 Kevin Jamieson
bwridge = argminw
nX
i=1
�yi � xT
i w�2
+ �||w||22
0
What if we didn’t include showers? Weight on bathrooms increases!
Thresholded Ridge Regression
Can another regularizer perform selection automatically?
Recall Ridge Regression
34
■ Ridge Regression objective:
©2018 Kevin Jamieson
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
�
bwridge = argminw
nX
i=1
�yi � xT
i w�2
+ �||w||22
kwkp =
dX
i=1
|w|p!1/p
Ridge vs. Lasso Regression
35
■ Ridge Regression objective:
■ Lasso objective:
©2018 Kevin Jamieson
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
�
bwridge = argminw
nX
i=1
�yi � xT
i w�2
+ �||w||22
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
+ + + =...
+f1(w) f2(w) + . . . + =TX
t=1
ft(w)fT (w)
�
bwlasso = argminw
nX
i=1
�yi � xT
i w�2
+ �||w||1
Penalized Least Squares
36©2018 Kevin Jamieson
bwr = argminw
nX
i=1
�yi � xT
i w�2
+ �r(w)
Ridge : r(w) = ||w||22 Lasso : r(w) = ||w||1
Penalized Least Squares
37©2018 Kevin Jamieson
bwr = argminw
nX
i=1
�yi � xT
i w�2
+ �r(w)
Ridge : r(w) = ||w||22 Lasso : r(w) = ||w||1
For any � � 0 for which bwr achieves the minimum, there exists a ⌫ � 0 such that
bwr = argminw
nX
i=1
�yi � xT
i w�2
subject to r(w) ⌫
Penalized Least Squares
38©2018 Kevin Jamieson
bwr = argminw
nX
i=1
�yi � xT
i w�2
+ �r(w)
Ridge : r(w) = ||w||22 Lasso : r(w) = ||w||1
For any � � 0 for which bwr achieves the minimum, there exists a ⌫ � 0 such that
bwr = argminw
nX
i=1
�yi � xT
i w�2
subject to r(w) ⌫