lasso regression - courses.cs.washington.edu...activity in which brain regions can predict...
Post on 16-Jul-2020
0 Views
Preview:
TRANSCRIPT
4/12/18
1
CSE/STAT 416: Intro to Machine Learning
CSE 416: Machine LearningEmily FoxUniversity of WashingtonApril 12, 2018
©2018 Emily Fox
Lasso Regression:Regularization for feature selection
CSE/STAT 416: Intro to Machine Learning2
Symptom of overfitting
Often, overfitting associated with verylarge estimated parameters ŵ
©2018 Emily Fox
square feet (sq.ft.)
pri
ce ($
)
x
fŵy
OVERFIT
4/12/18
2
CSE/STAT 416: Intro to Machine Learning3 ©2018 Emily Fox
TrainingData
Featureextraction
ML model
Qualitymetric
ML algorithm
y
h(x) ŷ
ŵ
CSE/STAT 416: Intro to Machine Learning4
Consider specific total cost
Total cost =measure of fit + measure of magnitude of coefficients
©2018 Emily Fox
RSS(w) ||w||22
Want to balance:i. How well function fits dataii. Magnitude of coefficients
4/12/18
3
CSE/STAT 416: Intro to Machine Learning5
Consider resulting objective
What if ŵ selected to minimize
RSS(w) + ||w||2
©2018 Emily Fox
λ
Ridge regression(a.k.a L2 regularization)
tuning parameter = balance of fit and magnitude
2
CSE/STAT 416: Intro to Machine Learning6
What summary # is indicative of size of regression coefficients?
- Sum?
- Sum of absolute value?
- Sum of squares (L2 norm)
©2018 Emily Fox
Measure of magnitude of regression coefficient
4/12/18
4
CSE/STAT 416: Intro to Machine Learning
Feature selection task
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning8
Efficiency: - If size(w) = 100B, each prediction is expensive- If ŵ sparse , computation only depends on # of non-zeros
Interpretability: - Which features are relevant for prediction?
Why might you want to performfeature selection?
©2018 Emily Fox
many zerosX
wj 6=0
ŷi = ŵj hj(xi)
4/12/18
5
CSE/STAT 416: Intro to Machine Learning9
Sparsity: Housing application
$ ?
Lot sizeSingle FamilyYear builtLast sold priceLast sale price/sqftFinished sqftUnfinished sqftFinished basement sqft# floorsFlooring typesParking typeParking amountCoolingHeatingExterior materialsRoof typeStructure style
DishwasherGarbage disposalMicrowaveRange / OvenRefrigeratorWasherDryerLaundry locationHeating typeJetted TubDeckFenced YardLawnGardenSprinkler System
Lot sizeSingle FamilyYear builtLast sold priceLast sale price/sqftFinished sqftUnfinished sqftFinished basement sqft# floorsFlooring typesParking typeParking amountCoolingHeatingExterior materialsRoof typeStructure style
DishwasherGarbage disposalMicrowaveRange / OvenRefrigeratorWasherDryerLaundry locationHeating typeJetted TubDeckFenced YardLawnGardenSprinkler System…
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning10
Sparsity: Reading your mind
©2018 Emily Fox
very sad very happy
Activity in which brain regions can predict
happiness?
4/12/18
6
CSE/STAT 416: Intro to Machine Learning
Option 1: All subsets or greedy variants
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning12
Find best model of size: 0
©2018 Emily Fox
# of features
RSS
(ŵ)
0
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
4/12/18
7
CSE/STAT 416: Intro to Machine Learning13
Find best model of size: 1
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
CSE/STAT 416: Intro to Machine Learning14
Find best model of size: 1
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
4/12/18
8
CSE/STAT 416: Intro to Machine Learning15
Find best model of size: 1
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
CSE/STAT 416: Intro to Machine Learning16
Find best model of size: 1
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
4/12/18
9
CSE/STAT 416: Intro to Machine Learning17
Find best model of size: 1
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
CSE/STAT 416: Intro to Machine Learning18
Find best model of size: 1
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
4/12/18
10
CSE/STAT 416: Intro to Machine Learning19
Find best model of size: 1
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
CSE/STAT 416: Intro to Machine Learning20
Find best model of size: 1
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
4/12/18
11
CSE/STAT 416: Intro to Machine Learning21
Find best model of size: 1
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
CSE/STAT 416: Intro to Machine Learning22
Find best model of size: 2
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1 2
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
4/12/18
12
CSE/STAT 416: Intro to Machine Learning23
Note: Not necessarily nested!
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1 2
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
CSE/STAT 416: Intro to Machine Learning24
Note: Not necessarily nested!
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1 2
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
4/12/18
13
CSE/STAT 416: Intro to Machine Learning25
Find best model of size: 3
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1 2 3
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
CSE/STAT 416: Intro to Machine Learning26
Find best model of size: 4
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1 2 3 4
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
4/12/18
14
CSE/STAT 416: Intro to Machine Learning27
Find best model of size: 5
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1 2 3 4 5
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
CSE/STAT 416: Intro to Machine Learning28
Find best model of size: 6
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1 2 3 4 5 6
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
4/12/18
15
CSE/STAT 416: Intro to Machine Learning29
Find best model of size: 7
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1 2 3 4 5 6 7
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
CSE/STAT 416: Intro to Machine Learning30
Find best model of size: 8
©2018 Emily Fox
# of features
RSS
(ŵ)
0 1 2 3 4 5 6 7 8
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront
4/12/18
16
CSE/STAT 416: Intro to Machine Learning31
Choosing model complexity?
Option 1: Assess on validation set
Option 2: Cross validation
Option 3+: Other metrics for penalizingmodel complexity like BIC…
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning32
Complexity of “all subsets”
How many models were evaluated?- each indexed by features included
©2018 Emily Fox
yi = εi
yi = w0h0(xi) + εi
yi = w1 h1(xi) + εi
yi = w0h0(xi) + w1 h1(xi) + εi
yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
…
…
[0 0 0 … 0 0 0]
[1 0 0 … 0 0 0]
[0 1 0 … 0 0 0]
[1 1 0 … 0 0 0]
[1 1 1 … 1 1 1]
…
…
2D
28 = 256230 = 1,073,741,82421000 = 1.071509 x 10301
2100B = HUGE!!!!!!
Typically, computationally
infeasible
4/12/18
17
CSE/STAT 416: Intro to Machine Learning33
Greedy algorithms
Forward stepwise:Starting from simple model and iteratively add features most useful to fit
Backward stepwise:Start with full model and iteratively remove features least useful to fit
Combining forward and backward steps:In forward algorithm, insert steps to remove features no longer as important
Lots of other variants, too.
33©2017 Emily Fox
CSE/STAT 416: Intro to Machine Learning
Option 2: Regularize
©2018 Emily Fox
4/12/18
18
CSE/STAT 416: Intro to Machine Learning35
Ridge regression: L2 regularized regression
Total cost =measure of fit + λ measure of magnitude of coefficients
©2018 Emily Fox
RSS(w) ||w||2=w02+…+wD
22
Encourages small weightsbut not exactly 0
CSE/STAT 416: Intro to Machine Learning36
Coefficient path – ridge
©2018 Emily Fox
0 50000 100000 150000 200000−100000
0100000
200000
coefficient paths −− Ridge
λ
coefficient
bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront
λ
coef
ficie
nts
ŵj
4/12/18
19
CSE/STAT 416: Intro to Machine Learning37
Using regularization for feature selection
Instead of searching over a discrete set of solutions, can we use regularization?- Start with full model (all possible features)
- “Shrink” some coefficients exactly to 0• i.e., knock out certain features
- Non-zero coefficients indicate “selected” features
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning38
Thresholding ridge coefficients?
Why don’t we just set small ridge coefficients to 0?
©2018 Emily Fox
# bedrooms
# bathrooms
sq.ft.
living
sq.ft.
lot
floors
year b
uilt
year re
novated
last sales p
rice
cost per s
q.ft.
heating
# showers
waterfront
0
4/12/18
20
CSE/STAT 416: Intro to Machine Learning39
Thresholding ridge coefficients?
Selected features for a given threshold value
©2018 Emily Fox
# bedrooms
# bathrooms
sq.ft.
living
sq.ft.
lot
floors
year b
uilt
year re
novated
last sales p
rice
cost per s
q.ft.
heating
# showers
waterfront
0
CSE/STAT 416: Intro to Machine Learning40
Thresholding ridge coefficients?
Let’s look at two related features…
©2018 Emily Fox
# bedrooms
# bathrooms
sq.ft.
living
sq.ft.
lot
floors
year b
uilt
year re
novated
last sales p
rice
cost per s
q.ft.
heating
# showers
waterfront
0
Nothing measuring bathrooms was included!
4/12/18
21
CSE/STAT 416: Intro to Machine Learning41
Thresholding ridge coefficients?
If only one of the features had been included…
©2018 Emily Fox
# bedrooms
# bathrooms
sq.ft.
living
sq.ft.
lot
floors
year b
uilt
year re
novated
last sales p
rice
cost per s
q.ft.
heating
# showers
waterfront
0
CSE/STAT 416: Intro to Machine Learning42
Thresholding ridge coefficients?
Would have included bathrooms in selected model
©2018 Emily Fox
# bedrooms
# bathrooms
sq.ft.
living
sq.ft.
lot
floors
year b
uilt
year re
novated
last sales p
rice
cost per s
q.ft.
heating
# showers
waterfront
0
Can regularization lead directly to sparsity?
4/12/18
22
CSE/STAT 416: Intro to Machine Learning43
Try this cost instead of ridge…
Total cost =measure of fit + λ measure of magnitude of coefficients
©2018 Emily Fox
RSS(w) ||w||1=|w0|+…+|wD|
Lasso regression(a.k.a. L1 regularized regression)
Leads to sparse solutions!
CSE/STAT 416: Intro to Machine Learning44
Lasso regression: L1 regularized regression
Just like ridge regression, solution is governed by a continuous parameter λ
If λ=0:
If λ=∞:
If λ in between:
RSS(w) + λ||w||1tuning parameter = balance of fit and sparsity
©2018 Emily Fox
4/12/18
23
CSE/STAT 416: Intro to Machine Learning45
Coefficient path – ridge
©2018 Emily Fox
0 50000 100000 150000 200000−100000
0100000
200000
coefficient paths −− Ridge
λ
coefficient
bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront
λ
coef
ficie
nts
ŵj
CSE/STAT 416: Intro to Machine Learning46
0 50000 100000 150000 200000−100000
0100000
200000
coefficient paths −− LASSO
λ
coefficient
bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront
Coefficient path – lasso
©2018 Emily Fox
λ
coef
ficie
nts
ŵj
4/12/18
24
CSE/STAT 416: Intro to Machine Learning47
Revisit polynomial fit demo
What happens if we refit our high-orderpolynomial, but now using lasso regression?
Will consider a few settings of λ …
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning
How to choose λ:Cross validation
©2018 Emily Fox
4/12/18
25
CSE/STAT 416: Intro to Machine Learning49
If sufficient amount of data…
©2018 Emily Fox
Validation set
Training setTest set
fit ŵλtest performance of ŵλ to select λ*
assess generalization
error of ŵλ*
CSE/STAT 416: Intro to Machine Learning50
Start with smallish dataset
©2018 Emily Fox
All data
4/12/18
26
CSE/STAT 416: Intro to Machine Learning51
Still form test set and hold out
©2018 Emily Fox
Rest of dataTest set
CSE/STAT 416: Intro to Machine Learning52
How do we use the other data?
©2018 Emily Fox
Rest of data
use for both training and validation, but not so naively
4/12/18
27
CSE/STAT 416: Intro to Machine Learning53
Recall naïve approach
Is validation set enough to compare performance of ŵλacross λ values?
©2018 Emily Fox
Valid.set
Training set
small validation set
No
CSE/STAT 416: Intro to Machine Learning54
Choosing the validation set
Didn’t have to use the last data points tabulated to form validation set
Can use any data subset
©2018 Emily Fox
Valid.set
small validation set
4/12/18
28
CSE/STAT 416: Intro to Machine Learning55
Choosing the validation set
©2018 Emily Fox
Valid.set
small validation set
Which subset should I use?
ALL! average performance over all choices
CSE/STAT 416: Intro to Machine Learning56
(use same split of data for all other steps)
Preprocessing: Randomly assign data to K groups
©2018 Emily Fox
NK
NK
NK
NK
NK
Rest of data
K-fold cross validation
4/12/18
29
CSE/STAT 416: Intro to Machine Learning57
For k=1,…,K1. Estimate ŵλ(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2018 Emily Fox
Validset
ŵλ(1)error1(λ)
K-fold cross validation
CSE/STAT 416: Intro to Machine Learning58
For k=1,…,K1. Estimate ŵλ(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2018 Emily Fox
Validset
ŵλ(2)error2(λ)
K-fold cross validation
4/12/18
30
CSE/STAT 416: Intro to Machine Learning59
For k=1,…,K1. Estimate ŵλ(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2018 Emily Fox
Validset
ŵλ(3) error3(λ)
K-fold cross validation
CSE/STAT 416: Intro to Machine Learning60
For k=1,…,K1. Estimate ŵλ(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2018 Emily Fox
Validset
ŵλ(4) error4(λ)
K-fold cross validation
4/12/18
31
CSE/STAT 416: Intro to Machine Learning61
For k=1,…,K1. Estimate ŵλ(k) on the training blocks
2. Compute error on validation block: errork(λ)
Compute average error: CV(λ) = errork(λ)
©2018 Emily Fox
Validset
ŵλ(5) error5(λ)
1
K
KX
k=1
K-fold cross validation
CSE/STAT 416: Intro to Machine Learning62
Repeat procedure for each choice of λ
Choose λ* to minimize CV(λ)
©2018 Emily Fox
Validset
K-fold cross validation
4/12/18
32
CSE/STAT 416: Intro to Machine Learning63
What value of K?
Formally, the best approximation occurs for validation sets of size 1 (K=N)
Computationally intensive- requires computing N fits of model per λ
Typically, K=5 or 10
©2018 Emily Fox
leave-one-outcross validation
5-fold CV 10-fold CV
CSE/STAT 416: Intro to Machine Learning64
Choosing λ via cross validation for lasso
Cross validation is choosing the λ that provides best predictive accuracy
Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection
c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion
©2018 Emily Fox
4/12/18
33
CSE/STAT 416: Intro to Machine Learning
Practical concerns with lasso
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning66
Debiasing lassoLasso shrinks coefficients relative to LS solutionà more bias, less variance
Can reduce bias as follows:1. Run lasso to select
features2. Run least squares
regression with only selected features
“Relevant” features no longer shrunk relative to LS fit of same reduced model
©2018 Emily Fox
Figure used with permission of Mario Figueiredo(captions modified to fit course)
True coefficients (D=4096, non-zero = 160)
L1 reconstruction (non-zero = 1024, MSE = 0.0072)
Debiased (non-zero = 1024, MSE = 3.26e-005)
Least squares (non-zero = 0, MSE = 1.568)
4/12/18
34
CSE/STAT 416: Intro to Machine Learning67
Issues with standard lasso objective1. With group of highly correlated features, lasso tends to select amongst
them arbitrarily- Often prefer to select all together
2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution
Elastic net aims to address these issues- hybrid between lasso and ridge regression- uses L1 and L2 penalties
See Zou & Hastie ‘05 for further discussion
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning
Summary for feature selection and lasso regression
©2018 Emily Fox
4/12/18
35
CSE/STAT 416: Intro to Machine Learning69
Impact of feature selection and lasso
Lasso has changed machine learning, statistics, & electrical engineering
But, for feature selection in general, be careful about interpreting selected features- selection only considers features included
- sensitive to correlations between features
- result depends on algorithm used
- there are theoretical guarantees for lasso under certain conditions
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning70
What you can do now…• Describe “all subsets” and greedy variants for feature selection• Analyze computational costs of these algorithms• Formulate lasso objective• Describe what happens to estimated lasso coefficients as tuning
parameter λ is varied• Interpret lasso coefficient path plot• Contrast ridge and lasso regression• Implement K-fold cross validation to select lasso tuning parameter λ
©2018 Emily Fox
top related