online linear learning - cilvr at nyu · 2013. 1. 30. · online linear learning john langford,...
TRANSCRIPT
![Page 1: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/1.jpg)
Online Linear Learning
John Langford, Large Scale Machine Learning Class,January 29
To follow along:git clonegit://github.com/JohnLangford/vowpal_wabbit.gitwgethttp://hunch.net/~jl/rcv1.train.txt.gz
(Post presentation updated version)
![Page 2: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/2.jpg)
Linear Learning
Features: a vector x ∈ Rn
Label: y ∈ RGoal: Learn w ∈ Rn such that yw(x) =
∑i wixi is
close to y .
![Page 3: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/3.jpg)
Linear Learning
Features: a vector x ∈ Rn
Label: y ∈ {−1, 1}Goal: Learn w ∈ Rn such that yw(x) =
∑i wixi is
close to y .
![Page 4: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/4.jpg)
Online Linear Learning
Start with ∀i : wi = 0Repeatedly:
1 Get features x ∈ Rn.2 Make linear prediction yw(x) =
∑i wixi .
3 Observe label y ∈ [0, 1].4 Update weights so yw(x) is closer to y .
Example: wi ← wi + η(y − y)xi .
![Page 5: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/5.jpg)
Online Linear Learning
Start with ∀i : wi = 0Repeatedly:
1 Get features x ∈ Rn.2 Make logistic prediction yw(x) = 1
1+e−∑
i wi xi.
3 Observe label y ∈ [0, 1].4 Update weights so yw(x) is closer to y .
Example: wi ← wi + η(y − y)xi .
![Page 6: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/6.jpg)
Online Linear Learning
Start with ∀i : wi = 0Repeatedly:
1 Get features x ∈ Rn.2 Make linear prediction yw(x) =
∑i wixi .
3 Observe label y ∈ [0, 1].4 Update weights so yw(x) is closer to y .
Example: wi ← wi + η(y − y)xi .
![Page 7: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/7.jpg)
Online Linear Learning
Start with ∀i : wi = 0Repeatedly:
1 Get features x ∈ Rn.2 Make linear prediction yw(x) =
∑i wixi .
3 Observe label y ∈ [0, 1].4 Update weights so yw(x) is closer to y .
Example: wi ← wi + η(y − y)xi .
![Page 8: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/8.jpg)
An Example: The RCV1 dataset
Pick whether a document is in category CCAT or not.Dataset size:781K examples60M nonzero features1.1G bytesFormat: label | sparse features ...
1 | 13:3.9656971e-02 24:3.4781646e-02 ...which corresponds to:1 | tuesday year ...command: time vw �sgd rcv1.train.txt -ctakes 1-3 seconds on my laptop.
![Page 9: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/9.jpg)
An Example: The RCV1 dataset
Pick whether a document is in category CCAT or not.Dataset size:781K examples60M nonzero features1.1G bytesFormat: label | sparse features ...1 | 13:3.9656971e-02 24:3.4781646e-02 ...
which corresponds to:1 | tuesday year ...command: time vw �sgd rcv1.train.txt -ctakes 1-3 seconds on my laptop.
![Page 10: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/10.jpg)
An Example: The RCV1 dataset
Pick whether a document is in category CCAT or not.Dataset size:781K examples60M nonzero features1.1G bytesFormat: label | sparse features ...1 | 13:3.9656971e-02 24:3.4781646e-02 ...which corresponds to:1 | tuesday year ...
command: time vw �sgd rcv1.train.txt -ctakes 1-3 seconds on my laptop.
![Page 11: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/11.jpg)
An Example: The RCV1 dataset
Pick whether a document is in category CCAT or not.Dataset size:781K examples60M nonzero features1.1G bytesFormat: label | sparse features ...1 | 13:3.9656971e-02 24:3.4781646e-02 ...which corresponds to:1 | tuesday year ...command: time vw �sgd rcv1.train.txt -ctakes 1-3 seconds on my laptop.
![Page 12: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/12.jpg)
Reasons for Online Learning
1 Fast convergence to a good predictor2 It's RAM e�cient. You need store only one
example in RAM rather than all of them. ⇒Entirely new scales of data are possible.
3 Online Learning algorithm = OnlineOptimization Algorithm. Online LearningAlgorithms ⇒ the ability to solve entirely newcategories of applications.
4 Online Learning = ability to deal with driftingdistributions.
![Page 13: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/13.jpg)
De�ning updates
1 De�ne a loss function L(yw(x), y).
2 Update according to wi ← wi − η ∂L(yw (x),y)∂wi.
Here η is the learning rate.
-0.5
0
0.5
1
1.5
2
2.5
3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
loss
prediction when y=1
common loss functions
0/1squared
logisticquantile
hinge
![Page 14: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/14.jpg)
De�ning updates
1 De�ne a loss function L(yw(x), y).
2 Update according to wi ← wi − η ∂L(yw (x),y)∂wi.
Here η is the learning rate.
-0.5
0
0.5
1
1.5
2
2.5
3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
loss
prediction when y=1
common loss functions
0/1squared
logisticquantile
hinge
![Page 15: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/15.jpg)
De�ning updates
1 De�ne a loss function L(yw(x), y).
2 Update according to wi ← wi − η ∂L(yw (x),y)∂wi.
Here η is the learning rate.
-0.5
0
0.5
1
1.5
2
2.5
3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
loss
prediction when y=1
common loss functions
0/1squared
logisticquantile
hinge
![Page 16: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/16.jpg)
De�ning updates
1 De�ne a loss function L(yw(x), y).
2 Update according to wi ← wi − η ∂L(yw (x),y)∂wi.
Here η is the learning rate.
-0.5
0
0.5
1
1.5
2
2.5
3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
loss
prediction when y=1
common loss functions
0/1squared
logisticquantile
hinge
![Page 17: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/17.jpg)
Know your loss function semantics
1 What is a typical price for a house?
quantile: minimizer = median
2 What is the expected return on a stock?
squared: minimizer = expectation
3 What is the probability of a click on an ad?
logistic: minimizer = probability
4 Is the digit a 1?
hinge: closest 0/1 approximation
5 What do you really care about?
often 0/1
![Page 18: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/18.jpg)
Know your loss function semantics
1 What is a typical price for a house?quantile: minimizer = median
2 What is the expected return on a stock?squared: minimizer = expectation
3 What is the probability of a click on an ad?logistic: minimizer = probability
4 Is the digit a 1?hinge: closest 0/1 approximation
5 What do you really care about?often 0/1
![Page 19: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/19.jpg)
The proof for quantile regression
Consider conditional probability distribution D(y |x).
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1
pro
ba
bili
ty
y|x
![Page 20: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/20.jpg)
The proof for quantile regression
Discretize it into equal mass bins.
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1
pro
ba
bili
ty
y|x
discretized pdf
![Page 21: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/21.jpg)
The proof for quantile regression
Where is absolute value minimized?
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1
pro
ba
bili
ty
y|x
A Pair
![Page 22: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/22.jpg)
The proof for quantile regression
note that minimizer = discrete median and take thelimit as bin mass goes to 0.
![Page 23: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/23.jpg)
rcv1 with di�erent loss functions
All the common loss functions are sound for binaryclassi�cation, so which is best is an empirical choice.vw �sgd rcv1.train.txt -c �loss_function
hinge
vw �sgd rcv1.train.txt -c �loss_function
logistic
vw �sgd rcv1.train.txt -c �loss_function
quantile
![Page 24: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/24.jpg)
How do you know when you succeed?
Standard answer: train on train set, test on test set.But can we do better?
Progressive Validation
On timestep t let lt = L(ywt(xt), yt).
Report loss L = Et lt .
PV analysis
Let D be a distribution over x , y . Letlt = E(x ,y)∼DL(ywt
(x), y)Theorem: For all probability distributions D(x , y), forall online learning algorithms, with probability 1− δ:
∣∣L− Et lt∣∣ ≤√ ln 2/δ
2T
![Page 25: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/25.jpg)
How do you know when you succeed?
Standard answer: train on train set, test on test set.But can we do better?
Progressive Validation
On timestep t let lt = L(ywt(xt), yt).
Report loss L = Et lt .
PV analysis
Let D be a distribution over x , y . Letlt = E(x ,y)∼DL(ywt
(x), y)Theorem: For all probability distributions D(x , y), forall online learning algorithms, with probability 1− δ:
∣∣L− Et lt∣∣ ≤√ ln 2/δ
2T
![Page 26: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/26.jpg)
How do you know when you succeed?
Progressive Validation
On timestep t let lt = L(ywt(xt), yt).
Report loss L = Et lt .
PV analysis
Let D be a distribution over x , y . Letlt = E(x ,y)∼DL(ywt
(x), y)Theorem: For all probability distributions D(x , y), forall online learning algorithms, with probability 1− δ:
∣∣L− Et lt∣∣ ≤√ ln 2/δ
2T
![Page 27: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/27.jpg)
rcv1 with di�erent loss functions
All the common loss functions are sound for binaryclassi�cation, so which is best is an empirical choice.vw �sgd rcv1.train.txt -c �loss_function
hinge �binary
vw �sgd rcv1.train.txt -c �loss_function
logistic �binary
vw �sgd rcv1.train.txt -c �loss_function
quantile �binary
Progressive validation often does not replacetrain/test discipline, but it can greatly aid empiricaltesting.
![Page 28: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/28.jpg)
rcv1 with di�erent loss functions
All the common loss functions are sound for binaryclassi�cation, so which is best is an empirical choice.vw �sgd rcv1.train.txt -c �loss_function
hinge �binary
vw �sgd rcv1.train.txt -c �loss_function
logistic �binary
vw �sgd rcv1.train.txt -c �loss_function
quantile �binary
Progressive validation often does not replacetrain/test discipline, but it can greatly aid empiricaltesting.
![Page 29: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/29.jpg)
Part II, advanced updates
1 Importance weight invariance2 Adaptive updates3 Normalized updates
![Page 30: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/30.jpg)
Learning with importance weights
A common scenario: you need to do classi�cation butone choice is more expensive than the other.An example: In spam detection, predicting nonspamas spam is worse than spam as nonspam.
Let's say an example is I times more important thana typical example.How do you modify the update to use I?
The baseline approach: wi ← wi − ηI ∂L(yw (x),y)∂wi.
![Page 31: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/31.jpg)
Learning with importance weights
A common scenario: you need to do classi�cation butone choice is more expensive than the other.An example: In spam detection, predicting nonspamas spam is worse than spam as nonspam.Let's say an example is I times more important thana typical example.How do you modify the update to use I?
The baseline approach: wi ← wi − ηI ∂L(yw (x),y)∂wi.
![Page 32: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/32.jpg)
Learning with importance weights
A common scenario: you need to do classi�cation butone choice is more expensive than the other.An example: In spam detection, predicting nonspamas spam is worse than spam as nonspam.Let's say an example is I times more important thana typical example.How do you modify the update to use I?
The baseline approach: wi ← wi − ηI ∂L(yw (x),y)∂wi.
![Page 33: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/33.jpg)
Dealing with the importance weights
wi ← wi − ηI ∂L(yw (x),y)∂wiperforms poorly.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
-0.5 0 0.5 1 1.5 2 2.5 3 3.5
loss
prediction when y=1
Baseline Importance Update
Squared LossBaseline Update
![Page 34: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/34.jpg)
Dealing with the importance weights
A better approach: wi ← wi − η ∂L(yw (x),y)∂wiI times
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
-0.5 0 0.5 1 1.5 2 2.5 3 3.5
loss
prediction when y=1
Baseline Importance Update
Squared LossBaseline Update
Repeated update
![Page 35: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/35.jpg)
Dealing with the importance weights
An even better approach: wi ← wi − s(ηI )∂L(yw (x),y)∂wi
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
-0.5 0 0.5 1 1.5 2 2.5 3 3.5
loss
prediction when y=1
Baseline Importance Update
Squared LossBaseline Update
Repeated UpdateInvariant Update
![Page 36: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/36.jpg)
Robust results for unweighted problems
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97
stan
dard
importance aware
astro - logistic loss
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98
stan
dard
importance aware
spam - quantile loss
0.9
0.905
0.91
0.915
0.92
0.925
0.93
0.935
0.94
0.945
0.95
0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95
stan
dard
importance aware
rcv1 - squared loss
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
stan
dard
importance aware
webspam - hinge loss
![Page 37: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/37.jpg)
rcv1 with an invariant update
vw rcv1.train.txt -c �binary �invariant
Performs slightly worse with the default learning rate,but much more robust to learning rate choice.
![Page 38: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/38.jpg)
Adaptive Learning
Learning rates must decay to converge, but how?
Common answer: ηt = 1/t0.5 or ηt = 1/t.
Better answer: t, let git = ∂L(yw (xt),yt)∂wi
.
New update rule: wi ← wit − η git√∑t
t′=1g2
it′
Common features stabilize quickly. Rare features canhave large updates.
![Page 39: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/39.jpg)
Adaptive Learning
Learning rates must decay to converge, but how?Common answer: ηt = 1/t0.5 or ηt = 1/t.
Better answer: t, let git = ∂L(yw (xt),yt)∂wi
.
New update rule: wi ← wit − η git√∑t
t′=1g2
it′
Common features stabilize quickly. Rare features canhave large updates.
![Page 40: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/40.jpg)
Adaptive Learning
Learning rates must decay to converge, but how?Common answer: ηt = 1/t0.5 or ηt = 1/t.
Better answer: t, let git = ∂L(yw (xt),yt)∂wi
.
New update rule: wi ← wit − η git√∑t
t′=1g2
it′
Common features stabilize quickly. Rare features canhave large updates.
![Page 41: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/41.jpg)
Adaptive Learning
Learning rates must decay to converge, but how?Common answer: ηt = 1/t0.5 or ηt = 1/t.
Better answer: t, let git = ∂L(yw (xt),yt)∂wi
.
New update rule: wi ← wit − η git√∑t
t′=1g2
it′
Common features stabilize quickly. Rare features canhave large updates.
![Page 42: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/42.jpg)
Adaptive Learning example
vw rcv1.train.txt -c �binary �adaptive
Slightly worse. Adding in �invariant -l 1 helps.
![Page 43: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/43.jpg)
Dimensional Correction
git for squared loss = 2(yw(x)− y)xi so update is
wi ← wi − Cxi
The same form occurs for all linear updates.
Intrinsic problems! Doubling xi implies halving wi toget the same prediction.⇒ Update rule has mixed units!
![Page 44: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/44.jpg)
Dimensional Correction
git for squared loss = 2(yw(x)− y)xi so update is
wi ← wi − Cxi
The same form occurs for all linear updates.Intrinsic problems! Doubling xi implies halving wi toget the same prediction.⇒ Update rule has mixed units!
![Page 45: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/45.jpg)
A standard solution: Gaussian sphering
For each feature xi compute:empirical mean µi = Etxitempirical standard deviation σi =
√Et(xit − µi)2
Let x ′i ←xi−µiσi
.
Problems:1 Lose online.2 RCV1 becomes a factor of 500 larger.
![Page 46: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/46.jpg)
A standard solution: Gaussian sphering
For each feature xi compute:empirical mean µi = Etxitempirical standard deviation σi =
√Et(xit − µi)2
Let x ′i ←xi−µiσi
.Problems:
1 Lose online.2 RCV1 becomes a factor of 500 larger.
![Page 47: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/47.jpg)
A scale-free update
NG(learning_rate η)
1 Initially wi = 0, si = 0, N = 02 For each timestep t observe example (x , y)
1 For each i , if |xi | > si
1 wi ← wi s2
i
|xi |22 si ← |xi |
2 y =∑
iwixi
3 N ← N +∑
i
x2i
s2i
4 For each i ,
1 wi ← wi − η√
Nt
1
s2i
∂L(y ,y)∂wi
![Page 48: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/48.jpg)
A scale-free update
NG(learning_rate η)
1 Initially wi = 0, si = 0, N = 02 For each timestep t observe example (x , y)
1 For each i , if |xi | > si
1 Renormalize wi for new scale2 Adjust Scale
2 y =∑
iwixi
3 Adjust global scale4 For each i ,
1 wi ← wi − η√
Nt
1
s2i
∂L(y ,y)∂wi
![Page 49: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/49.jpg)
In combination
An adaptive, scale-free, importance invariant updaterule.vw rcv1.train.txt -c �binary
![Page 50: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/50.jpg)
References
[RCV1 example] Leon Bottou, Stochastic GradientDescent, 2007.[VW] Vowpal Wabbit project,http://hunch.net/~vw, 2007-2012.[Quantile Regression] Roger Koenker, QuantileRegression, Econometric Society Monograph Series,Cambridge University Press, 2005.[Classi�cation Consistency] Ambuj Tewari and PeterL. Bartlett. On the consistency of multiclassclassi�cation methods. COLT, 2005.
![Page 51: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/51.jpg)
References
[Progressive Validation I] Avrim Blum, Adam Kalai,and John Langford Beating the Holdout: Bounds forKFold and Progressive Cross-Validation. COLT99.[Progressive Validation II] N. Cesa-Bianchi, A.Conconi, and C. Gentile On the generalization abilityof on-line learning algorithms IEEE Transactions onInformation Theory, 50(9):2050-2057, 2004.[Importance Aware Updates] Nikos Karampatziakisand John Langford, Importance Weight AwareGradient Updates UAI 2010.[Online Convex Programming] Martin Zinkevich,Online convex programming and generalizedin�nitesimal gradient ascent, ICML 2003.
![Page 52: Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford, Large Scale Machine Learning Class, January 29 oT follow along: ... distributions](https://reader036.vdocuments.us/reader036/viewer/2022071403/60f7679958f9da6bc476830b/html5/thumbnails/52.jpg)
References
[Adaptive Updates I] John Duchi, Elad Hazan, andYoram Singer, Adaptive Subgradient Methods forOnline Learning and Stochastic Optimization, COLT2010 & JMLR 2011.[Adaptive Updates II] H. Brendan McMahan,Matthew Streeter, Adaptive Bound Optimization forOnline Convex Optimization, COLT 2010.[Scale invariant updates] Forthcoming.