1 second order learning koby crammer department of electrical engineering ecml pkdd 2013 prague
TRANSCRIPT
![Page 1: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/1.jpg)
1
Second Order Learning
Koby CrammerDepartment of Electrical Engineering
ECML PKDD 2013 Prague
![Page 2: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/2.jpg)
Thanks
• Mark Dredze• Alex Kulesza• Avihai Mejer• Edward Moroshko• Francesco Orabona• Fernando Pereira• Yoram Singer• Nina Vaitz
2
![Page 3: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/3.jpg)
3
Tutorial Context
OnlineLearning
Tutorial
OptimizationTheory
Real-WorldData
SVMs
![Page 4: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/4.jpg)
4
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 5: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/5.jpg)
5
Online Learning
Tyrannosaurus rex
![Page 6: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/6.jpg)
6
Online Learning
Triceratops
![Page 7: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/7.jpg)
7
Online Learning
Tyrannosaurus rex
Velocireptor
![Page 8: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/8.jpg)
8
Formal Setting – Binary Classification
• Instances – Images, Sentences
• Labels– Parse tree, Names
• Prediction rule– Linear predictions rules
• Loss– No. of mistakes
![Page 9: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/9.jpg)
9
Predictions
• Discrete Predictions:– Hard to optimize
• Continuous predictions :
– Label
– Confidence
![Page 10: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/10.jpg)
10
Loss Functions
• Natural Loss:– Zero-One loss:
• Real-valued-predictions loss:– Hinge loss:
– Exponential loss (Boosting)– Log loss (Max Entropy, Boosting)
![Page 11: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/11.jpg)
11
Loss Functions
1
1Zero-One Loss
Hinge Loss
![Page 12: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/12.jpg)
Online Learning
Maintain Model M Get Instance x
Predict Label y=M(x)
Get True Label ySuffer Loss l(y,y)
Update Model M
![Page 13: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/13.jpg)
14
• Any Features
• W.l.o.g.
• Binary Classifiers of the form
Linear Classifiers
Notation
Abuse
![Page 14: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/14.jpg)
15
• Prediction :
• Confidence in prediction:
Linear Classifiers (cntd.)
![Page 15: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/15.jpg)
16
Linear Classifiers
Input Instance to be classified
Weight vector of classifier
![Page 16: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/16.jpg)
17
• Margin of an example with respect to the classifier :
• Note :
• The set is separable iff there exists such that
Margin
![Page 17: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/17.jpg)
18
Geometrical Interpretation
![Page 18: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/18.jpg)
19
Geometrical Interpretation
![Page 19: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/19.jpg)
20
Geometrical Interpretation
![Page 20: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/20.jpg)
21
Geometrical Interpretation
Margin >0
Margin <<0
Margin <0Margin >>0
![Page 21: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/21.jpg)
22
Hinge Loss
![Page 22: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/22.jpg)
23
Why Online Learning?
• Fast• Memory efficient - process one example at a
time• Simple to implement• Formal guarantees – Mistake bounds • Online to Batch conversions• No statistical assumptions• Adaptive
• Not as good as a well designed batch algorithms
![Page 23: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/23.jpg)
24
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 24: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/24.jpg)
25
The Perceptron Algorithm
• If No-Mistake
– Do nothing
• If Mistake
– Update
• Margin after update :
Rosenblat 1958
![Page 25: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/25.jpg)
26
Geometrical Interpretation
![Page 26: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/26.jpg)
27
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 27: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/27.jpg)
Gradient Descent
• Consider the batch problem
• Simple algorithm:– Initialize– Iterate, for – Compute
– Set
28
![Page 28: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/28.jpg)
![Page 29: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/29.jpg)
Stochastic Gradient Descent
• Consider the batch problem
• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute
– Set
30
![Page 30: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/30.jpg)
31
![Page 31: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/31.jpg)
Stochastic Gradient Descent
• “Hinge” loss
• The gradient
• Simple algorithm:– Initialize– Iterate, for– Pick a random index – If then
else– Set 32
The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples
![Page 32: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/32.jpg)
33
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 33: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/33.jpg)
34
Motivation
• Perceptron: No guaranties of margin after the update
• PA :Enforce a minimal non-zero margin after the update
• In particular :– If the margin is large enough (1), then do nothing– If the margin is less then unit, update such that the
margin after the update is enforced to be unit
![Page 34: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/34.jpg)
35
Input Space
![Page 35: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/35.jpg)
36
Input Space vs. Version Space
• Input Space :– Points are input data– One constraint is induced
by weight vector– Primal space– Half space = all input
examples that are classified correctly by a given predictor (weight vector)
• Version Space :– Points are weight vectors– One constraints is induced
by input data– Dual space– Half space = all predictors
(weight vectors) that classify correctly a given input example
![Page 36: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/36.jpg)
37
Weight Vector (Version) Space
The algorithm forces to reside in this region
![Page 37: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/37.jpg)
38
Passive Step
Nothing to do. already resides on the desired side.
![Page 38: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/38.jpg)
39
Aggressive Step
The algorithm projects on the desired half-space
![Page 39: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/39.jpg)
40
Aggressive Update Step
• Set to be the solution of the following optimization problem :
• Solution:
![Page 40: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/40.jpg)
41
Perceptron vs. PA
• Common Update :
• Perceptron
• Passive-Aggressive
![Page 41: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/41.jpg)
42
Perceptron vs. PA
Margin
Error
No-E
rror, Sm
all Margin
No-Error, Large Margin
![Page 42: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/42.jpg)
43
Perceptron vs. PA
![Page 43: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/43.jpg)
44
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 44: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/44.jpg)
45
Geometrical Assumption
• All examples are bounded in a ball of radius R
![Page 45: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/45.jpg)
46
Separablity
• There exists a unit vector that classifies the data correctly
![Page 46: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/46.jpg)
• Simple case: positive points
negative points
• Separating hyperplane
• Bound is :
Perceptron’s Mistake Bound
• The number of mistakes the algorithm makes is bounded by
47
![Page 47: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/47.jpg)
48
Geometrical Motivation
![Page 48: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/48.jpg)
SGD on such data
49
![Page 49: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/49.jpg)
50
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 50: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/50.jpg)
Second Order Perceptron
• Assume all inputs are given• Compute “whitening” matrix
• Run the Perceptron on “wightened” data
• New “whitening” matrix
51
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005
![Page 51: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/51.jpg)
Second Order Perceptron
• Bound:
• Same simple case:
• Thus
• Bound is :
52
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005
![Page 52: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/52.jpg)
53
Second Order Perceptron
• If No-Mistake
– Do nothing
• If Mistake
– Update
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005
![Page 53: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/53.jpg)
SGD on weightened data
54
![Page 54: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/54.jpg)
55
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 55: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/55.jpg)
56
• The weight vector is a linear combination of examples
• Two rate schedules (many many others):– Perceptron algorithm, Conservative
– Passive - Aggressive
Span-based Update Rules
Feature-value of input instance
Target labelEither -1 or 1
Learning rateLearning rateWeight of feature f
![Page 56: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/56.jpg)
57
Sentiment Classification
• Who needs this Simpsons book? You DOOOOOOOOThis is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended!
Pang, Lee, Vaithyanathan, EMNLP 2002
![Page 57: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/57.jpg)
58
Sentiment Classification
• Many positive reviews with the word best
Wbest
• Later negative review – “boring book – best if you want to sleep in seconds”
• Linear update will reduce both
Wbest Wboring
• But best appeared more than boring• The model know’s more about best than boring• Better to reduce words in different rate
Wboring Wbest
![Page 58: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/58.jpg)
59
Natural Language Processing
• Big datasets, large number of features
• Many features are only weakly correlated with target label
• Linear classifiers: features are associated with word-counts
• Heavy-tailed feature distribution
Feature Rank
Cou
nts
![Page 59: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/59.jpg)
Natural Language Processing
![Page 60: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/60.jpg)
61
New Prediction Models
• Gaussian distributions over weight vectors
• The covariance is either full or diagonal
• In NLP we have many features and use a diagonal covariance
![Page 61: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/61.jpg)
62
Classification
• Given a new example • Stochastic:
– Draw a weight vector– Make a prediction
• Collective:– Average weight vector– Average margin– Average prediction
![Page 62: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/62.jpg)
63
The Margin is Random Variable
• The signed margin
is random 1-d Gaussian
• Thus:
![Page 63: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/63.jpg)
64
Linear Model Distribution over Linear Models
Example
Mean weight-vector
![Page 64: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/64.jpg)
65
The algorithm forces that most of the values of would reside in this region
Weight Vector (Version) Space
![Page 65: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/65.jpg)
66
Nothing to do, most of the weight vectors already classifies the example correctly
Passive Step
![Page 66: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/66.jpg)
67
The mean is moved beyond the mistake-line(Large Margin)
Aggressive Step
The covariance is shrunk in the direction of the input example
The algorithm projects the current Gaussian distribution on the half-space
![Page 67: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/67.jpg)
68
Projection Update
• Vectors (aka PA):
• Distributions (New Update) :
Confidence Parameter
![Page 68: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/68.jpg)
69
• Sum of two divergences of parameters :
• Convex in both arguments simultaneously
Divergence
Matrix Itakura-Saito Divergence
Mahanabolis Distance
![Page 69: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/69.jpg)
70
Constraint
• Probabilistic Constraint :
• Equivalent Margin Constraint :
• Convex in , concave in • Solutions:
– Linear approximation– Change variables to
get a convex formulation– Relax (AROW)
Dredze, Crammer, Pereira. ICML 2008
Crammer, Dredze, Pereira. NIPS 2008
Crammer, Dredze, Kulesza. NIPS 2009
![Page 70: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/70.jpg)
71
Convexity
• Change variables• Equivalent convex formulation
Crammer, Dredze, Pereira. NIPS 2008
![Page 71: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/71.jpg)
72
AROW
• PA:
• CW :
• Similar update form as CW
Crammer, Dredze, Kulesza. NIPS 2009
![Page 72: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/72.jpg)
73
• Optimization update can be solved analytically
• Coefficients depend on specific algorithm
The Update
![Page 73: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/73.jpg)
Definitions
74
![Page 74: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/74.jpg)
Updates
CW (Linearization)CW (Change Variables)
AROW
75
![Page 75: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/75.jpg)
76
Per-feature Learning Rate
Per-feature Learning rate
Reducing the Learning rate and eigenvalues of
covariance matrix
![Page 76: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/76.jpg)
77
Diagonal Matrix• Given a matrix we define to
be only the diagonal part of the matrix,
• Make matrix diagonal
• Make inverse diagonal
![Page 77: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/77.jpg)
78
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 78: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/78.jpg)
(Back to)Stochastic Gradient Descent
• Consider the batch problem
• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute
– Set
79
![Page 79: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/79.jpg)
Adaptive Stochastic Gradient Descent
• Consider the batch problem
• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute
– Set
– Set 80
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010
![Page 80: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/80.jpg)
Adaptive Stochastic Gradient Descent
• Very general! Can be used to solve with various regularizations
• The matrix A can be either full or diagonal
• Comes with convergence and regret bounds
• Similar performance to AROW
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010
![Page 81: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/81.jpg)
Adaptive Stochastic Gradient Descent
SGD AdaGrad
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010
![Page 82: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/82.jpg)
86
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 83: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/83.jpg)
87
Kernels
![Page 84: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/84.jpg)
Proof
• Show that we can write
• Induction
88
![Page 85: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/85.jpg)
Proof (cntd)
• By update rule :
• Thus
89
![Page 86: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/86.jpg)
Proof (cntd)
• By update rule :
90
![Page 87: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/87.jpg)
Proof (cntd)
• Thus
91
![Page 88: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/88.jpg)
92
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 89: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/89.jpg)
94
Statistical Interpretation
• Margin Constraint :
• Distribution over weight-vectors :
• Assume input is corrupted with Gaussian noise
![Page 90: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/90.jpg)
95
Statistical Interpretation
Example
Mean weight-vector
Version Space Input Space
Input Instance
Linear Separator
Good realization
Bad realization
![Page 91: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/91.jpg)
96
Mistake Bound• For any reference weight vector , the
number of mistakes made by AROW is upper bounded by
where
– set of example indices with a mistake– set of example indices with an update
but not a mistake–
Orabona and Crammer, NIPS 2010
![Page 92: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/92.jpg)
97
Comment I
• Separable case and no updates:
where
![Page 93: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/93.jpg)
98
Comment II
• For large the bound becomes:
• When no updates are performed: Perceptron
![Page 94: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/94.jpg)
Bound for Diagonal Algorithm
• No. of mistakes is bounded by
• Is low when either
a feature is rare or non-informative• Exactly as in NLP …
Orabona and Crammer, NIPS 2010
![Page 95: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/95.jpg)
100
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 96: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/96.jpg)
101
Synthetic Data
• 20 features• 2 informative (rotated
skewed Gaussian)• 18 noisy• Using a single feature is
as good as a random prediction
![Page 97: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/97.jpg)
102
Synthetic Data (cntd.)
Distribution after 50 examples (x1)
![Page 98: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/98.jpg)
103
Synthetic Data (no noise)
Perceptron
PA
SOP
CW-full
CW-diag
![Page 99: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/99.jpg)
104
Synthetic Data (10% noise)
![Page 100: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/100.jpg)
105
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
![Page 101: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/101.jpg)
106
Data
• Sentiment– Sentiment reviews from 6 Amazon domains (Blitzer et al)– Classify a product review as either positive or negative
• Reuters, pairs of labels– Three divisions:
• Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail Distribution: Specialist Stores vs. Mixed Retail.
– Bag of words representation with binary features.
• 20 News Groups, pairs of labels– Three divisions:
• comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances, sci.electronics vs. sci.med.instances, and talk.politics.guns vs. talk.politics.mideast.instances.
– Bag of words representation with binary features.
![Page 102: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/102.jpg)
107
Experimental Design
• Online to batch :– Multiple passes over the training data– Evaluate on a different test set after each pass– Compute error/accuracy
• Set parameter using held-out data• 10 Fold Cross-Validation• ~2000 instances per problem• Balanced class-labels
![Page 103: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/103.jpg)
108
Results vs Online- Sentiment
• StdDev and Variance – always better than baseline • Variance – 5/6 significantly better
![Page 104: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/104.jpg)
109
Results vs Online – 20NG + Reuters
• StdDev and Variance – always better than baseline • Variance – 4/6 significantly better
![Page 105: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/105.jpg)
110
Results vs Batch - Sentiment
• always better than batch methods • 3/6 significantly better
![Page 106: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/106.jpg)
111
Results vs Batch - 20NG + Reuters
• 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse
![Page 107: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/107.jpg)
112
![Page 108: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/108.jpg)
113
![Page 109: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/109.jpg)
114
![Page 110: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/110.jpg)
115
Results - Sentiment
• CW is better (5/6 cases), statistically significant (4/6)
• CW benefit less from many passes
Passes of Training Data
Acc
urac
y
O PAO CW
O PAO CW
O PAO CW
O PAO CW
O PAO CW
O PAO CW
![Page 111: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/111.jpg)
116
Results – Reuters + 20NG
• CW is better (5/6 cases), statistically significant (4/6)
• CW benefit less from many passes
Passes of Training Data
Acc
urac
y
O PAO CW
O PAO CW
O PAO CW
O PAO CW
O PAO CW
O PAO CW
![Page 112: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/112.jpg)
117
Error Reduction by Multiple Passes
• PA benefits more from multiple passes (8/12)
• Amount of benefit is data dependent
![Page 113: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/113.jpg)
Bayesian Logistic Regression
BLR
• Covariance
• Mean
CW/AROW
• Covariance
• Mean
118
T. Jaakkola and M. Jordan. 1997
Based on the Variational
Approximation
Conceptually decoupled update
Function of the margin/hinge-loss
![Page 114: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague](https://reader037.vdocuments.us/reader037/viewer/2022110116/5516e54e550346fe558b46dd/html5/thumbnails/114.jpg)
Algorithms Summary
1st Order2nd Order
PerceptronSOP
PACW+AROW
SGDAdaGrad
Logisitic Regression
LR
• Different motivation, similar algorithms
• All algorithms can be kernelized
• Work well for data NOT isotropic / symmetric
• State-of-the-art results in various domains
• Accompanied with theory
119