![Page 1: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/1.jpg)
Structured Sparsityin Natural Language Processing:
Models, Algorithms, and Applications
Andre F. T. Martins1,3 Dani Yogatama2 Noah A. Smith2
Mario A. T. Figueiredo1
1Instituto de TelecomunicacoesInstituto Superior Tecnico, Lisboa, Portugal
2Language Technologies Institute, School of Computer ScienceCarnegie Mellon University, Pittsburgh, PA, USA
3Priberam, Lisboa, Portugal
EACL 2014 Tutorial, Gothenburg, Sweden, April 27, 2014Slides online at http://tiny.cc/ssnlp14
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 1 / 128
![Page 2: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/2.jpg)
Welcome
This tutorial is about sparsity, a topic of great relevance to NLP.
Sparsity relates to feature selection, model compactness, runtime,memory footprint, interpretability of our models.
New idea in the last 7 years: structured sparsity. This tutorial tries toanswer:
What is structured sparsity?
How do we apply it?
How has it been used so far?
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 2 / 128
![Page 3: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/3.jpg)
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 3 / 128
![Page 4: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/4.jpg)
Notation
Many NLP problems involve mapping from one structured space toanother. Notation:
Input set X
For each x ∈ X, candidate outputs are Y(x) ⊆ Y
Mapping is hw : X→ Y
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 4 / 128
![Page 5: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/5.jpg)
Linear Models
Our predictor will take the form
hw(x) = arg maxy∈Y(x)
w>f(x , y)
where:
f is a vector function that encodes all the relevant things about(x , y); the result of a theory, our knowledge, feature engineering, etc.
w ∈ RD are the weights that parameterize the mapping.
NLP today: D is often in the tens or hundreds of millions.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 5 / 128
![Page 6: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/6.jpg)
Learning Linear Models
Max ent, perceptron, CRF, SVM, even supervised generative models all fitthe linear modeling framework.
General training setup:
We observe a collection of examples 〈xn, yn〉Nn=1.
Perform statistical analysis to discover w from the data.Ranges from “count and normalize” to complex optimization routines.
Optimization view:
w = arg minw
1
N
N∑n=1
L(w; xn, yn)︸ ︷︷ ︸empirical loss
+ Ω(w)︸ ︷︷ ︸regularizer
This tutorial will focus on the regularizer, Ω.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 6 / 128
![Page 7: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/7.jpg)
What is Sparsity?
The word “sparsity” has (at least) four related meanings in NLP!
1 Data sparsity: N is too small to obtain a good estimate for w.Also known as “curse of dimensionality.”(Usually bad.)
2 “Probability” sparsity: I have a probability distribution over events(e.g., X× Y), most of which receive zero probability.(Might be good or bad.)
3 Sparsity in the dual: associated with SVMs and other kernel-basedmethods; implies that the predictor can be represented via kernelcalculations involving just a few training instances.
4 Model sparsity: Most dimensions of f are not needed for a good hw;those dimensions of w can be zero, leading to a sparse w (model).
This tutorial is about sense #4: today, (model) sparsity is a good thing!
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 7 / 128
![Page 8: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/8.jpg)
Why Sparsity is Desirable in NLP
Occam’s razor and interpretability.
The bet on sparsity (Friedman et al., 2004): it’s often correct. When itisn’t, there’s no good solution anyway!
Models with just a few features are
easy to explain and implement
attractive as linguistic hypotheses
reminiscent of classical symbolic systems
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 8 / 128
![Page 9: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/9.jpg)
A decision list from Yarowsky (1995).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 9 / 128
![Page 10: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/10.jpg)
Why Sparsity is Desirable in NLP
Computational savings.
wd = 0 is equivalent to erasing the feature from the model; smallereffective D implies smaller memory footprint.
This, in turn, implies faster decoding runtime.
Further, sometimes entire kinds of features can be eliminated, givingasymptotic savings.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 10 / 128
![Page 11: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/11.jpg)
Why Sparsity is Desirable in NLP
Generalization.
The challenge of learning is to extract from the data only what willgeneralize to new examples.
Forcing a learner to use few features is one way to discourageoverfitting.
Text categorization experiments in Kazama and Tsujii (2003): +3accuracy points with 1% as many features
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 11 / 128
![Page 12: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/12.jpg)
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 12 / 128
![Page 13: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/13.jpg)
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 12 / 128
![Page 14: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/14.jpg)
Filter-based Feature Selection
For each candidate feature fd , apply a heuristic to determine whether toinclude it. (Excluding fd equates to fixing wd = 0.)
Examples:
Count threshold: is |n | fd (xn, yn) > 0| > τ?(Ignore rare features.)
Mutual information or correlation between features and labels
Advantage: speed!
Disadvantages:
Ignores the learning algorithm
Thresholds require tuning
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 13 / 128
![Page 15: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/15.jpg)
Ratnaparkhi (1996), on his POS tagger:
The behavior of a feature that occurs very sparsely in thetraining set is often difficult to predict, since its statistics maynot be reliable. Therefore, the model uses the heuristic that anyfeature which occurs less than 10 times in the data is unreliable,and ignores features whose counts are less than 10.1 While thereare many smoothing algorithms which use techniques morerigorous than a simple count cutoff, they have not yet beeninvestigated in conjunction with this tagger.
1Except for features that look only at the current word, i.e., features of theform wi =<word> and ti = <TAG>. The count of 10 was chosen by inspection ofTraining and Development data.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 14 / 128
![Page 16: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/16.jpg)
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 15 / 128
![Page 17: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/17.jpg)
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 15 / 128
![Page 18: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/18.jpg)
Wrapper-based Feature Selection
For each subset F ⊆ 1, 2, . . .D, learn hwFfor features fd | d ∈ F.
2D − 1 choices; so perform a search over subsets.
Cons:
NP-hard problem (Amaldi and Kann, 1998; Davis et al., 1997)
Must resort to greedy methods
Even those require iterative calls to a black-box learner
Danger of overfitting in choosing F.(Typically use development data or cross-validate.)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 16 / 128
![Page 19: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/19.jpg)
Della Pietra et al. (1997) add features one at a time. Step (3) involvesre-estimating parameters:
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 17 / 128
![Page 20: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/20.jpg)
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 18 / 128
![Page 21: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/21.jpg)
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 18 / 128
![Page 22: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/22.jpg)
Embedded Methods for Feature Selection
Formulate the learning problem as a trade-off between
minimizing loss (fitting the training data, achieving good accuracy onthe training data, etc.)
choosing a desirable model (e.g., one with no more features thanneeded)
minw
1
N
N∑n=1
L(w; xn, yn) + Ω(w)
Key advantage: declarative statements of model “desirability” often leadto well-understood, solvable optimization problems.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 19 / 128
![Page 23: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/23.jpg)
Useful Papers on Feature Selection and Sparsity
Overview of many feature selection methods:Guyon and Elisseeff (2003)
Greedy wrapper-based method used for max ent models in NLP:Della Pietra et al. (1997)
Early uses of sparsity in NLP:Kazama and Tsujii (2003); Goodman (2004)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 20 / 128
![Page 24: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/24.jpg)
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 21 / 128
![Page 25: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/25.jpg)
Learning Problem
Recall that we formulate the learning problem as:
minw
Ω(w)︸ ︷︷ ︸regularizer
+N∑
i=1
L(w, xi , yi )︸ ︷︷ ︸total loss
,
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 22 / 128
![Page 26: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/26.jpg)
Loss functions (I)
Regression (y ∈ R) typically uses the squared error loss:
LSE(w; x , y) =1
2
(y −w>f(x)
)2
Total loss:
1
2
N∑n=1
(yn −w>f(xn)
)2=
1
2‖Aw − y‖2
2
Design matrix: A = [Aij ]i=1,...,N; j=1,...,D , where Aij = fj (xi ).
Response vector: y = [y1, ..., yN ]>.
Arguably, the most/best studied loss function (statistics, machinelearning, signal processing).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 23 / 128
![Page 27: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/27.jpg)
Loss functions (I)
Regression (y ∈ R) typically uses the squared error loss:
LSE(w; x , y) =1
2
(y −w>f(x)
)2
Total loss:
1
2
N∑n=1
(yn −w>f(xn)
)2=
1
2‖Aw − y‖2
2
Design matrix: A = [Aij ]i=1,...,N; j=1,...,D , where Aij = fj (xi ).
Response vector: y = [y1, ..., yN ]>.
Arguably, the most/best studied loss function (statistics, machinelearning, signal processing).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 23 / 128
![Page 28: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/28.jpg)
Loss functions (I)
Regression (y ∈ R) typically uses the squared error loss:
LSE(w; x , y) =1
2
(y −w>f(x)
)2
Total loss:
1
2
N∑n=1
(yn −w>f(xn)
)2=
1
2‖Aw − y‖2
2
Design matrix: A = [Aij ]i=1,...,N; j=1,...,D , where Aij = fj (xi ).
Response vector: y = [y1, ..., yN ]>.
Arguably, the most/best studied loss function (statistics, machinelearning, signal processing).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 23 / 128
![Page 29: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/29.jpg)
Loss functions (I)
Regression (y ∈ R) typically uses the squared error loss:
LSE(w; x , y) =1
2
(y −w>f(x)
)2
Total loss:
1
2
N∑n=1
(yn −w>f(xn)
)2=
1
2‖Aw − y‖2
2
Design matrix: A = [Aij ]i=1,...,N; j=1,...,D , where Aij = fj (xi ).
Response vector: y = [y1, ..., yN ]>.
Arguably, the most/best studied loss function (statistics, machinelearning, signal processing).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 23 / 128
![Page 30: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/30.jpg)
Loss functions (I)
Regression (y ∈ R) typically uses the squared error loss:
LSE(w; x , y) =1
2
(y −w>f(x)
)2
Total loss:
1
2
N∑n=1
(yn −w>f(xn)
)2=
1
2‖Aw − y‖2
2
Design matrix: A = [Aij ]i=1,...,N; j=1,...,D , where Aij = fj (xi ).
Response vector: y = [y1, ..., yN ]>.
Arguably, the most/best studied loss function (statistics, machinelearning, signal processing).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 23 / 128
![Page 31: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/31.jpg)
Loss functions (II)
Classification and structured prediction using log-linear models(logistic regression, max ent, conditional random fields):
LLR(w; x , y) = − log P (y |x ; w)
= − logexp(w>f(x , y))∑
y ′∈Y(x) exp(w>f(x , y ′))
= −w>f(x , y) + log Z (w, x)
Partition function:
Z (w, x) =∑
y ′∈Y(x)
exp(w>f(x , y ′)).
Related loss functions: hinge loss (in SVM) and the perceptron loss.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 24 / 128
![Page 32: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/32.jpg)
Loss functions (II)
Classification and structured prediction using log-linear models(logistic regression, max ent, conditional random fields):
LLR(w; x , y) = − log P (y |x ; w)
= − logexp(w>f(x , y))∑
y ′∈Y(x) exp(w>f(x , y ′))
= −w>f(x , y) + log Z (w, x)
Partition function:
Z (w, x) =∑
y ′∈Y(x)
exp(w>f(x , y ′)).
Related loss functions: hinge loss (in SVM) and the perceptron loss.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 24 / 128
![Page 33: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/33.jpg)
Loss functions (II)
Classification and structured prediction using log-linear models(logistic regression, max ent, conditional random fields):
LLR(w; x , y) = − log P (y |x ; w)
= − logexp(w>f(x , y))∑
y ′∈Y(x) exp(w>f(x , y ′))
= −w>f(x , y) + log Z (w, x)
Partition function:
Z (w, x) =∑
y ′∈Y(x)
exp(w>f(x , y ′)).
Related loss functions: hinge loss (in SVM) and the perceptron loss.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 24 / 128
![Page 34: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/34.jpg)
Main Loss Functions: Summary
Squared (linear regression) 12
(y −w>f(x)
)2
Log-linear (MaxEnt, CRF, logistic) −w>f(x , y) + log∑y ′∈Y
exp(w>f(x , y ′))
Hinge (SVMs) −w>f(x , y) + maxy ′∈Y
(w>f(x , y ′) + c(y , y ′)
)Perceptron −w>f(x , y) + max
y ′∈Yw>f(x , y ′)
(in the SVM loss, c(y , y ′) is a cost function.)
The log-linear, hinge, and perceptron losses are particular cases of generalfamily (Martins et al., 2010).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 25 / 128
![Page 35: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/35.jpg)
Main Loss Functions: Summary
Squared (linear regression) 12
(y −w>f(x)
)2
Log-linear (MaxEnt, CRF, logistic) −w>f(x , y) + log∑y ′∈Y
exp(w>f(x , y ′))
Hinge (SVMs) −w>f(x , y) + maxy ′∈Y
(w>f(x , y ′) + c(y , y ′)
)Perceptron −w>f(x , y) + max
y ′∈Yw>f(x , y ′)
(in the SVM loss, c(y , y ′) is a cost function.)
The log-linear, hinge, and perceptron losses are particular cases of generalfamily (Martins et al., 2010).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 25 / 128
![Page 36: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/36.jpg)
Regularization Formulations
Tikhonov regularization: w = arg minwλΩ(w) +
N∑n=1
L(w; xn, yn)
Ivanov regularization
w = arg minw
N∑n=1
L(w; xn, yn)
subject to Ω(w) ≤ τ
Morozov regularization
w = arg minw
Ω(w)
subject toN∑
n=1
L(w; xn, yn) ≤ δ
Equivalent, under mild conditions (namely convexity).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 26 / 128
![Page 37: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/37.jpg)
Regularization Formulations
Tikhonov regularization: w = arg minwλΩ(w) +
N∑n=1
L(w; xn, yn)
Ivanov regularization
w = arg minw
N∑n=1
L(w; xn, yn)
subject to Ω(w) ≤ τ
Morozov regularization
w = arg minw
Ω(w)
subject toN∑
n=1
L(w; xn, yn) ≤ δ
Equivalent, under mild conditions (namely convexity).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 26 / 128
![Page 38: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/38.jpg)
Regularization Formulations
Tikhonov regularization: w = arg minwλΩ(w) +
N∑n=1
L(w; xn, yn)
Ivanov regularization
w = arg minw
N∑n=1
L(w; xn, yn)
subject to Ω(w) ≤ τ
Morozov regularization
w = arg minw
Ω(w)
subject toN∑
n=1
L(w; xn, yn) ≤ δ
Equivalent, under mild conditions (namely convexity).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 26 / 128
![Page 39: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/39.jpg)
Regularization Formulations
Tikhonov regularization: w = arg minwλΩ(w) +
N∑n=1
L(w; xn, yn)
Ivanov regularization
w = arg minw
N∑n=1
L(w; xn, yn)
subject to Ω(w) ≤ τ
Morozov regularization
w = arg minw
Ω(w)
subject toN∑
n=1
L(w; xn, yn) ≤ δ
Equivalent, under mild conditions (namely convexity).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 26 / 128
![Page 40: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/40.jpg)
Regularization
Why regularize?
Improve generalization by avoiding over-fitting.
Express prior knowledge about w.
Select relevant features (via sparsity-inducing regularization).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 27 / 128
![Page 41: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/41.jpg)
Regularization
Why regularize?
Improve generalization by avoiding over-fitting.
Express prior knowledge about w.
Select relevant features (via sparsity-inducing regularization).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 27 / 128
![Page 42: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/42.jpg)
Regularization
Why regularize?
Improve generalization by avoiding over-fitting.
Express prior knowledge about w.
Select relevant features (via sparsity-inducing regularization).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 27 / 128
![Page 43: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/43.jpg)
Regularization
Why regularize?
Improve generalization by avoiding over-fitting.
Express prior knowledge about w.
Select relevant features (via sparsity-inducing regularization).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 27 / 128
![Page 44: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/44.jpg)
Regularization
Why regularize?
Improve generalization by avoiding over-fitting.
Express prior knowledge about w.
Select relevant features (via sparsity-inducing regularization).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 27 / 128
![Page 45: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/45.jpg)
Regularization vs. Bayesian estimation
Regularized parameter estimate: w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)
...interpretable as Bayesian maximum a posteriori (MAP) estimate:
w = arg maxw
exp (−Ω(w))︸ ︷︷ ︸prior p(w)
N∏n=1
exp (−L(w; xn, yn))︸ ︷︷ ︸likelihood (i.i.d. data)
.
This interpretation underlies the logistic regression (LR) loss:LLR(w; xn, yn) = − log P (yn|xn; w).
Same is true for the squared error (SE) loss:
LSE(w; xn, yn) = 12
(y −w>f(x)
)2= − logN(y |w>f(x), 1)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 28 / 128
![Page 46: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/46.jpg)
Regularization vs. Bayesian estimation
Regularized parameter estimate: w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)
...interpretable as Bayesian maximum a posteriori (MAP) estimate:
w = arg maxw
exp (−Ω(w))︸ ︷︷ ︸prior p(w)
N∏n=1
exp (−L(w; xn, yn))︸ ︷︷ ︸likelihood (i.i.d. data)
.
This interpretation underlies the logistic regression (LR) loss:LLR(w; xn, yn) = − log P (yn|xn; w).
Same is true for the squared error (SE) loss:
LSE(w; xn, yn) = 12
(y −w>f(x)
)2= − logN(y |w>f(x), 1)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 28 / 128
![Page 47: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/47.jpg)
Regularization vs. Bayesian estimation
Regularized parameter estimate: w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)
...interpretable as Bayesian maximum a posteriori (MAP) estimate:
w = arg maxw
exp (−Ω(w))︸ ︷︷ ︸prior p(w)
N∏n=1
exp (−L(w; xn, yn))︸ ︷︷ ︸likelihood (i.i.d. data)
.
This interpretation underlies the logistic regression (LR) loss:LLR(w; xn, yn) = − log P (yn|xn; w).
Same is true for the squared error (SE) loss:
LSE(w; xn, yn) = 12
(y −w>f(x)
)2= − logN(y |w>f(x), 1)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 28 / 128
![Page 48: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/48.jpg)
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ
2‖w‖22
)
Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 29 / 128
![Page 49: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/49.jpg)
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ
2‖w‖22
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 29 / 128
![Page 50: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/50.jpg)
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ
2‖w‖22
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 29 / 128
![Page 51: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/51.jpg)
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ
2‖w‖22
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 29 / 128
![Page 52: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/52.jpg)
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ
2‖w‖22
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 29 / 128
![Page 53: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/53.jpg)
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ
2‖w‖22
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 29 / 128
![Page 54: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/54.jpg)
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ
2‖w‖22
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 29 / 128
![Page 55: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/55.jpg)
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)
Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 30 / 128
![Page 56: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/56.jpg)
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 30 / 128
![Page 57: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/57.jpg)
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 30 / 128
![Page 58: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/58.jpg)
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 30 / 128
![Page 59: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/59.jpg)
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 30 / 128
![Page 60: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/60.jpg)
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 30 / 128
![Page 61: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/61.jpg)
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Cons: only encodes trivial prior knowledge.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 30 / 128
![Page 62: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/62.jpg)
The Lasso and SparsityWhy does the Lasso yield sparsity?
The simplest case:
w = arg minw
1
2(w − y)2 + λ|w | = soft(y , λ) =
y − λ ⇐ y > λ0 ⇐ |y | ≤ λy + λ ⇐ y < −λ
Contrast with the squared `2 (ridge) regularizer (linear scaling):
w = arg minw
1
2(w − y)2 +
λ
2w 2 =
1
1 + λy
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 31 / 128
![Page 63: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/63.jpg)
The Lasso and SparsityWhy does the Lasso yield sparsity?
The simplest case:
w = arg minw
1
2(w − y)2 + λ|w | = soft(y , λ) =
y − λ ⇐ y > λ0 ⇐ |y | ≤ λy + λ ⇐ y < −λ
Contrast with the squared `2 (ridge) regularizer (linear scaling):
w = arg minw
1
2(w − y)2 +
λ
2w 2 =
1
1 + λy
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 31 / 128
![Page 64: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/64.jpg)
The Lasso and SparsityWhy does the Lasso yield sparsity?
The simplest case:
w = arg minw
1
2(w − y)2 + λ|w | = soft(y , λ) =
y − λ ⇐ y > λ0 ⇐ |y | ≤ λy + λ ⇐ y < −λ
Contrast with the squared `2 (ridge) regularizer (linear scaling):
w = arg minw
1
2(w − y)2 +
λ
2w 2 =
1
1 + λy
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 31 / 128
![Page 65: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/65.jpg)
The Lasso and Sparsity (II)
Why does the Lasso yield sparsity?
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 32 / 128
![Page 66: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/66.jpg)
The Lasso and Sparsity (II)
Why does the Lasso yield sparsity?
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 32 / 128
![Page 67: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/67.jpg)
Norms: A Quick Review
A norm is a function satisfying:
‖αw‖ = |α|‖w‖, for any w (homogeneity);
‖w + w′‖ ≤ ‖w‖+ ‖w′‖, for any w,w′ (triangle inequality);
‖w‖ = 0 if and only if w = 0.
Examples of norms:
‖w‖1 = (∑
i |wi |)1 =∑
i |wi |.
‖w‖2 =(∑
i |wi |2)1/2
=√∑
i |wi |2.
‖w‖p = (∑
i |wi |p)1/p (called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 33 / 128
![Page 68: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/68.jpg)
Norms: A Quick Review
A norm is a function satisfying:
‖αw‖ = |α|‖w‖, for any w (homogeneity);
‖w + w′‖ ≤ ‖w‖+ ‖w′‖, for any w,w′ (triangle inequality);
‖w‖ = 0 if and only if w = 0.
Examples of norms:
‖w‖1 = (∑
i |wi |)1 =∑
i |wi |.
‖w‖2 =(∑
i |wi |2)1/2
=√∑
i |wi |2.
‖w‖p = (∑
i |wi |p)1/p (called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 33 / 128
![Page 69: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/69.jpg)
Norms: A Quick Review
A norm is a function satisfying:
‖αw‖ = |α|‖w‖, for any w (homogeneity);
‖w + w′‖ ≤ ‖w‖+ ‖w′‖, for any w,w′ (triangle inequality);
‖w‖ = 0 if and only if w = 0.
Examples of norms:
‖w‖1 = (∑
i |wi |)1 =∑
i |wi |.
‖w‖2 =(∑
i |wi |2)1/2
=√∑
i |wi |2.
‖w‖p = (∑
i |wi |p)1/p (called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 33 / 128
![Page 70: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/70.jpg)
Norms: A Quick Review
A norm is a function satisfying:
‖αw‖ = |α|‖w‖, for any w (homogeneity);
‖w + w′‖ ≤ ‖w‖+ ‖w′‖, for any w,w′ (triangle inequality);
‖w‖ = 0 if and only if w = 0.
Examples of norms:
‖w‖1 = (∑
i |wi |)1 =∑
i |wi |.
‖w‖2 =(∑
i |wi |2)1/2
=√∑
i |wi |2.
‖w‖p = (∑
i |wi |p)1/p (called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 33 / 128
![Page 71: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/71.jpg)
Norms: A Quick Review
A norm is a function satisfying:
‖αw‖ = |α|‖w‖, for any w (homogeneity);
‖w + w′‖ ≤ ‖w‖+ ‖w′‖, for any w,w′ (triangle inequality);
‖w‖ = 0 if and only if w = 0.
Examples of norms:
‖w‖1 = (∑
i |wi |)1 =∑
i |wi |.
‖w‖2 =(∑
i |wi |2)1/2
=√∑
i |wi |2.
‖w‖p = (∑
i |wi |p)1/p (called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 33 / 128
![Page 72: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/72.jpg)
Norms: A Quick Review
A norm is a function satisfying:
‖αw‖ = |α|‖w‖, for any w (homogeneity);
‖w + w′‖ ≤ ‖w‖+ ‖w′‖, for any w,w′ (triangle inequality);
‖w‖ = 0 if and only if w = 0.
Examples of norms:
‖w‖1 = (∑
i |wi |)1 =∑
i |wi |.
‖w‖2 =(∑
i |wi |2)1/2
=√∑
i |wi |2.
‖w‖p = (∑
i |wi |p)1/p (called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 33 / 128
![Page 73: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/73.jpg)
Norms: A Quick Review
A norm is a function satisfying:
‖αw‖ = |α|‖w‖, for any w (homogeneity);
‖w + w′‖ ≤ ‖w‖+ ‖w′‖, for any w,w′ (triangle inequality);
‖w‖ = 0 if and only if w = 0.
Examples of norms:
‖w‖1 = (∑
i |wi |)1 =∑
i |wi |.
‖w‖2 =(∑
i |wi |2)1/2
=√∑
i |wi |2.
‖w‖p = (∑
i |wi |p)1/p (called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 33 / 128
![Page 74: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/74.jpg)
Norms: A Quick Review
A norm is a function satisfying:
‖αw‖ = |α|‖w‖, for any w (homogeneity);
‖w + w′‖ ≤ ‖w‖+ ‖w′‖, for any w,w′ (triangle inequality);
‖w‖ = 0 if and only if w = 0.
Examples of norms:
‖w‖1 = (∑
i |wi |)1 =∑
i |wi |.
‖w‖2 =(∑
i |wi |2)1/2
=√∑
i |wi |2.
‖w‖p = (∑
i |wi |p)1/p (called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 33 / 128
![Page 75: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/75.jpg)
Norms: A Quick Review
A norm is a function satisfying:
‖αw‖ = |α|‖w‖, for any w (homogeneity);
‖w + w′‖ ≤ ‖w‖+ ‖w′‖, for any w,w′ (triangle inequality);
‖w‖ = 0 if and only if w = 0.
Examples of norms:
‖w‖1 = (∑
i |wi |)1 =∑
i |wi |.
‖w‖2 =(∑
i |wi |2)1/2
=√∑
i |wi |2.
‖w‖p = (∑
i |wi |p)1/p (called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 33 / 128
![Page 76: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/76.jpg)
Norms: A Quick Review
A norm is a function satisfying:
‖αw‖ = |α|‖w‖, for any w (homogeneity);
‖w + w′‖ ≤ ‖w‖+ ‖w′‖, for any w,w′ (triangle inequality);
‖w‖ = 0 if and only if w = 0.
Examples of norms:
‖w‖1 = (∑
i |wi |)1 =∑
i |wi |.
‖w‖2 =(∑
i |wi |2)1/2
=√∑
i |wi |2.
‖w‖p = (∑
i |wi |p)1/p (called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 33 / 128
![Page 77: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/77.jpg)
Relationship Between `1 and `0
The `0 “norm” (number of non-zeros): ‖w‖0 = |i : wi 6= 0|.Not convex, but...
w = arg minw
1
2(w − y)2 + λ|w |0 = hard(y ,
√2λ) =
y ⇐ |y | >
√2λ
0 ⇐ |y | ≤√
2λ
The “ideal” feature selection criterion (best subset):
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ (limit the number of features)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 34 / 128
![Page 78: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/78.jpg)
Relationship Between `1 and `0
The `0 “norm” (number of non-zeros): ‖w‖0 = |i : wi 6= 0|.Not convex, but...
w = arg minw
1
2(w − y)2 + λ|w |0 = hard(y ,
√2λ) =
y ⇐ |y | >
√2λ
0 ⇐ |y | ≤√
2λ
The “ideal” feature selection criterion (best subset):
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ (limit the number of features)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 34 / 128
![Page 79: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/79.jpg)
Relationship Between `1 and `0
The `0 “norm” (number of non-zeros): ‖w‖0 = |i : wi 6= 0|.Not convex, but...
w = arg minw
1
2(w − y)2 + λ|w |0 = hard(y ,
√2λ) =
y ⇐ |y | >
√2λ
0 ⇐ |y | ≤√
2λ
The “ideal” feature selection criterion (best subset):
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ (limit the number of features)Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 34 / 128
![Page 80: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/80.jpg)
Relationship Between `1 and `0 (II)The best subset selection problem
is NP-hard Amaldi and Kann(1998)(Davis et al., 1997).
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ
A closely related problem, also NP-hard (Muthukrishnan, 2005).
w = arg minw‖w‖0
subject toN∑
n=1
L(w; xn, yn) ≤ δ
In some cases, one may replace `0 with `1 and obtain “similar” results:
central issue in compressive sensing (CS) (Candes et al., 2006; Donoho,
2006)
.Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 35 / 128
![Page 81: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/81.jpg)
Relationship Between `1 and `0 (II)The best subset selection problem is NP-hard Amaldi and Kann(1998)(Davis et al., 1997).
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ
A closely related problem, also NP-hard (Muthukrishnan, 2005).
w = arg minw‖w‖0
subject toN∑
n=1
L(w; xn, yn) ≤ δ
In some cases, one may replace `0 with `1 and obtain “similar” results:
central issue in compressive sensing (CS) (Candes et al., 2006; Donoho,
2006)
.Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 35 / 128
![Page 82: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/82.jpg)
Relationship Between `1 and `0 (II)The best subset selection problem is NP-hard Amaldi and Kann(1998)(Davis et al., 1997).
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ
A closely related problem,
also NP-hard (Muthukrishnan, 2005).
w = arg minw‖w‖0
subject toN∑
n=1
L(w; xn, yn) ≤ δ
In some cases, one may replace `0 with `1 and obtain “similar” results:
central issue in compressive sensing (CS) (Candes et al., 2006; Donoho,
2006)
.Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 35 / 128
![Page 83: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/83.jpg)
Relationship Between `1 and `0 (II)The best subset selection problem is NP-hard Amaldi and Kann(1998)(Davis et al., 1997).
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ
A closely related problem, also NP-hard (Muthukrishnan, 2005).
w = arg minw‖w‖0
subject toN∑
n=1
L(w; xn, yn) ≤ δ
In some cases, one may replace `0 with `1 and obtain “similar” results:
central issue in compressive sensing (CS) (Candes et al., 2006; Donoho,
2006)
.Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 35 / 128
![Page 84: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/84.jpg)
Relationship Between `1 and `0 (II)The best subset selection problem is NP-hard Amaldi and Kann(1998)(Davis et al., 1997).
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ
A closely related problem, also NP-hard (Muthukrishnan, 2005).
w = arg minw‖w‖0
subject toN∑
n=1
L(w; xn, yn) ≤ δ
In some cases, one may replace `0 with `1 and obtain “similar” results:
central issue in compressive sensing (CS) (Candes et al., 2006; Donoho,
2006).Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 35 / 128
![Page 85: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/85.jpg)
Take-Home Messages
Sparsity is desirable for interpretability, computational savings, andgeneralization
`1-regularization gives an embedded method for feature selection
Another view of `1: a convex surrogate for direct penalization ofcardinality (`0)
There are compelling algorithmic reasons for using convex surrogateslike `1
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 36 / 128
![Page 86: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/86.jpg)
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 37 / 128
![Page 87: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/87.jpg)
Models
`1 regularization promotes sparse models
A very simple sparsity pattern: prefer models with small cardinality
Our main question: how can we promote less trivial sparsity patterns?
We’ll talk about structured sparsity and group-Lasso regularization.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 38 / 128
![Page 88: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/88.jpg)
Models
`1 regularization promotes sparse models
A very simple sparsity pattern: prefer models with small cardinality
Our main question: how can we promote less trivial sparsity patterns?
We’ll talk about structured sparsity and group-Lasso regularization.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 38 / 128
![Page 89: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/89.jpg)
Models
`1 regularization promotes sparse models
A very simple sparsity pattern: prefer models with small cardinality
Our main question: how can we promote less trivial sparsity patterns?
We’ll talk about structured sparsity and group-Lasso regularization.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 38 / 128
![Page 90: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/90.jpg)
Structured Sparsity and Groups
Main goal: promote structural patterns, not just penalize cardinality
Group sparsity: discard entire groups of features
density inside each group
sparsity with respect to the groups which are selected
choice of groups: prior knowledge about the intended sparsity patterns
Leads to statistical gains if the prior assumptions are correct (Stojnicet al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 39 / 128
![Page 91: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/91.jpg)
Structured Sparsity and Groups
Main goal: promote structural patterns, not just penalize cardinality
Group sparsity: discard entire groups of features
density inside each group
sparsity with respect to the groups which are selected
choice of groups: prior knowledge about the intended sparsity patterns
Leads to statistical gains if the prior assumptions are correct (Stojnicet al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 39 / 128
![Page 92: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/92.jpg)
Structured Sparsity and Groups
Main goal: promote structural patterns, not just penalize cardinality
Group sparsity: discard entire groups of features
density inside each group
sparsity with respect to the groups which are selected
choice of groups: prior knowledge about the intended sparsity patterns
Leads to statistical gains if the prior assumptions are correct (Stojnicet al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 39 / 128
![Page 93: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/93.jpg)
Tons of Uses
feature template selection (Martins et al., 2011b)
multi-task learning (Caruana, 1997; Obozinski et al., 2010)
multiple kernel learning (Lanckriet et al., 2004)
learning the structure of graphical models (Schmidt and Murphy,2010)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 40 / 128
![Page 94: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/94.jpg)
“Grid” Sparsity
For feature spaces that can be arranged as a grid (examples next)
Goal: push entire columns to have zero weights
The groups are the columns of the grid
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 41 / 128
![Page 95: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/95.jpg)
“Grid” Sparsity
For feature spaces that can be arranged as a grid (examples next)
Goal: push entire columns to have zero weights
The groups are the columns of the grid
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 41 / 128
![Page 96: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/96.jpg)
“Grid” Sparsity
For feature spaces that can be arranged as a grid (examples next)
Goal: push entire columns to have zero weights
The groups are the columns of the grid
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 41 / 128
![Page 97: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/97.jpg)
Example 1: Sparsity with Multiple Classes
Assume the feature map decomposes as f(x , y) = f(x)⊗ ey
In words: we’re conjoining each input feature with each output class
input features
labels
“Standard” sparsity is wasteful—we still need to hash all the input features
What we want: discard some input features, along with each class theyconjoin with
Solution: one group per input feature
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 42 / 128
![Page 98: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/98.jpg)
Example 1: Sparsity with Multiple Classes
Assume the feature map decomposes as f(x , y) = f(x)⊗ ey
In words: we’re conjoining each input feature with each output class
input features
labels
“Standard” sparsity is wasteful—we still need to hash all the input features
What we want: discard some input features, along with each class theyconjoin with
Solution: one group per input feature
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 42 / 128
![Page 99: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/99.jpg)
Example 2: Multi-Task Learning(Caruana, 1997; Obozinski et al., 2010)
Same thing, except now rows are tasks and columns are features
shared features
task
s
What we want: discard features that are irrelevant for all tasks
Solution: one group per feature
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 43 / 128
![Page 100: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/100.jpg)
Example 2: Multi-Task Learning(Caruana, 1997; Obozinski et al., 2010)
Same thing, except now rows are tasks and columns are features
shared features
task
s
What we want: discard features that are irrelevant for all tasks
Solution: one group per feature
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 43 / 128
![Page 101: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/101.jpg)
Group Sparsity
D features
M groups G1, . . . ,GM , eachGm ⊆ 1, . . . ,Dparameter subvectors w1, . . . ,wM
Group-Lasso (Bakin, 1999; Yuan and Lin, 2006):
Ω(w) =∑M
m=1 ‖wm‖2
Intuitively: the `1 norm of the `2 norms
Technically, still a norm (called a mixed norm, denoted `2,1)
λm: prior weight for group Gm (different groups have different sizes)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 44 / 128
![Page 102: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/102.jpg)
Group Sparsity
D features
M groups G1, . . . ,GM , eachGm ⊆ 1, . . . ,Dparameter subvectors w1, . . . ,wM
Group-Lasso (Bakin, 1999; Yuan and Lin, 2006):
Ω(w) =∑M
m=1 ‖wm‖2
Intuitively: the `1 norm of the `2 norms
Technically, still a norm (called a mixed norm, denoted `2,1)
λm: prior weight for group Gm (different groups have different sizes)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 44 / 128
![Page 103: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/103.jpg)
Group Sparsity
D features
M groups G1, . . . ,GM , eachGm ⊆ 1, . . . ,Dparameter subvectors w1, . . . ,wM
Group-Lasso (Bakin, 1999; Yuan and Lin, 2006):
Ω(w) =∑M
m=1 ‖wm‖2
Intuitively: the `1 norm of the `2 norms
Technically, still a norm (called a mixed norm, denoted `2,1)
λm: prior weight for group Gm (different groups have different sizes)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 44 / 128
![Page 104: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/104.jpg)
Group Sparsity
D features
M groups G1, . . . ,GM , eachGm ⊆ 1, . . . ,Dparameter subvectors w1, . . . ,wM
Group-Lasso (Bakin, 1999; Yuan and Lin, 2006):
Ω(w) =∑M
m=1 ‖wm‖2
Intuitively: the `1 norm of the `2 norms
Technically, still a norm (called a mixed norm, denoted `2,1)
λm: prior weight for group Gm (different groups have different sizes)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 44 / 128
![Page 105: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/105.jpg)
Group Sparsity
D features
M groups G1, . . . ,GM , eachGm ⊆ 1, . . . ,Dparameter subvectors w1, . . . ,wM
Group-Lasso (Bakin, 1999; Yuan and Lin, 2006):
Ω(w) =∑M
m=1 λm‖wm‖2
Intuitively: the `1 norm of the `2 norms
Technically, still a norm (called a mixed norm, denoted `2,1)
λm: prior weight for group Gm (different groups have different sizes)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 44 / 128
![Page 106: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/106.jpg)
Regularization Formulations (reminder)
Tikhonov regularization: w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)
Ivanov regularization
w = arg minw
N∑n=1
L(w; xn, yn)
subject to Ω(w) ≤ τ
Morozov regularization
w = arg minw
Ω(w)
subject toN∑
n=1
L(w; xn, yn) ≤ δ
Equivalent, under mild conditions (namely convexity).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 45 / 128
![Page 107: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/107.jpg)
Lasso versus group-Lasso
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 46 / 128
![Page 108: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/108.jpg)
Lasso versus group-Lasso
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 46 / 128
![Page 109: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/109.jpg)
Other names, other norms
Statisticians call these composite absolute penalties (Zhao et al., 2009)
In general: the (weighted) `r -norm of the `q-norms (r ≥ 1, q ≥ 1), calledthe mixed `q,r norm
Ω(w) =(∑M
m=1λm‖wm‖rq
)1/r
Group sparsity corresponds to r = 1
This talk: q = 2
However q =∞ is also popular (Quattoni et al., 2009; Graca et al., 2009;
Wright et al., 2009; Eisenstein et al., 2011)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 47 / 128
![Page 110: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/110.jpg)
Other names, other norms
Statisticians call these composite absolute penalties (Zhao et al., 2009)
In general: the (weighted) `r -norm of the `q-norms (r ≥ 1, q ≥ 1), calledthe mixed `q,r norm
Ω(w) =(∑M
m=1λm‖wm‖rq
)1/r
Group sparsity corresponds to r = 1
This talk: q = 2
However q =∞ is also popular (Quattoni et al., 2009; Graca et al., 2009;
Wright et al., 2009; Eisenstein et al., 2011)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 47 / 128
![Page 111: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/111.jpg)
Other names, other norms
Statisticians call these composite absolute penalties (Zhao et al., 2009)
In general: the (weighted) `r -norm of the `q-norms (r ≥ 1, q ≥ 1), calledthe mixed `q,r norm
Ω(w) =(∑M
m=1λm‖wm‖rq
)1/r
Group sparsity corresponds to r = 1
This talk: q = 2
However q =∞ is also popular (Quattoni et al., 2009; Graca et al., 2009;
Wright et al., 2009; Eisenstein et al., 2011)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 47 / 128
![Page 112: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/112.jpg)
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 48 / 128
![Page 113: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/113.jpg)
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 48 / 128
![Page 114: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/114.jpg)
Non-overlapping Groups
Assume G1, . . . ,GM are disjoint
⇒ Each feature belongs to exactly one group
Ω(w) =∑M
m=1 λm‖wm‖2
Trivial choices of groups recover unstructured regularizers:
`2-regularization: one large group G1 = 1, . . . ,D`1-regularization: D singleton groups Gd = d
Examples of non-trivial groups:
label-based groups (groups are columns of a matrix)
template-based groups (next)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 49 / 128
![Page 115: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/115.jpg)
Non-overlapping Groups
Assume G1, . . . ,GM are disjoint
⇒ Each feature belongs to exactly one group
Ω(w) =∑M
m=1 λm‖wm‖2
Trivial choices of groups recover unstructured regularizers:
`2-regularization: one large group G1 = 1, . . . ,D`1-regularization: D singleton groups Gd = d
Examples of non-trivial groups:
label-based groups (groups are columns of a matrix)
template-based groups (next)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 49 / 128
![Page 116: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/116.jpg)
Non-overlapping Groups
Assume G1, . . . ,GM are disjoint
⇒ Each feature belongs to exactly one group
Ω(w) =∑M
m=1 λm‖wm‖2
Trivial choices of groups recover unstructured regularizers:
`2-regularization: one large group G1 = 1, . . . ,D`1-regularization: D singleton groups Gd = d
Examples of non-trivial groups:
label-based groups (groups are columns of a matrix)
template-based groups (next)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 49 / 128
![Page 117: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/117.jpg)
Non-overlapping Groups
Assume G1, . . . ,GM are disjoint
⇒ Each feature belongs to exactly one group
Ω(w) =∑M
m=1 λm‖wm‖2
Trivial choices of groups recover unstructured regularizers:
`2-regularization: one large group G1 = 1, . . . ,D`1-regularization: D singleton groups Gd = d
Examples of non-trivial groups:
label-based groups (groups are columns of a matrix)
template-based groups (next)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 49 / 128
![Page 118: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/118.jpg)
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 119: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/119.jpg)
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 120: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/120.jpg)
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 121: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/121.jpg)
Example: Feature Template Selection
5
5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 122: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/122.jpg)
Example: Feature Template Selection
5
5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 123: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/123.jpg)
Example: Feature Template Selection
5
5Input: We want to explore the feature space
PRP VBP TO VB DT NN NNOutput: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"the feature"
"explore the"
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 124: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/124.jpg)
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"the feature"
"explore the"
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 125: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/125.jpg)
Example: Feature Template Selection
5
5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"the feature"
"explore the"
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 126: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/126.jpg)
Example: Feature Template Selection
5
5Input: We want to explore the feature space
PRP VBP TO VB DT NN NNOutput: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"the feature"
"explore the"
"DT NN NN"
"VB DT NN"
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 127: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/127.jpg)
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"the feature"
"explore the"
"DT NN NN"
"VB DT NN"
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 128: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/128.jpg)
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"DT NN NN"
"VB DT NN"
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 129: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/129.jpg)
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 130: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/130.jpg)
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 131: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/131.jpg)
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 50 / 128
![Page 132: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/132.jpg)
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 51 / 128
![Page 133: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/133.jpg)
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 51 / 128
![Page 134: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/134.jpg)
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 52 / 128
![Page 135: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/135.jpg)
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 52 / 128
![Page 136: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/136.jpg)
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 52 / 128
![Page 137: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/137.jpg)
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 52 / 128
![Page 138: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/138.jpg)
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 52 / 128
![Page 139: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/139.jpg)
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 52 / 128
![Page 140: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/140.jpg)
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 52 / 128
![Page 141: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/141.jpg)
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 53 / 128
![Page 142: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/142.jpg)
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 53 / 128
![Page 143: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/143.jpg)
Graph-Structured Groups
In general: groups can be represented as a directed acyclic graph
set inclusion induces a partial order on groups (Jenatton et al., 2009)
feature space becomes a poset
sparsity patterns: given by this poset
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 54 / 128
![Page 144: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/144.jpg)
Example: coarse-to-fine regularization
1 Define a partial order between basic feature templates (e.g., p0 w0)
2 Extend this partial order to all templates by lexicographic closure:p0 p0p1 w0w1
Goal: only include finer features if coarser ones are also in the model
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 55 / 128
![Page 145: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/145.jpg)
Things to Keep in Mind
Structured sparsity cares about the structure of the feature space
Group-Lasso regularization generalizes `1 and it’s still convex
Choice of groups: problem dependent, opportunity to use priorknowledge to favour certain structural patterns
Next: algorithms
We’ll see that optimization is easier with non-overlapping ortree-structured groups than with arbitrary overlaps
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 56 / 128
![Page 146: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/146.jpg)
Things to Keep in Mind
Structured sparsity cares about the structure of the feature space
Group-Lasso regularization generalizes `1 and it’s still convex
Choice of groups: problem dependent, opportunity to use priorknowledge to favour certain structural patterns
Next: algorithms
We’ll see that optimization is easier with non-overlapping ortree-structured groups than with arbitrary overlaps
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 56 / 128
![Page 147: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/147.jpg)
Things to Keep in Mind
Structured sparsity cares about the structure of the feature space
Group-Lasso regularization generalizes `1 and it’s still convex
Choice of groups: problem dependent, opportunity to use priorknowledge to favour certain structural patterns
Next: algorithms
We’ll see that optimization is easier with non-overlapping ortree-structured groups than with arbitrary overlaps
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 56 / 128
![Page 148: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/148.jpg)
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 57 / 128
![Page 149: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/149.jpg)
Learning the Model
Recall that learning involves solving
minw
Ω(w)︸ ︷︷ ︸regularizer
+N∑
i=1
L(w, xi , yi )︸ ︷︷ ︸total loss
,
We’ll address two kinds of optimization algorithms:
batch algorithms (attacks the complete problem);
online algorithms (uses the training examples one by one)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 58 / 128
![Page 150: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/150.jpg)
Learning the Model
Recall that learning involves solving
minw
Ω(w)︸ ︷︷ ︸regularizer
+N∑
i=1
L(w, xi , yi )︸ ︷︷ ︸total loss
,
We’ll address two kinds of optimization algorithms:
batch algorithms (attacks the complete problem);
online algorithms (uses the training examples one by one)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 58 / 128
![Page 151: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/151.jpg)
Key Concepts: Convex Functions
f is a convex function if:
∀λ ∈ [0, 1], x and x ′ ∈ domain(f )
f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 59 / 128
![Page 152: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/152.jpg)
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 60 / 128
![Page 153: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/153.jpg)
Batch Algorithms
Subgradient methods
Proximal methods
Alternating direction method of multipliers
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 61 / 128
![Page 154: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/154.jpg)
Key Concepts: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).
Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)
Subdifferential: ∂f (x) = v : v is a subgradient of f at xIf f is differentiable, ∂f (x) = ∇f (x)
linear lower bound non-differentiable case
Notation: ∇f (x) is a subgradient of f at x
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 62 / 128
![Page 155: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/155.jpg)
Key Concepts: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).
Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)
Subdifferential: ∂f (x) = v : v is a subgradient of f at x
If f is differentiable, ∂f (x) = ∇f (x)
linear lower bound
non-differentiable case
Notation: ∇f (x) is a subgradient of f at x
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 62 / 128
![Page 156: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/156.jpg)
Key Concepts: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).
Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)
Subdifferential: ∂f (x) = v : v is a subgradient of f at xIf f is differentiable, ∂f (x) = ∇f (x)
linear lower bound
non-differentiable case
Notation: ∇f (x) is a subgradient of f at x
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 62 / 128
![Page 157: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/157.jpg)
Key Concepts: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).
Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)
Subdifferential: ∂f (x) = v : v is a subgradient of f at xIf f is differentiable, ∂f (x) = ∇f (x)
linear lower bound non-differentiable case
Notation: ∇f (x) is a subgradient of f at x
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 62 / 128
![Page 158: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/158.jpg)
Key Concepts: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).
Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)
Subdifferential: ∂f (x) = v : v is a subgradient of f at xIf f is differentiable, ∂f (x) = ∇f (x)
linear lower bound non-differentiable case
Notation: ∇f (x) is a subgradient of f at xMartins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 62 / 128
![Page 159: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/159.jpg)
Subgradient Methods
minw Ω(w) + Λ(w), where Λ(w) =∑N
i=1 L(w, xi , yi ) (loss)
Subgradient methods were invented by Shor in the 1970’s (Shor, 1985):
input: stepsize sequence (ηt)Tt=1
initialize wfor t = 1, 2, . . . do
(sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇Λ(w)
)end for
Key disadvantages:
The step size ηt needs to be annealed for convergence: very slow!
Doesn’t explicitly capture the sparsity promoted by sparse regularizers.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 63 / 128
![Page 160: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/160.jpg)
Subgradient Methods
minw Ω(w) + Λ(w), where Λ(w) =∑N
i=1 L(w, xi , yi ) (loss)
Subgradient methods were invented by Shor in the 1970’s (Shor, 1985):
input: stepsize sequence (ηt)Tt=1
initialize wfor t = 1, 2, . . . do
(sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇Λ(w)
)end for
Key disadvantages:
The step size ηt needs to be annealed for convergence: very slow!
Doesn’t explicitly capture the sparsity promoted by sparse regularizers.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 63 / 128
![Page 161: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/161.jpg)
Subgradient Methods
minw Ω(w) + Λ(w), where Λ(w) =∑N
i=1 L(w, xi , yi ) (loss)
Subgradient methods were invented by Shor in the 1970’s (Shor, 1985):
input: stepsize sequence (ηt)Tt=1
initialize wfor t = 1, 2, . . . do
(sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇Λ(w)
)end for
Key disadvantages:
The step size ηt needs to be annealed for convergence: very slow!
Doesn’t explicitly capture the sparsity promoted by sparse regularizers.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 63 / 128
![Page 162: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/162.jpg)
Key Concepts: Proximity OperatorsLet Ω : RD → R be a convex function.
The Ω-proximity operator is the following RD → RD map:
w 7→ proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
...always well defined, because ‖u−w‖22 is strictly convex.
Classical examples:
Squared `2 regularization, Ω(w) = λ2‖w‖
22: scaling operation
proxΩ(w) =1
1 + λw
`1 regularization, Ω(w) = λ‖w‖1: soft-thresholding;
proxΩ(w) = soft(w, λ)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 64 / 128
![Page 163: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/163.jpg)
Key Concepts: Proximity OperatorsLet Ω : RD → R be a convex function.
The Ω-proximity operator is the following RD → RD map:
w 7→ proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
...always well defined, because ‖u−w‖22 is strictly convex.
Classical examples:
Squared `2 regularization, Ω(w) = λ2‖w‖
22: scaling operation
proxΩ(w) =1
1 + λw
`1 regularization, Ω(w) = λ‖w‖1: soft-thresholding;
proxΩ(w) = soft(w, λ)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 64 / 128
![Page 164: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/164.jpg)
Key Concepts: Proximity OperatorsLet Ω : RD → R be a convex function.
The Ω-proximity operator is the following RD → RD map:
w 7→ proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
...always well defined, because ‖u−w‖22 is strictly convex.
Classical examples:
Squared `2 regularization, Ω(w) = λ2‖w‖
22: scaling operation
proxΩ(w) =1
1 + λw
`1 regularization, Ω(w) = λ‖w‖1: soft-thresholding;
proxΩ(w) = soft(w, λ)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 64 / 128
![Page 165: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/165.jpg)
Key Concepts: Proximity OperatorsLet Ω : RD → R be a convex function.
The Ω-proximity operator is the following RD → RD map:
w 7→ proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
...always well defined, because ‖u−w‖22 is strictly convex.
Classical examples:
Squared `2 regularization, Ω(w) = λ2‖w‖
22: scaling operation
proxΩ(w) =1
1 + λw
`1 regularization, Ω(w) = λ‖w‖1: soft-thresholding;
proxΩ(w) = soft(w, λ)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 64 / 128
![Page 166: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/166.jpg)
Key Concepts: Proximity Operators (II)
proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
`2 regularization, Ω(w) = λ‖w‖2: vector soft thresholding
proxΩ(w) =
0 ⇐ ‖w‖ ≤ λ
w‖w‖ (‖w‖ − λ) ⇐ ‖w‖ > λ
indicator function, Ω(w) = ιS(w) =
0 ⇐ w ∈ S
+∞ ⇐ w 6∈ S
proxΩ(w) = PS(w)
Euclidean projection
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 65 / 128
![Page 167: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/167.jpg)
Key Concepts: Proximity Operators (II)
proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
`2 regularization, Ω(w) = λ‖w‖2: vector soft thresholding
proxΩ(w) =
0 ⇐ ‖w‖ ≤ λ
w‖w‖ (‖w‖ − λ) ⇐ ‖w‖ > λ
indicator function, Ω(w) = ιS(w) =
0 ⇐ w ∈ S
+∞ ⇐ w 6∈ S
proxΩ(w) = PS(w)
Euclidean projection
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 65 / 128
![Page 168: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/168.jpg)
Key Concepts: Proximity Operators (II)
proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
`2 regularization, Ω(w) = λ‖w‖2: vector soft thresholding
proxΩ(w) =
0 ⇐ ‖w‖ ≤ λ
w‖w‖ (‖w‖ − λ) ⇐ ‖w‖ > λ
indicator function, Ω(w) = ιS(w) =
0 ⇐ w ∈ S
+∞ ⇐ w 6∈ S
proxΩ(w) = PS(w)
Euclidean projection
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 65 / 128
![Page 169: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/169.jpg)
Key Concepts: Proximity Operators (III)
Group regularizers: Ω(w) =M∑
m=1
Ωm(wm)
Groups: Gm ⊂ 1, 2, ...,D. wm is a sub-vector of w with theindices in Gm.
Non-overlapping groups (Gm ∩ Gn = ∅, for m 6= n): separable proxoperator
[proxΩ(w)]m = proxΩm(wm)
Tree-structured groups: (two groups are either non-overlapping orone contais the other) proxΩ can be computed recursively (Jenattonet al., 2011).
Arbitrary groups:For Ωj (wm) = ‖wm‖2: solved via convex smooth optimization (Yuanet al., 2011).Sequential proximity steps (Martins et al., 2011a) (more later).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 66 / 128
![Page 170: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/170.jpg)
Key Concepts: Proximity Operators (III)
Group regularizers: Ω(w) =M∑
m=1
Ωm(wm)
Groups: Gm ⊂ 1, 2, ...,D. wm is a sub-vector of w with theindices in Gm.
Non-overlapping groups (Gm ∩ Gn = ∅, for m 6= n): separable proxoperator
[proxΩ(w)]m = proxΩm(wm)
Tree-structured groups: (two groups are either non-overlapping orone contais the other) proxΩ can be computed recursively (Jenattonet al., 2011).
Arbitrary groups:For Ωj (wm) = ‖wm‖2: solved via convex smooth optimization (Yuanet al., 2011).Sequential proximity steps (Martins et al., 2011a) (more later).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 66 / 128
![Page 171: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/171.jpg)
Key Concepts: Proximity Operators (III)
Group regularizers: Ω(w) =M∑
m=1
Ωm(wm)
Groups: Gm ⊂ 1, 2, ...,D. wm is a sub-vector of w with theindices in Gm.
Non-overlapping groups (Gm ∩ Gn = ∅, for m 6= n): separable proxoperator
[proxΩ(w)]m = proxΩm(wm)
Tree-structured groups: (two groups are either non-overlapping orone contais the other) proxΩ can be computed recursively (Jenattonet al., 2011).
Arbitrary groups:For Ωj (wm) = ‖wm‖2: solved via convex smooth optimization (Yuanet al., 2011).Sequential proximity steps (Martins et al., 2011a) (more later).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 66 / 128
![Page 172: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/172.jpg)
Key Concepts: Proximity Operators (III)
Group regularizers: Ω(w) =M∑
m=1
Ωm(wm)
Groups: Gm ⊂ 1, 2, ...,D. wm is a sub-vector of w with theindices in Gm.
Non-overlapping groups (Gm ∩ Gn = ∅, for m 6= n): separable proxoperator
[proxΩ(w)]m = proxΩm(wm)
Tree-structured groups: (two groups are either non-overlapping orone contais the other) proxΩ can be computed recursively (Jenattonet al., 2011).
Arbitrary groups:For Ωj (wm) = ‖wm‖2: solved via convex smooth optimization (Yuanet al., 2011).Sequential proximity steps (Martins et al., 2011a) (more later).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 66 / 128
![Page 173: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/173.jpg)
Proximal GradientRecall the problem: min
wΩ(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Often called iterative shrinkage thresholding (IST).
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 67 / 128
![Page 174: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/174.jpg)
Proximal GradientRecall the problem: min
wΩ(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Often called iterative shrinkage thresholding (IST).
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 67 / 128
![Page 175: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/175.jpg)
Proximal GradientRecall the problem: min
wΩ(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Often called iterative shrinkage thresholding (IST).
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 67 / 128
![Page 176: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/176.jpg)
Proximal GradientRecall the problem: min
wΩ(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Often called iterative shrinkage thresholding (IST).
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 67 / 128
![Page 177: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/177.jpg)
Proximal GradientRecall the problem: min
wΩ(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Often called iterative shrinkage thresholding (IST).
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 67 / 128
![Page 178: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/178.jpg)
Proximal GradientRecall the problem: min
wΩ(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Often called iterative shrinkage thresholding (IST).
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 67 / 128
![Page 179: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/179.jpg)
Proximal GradientRecall the problem: min
wΩ(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Often called iterative shrinkage thresholding (IST).
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 67 / 128
![Page 180: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/180.jpg)
Monotonicity and Convergence
Proximal gradient, a.k.a., iterative shrinkage thresholding (IST):
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt)) .
Assume Λ(w) has L-Lipschitz gradient: ‖∇Λ(w)−∇Λ(w′)‖ ≤ L‖w −w′‖.
Monotonicity: if ηt ≤ 1/L, then Λ(wt+1) + Ω(wt+1) ≤ Λ(wt) + Ω(wt).
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1
ε
)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 68 / 128
![Page 181: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/181.jpg)
Monotonicity and Convergence
Proximal gradient, a.k.a., iterative shrinkage thresholding (IST):
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt)) .
Assume Λ(w) has L-Lipschitz gradient: ‖∇Λ(w)−∇Λ(w′)‖ ≤ L‖w −w′‖.Monotonicity: if ηt ≤ 1/L, then Λ(wt+1) + Ω(wt+1) ≤ Λ(wt) + Ω(wt).
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1
ε
)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 68 / 128
![Page 182: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/182.jpg)
Monotonicity and Convergence
Proximal gradient, a.k.a., iterative shrinkage thresholding (IST):
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt)) .
Assume Λ(w) has L-Lipschitz gradient: ‖∇Λ(w)−∇Λ(w′)‖ ≤ L‖w −w′‖.Monotonicity: if ηt ≤ 1/L, then Λ(wt+1) + Ω(wt+1) ≤ Λ(wt) + Ω(wt).
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1
ε
)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 68 / 128
![Page 183: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/183.jpg)
Accelerating IST: FISTA
Idea: compute wt+1 based, not only on wt , but also on wt−1.
Fast IST algorithm (FISTA) (Beck and Teboulle, 2009):
bt+1 =1+√
1+4 b2t
2
z = wt + bt−1bt+1
(wt −wt−1)
wt+1 = proxηΩ (z− η∇Λ(z))
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1√ε
)(vs O(1/ε) for IST)
Other IST variants: Nesterov’s method (Nesterov, 2007), SpaRSA (Wrightet al., 2009), TwIST (two-step IST; Bioucas-Dias and Figueiredo, 2007).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 69 / 128
![Page 184: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/184.jpg)
Accelerating IST: FISTA
Idea: compute wt+1 based, not only on wt , but also on wt−1.
Fast IST algorithm (FISTA) (Beck and Teboulle, 2009):
bt+1 =1+√
1+4 b2t
2
z = wt + bt−1bt+1
(wt −wt−1)
wt+1 = proxηΩ (z− η∇Λ(z))
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1√ε
)(vs O(1/ε) for IST)
Other IST variants: Nesterov’s method (Nesterov, 2007), SpaRSA (Wrightet al., 2009), TwIST (two-step IST; Bioucas-Dias and Figueiredo, 2007).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 69 / 128
![Page 185: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/185.jpg)
Accelerating IST: FISTA
Idea: compute wt+1 based, not only on wt , but also on wt−1.
Fast IST algorithm (FISTA) (Beck and Teboulle, 2009):
bt+1 =1+√
1+4 b2t
2
z = wt + bt−1bt+1
(wt −wt−1)
wt+1 = proxηΩ (z− η∇Λ(z))
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1√ε
)(vs O(1/ε) for IST)
Other IST variants: Nesterov’s method (Nesterov, 2007), SpaRSA (Wrightet al., 2009), TwIST (two-step IST; Bioucas-Dias and Figueiredo, 2007).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 69 / 128
![Page 186: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/186.jpg)
Accelerating IST: FISTA
Idea: compute wt+1 based, not only on wt , but also on wt−1.
Fast IST algorithm (FISTA) (Beck and Teboulle, 2009):
bt+1 =1+√
1+4 b2t
2
z = wt + bt−1bt+1
(wt −wt−1)
wt+1 = proxηΩ (z− η∇Λ(z))
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1√ε
)(vs O(1/ε) for IST)
Other IST variants: Nesterov’s method (Nesterov, 2007), SpaRSA (Wrightet al., 2009), TwIST (two-step IST; Bioucas-Dias and Figueiredo, 2007).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 69 / 128
![Page 187: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/187.jpg)
Alternating Direction Method of Multipliers
Combine benefits of dual decomposition and augmented Lagrangianmethods for constrained optimization (Hestenes, 1969; Powell, 1969).
Key ideas
break down the optimization problem into subproblems, eachdepending on a subset of w.
each subproblem p receives a “copy” of the subvector w, denoted byvp.
encode constraints forcing each vp to “agree” with the global solutionw.
Particularly suitable for distributed optimization.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 70 / 128
![Page 188: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/188.jpg)
Alternating Direction Method of Multipliers
Combine benefits of dual decomposition and augmented Lagrangianmethods for constrained optimization (Hestenes, 1969; Powell, 1969).
Key ideas
break down the optimization problem into subproblems, eachdepending on a subset of w.
each subproblem p receives a “copy” of the subvector w, denoted byvp.
encode constraints forcing each vp to “agree” with the global solutionw.
Particularly suitable for distributed optimization.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 70 / 128
![Page 189: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/189.jpg)
Alternating Direction Method of Multipliers
Combine benefits of dual decomposition and augmented Lagrangianmethods for constrained optimization (Hestenes, 1969; Powell, 1969).
Key ideas
break down the optimization problem into subproblems, eachdepending on a subset of w.
each subproblem p receives a “copy” of the subvector w, denoted byvp.
encode constraints forcing each vp to “agree” with the global solutionw.
Particularly suitable for distributed optimization.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 70 / 128
![Page 190: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/190.jpg)
Alternating Direction Method of Multipliers
Original problem minw
Ω(w) + Λ(w) where Ω(w) =M∑
m=1
Ωm(wm) .
ADMM objective minw,v
Ω(v) + Λ(w) subject to Av + Bw = c
For example, in the overlapping group lasso case, we have A = I andc = 0. The constraint becomes v = −Bw.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 71 / 128
![Page 191: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/191.jpg)
Alternating Direction Method of Multipliers
Original problem minw
Ω(w) + Λ(w) where Ω(w) =M∑
m=1
Ωm(wm) .
ADMM objective minw,v
Ω(v) + Λ(w) subject to Av + Bw = c
For example, in the overlapping group lasso case, we have A = I andc = 0. The constraint becomes v = −Bw.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 71 / 128
![Page 192: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/192.jpg)
Alternating Direction Method of Multipliers
Original problem minw
Ω(w) + Λ(w) where Ω(w) =M∑
m=1
Ωm(wm) .
ADMM objective minw,v
Ω(v) + Λ(w) subject to Av + Bw = c
For example, in the overlapping group lasso case, we have A = I andc = 0. The constraint becomes v = −Bw.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 71 / 128
![Page 193: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/193.jpg)
Alternating Direction Method of Multipliers
Original problem minw
Ω(w) + Λ(w) where Ω(w) =M∑
m=1
Ωm(wm) .
ADMM objective minw,v
Ω(v) + Λ(w) subject to Av + Bw = c
For example, in the overlapping group lasso case, we have A = I andc = 0. The constraint becomes v = −Bw.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 71 / 128
![Page 194: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/194.jpg)
Alternating Direction Method of Multipliers
The augmented Lagrangian is:
Ω(v) +Λ(w) + u>(Av + Bw − c) + ρ2‖Av + Bw − c‖2
2
ADMM iteratively solves:
w = arg minw Λ(w) + u>Bw + ρ2‖Av + Bw − c‖2
2
v = arg minv Ω(v) + u>Av + ρ2‖Av + Bw − c‖2
2
u = u + ρ(Av + Bw − c)
Key advantage: the minimization of v can be done in parallel.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 72 / 128
![Page 195: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/195.jpg)
Alternating Direction Method of Multipliers
The augmented Lagrangian is:
Ω(v) +Λ(w) + u>(Av + Bw − c) + ρ2‖Av + Bw − c‖2
2
ADMM iteratively solves:
w = arg minw Λ(w) + u>Bw + ρ2‖Av + Bw − c‖2
2
v = arg minv Ω(v) + u>Av + ρ2‖Av + Bw − c‖2
2
u = u + ρ(Av + Bw − c)
Key advantage: the minimization of v can be done in parallel.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 72 / 128
![Page 196: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/196.jpg)
Alternating Direction Method of Multipliers
The augmented Lagrangian is:
Ω(v) +Λ(w) + u>(Av + Bw − c) + ρ2‖Av + Bw − c‖2
2
ADMM iteratively solves:
w = arg minw Λ(w) + u>Bw + ρ2‖Av + Bw − c‖2
2
v = arg minv Ω(v) + u>Av + ρ2‖Av + Bw − c‖2
2
u = u + ρ(Av + Bw − c)
Key advantage: the minimization of v can be done in parallel.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 72 / 128
![Page 197: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/197.jpg)
Alternating Direction Method of Multipliers
Convergence of ADMM in theory (Boyd et al., 2010)
Assumptions:
Λ and Ω are closed, proper, and convex.
The unaugmented Lagrangian has a saddle point
As t →∞, we have:
Residual convergence: Av + Bw − c→ 0.
Primal convergence: Λ(wt) + Ω(vt)→ p∗ where p∗ is the optimalvalue.
Dual convergence: ut → u∗.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 73 / 128
![Page 198: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/198.jpg)
Alternating Direction Method of Multipliers
Convergence of ADMM in theory (Boyd et al., 2010)
Assumptions:
Λ and Ω are closed, proper, and convex.
The unaugmented Lagrangian has a saddle point
As t →∞, we have:
Residual convergence: Av + Bw − c→ 0.
Primal convergence: Λ(wt) + Ω(vt)→ p∗ where p∗ is the optimalvalue.
Dual convergence: ut → u∗.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 73 / 128
![Page 199: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/199.jpg)
Alternating Direction Method of Multipliers
Convergence of ADMM in theory (Boyd et al., 2010)
Assumptions:
Λ and Ω are closed, proper, and convex.
The unaugmented Lagrangian has a saddle point
As t →∞, we have:
Residual convergence: Av + Bw − c→ 0.
Primal convergence: Λ(wt) + Ω(vt)→ p∗ where p∗ is the optimalvalue.
Dual convergence: ut → u∗.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 73 / 128
![Page 200: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/200.jpg)
Alternating Direction Method of Multipliers
ADMM can handle various kinds of regularizers by adapting A and B.
ADMM is well suited for structured sparse models with group overlapsbecause we can design A and B such that Ω(v) no longer has overlappinggroups. Hence, we can solve each subproblem separately in parallel.
Practical considerations:
ADMM can be slow to converge in practice, but tens of iterations areoften enough to produce good results.
ADMM only produces weakly sparse solution (we only get sparsity inthe limit).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 74 / 128
![Page 201: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/201.jpg)
Alternating Direction Method of Multipliers
ADMM can handle various kinds of regularizers by adapting A and B.
ADMM is well suited for structured sparse models with group overlapsbecause we can design A and B such that Ω(v) no longer has overlappinggroups. Hence, we can solve each subproblem separately in parallel.
Practical considerations:
ADMM can be slow to converge in practice, but tens of iterations areoften enough to produce good results.
ADMM only produces weakly sparse solution (we only get sparsity inthe limit).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 74 / 128
![Page 202: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/202.jpg)
Alternating Direction Method of Multipliers
ADMM can handle various kinds of regularizers by adapting A and B.
ADMM is well suited for structured sparse models with group overlapsbecause we can design A and B such that Ω(v) no longer has overlappinggroups. Hence, we can solve each subproblem separately in parallel.
Practical considerations:
ADMM can be slow to converge in practice, but tens of iterations areoften enough to produce good results.
ADMM only produces weakly sparse solution (we only get sparsity inthe limit).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 74 / 128
![Page 203: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/203.jpg)
Alternating Direction Method of Multipliers
Recall that the ADMM objective is:
minw,v
Ωstruct(v) + Λ(w) subject to Av + Bw = c
We can introduce an additional lasso penalty (sparse group lasso;Friedman et al., 2010):
minw,v
Ωstruct(v) + Ωlasso(w) + Λ(w) subject to Av + Bw = c
We get sparse solutions and can still guarantee convergence (Yogatamaand Smith, 2014a).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 75 / 128
![Page 204: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/204.jpg)
Alternating Direction Method of Multipliers
Recall that the ADMM objective is:
minw,v
Ωstruct(v) + Λ(w) subject to Av + Bw = c
We can introduce an additional lasso penalty (sparse group lasso;Friedman et al., 2010):
minw,v
Ωstruct(v) + Ωlasso(w) + Λ(w) subject to Av + Bw = c
We get sparse solutions and can still guarantee convergence (Yogatamaand Smith, 2014a).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 75 / 128
![Page 205: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/205.jpg)
Alternating Direction Method of Multipliers
Recall that the ADMM objective is:
minw,v
Ωstruct(v) + Λ(w) subject to Av + Bw = c
We can introduce an additional lasso penalty (sparse group lasso;Friedman et al., 2010):
minw,v
Ωstruct(v) + Ωlasso(w) + Λ(w) subject to Av + Bw = c
We get sparse solutions and can still guarantee convergence (Yogatamaand Smith, 2014a).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 75 / 128
![Page 206: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/206.jpg)
Summary of Algorithms
Converges? Rate? Sparse? Groups? Overlaps?Prox-grad (IST) X O(1/ε) X X Not easyFISTA X O(1/
√ε) X X Not easy
ADMM X O(1/ε) No X X
Note that we can still get sparsity for ADMM with sparse group lasso(Yogatama and Smith, 2014a).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 76 / 128
![Page 207: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/207.jpg)
Summary of Algorithms
Converges? Rate? Sparse? Groups? Overlaps?Prox-grad (IST) X O(1/ε) X X Not easyFISTA X O(1/
√ε) X X Not easy
ADMM X O(1/ε) No X X
Note that we can still get sparsity for ADMM with sparse group lasso(Yogatama and Smith, 2014a).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 76 / 128
![Page 208: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/208.jpg)
Some Stuff We Didn’t Talk About
shooting method (Fu, 1998);
grafting (Perkins et al., 2003) and grafting-light (Zhu et al., 2010);(Afonso et al., 2010; Figueiredo and Bioucas-Dias, 2011).
forward stagewise regression (Hastie et al., 2007).
homotopy/continuation method (Osborne et al., 2000; Efron et al.,2004; Figueiredo et al., 2007; Hale et al., 2008).
Next: We’ll talk about online algorithms.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 77 / 128
![Page 209: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/209.jpg)
Some Stuff We Didn’t Talk About
shooting method (Fu, 1998);
grafting (Perkins et al., 2003) and grafting-light (Zhu et al., 2010);(Afonso et al., 2010; Figueiredo and Bioucas-Dias, 2011).
forward stagewise regression (Hastie et al., 2007).
homotopy/continuation method (Osborne et al., 2000; Efron et al.,2004; Figueiredo et al., 2007; Hale et al., 2008).
Next: We’ll talk about online algorithms.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 77 / 128
![Page 210: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/210.jpg)
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 78 / 128
![Page 211: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/211.jpg)
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 79 / 128
![Page 212: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/212.jpg)
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 79 / 128
![Page 213: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/213.jpg)
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 79 / 128
![Page 214: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/214.jpg)
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 79 / 128
![Page 215: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/215.jpg)
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 79 / 128
![Page 216: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/216.jpg)
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 79 / 128
![Page 217: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/217.jpg)
Plain Stochastic (Sub-)Gradient Descent
minw
Ω(w)︸ ︷︷ ︸regularizer
+1
N
N∑i=1
L(w, xi , yi )︸ ︷︷ ︸empirical loss
,
input: stepsize sequence (ηt)Tt=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)(sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)end for
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 80 / 128
![Page 218: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/218.jpg)
Plain Stochastic (Sub-)Gradient Descent
minw
Ω(w)︸ ︷︷ ︸regularizer
+1
N
N∑i=1
L(w, xi , yi )︸ ︷︷ ︸empirical loss
,
input: stepsize sequence (ηt)Tt=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)(sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)end for
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 80 / 128
![Page 219: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/219.jpg)
What’s the Problem with SGD?
(Sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)
`2-regularization Ω(w) = λ2‖w‖
22 =⇒ ∇Ω(w) = λw
w ← (1− ηtλ)w︸ ︷︷ ︸scaling
− ηt∇L(w; xt , yt)
`1-regularization Ω(w) = λ‖w‖1 =⇒ ∇Ω(w) = λsign(w)
w ← w − ηtλsign(w)︸ ︷︷ ︸constant penalty
− ηt∇L(w; xt , yt)
Problem: iterates are never sparse!
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 81 / 128
![Page 220: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/220.jpg)
What’s the Problem with SGD?
(Sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)`2-regularization Ω(w) = λ
2‖w‖22 =⇒ ∇Ω(w) = λw
w ← (1− ηtλ)w︸ ︷︷ ︸scaling
− ηt∇L(w; xt , yt)
`1-regularization Ω(w) = λ‖w‖1 =⇒ ∇Ω(w) = λsign(w)
w ← w − ηtλsign(w)︸ ︷︷ ︸constant penalty
− ηt∇L(w; xt , yt)
Problem: iterates are never sparse!
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 81 / 128
![Page 221: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/221.jpg)
What’s the Problem with SGD?
(Sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)`2-regularization Ω(w) = λ
2‖w‖22 =⇒ ∇Ω(w) = λw
w ← (1− ηtλ)w︸ ︷︷ ︸scaling
− ηt∇L(w; xt , yt)
`1-regularization Ω(w) = λ‖w‖1 =⇒ ∇Ω(w) = λsign(w)
w ← w − ηtλsign(w)︸ ︷︷ ︸constant penalty
− ηt∇L(w; xt , yt)
Problem: iterates are never sparse!
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 81 / 128
![Page 222: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/222.jpg)
What’s the Problem with SGD?
(Sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)`2-regularization Ω(w) = λ
2‖w‖22 =⇒ ∇Ω(w) = λw
w ← (1− ηtλ)w︸ ︷︷ ︸scaling
− ηt∇L(w; xt , yt)
`1-regularization Ω(w) = λ‖w‖1 =⇒ ∇Ω(w) = λsign(w)
w ← w − ηtλsign(w)︸ ︷︷ ︸constant penalty
− ηt∇L(w; xt , yt)
Problem: iterates are never sparse!
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 81 / 128
![Page 223: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/223.jpg)
Plain SGD with `2-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 82 / 128
![Page 224: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/224.jpg)
Plain SGD with `2-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 82 / 128
![Page 225: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/225.jpg)
Plain SGD with `2-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 82 / 128
![Page 226: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/226.jpg)
Plain SGD with `2-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 82 / 128
![Page 227: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/227.jpg)
Plain SGD with `2-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 82 / 128
![Page 228: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/228.jpg)
Plain SGD with `2-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 82 / 128
![Page 229: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/229.jpg)
Plain SGD with `2-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 82 / 128
![Page 230: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/230.jpg)
Plain SGD with `2-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 82 / 128
![Page 231: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/231.jpg)
Plain SGD with `2-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 82 / 128
![Page 232: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/232.jpg)
Plain SGD with `2-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 82 / 128
![Page 233: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/233.jpg)
Plain SGD with `1-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 83 / 128
![Page 234: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/234.jpg)
Plain SGD with `1-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 83 / 128
![Page 235: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/235.jpg)
Plain SGD with `1-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 83 / 128
![Page 236: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/236.jpg)
Plain SGD with `1-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 83 / 128
![Page 237: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/237.jpg)
Plain SGD with `1-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 83 / 128
![Page 238: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/238.jpg)
Plain SGD with `1-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 83 / 128
![Page 239: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/239.jpg)
Plain SGD with `1-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 83 / 128
![Page 240: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/240.jpg)
Plain SGD with `1-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 83 / 128
![Page 241: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/241.jpg)
Plain SGD with `1-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 83 / 128
![Page 242: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/242.jpg)
Plain SGD with `1-regularization
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 83 / 128
![Page 243: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/243.jpg)
“Sparse” Online Algorithms
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 84 / 128
![Page 244: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/244.jpg)
“Sparse” Online Algorithms
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 84 / 128
![Page 245: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/245.jpg)
Truncated Gradient (Langford et al., 2009)
input: laziness coefficient K , stepsize sequence (ηt)Tt=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)(sub-)gradient step: w ← w − ηt∇L(θ; xt , yt)if t/K is integer then
truncation step: w ← w − sign(w) (|w| − ηtKτ)︸ ︷︷ ︸soft-thresholding
end ifend for
take gradients only with respect to the loss
every K rounds: a “lazy” soft-thresholding step
Langford et al. (2009) also suggest other forms of truncation
converges to ε-accurate objective after O(1/ε2) iterations
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 85 / 128
![Page 246: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/246.jpg)
Truncated Gradient (Langford et al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 86 / 128
![Page 247: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/247.jpg)
Truncated Gradient (Langford et al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 86 / 128
![Page 248: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/248.jpg)
Truncated Gradient (Langford et al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 86 / 128
![Page 249: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/249.jpg)
Truncated Gradient (Langford et al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 86 / 128
![Page 250: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/250.jpg)
Truncated Gradient (Langford et al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 86 / 128
![Page 251: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/251.jpg)
Truncated Gradient (Langford et al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 86 / 128
![Page 252: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/252.jpg)
Truncated Gradient (Langford et al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 86 / 128
![Page 253: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/253.jpg)
Truncated Gradient (Langford et al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 86 / 128
![Page 254: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/254.jpg)
Truncated Gradient (Langford et al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 86 / 128
![Page 255: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/255.jpg)
Truncated Gradient (Langford et al., 2009)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 86 / 128
![Page 256: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/256.jpg)
“Sparse” Online Algorithms
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 87 / 128
![Page 257: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/257.jpg)
Online Forward-Backward Splitting (Duchi andSinger, 2009)
input: stepsize sequences (ηt)Tt=1, (ρt)T
t=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)gradient step: w ← w − ηt∇L(w; xt , yt)proximal step: w ← proxρt Ω(w)
end for
generalizes truncated gradient to arbitrary regularizers Ωcan tackle non-overlapping or hierarchical group-Lasso, but arbitraryoverlaps are difficult to handle (more later)
practical drawback: without a laziness parameter, iterates areusually not very sparse
converges to ε-accurate objective after O(1/ε2) iterations
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 88 / 128
![Page 258: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/258.jpg)
“Sparse” Online Algorithms
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 89 / 128
![Page 259: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/259.jpg)
Regularized Dual Averaging (Xiao, 2010)
input: coefficient η0
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)gradient step: s ← s +∇L(w; xt , yt)proximal step: w ← η0
√t × proxΩ(−s/t)
end for
based on the dual averaging technique (Nesterov, 2009)
in practice: quite effective at getting sparse iterates (the proximalsteps are not vanishing)
O(C1/ε2 + C2/
√ε) convergence, where C1 is a Lipschitz constant,
and C2 is the variance of the stochastic gradients
drawback: requires storing two vectors (w and s), and s is not sparse
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 90 / 128
![Page 260: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/260.jpg)
What About Group Sparsity?
Both online forward-backward splitting (Duchi and Singer, 2009) andregularized dual averaging (Xiao, 2010) can handle groups
All that is necessary is to compute proxΩ(w)
easy for non-overlapping and tree-structured groups
But what about general overlapping groups?
Martins et al. (2011a): a prox-grad algorithm that can handle arbitraryoverlapping groups
decompose Ω(w) =∑J
j=1 Ωj (w) where each Ωj is non-overlapping
then apply proxΩjsequentially
still convergent (Martins et al., 2011a)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 91 / 128
![Page 261: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/261.jpg)
“Sparse” Online Algorithms
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 92 / 128
![Page 262: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/262.jpg)
Online Proximal Gradient (Martins et al., 2011a)
input: gravity sequence (σt)Tt=1, stepsize sequence (ηt)T
t=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)gradient step: w ← w − ηt∇L(θ; xt , yt)sequential proximal steps:for j = 1, 2, . . . do
w ← proxηtσt Ωj(w)
end forend for
PAC Convergence. ε-accurate solution after T ≤ O(1/ε2) rounds
Computational efficiency. Each gradient step is linear in thenumber of features that fire.Each proximal step is linear in the number of groups M.Both are independent of D.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 93 / 128
![Page 263: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/263.jpg)
Online Proximal Gradient (Martins et al., 2011a)
input: gravity sequence (σt)Tt=1, stepsize sequence (ηt)T
t=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)gradient step: w ← w − ηt∇L(θ; xt , yt)sequential proximal steps:for j = 1, 2, . . . do
w ← proxηtσt Ωj(w)
end forend for
PAC Convergence. ε-accurate solution after T ≤ O(1/ε2) rounds
Computational efficiency. Each gradient step is linear in thenumber of features that fire.Each proximal step is linear in the number of groups M.Both are independent of D.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 93 / 128
![Page 264: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/264.jpg)
Implementation Tricks (Martins et al., 2011b)
Budget driven shrinkage. Instead of a regularization constant,specify a budget on the number of selected groups. Each proximalstep sets σt to meet this target.
Sparseptron. Let L(w) = w>(f(x , y)− f(x , y)) be the perceptronloss. The algorithm becomes perceptron with shrinkage.
Debiasing. Run a few iterations of sparseptron to identify therelevant groups. Then run a unregularized learner at a second stage.
Memory efficiency. Only asmall active set of features needto be maintained. Entire groupscan be deleted after eachproximal step.Many irrelevant features arenever instantiated.
0 5 10 150
2
4
6x 10
6
# Epochs
# Fe
atur
es
MIRA
Sparceptron + MIRA (B=30)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 94 / 128
![Page 265: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/265.jpg)
Implementation Tricks (Martins et al., 2011b)
Budget driven shrinkage. Instead of a regularization constant,specify a budget on the number of selected groups. Each proximalstep sets σt to meet this target.
Sparseptron. Let L(w) = w>(f(x , y)− f(x , y)) be the perceptronloss. The algorithm becomes perceptron with shrinkage.
Debiasing. Run a few iterations of sparseptron to identify therelevant groups. Then run a unregularized learner at a second stage.
Memory efficiency. Only asmall active set of features needto be maintained. Entire groupscan be deleted after eachproximal step.Many irrelevant features arenever instantiated.
0 5 10 150
2
4
6x 10
6
# Epochs
# Fe
atur
es
MIRA
Sparceptron + MIRA (B=30)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 94 / 128
![Page 266: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/266.jpg)
Summary of Algorithms
Converges? Rate? Sparse? Groups? Overlaps?Prox-grad (IST) X O(1/ε) X X Not easyFISTA X O(1/
√ε) X X Not easy
ADMM X O(1/ε) No X XOnline subgradient X O(1/ε2) No X NoTruncated gradient X O(1/ε2) X No NoFOBOS X O(1/ε2) Sort of X Not easyRDA X O(1/ε2) X X Not easyOnline prox-grad X O(1/ε2) X X X
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 95 / 128
![Page 267: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/267.jpg)
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 96 / 128
![Page 268: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/268.jpg)
Applications of Structured Sparsity in NLP
1 Non-overlapping groups by feature template
2 Tree-structured groups: coarse-to-fine
3 Arbitrarily overlapping groups
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 97 / 128
![Page 269: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/269.jpg)
Applications of Structured Sparsity in NLP
1 Non-overlapping groups by feature template
2 Tree-structured groups: coarse-to-fine
3 Arbitrarily overlapping groups
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 97 / 128
![Page 270: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/270.jpg)
Martins et al. (2011b): Group by Template
Feature templates provide a straightforward way to define non-overlappinggroups.
To achieve group sparsity, we optimize:
minw
1
N
N∑n=1
L(w; xn, yn)︸ ︷︷ ︸empirical loss
+ Ω(w)︸ ︷︷ ︸regularizer
where we use the `2,1 norm:
Ω(w) = λ
M∑m=1
λm‖wm‖2
for M groups/templates.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 98 / 128
![Page 271: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/271.jpg)
Structured Prediction Tasks (Martins et al., 2011b)
Chunking (CoNLL 2000 shared task; Sang and Buchholz, 2000)+0.5 F1 with 30 groups (out of 96)
NER (CoNLL 2002/3 shared tasks on Spanish, Dutch, English; Sang,2002; Sang and De Meulder, 2003)+1–2 F1 with 200 groups (out of 452)
Dependency parsing (CoNLL-X shared task on several languages;Buchholz and Marsi, 2006), 684 feature templates based onMcDonald et al. (2005)
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 99 / 128
![Page 272: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/272.jpg)
Which features get selected?
Qualitative analysis of selected templates:
Arabic Danish Japanese Slovene Spanish TurkishBilexical ++ + +Lex. → POS + +POS → Lex. ++ + + + +POS → POS ++ +Middle POS ++ ++ ++ ++ ++ ++Shape ++ ++ ++ ++Direction + + + + +Distance ++ + + + + +
(Empty: none or very few templates selected; +: some templatesselected; ++: most or all templates selected.)
Morphologically-rich languages with small datasets (Turkish andSlovene) avoid lexical features.
In Japanese, contextual POS appear to be especially relevant.
Take this with a grain of salt: some patterns may be properties ofthe datasets, not the languages!
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 100 / 128
![Page 273: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/273.jpg)
Which features get selected?
Qualitative analysis of selected templates:
Arabic Danish Japanese Slovene Spanish TurkishBilexical ++ + +Lex. → POS + +POS → Lex. ++ + + + +POS → POS ++ +Middle POS ++ ++ ++ ++ ++ ++Shape ++ ++ ++ ++Direction + + + + +Distance ++ + + + + +
(Empty: none or very few templates selected; +: some templatesselected; ++: most or all templates selected.)
Morphologically-rich languages with small datasets (Turkish andSlovene) avoid lexical features.
In Japanese, contextual POS appear to be especially relevant.
Take this with a grain of salt: some patterns may be properties ofthe datasets, not the languages!
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 100 / 128
![Page 274: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/274.jpg)
Sociolinguistic Association Discovery(Eisenstein et al., 2011)
Dataset:
geotagged tweets from 9,250 authorsmapping of locations to the U.S. Census’ ZIP code tabulation areas(ZCTAs)a ten-dimensional vector of statistics on demographic attributes
Can we learn a compact set of terms used on Twitter that associatewith demographics?
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 101 / 128
![Page 275: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/275.jpg)
Sociolinguistic Association Discovery(Eisenstein et al., 2011)
Setup: multi-output regression.
xn is a P-dimensional vector of independent variables; matrix isX ∈ RN×P
yn is a T -dimensional vector of dependent variables; matrix isY ∈ RN×T
wp,t is the regression coefficient for the pth variable in the tth task;matrix is W ∈ RP×T
Regularized objective with squared error loss typical for regression:
minW
Ω(W) + ‖Y − XW‖2F
Regressions are run in both directions.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 102 / 128
![Page 276: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/276.jpg)
Structured Sparsity with `∞,1
Drive entire rows of W to zero (Turlach et al., 2005): “somepredictors are useless for any task”
Ω(W) = λ
T∑t=1
maxp
wp,t
Optimization with blockwise coordinate ascent (Liu et al., 2009) andsome tricks to maintain sparsity (Eisenstein et al., 2011)
See also: Duh et al. (2010) used multitask regression and `2,1 toselect features useful for reranking across many instances (applicationin machine translation).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 103 / 128
![Page 277: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/277.jpg)
Predicting Demographics from Text(Eisenstein et al., 2011)
Predict 10-dimensional ZCTA characterization from words tweeted inthat region (vocabulary is P = 5, 418)Measure Pearson’s correlation between prediction and correct value(average over tasks, cross-validated test sets)Compare with truncated SVD, greatest variance across authors, mostfrequent words
102
103
0.16
0.18
0.2
0.22
0.24
0.26
0.28
number of features
ave
rag
e c
orr
ela
tion
multi!output lassoSVDhighest variancemost frequent
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 104 / 128
![Page 278: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/278.jpg)
Predictive Words (Eisenstein et al., 2011)
wh
ite
Afr
.A
m.
His
p.
En
g.
lan
g.
Sp
an
.la
ng
.
oth
erla
ng
.
urb
an
fam
ily
ren
ter
med
.in
c.
- - - + - + + +;) - + - +:( -:) -:d + - + - +as - + -awesome + - - - +break - + - -campus - + - -dead - + - + + +hell - + - -shit - +train - + +will - + -would + -atlanta - + - -famu + - + - - -harlem - +bbm - + - + + +lls + - + - -lmaoo - + + - + + + +lmaooo - + + - + + + +lmaoooo - + + - + + +lmfaoo - + - + + +lmfaooo - + - + + +lml - + + - + + + + -odee - + - + + +
wh
ite
Afr
.A
m.
His
p.
En
g.
lan
g.
Sp
an
.la
ng
.
oth
erla
ng
.
urb
an
fam
ily
ren
ter
med
.in
c.
omw - + + - + + + +smfh - + + - + + + +smh - + + +w| - + - + + + +
con + - + +la - + - +si - + - +dats - + - + -deadass - + + - + + + +haha + - -hahah + -hahaha + - - +ima - + - + +madd - - + +nah - + - + + +ova - + - +sis - + +skool - + - + + + -wassup - + + - + + + + -wat - + + - + + + + -ya - + +yall - +yep - + - - - -yoo - + + - + + + +yooo - + - + +
Table: Demographically-indicative terms discovered by multi-output sparseregression. Statistically significant (p < .05) associations are marked (+/-).
Significant p < 0.05 positive (+) and negative (-) associations in a69-feature model (see the paper for the rest).
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 105 / 128
![Page 279: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/279.jpg)
Non-overlapping Groups for “Some” Ambiguity
Learning mappings from word types to labels (POS or semantic predicates)
Semisupervised lexicon expansion with graph-based learning (Das andSmith, 2012)
Elitist lasso (squared `1,2; Kowalski and Torresani, 2009) for per-wordsparsity
λ∑
v
(∑y
|wv ,y |
)2
where v is a word and y is a label.+3% accuracy on unknown-word frame prediction, with 35% as manylexicon entries
Unsupervised POS tagging with posterior regularization (Graca et al.,2009)
Incorporates `∞,1 norm+2–7% accuracy on 1-many POS evaluation
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 106 / 128
![Page 280: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/280.jpg)
Applications of Structured Sparsity in NLP
1 Non-overlapping groups by feature template
2 Tree-structured groups: coarse-to-fine
3 Arbitrarily overlapping groups
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 107 / 128
![Page 281: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/281.jpg)
Log-Linear Language Models(Nelakanti et al., 2013)
Setup: multinomial logistic regression (Della Pietra et al., 1997)
p(y | x) =exp(w>y f(x))∑
v∈V exp(w>v f(x))
Regularized objective with logistic loss:
minw−
N∑i=1
log p(yi | x1:k ; w) + Ω(w)
There are many choices for Ω(w). A key consideration is that the size ofw increases rapidly as k gets bigger.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 108 / 128
![Page 282: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/282.jpg)
Log-Linear Language Models(Nelakanti et al., 2013)
Setup: multinomial logistic regression (Della Pietra et al., 1997)
p(y | x) =exp(w>y f(x))∑
v∈V exp(w>v f(x))
Regularized objective with logistic loss:
minw−
N∑i=1
log p(yi | x1:k ; w) + Ω(w)
There are many choices for Ω(w). A key consideration is that the size ofw increases rapidly as k gets bigger.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 108 / 128
![Page 283: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/283.jpg)
Log-Linear Language Models(Nelakanti et al., 2013)
Setup: multinomial logistic regression (Della Pietra et al., 1997)
p(y | x) =exp(w>y f(x))∑
v∈V exp(w>v f(x))
Regularized objective with logistic loss:
minw−
N∑i=1
log p(yi | x1:k ; w) + Ω(w)
There are many choices for Ω(w). A key consideration is that the size ofw increases rapidly as k gets bigger.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 108 / 128
![Page 284: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/284.jpg)
Log-Linear Language Models(Nelakanti et al., 2013)
Encode history suffixes from length 0 to k in a tree; each is a feature.
Tree-structured penalty: a longer suffix can only be included if all itsshorter suffixes are included.
Can use `2,1 or `∞,1 norm
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 109 / 128
![Page 285: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/285.jpg)
Experimental Results: AP-news
Good generalization results (perplexity):
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 110 / 128
![Page 286: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/286.jpg)
Experimental Results: AP-news
Small model size:
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 111 / 128
![Page 287: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/287.jpg)
Groups from Word Clusters(Yogatama and Smith, 2014a)
Task: text classification
Model: bag-of-words logistic regression
Hierarchical clusters from Brown et al. (1992): include the words in acluster only if its parent cluster is included.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 112 / 128
![Page 288: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/288.jpg)
Brown et al. (1992) Clusters
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 113 / 128
![Page 289: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/289.jpg)
Regularize or Add Features?
20-newsgroups binary tasks:
+ Brown features Browndataset baseline lasso ridge elastic group lassoscience 91.90 (ridge) 86.96 90.51 91.14 93.04sports 93.71 (elastic) 82.66 88.94 85.43 93.71religion 92.47 (ridge) 94.98 96.93 96.93 92.89computer 87.13 (elastic) 55.72 96.65 67.57 86.36
Caveat: we ought to use more data to learn the clusters!
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 114 / 128
![Page 290: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/290.jpg)
Applications of Structured Sparsity in NLP
1 Non-overlapping groups by feature template
2 Tree-structured groups: coarse-to-fine
3 Arbitrarily overlapping groups
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 115 / 128
![Page 291: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/291.jpg)
Groups from Data(Yogatama and Smith, 2014b)
Task: text classification
Model: bag-of-words logistic regression
Groups: one group for every sentence in every training-set document
Intuition: only some sentences are relevantPast work used latent “relevance” variables (Yessenalina et al., 2010;Tackstrom and McDonald, 2011)
Use ADMM to handle thousands/millions of overlapping groups.
Copy weights allow inspection to see which training sentences are“selected”Additional `1 penalty for strong sparsity
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 116 / 128
![Page 292: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/292.jpg)
Topic Classification (IBM vs. Mac)
Bars show log-odds effect of removing the sentence: sentence, elastic,ridge, lasso.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 117 / 128
![Page 293: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/293.jpg)
Sentiment Analysis(Amazon DVDs; Blitzer et al., 2007)
Bars show log-odds effect of removing the sentence: sentence, elastic,ridge, lasso.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 118 / 128
![Page 294: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/294.jpg)
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 119 / 128
![Page 295: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/295.jpg)
Summary
Sparsity is desirable in NLP: feature selection, runtime, memoryfootprint, interpretability
Beyond plain sparsity: structured sparsity can be promoted throughgroup-Lasso regularization
Choice of groups reflects prior knowledge about the desired sparsitypatterns.
We have seen examples for feature template selection, tree structures,and data-driven groups, but many more are possible!
Small/medium scale: many batch algorithms available, with fastconvergence (IST, FISTA, SpaRSA, ...)
Large scale: distributed optimization algorithms (ADMM) or onlineproximal-gradient algorithms suitable to explore large feature spaces
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 120 / 128
![Page 296: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/296.jpg)
Thank you!
Questions?
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 121 / 128
![Page 297: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/297.jpg)
Acknowledgments
National Science Foundation (USA), CAREER grant IIS-1054319
Fundacao para a Ciencia e Tecnologia (Portugal), grantsPEst-OE/EEI/LA0008/2011 and PTDC/EEI-SII/2312/2012.
Fundacao para a Ciencia e Tecnologia and Information andCommunication Technologies Institute (Portugal/USA), through theCMU-Portugal Program.
Priberam: QREN/POR Lisboa (Portugal), EU/FEDER programme,Intelligo project, contract 2012/24803.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 122 / 128
![Page 298: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/298.jpg)
References I
Afonso, M., Bioucas-Dias, J., and Figueiredo, M. (2010). Fast image recovery using variable splitting and constrainedoptimization. IEEE Transactions on Image Processing, 19:2345–2356.
Amaldi, E. and Kann, V. (1998). On the approximation of minimizing non zero variables or unsatisfied relations in linearsystems. Theoretical Computer Science, 209:237–260.
Bakin, S. (1999). Adaptive regression and model selection in data mining problems. PhD thesis, Australian National University.
Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journalon Imaging Sciences, 2(1):183–202.
Bioucas-Dias, J. and Figueiredo, M. (2007). A new twist: two-step iterativeshrinkage/thresholding algorithms for imagerestoration. IEEE Transactions on Image Processing, 16:2992–3004.
Blitzer, J., Dredze, M., and Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation forsentiment classification. In Proc. of ACL.
Bottou, L. and Bousquet, O. (2007). The tradeoffs of large scale learning. NIPS, 20.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2010). Distributed optimization and statistical learning via thealternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122.
Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. (1992). Class-based n-gram models of naturallanguage. Computational Linguistics, 18(4):467–479.
Buchholz, S. and Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In Proc. of CoNLL.
Candes, E., Romberg, J., and Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incompletefrequency information. IEEE Transactions on Information Theory, 52:489–509.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1):41–75.
Cessie, S. L. and Houwelingen, J. C. V. (1992). Ridge estimators in logistic regression. Journal of the Royal Statistical Society;Series C, 41:191–201.
Chen, S. and Rosenfeld, R. (1999). A Gaussian prior for smoothing maximum entropy models. Technical report,CMU-CS-99-108.
Claerbout, J. and Muir, F. (1973). Robust modelling of erratic data. Geophysics, 38:826–844.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 123 / 128
![Page 299: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/299.jpg)
References IICombettes, P. and Wajs, V. (2006). Signal recovery by proximal forward-backward splitting. Multiscale Modeling and
Simulation, 4:1168–1200.
Das, D. and Smith, N. A. (2012). Graph-based lexicon expansion with sparsity-inducing penalties. In Proceedings of NAACL.
Daubechies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with asparsity constraint. Communications on Pure and Applied Mathematics, 11:1413–1457.
Davis, G., Mallat, S., and Avellaneda, M. (1997). Greedy adaptive approximation. Journal of Constructive Approximation,13:57–98.
Della Pietra, S., Della Pietra, V., and Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on PatternAnalysis and Machine Intelligence, 19:380–393.
Donoho, D. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52:1289–1306.
Duchi, J. and Singer, Y. (2009). Efficient online and batch learning using forward backward splitting. JMLR, 10:2873–2908.
Duh, K., Sudoh, K., Tsukada, H., Isozaki, H., and Nagata, M. (2010). n-best reranking by multitask learning. In Proceedings ofthe Joint Fifth Workshop on Statistical Machine Translation and Metrics.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32:407–499.
Eisenstein, J., Smith, N. A., and Xing, E. P. (2011). Discovering sociolinguistic associations with structured sparsity. In Proc. ofACL.
Figueiredo, M. and Bioucas-Dias, J. (2011). An alternating direction algorithm for (overlapping) group regularization. In Signalprocessing with adaptive sparse structured representations–SPARS11. Edinburgh, UK.
Figueiredo, M. and Nowak, R. (2003). An EM algorithm for wavelet-based image restoration. IEEE Transactions on ImageProcessing, 12:986–916.
Figueiredo, M., Nowak, R., and Wright, S. (2007). Gradient projection for sparse reconstruction: application to compressedsensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing: Special Issue on ConvexOptimization Methods for Signal Processing, 1:586–598.
Friedman, J., Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004). Discussion of three boosting papers. Annals ofStatistics, 32(1):102–107.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. Technical report,Stanford University.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 124 / 128
![Page 300: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/300.jpg)
References IIIFu, W. (1998). Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics, pages
397–416.
Goodman, J. (2004). Exponential priors for maximum entropy models. In Proc. of NAACL.
Graca, J., Ganchev, K., Taskar, B., and Pereira, F. (2009). Posterior vs. parameter sparsity in latent variable models. Advancesin Neural Information Processing Systems.
Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research,3:1157–1182.
Hale, E., Yin, W., and Zhang, Y. (2008). Fixed-point continuation for l1-minimization: Methodology and convergence. SIAMJournal on Optimization, 19:1107–1130.
Hastie, T., Taylor, J., Tibshirani, R., and Walther, G. (2007). Forward stagewise regression and the monotone lasso. ElectronicJournal of Statistics, 1:1–29.
Hestenes, M. R. (1969). Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4:303–320.
Jenatton, R., Audibert, J.-Y., and Bach, F. (2009). Structured variable selection with sparsity-inducing norms. Technical report,arXiv:0904.3523.
Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2011). Proximal methods for hierarchical sparse coding. Journal ofMachine Learning Research, 12:2297–2334.
Kazama, J. and Tsujii, J. (2003). Evaluation and extension of maximum entropy models with inequality constraints. In Proc. ofEMNLP.
Kim, S. and Xing, E. (2010). Tree-guided group lasso for multi-task regression with structured sparsity. In Proc. of ICML.
Kowalski, M. and Torresani, B. (2009). Sparsity and persistence: mixed norms provide simple signal models with dependentcoefficients. Signal, Image and Video Processing, 3(3):251–264.
Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., and Jordan, M. I. (2004). Learning the kernel matrix withsemidefinite programming. JMLR, 5:27–72.
Langford, J., Li, L., and Zhang, T. (2009). Sparse online learning via truncated gradient. JMLR, 10:777–801.
Liu, H., Palatucci, M., and Zhang, J. (2009). Blockwise coordinate descent procedures for the multi-task lasso, with applicationsto neural semantic basis discovery. In Proceedings of the 26th Annual International Conference on Machine Learning, pages649–656. ACM.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 125 / 128
![Page 301: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/301.jpg)
References IVMairal, J., Jenatton, R., Obozinski, G., and Bach, F. (2010). Network flow algorithms for structured sparsity. In Advances in
Neural Information Processing Systems.
Martins, A. F. T., Figueiredo, M. A. T., Aguiar, P. M. Q., Smith, N. A., and Xing, E. P. (2011a). Online learning of structuredpredictors with multiple kernels. In Proc. of AISTATS.
Martins, A. F. T., Smith, N. A., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2011b). Structured Sparsity in StructuredPrediction. In Proc. of Empirical Methods for Natural Language Processing.
Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2010). Turbo parsers: Dependencyparsing by approximate variational inference. In Proc. of EMNLP.
McDonald, R. T., Pereira, F., Ribarov, K., and Hajic, J. (2005). Non-projective dependency parsing using spanning treealgorithms. In Proc. of HLT-EMNLP.
Muthukrishnan, S. (2005). Data Streams: Algorithms and Applications. Now Publishers, Boston, MA.
Nelakanti, A., Archambeau, C., Mairal, J., Bach, F., and Bouchard, G. (2013). Structured penalties for log-linear languagemodels. In Proc. of EMNLP.
Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. Technical report, CORE report.
Nesterov, Y. (2009). Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259.
Obozinski, G., Taskar, B., and Jordan, M. (2010). Joint covariate selection and joint subspace selection for multipleclassification problems. Statistics and Computing, 20(2):231–252.
Osborne, M., Presnell, B., and Turlach, B. (2000). A new approach to variable selection in least squares problems. IMA Journalof Numerical Analysis, 20:389–403.
Perkins, S., Lacker, K., and Theiler, J. (2003). Grafting: Fast, incremental feature selection by gradient descent in functionspace. Journal of Machine Learning Research, 3:1333–1356.
Powell, M. J. D. (1969). A method for nonlinear constraints in minimization problems. In Fletcher, R., editor, Optimization,pages 283–298. Academic Press.
Quattoni, A., Carreras, X., Collins, M., and Darrell, T. (2009). An efficient projection for l1,∞ regularization. In Proc. of ICML.
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proc. of EMNLP.
Sang, E. (2002). Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proc. ofCoNLL.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 126 / 128
![Page 302: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/302.jpg)
References VSang, E. and Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings of CoNLL-2000 and
LLL-2000.
Sang, E. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entityrecognition. In Proc. of CoNLL.
Schaefer, R., Roi, L., and Wolfe, R. (1984). A ridge logistic estimator. Communications in Statistical Theory and Methods,13:99–113.
Schmidt, M. and Murphy, K. (2010). Convex structure learning in log-linear models: Beyond pairwise potentials. In Proc. ofAISTATS.
Shor, N. (1985). Minimization Methods for Non-differentiable Functions. Springer.
Stojnic, M., Parvaresh, F., and Hassibi, B. (2009). On the reconstruction of block-sparse signals with an optimal number ofmeasurements. Signal Processing, IEEE Transactions on, 57(8):3075–3085.
Tackstrom, O. and McDonald, R. (2011). Discovering fine-grained sentiment with latent variable structured prediction models.In Proc. of ECIR.
Taylor, H., Bank, S., and McCoy, J. (1979). Deconvolution with the `1 norm. Geophysics, 44:39–52.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B., pages267–288.
Tikhonov, A. (1943). On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pages 195–198.
Turlach, B. A., Venables, W. N., and Wright, S. J. (2005). Simultaneous variable selection. Technometrics, 47(3):349–363.
Wiener, N. (1949). Extrapolation, Interpolation, and Smoothing of Stationary Time Series. Wiley, New York.
Williams, P. (1995). Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7:117–143.
Wright, S., Nowak, R., and Figueiredo, M. (2009). Sparse reconstruction by separable approximation. IEEE Transactions onSignal Processing, 57:2479–2493.
Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. Journal of MachineLearning Research, 11:2543–2596.
Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proc. of ACL.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 127 / 128
![Page 303: Structured Sparsity in Natural Language Processingafm/Home_files/eacl2014tutorial.pdf · 2014-04-27 · Structured Sparsity in Natural Language Processing: Models, Algorithms, and](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e8707501a6ff608dd63f857/html5/thumbnails/303.jpg)
References VI
Yessenalina, A., Yue, Y., and Cardie, C. (2010). Multi-level structured models for document sentiment classification. In Proc. ofEMNLP.
Yogatama, D. and Smith, N. A. (2014a). Linguistic structured sparsity in text categorization. In Proc. of ACL.
Yogatama, D. and Smith, N. A. (2014b). Making the most of bag of words: Sentence regularization with alternating directionmethod of multipliers. In Proc. of ICML.
Yuan, L., Liu, J., and Ye, J. (2011). Efficient methods for overlapping group lasso. In Advances in Neural InformationProcessing Systems 24, pages 352–360.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the RoyalStatistical Society (B), 68(1):49.
Zhao, P., Rocha, G., and Yu, B. (2009). Grouped and hierarchical model selection through composite absolute penalties. Annalsof Statistics, 37(6A):3468–3497.
Zhu, J., Lao, N., and Xing, E. (2010). Grafting-light: fast, incremental feature selection and structure learning of markovrandom fields. In Proc. of International Conference on Knowledge Discovery and Data Mining, pages 303–312.
Martins, Yogatama, Smith, Figueiredo (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp14 128 / 128