an introduction to structured prediction...an introduction to structured prediction carlo ciliberto...
TRANSCRIPT
![Page 1: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/1.jpg)
An Introduction to Structured Prediction
Carlo Ciliberto
20/11/2019
Electrical and Electronics Engineering
Imperial College London
![Page 2: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/2.jpg)
Outline
Structured prediction: what & why?
Surrogate Frameworks
Examples
The Surrogate Approach
Likelihood Estimation Approaches
Structured Prediction with Implicit Embeddings
2
![Page 3: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/3.jpg)
Structured prediction: what & why?
![Page 4: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/4.jpg)
Structured Prediction
3
![Page 5: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/5.jpg)
Structured Prediction Vs. Supervised Learning
Q: This seems “just” standard supervised learning, doesn’t it?
• Learn f : X → Y,
• Given many training examples (xi, yi)ni=1.
A: Indeed it is supervised learning!
However, standard learning methods do not apply here...
What changes is what we do to learn f .
4
![Page 6: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/6.jpg)
Supervised Learning 101
• X input space, Y output space,
• ` : Y × Y → R loss function,
• ρ unknown probability on X × Y.
Goal: find f? : X → Y
f? = argminf :X→Y
E(f), E(f) = E[`(f(x), y)],
given only the dataset (xi, yi)ni=1 sampled independently from ρ.
5
![Page 7: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/7.jpg)
Protypical Approach: Empirical Risk Minimization
Solve f = argminf∈F
1
n
n∑i=1
`(f(xi), yi).
Where F ⊆ {f : X → Y} (usually a convex function space)
If Y is a vector space (e.g. Y = R):
• F easy to choose/optimize over: (generalized) linear models,
Kernel methods, Neural Networks, etc.
Example: Linear models. X = Rd
• f(x) = w>x for some w ∈ Rd.
6
![Page 8: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/8.jpg)
Protypical Approach: Empirical Risk Minimization
Solve f = argminf∈F
1
n
n∑i=1
`(f(xi), yi).
Where F ⊆ {f : X → Y} (usually a convex function space)
If Y is a vector space (e.g. Y = R):
• F easy to choose/optimize over: (generalized) linear models,
Kernel methods, Neural Networks, etc.
Example: Linear models. X = Rd
• f(x) = w>x for some w ∈ Rd.
6
![Page 9: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/9.jpg)
Protypical Approach: Empirical Risk Minimization
Solve f = argminf∈F
1
n
n∑i=1
`(f(xi), yi).
Where F ⊆ {f : X → Y} (usually a convex function space)
If Y is a vector space (e.g. Y = R):
• F easy to choose/optimize over: (generalized) linear models,
Kernel methods, Neural Networks, etc.
Example: Linear models. X = Rd
• f(x) = w>x for some w ∈ Rd.
6
![Page 10: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/10.jpg)
Empirical Risk Minimization (ERM)
We are interested in controlling the Excess Risk of f
E(f)− E(f?)
Wish list:
• Consistency:
limn→+∞
E(f)− E(f?) = 0
• Learning Rates:
E(f)− E(f?) ≤ O(n−γ)
γ > 0 (the larger the better).
7
![Page 11: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/11.jpg)
Prototypical Results: Empirical Risk Minimization
Several results allow to study ERM’s consistency and rates when:
• Y = Rd and,
• F is a “standard” space of functions (e.g. a reproducing
kernel Hilbert space).
Examples of techniques/notions involved to obtain these results:
• VC dimension,
• Rademacher & Gaussian complexity,
• Covering numbers,
• Stability,
• Empirical processes,
• . . . .
8
![Page 12: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/12.jpg)
Protypical Approach: Empirical Risk Minimization
Solve f = argminf∈F
1
n
n∑i=1
`(f(xi), yi).
Where F ⊆ {f : X → Y} (usually a convex function space)
If Y is a vector space (e.g. Y = R):
• F easy to choose/optimize over: (generalized) linear models,
Kernel methods, Neural Networks, etc.
If Y is a “structured” space:
• How to choose F?
• How to perform optimization over it?
• How to study the statistics of f over F?
9
![Page 13: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/13.jpg)
Structured Prediction Methods
Y arbitrary: how do we parametrize F and learn f?
Surrogate approaches
+ Clear theory (e.g. convergence and learning rates)
– Only for special cases (classification, ranking, multi-labeling etc.)
[Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012]
Score learning techniques
+ General algorithmic framework
(e.g. StructSVM [Tsochantaridis et al., 2005])
– Limited Theory (no consistency, see e.g. [Bakir et al., 2007] )
10
![Page 14: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/14.jpg)
Surrogate Frameworks
![Page 15: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/15.jpg)
Example: Binary Classification setting
Binary Classification:
• “any” input space X
• output space Y = {−1, 1}
• 0-1 loss function, i.e `(y, y′) = 1{y 6=y′} =
0 if y = y
1 otherwise
11
![Page 16: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/16.jpg)
Example: Binary Classification Problem
• A classification rule is a map f : X → Y
• The risk of a rule f is E(f) = E(x,y)∼ρ[1{f(x)6=y}].
• The classification rule that minimizes E is
f∗ : X → Y, f∗(x) = argmaxy∈Y
ρ(y | x).
• Why? Exercise : )
12
![Page 17: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/17.jpg)
Example: Binary Classification
Goal: approximate f∗ given a training set (xi, yi)ni=1.
Issues:
i) Y is not linear! ⇒ H = {classification rules} is not linear!
i) `(y, y′) = 1{y 6=y′} is not convex ⇒ very hard to minimize!
13
![Page 18: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/18.jpg)
Example: Binary Classification
Idea:
i) Rephrase the problem using a linear output space,
ii) Find a good convex “replacement” for `.
i) Replace Y = {−1, 1} to R and consider functions g : X → R(surrogate classification rule)
14
![Page 19: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/19.jpg)
Example: Binary Classification
Idea:
i) Rephrase the problem using a linear output space,
ii) Find a good convex “replacement” for `.
i) Replace Y = {−1, 1} to R and consider functions g : X → R(surrogate classification rule)
14
![Page 20: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/20.jpg)
Example: Binary Classification
Idea:
Rephrase the problem using a linear output space,
ii) Find a good convex “replacement” for `.
i) Replace Y = {−1, 1} to R and consider functions g : X → R(surrogate classification rule)
15
![Page 21: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/21.jpg)
Example: Binary Classification
Idea:
Rephrase the problem using a linear output space,
Find a good convex “replacement” for `.
i) Replace Y = {−1, 1} to R and consider functions g : X → R(“surrogate” classification rule)
ii) Replace `(y, y′) = 1{y 6=y′} with L: R× R→ R+ non-negative
convex “surrogate” loss: e.g. logistic, least squares, hinge.
16
![Page 22: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/22.jpg)
Loss Functions for Binary Classification
Loss functions of the form L(y, y′) = L(y · y′)
1 2
0.5
1.0
1.5
2.0
0 1 loss
square loss
Hinge loss
Logistic loss
0.5
17
![Page 23: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/23.jpg)
Surrogate ERM
The loss L induces a surrogate risk
R(g) = E(x,y)∼ρ L(g(x), y).
and can define the surrogate ERM estimator
g = argming∈G
Rn(g) Rn(g) =1
n
n∑i=1
L(g(xi), yi).
Modeling. The output space is linear ⇒ many options for G!
Optimization. The loss is convex ⇒ we can efficiently find g!
Statistics. Standard results ⇒ generalization properties of g!
R(g)−R(g?)→ 0
18
![Page 24: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/24.jpg)
Surrogate Vs. Original Problems
This is all nice and well, but . . .
• How can we go from g : X → R to some f : X → Y?
Standard approach: f(x) = sign(g(x))
• How is g? related to f??
Exercise. f?(x) = sign(g?(x))!
• Are surrogate learning rates for g of any use?
Theorem.
E(f)− E(f?) ≤ ϕ(R(g)−R(g?)
).
(where ϕ : R→ R+ depends on the surrogate loss L).
19
![Page 25: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/25.jpg)
Surrogate Vs. Original Problems
This is all nice and well, but . . .
• How can we go from g : X → R to some f : X → Y?
Standard approach: f(x) = sign(g(x))
• How is g? related to f??
Exercise. f?(x) = sign(g?(x))!
• Are surrogate learning rates for g of any use?
Theorem.
E(f)− E(f?) ≤ ϕ(R(g)−R(g?)
).
(where ϕ : R→ R+ depends on the surrogate loss L).
19
![Page 26: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/26.jpg)
Surrogate Vs. Original Problems
This is all nice and well, but . . .
• How can we go from g : X → R to some f : X → Y?
Standard approach: f(x) = sign(g(x))
• How is g? related to f??
Exercise. f?(x) = sign(g?(x))!
• Are surrogate learning rates for g of any use?
Theorem.
E(f)− E(f?) ≤ ϕ(R(g)−R(g?)
).
(where ϕ : R→ R+ depends on the surrogate loss L).
19
![Page 27: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/27.jpg)
Surrogate Vs. Original Problems
This is all nice and well, but . . .
• How can we go from g : X → R to some f : X → Y?
Standard approach: f(x) = sign(g(x))
• How is g? related to f??
Exercise. f?(x) = sign(g?(x))!
• Are surrogate learning rates for g of any use?
Theorem.
E(f)− E(f?) ≤ ϕ(R(g)−R(g?)
).
(where ϕ : R→ R+ depends on the surrogate loss L).
19
![Page 28: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/28.jpg)
Example: Multiclass Classification setting
Multiclass Classification:
• input space X
• output space Y = {1, 2, . . . , T}
• 0-1 loss function, i.e `(y, y′) = 1{y 6=y′}
Issues:
• Can we still map Y in R?
• What surrogate L can replace `?
20
![Page 29: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/29.jpg)
Example: Multiclass Classification
• Attempt 1: Y = {1, 2, . . . , T} ⊂ R. Could replace Y with R
Not a good choice: induces an arbitrary distance on classes.
(i.e. 1 is closer to 2 than 3 and so on . . . )
• Attempt 2: replace Y = {1, 2, . . . , T} with RT .
“Replace” means “embed” Y into RT using an
encoding c : Y → RT defined by
c(i) = ei i = 1, . . . Y
where ei is the ith vector of the canonical basis of RT .
21
![Page 30: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/30.jpg)
Example: Multiclass Classification
• Attempt 1: Y = {1, 2, . . . , T} ⊂ R. Could replace Y with RNot a good choice: induces an arbitrary distance on classes.
(i.e. 1 is closer to 2 than 3 and so on . . . )
• Attempt 2: replace Y = {1, 2, . . . , T} with RT .
“Replace” means “embed” Y into RT using an
encoding c : Y → RT defined by
c(i) = ei i = 1, . . . Y
where ei is the ith vector of the canonical basis of RT .
21
![Page 31: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/31.jpg)
Example: Multiclass Classification
• Attempt 1: Y = {1, 2, . . . , T} ⊂ R. Could replace Y with RNot a good choice: induces an arbitrary distance on classes.
(i.e. 1 is closer to 2 than 3 and so on . . . )
• Attempt 2: replace Y = {1, 2, . . . , T} with RT .
“Replace” means “embed” Y into RT using an
encoding c : Y → RT defined by
c(i) = ei i = 1, . . . Y
where ei is the ith vector of the canonical basis of RT .
21
![Page 32: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/32.jpg)
Example: Multiclass Classification
Given a surrogate loss L : RT ×RT → R (hinge? least squares?)...
. . . we can train the surrogate estimator g : X → RT
g = argming∈G
1
n
n∑i=1
L(g(xi), c(yi)).
Question: But g has values in RT . . . How can we go back to Y?
Answer: via a decoding routine!
f(x) = argmaxt=1,...T
gt(x)
22
![Page 33: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/33.jpg)
Example: Multiclass Classification
Given a surrogate loss L : RT ×RT → R (hinge? least squares?)...
. . . we can train the surrogate estimator g : X → RT
g = argming∈G
1
n
n∑i=1
L(g(xi), c(yi)).
Question: But g has values in RT . . . How can we go back to Y?
Answer: via a decoding routine!
f(x) = argmaxt=1,...T
gt(x)
22
![Page 34: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/34.jpg)
Example: Multiclass Classification
Given a surrogate loss L : RT ×RT → R (hinge? least squares?)...
. . . we can train the surrogate estimator g : X → RT
g = argming∈G
1
n
n∑i=1
L(g(xi), c(yi)).
Question: But g has values in RT . . . How can we go back to Y?
Answer: via a decoding routine!
f(x) = argmaxt=1,...T
gt(x)
22
![Page 35: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/35.jpg)
Multiclass Classification: Surrogate Vs. Original Problems
The same questions as for binary classification . . .
• How can we go from g : X → RT to some f : X → Y?
Decoding: f(x) = argmaxt=1,...,T gt(x)
• How is g? related to f??
Not clear: it strongly depends on L!
• Are surrogate learning rates for g of any use?
Not clear: it strongly depends on L!
23
![Page 36: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/36.jpg)
Multiclass Classification: Surrogate Vs. Original Problems
The same questions as for binary classification . . .
• How can we go from g : X → RT to some f : X → Y?
Decoding: f(x) = argmaxt=1,...,T gt(x)
• How is g? related to f??
Not clear: it strongly depends on L!
• Are surrogate learning rates for g of any use?
Not clear: it strongly depends on L!
23
![Page 37: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/37.jpg)
Multiclass Classification: Surrogate Vs. Original Problems
The same questions as for binary classification . . .
• How can we go from g : X → RT to some f : X → Y?
Decoding: f(x) = argmaxt=1,...,T gt(x)
• How is g? related to f??
Not clear: it strongly depends on L!
• Are surrogate learning rates for g of any use?
Not clear: it strongly depends on L!
23
![Page 38: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/38.jpg)
The Surrogate Approach
![Page 39: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/39.jpg)
Taking inspiration from the previous examples . . .
24
![Page 40: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/40.jpg)
Surrogate Approach: Key Ingredients
A possible approach to structured prediction is to find:
1. A linear surrogate space H,
2. An encoding c : Y → H,
3. A surrogate loss L : H×H → R,
4. A decoding d : Y → H.
25
![Page 41: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/41.jpg)
Surrogate Approach: Recipe
Then:
1. Encode training set (xi, yi)ni=1 into (xi, c(yi))
ni=1,
2. Learn g = argming∈G1n
∑ni=1 L(g(xi), c(yi))
(using standard supervised learning methods)
3. Decode f = d ◦ g.
26
![Page 42: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/42.jpg)
Wish list
However, recall that learning g is solving a different problem...
R(g) =
∫L(g(x), c(y)) dρ(x, y).
In order to be “useful”, a surrogate framework needs to satisfy:
• Fischer Consistency. E(f?) = E(d ◦ g?)
• Comparison Inequality. for any g : X → H,
E(d ◦ g)− E(f?) ≤ ϕ (R(g)−R(g?)) ,
with ϕ : R+ → R+ continuous, non-decreasing and ϕ(0) = 0.
27
![Page 43: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/43.jpg)
Wish list
However, recall that learning g is solving a different problem...
R(g) =
∫L(g(x), c(y)) dρ(x, y).
In order to be “useful”, a surrogate framework needs to satisfy:
• Fischer Consistency. E(f?) = E(d ◦ g?)
• Comparison Inequality. for any g : X → H,
E(d ◦ g)− E(f?) ≤ ϕ (R(g)−R(g?)) ,
with ϕ : R+ → R+ continuous, non-decreasing and ϕ(0) = 0.
27
![Page 44: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/44.jpg)
Wish list
However, recall that learning g is solving a different problem...
R(g) =
∫L(g(x), c(y)) dρ(x, y).
In order to be “useful”, a surrogate framework needs to satisfy:
• Fischer Consistency. E(f?) = E(d ◦ g?)
• Comparison Inequality. for any g : X → H,
E(d ◦ g)− E(f?) ≤ ϕ (R(g)−R(g?)) ,
with ϕ : R+ → R+ continuous, non-decreasing and ϕ(0) = 0.
27
![Page 45: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/45.jpg)
Fisher consistency
Fisher consistency. We want this because we want that the
surrogate problem and the decoding procedure are good ones,
meaning that if we decode the best surrogate solution d ◦ g∗ we
have the same risk as the best original solution f∗.
28
![Page 46: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/46.jpg)
Comparison inequality
Comparison inequality. If we learn a g which approximates g∗. . .
R(g)−R(g∗)→ 0 as n→ +∞.
. . . then the comparison inequality implies,
E(d ◦ g)− E(f∗)→ 0 as n→ +∞.
Therefore f := d ◦ g is a good estimator for the original problem!
Rates. Moreover, if R(g)−R(g?) ≤ n−α for some α > 0
E(f)− E(f?) ≤ ϕ(n−α).
Knowledge of ϕ allows to derive rates for f from the rates of g!
29
![Page 47: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/47.jpg)
Comparison inequality
Comparison inequality. If we learn a g which approximates g∗. . .
R(g)−R(g∗)→ 0 as n→ +∞.
. . . then the comparison inequality implies,
E(d ◦ g)− E(f∗)→ 0 as n→ +∞.
Therefore f := d ◦ g is a good estimator for the original problem!
Rates. Moreover, if R(g)−R(g?) ≤ n−α for some α > 0
E(f)− E(f?) ≤ ϕ(n−α).
Knowledge of ϕ allows to derive rates for f from the rates of g!
29
![Page 48: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/48.jpg)
Going back to the examples...
Surrogate framework for binary classification:
• Y = {1,−1}, H = R
• coding c : {1,−1} → R is the embedding Y ↪→ R
• L : R× R→ R+: least squares , hinge , logistic
• decoding d : R→ {1,−1} is d(r) = sign(r).
Fisher consistency? Comparison inequality?
Exercise for the reader! :)
30
![Page 49: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/49.jpg)
Going back to the examples...
Surrogate framework for multiclass classification:
• Y = {1, 2, . . . , T}, H = RT
• coding c : {1, 2, . . . , T} ↪→ RT with c(i) = ei.
• L : RT × RT → R+: least squares , hinge ×.
• decoding d : RT → {1, 2, . . . , T} is d(r) = argmaxt=1,...,T rt.
Fisher consistency? Comparison inequality?
Exercise for the reader! :)
31
![Page 50: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/50.jpg)
To sum up
Pros
• Modeling. Directly borrow from ERM literature to design
(surrogate) learning algorithms (vector-valued regression!)
• Statistics. Extend surrogate ERM rates for g to f by means
of the comparison inequality.
• Optimization. Bypasses/Postpones dealing with the
non-convex ` at prediction time!
Cons
• Flexibility. Need to design a surrogate framework (H, c,L, d)
on a case-by-case basis for any (`,Y).
32
![Page 51: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/51.jpg)
Likelihood Estimation Approaches
![Page 52: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/52.jpg)
A standard approach
Alternative approach to address structured prediction problems:
• Model the likelihood of observing y given x as a function
F ? : Y × X → [0, 1] with F ?(y, x) = ρ(y|x).
• Learn F : X × Y → R
• Ideally F → F ∗, with F ∗(x, y) = ρ(y | x).
• Then,
• Ideal solution f∗(x) = argmaxy∈Y F ∗(x, y)
• Approximate solution f(x) = argmaxy∈Y F (x, y)
33
![Page 53: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/53.jpg)
Struct SVM [Tsochantaridis et al., 2005]
Model:
• joint feature map Ψ : Y × X → F with F a Hilbert space.
• F (y, x) = 〈w,Ψ(y, x)〉 with w ∈ F a parameter vector.
Algorithm: Find the parameters w that solve
minw∈F
‖w‖2
〈w,Ψ(yi, xi)〉 ≥ 〈w,Ψ(y, xi)〉+ 1
∀i = 1, . . . , n, ∀y ∈ Y \ yi
Intuition: the best y∗(x) must be such that F (x, y∗(x)) is
considerably larger than any other F (x, y)
34
![Page 54: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/54.jpg)
Including the Loss
However, things as more complicated. . .
we don’t want to simply maximise ρ(y | x), but we have a loss
function ` as part of the problem:
E(f) =
∫`(f(x), y) dρ(y | x)dρX (x)
35
![Page 55: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/55.jpg)
Struct SVM Variants
Model:
• joint feature map Ψ : Y × X → F with F a Hilbert space.
• F (y, x) = 〈w,Ψ(y, x)〉 with w ∈ F a parameter vector.
Algorithm: Find the parameters w that solve
minw∈F
‖w‖2
〈w,Ψ(yi, xi)〉 ≥ 〈w,Ψ(y, xi)〉+ `(yi, y)
∀i = 1, . . . , n, ∀y ∈ Y \ yi
36
![Page 56: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/56.jpg)
Struct SVM Variants
Model:
• joint feature map Ψ : Y × X → F with F a Hilbert space.
• F (y, x) = 〈w,Ψ(y, x)〉 with w ∈ F a parameter vector.
Algorithm: Find the parameters w that solve
minw∈F
‖w‖2 +C
n
n∑i=1
ξi
〈w,Ψ(yi, xi)〉 ≥ 〈w,Ψ(y, xi)〉+ `(yi, y)− ξi∀i = 1, . . . , n, ∀y ∈ Y \ yi
Generalizing the “slack” variables in standard SVM . . .
37
![Page 57: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/57.jpg)
Optimization
38
![Page 58: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/58.jpg)
To sum up
Pros
• Flexibility. Can be virtually applied to any problem.
Cons
• Optimization. Requires solving an optimization over Y and
with respect to ` at every iteration. It can become very
expensive!
• Statistics. It has been shown that in some cases this
approach is not consistent [Bakir et al., 2007].
39
![Page 59: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/59.jpg)
Examples: Language Parsing
40
![Page 60: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/60.jpg)
Examples: Image Segmentation
41
![Page 61: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/61.jpg)
Examples: Pose Estimation
42
![Page 62: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/62.jpg)
To wrap up. . .
Surrogate approaches
+ Clear theory (e.g. convergence and learning rates)
– Only for special cases (classification, ranking, multi-labeling etc.)
[Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012]
Score learning techniques
+ General algorithmic framework
(e.g. StructSVM [Tsochantaridis et al., 2005])
– Limited Theory (no consistency, see e.g. [Bakir et al., 2007] )
Can we get the best of both worlds?
43
![Page 63: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/63.jpg)
To wrap up. . .
Surrogate approaches
+ Clear theory (e.g. convergence and learning rates)
– Only for special cases (classification, ranking, multi-labeling etc.)
[Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012]
Score learning techniques
+ General algorithmic framework
(e.g. StructSVM [Tsochantaridis et al., 2005])
– Limited Theory (no consistency, see e.g. [Bakir et al., 2007] )
Can we get the best of both worlds?
43
![Page 64: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/64.jpg)
Structured Prediction with Implicit
Embeddings
![Page 65: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/65.jpg)
Wish List
We would like a method that:
• Is flexible: can be applied to (m)any Y and `.
• Leads to efficient computations.
• Has strong theoretical guarantees (i.e. consistency, rates)
44
![Page 66: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/66.jpg)
Ideal solution
Let’s study the expected risk of our problem
E(f) =
∫`(f(x), y) dρ(x, y)
=
∫ (∫`(f(x), y) dρ(y|x)
)dρX (x)
We can minimize it pointwise. Then best f? : X → Y is:
f?(x) = argminz∈Y
∫`(z, y) dρ(y|x)
f? is the point-wise minimizer of the expectation Ey|x `(z, y)
conditioned w.r.t. x
45
![Page 67: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/67.jpg)
Finite Dimensional Intuition
Consider again the case where Y = {1, . . . , T}.
Then any ` : Y × Y → R is represented by a matrix V ∈ RT×T :
`(y, z) = Vyz = e>y V ez ∀y, z ∈ Y
where ey is the y-th element of the canonical basis.
This (bi)linearity will be very useful. . .
46
![Page 68: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/68.jpg)
Finite Dimensional Intuition (cont.)
Going back to f?. . .
f?(x) = argminz∈Y
∫`(z, y) dρ(y|x)
= argminz∈Y
∫e>z V ey dρ(y|x)
= argminz∈Y
e>z V
∫ey dρ(y|x).
Denote by g? : X → RT the function g?(x) =∫ey dρ(y|x). Then:
f?(x) = argminz∈Y
e>z V g∗(x)
47
![Page 69: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/69.jpg)
Finite Dimensional Intuition (cont.)
Idea: replace g? : X → RT in
f?(x) = argminz∈Y
e>z V g∗(x)
. . . with an estimator g : X → RT
f(x) = argminz∈Y
e>z V g(x)
48
![Page 70: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/70.jpg)
Finite Dimensional Intuition (cont.)
What is a good algorithm to learn g?
Recall that g?(x) =∫ey dρ(y|x) = Ey|x[ey] is a conditional
expectation. . .
It is easy to show that
g? = argming:X→RT
R(g) R(g) =
∫‖g(x)− ey‖2 dρ(x, y)
Therefore g can be taken to be the least-squares ERM estimator!
49
![Page 71: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/71.jpg)
Going back to surrogate methods
Natural way to find a surrogate framework:
• Encoding. c : Y → H = RT such that y 7→ ey,
• Loss. L(g(x), c(y)) = ‖g(x)− c(y)‖2,
• Decoding. d : RT → Y such that for any h ∈ RT
d(h) = argminz∈Y
e>z V h
Very similar to the multiclass setting (but can be applied to any `)!
50
![Page 72: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/72.jpg)
Finite Dimensional Intuition (cont.)
We perform vector-valued ridge-regression.
Let X = Rd. We parametrize g(x) = Wx, where
W = argminW∈RT×d
1
n
n∑i=1
‖eyi −Wxi‖2 + λ‖W‖2F ,
The solution is
W = Y >X (X>X + nλI)−1
I ∈ Rd×d identity matrix, X ∈ Rn×d and Y ∈ Rn×T the matrices
with i-th row corresponding to xi and eyi respectively.
51
![Page 73: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/73.jpg)
Finite Dimensional Intuition (cont.)
By some algebraic manipulation. . .
g(x) = Wx = Y >X (X>X + nλI)−1x︸ ︷︷ ︸α(x)
=
n∑i=1
αi(x) eyi ,
(1)
where the weights α : X → Rn are such that
α(x) = (α1(x), . . . , αn(x))> = [X(X>X + nλI)−1] x ∈ Rn.
52
![Page 74: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/74.jpg)
Finite Dimensional Intuition (cont.)
Therefore, by replacing the definition of f . . .
f(x) = argminz∈Y
e>z V g(x)
= argminz∈Y
n∑i=1
αi(x) e>z V eyi︸ ︷︷ ︸`(z,yi)
In other words,
f(x) = argminz∈Y
n∑i=1
αi(x) `(z, yi)
53
![Page 75: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/75.jpg)
To sum up. . .
This approach alternates between two phases:
• Learning. Where the score function α : X → Rn is estimated.
• Prediction. Where we need to solve
f(x) = argminz∈Y
n∑i=1
αi(x) `(z, yi)
Note. similarly to likelihood estimation methods one needs to
know how to optimize over Y (but only needs to do it once!).
54
![Page 76: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/76.jpg)
Wish List
Going back to our wishlist:
• Is flexible: can be applied to (m)any Y and `.
• Leads to efficient computations.
- No optimization over Y during training,
- Recovers many previous surrogate approaches.
• Has strong theoretical guarantees (i.e. consistency, rates)
In a minute. . .
55
![Page 77: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/77.jpg)
General Case: Implicit Embeddings
Goal: generalize the intuition from the finite case to any Y.
Definition. A continuous ` : Z × Y → R admits an Implicit
Embedding (IE) if there exists a map c : Y → H into a separable
Hilbert space H and a linear operator V : H → H such that
`(z, y) =⟨
c(z) , V c(y)⟩H.
• For V = I, we recover the notion of reproducing kernel !
• Accounts for non positive definite, non-symmetric functions,
• Holds also for infinite dimensional surrogate spaces H!
Quite technical definition however. . . when does it hold in practice?
56
![Page 78: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/78.jpg)
General Case: Implicit Embeddings
Goal: generalize the intuition from the finite case to any Y.
Definition. A continuous ` : Z × Y → R admits an Implicit
Embedding (IE) if there exists a map c : Y → H into a separable
Hilbert space H and a linear operator V : H → H such that
`(z, y) =⟨
c(z) , V c(y)⟩H.
• For V = I, we recover the notion of reproducing kernel !
• Accounts for non positive definite, non-symmetric functions,
• Holds also for infinite dimensional surrogate spaces H!
Quite technical definition however. . . when does it hold in practice?
56
![Page 79: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/79.jpg)
Which loss functions have an IE?
57
![Page 80: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/80.jpg)
A few useful sufficient conditions. . .
58
![Page 81: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/81.jpg)
Structured Prediction with Implicit Embeddings
If ` has an implicit embedding:
f?(x) = argminz∈Y
〈c(z), V g?(x)〉H ,
with g? : X → H such that
g?(x) =
∫c(y) dρ(y|x),
the conditional mean embedding of ρ(·|x) with respect to the
output kernel ky(z, y) = 〈c(z), c(z)〉H. (see [Song et al., 2009])
59
![Page 82: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/82.jpg)
Structured Prediction with Implicit Embeddings (Cont.)
We approximate g? with g(x) = Wx
W = argminW∈H⊗Rd
1
n
n∑i=1
‖c(yi)−Wxi‖2 + λ‖W‖2F ,
• If H = RT we have W ∈ RT ⊗ Rd = RT×d is a matrix,
• If H is infinite dimensional, W ∈ H ⊗ Rd is an operator.
Still. . . the solution is
W = Y >X (X>X + nλI)−1
X ∈ Rn×d and Y ∈ Rn ⊗H the matrices/operators with i-th
“row” corresponding to xi and c(yi) respectively.
60
![Page 83: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/83.jpg)
Structured Prediction with Implicit Embeddings (Cont.)
We approximate g? with g(x) = Wx
W = argminW∈H⊗Rd
1
n
n∑i=1
‖c(yi)−Wxi‖2 + λ‖W‖2F ,
• If H = RT we have W ∈ RT ⊗ Rd = RT×d is a matrix,
• If H is infinite dimensional, W ∈ H ⊗ Rd is an operator.
Still. . . the solution is
W = Y >X (X>X + nλI)−1
X ∈ Rn×d and Y ∈ Rn ⊗H the matrices/operators with i-th
“row” corresponding to xi and c(yi) respectively.
60
![Page 84: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/84.jpg)
Structured Prediction with Implicit Embeddings (Cont.)
W contains infinitely many parameters. However. . .
g(x) = Wx = Y >X (X>X + nλI)−1x︸ ︷︷ ︸α(x)
=
n∑i=1
αi(x) c(yi) ,
where the weights α : X → Rn are such that
α(x) = (α1(x), . . . , αn(x))> = [X(X>X + nλI)−1]︸ ︷︷ ︸d×dmatrix!
x ∈ Rn.
Or, if we have a kernel k : X × X → R
α(x) = (K + nλI)−1 v(x) ∈ Rn.
– K ∈ Rn×n kernel matrix Kij = k(xi, xj)
– v(x) ∈ Rn evaluation vector v(x)i = k(xi, x).
61
![Page 85: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/85.jpg)
Structured Prediction with Implicit Embeddings (Cont.)
W contains infinitely many parameters. However. . .
g(x) = Wx = Y >X (X>X + nλI)−1x︸ ︷︷ ︸α(x)
=
n∑i=1
αi(x) c(yi) ,
where the weights α : X → Rn are such that
α(x) = (α1(x), . . . , αn(x))> = [X(X>X + nλI)−1]︸ ︷︷ ︸d×dmatrix!
x ∈ Rn.
Or, if we have a kernel k : X × X → R
α(x) = (K + nλI)−1 v(x) ∈ Rn.
– K ∈ Rn×n kernel matrix Kij = k(xi, xj)
– v(x) ∈ Rn evaluation vector v(x)i = k(xi, x).
61
![Page 86: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/86.jpg)
Structured Prediction with Implicit Embeddings (Cont.)
Therefore, analogously to the finite case. . .
f(x) = argminz∈Y
〈c(y), V g(x)〉
= argminz∈Y
n∑i=1
αi(x) 〈c(z), V c(yi)〉︸ ︷︷ ︸`(z,yi)
loss trick
In other words,
f(x) = argminz∈Y
n∑i=1
αi(x) `(z, yi)
62
![Page 87: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/87.jpg)
The “loss trick”
f(x) = argminz∈Y
n∑i=1
αi(x) `(z, yi)
Analogous to the “kernel trick”, the implicit embedding enables us
to find an estimator f : X → Y. . .
without need for explicit knowledge of (H, c, V )!
63
![Page 88: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/88.jpg)
Implicit Embeddings and Surrogate Methods
Implicit embeddings naturally induce a surrogate framework:
• Encoding. c : Y → H,
• Loss. L(g(x), c(y)) = ‖g(x)− c(y)‖2H,
• Decoding. d : H → Y such that for any h ∈ H
d(h) = argminz∈Y
〈c(z), V h〉H
Q: do Fischer consistency and a comparison inequality hold?
64
![Page 89: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/89.jpg)
Fischer Consistency & Comparison Inequality
Fischer Coinsistency. We get it for free. . .
f?(x) = d(g?(x)) = argminz∈Y
〈c(z), V g?(x)〉H
Comparison Inequality. We have the following. . .
Theorem [Ciliberto et al., 2016] Let ` admit an implicit embedding
(H, c, V ). Then, for any measurable g : X → H
E(d ◦ g)− E(f?) ≤ q`√R(g)−R(g?)
with q` = 2 supy∈Y ‖V c(y)‖H.
65
![Page 90: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/90.jpg)
Fischer Consistency & Comparison Inequality
Fischer Coinsistency. We get it for free. . .
f?(x) = d(g?(x)) = argminz∈Y
〈c(z), V g?(x)〉H
Comparison Inequality. We have the following. . .
Theorem [Ciliberto et al., 2016] Let ` admit an implicit embedding
(H, c, V ). Then, for any measurable g : X → H
E(d ◦ g)− E(f?) ≤ q`√R(g)−R(g?)
with q` = 2 supy∈Y ‖V c(y)‖H.
65
![Page 91: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/91.jpg)
Universal consistency
We can borrow from the literature on vector-valued regression
[Caponnetto and De Vito, 2007] to study g.
Theorem (Universal Consistency). Let X ,Y compact ` admit
an implicit embedding and k : X × X → R a universal kernel1.
Choose λ = n−1/2 to train f . Then,
limn→+∞
E(f)− E(f?) = 0,
with probability 1.
1Technical requirement. Use e.g. the Gaussian kernel k(x, x′) = e−‖x−x′‖2
/σ.
66
![Page 92: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/92.jpg)
Learning Rates
Theorem (Learning Rates). Let X ,Y compact ` admit an
implicit embedding. Choose λ = n−1/2 to train f . Then,
∀δ ∈ (0, 1)
E(f)− E(f?) ≤ q` log(1/δ)1
n1/4,
hold with probability at least 1− δ.
Comments.
• Same rates as worst-case binary classification (better rates
with Tsibakov-like noise assumptions [Nowak-Vila et al., 2018]).
• Adaptive w.r.t. q` (it automatically chooses the “best”
surrogate framework).
67
![Page 93: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/93.jpg)
Learning Rates
Theorem (Learning Rates). Let X ,Y compact ` admit an
implicit embedding. Choose λ = n−1/2 to train f . Then,
∀δ ∈ (0, 1)
E(f)− E(f?) ≤ q` log(1/δ)1
n1/4,
hold with probability at least 1− δ.
Comments.
• Same rates as worst-case binary classification (better rates
with Tsibakov-like noise assumptions [Nowak-Vila et al., 2018]).
• Adaptive w.r.t. q` (it automatically chooses the “best”
surrogate framework).67
![Page 94: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/94.jpg)
Example Applications
![Page 95: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/95.jpg)
Predicting Probability Distributions[Luise, Rudi, Pontil, Ciliberto ’18]
Setting: Y = P(Rd) probability
distributions on Rd.
Loss: Wasserstein distance
`(µ, ν) = minτ∈Π(µ,ν)
∫‖z − y‖2 dτ(x, y)
Digit Reconstruction
68
![Page 96: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/96.jpg)
Manifold Regression[Rudi, Ciliberto, Marconi, Rosasco ’18]
Setting: Y Riemmanian
manifold.
Loss: (squared) geodesic
distance.
Optimization: Riemannian GD.
Fingerprint Reconstruction
(Y = S1 sphere)
Multi-labeling
(Y statistical manifold)
69
![Page 97: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/97.jpg)
Nonlinear Multi-task Learning[Ciliberto, Rudi, Rosasco, Pontil ’17, Luise, Stamos, Pontil, Ciliberto ’19 ]
Idea: instead of solving multiple learning problems (tasks)
separately, leverage the potential relations among them.
Previous Methods: only imposing/learning linear tasks relations.
Unable to cope with non-linear
constraints (e.g. ranking, robotics,
etc.).
MTL+Structured Prediction
− Interpret multiple tasks as
separate outputs.
− Impose constraints as
structure on the joint output.70
![Page 98: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/98.jpg)
Wrapping up. . .
Structured prediction poses hard optimization/modeling/statisticalchallenges. We have seen two main strategies:
• Likelihood Estimation. Flexible yet lacking theory.
• Surrogate Methods. Theoretically sound but not flexible.
By leveraging the concept of Implicit Embeddings we found asynthesis between these two strategies:
• Flexible. Can be applied to any ` admitting an implicit embedding.
• Optimization. Requires a minimization over Y only at test time.
• Sound. We have consistency and learning rates.
71
![Page 99: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/99.jpg)
Additional Work
Case studies:
• Learning to rank [Korba et al., 2018]
• Output Fisher Embeddings [Djerrab et al., 2018]
• Y = manifolds, ` = geodesic distance [Rudi et al., 2018]
• Y = probability space, ` = wasserstein distance [Luise et al., 2018]
Refinements of the analysis:
• Alternative derivations [Osokin et al., 2017]
• Discrete loss [Nowak-Vila et al., 2018, Struminsky et al., 2018]
Extensions:
• Application to multitask-learning [Ciliberto et al., 2017]
• Beyond least squares surrogate [Nowak-Vila et al., 2019]
• Regularizing with trace norm [Luise et al., 2019]
72
![Page 100: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/100.jpg)
References i
Bakir, G. H., Hofmann, T., Scholkopf, B., Smola, A. J., Taskar, B., and
Vishwanathan, S. V. N. (2007). Predicting Structured Data. MIT Press.
Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification,
and risk bounds. Journal of the American Statistical Association,
101(473):138–156.
Caponnetto, A. and De Vito, E. (2007). Optimal rates for the regularized least-squares
algorithm. Foundations of Computational Mathematics, 7(3):331–368.
Ciliberto, C., Rosasco, L., and Rudi, A. (2016). A consistent regularization approach
for structured prediction. Advances in Neural Information Processing Systems 29
(NIPS), pages 4412–4420.
Ciliberto, C., Rudi, A., Rosasco, L., and Pontil, M. (2017). Consistent multitask
learning with nonlinear output relations. In Advances in Neural Information
Processing Systems, pages 1983–1993.
Djerrab, M., Garcia, A., Sangnier, M., and d’Alche Buc, F. (2018). Output fisher
embedding regression. Machine Learning, 107(8-10):1229–1256.
73
![Page 101: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/101.jpg)
References ii
Duchi, J. C., Mackey, L. W., and Jordan, M. I. (2010). On the consistency of ranking
algorithms. In Proceedings of the International Conference on Machine Learning
(ICML), pages 327–334.
Korba, A., Garcia, A., and d’Alche Buc, F. (2018). A structured prediction approach
for label ranking. In Advances in Neural Information Processing Systems, pages
8994–9004.
Luise, G., Rudi, A., Pontil, M., and Ciliberto, C. (2018). Differential properties of
sinkhorn approximation for learning with wasserstein distance. In Advances in
Neural Information Processing Systems, pages 5859–5870.
Luise, G., Stamos, D., Pontil, M., and Ciliberto, C. (2019). Leveraging low-rank
relations between surrogate tasks in structured prediction. International Conference
on Machine Learning (ICML).
Mroueh, Y., Poggio, T., Rosasco, L., and Slotine, J.-J. (2012). Multiclass learning
with simplex coding. In Advances in Neural Information Processing Systems
(NIPS) 25, pages 2798–2806.
Nowak-Vila, A., Bach, F., and Rudi, A. (2018). Sharp analysis of learning with
discrete losses. AISTATS.
74
![Page 102: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline](https://reader034.vdocuments.us/reader034/viewer/2022043010/5fa247bf510aeb254d4d5bba/html5/thumbnails/102.jpg)
References iii
Nowak-Vila, A., Bach, F., and Rudi, A. (2019). A general theory for structured
prediction with smooth convex surrogates. arXiv preprint arXiv:1902.01958.
Osokin, A., Bach, F., and Lacoste-Julien, S. (2017). On structured prediction theory
with calibrated convex surrogate losses. In Advances in Neural Information
Processing Systems, pages 302–313.
Rudi, A., Ciliberto, C., Marconi, G., and Rosasco, L. (2018). Manifold structured
prediction. In Advances in Neural Information Processing Systems, pages
5610–5621.
Song, L., Huang, J., Smola, A., and Fukumizu, K. (2009). Hilbert space embeddings
of conditional distributions with applications to dynamical systems. In Proceedings
of the 26th Annual International Conference on Machine Learning, pages 961–968.
ACM.
Struminsky, K., Lacoste-Julien, S., and Osokin, A. (2018). Quantifying learning
guarantees for convex but inconsistent surrogates. In Advances in Neural
Information Processing Systems, pages 669–677.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. (2005). Large margin
methods for structured and interdependent output variables. volume 6, pages
1453–1484.
75