an introduction to structured prediction...an introduction to structured prediction carlo ciliberto...

102
An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London

Upload: others

Post on 09-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

An Introduction to Structured Prediction

Carlo Ciliberto

20/11/2019

Electrical and Electronics Engineering

Imperial College London

Page 2: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Outline

Structured prediction: what & why?

Surrogate Frameworks

Examples

The Surrogate Approach

Likelihood Estimation Approaches

Structured Prediction with Implicit Embeddings

2

Page 3: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured prediction: what & why?

Page 4: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured Prediction

3

Page 5: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured Prediction Vs. Supervised Learning

Q: This seems “just” standard supervised learning, doesn’t it?

• Learn f : X → Y,

• Given many training examples (xi, yi)ni=1.

A: Indeed it is supervised learning!

However, standard learning methods do not apply here...

What changes is what we do to learn f .

4

Page 6: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Supervised Learning 101

• X input space, Y output space,

• ` : Y × Y → R loss function,

• ρ unknown probability on X × Y.

Goal: find f? : X → Y

f? = argminf :X→Y

E(f), E(f) = E[`(f(x), y)],

given only the dataset (xi, yi)ni=1 sampled independently from ρ.

5

Page 7: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Protypical Approach: Empirical Risk Minimization

Solve f = argminf∈F

1

n

n∑i=1

`(f(xi), yi).

Where F ⊆ {f : X → Y} (usually a convex function space)

If Y is a vector space (e.g. Y = R):

• F easy to choose/optimize over: (generalized) linear models,

Kernel methods, Neural Networks, etc.

Example: Linear models. X = Rd

• f(x) = w>x for some w ∈ Rd.

6

Page 8: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Protypical Approach: Empirical Risk Minimization

Solve f = argminf∈F

1

n

n∑i=1

`(f(xi), yi).

Where F ⊆ {f : X → Y} (usually a convex function space)

If Y is a vector space (e.g. Y = R):

• F easy to choose/optimize over: (generalized) linear models,

Kernel methods, Neural Networks, etc.

Example: Linear models. X = Rd

• f(x) = w>x for some w ∈ Rd.

6

Page 9: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Protypical Approach: Empirical Risk Minimization

Solve f = argminf∈F

1

n

n∑i=1

`(f(xi), yi).

Where F ⊆ {f : X → Y} (usually a convex function space)

If Y is a vector space (e.g. Y = R):

• F easy to choose/optimize over: (generalized) linear models,

Kernel methods, Neural Networks, etc.

Example: Linear models. X = Rd

• f(x) = w>x for some w ∈ Rd.

6

Page 10: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Empirical Risk Minimization (ERM)

We are interested in controlling the Excess Risk of f

E(f)− E(f?)

Wish list:

• Consistency:

limn→+∞

E(f)− E(f?) = 0

• Learning Rates:

E(f)− E(f?) ≤ O(n−γ)

γ > 0 (the larger the better).

7

Page 11: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Prototypical Results: Empirical Risk Minimization

Several results allow to study ERM’s consistency and rates when:

• Y = Rd and,

• F is a “standard” space of functions (e.g. a reproducing

kernel Hilbert space).

Examples of techniques/notions involved to obtain these results:

• VC dimension,

• Rademacher & Gaussian complexity,

• Covering numbers,

• Stability,

• Empirical processes,

• . . . .

8

Page 12: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Protypical Approach: Empirical Risk Minimization

Solve f = argminf∈F

1

n

n∑i=1

`(f(xi), yi).

Where F ⊆ {f : X → Y} (usually a convex function space)

If Y is a vector space (e.g. Y = R):

• F easy to choose/optimize over: (generalized) linear models,

Kernel methods, Neural Networks, etc.

If Y is a “structured” space:

• How to choose F?

• How to perform optimization over it?

• How to study the statistics of f over F?

9

Page 13: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured Prediction Methods

Y arbitrary: how do we parametrize F and learn f?

Surrogate approaches

+ Clear theory (e.g. convergence and learning rates)

– Only for special cases (classification, ranking, multi-labeling etc.)

[Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012]

Score learning techniques

+ General algorithmic framework

(e.g. StructSVM [Tsochantaridis et al., 2005])

– Limited Theory (no consistency, see e.g. [Bakir et al., 2007] )

10

Page 14: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Surrogate Frameworks

Page 15: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Binary Classification setting

Binary Classification:

• “any” input space X

• output space Y = {−1, 1}

• 0-1 loss function, i.e `(y, y′) = 1{y 6=y′} =

0 if y = y

1 otherwise

11

Page 16: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Binary Classification Problem

• A classification rule is a map f : X → Y

• The risk of a rule f is E(f) = E(x,y)∼ρ[1{f(x)6=y}].

• The classification rule that minimizes E is

f∗ : X → Y, f∗(x) = argmaxy∈Y

ρ(y | x).

• Why? Exercise : )

12

Page 17: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Binary Classification

Goal: approximate f∗ given a training set (xi, yi)ni=1.

Issues:

i) Y is not linear! ⇒ H = {classification rules} is not linear!

i) `(y, y′) = 1{y 6=y′} is not convex ⇒ very hard to minimize!

13

Page 18: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Binary Classification

Idea:

i) Rephrase the problem using a linear output space,

ii) Find a good convex “replacement” for `.

i) Replace Y = {−1, 1} to R and consider functions g : X → R(surrogate classification rule)

14

Page 19: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Binary Classification

Idea:

i) Rephrase the problem using a linear output space,

ii) Find a good convex “replacement” for `.

i) Replace Y = {−1, 1} to R and consider functions g : X → R(surrogate classification rule)

14

Page 20: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Binary Classification

Idea:

Rephrase the problem using a linear output space,

ii) Find a good convex “replacement” for `.

i) Replace Y = {−1, 1} to R and consider functions g : X → R(surrogate classification rule)

15

Page 21: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Binary Classification

Idea:

Rephrase the problem using a linear output space,

Find a good convex “replacement” for `.

i) Replace Y = {−1, 1} to R and consider functions g : X → R(“surrogate” classification rule)

ii) Replace `(y, y′) = 1{y 6=y′} with L: R× R→ R+ non-negative

convex “surrogate” loss: e.g. logistic, least squares, hinge.

16

Page 22: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Loss Functions for Binary Classification

Loss functions of the form L(y, y′) = L(y · y′)

1 2

0.5

1.0

1.5

2.0

0 1 loss

square loss

Hinge loss

Logistic loss

0.5

17

Page 23: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Surrogate ERM

The loss L induces a surrogate risk

R(g) = E(x,y)∼ρ L(g(x), y).

and can define the surrogate ERM estimator

g = argming∈G

Rn(g) Rn(g) =1

n

n∑i=1

L(g(xi), yi).

Modeling. The output space is linear ⇒ many options for G!

Optimization. The loss is convex ⇒ we can efficiently find g!

Statistics. Standard results ⇒ generalization properties of g!

R(g)−R(g?)→ 0

18

Page 24: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Surrogate Vs. Original Problems

This is all nice and well, but . . .

• How can we go from g : X → R to some f : X → Y?

Standard approach: f(x) = sign(g(x))

• How is g? related to f??

Exercise. f?(x) = sign(g?(x))!

• Are surrogate learning rates for g of any use?

Theorem.

E(f)− E(f?) ≤ ϕ(R(g)−R(g?)

).

(where ϕ : R→ R+ depends on the surrogate loss L).

19

Page 25: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Surrogate Vs. Original Problems

This is all nice and well, but . . .

• How can we go from g : X → R to some f : X → Y?

Standard approach: f(x) = sign(g(x))

• How is g? related to f??

Exercise. f?(x) = sign(g?(x))!

• Are surrogate learning rates for g of any use?

Theorem.

E(f)− E(f?) ≤ ϕ(R(g)−R(g?)

).

(where ϕ : R→ R+ depends on the surrogate loss L).

19

Page 26: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Surrogate Vs. Original Problems

This is all nice and well, but . . .

• How can we go from g : X → R to some f : X → Y?

Standard approach: f(x) = sign(g(x))

• How is g? related to f??

Exercise. f?(x) = sign(g?(x))!

• Are surrogate learning rates for g of any use?

Theorem.

E(f)− E(f?) ≤ ϕ(R(g)−R(g?)

).

(where ϕ : R→ R+ depends on the surrogate loss L).

19

Page 27: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Surrogate Vs. Original Problems

This is all nice and well, but . . .

• How can we go from g : X → R to some f : X → Y?

Standard approach: f(x) = sign(g(x))

• How is g? related to f??

Exercise. f?(x) = sign(g?(x))!

• Are surrogate learning rates for g of any use?

Theorem.

E(f)− E(f?) ≤ ϕ(R(g)−R(g?)

).

(where ϕ : R→ R+ depends on the surrogate loss L).

19

Page 28: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Multiclass Classification setting

Multiclass Classification:

• input space X

• output space Y = {1, 2, . . . , T}

• 0-1 loss function, i.e `(y, y′) = 1{y 6=y′}

Issues:

• Can we still map Y in R?

• What surrogate L can replace `?

20

Page 29: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Multiclass Classification

• Attempt 1: Y = {1, 2, . . . , T} ⊂ R. Could replace Y with R

Not a good choice: induces an arbitrary distance on classes.

(i.e. 1 is closer to 2 than 3 and so on . . . )

• Attempt 2: replace Y = {1, 2, . . . , T} with RT .

“Replace” means “embed” Y into RT using an

encoding c : Y → RT defined by

c(i) = ei i = 1, . . . Y

where ei is the ith vector of the canonical basis of RT .

21

Page 30: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Multiclass Classification

• Attempt 1: Y = {1, 2, . . . , T} ⊂ R. Could replace Y with RNot a good choice: induces an arbitrary distance on classes.

(i.e. 1 is closer to 2 than 3 and so on . . . )

• Attempt 2: replace Y = {1, 2, . . . , T} with RT .

“Replace” means “embed” Y into RT using an

encoding c : Y → RT defined by

c(i) = ei i = 1, . . . Y

where ei is the ith vector of the canonical basis of RT .

21

Page 31: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Multiclass Classification

• Attempt 1: Y = {1, 2, . . . , T} ⊂ R. Could replace Y with RNot a good choice: induces an arbitrary distance on classes.

(i.e. 1 is closer to 2 than 3 and so on . . . )

• Attempt 2: replace Y = {1, 2, . . . , T} with RT .

“Replace” means “embed” Y into RT using an

encoding c : Y → RT defined by

c(i) = ei i = 1, . . . Y

where ei is the ith vector of the canonical basis of RT .

21

Page 32: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Multiclass Classification

Given a surrogate loss L : RT ×RT → R (hinge? least squares?)...

. . . we can train the surrogate estimator g : X → RT

g = argming∈G

1

n

n∑i=1

L(g(xi), c(yi)).

Question: But g has values in RT . . . How can we go back to Y?

Answer: via a decoding routine!

f(x) = argmaxt=1,...T

gt(x)

22

Page 33: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Multiclass Classification

Given a surrogate loss L : RT ×RT → R (hinge? least squares?)...

. . . we can train the surrogate estimator g : X → RT

g = argming∈G

1

n

n∑i=1

L(g(xi), c(yi)).

Question: But g has values in RT . . . How can we go back to Y?

Answer: via a decoding routine!

f(x) = argmaxt=1,...T

gt(x)

22

Page 34: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example: Multiclass Classification

Given a surrogate loss L : RT ×RT → R (hinge? least squares?)...

. . . we can train the surrogate estimator g : X → RT

g = argming∈G

1

n

n∑i=1

L(g(xi), c(yi)).

Question: But g has values in RT . . . How can we go back to Y?

Answer: via a decoding routine!

f(x) = argmaxt=1,...T

gt(x)

22

Page 35: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Multiclass Classification: Surrogate Vs. Original Problems

The same questions as for binary classification . . .

• How can we go from g : X → RT to some f : X → Y?

Decoding: f(x) = argmaxt=1,...,T gt(x)

• How is g? related to f??

Not clear: it strongly depends on L!

• Are surrogate learning rates for g of any use?

Not clear: it strongly depends on L!

23

Page 36: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Multiclass Classification: Surrogate Vs. Original Problems

The same questions as for binary classification . . .

• How can we go from g : X → RT to some f : X → Y?

Decoding: f(x) = argmaxt=1,...,T gt(x)

• How is g? related to f??

Not clear: it strongly depends on L!

• Are surrogate learning rates for g of any use?

Not clear: it strongly depends on L!

23

Page 37: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Multiclass Classification: Surrogate Vs. Original Problems

The same questions as for binary classification . . .

• How can we go from g : X → RT to some f : X → Y?

Decoding: f(x) = argmaxt=1,...,T gt(x)

• How is g? related to f??

Not clear: it strongly depends on L!

• Are surrogate learning rates for g of any use?

Not clear: it strongly depends on L!

23

Page 38: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

The Surrogate Approach

Page 39: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Taking inspiration from the previous examples . . .

24

Page 40: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Surrogate Approach: Key Ingredients

A possible approach to structured prediction is to find:

1. A linear surrogate space H,

2. An encoding c : Y → H,

3. A surrogate loss L : H×H → R,

4. A decoding d : Y → H.

25

Page 41: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Surrogate Approach: Recipe

Then:

1. Encode training set (xi, yi)ni=1 into (xi, c(yi))

ni=1,

2. Learn g = argming∈G1n

∑ni=1 L(g(xi), c(yi))

(using standard supervised learning methods)

3. Decode f = d ◦ g.

26

Page 42: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Wish list

However, recall that learning g is solving a different problem...

R(g) =

∫L(g(x), c(y)) dρ(x, y).

In order to be “useful”, a surrogate framework needs to satisfy:

• Fischer Consistency. E(f?) = E(d ◦ g?)

• Comparison Inequality. for any g : X → H,

E(d ◦ g)− E(f?) ≤ ϕ (R(g)−R(g?)) ,

with ϕ : R+ → R+ continuous, non-decreasing and ϕ(0) = 0.

27

Page 43: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Wish list

However, recall that learning g is solving a different problem...

R(g) =

∫L(g(x), c(y)) dρ(x, y).

In order to be “useful”, a surrogate framework needs to satisfy:

• Fischer Consistency. E(f?) = E(d ◦ g?)

• Comparison Inequality. for any g : X → H,

E(d ◦ g)− E(f?) ≤ ϕ (R(g)−R(g?)) ,

with ϕ : R+ → R+ continuous, non-decreasing and ϕ(0) = 0.

27

Page 44: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Wish list

However, recall that learning g is solving a different problem...

R(g) =

∫L(g(x), c(y)) dρ(x, y).

In order to be “useful”, a surrogate framework needs to satisfy:

• Fischer Consistency. E(f?) = E(d ◦ g?)

• Comparison Inequality. for any g : X → H,

E(d ◦ g)− E(f?) ≤ ϕ (R(g)−R(g?)) ,

with ϕ : R+ → R+ continuous, non-decreasing and ϕ(0) = 0.

27

Page 45: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Fisher consistency

Fisher consistency. We want this because we want that the

surrogate problem and the decoding procedure are good ones,

meaning that if we decode the best surrogate solution d ◦ g∗ we

have the same risk as the best original solution f∗.

28

Page 46: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Comparison inequality

Comparison inequality. If we learn a g which approximates g∗. . .

R(g)−R(g∗)→ 0 as n→ +∞.

. . . then the comparison inequality implies,

E(d ◦ g)− E(f∗)→ 0 as n→ +∞.

Therefore f := d ◦ g is a good estimator for the original problem!

Rates. Moreover, if R(g)−R(g?) ≤ n−α for some α > 0

E(f)− E(f?) ≤ ϕ(n−α).

Knowledge of ϕ allows to derive rates for f from the rates of g!

29

Page 47: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Comparison inequality

Comparison inequality. If we learn a g which approximates g∗. . .

R(g)−R(g∗)→ 0 as n→ +∞.

. . . then the comparison inequality implies,

E(d ◦ g)− E(f∗)→ 0 as n→ +∞.

Therefore f := d ◦ g is a good estimator for the original problem!

Rates. Moreover, if R(g)−R(g?) ≤ n−α for some α > 0

E(f)− E(f?) ≤ ϕ(n−α).

Knowledge of ϕ allows to derive rates for f from the rates of g!

29

Page 48: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Going back to the examples...

Surrogate framework for binary classification:

• Y = {1,−1}, H = R

• coding c : {1,−1} → R is the embedding Y ↪→ R

• L : R× R→ R+: least squares , hinge , logistic

• decoding d : R→ {1,−1} is d(r) = sign(r).

Fisher consistency? Comparison inequality?

Exercise for the reader! :)

30

Page 49: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Going back to the examples...

Surrogate framework for multiclass classification:

• Y = {1, 2, . . . , T}, H = RT

• coding c : {1, 2, . . . , T} ↪→ RT with c(i) = ei.

• L : RT × RT → R+: least squares , hinge ×.

• decoding d : RT → {1, 2, . . . , T} is d(r) = argmaxt=1,...,T rt.

Fisher consistency? Comparison inequality?

Exercise for the reader! :)

31

Page 50: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

To sum up

Pros

• Modeling. Directly borrow from ERM literature to design

(surrogate) learning algorithms (vector-valued regression!)

• Statistics. Extend surrogate ERM rates for g to f by means

of the comparison inequality.

• Optimization. Bypasses/Postpones dealing with the

non-convex ` at prediction time!

Cons

• Flexibility. Need to design a surrogate framework (H, c,L, d)

on a case-by-case basis for any (`,Y).

32

Page 51: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Likelihood Estimation Approaches

Page 52: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

A standard approach

Alternative approach to address structured prediction problems:

• Model the likelihood of observing y given x as a function

F ? : Y × X → [0, 1] with F ?(y, x) = ρ(y|x).

• Learn F : X × Y → R

• Ideally F → F ∗, with F ∗(x, y) = ρ(y | x).

• Then,

• Ideal solution f∗(x) = argmaxy∈Y F ∗(x, y)

• Approximate solution f(x) = argmaxy∈Y F (x, y)

33

Page 53: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Struct SVM [Tsochantaridis et al., 2005]

Model:

• joint feature map Ψ : Y × X → F with F a Hilbert space.

• F (y, x) = 〈w,Ψ(y, x)〉 with w ∈ F a parameter vector.

Algorithm: Find the parameters w that solve

minw∈F

‖w‖2

〈w,Ψ(yi, xi)〉 ≥ 〈w,Ψ(y, xi)〉+ 1

∀i = 1, . . . , n, ∀y ∈ Y \ yi

Intuition: the best y∗(x) must be such that F (x, y∗(x)) is

considerably larger than any other F (x, y)

34

Page 54: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Including the Loss

However, things as more complicated. . .

we don’t want to simply maximise ρ(y | x), but we have a loss

function ` as part of the problem:

E(f) =

∫`(f(x), y) dρ(y | x)dρX (x)

35

Page 55: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Struct SVM Variants

Model:

• joint feature map Ψ : Y × X → F with F a Hilbert space.

• F (y, x) = 〈w,Ψ(y, x)〉 with w ∈ F a parameter vector.

Algorithm: Find the parameters w that solve

minw∈F

‖w‖2

〈w,Ψ(yi, xi)〉 ≥ 〈w,Ψ(y, xi)〉+ `(yi, y)

∀i = 1, . . . , n, ∀y ∈ Y \ yi

36

Page 56: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Struct SVM Variants

Model:

• joint feature map Ψ : Y × X → F with F a Hilbert space.

• F (y, x) = 〈w,Ψ(y, x)〉 with w ∈ F a parameter vector.

Algorithm: Find the parameters w that solve

minw∈F

‖w‖2 +C

n

n∑i=1

ξi

〈w,Ψ(yi, xi)〉 ≥ 〈w,Ψ(y, xi)〉+ `(yi, y)− ξi∀i = 1, . . . , n, ∀y ∈ Y \ yi

Generalizing the “slack” variables in standard SVM . . .

37

Page 57: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Optimization

38

Page 58: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

To sum up

Pros

• Flexibility. Can be virtually applied to any problem.

Cons

• Optimization. Requires solving an optimization over Y and

with respect to ` at every iteration. It can become very

expensive!

• Statistics. It has been shown that in some cases this

approach is not consistent [Bakir et al., 2007].

39

Page 59: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Examples: Language Parsing

40

Page 60: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Examples: Image Segmentation

41

Page 61: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Examples: Pose Estimation

42

Page 62: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

To wrap up. . .

Surrogate approaches

+ Clear theory (e.g. convergence and learning rates)

– Only for special cases (classification, ranking, multi-labeling etc.)

[Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012]

Score learning techniques

+ General algorithmic framework

(e.g. StructSVM [Tsochantaridis et al., 2005])

– Limited Theory (no consistency, see e.g. [Bakir et al., 2007] )

Can we get the best of both worlds?

43

Page 63: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

To wrap up. . .

Surrogate approaches

+ Clear theory (e.g. convergence and learning rates)

– Only for special cases (classification, ranking, multi-labeling etc.)

[Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012]

Score learning techniques

+ General algorithmic framework

(e.g. StructSVM [Tsochantaridis et al., 2005])

– Limited Theory (no consistency, see e.g. [Bakir et al., 2007] )

Can we get the best of both worlds?

43

Page 64: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured Prediction with Implicit

Embeddings

Page 65: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Wish List

We would like a method that:

• Is flexible: can be applied to (m)any Y and `.

• Leads to efficient computations.

• Has strong theoretical guarantees (i.e. consistency, rates)

44

Page 66: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Ideal solution

Let’s study the expected risk of our problem

E(f) =

∫`(f(x), y) dρ(x, y)

=

∫ (∫`(f(x), y) dρ(y|x)

)dρX (x)

We can minimize it pointwise. Then best f? : X → Y is:

f?(x) = argminz∈Y

∫`(z, y) dρ(y|x)

f? is the point-wise minimizer of the expectation Ey|x `(z, y)

conditioned w.r.t. x

45

Page 67: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Finite Dimensional Intuition

Consider again the case where Y = {1, . . . , T}.

Then any ` : Y × Y → R is represented by a matrix V ∈ RT×T :

`(y, z) = Vyz = e>y V ez ∀y, z ∈ Y

where ey is the y-th element of the canonical basis.

This (bi)linearity will be very useful. . .

46

Page 68: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Finite Dimensional Intuition (cont.)

Going back to f?. . .

f?(x) = argminz∈Y

∫`(z, y) dρ(y|x)

= argminz∈Y

∫e>z V ey dρ(y|x)

= argminz∈Y

e>z V

∫ey dρ(y|x).

Denote by g? : X → RT the function g?(x) =∫ey dρ(y|x). Then:

f?(x) = argminz∈Y

e>z V g∗(x)

47

Page 69: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Finite Dimensional Intuition (cont.)

Idea: replace g? : X → RT in

f?(x) = argminz∈Y

e>z V g∗(x)

. . . with an estimator g : X → RT

f(x) = argminz∈Y

e>z V g(x)

48

Page 70: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Finite Dimensional Intuition (cont.)

What is a good algorithm to learn g?

Recall that g?(x) =∫ey dρ(y|x) = Ey|x[ey] is a conditional

expectation. . .

It is easy to show that

g? = argming:X→RT

R(g) R(g) =

∫‖g(x)− ey‖2 dρ(x, y)

Therefore g can be taken to be the least-squares ERM estimator!

49

Page 71: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Going back to surrogate methods

Natural way to find a surrogate framework:

• Encoding. c : Y → H = RT such that y 7→ ey,

• Loss. L(g(x), c(y)) = ‖g(x)− c(y)‖2,

• Decoding. d : RT → Y such that for any h ∈ RT

d(h) = argminz∈Y

e>z V h

Very similar to the multiclass setting (but can be applied to any `)!

50

Page 72: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Finite Dimensional Intuition (cont.)

We perform vector-valued ridge-regression.

Let X = Rd. We parametrize g(x) = Wx, where

W = argminW∈RT×d

1

n

n∑i=1

‖eyi −Wxi‖2 + λ‖W‖2F ,

The solution is

W = Y >X (X>X + nλI)−1

I ∈ Rd×d identity matrix, X ∈ Rn×d and Y ∈ Rn×T the matrices

with i-th row corresponding to xi and eyi respectively.

51

Page 73: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Finite Dimensional Intuition (cont.)

By some algebraic manipulation. . .

g(x) = Wx = Y >X (X>X + nλI)−1x︸ ︷︷ ︸α(x)

=

n∑i=1

αi(x) eyi ,

(1)

where the weights α : X → Rn are such that

α(x) = (α1(x), . . . , αn(x))> = [X(X>X + nλI)−1] x ∈ Rn.

52

Page 74: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Finite Dimensional Intuition (cont.)

Therefore, by replacing the definition of f . . .

f(x) = argminz∈Y

e>z V g(x)

= argminz∈Y

n∑i=1

αi(x) e>z V eyi︸ ︷︷ ︸`(z,yi)

In other words,

f(x) = argminz∈Y

n∑i=1

αi(x) `(z, yi)

53

Page 75: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

To sum up. . .

This approach alternates between two phases:

• Learning. Where the score function α : X → Rn is estimated.

• Prediction. Where we need to solve

f(x) = argminz∈Y

n∑i=1

αi(x) `(z, yi)

Note. similarly to likelihood estimation methods one needs to

know how to optimize over Y (but only needs to do it once!).

54

Page 76: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Wish List

Going back to our wishlist:

• Is flexible: can be applied to (m)any Y and `.

• Leads to efficient computations.

- No optimization over Y during training,

- Recovers many previous surrogate approaches.

• Has strong theoretical guarantees (i.e. consistency, rates)

In a minute. . .

55

Page 77: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

General Case: Implicit Embeddings

Goal: generalize the intuition from the finite case to any Y.

Definition. A continuous ` : Z × Y → R admits an Implicit

Embedding (IE) if there exists a map c : Y → H into a separable

Hilbert space H and a linear operator V : H → H such that

`(z, y) =⟨

c(z) , V c(y)⟩H.

• For V = I, we recover the notion of reproducing kernel !

• Accounts for non positive definite, non-symmetric functions,

• Holds also for infinite dimensional surrogate spaces H!

Quite technical definition however. . . when does it hold in practice?

56

Page 78: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

General Case: Implicit Embeddings

Goal: generalize the intuition from the finite case to any Y.

Definition. A continuous ` : Z × Y → R admits an Implicit

Embedding (IE) if there exists a map c : Y → H into a separable

Hilbert space H and a linear operator V : H → H such that

`(z, y) =⟨

c(z) , V c(y)⟩H.

• For V = I, we recover the notion of reproducing kernel !

• Accounts for non positive definite, non-symmetric functions,

• Holds also for infinite dimensional surrogate spaces H!

Quite technical definition however. . . when does it hold in practice?

56

Page 79: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Which loss functions have an IE?

57

Page 80: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

A few useful sufficient conditions. . .

58

Page 81: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured Prediction with Implicit Embeddings

If ` has an implicit embedding:

f?(x) = argminz∈Y

〈c(z), V g?(x)〉H ,

with g? : X → H such that

g?(x) =

∫c(y) dρ(y|x),

the conditional mean embedding of ρ(·|x) with respect to the

output kernel ky(z, y) = 〈c(z), c(z)〉H. (see [Song et al., 2009])

59

Page 82: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured Prediction with Implicit Embeddings (Cont.)

We approximate g? with g(x) = Wx

W = argminW∈H⊗Rd

1

n

n∑i=1

‖c(yi)−Wxi‖2 + λ‖W‖2F ,

• If H = RT we have W ∈ RT ⊗ Rd = RT×d is a matrix,

• If H is infinite dimensional, W ∈ H ⊗ Rd is an operator.

Still. . . the solution is

W = Y >X (X>X + nλI)−1

X ∈ Rn×d and Y ∈ Rn ⊗H the matrices/operators with i-th

“row” corresponding to xi and c(yi) respectively.

60

Page 83: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured Prediction with Implicit Embeddings (Cont.)

We approximate g? with g(x) = Wx

W = argminW∈H⊗Rd

1

n

n∑i=1

‖c(yi)−Wxi‖2 + λ‖W‖2F ,

• If H = RT we have W ∈ RT ⊗ Rd = RT×d is a matrix,

• If H is infinite dimensional, W ∈ H ⊗ Rd is an operator.

Still. . . the solution is

W = Y >X (X>X + nλI)−1

X ∈ Rn×d and Y ∈ Rn ⊗H the matrices/operators with i-th

“row” corresponding to xi and c(yi) respectively.

60

Page 84: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured Prediction with Implicit Embeddings (Cont.)

W contains infinitely many parameters. However. . .

g(x) = Wx = Y >X (X>X + nλI)−1x︸ ︷︷ ︸α(x)

=

n∑i=1

αi(x) c(yi) ,

where the weights α : X → Rn are such that

α(x) = (α1(x), . . . , αn(x))> = [X(X>X + nλI)−1]︸ ︷︷ ︸d×dmatrix!

x ∈ Rn.

Or, if we have a kernel k : X × X → R

α(x) = (K + nλI)−1 v(x) ∈ Rn.

– K ∈ Rn×n kernel matrix Kij = k(xi, xj)

– v(x) ∈ Rn evaluation vector v(x)i = k(xi, x).

61

Page 85: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured Prediction with Implicit Embeddings (Cont.)

W contains infinitely many parameters. However. . .

g(x) = Wx = Y >X (X>X + nλI)−1x︸ ︷︷ ︸α(x)

=

n∑i=1

αi(x) c(yi) ,

where the weights α : X → Rn are such that

α(x) = (α1(x), . . . , αn(x))> = [X(X>X + nλI)−1]︸ ︷︷ ︸d×dmatrix!

x ∈ Rn.

Or, if we have a kernel k : X × X → R

α(x) = (K + nλI)−1 v(x) ∈ Rn.

– K ∈ Rn×n kernel matrix Kij = k(xi, xj)

– v(x) ∈ Rn evaluation vector v(x)i = k(xi, x).

61

Page 86: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Structured Prediction with Implicit Embeddings (Cont.)

Therefore, analogously to the finite case. . .

f(x) = argminz∈Y

〈c(y), V g(x)〉

= argminz∈Y

n∑i=1

αi(x) 〈c(z), V c(yi)〉︸ ︷︷ ︸`(z,yi)

loss trick

In other words,

f(x) = argminz∈Y

n∑i=1

αi(x) `(z, yi)

62

Page 87: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

The “loss trick”

f(x) = argminz∈Y

n∑i=1

αi(x) `(z, yi)

Analogous to the “kernel trick”, the implicit embedding enables us

to find an estimator f : X → Y. . .

without need for explicit knowledge of (H, c, V )!

63

Page 88: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Implicit Embeddings and Surrogate Methods

Implicit embeddings naturally induce a surrogate framework:

• Encoding. c : Y → H,

• Loss. L(g(x), c(y)) = ‖g(x)− c(y)‖2H,

• Decoding. d : H → Y such that for any h ∈ H

d(h) = argminz∈Y

〈c(z), V h〉H

Q: do Fischer consistency and a comparison inequality hold?

64

Page 89: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Fischer Consistency & Comparison Inequality

Fischer Coinsistency. We get it for free. . .

f?(x) = d(g?(x)) = argminz∈Y

〈c(z), V g?(x)〉H

Comparison Inequality. We have the following. . .

Theorem [Ciliberto et al., 2016] Let ` admit an implicit embedding

(H, c, V ). Then, for any measurable g : X → H

E(d ◦ g)− E(f?) ≤ q`√R(g)−R(g?)

with q` = 2 supy∈Y ‖V c(y)‖H.

65

Page 90: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Fischer Consistency & Comparison Inequality

Fischer Coinsistency. We get it for free. . .

f?(x) = d(g?(x)) = argminz∈Y

〈c(z), V g?(x)〉H

Comparison Inequality. We have the following. . .

Theorem [Ciliberto et al., 2016] Let ` admit an implicit embedding

(H, c, V ). Then, for any measurable g : X → H

E(d ◦ g)− E(f?) ≤ q`√R(g)−R(g?)

with q` = 2 supy∈Y ‖V c(y)‖H.

65

Page 91: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Universal consistency

We can borrow from the literature on vector-valued regression

[Caponnetto and De Vito, 2007] to study g.

Theorem (Universal Consistency). Let X ,Y compact ` admit

an implicit embedding and k : X × X → R a universal kernel1.

Choose λ = n−1/2 to train f . Then,

limn→+∞

E(f)− E(f?) = 0,

with probability 1.

1Technical requirement. Use e.g. the Gaussian kernel k(x, x′) = e−‖x−x′‖2

/σ.

66

Page 92: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Learning Rates

Theorem (Learning Rates). Let X ,Y compact ` admit an

implicit embedding. Choose λ = n−1/2 to train f . Then,

∀δ ∈ (0, 1)

E(f)− E(f?) ≤ q` log(1/δ)1

n1/4,

hold with probability at least 1− δ.

Comments.

• Same rates as worst-case binary classification (better rates

with Tsibakov-like noise assumptions [Nowak-Vila et al., 2018]).

• Adaptive w.r.t. q` (it automatically chooses the “best”

surrogate framework).

67

Page 93: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Learning Rates

Theorem (Learning Rates). Let X ,Y compact ` admit an

implicit embedding. Choose λ = n−1/2 to train f . Then,

∀δ ∈ (0, 1)

E(f)− E(f?) ≤ q` log(1/δ)1

n1/4,

hold with probability at least 1− δ.

Comments.

• Same rates as worst-case binary classification (better rates

with Tsibakov-like noise assumptions [Nowak-Vila et al., 2018]).

• Adaptive w.r.t. q` (it automatically chooses the “best”

surrogate framework).67

Page 94: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Example Applications

Page 95: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Predicting Probability Distributions[Luise, Rudi, Pontil, Ciliberto ’18]

Setting: Y = P(Rd) probability

distributions on Rd.

Loss: Wasserstein distance

`(µ, ν) = minτ∈Π(µ,ν)

∫‖z − y‖2 dτ(x, y)

Digit Reconstruction

68

Page 96: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Manifold Regression[Rudi, Ciliberto, Marconi, Rosasco ’18]

Setting: Y Riemmanian

manifold.

Loss: (squared) geodesic

distance.

Optimization: Riemannian GD.

Fingerprint Reconstruction

(Y = S1 sphere)

Multi-labeling

(Y statistical manifold)

69

Page 97: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Nonlinear Multi-task Learning[Ciliberto, Rudi, Rosasco, Pontil ’17, Luise, Stamos, Pontil, Ciliberto ’19 ]

Idea: instead of solving multiple learning problems (tasks)

separately, leverage the potential relations among them.

Previous Methods: only imposing/learning linear tasks relations.

Unable to cope with non-linear

constraints (e.g. ranking, robotics,

etc.).

MTL+Structured Prediction

− Interpret multiple tasks as

separate outputs.

− Impose constraints as

structure on the joint output.70

Page 98: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Wrapping up. . .

Structured prediction poses hard optimization/modeling/statisticalchallenges. We have seen two main strategies:

• Likelihood Estimation. Flexible yet lacking theory.

• Surrogate Methods. Theoretically sound but not flexible.

By leveraging the concept of Implicit Embeddings we found asynthesis between these two strategies:

• Flexible. Can be applied to any ` admitting an implicit embedding.

• Optimization. Requires a minimization over Y only at test time.

• Sound. We have consistency and learning rates.

71

Page 99: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

Additional Work

Case studies:

• Learning to rank [Korba et al., 2018]

• Output Fisher Embeddings [Djerrab et al., 2018]

• Y = manifolds, ` = geodesic distance [Rudi et al., 2018]

• Y = probability space, ` = wasserstein distance [Luise et al., 2018]

Refinements of the analysis:

• Alternative derivations [Osokin et al., 2017]

• Discrete loss [Nowak-Vila et al., 2018, Struminsky et al., 2018]

Extensions:

• Application to multitask-learning [Ciliberto et al., 2017]

• Beyond least squares surrogate [Nowak-Vila et al., 2019]

• Regularizing with trace norm [Luise et al., 2019]

72

Page 100: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

References i

Bakir, G. H., Hofmann, T., Scholkopf, B., Smola, A. J., Taskar, B., and

Vishwanathan, S. V. N. (2007). Predicting Structured Data. MIT Press.

Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification,

and risk bounds. Journal of the American Statistical Association,

101(473):138–156.

Caponnetto, A. and De Vito, E. (2007). Optimal rates for the regularized least-squares

algorithm. Foundations of Computational Mathematics, 7(3):331–368.

Ciliberto, C., Rosasco, L., and Rudi, A. (2016). A consistent regularization approach

for structured prediction. Advances in Neural Information Processing Systems 29

(NIPS), pages 4412–4420.

Ciliberto, C., Rudi, A., Rosasco, L., and Pontil, M. (2017). Consistent multitask

learning with nonlinear output relations. In Advances in Neural Information

Processing Systems, pages 1983–1993.

Djerrab, M., Garcia, A., Sangnier, M., and d’Alche Buc, F. (2018). Output fisher

embedding regression. Machine Learning, 107(8-10):1229–1256.

73

Page 101: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

References ii

Duchi, J. C., Mackey, L. W., and Jordan, M. I. (2010). On the consistency of ranking

algorithms. In Proceedings of the International Conference on Machine Learning

(ICML), pages 327–334.

Korba, A., Garcia, A., and d’Alche Buc, F. (2018). A structured prediction approach

for label ranking. In Advances in Neural Information Processing Systems, pages

8994–9004.

Luise, G., Rudi, A., Pontil, M., and Ciliberto, C. (2018). Differential properties of

sinkhorn approximation for learning with wasserstein distance. In Advances in

Neural Information Processing Systems, pages 5859–5870.

Luise, G., Stamos, D., Pontil, M., and Ciliberto, C. (2019). Leveraging low-rank

relations between surrogate tasks in structured prediction. International Conference

on Machine Learning (ICML).

Mroueh, Y., Poggio, T., Rosasco, L., and Slotine, J.-J. (2012). Multiclass learning

with simplex coding. In Advances in Neural Information Processing Systems

(NIPS) 25, pages 2798–2806.

Nowak-Vila, A., Bach, F., and Rudi, A. (2018). Sharp analysis of learning with

discrete losses. AISTATS.

74

Page 102: An Introduction to Structured Prediction...An Introduction to Structured Prediction Carlo Ciliberto 20/11/2019 Electrical and Electronics Engineering Imperial College London Outline

References iii

Nowak-Vila, A., Bach, F., and Rudi, A. (2019). A general theory for structured

prediction with smooth convex surrogates. arXiv preprint arXiv:1902.01958.

Osokin, A., Bach, F., and Lacoste-Julien, S. (2017). On structured prediction theory

with calibrated convex surrogate losses. In Advances in Neural Information

Processing Systems, pages 302–313.

Rudi, A., Ciliberto, C., Marconi, G., and Rosasco, L. (2018). Manifold structured

prediction. In Advances in Neural Information Processing Systems, pages

5610–5621.

Song, L., Huang, J., Smola, A., and Fukumizu, K. (2009). Hilbert space embeddings

of conditional distributions with applications to dynamical systems. In Proceedings

of the 26th Annual International Conference on Machine Learning, pages 961–968.

ACM.

Struminsky, K., Lacoste-Julien, S., and Osokin, A. (2018). Quantifying learning

guarantees for convex but inconsistent surrogates. In Advances in Neural

Information Processing Systems, pages 669–677.

Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. (2005). Large margin

methods for structured and interdependent output variables. volume 6, pages

1453–1484.

75